From the perspective of causal analysis, our goal is to estimate the causal effect of an intervention in the labor market, in this case a reform of employment protection, on an outcome (see Morgan and Winship, 2007; Angrist and Pischke, 2009; Wooldridge and Imbens, 2009).
Following conventional terminology, we will also refer to this intervention as a treatment, and to units (countries) exposed to treatment as treated units and units not exposed to treatment as control units. Our goal is to arrive at a statistical model, in which the association between exposure to treatment and outcome identifies the causal effect of the treatment on the outcome.
Following Morgan and Winship's exposition, Figure 4.1 schematizes this issue.
Figure 4.1 Causal Diagram
We want to estimate the effect of a treatment (a change in employment regulation) on an outcome (for example, youth unemployment) in the presence of confounder U. Because U causes both treatment (via path a) and outcome (via path b), the effect of the treatment on the outcome variable (path c) is not identified unless we account for the influence of the factor U, which may be observed or unobserved. Causal inference is possible, however, if we can appropriately condition for control variables so that the effect of U (or many U's) on either the outcome or the treatment is "controlled away". If we can "block" (Morgan and Winship, 2007) either path a or path b, the treatment effect (path c) is identified, i.e. we can estimate the treatment effect in a
Treatment c Outcome
b a
U
Data and Methods
simple regression framework and assign a causal interpretation to the corresponding regression coefficient.
Randomly assigning treatment status is the benchmark approach to achieve identification of causal effects, since randomization ensures that treated and control units do not systematically differ in their observed and unobserved U's. However, since employment protection reforms are not implemented in this fashion, we need to worry about the presence of unobserved or observed U's. In this context, we can think of two strategies to rule out the influence of U. We either use a set of control variables W that model treatment assignment and block path a. This approach underlies statistical matching analysis. The goal is to model assignment to the treatment status using control variables in order to "balance" the U's between treated and control groups, leaving no systematic differences between both groups.26
For both cross-sectional an longitudinal analysis, we employ an extensive set of control variables, which were selected both because they predict the outcome variables as well as treatment assignment. While only few quantitative studies exist that discuss the causal factors behind the assignment of employment protection regimes (for example Rueda, 2005, 2007;
Botero et al., 2004), there has been much research on the determinants of aggregate employment outcomes (for example, Blanchard and Wolfers, 2000; Baccaro and Rei, 2007; Bertola et al., Alternatively, we could use a set of control variables X that account for the direct influence of U on the outcome variables, and hence block path b. In the following, we propose a conditioning strategy using the control variables that essentially tries to accomplish both, adjusting for factors that determine treatment assignment as well as adjusting for causes of the outcome.
26 Instrumental variable analysis, by relying on exogenous variation in treatment assignment, can be similarly interpreted as an attempt to rule out an effect of U via path a. While instrumental variable specifications have been suggested to estimate the effect of employment protection (Allard and Lindert, 2006), we could not find credible instruments in this application. For an interesting approach to identify institutional
Data and Methods
2007). From this research, we distilled a list of control variables (see Data Appendix at the end of this chapter).
While representing a fairly comprehensive set of control variables typically used in this field of research, they may not fully capture the unobserved confounders. Published research has often entered them as linear effects into regression models and often implicitly or explicitly claimed to achieve identification of causal effects in this way. However, we can use the information contained in the control variables more effectively. Following the logic of matching analysis, we can try to detect the set of W control variables that model treatment assignments and attempt to balance the data in this way. In particular, in addition to the main effects of the control variables, we add "powerful" quadratic/interaction terms generated from the control variables that predict treatment assignment (the choice of employment protection regime). We are not interested in interpreting these interactions/non-linear terms, because their sole purpose is to balance the data.
To select the quadratic/interaction terms, we use a data driven approach. Based on Imbens' and Rubin's (2008) suggestions for specifying a propensity score model, we propose an algorithm that selects all "powerful" quadratic and interaction terms generated from the control variables that predict treatment assignment. When estimating the effect of employment protection legislation, we then control for these quadratic/interaction terms to explicitly capture some additional factors unaccounted for by the conventional linear specifications. If these non-linear/interaction terms matter, omitting them from the specification would constitute a specification error. How exactly the algorithms were designed is discussed in the following section.
Implicitly, we try to accomplish the same goal as in propensity score matching analysis, but without estimating and adjusting for a propensity score. Instead, we use the variables that predict treatment status (or treatment intensity, since our treatment variable is continuous) and include
Data and Methods
them directly in the outcome equation. Angrist and Pischke (2009: 86) point to the similarity between first estimating than adjusting for the propensity score and regression model that uses a flexible specification of the variables used to model at the propensity score. Instead of estimating a balancing score and then conditioning on it, we directly control for the observed covariates used to predict the treatment variables and therefore attempt to block path a in Figure 4.1 above.
We also apply the same reasoning, when modeling the U's that affect the outcome (path b in Figure 4.1). Again, published research usually constrains the effects of control variables to be linear, which is not plausible. For example, institutions modify the impact of macroeconomic shocks, but institutions may also interact with each other. Indeed, a number of contributions have focused on precisely these questions (Blanchard and Wolfers, 2000; Bassanini and Duval, 2006).
Unfortunately, published research is a poor guide in detailing the functional form of control variables and which interactions to include, particularly when studying the youth labor market (or other specific demographic group). Instead one strategy has been to use nonlinear least squares, where sets of institutions are allowed to interact with many observed macroeconomic variables (e.g. Blanchard and Wolfers, 2000; Jimeno and Rodriguez-Palenzuela, 2002), leading to the estimation of a large number of parameters while avoiding any statement about just which interactions/nonlinearities matter.
Similar to modeling treatment assignment using additional non-linear terms, we therefore also use a data-driven approach to detect non-linear terms that predict the respective outcome variable (path b in Figure 4.1). If these non-linearities are present and we omit them from the model, we commit a specification error. The risk of this approach is that we include an irrelevant term in the model that is just significant by chance. Essentially, we are caught between two potential sources of specification error. An advantage of the data-driven approach is that it
Data and Methods
2009, for a discussion of the pitfalls of model dependence), and replaces the subjectively preferred specification with a specification that is determined on the basis of objective criteria.
Because we let an algorithm decide about the functional for of the control variables, this precludes specification searches on the basis of the treatment effect. "Partisan model selection" is a particular concern in this setting, because we deal with relatively small sample sizes, in which robustness to specification changes is a considerable problem and gives a lot of room that researchers may exploit to arrive at a preferred result. By delegating the decision about how the functional form of the control variables to an algorithm, we effectively preclude some model selection on the basis of estimated treatment effects.
Finally, while Imbens and Rubin (2008) recommend a partly data-driven approach to balance the vis-à-vis the treatment variable, applying the same procedure vis-à-vis the outcome variable may seem questionable. However, repeating the analysis where we just include the non-linear terms that predict the treatment variable (and omit the non-linear terms that predict the outcome variable) yielded virtually identical results. In the following analyses, we generally use both sets of non-linear terms.