An Approach to Macroeconomic Quasi-Experimentation

(1)

VI. Regression Discontinuity

An Approach to Macroeconomic

Quasi-Experimentation

(2)

Universe of Counterfactual Outcome

Probability Space = ( Ω

,

ω,

P

)

Sample Space Event Probability Measure

Objective: To construct

P

for a given experiment

P

=

ω

→

[0

,

1]

But... How should we interpret regression estimates?

y

=

α

1 +

β

1 X

1 +

β

2 X

2 +

ε

β

1 = (

X

0

(3)

Treatment and Control group Comparability

Ideal World

Conduct Experiments

Causal inference: Random assignment of treatment and control

Interpretation not to be heavily "model dependent"

Real World

Self selection. measurement error, omitted variable bias, simultaneity

No counterfactuals

Econometrics: "Statistics with bad data"

Selection on Observables

–Mostly assume CIA–

Selection on Unobservables

-Often as hard to

justify-Further Readings:

-Angrist and Pischke (2008) “Mostly Harmless Econometrics"

(4)

Identification Strategy

How can I approximate observational data to an experiment in the

absence of random assignment?

Non-Conventional Approach to Macroeconomics

Propensity Score Matching

Diff-in-Diff

Regression Discontinuity Design

Event Study

Conventional Approach to Macroeconomics

VARs (Reduced, structural, factor-augmented, etc.)

Principal components

Hazard Functions

ARs (ARIMA, ARCH models)

Cluster Analysis

(5)

Unidentified Questions

Do children do better at school if they start at 6 or 7?

Counterfactuals: Test score given that I started school at age 6 had I started at age 7

Test score given that I started school at age 7 had I started at age 6

Two groups: one starts at 6, the other at 7, compare test scores in first

grade

Bias: One group is older when taking the test

Group that starts at 6 take test in 2nd grade, group that starts at 7 takes

test in 1st grade

(6)

Impact on income if parents have 1 vs 2 children?

Use of twins: more comparable with parents that have 1 child

Do hospitals make people healthier or sicker?

National Health Interview Survey 2005

Group

Sample Size

Men Health Status

St. Error

Hospital

7,774

2.21

0.014 No Hospital

90,049

2.93

0.003

(7)

Tennessee STAR Experiment 1985-1989

$12 million USD

11,600 children in kindergarten, 3 treatments:

Small classes 13-17

Regular classes 22-25 with part time teacher’s aid

(8)

Do smoking lead to lower infant birth-weight?

Treatment status: mother’s smoking status

Outcome: Birth weight of infants

Problems: age is related to both treatment status and outcome

older mothers have heavier infants

(9)

Counterfactuals of interest

Outcome (infants’ birth-weight) of mothers who smoked if they had chosen not to

smoke

(10)

Hospital Example:

Group

Sample Size

Men Health Status

St. Error

Hospital

7,774

2.21

0.014 No Hospital

90,049

2.93

0.003 Trivial example... but resembles

What is the variable of interest?

[

y

1i

|

D

i

= 1]

−

[

y

0i

|

D

i

= 1]

Problem: only observe one outcome per person

y

i

=

D

i

y

1i

+ (1

−

D

i

)

y

0i

y

1i

|

D

i

= 1 Observed

y

1i

|

D

i

= 0

Hypothetical

y

0i

|

D

i

= 0 Observed

y

0i

|

D

i

= 1

Hypothetical

Where

y

i

=

Health Status

D

i

=

1 hospital

(11)

Group Mean Health Status

Hospital 2.21

No Hospital 2.93

E

[

y

1i

|

D

i

= 1]

−

E

[

y

0i

|

D

i

= 0]

→

What I observe

(2

.

21 −

2 .

93)

=

E

[

y

1i

|

D

i

= 1]

−

E

[

y

0i

|

D

i

= 1]

|

{z

}

E[y1i−y0i|Di= 1]

+

E

[

y

0i

|

D

i

= 1]

−

E

[

y

0i

|

D

i

= 0]

|

{z

}

Bias<0

Potential Outcomes framework

Regression framework

y

i

=

D

i

y

1i

+ (1

−

D

i

)

y

0i

-vs-

y

i

=

α

+

ρ

D

i

+

η

i

y

i

=

y

0i

+ (

y

1i

+

y

0i

)

D

i

↓

(12)

y

i

=

α

+

ρ

D

i

+

η

i

↓

E

[

y

0i

]

y

0i

−

E

[

y

0i

]

E

[

y

i

|

D

i

= 1] =

α

+

ρ

+

E

[

ηi

|

D

i

= 1]

E

[

y

i

|

D

i

= 0] =

α

+

E

[

ηi

|

D

i

= 0]

E

[

y

i

|

D

i

= 1]

−

E

[

y

i

|

D

i

= 0] =

ρ

+

E

[

ηi

|

D

i

= 1]

−

E

[

ηi

|

D

i

= 0]

|

{z

}

E[y0i|Di= 1]−E[y0i|Di= 0]

BIAS

(13)

E

[

y

i

|

D

i

= 1]

−

E

[

y

i

|

D

i

= 0] =

ρ

+

E

[

η

i

|

D

i

= 1]

−

E

[

η

i

|

D

i

= 0]

|

{z

}

E[y0i|Di= 1]−E[y0i|Di= 0]

BIAS

Solution:

-Controlled Experiment

-Bouncer in front of Hospital

Optimistic view

Conditional Independence Assumption (CIA)

(14)

E

[

y

i

|

X

i

,

D

i

= 1]

−

E

[

y

i

|

X

i

,

D

i

= 0] = (

what i want

)+

E

[

y

0i

|

X

i

,

D

i

= 1]

−

E

[

y

0i

|

X

i

,

D

i

= 0]

|

{z

}

0

Non- Binnary Treatment:

# years of school

⇒

S

i

What individual “

i

" would earn for any value of S

⇒

Y

si

≡

f(s)

CIA

⇒

y

si

⊥

S

i

|

Xi

∀

s

(15)

Example 2: Going to College

y

1i

- earning had "i " gone to college

y

0i

- earning had "i " not gone to college

C

i

= 1 go to college

C

i

= 0 don’t go to college

I observe:

E

[

y

i

|

C

i

= 1]

−

E

[

y

i

|

C

i

= 0] =

E

[

y

1i

−

y

0i

|

C

i

= 1] +

E

[

y

0i

|

C

i

= 1]

−

E

[

y

0i

|

C

i

= 0]

|

{z

}

(16)

Bad Control Problem

College -vs- no college

/

Blue -vs- white collar

Y

i

=

C

i

y

i1

+ (1

−

c

i

)

y

i0

w

i

=1 - white collar

W

i

=

C

i

w

i1

+ (1

−

c

i

)

w

i0

w

i1

- white collar & C =1

w

i0

- white collar & C =0

CIA

⇒

E

[

y

i

|

c

i

= 1]

−

E

[

y

i

|

c

i

= 0] =

E

[

y

1i

−

y

0i

]

E

[

w

i

|

c

i

= 1]

−

E

[

w

i

|

c

i

= 0] =

E

[

w

1i

−

w

0i

]

(17)

Bad Control Problem cont.

diff in

y

i

between college graduates and others (without collage) given that they are white

collar. i.e. want:

y

i1

−

y

i0

|

w

i1

E

[

y

i

|

w

i

= 1

,

C

i

= 1]

−

E

[

y

i

|

w

i

= 1

,

C

i

= 0]

=

E

[

y

i1

|

w

i1

= 1

,

C

i

= 1]

−

E

[

y

i0

|

w

0i

= 1

,

C

i

= 0]

E

[

y

i1

|

w

i1

= 1]

−

E

[

y

i0

|

w

0i

= 1]

+ CIA

=

E

[

y

i1

|

w

i1

= 1]

−

E

[

y

i0

|

w

i1

= 1]

|

{z

}

E[yi1−yi0|wi1]

+

E

[

y

i0

|

w

i1

= 1]

−

E

[

y

i0

|

w

0i

= 1]

|

{z

}

BIAS

Casual effect on college on those

Any college student

Gets a white collar job

that work in white collar job

who gets white

without college

when they have a college degree

collar job

≈

E

[

y

0i

]

(e.g Bill Gates)

(18)

Matching & Propensity Score Functions

Estimates the effects of a treatment by accounting for covariates that predict

receiving treatment

-Rosenbaum and Rubin (1983)

Easier with categorical variables (Matching)

Harder with continuous variables (Propensity Score Matching)

(19)

Example: Training Program

Variables (All binary except income):

Treatment

Black

Hispanic

Married

Degree

Income

l l l l l l l l l l l l l l l l

P

s

$

s(¯

y

is

−

¯

y

0s)

donde

$

s

=

(20)

Key Assumption

Prosperity Score Theorem:

Corollary of CIA

(CIA)

(PST)

y

0 i

,

y

1 i

⊥

D

i

|

X

i

−→

y

0 i

,

y

1 i

⊥

D

i

|

P

(

X

i

)

4 Steps to Propensity Score Matching:

1

_{Estimate Propensity Score}

2

_Matching

3

Stratification

(21)

Propensity score matching methodology

Propensity Score: conditional probability of receiving treatment given

X

i

Through the use of a logistic model or through generalized boosted modeling

P

(

X

i) =

Pr

(

D

i

= 1

|

X

i) For each

i

within the sample

Pr

(

D

i

= 1

|

X

i

) = Φ(

X

i0

δ

)

(22)

Propensity score matching methodology

Propensity Score: conditional probability of receiving treatment given

X

i

Through the use of a logistic model or through generalized boosted modeling

Matching: Find Individuals with no treatment, with similar levels of Propensity Scores as

to those with treatment

Stratification:

(23)

Matching methods explained

Propensity scores for treated and control groups

Matching methods: for each treated observation i, we need to find matches

of control observations(s) j with similar characteristics

Matching with or without replacement

Matching without replacement: each control observation is used no

more than one time as a match for a treated observation.

(24)

(25)

Nearest neighbor matching

For each treated observation

i

, select a control observation

j

that has the closest

x

.

min

k

pi

−

pj

k

Radius matching

Each treated observation i is matched with control observation j that fall within a

specified radius.

k

p

i

−

p

j

k

<

r

Kernel matching

Each treated observation i is matched with several control observations, with

weights inversely proportional to the distance between treated and control

observations.

With matching based on propensity scores, the weights are defined as:

w

(

i

,

j

) =

K(

pj−_pi

h

)

P

n₀ j=1

K(

pj−pi h

)

Here h is the bandwidth parameter.

Stratification or interval matching

(26)

To Recap

Average treatment effect on the treated (ATET)

ATET is the difference between the outcomes of treated and the outcomes of the

treated observations if the had not been treated.

ATET

=

E

(∆

|

D

= 1) =

E

(

y

1

|

x

,

D

= 1)

−

E

(

y

0

|

x

,

D

= 1)

The second term is a counterfactual so it is not observable and needs to be

estimated.

Propensity score method

After matching on propensity scores we can compare the outcomes of treated and

control observations

ATET

=

E

(∆

|

p

(

x

)

D

= 1) =

E

(

y

1

|

p

(

x

)

,

D

= 1)

−

E

(

y

0

|

x

,

D

= 0)

Empirical estimation

Each treated observation

i

is matched

j

control observations and their outcomes

y

0

are weighed by

w

.

ATET

=

1 n

1

X

i∈(D=1)

[

y

1,i

−

X

j

(27)

Diff-in-Diff

Scale problem: Non-linearities in Outcome. What if control group

had been higher?

(28)

Good

Better

(29)

Key Assumption

Trend in control group approximates what would have occurred in treatment group in

the absence of treatment

"weaker version of CIA"

DD

=

E

(∆

treated

−

∆

control

|

D

= 1)

=

E

[(

y

1,t+1

−

y

1,t

)

−

(

y

0,t+1

−

y

0,t

)

|

x

,

D

= 1]

Regression Framework:

y

i

=

β

0

+

β

1

D

Post

+

β

2

D

Treat

+

β

3

D

Post

D

Treat

+ (

β

4

X

i

) +

εi

E

[

y

|

X

i

,

D

Post

= 1

,

D

Treat

= 1] =

β

0

+

β

1

+

β

2

+

β

3

E

[

y

|

X

i

,

D

Post

= 0

,

D

Treat

= 1] =

β

0

+

β

2

E

[

y

|

X

i

,

D

Post

= 1

,

D

Treat

= 0] =

β

0

+

β

1

E

[

y

|

X

i

,

D

Post

= 0

,

D

Treat

= 0] =

β

0

β

1

+

β

3

β

1

(30)

Event Study

Event Date

(31)

Intro

Hahn et al. (2001): “RDD require seemingly mild assumptions compared to tose needed

for other non-experimental approaches"

Lee (2008): “one need not assume the RDD isolates treatment variation that is ‘as good

as randomized’; instead, such randomized variation is a

consequence

of agents’ inability

to

precisely

control the assignment variable near the known cutoff"

Precise sorting around the cutoff is a sign of self-selection

Non-Parametric (i.e. local linear regression -using only data close to cutoff) and

Parametric (i.e. functional form like a low-order polynomial) estimation should be seen

as complementary. In practice, they lead to the computation of the exact same statistic.

Disadvatanges

Statistical power is lower than randomized experiments of equal sample size

(higher Type-II error)

(32)

Regression Discontinuity Design

Closest cousin of a randomized experiment

Deterministic rule that assigns treatment in a discontinuous fashion

D

i

=

1 if x

i

≥

x

0

0 if x

i

<

x

0

RD Scatterplot: Positive Treatment Effect

RD Scatterplot: No Treatment Effect

(33)

(34)

Randomized Experiments: Treatment and control groups are divided on the

basis of a randomly generating number.

For example, let

µ

follow a uniform distribution with range [0,4]. Units with

µ

= 2 receive

treatment, units with

µ <

2 get placebo

Think of RDD where assignment variable

is

X

=

µ

and cutoff=2

Only difference:

X is independent if Y

i(1)

and Y

i

(0)

so,

E

[

Y

i

(1)

|

X

=

c

]

,

E

[

Y

i(0)

|

X

=

c

]

(35)

Examples

PSAT/NMSQT: Top 16,000 test-takers get a scholarship

A small difference in test score can means a discontinuous jump in

scholarship amount (Thistlewaite & Campbell 1960)

School Class Size: Maimonides’ Rule -No more than 40 kids in a

class in Israel

40 kids in school means 40 kids per class. 41 kids means two classes with

20 and 21. (Angrist & Lavy 1999)

Union Elections: If employers want to unionize, NLRB holds election

50%: the employer doesn’t recognize the union, and 50% + 1 means the

employer is required to "bargain in good faith" (DiNardo & Lee 2004)

Air Pollution and Home Values: Clean Air Act’s National Ambient

Air Quality Standards

(36)

Thistlewaite & Campbell 1960

A small difference in test score

→

discontinuous jump in scholarship amount

Y

i

=

α

+

τ

D

i

+

X

0

i

β

+

i

B

0

−

A

00

= lim

ε↓0

E

[

Y

i

|

X

i

=

c

+

ε

]

−

lim

ε↑0

(37)

Nonlinear RD

(38)

Issues with Causal Inference

Causal inference is possible because of the continuity of the underlying functions

E

[

Y

1

|

X

=

c

] and

E

[

Y

0

|

X

=

c

]

Can use average outcome right below cutoff (denied treatment) as counterfactuals

for those right above cutoff (treated)

Limitation: data closer than c’ and c” yield no observations. RDD is fundamentally an

extrapolation-based approach

Since data is required away from cutoff, estimates will depend on chosen functional form