A Basic Propensity Score Matching Example for Diff-in-Diff
2022-09-29
In this tutorial, we will be using the same dataset as other tutorials, namely my panel data for Philadelphia census tracts.
It should be noted that this matching example differs from the catholic school data homework in that, instead of strictly treated and untreated groups, we have an eventually treated group and an untreated group. Obviously, this is why we have been using diff-in-diff to analyze this panel data. Ultimately, we will use diff-in-diff after matching. Another difference is that I am going to match solely on the pre-treatment means of certain variables at the tract level, which means this requires a few more steps than the matching example in the homework.
Consider this: if we are trying to match on likelihood of treatment (AKA propensity score matching), then it follows logically that we should match on pre-treatment variables. After all, it was their values before, not after, the treatment began that should theoretically determine their selection into treatment.
First, we’ll load the data, which is in my github files.
# load the data
library(curl)
download.file("https://raw.githubusercontent.com/alexmarsella/alexmarsella.github.io/master/assets/aggrework.rds", "aggrework.rds", method="curl")
agg <- readRDS("aggrework.rds")
# subset down to pre-treatment
aggpre2014 <- subset(agg, agg$year < 2014)
This step requires some extra explanation. Remember, I have determined that I want to match on the pre-treatment average of some variables within tracts. I am not trying to match for specific years. My goal here is to just find a set of 8 tracts that can pair with my 8 treated tracts on their propensity scores. I will ultimately match on pre-2014 violent crime rate, per capita income, proportion African-American, proportion who have not completed high school (25 and older), and proportion 16 to 64 who have not worked in the past twelve months.
Context matters. I am claiming here that I believe these five variables constitute criterion on which treatment may have been assigned.
library(dplyr)
# a strange error may occur if you do not include this next line before you do the following lines
aggpre2014 <- as.data.frame(aggpre2014)
# note that I include the treatment variable "fullzone" because I want to maintain this in the data, same with tract. I am obviously not calculating the average for these, but for tract X, the average is literally X. Same with "fullzone", for a tract in the zone, the average value is 1. I include this for completeness.
vsumdatapre2014 <- aggpre2014 %>%
group_by(tract) %>%
summarize(vcap=mean(vcap), pcinc=mean(as.numeric(pcinc))
,black=mean(as.numeric(black))
,lessthanhighcap=mean(as.numeric(lessthanhighcap))
,fullzone=mean(as.numeric(fullzone))
,nowork12cap=mean(as.numeric(nowork12cap))
,tract=mean(as.numeric(tract)))
Now, we’ll use the function matchit
. Note that the outcome variable is not violent crime, but fullzone
. Why? Because fullzone
is the variable equal to 1 for the eventually treated units and 0 for never treated units. Matchit will build a predictive model of propensity scores to determine how the independent variables chosen affect propensity of being included in the treated group!
library(MatchIt)
m.out1 <- matchit(fullzone ~ vcap+black+
lessthanhighcap+
+pcinc+nowork12cap ,
method = "nearest", data = vsumdatapre2014)
summary(m.out1)
##
## Call:
## matchit(formula = fullzone ~ vcap + black + lessthanhighcap +
## +pcinc + nowork12cap, data = vsumdatapre2014, method = "nearest")
##
## Summary of Balance for All Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio
## distance 0.0758 0.0201 1.0422 2.4548
## vcap 39.0256 30.3453 0.6751 0.3824
## black 0.7333 0.4346 0.9343 0.7970
## lessthanhighcap 0.2299 0.1960 0.3307 0.8414
## pcinc 13564.5625 23691.2452 -2.5258 0.0735
## nowork12cap 0.4632 0.3244 1.6036 0.4663
## eCDF Mean eCDF Max
## distance 0.3581 0.6083
## vcap 0.2011 0.5671
## black 0.2233 0.4639
## lessthanhighcap 0.1332 0.3546
## pcinc 0.2926 0.6025
## nowork12cap 0.3182 0.6022
##
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio
## distance 0.0758 0.0741 0.0311 1.1623
## vcap 39.0256 40.0314 -0.0782 0.3951
## black 0.7333 0.7423 -0.0284 0.7360
## lessthanhighcap 0.2299 0.2288 0.0109 1.3767
## pcinc 13564.5625 13855.6250 -0.0726 1.6050
## nowork12cap 0.4632 0.4654 -0.0248 1.0361
## eCDF Mean eCDF Max Std. Pair Dist.
## distance 0.0027 0.125 0.0489
## vcap 0.1030 0.500 1.6389
## black 0.0890 0.500 0.9046
## lessthanhighcap 0.0717 0.250 1.1252
## pcinc 0.0533 0.375 0.8389
## nowork12cap 0.0520 0.250 0.6136
##
## Sample Sizes:
## Control Treated
## All 367 8
## Matched 8 8
## Unmatched 359 0
## Discarded 0 0
#will show us the new balance
plot(summary(m.out1), abs=FALSE )
#the match.data function turns the matching output into a dataset
m.data1 <- match.data(m.out1)
Now comes another major departure from the homework. I am not going to analyze m.data1
. After all, m.data1
is 16 observations, pre-2014 averages for 8 treated and 8 untreated tracts. Not useful! What is useful is the choice of tracts themself. So now I am going to subset agg
down to just those tracts. It also assigned a subclass
to each observation. The subclass determines the partner tract. In other words, the two tracts in subclass 1 were matched. The two tracts in subclass 2 were matched, and so on. Generally speaking, people will cluster their standard errors at the subclass instead of the tract level in this context.
library(lfe)
library(stargazer)
#once again, we need to set agg to a data frame to avoid an error
agg <- as.data.frame(agg)
#subset down to the matched tracts
m.data2 <- subset(agg, agg$tract %in% c(m.data1$tract))
#feeding subclass in
m.data2<-merge(x = m.data2, y = m.data1[ , c("subclass","tract")], by = "tract", all.x=TRUE)
#now, we just run a diff in diff model with two-way fixed effects.
match2 <- felm(vcap ~ post:fullzone+black+
+lessthanhighcap+
pcinc+nowork12cap|year+tract|0|subclass, data=m.data2)
# for good measure, let's run another, except let's include all of our controls, even if we didn't match on them.
match1 <- felm(vcap ~ post:fullzone+white+black+hisp+nofathercap
+male15to29cap+lessthanhighcap+highcap+somecollegecap
+bachelorcap+pcinc+nowork12cap|year+tract|0|subclass, data=m.data2)
#finally, a nice stargazer table for reporting our results.
stargazer(match1,match2,
add.lines = list(c("TWFE?", "X","X")), title="Difference-in-Differences for Violent Crime",
align=FALSE, notes="Std. Errors Clustered at subclass level", df=FALSE, notes.append=TRUE, font.size = "small" ,
covariate.labels = c("White","Black","Hispanic","Single Mother", "Male 15 to 29", "Less than Highschool", "High School","Some College","Bachelor's","Income","No work 12 months","Post x Zone"), type="text")
##
## Difference-in-Differences for Violent Crime
## =============================================================
## Dependent variable:
## ----------------------------------------
## vcap
## (1) (2)
## -------------------------------------------------------------
## White 22.106
## (51.336)
##
## Black -35.179 -47.628
## (36.964) (37.836)
##
## Hispanic -15.959
## (92.617)
##
## Single Mother 2.433
## (6.876)
##
## Male 15 to 29 -55.732
## (30.788)
##
## Less than Highschool -15.538 4.377
## (31.174) (32.748)
##
## High School -21.962
## (33.468)
##
## Some College 9.567
## (25.226)
##
## Bachelor's 60.493
## (42.879)
##
## Income -0.0004 -0.0003
## (0.001) (0.001)
##
## No work 12 months 1.810 -9.117
## (17.709) (10.980)
##
## Post x Zone -15.059* -14.382**
## (6.927) (5.791)
##
## -------------------------------------------------------------
## TWFE? X X
## Observations 160 160
## R2 0.865 0.852
## Adjusted R2 0.825 0.819
## Residual Std. Error 7.677 7.805
## =============================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
## Std. Errors Clustered at subclass level