Intermediate Data Tidying and Plotting
2022-08-02
This tutorial can be done in order from beginning to end, or picked from at various points for basic code implementation. In this tutorial, I touch on:
- loading in csv files
- thinking about the structure of data and how to make it more palatable for analysis
- “tidying” data from “wide” to “long”
- dealing with date variables
- subsetting on the fly
- plotting with base R
- slightly more advanced plotting with base R
- a quick note on why a plot doesn’t prove anything in the context of Causal Inference
Save yourself some headaches now and set up a working directory where you keep files related to your R script. Anything in your working directory can be called into your environment directly, which you will see further down the page. Make a folder in your computer or choose a folder that already exists, it is up to you.
getwd() #this allows you to see what your directory is
## [1] "C:/Users/Alex/Dropbox/PROJECTS/Econ482"
setwd("C:/Users/Alex/Dropbox/PROJECTS/Econ482") #this is the directory I will be using for this class
My friend Tom used to be a prosecutor for the Philadelphia DA. He asked me to gather some data on conviction rates for him. The DA’s office provides data on prosecutorial outcomes, which I downloaded and named tom
. What Tom wanted was a plot of guilty trial outcomes and guilty pleas.
data <- read.csv("tom.csv", stringsAsFactors = F)
The only reason that file read in so seamlessly is because it’s located in “C:/Users/Alex/Dropbox/PROJECTS/Econ482”. If it were not located there, R would return an error. You, the reader, can acquire this file with the following code.
library(readr)
data <- read_csv("https://raw.githubusercontent.com/alexmarsella/alexmarsella.github.io/master/assets/tom.csv")
Let’s take a look at this dataset, using the head()
command to view the first ten rows. I hide the output for the sake of this document, but you should try it yourself on your own computer to see what it looks like.
head(data, n=10)
Think about the format of this data. Each day repeats 7 times, because the data lists each case outcome on a given day, along with the total cases. The numbers under each crime are simply how many of that crime had the given case outcome on the given day.
This data is in what is called a “wide” format, which means that each variable of interest, namely the crime, has its own column. If I want to study this data or even plot it, this format would be very tedious to deal with. There is an alternative format called “long” (also called “tidy”, and you’ll see why).
First, I am going to load two libraries: dplyr
and reshape2
library(dplyr)
library(reshape2)
#first, melt will combine all of the other columns while saving date and trial outcome
data2 <- melt(data, id=c("date_value", "dispoType"))
colnames(data2) <- c("DATE", "DISP_TYPE", "OFFENSE_TYPE", "N_CASES")
#now, take another look at the data to see how it looks
head(data2)
## DATE DISP_TYPE OFFENSE_TYPE N_CASES
## 1 2014-01-02 Dismissed/Withdrawn/Etc Homicide..Shooting 0
## 2 2014-01-02 Diversion Homicide..Shooting 0
## 3 2014-01-02 Exonerated/Won on Appeal Homicide..Shooting 0
## 4 2014-01-02 Guilty Homicide..Shooting 0
## 5 2014-01-02 Guilty Plea/Nolo Homicide..Shooting 0
## 6 2014-01-02 Not Guilty/Acquittal Homicide..Shooting 0
Note that now there are only four columns. We’re not quite done, but the data is a bit cleaner now. We still have date and trial outcome, but now there is a unified variable for the type of offense and the sum of those offenses on a given day.
Tom was interested in guilty outcomes, so I will filter the data a bit. dplyr
allows me to filter for Guilty, Guilty Plea/Nolo, and Total within DISP_TYPE using the %in%
function.
data2 <- filter(data2, DISP_TYPE %in% c("Guilty", "Guilty Plea/Nolo", "Total"))
We have to pick a level of aggregation, if you measured conviction rates on a daily basis, we’d have an extremely sloppy plot at the end of the day. Let’s aggregate at the month level.
#CREATE MONTH_DATE COLUMN AND ROLL UP
data2["MONTH_DATE"] <- format(as.Date(data2$DATE), "%Y-%m-01")
#I want a column for month, a column for offense type, and then the outcome types (DISP_TYPE) split into as many columns as there are unique outcomes
#I also want to sum those outcomes on an offense-type by month basis
#The variable that tells me the value of interest is N_CASES
data2 <- dcast(data=data2, formula = MONTH_DATE + OFFENSE_TYPE ~ DISP_TYPE, fun.aggregate = sum, value.var="N_CASES")
#Now, I will sum the guilties as "convicted" and then divide by "total" to get an actual conviction rate
data2["Convicted"] <- data2["Guilty"] + data2["Guilty Plea/Nolo"]
data2["Conviction Rate"] <- data2["Convicted"] / data2["Total"]
head(data2, n=10)
## MONTH_DATE OFFENSE_TYPE Guilty Guilty Plea/Nolo Total
## 1 2014-01-01 Homicide..Shooting 0 0 0
## 2 2014-01-01 Homicide..Other 13 13 27
## 3 2014-01-01 Attempted.Murder 4 15 29
## 4 2014-01-01 Non.Fatal.Shooting 2 0 11
## 5 2014-01-01 Rape 2 13 23
## 6 2014-01-01 Robbery.with.a.Deadly.Weapon 1 19 59
## 7 2014-01-01 Robbery 8 19 72
## 8 2014-01-01 Aggravated.Assault 29 92 360
## 9 2014-01-01 Strangulation 0 0 0
## 10 2014-01-01 Simple.Assault 12 27 156
## Convicted Conviction Rate
## 1 0 NaN
## 2 26 0.9629630
## 3 19 0.6551724
## 4 2 0.1818182
## 5 15 0.6521739
## 6 20 0.3389831
## 7 27 0.3750000
## 8 121 0.3361111
## 9 0 NaN
## 10 39 0.2500000
Plotting our Data
Now, our data is in an easily plottable format. We have offense type by month data with conviction rates. We could now plot, for any individual crime type, conviction rates on a monthly basis.
That being said, our date variable is just a character variable, we want it to be a date that R recognizes as a date so the order will be obvious.
library(lubridate)
data2$date <- as.Date(data2$MONTH_DATE)
Note that the following gives you a really ugly plot.
plot(data2$date, data2$`Conviction Rate`)
We should be specific about what crime we want to plot the conviction rate of. One way of doing this is using square brackets to subset our data. Note that because Conviction Rate
has a space in it, we surround it in back quotes (top left of your keyboard, sharing a key with tilde). Without those back quotes, R will view the space as another space in the code, and try to find something called data2$Conviction
, which does not exist.
plot(data2$date[data2$OFFENSE_TYPE=='Robbery.with.a.Deadly.Weapon'],
data2$`Conviction Rate`[data2$OFFENSE_TYPE=='Robbery.with.a.Deadly.Weapon']) #Always use double equals in situations like this.
#note that I subset date as well as conviction rate, or else the x and y lengths would differ in the plot and return an error
We can make this plot a bit more attractive. We’ll add xlab
, ylab
, and main
within the plot()
command.
plot(data2$date[data2$OFFENSE_TYPE=='Robbery.with.a.Deadly.Weapon'],
data2$`Conviction Rate`[data2$OFFENSE_TYPE=='Robbery.with.a.Deadly.Weapon'],
xlab="Year", ylab="Armed Robbery Conviction Rate", main="Monthly Conviction Rate - Armed Robbery (Guilty, Guilty Plea, Nolo)")
We can also get really advanced with it, adding a smoothing line and some vertical lines around specific events. For example, I can add a black line for when the recent DA was elected, and a blue line for when Covid began. Determining how to use the more advanced maneuvers below is left as an exercise for the reader.
plot(data2$date[data2$OFFENSE_TYPE=='Robbery.with.a.Deadly.Weapon'],
data2$`Conviction Rate`[data2$OFFENSE_TYPE=='Robbery.with.a.Deadly.Weapon'],
xlab="Year", ylab="Armed Robbery Conviction Rate", main="Monthly Conviction Rate - Armed Robbery (Guilty, Guilty Plea, Nolo)")
data2$ndate = as.numeric(data2$date, as.Date("2014-01-01"), units="months")
# a bit more advanced, but we can add a smoothing line to try to fit the points. We'll define a loess object that fits the dates to the rates
lo <- loess(data2$`Conviction Rate`[data2$OFFENSE_TYPE=='Robbery.with.a.Deadly.Weapon']~as.numeric(data2$ndate[data2$OFFENSE_TYPE=='Robbery.with.a.Deadly.Weapon']), data=data2)
xl <- seq(min(data2$ndate),max(data2$ndate), (max(data2$ndate) - min(data2$ndate))/1000)
lines(xl, predict(lo,xl), col='red', lwd=2)
abline(v=c(17532, 18322), col=c("black", "blue"), lty=c(1,2), lwd=c(2, 3))
… but does this plot prove anything? Thinking about Causal Inference
Looking at the plot, you may think that conviction rates for robbery with a deadly weapon were rising before the DA switched over in 2018, and fell afterward. This is certainly a fact, but the harder claim to justify is whether the Philadelphia DA caused the reduction in convictions.
This plot alone does not meet the burden of proof required for economists to make claims like “The Philly DA elected in 2018 caused conviction rates to fall.” A few things are needed before a good applied social scientist would actually be able to make that claim:
- A strong theory
- Tom has his own theory, which is that the DA fired much of the seasoned prosecutors and hired young inexperienced ones.
- This is an intuitive theory. It certainly provides an explanation for the fall in crime if we can demonstrate with causal inference methods that it is attributable to the 2018 DA change. This leads us to an equally, if not more important part
- Tom has his own theory, which is that the DA fired much of the seasoned prosecutors and hired young inexperienced ones.
- Strong causal evidence derived from a causal inference method with a clear identification strategy.
- For example, what if we look at another city on the East Coast? Suppose we find that Pittsburgh had a similar upward trending conviction rate for this crime before 2018, and also experienced a similar decline post-2018. Considering that Pittsburgh has had the same DA since 1998, it would be quite interesting if we found this to be true. If it meets the parallel trends assumption, then this could be a setting for a basic diff-in-diff.
- Note: A two-way fixed effects difference-in-differences model (with control data that meets the parallel trends assumption) is probably the bare minimum for attempting to “prove” that the new DA is the cause of the reduced conviction rates. Synthetic control, or even synthetic-difference-in-differences, may be even more preferable (and are more “in-fashion” as econometricians develop more and more exceptions and additions to the classic diff-in-diff models)