Investigating crime stats

Jan 29, 2017 21:13 · 2397 words · 12 minute read ggplot2 readxl tidyr stringr

Importing different data sets, calculations

Goal: Dealing with Excel files, analyzing data, adjusting for population.

Donald Trump promised to save America from the hellish wave of crime and disorder in his inaugural address.

“The crime and the gangs and the drugs that have stolen too many lives and robbed our country of so much unrealized potential,” said President Trump. “This American carnage stops right here and stops right now.”

A statement on the White House website lays out the foundation for their claims.

White House statement on crime

White House statement on crime

The Trump adminstration alleges that

  • In 2015, homicides increased by 17% in America’s fifty largest cities
  • That’s the largest increase in 25 years
  • In our nation’s capital, killings rose by 50 percent over the past four years
  • There were thousands of shootings in Chicago last year alone

Trump does not have a great track record when it comes to accurate representation of data.

So think like a journalist.

Is this true? How can it be verified?

With data on homicides. Where can we get that data?

Reframe that question.

Who would collect data on homicides and why? (Think local then go wider.)

  • Police departments
    • To arrest those responsible
  • City governments
    • To track population, public policy adjustments
  • Local news media
    • To report on events that affect the public
  • The Justice Department
    • To track local, regional, state trends
  • The Federal Bureau of Investigation
    • To track local statistics
  • Insurance, real estate companies, businesses
    • To measure risk and likelihood of crime in an area to set rates

How can you get the data from them?

  • Call them up
  • Check online
  • File a FOIA request

Let’s use the FBI’s Uniform Crime Reporting Statistics

They’ve collected data from police departments across the country for decades.

Each year they create a new report. However, their overall table builder tool that lets people pull data across several years only goes up to 2014. It doesn’t have the latest 2015 data.

But that’s alright for now.

To verify this first claim “In 2015, homicides increased by 17% in America’s fifty largest cities” we need the 2015 and 2014 data to compare to each other.

We’ll need a list of the 50 largest cities and we’ll need homicide totals from 2014 and 2015 to measure the percent change.

The 2015 data is here and the 2014 data is here.

It’s not the most intuitive site to navigate, so I’ll point you in the right direction.

Click the ‘Download Excel’ sheet on Table 8 of the 2015 data and save it to your data folder.

Do the same for the 2014 data on Table 8.

You should now have two Excel spreadsheets

You should now have two Excel spreadsheets

Let’s load them in and see what they look like.

We’ll use the readxl package.

library(readxl)
h2014 <- read_excel("data/table-8.xls", sheet=1)
This is what it looks when you View() the dataframe in RStudio.

This is what it looks when you View() the dataframe in RStudio.

Problems, right?

That’s because it reflects the formating of the Excel spreadsheet.

That’s because it reflects the formating of the Excel spreadsheet.

It’s not the tidy data structure we’re used to working with in R.

There are too many cells with NAs.

We have to clean it up.

#Fortunately, we can tell the read_excel() function to skip to a certain row when bringing in data

h2014 <- read_excel("data/table-8.xls", sheet=1, skip=3)

Better. The columns have been named correctly, but it’s good practice to change column names so they’re syntactically valid. This means no spaces or odd characters.

To do that, use the make.names() function.

colnames(h2014) <- make.names(colnames(h2014))
Now, were talking.

Now, were talking.

But now we need to replace the NAs in the State column.

There might be a quicker way to do it, but for now, I’m going to show you a quick loop to do it.

# What this loop does

# For every row between 1 and the the total number of rows in h2014 dataframe
for (i in 1:nrow(h2014)) {
  
  # If the cell in State is blank,
  if(is.na(h2014$State[i])) {
    # Then fill it with the value of the State cell previous to it
    h2014$State[i]<-h2014$State[i-1]
  } 
}

#This way the cell will be filled with the last value, which logically, is the state identifier.
Excellent.

Excellent.

Now, let’s do this to the 2015 data by judge adjusting the code from above.

h2015 <- read_excel("data/Table_8_Offenses_Known_to_Law_Enforcement_by_State_by_City_2015.xls", sheet=1, skip=3)

colnames(h2015) <- make.names(colnames(h2015))

# For every row between 1 and the the total number of rows in h2015 dataframe
for (i in 1:nrow(h2015)) {
  
  # If the cell in State is blank,
  if(is.na(h2015$State[i])) {
    # Then fill it with the value of the State cell previous to it
    h2015$State[i]<-h2015$State[i-1]
  } 
}

Great, we’ve got two data sets with homicides for all cities across the country.

Let’s narrow the 2015 data down to the 50 largest cities (Fortunately, the data includes population). And then we’ll join it to the 2014 data.

We’ll use the dplyr package.

library(dplyr)

# Arranging by population in descending order
h2015 <- arrange(h2015, desc(Population))

# Grabbing just the first 50 cities
h2015_50 <- h2015[1:50,]

# Prepping the two dataframes for joining
h2015_50 <- select(h2015_50, State, City, Homicides2015=Murder.and.nonnegligent.manslaughter)
h2014 <- select(h2014, State, City, Homicides2014=Murder.and.nonnegligent.manslaughter)
h2015_joined <- left_join(h2015_50, h2014, by=c("State", "City"))

Did that work?

head(h2015_joined)
## # A tibble: 6 x 4
##   State        City                            Homicides2015 Homicides2014
##   <chr>        <chr>                                   <dbl>         <dbl>
## 1 NEW YORK     New York                                  352            NA
## 2 CALIFORNIA   Los Angeles                               282           260
## 3 ILLINOIS     Chicago                                   478           411
## 4 TEXAS        Houston                                   303           242
## 5 PENNSYLVANIA Philadelphia                              280           248
## 6 NEVADA       Las Vegas Metropolitan Police …           127           122

Hm, it didn’t join successfully on some. What’s up with that?

Let’s take a look at what’s going on in New York.

Aha

Aha

It looks like a number of the data points have footnote numbers (Dammit, Excel!)

We have to use regex to clean out those numeric characters.

# This library a function I'll use to trim trailing spaces
library(stringr)

# These commands get rid of numbers and commas
h2014$City <- gsub('[0-9]+', '', h2014$City)
h2014$City <- gsub(',', '', h2014$City)
h2014$City <- str_trim(h2014$City)

h2014$State <- gsub('[0-9]+', '', h2014$State)
h2014$State <- gsub(',', '', h2014$State)
h2014$State <- str_trim(h2014$State)

# Now let's try to bring the 2015 and 2014 dataframes together again
h2015_joined <- left_join(h2015_50, h2014, by=c("State", "City"))

Alright, how many NA cells are there now? We can check by typing sum(is.na(h2015_joined$Homicides2014)) and we’ll know there are 5 blank values.

Why is that?

Well, I won’t make you dig around further, but if you did, you’d see that it was because most of those cities weren’t in both years.

As in Fort Worth had data for 2015 but not for 2014.

In all, Honolulu, Fort Worth, Tucson, and Wichita were not present and thus could not be successfully joined.

Nashville was present in both data sets but did not join successfully because in 2015, officials called it “Nashville Metropolitan” while in 2014, it was called just “Nashville”.

This is the frustrating thing about working with data.

It can be incomplete or inconsistent and takes a long time to prepare it before we can analyze it.

OK, let’s do what we can.

# We can at least fix Nashville quickly

h2014$City <- gsub('Nashville', 'Nashville Metropolitan', h2014$City)

# Now let's try to bring the 2015 and 2014 dataframes together again
h2015_joined <- left_join(h2015_50, h2014, by=c("State", "City"))

Alright, we’re down to just four missing cities.

Let’s do some math knowing the data is incomplete just for fun.

total_homicides_2014 <- sum(h2015_joined$Homicides2014, na.rm=T)
total_homicides_2015 <- sum(h2015_joined$Homicides2015, na.rm=T)

# Percent change

percent_change_2015 <- round((total_homicides_2015- total_homicides_2014)/total_homicides_2014*100,2)

So with four towns missing, there were 4368 homicides in 2014 and 5185 homicides in 2015.

That’s a percent change of 18.7, which is close to the 17 percent that Trump alleges. But with homides missing from four towns, the percent change is exaggerated.

Let’s try it one more way and try to make it as complete as possible.

# Let's filter to the largest 50 cities that have both 2014 and 2015 data

h2015 <- select(h2015, State, City, Population, Homicides2015=Murder.and.nonnegligent.manslaughter)
h2015_joined <- left_join(h2015, h2014, by=c("State", "City"))
h2015_joined <- filter(h2015_joined, !is.na(Homicides2014))
h2015_joined <- h2015_joined[1:50,]

Let’s try the math again.

total_homicides_2014 <- sum(h2015_joined$Homicides2014, na.rm=T)
total_homicides_2015 <- sum(h2015_joined$Homicides2015, na.rm=T)

# Percent change

percent_change_2015 <- round((total_homicides_2015- total_homicides_2014)/total_homicides_2014*100,2)

Alright, using complete data, the percent change in homicides between 2014 and 2015 is 15.63— which is an increase, but not as high of an increase as Trump alleges.

Also, what happens when you include all the matched town data and don’t limit it to the 50 largest cities? The percent change is 14.34.

Is that the largest in 25 years?

To look into this other allegation, we can finally use the FBI’s table-building tool.

Navigate to the bottom left of the page

Navigate to the bottom left of the page

Click Multiple agencies, one variable under the Large local agencies section.

Hold the control button and select Cities 1,000,000 or over, Cities from 500,000 thru 999,999, and Cities from 250,000 thru 499,999

Hold the control button and select Cities 1,000,000 or over, Cities from 500,000 thru 999,999, and Cities from 250,000 thru 499,999

Click the Next button.

Next

Next

  • Select all the agencies in the menu on the left (command+A or ctrl+A)
  • Choose Murder and nonnegligent manslaughter in the pull down menu
  • Pick 1990 to 2014 (which is 24 years total)
  • Click Get Table
Sharp looking table

Sharp looking table

OK, get that data by left clicking and saving the link in Spreadsheet of this table.

Move that file over to your data folder and load it into R.

Let’s take a quick look at the structure before bringing it in.

Yup, gross structure like last time.

Yup, gross structure like last time.

When we bring it in, we’ll need to skip the first few rows again.

trends <- read.csv("data/LocalCrimeTrendsInOneVar.csv", stringsAsFactors=F, skip=4)

Looks like that worked except:

All those footnotes at the bottom are going to mess things up.

All those footnotes at the bottom are going to mess things up.

We can filter that out.

trends <- trends[1:82,]

# Also, let's get rid of that floating blank column at the end
trends$X <- NULL

Let’s figure out the percent change between each year.

We’ll use the tidyr package so we can restructure the data so it’s easier to figure out the total homicides per year.

library(tidyr)

re_trends <- trends %>%
  gather("year", "homicides", 3:27) %>%
  group_by(year) %>%
  summarize(total=sum(homicides, na.rm=T))

library(knitr)
kable(re_trends)
year total
X1990 11187
X1991 11743
X1992 11160
X1993 11359
X1994 10365
X1995 9445
X1996 8448
X1997 7773
X1998 6883
X1999 6309
X2000 6667
X2001 7067
X2002 7031
X2003 7078
X2004 6779
X2005 6766
X2006 7225
X2007 6909
X2008 6550
X2009 5953
X2010 5851
X2011 5659
X2012 5859
X2013 5385
X2014 5395

Alright, we’re getting somewhere.

We need to clean it up a bit and figure out the percent change year to year, first.

And then we can try to visualize it.

re_trends$year <- gsub("X", "", re_trends$year)

# There's probably a faster way to do this, but I'm going to make a loop to determine the percent change for each row.

re_trends$percent_change <- 0

for (i in 2:nrow(re_trends)) {
  re_trends$percent_change[i] <- round((re_trends$total[i]-re_trends$total[i-1])/re_trends$total[i-1]*100,2)
}

library(ggplot2)

ggplot(re_trends, aes(x=year, y=percent_change)) +geom_bar(stat="identity")

Well, even though we’re not comparing the same cities, we can get a general sense of percent change year to year.

According to that measurement, yes, the percent change between 2014 and 2015 of 15.63 is higher than any other year since 1990.

But that number is misleading.

Why?

Check out the overall trend of raw numbers in these cities.

ggplot(re_trends, aes(x=year, y=total)) +geom_bar(stat="identity")

As we can all see, focusing on percent year to year takes away from the larger picture, which is that homicides is on an overall downward trend and has been since the ’90s.

Homicides in Washington DC

“In our nation’s capital, killings rose by 50 percent over the past four years.”

Is that true?

According to the 2015 data, the number of homicides was 165. Four years earlier, it was 88.

That’s about an 88 percent increase, it’s true.

But once again, why focus on four years?

Let’s look at the data.

dc <- filter(trends, State=="DC") %>%
  gather("Year", "Total", 3:27) 

# Adding the 2015 data
dc <- rbind(dc, data.frame(Agency="Washington Metropolitan Police Dept", State="DC", Year="X2015", Total=162))
dc$Year <- gsub("X", "", dc$Year)

ggplot(dc, aes(x=Year, y=Total)) +geom_bar(stat="identity")

Because four years ago, DC had the lowest number of homicides in decades. So if you want to exaggerate the change between now and then, you pick the then with lowest point.

Is there an uptick in homcides? Sure. But that’s only happened very recently.

Chicago homicides

“There were thousands of shootings in Chicago last year alone.”

You known what we’re going to do.

Same thing as with DC but for Chicago.

chi <- filter(trends, Agency=="Chicago Police Dept") %>%
  gather("Year", "Total", 3:27) 

# Adding the 2015 data
chi <- rbind(chi, data.frame(Agency="Chicago Police Dept", State="IL", Year="X2015", Total=478))
chi$Year <- gsub("X", "", chi$Year)

ggplot(chi, aes(x=Year, y=Total)) +geom_bar(stat="identity")

The FBI doesn’t track shootings, only homicides.

We’d need to get that data from the Chicago Police themselves, most likely.

So, it’s true that shootings are in the “thousands” but in relation to what?

We know for sure that deaths are going down in Chicago. It’s most likely that shootings have gone down, too.

We’ll only know for sure if we have the data.

Also, why the fixation on Chicago? Is that fair? Sure they have a lot of shootings and deaths, but the city is also a large population.

Let’s check that real quick with the 2015 data since that data set includes both total and population.

library(DT)

h2015b <- select(h2015, State, City, Population, Homicides2015) %>%
  filter(!is.na(City)) %>%
  filter(Population>100000)

h2015b$per_capita <- round(h2015b$Homicides2015/h2015b$Population*100000,2)

h2015b <- arrange(h2015b, desc(per_capita))
datatable(h2015b)

St. Louis and Baltimore and many other cities actually have higher homicides per 100,000 residents than Chicago.

Chicago’s actually 25th compared to all the other cities. Even Hartford has a higher rate of homicides.

What’s the lesson in all this?

It’s not that difficult to verify claims when the data is easily available.

The hardest part was just cleaning up the data to get apples-to-apples data.

tweet Share