US Police Killings: What the data tells us
Exploratory Data Analysis on Police Killings from 2015-2016
Introduction
In this article, we will analyze one of America’s hottest political topics, which encompasses issues ranging from institutional racism to the role of Law Enforcement personnel in society. But first, I have a favor to ask. For the next 10 minutes, let’s leave our preconceived notions of what’s true at the door. Prior domain knowledge is vital for making inferences from data. But if we build our statistical models based on preexisting beliefs, we are less likely to get to the right answers and more likely to ask the wrong questions. That was my schpeal on the Philosophy of Statistics. Let’s get started.
Background and goals
The ever-growing argument, pushed by American liberals and libertarians and opposed heavily by a staunch conservative base, is that the US has a flawed Law Enforcement system that costs too many innocent civilians their lives. US cops kill around a 1000 people a year. If we contrast this number with other developed countries like Finland where the cops fired only 6 shots in 2013, we get a grotesque picture. But if we look at other countries with similar levels of violent crimes and homicides, the picture gets fuzzier. In this project, we will limit our scope to analyzing police killings inside the US, and try to come up with useful insights based on data from police encounters that led to killings. Without further ado, let’s dig into the data.
Data-sets used
We are using 5 data-sets for this project:
1) a data-set on police killings from 2016
2) an identically framed data-set from 2015
3) a data-set for January-June 2015 that has less data-points than 2) but more informational features on the incidents
4) a data-set on US state populations and incidences of Violent Crime in 2015
5) a data-set with statistics from 1000+ US cities
The first 2 were compiled via “The Counted”, a project by The Guardian, a British Newspaper. The 3rd was compiled by Nate Silver’s FiveThirtyEight. I gathered all three from the website, Kaggle. I customized the 4th data-set on state populations myself, trimming down data compiled by the FBI. The fifth data-set on city statistics was obtained from www.simplemaps.com. You can find all of these files on my GitHub repository.
Pre-processing: Data-cleaning & Feature Engineering
This project will be completed on R. Data visualization, one of R’s main strengths, comes very handy in such analyses. My programming scripts and plots are publicly available on GitHub.
Merging data sets: The first 2 data-sets we use are from the same source, so merging them together requires just one line of code. The 3rd data-set, while similar, is from a different source so there are discrepancies we need to address. Even though it has all the features as the first 2, some of them are named slightly differently. There are additional features in this one, relating to demographic details of the tracts/areas in which the killings occurred. Instead of trying to merge all three data-sets together we are going to use the larger merged data-set from the first 2 data-sets for most of our visualizations, and work with the 3rd data-set exclusively when looking at demographics data. The 4th and 5th data-sets will be also be used to see how different states and cities compare in relation to police killings.
Removing and transforming features: We remove identifying features like ID, Name and Street Address as they are almost always unique to each case and don’t allow us to reach any generalizations. We also remove the date features: Year, Month and Day. I checked to make sure these variables are uniformly distributed so we are not losing any valuable information. For the extended features data-set, we also remove any feature that serves as a name/ID for the county/area where the killing happened. These values are also mostly unique to each incident and rule out general trends. We do leave in location features such as City, State, Latitude and Longitude. Many of these features are imported to our data-frame in the wrong format when we load the data-sets. So we have to manually check the data-set summaries to ensure the numeric features are indeed saved as numeric info, the categorical features are saved as factors and so on, and convert them to the right format when they aren’t.
Missing values and data sub-sets: A small portion of the data-points have missing values for some of the features, such as Longitude and Latitude. But instead of removing these data points from our data-frame entirely, it makes more sense to use subsets of the data-frames accordingly to leave out specific data points when we are trying to visualize certain features.
Feature extraction: Next we add some new features. We add the feature, Region, to our data-set. Based on the state in which the killings occurred, we assign the data-points to one of the 4 regions in the US: West, South, Midwest, and Northeast. We also add another feature, Agegroup, to separate the deceased into different age groups.
After trimming our features, our 5 data-sets have the following features:
A) 2015 data-set with extended features- 467 killings
1) age- Age of deceased
2) gender- Gender of deceased
3) raceethnicity- Race/ethnicity of deceased
4) city- City where incident occurred
5) state- State where incident occurred
6) latitude- Latitude, geocoded from address
7) longitude- Longitude, geocoded from address
8) lawenforcementagency- Agency involved in incident
9) cause- Cause of death
10) armed- How/whether deceased was armed
11) region- Region of the US where the incident occured
12) agegroup- Age-group that the deceased belongs to
13) share_white- Share of pop that is non-Hispanic white
14) share_black- Share of pop that is black (alone, not in combination)
15) share_hispanic- Share of pop that is Hispanic/Latino (any race)
16) p_income- Tract-level median personal income
17) h_income- Tract-level median household income
18) county_income- County-level median household income
19) pov- Tract-level poverty rate
20) urate- Tract-level unemployment rate
21) college- Share of 25+ pop with BA or higher
B) 2015–16 data-set- 2226 killings-
Features 1-12 from data-set A
C) 2015 State data-set-
1) State ID
2) Population
3) Violent Crime incidents in 2015
D) City data-set
1) City
2) State ID
3) State name
4) Population
5) Population proper
6) Population density
Analysis: Geographical visualizations & Maps
At first, we are going to compare different states with each other, and then look at the cities with the most police killings.
The y-axis represents the number of violent crimes in each state in 2015 and the x-axis represents the number of Police killings in that state in same time period. The straight line passes through the origin and its gradient is the total Instances of violence in America divided by the Total Killings. For the most part, these 2 factors were proportional but there were outliers. Oklahoma, Arizona, and California (which topped both these statistics by a mile) had disproportionately high police killings when normalized by incidences of violence. New York, on the other hand, had a disproportionately low value of killings compared to the incidences of violence. Given that incidences of violence are proportional to the population of a state, if we were to create a 3D scatter plot for the Population, Incidences of Violence and Killings in each state, we would expect something similar to a straight line through a 3D plane (it’s in my R script but I haven’t included it here to keep things simple).
We make a similar plot for the cities where there were 10 or more Police killings in 2015. This time we have replaced the Incidences of Violence by the Population (proper) of the city and we still observe a similar correlation, which is unsurprising as the incidences of violent crime are also usually proportional to the city population. New York City was even more of an outlier than the state of New York with disproportionately low killings given its population. Next, we are going to put some maps on our plots to better visualize the locations.
Map methodology- For all the map visualizations, we will work with the large 2015–16 data-set and leave out killings in the states of Alaska and Hawaii, so we can zoom in geographically to see finer details. Despite leaving out those 2 states, we are still covering 99% of the killings in the US so we are not missing much information. For our maps, we are going to utilize the “ggplot2” library in R. While these plots can also be done using the “ggmap” library, I used a combination of “ggplot2” and “sf” libraries. If you are an aspiring data-scientist, I would strongly encourage learning to make good use of these R libraries.
For the first map, we are looking at all police killings during 2015–16 with the shade of the dot indicating the age of the deceased. If you notice the dots in the Kentucky, Tennessee, Virginia, North Carolina area, it seems that those killed there were slightly older than others killed elsewhere. The second insight, which should be obvious to you if you are familiar with the US map, is that the killings were concentrated in the cities, with bigger cities accounting for higher concentrations. No big surprise there. If you see the US population heat-map below, you should notice that these 2 plots look somewhat similar.
Now let’s take a look at a smaller subset of the data which consists only of individuals who were killed when they were unarmed. That would be 389 people, or roughly 17.4% of the total people killed in those 2 years.
If the data is representative of what’s still happening now, then it seems that the killings of unarmed individuals are also a wide-spread problem throughout the US. Gunshots seemed to be the leading cause of death here as well (gunshots are responsible for over 90% of the killings in 2015–16), with death in custody being the second leading cause of death. Now let’s investigate how race/ethnicity factors into the picture.
The first map covers killings separated by all 8 categories of race/ethnicity available in the data-set. But because of the large number of groups, it is a little hard to follow. So we subset our data-set to only include those killed who were either “Black”, “White”, “Hispanic/Latino” or “Native American”. By doing so, we still retain more than 95% of the data-points and get cleaner visuals. It seems that Whites killed were spread throughout most of the country. Black people seem to made up the lion’s share of the deceased near the crime riddled areas of Baltimore, DC, and Chicago. Hispanics/Latinos understandably made up more of those killed in Hispanic/Latino population heavy states like California, New Mexico, and Texas, and also made up most of the deceased in border towns by Mexico. Southern California seems to have had a heavy combination of all 3 of the mentioned races/ethnicities. Native American deaths seem to have had the highest concentration near/at the Native American reservations of Arizona and New Mexico.
Analysis: Bar-plots and histograms
Staying on the topic of race/ethnicity, let’s take a quick look at how the different ethnicities and races differed by age.
Arab-Americans killed had the lowest median age and those classified as “Unknown” had the highest. But these are groups without a substantial number of data-points. So let’s focus on the 3 groups that made up most (almost 95%) of the data-set: Blacks, Whites and Hispanics/Latinos. We can see there are noticeable differences here as well so let’s go in deeper with some histograms.
The age of Hispanics/Latinos killed tended to be lower than that of Whites killed, and Blacks had the lowest median age out of the groups. The median ages for the killed among Whites, Hispanics/Latinos and Blacks were 38, 31 and 30 respectively. However, it is important to note that the median ages for these groups for the general US population are 42.9, 28.1 and 33.3, respectively. That at least explains why the group of Whites killed had the highest median age. The tail end of the graphs for Whites also extended a lot further into older ages, but it’s understandable that they had more outliers as they made up over half the data-set. It’s also worth noting that the median age for Hispanics/Latinos in the US is driven down by the fact that a large portion of their community is 1st generation immigrants who tend to be younger than the US average age. They also tend to have more kids than the other 2 groups.
Staying on the topic of age, let’s look at how things differed by US region and narrow things down by a few more factors.
It’s hard to come to a conclusive decision about Age and Region by this plot as a bar-plot isn’t meant to show the number of instances and focuses rather on the breakdown by percentiles. But it should appear (as is the case if you examine the numbers) that the ages of those killed in the Midwest and West are lower than of those killed in the Northeast and the South. The ages of those killed who possessed knives were higher than their counter-parts in all regions (assuming we leave out a small subset of the population of people who were armed with a Vehicle, “other”, or where the cases were/are “disputed”).
Gunshots seemed to be the cause of death in more than 90% of the cases, or 2021/2226 cases to be exact. The median age of those killed in custody (which is the 2nd leading cause of death at 82 killings) was slightly higher than those killed by gunshots in all but the Midwest region. It is interesting that the cases where the deceased were struck by vehicles involve older people, based on the upper quartile of the box-plots. While they did make a small subset of the data, these cases were significant in quantity (49 in total and at least 9 in each of the 4 regions).
Let’s move on to breakdowns by income from the additional information we found on the FiveThirtyEight data-set. Note that this is only information from less than 500 people who were killed between January-June 2015. First, let’s look at incomes.
The important thing to note here that these incomes were not of the individuals who were killed, but rather of the people who reside at the location where the killings occurred. In the words of the source, “Census tracts were determined by geocoding addresses to latitude/longitude using the Bing Maps and Google Maps APIs and then overlaying points onto 2014 census tracts”. However, we do see patterns that suggest a sizable amount of the deceased were killed in counties/cities/tracts that are representative of the deceased’s ethnic background. In all income categories, killings of Native Americans occurred in the areas with the lowest incomes, while Asian Americans were on the other end of the spectrum. As per Wikipedia, the median 2018 household income per race/ethnicity is as follows: Asian-80,720, White- 61,349, Native Hawaiian and Other Pacific Islander-57,112, Hispanic or Latino (of any race)-46,882, American Indian and Alaska Native- 39,719, Black or African American- 38,555. Interestingly, it seems that the Native Americans fall in a higher income bracket than Blacks. But according to the data from the killings, Native Americans got killed in lower income areas than Blacks did. Among many other things, this could suggest that either a lot of black people got killed in neighborhoods without predominantly black families or that the national incomes of Native American households were boosted when they are categorized in the same group as Native Alaskans. To understand it better we would be deviating towards data that’s beyond the scope of this project so I will leave that task to those of you who are curious. It is worth noting that the categorical median incomes we see are all significantly lower than the national median income for any race/ethnicity. For our data-set, the killings occurred in areas where the median household income is $42,759. The median income of a 2018 household in the US is $57,112. Less than 24% of the 2015–16 police killings occurred in Census tracts where the household income was higher than the national median. If our data from 2015–16 is representative of US Police killings in general, the killings are more likely to occur in poorer areas. Let’s move on to poverty rates in these tracts.
A similar pattern as income is seen with poverty rates in the tracts, with Native-Americans being killed in places with the highest poverty. It is known that (Native American) Reservation poverty rate is 28.4 %, compared with 22% among all Native Americans. Both numbers are much higher than the 12.7% national poverty rate. Now let’s move on to the racial/ethnic demographics of the places where the killings occurred.
More than 75% of the deceased Whites were killed in neighborhoods that are more than 50% non-Hispanic White. Blacks and Hispanics/Latinos were also reported to be killed in neighborhoods where people from their respective racial/ethnic backgrounds constituted a relatively larger share of the population than what you find in the average American neighborhood.
This one is a little more interesting. It looks like Native Americans, for some reason, were reported to be killed in tracts with better educated populations as opposed to their counterparts. What’s even more interesting is that Blacks and Hispanics are 2nd and 3rd on this list respectively, if we look at the median college education level of the tracts where they are killed. I personally cannot think of good hypotheses to explain this phenomenon. If you have an answer, feel free to let me know.
Now let’s move away from the demographics of the region and go back to our original 2015–16 data-set for some stacked bar-plots. In the next 2 plots, we will explore a feature we haven’t looked at yet, Age-group.
Before discussing these 2 plots, we should note that 38 of the deceased were reported to be under the age of 18, while the other 2 age-groups had over 1000 data-points each. Because of the much smaller sample-size of data for the under-18 group, inferences from this group are less likely to be representative of what happens in the overall population than inferences from the other 2 age-groups. That being said, let’s take a look at the graphs. It intuitively makes sense that the oldest group had the highest percentage of white people. The percentage of white people tend to be higher in older age-groups in the US. The second one is quite interesting. Out of the 3 age-groups, the minors killed had the lowest percentage of cases where they were armed with either a fire-arm or a knife.The under-18 group also had the highest percentage (26.3%) of cases where those killed were unarmed. Note that there isn’t enough information here to conclude that these statistics are reflective of the general population. But if they are found to be representative with further analysis, there should be cause for concern.
In the above plot, we see the “armed” status of those killed by different regions. It’s worth noting that an alarmingly high percentage of those killed in every region were unarmed, with values ranging from 16.9% to 18.5% between the different regions.
Gunshots seemingly caused the overwhelming majority of deaths for each race. The second leading cause of death seems a bit more interesting. Rates of death in custody was lowest for Whites (not counting the sub-set of 7 Arab-Americans killed) and the highest rate was for Native Americans. This next plot is interesting as well.
If we assume that our data represents the general trends for police killings today, then Black people that are killed are more likely to be armed with a firearm (49.1% were in this case) than any other racial/ethnic group. It would also imply that they are less likely than any other racial/ethnic group to be armed with a knife when killed. More interestingly, if we exclude the 7 Arab-Americans, this would imply that black people are more likely than any other racial/ethnic group to be unarmed when killed (20.9% of black people killed in 2015–16 were unarmed). These insights, if true, can relate to the highly debated discussion of whether stereotypes associating black people with having firearms lead to more killings of unarmed black people. And this makes for a perfect segway to move on to the next section of our project.
Discussion- understanding racial bias
To provide numbers to the plot above, here is a racial/ethnic summary of the overall number of killings for 2015–16:
2015–16 police killings by race/ethnicity of the deceased:
-Arab-American- 7
-Asian/Pacific Islander- 45
-Black- 568
-Hispanic/Latino- 378
-Native American- 34
-Other- 1
-Unknown- 43
-White- 1150
You may notice that black people made up a disproportionately large share (more than 25%) of those killed, while comprising of only 12% of the overall American population. However, trying to find meaningful conclusions purely based on these statistics would mean we are ignoring the confounding factors behind them. According to this FBI report, the homicide rate for black males is 8 times higher than white males. A novice at statistics may be tempted to prematurely conclude that higher crime rates are the explanation to black people getting killed at a higher rate. But again, to reach conclusions about a race/ethnicity based on snippets of data like this would not only be incorrect, but also dangerously ignorant. We would be breaking the Golden Rule of Observational Studies: Correlation does not equal Causation. A deeper level of confounding factors such as poverty rate, educational background, and socioeconomic status could arguably much better predict incidences of violent crime among groups.
*Read this if you are not sure what a confounding factor is. It’s a variable that affects both the independent and dependent variable in a study, and leads researchers to wrong conclusions if it’s not balanced for. Ex: Smokers are likely to drink coffee, and also likely to develop pancreatic cancer. But if the effects of smoking aren’t eliminated from a study investigating the effects of drinking coffee, we may wrongfully conclude that drinking coffee causes pancreatic cancer*
If we had more demographic information on the individuals that were killed, we would be able to balance for several confounding factors and reach better conclusions. But we don’t. Which leads us to face the biggest obstacle in the path to having statistically conclusive answers about policing and racial bias: the lack of data. The problem with not having a system in place that comprehensively tracks all police interactions is that we only get data on interactions that lead to killings, and even then we get incomplete data. If we had data on how many total interactions the police have with people from different races/ethnicities and what percentage of them lead to violence/killings, we would be better equipped at spotting police biases. Right now, data for such instances is only available when the individual department that has the data chooses to release it. And those departments that choose to do so may not be representative at all of the average US police department. As you may have heard recently, the FBI is planning to start collecting data on all cases involving use of serious violence by the police from 2019. To an extent, this change will allow for better analysis. But having all that data from the police departments may still not give us the straightforward answers we are looking for. It’s because these are not answers we can find conclusively through randomized trials in a lab setting. In observational studies, we have to make sure that we neutralize all confounding factors such as the ones I referred to earlier. And this is not easy to do as there could be confounding factors that we haven’t even thought about. In the words of Keith Payne, a psychologist and neuroscientist who has researched police bias, “It is very difficult to empirically separate the effects of poverty, geography, race, race bias and policing tactics”.
Conclusion
While the complexities mentioned in the last paragraph may seem a little discouraging for those of us trying to find conclusions about this troubling issue, there is reason to be hopeful. Even with the current limitations, we are able to develop important insights into the issue (like ones we came up with in this project). I will conclude this article with a final thought. I believe there are only 2 things separating us from having all the right answers. The first is enough data, and we are slowly but surely getting there. The second is having more curious people who are willing to look at this data and ask the right questions. And this group would probably consist of people like you, yes you, if you have made it this far into the article.
Thank you for reading so far. I hope this article taught you something new, be it about Exploratory Data Analysis in R or about the issue of US Police violence. If you think this may be interesting or educational to others you know, please do share it with them. If you liked the article, and had something you wanted to share with me, feel free to comment, contact me via email at nadir.nibras@gmail.com or at https://www.linkedin.com/in/nadirnibras/.
I am committed to improving my methods, analyses or data-sets, so if you have any suggestions, feel free to comment or let me know otherwise. If you want to follow more of my work on data science, follow me on Medium and Linkedin.
Also, if you are a fan of Data Science or it’s applications in understanding the world, please do connect with me. It’s always fun talking to fellow stats nerds and I would love to collaborate on fun projects :)