Arrest Risk Prediction, Chicago IL

Introduction

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(spatstat)
library(raster)
library(spdep)
library(FNN)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(mapview)
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

paletteGray <- c("gray90", "gray70", "gray50", "gray30", "gray10")

Over the past months, high profile cases of police brutality and the civil unrest which followed have brought the injustice in American policing to the public consciousness. Poor communities of color are policed disproportionately, resulting in more arrests and the appearance of being more crime-ridden. This cycle reinforces itself, and creates communities which are chronically over-policed and under-resourced.

In this analysis, I predict hotspots for arrest risk in Chicago, Illinois. In doing so, I identify communities which can benefit from alternatives to policing: social services, educational resources, housing aid, and more. Using this model’s predictions, I offer the Chicago city government recommendations for allocating resources citywide and reducing the overpolicing of disadvantaged neighborhoods.

Methods and Data

This analysis uses open sourced data from the City of Chicago’s Open Data Portal and the US Census Bureau’s American Community Survey. From these sources, I wrangle data and design features for risk factors involving income, education, age, and demographics. Using a fishnet grid, I break the city into square cells to design a model which predicts the risk of an arrest occurring in each cell based on its risk factor data.

# READ UNIT/AREA BOUNDARIES
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
  
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = beat_num)

bothPoliceUnits <- rbind(mutate(policeDistricts, Legend = "Police Districts"), 
                         mutate(policeBeats, Legend = "Police Beats"))

chicagoBound <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 

fishnet <- 
  st_make_grid(chicagoBound, cellsize = 500) %>%
  st_sf() %>%
  mutate(uniqueID = rownames(.))



# READ RISK FACTORS
abandonCars <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Abandoned-Vehicles/3c9v-pnva") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Cars")

abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%  filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Buildings")

graffiti <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Graffiti-Removal-Historical/hec5-y4x5") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    filter(where_is_the_graffiti_located_ %in% c("Front", "Rear", "Side")) %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Graffiti")

streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Street_Lights_Out")

sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Sanitation")

liquorRetail <- 
  read.socrata("https://data.cityofchicago.org/resource/nrmj-3kcf.json") %>%  
    filter(business_activity == "Retail Sales of Packaged Liquor") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Liquor_Retail")

busStops <-
  st_read("/Users/annaduan/Documents/GitHub/Project-3-Risk-Prediction/CTA_BusStops/CTA_BusStops.shp") %>%
        st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
  dplyr::select(Y = POINT_Y, X = POINT_X) %>%
    na.omit() %>%
    mutate(Legend = "Bus_Stops") %>%
  dplyr::select(-X, -Y)
#https://data.cityofchicago.org/Transportation/CTA-Bus-Stops-Shapefile/pxug-u72f

#additional features
buildingViolation <- 
  read.socrata("https://data.cityofchicago.org/resource/22u3-xenr.json") %>%  
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Building_Violation")

affordableRentHousing <- 
    read.socrata("https://data.cityofchicago.org/resource/s6ha-ppgi.json") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Affordable_Rental_Housing")

alleyLightsOut <- 
    read.socrata("https://data.cityofchicago.org/resource/t28b-ys7j.json") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Alley_Lights_Out")

garbageOpen <- 
    read.socrata("https://data.cityofchicago.org/resource/9ksk-na4q.json") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Garbage_Open")

policeStations <- 
    read.socrata("https://data.cityofchicago.org/resource/z8bn-74gv.json") %>%
  dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Police_Stations")



# READ CENSUS DATA
census_api_key("d9ebfd04caa0138647fbacd94c657cdecbf705e9", install = TRUE, overwrite = TRUE)
#row 1: vacant, total housing units, mhhinc
#row 2: white, population, renter occ, owner occ, #no HS degree
#row 3: male adults, female adults, poverty level, youth unemployed (18-34, veteran), youth unemployed (non veteran)
#row 4-5: male 15-17, male 18-19, male 20, male 21, male 22-24, male 25-29, male 30-34
#row 6-7: female 18-34
acs <-
  get_acs(geography = "tract", variables = c("B25002_003E", "B25001_001E", "B19013_001E", "B01001A_001E", "B01003_001E", "B07013_002E", "B07013_003E", "B06009_002E", 
"B05003_008E", "B05003_019E", "B06012_002", "B21005_006E", "B21005_011E", 
"B01001_006E", "B01001_007E", "B01001_008E", "B01001_009E","B01001_010E", "B01001_011E", "B01001_012E",
"B01001_031E", "B01001_032E", "B01001_033E", "B01001_034E", "B01001_035E", "B01001_036E"
), year=2018, state=17, county=031, geometry=T) %>%
  st_transform(st_crs(fishnet))
acs <-         #filter for chicago tracts
  rbind(
    st_centroid(acs)[chicagoBound,] %>%
      st_drop_geometry() %>%
      left_join(acs) %>%
      st_sf() %>%
      mutate(inChicago = "YES"),
    st_centroid(acs)[chicagoBound, op = st_disjoint] %>%
      st_drop_geometry() %>%
      left_join(acs) %>%
      st_sf() %>%
      mutate(inChicago = "NO")) %>%
  filter(inChicago == "YES") %>%
  dplyr::select(-inChicago)
#long to wide form
acs <-
  acs %>%
  dplyr::select(-moe, -GEOID) %>%
  spread(variable, estimate) %>%
  dplyr::select(-geometry) %>%
  rename(vacantUnits = B25002_003,
         totalUnits = B25001_001,
         medHHInc = B19013_001,
         white = B01001A_001,
         population = B01003_001,
         ownerOcc = B07013_002,
         renterOcc = B07013_003,
         noHsDegree = B06009_002,
         maleAdult = B05003_008,
         femaleAdult = B05003_019,
         poverty = B06012_002,
         youthUnempVet = B21005_006,
         youthUnempNonVet = B21005_011,
         male1517 = B01001_006,
         male1819 = B01001_007,
         male20 = B01001_008,
         male21 = B01001_009,
         male2224 = B01001_010,
         male2529 = B01001_011,
         male3034 = B01001_012,
         female1819 = B01001_031,
         female20 = B01001_032,
         female21 = B01001_033,
         female2224 = B01001_034,
         female2529 = B01001_035,
         female3034 = B01001_036)
acs <- 
  acs %>%
  mutate(pctVacant = ifelse(totalUnits > 0, vacantUnits / totalUnits, 0),
         pctWhite = ifelse(population > 0, white / population, 0), 
         pctRenterOcc = renterOcc/ (renterOcc + ownerOcc),
         pctNoHS = noHsDegree/ (maleAdult + femaleAdult),
         pctPoverty = ifelse(population > 0, poverty / population, 0),
         youthUnemploy = (youthUnempVet + youthUnempNonVet) / (male1819 + male20 + male21 + male2224 + male2529 + male3034 + female1819 + female20 + female21 + female2224 + female2529 + female3034),
         pctMaleYouth = ifelse(population > 0, (male1517 + male1819 + male20 + male21 + male2224 + male2529 + male3034) / population, 0)) %>%
  dplyr::select(-totalUnits,-vacantUnits,-white,-renterOcc,-ownerOcc, -noHsDegree, -maleAdult, -femaleAdult, -youthUnempVet, -youthUnempNonVet, -male1517, -male1819, -male20, -male21, -male2224, -male2529, -male3034, -female1819, -female20, -female21, -female2224, -female2529, -female3034, -poverty)



# ATTACH VARIABLES TO FISHNET
vars_net <- 
  rbind(abandonCars,streetLightsOut,abandonBuildings,
        liquorRetail, graffiti, sanitation, busStops, buildingViolation, affordableRentHousing, alleyLightsOut, garbageOpen, policeStations) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
    full_join(fishnet) %>%
    spread(Legend, count, fill=0) %>%
    st_sf() %>%
    dplyr::select(-`<NA>`) %>%
    na.omit() %>%
    ungroup()
vars_net <- vars_net %>%
  st_join(., acs)



# MAKE VARS_NET MAPPABLE
vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()



# NEAREST NEIGHBOR FUNCTION
nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
      as.data.frame(nn) %>%
      rownames_to_column(var = "thisPoint") %>%
      gather(points, point_distance, V1:ncol(.)) %>%
      arrange(as.numeric(thisPoint)) %>%
      group_by(thisPoint) %>%
      summarize(pointDistance = mean(point_distance)) %>%
      arrange(as.numeric(thisPoint)) %>% 
      dplyr::select(-thisPoint) %>%
      pull()
  return(output) 
}



# MAKE NN FEATURES
st_c <- st_coordinates
st_coid <- st_centroid

vars_net <-
  vars_net %>%
    mutate(
      Abandoned_Buildings.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3),
      Abandoned_Cars.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonCars),3),
      Graffiti.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(graffiti),3),
      Liquor_Retail.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(liquorRetail),3),
      Street_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3),
      Sanitation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(sanitation),3),
      Bus_stops.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(busStops),3),
      Building_Violation.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(buildingViolation),3),
      Affordable_Rental_Housing.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(affordableRentHousing),3),
      Alley_Lights_Out.nn = 
        nn_function(st_c(st_coid(vars_net)), st_c(alleyLightsOut),3),
      Garbage_Open.nn = 
        nn_function(st_c(st_coid(vars_net)), st_c(garbageOpen),3),
      Police_Stations.nn = 
        nn_function(st_c(st_coid(vars_net)), st_c(policeStations),3))

vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)

vars <- unique(vars_net.long.nn$Variable)
mapList <- list()



# READ DIST TO DOWNTOWN
neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet)) 

loopPoint <- neighborhoods %>%
  filter(name == "Loop") %>%
  st_centroid()

vars_net$loopDistance =
  st_distance(st_centroid(vars_net),loopPoint) %>%
  as.numeric()

Exploratory Analysis

Arrests

Figure 1 shows the arrests observed in Chicago during 2017.

# READ CRIME
arrests <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2017/d62x-nvdr?Arrest=true") %>% 
    mutate(x = gsub("[()]", "", Location)) %>%
    separate(x,into= c("Y","X"), sep=",") %>%
    mutate(X = as.numeric(X),Y = as.numeric(Y)) %>% 
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant")%>%
    st_transform('ESRI:102271') %>% 
    distinct() %>%
  st_intersection(chicagoBound)

crime_net <- 
  dplyr::select(arrests) %>% 
  mutate(countArrests = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countArrests = replace_na(countArrests, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))



# Map 1: arrests in 2017
ggplot() +
  geom_sf(data = chicagoBound, fill = "black") +
  geom_sf(data = arrests, aes(colour = Id), size = 0.1) +
  scale_colour_viridis(name = "1 Observed Arrest", labels = "", option = "B") +
  labs(title = "Figure 1: Observed Arrests, 2017", subtitle = "Chicago, IL") +
  mapTheme()

Figure 2 presents this data in a fishnet, which is the scale we will be working at. Arrests were most common in the downtown area, the area west of it, and the South Side.

ggplot() +
  geom_sf(data = crime_net, aes(fill = countArrests, colour = countArrests)) +
  scale_fill_viridis(option = "A", name = "Arrests Per Cell") +
  scale_colour_viridis(option = "A", name = "Arrests Per Cell") +
  labs(title = "Figure 2: Observed Arrests Joined to Fishnet, 2017", subtitle = "Chicago, IL") +
  mapTheme()

Figure 3 shows us that 2017 observed arrests are unevenly distributed across the fishnet. A small portion of cells accounts for the bulk of the arrests. This suggests that unfavorable conditions in some areas increase their risk of arrests.

# FINAL NET: COMBINE ARRESTS AND RISK FACTORS
final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID") 

final_net <- #spatial join arrests and risk factors
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name)) %>%
    st_join(dplyr::select(policeDistricts, District)) %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()


ggplot(data = final_net, aes(x=countArrests)) +
  geom_histogram() +
  labs(title = "Figure 3: Distribution of Observed Arrests, 2017", subtitle = "Chicago, IL", x = "Fishnet Cells", y = "Arrests Observed") +
  plotTheme()

Risk Factors

To explore this, I select risk factors which are believed to correlate with crime and arrests. These include signs of a disorderly physical environment, unemployment, and low education attainment. I map these factors in Figure 4, where we see that percent renter occupancy, building violations, and percent poverty have similar distributions to observed arrests, suggesting that they may be predictors of arrest risk.

#get rid of NAME variable (census tract name), select preferred features
vars_net_noname <- subset(vars_net, select = -c(NAME)) %>%
  dplyr::select(pctWhite, Abandoned_Buildings.nn, Affordable_Rental_Housing.nn, medHHInc, Sanitation.nn, Police_Stations.nn, Liquor_Retail.nn, Abandoned_Cars.nn, Graffiti.nn, pctNoHS, youthUnemploy, Alley_Lights_Out, Bus_Stops, Street_Lights_Out, pctRenterOcc, pctPoverty, pctVacant, Garbage_Open, Building_Violation)

#make vars_net.long to use for mapping
vars_net.long <-
   gather(vars_net_noname, Variable, value, -geometry)

#select features to show in multiple map
vars_net.long.select <-
   gather(vars_net_noname, Variable, value, -geometry) %>%
  dplyr::filter(Variable == "Abandoned_Buildings.nn" | Variable == "Affordable_Rental_Housing.nn"| Variable == "medHHInc"| Variable == "Street_Lights_Out" | Variable == "pctRenterOcc" | Variable == "pctPoverty" | Variable == "pctVacant" | Variable == "Garbage_Open" | Variable == "Building_Violation")

#make list of unique variables
vars <- unique(vars_net.long.select$Variable)
mapList <- list()

#map risk factors
for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(option = "A", name="") +
      labs(title=i) +
      mapTheme() +
    theme(plot.title = element_text(size=10))}

do.call(grid.arrange,c(mapList, ncol = 3, top = "Figure 4: Risk Factors by Fishnet"))

Arrest Hotspots

Next, we need to know whether, and where, observed arrests cluster spatially. Using the Local Moran’s I statistic, I locate areas in Chicago where the clustering of arrests is greater than would be expected with random distribution. In Figure 5 we see that only some areas with high observed arrests have high clustering (high Moran’s I), namely the area west of Downtown. Of these areas, only some have a high level of confidence (low P Value) in their Moran’s I. Areas with high arrest counts, clustering, and confidence are likely to be identified as Chicago’s arrest risk hotspots.

With the hotspots identified, we can engineer two new features for our model: arrests.isSig tells whether a cell is in a hotspot, and arrests.isSig.dist is the distance from a cell to its nearest hotspot. These features allow us to factor in the spatial clustering of arrests.

#assign weights at the grid cell level
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

#add Moran's I and P value to final_net
final_net.localMorans <- 
  cbind(
    as.data.frame(localmoran(final_net$countArrests, final_net.weights)),
    as.data.frame(final_net)) %>% 
    st_sf() %>%
      dplyr::select(Arrest_Count = countArrests, 
                    Local_Morans_I = Ii, 
                    Morans_P_Value = `Pr(z > 0)`) %>%
      mutate(Sig_Hotspots = ifelse(Morans_P_Value <= 0.0000001, 1, 0)) %>% #change P value
      gather(Variable, Value, -geometry)

#get unique values for mapping
vars <- unique(final_net.localMorans$Variable)
varList <- list()

#map moran's I and P Value
for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="", option = "B") +
      labs(title=i) +
      mapTheme() + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Figure 5: Local Moran's I Statistics, Observed Arrests"))

#feature engineering: is hotspot? + dist to hotspot
final_net <-
  final_net %>% 
  mutate(arrests.isSig = 
           ifelse(localmoran(final_net$countArrests, 
                             final_net.weights)[,5] <= 0.0000001, 1, 0)) %>%
  mutate(arrests.isSig.dist = 
           nn_function(st_coordinates(st_centroid(final_net)),
                       st_coordinates(st_centroid(
                         filter(final_net, arrests.isSig == 1))), 1))

Correlations with Arrest Risk

Now with our complete set of risk factor features, we explore the correlation between the features and observed arrests. Figure 6 visualizes some of these correlations. As expected, physical signs of disorder such as open garbage, building violations, and vacant properties correlate strongly with arrests. These features will be used in our model. Surprisingly, youth unemployment, graffiti, and street lights out have weak correlations with arrests despite the ample literature suggesting their link with crime.

correlation.long <-
  st_drop_geometry(final_net) %>%
    dplyr::select(-uniqueID, -cvID, -name, -District, -NAME, -Building_Violation.nn, -Abandoned_Buildings, -Garbage_Open.nn, -Street_Lights_Out.nn, -Bus_stops.nn, -Sanitation, -Alley_Lights_Out.nn, -Liquor_Retail, -Affordable_Rental_Housing, -Abandoned_Cars, -Police_Stations, -Graffiti, -pctMaleYouth) %>%
    gather(Variable, Value, -countArrests) %>%
  mutate(Value = as.numeric(Value))

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, countArrests, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, countArrests)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Figure 6: Observed Arrests as a Function of Risk Factors", subtitle = "Chicago, IL") +
  plotTheme()

Regression Model Design

Using the scatterplots and a calculation of each feature’s correlation, I design several sets of model features. For comparison, I make one that does not include spatial process features (reg.vars).

reg.vars <- c("pctWhite", "Abandoned_Buildings.nn", "Affordable_Rental_Housing.nn", "medHHInc", "Sanitation.nn", "Police_Stations.nn", "Liquor_Retail.nn", "Abandoned_Cars.nn", "Graffiti.nn", "population", "pctNoHS", "youthUnemploy", "Alley_Lights_Out", "Bus_Stops", "Street_Lights_Out", "pctRenterOcc", "pctPoverty", "pctVacant", "Garbage_Open", "Building_Violation")

#reg.ss.vars <- c("arrests.isSig", "arrests.isSig.dist", "Abandoned_Buildings", "pctVacant", "pctPoverty", "pctRenterOcc", "Street_Lights_Out", "youthUnemploy", "pctNoHS", "pctWhite", "medHHInc", "Sanitation.nn", "Liquor_Retail.nn", "Abandoned_Cars.nn", "Graffiti.nn", "countArrests", "loopDistance")

#reg.ss.vars1 <- c("arrests.isSig", "arrests.isSig.dist", "Abandoned_Buildings", "pctVacant", "pctPoverty", "pctRenterOcc", "Street_Lights_Out", "youthUnemploy", "pctWhite", "medHHInc", "Sanitation.nn", "Liquor_Retail.nn", "Abandoned_Cars.nn", "countArrests", "loopDistance")

#reg.ss.vars2 <- c("arrests.isSig", "Abandoned_Buildings", "pctVacant", "pctPoverty", "pctRenterOcc", "Street_Lights_Out", "youthUnemploy", "pctWhite", "medHHInc", "Sanitation.nn", "Liquor_Retail.nn", "Abandoned_Cars.nn", "countArrests", "loopDistance")

reg.ss.vars3 <- c("pctWhite", "Abandoned_Buildings.nn", "Affordable_Rental_Housing.nn", "medHHInc", "Sanitation.nn", "Police_Stations.nn", "Liquor_Retail.nn", "Abandoned_Cars.nn", "Graffiti.nn", "population", "pctNoHS", "youthUnemploy", "Alley_Lights_Out", "Bus_Stops", "Street_Lights_Out", "pctRenterOcc", "pctPoverty", "pctVacant", "Garbage_Open", "Building_Violation", "arrests.isSig", "arrests.isSig.dist")

Following are estimations of four regressions. Two use only risk factors and two incorporate Moran’s I spatial process features (arrests.isSig and arrests.isSig.dist). These are cross validated with Leave One Group Out cross validation (LOGO-CV), which is a form of spatial cross validation that trains the model on n-1 areas, tests its goodness of fit on the excluded area, and repeats for all areas (in this case, neighborhoods). The models are also cross-validated with random K-Folds, a non-spatial test, for comparison.

Figure 7 suggests that LOGO-CV has higher errors than Random K-Fold cross validation, and that incorporating spatial process features reduces error.

#crossValidate function
crossValidate <- function(dataset, id, dependentVariable, indVariables) {

allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])

for (i in cvID_list) {

  thisFold <- i
  cat("This hold out fold is", thisFold, "\n")

  fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
                dplyr::select(id, geometry, indVariables, dependentVariable)
  fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
                dplyr::select(id, geometry, indVariables, dependentVariable)
  
  regression <-
    glm(countArrests ~ ., family = "poisson", 
      data = fold.train %>% 
      dplyr::select(-geometry, -id))
  
  thisPrediction <- 
    mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
  allPredictions <-
    rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

#Regressions
reg.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countArrests",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = cvID, countArrests, Prediction, geometry)

reg.ss.cv <- crossValidate(
  dataset = final_net,
  id = "cvID",
  dependentVariable = "countArrests",
  indVariables = reg.ss.vars3) %>%
    dplyr::select(cvID = cvID, countArrests, Prediction, geometry)
  
reg.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countArrests",
  indVariables = reg.vars) %>%
    dplyr::select(cvID = name, countArrests, Prediction, geometry)

reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",
  dependentVariable = "countArrests",
  indVariables = reg.ss.vars3) %>%
    dplyr::select(cvID = name, countArrests, Prediction, geometry)


#summary of regression results
reg.summary <- 
  rbind(
    mutate(reg.cv,           Error = Prediction - countArrests,
                             Regression = "Random k-fold CV: Just Risk Factors"),
                             
    mutate(reg.ss.cv,        Error = Prediction - countArrests,
                             Regression = "Random k-fold CV: Spatial Process"),
    
    mutate(reg.spatialCV,    Error = Prediction - countArrests,
                             Regression = "Spatial LOGO-CV: Just Risk Factors"),
                             
    mutate(reg.ss.spatialCV, Error = Prediction - countArrests,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>%
    st_sf() 


error_by_reg_and_fold <- 
  reg.summary %>%
    group_by(Regression, cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countArrests, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

  ggplot() +
  geom_sf(data = error_by_reg_and_fold, aes(fill = MAE, colour = MAE)) +
        scale_fill_viridis(option = "B") +
    scale_colour_viridis(option = "B") +
 facet_wrap(~Regression) +  
      labs(title="Figure 7: Spatial Distribution of Mean Average Error", subtitle = "K-fold Cross Validation vs. LOGO-CV") +
  mapTheme()

Figure 8 gives us additional important information: across all models, most samples have low error but a few outliers have very high error. This may mean that our model does not generalize well citywide.

error_by_reg_and_fold %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="black", fill = "yellow") +
    facet_wrap(~Regression) +  
    geom_vline(xintercept = 0) + scale_x_continuous(breaks = seq(0, 8, by = 1)) + 
    labs(title="Figure 8: Distribution of Mean Average Error", subtitle = "K-Fold cross validation vs. LOGO-CV",
         x="Mean Absolute Error", y="Count") +
  plotTheme()

Evaluating the Model

Error Across Models

Table 1 shows us that Random K-Fold cross validation and spatial process features indeed make the most accurate model, with a mean average error of 2.97. For reference, the average number of observed arrests in each cell is 19.22, and the average prediction across models is 19.23 arrests.

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable(caption = "Table 1: MAE and SD MAE by Regression Type") %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "Yellow") %>%
    row_spec(4, color = "black", background = "Yellow")

Table 1: MAE and SD MAE by Regression Type
Regression	Mean_MAE	SD_MAE
Random k-fold CV: Just Risk Factors	3.14	3.53
Random k-fold CV: Spatial Process	2.99	3.26
Spatial LOGO-CV: Just Risk Factors	4.74	5.92
Spatial LOGO-CV: Spatial Process	4.04	5.01

Does the model generalize across neighborhoods?

To further test our model, we calculate a Global Moran’s I statistic using neighborhood weights to see if its spatial process features properly account for variation in arrests across neighborhoods. A perfect score of 0 would mean that model errors exhibit no spatial clustering. Table 2 shows that our spatial process features do make the model generalizable across neighborhoods: the Moran’s I value is 0.1.

neighborhood.weights <-
  filter(error_by_reg_and_fold, Regression == "Spatial LOGO-CV: Spatial Process") %>%
    group_by(cvID) %>%
      poly2nb(as_Spatial(.), queen=TRUE) %>%
      nb2listw(., style="W", zero.policy=TRUE)

filter(error_by_reg_and_fold, str_detect(Regression, "LOGO"))  %>% 
    st_drop_geometry() %>%
    group_by(Regression) %>%
    summarize(Morans_I = moran.mc(abs(Mean_Error), neighborhood.weights, 
                                 nsim = 999, zero.policy = TRUE, 
                                 na.action=na.omit)[[1]],
              p_value = moran.mc(abs(Mean_Error), neighborhood.weights, 
                                 nsim = 999, zero.policy = TRUE, 
                                 na.action=na.omit)[[3]]) %>% #added here
   kable(caption = "Table 2: Moran's I and P Value by Regression") %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(2, color = "black", background = "Yellow")

Table 2: Moran’s I and P Value by Regression
Regression	Morans_I	p_value
Spatial LOGO-CV: Just Risk Factors	0.1769553	0.008
Spatial LOGO-CV: Spatial Process	0.0998272	0.053

In Figure 9, we can see the effect of the spatial features. Compared to Just Risk Factors, our Spatial Process regression model predicts more precise risk hotspots, including some spots with no observed arrests. These spots are important because they may experience some of the same structural issues which contribute to observed arrests and can benefit from similar interventions.

#predictions
reg.summary %>%
  ggplot() +
  geom_sf(aes(fill = Prediction, colour = Prediction)) +
  scale_fill_viridis(name = "Predicted Arrests", option = "B") +
  scale_colour_viridis(name = "Predicted Arrests", option = "B") +
  facet_wrap(~Regression) +  
  labs(title = "Figure 9: Predicted Arrests By Regression", subtitle = "Chicago, IL") +
  mapTheme()

Comparing Figure 9 to 10, we see that our model predicts hotspots accurately.

#actual counts, for comparison
reg.ss.cv %>%
  ggplot() +
  geom_sf(aes(fill = countArrests, colour = countArrests)) +
  scale_fill_viridis(name = "Arrest Count", option = "B") +
  scale_colour_viridis(name = "Arrest Count", option = "B") +
  labs(title = "Figure 10: Observed Arrests, 2017", subtitle = "Chicago, IL") +
  mapTheme()

However, Figure 11 suggests that our model under-predicts arrest risk in neighborhoods with high observed arrests and over-predicts in ones with low arrest rates.

#comparing high arrest and low arrest areas
st_drop_geometry(reg.summary) %>%
  group_by(Regression) %>%
    mutate(arrests_Decile = ntile(countArrests, 10)) %>%
  group_by(Regression, arrests_Decile) %>%
    summarize(meanObserved = mean(countArrests, na.rm=T),
              meanPrediction = mean(Prediction, na.rm=T)) %>%
    gather(Variable, Value, -Regression, -arrests_Decile) %>%          
    ggplot(aes(arrests_Decile, Value, shape = Variable)) +
      geom_point(size = 2) + geom_path(aes(group = arrests_Decile), colour = "black") +
      scale_shape_manual(values = c(2, 17)) +
      facet_wrap(~Regression) + xlim(0,10) +
      labs(title = "Figure 11: Predicted and Observed Arrests by Observed Arrest Decile", subtitle = "Chicago, IL", x = "Observed Arrest Decile", y = "Arrest Count") +
  plotTheme()

Does the model generalize across racial contexts?

In addition to space, our model must be generalizable across different racial contexts. Figure 12 shows the racial context of Chicago. Immediately, it bears a ressemblance to the distribution of arrests: non-white neighborhoods have the most observed arrests.

raceContextACS <- acs %>%
  dplyr::select(pctWhite) %>%
  mutate(Race_Context = ifelse(pctWhite > .5, "Majority_White", "Majority_Non_White"))

ggplot() +
  geom_sf(data = raceContextACS, aes(fill = Race_Context)) +
  scale_fill_viridis(option = "B", discrete = TRUE, name = "Race Context") +
  labs(title = "Figure 12: Neighborhood Racial Context", subtitle = "Chicago, IL") +
  mapTheme()

To be generalizable, our model must produce similar errors in both majority white and majority non-white neighborhoods. This is not the case. The absolute mean error in both regressions is higher for non-white neighborhoods (Table 3). Additionally, our model under-predicts arrest risk in non-white neighborhoods and over-predicts in majority white neighborhoods. This flaw is problematic because it may cause under-allocation of interventions and resources in non-white neighborhoods.

reg.summary %>% 
  filter(str_detect(Regression, "LOGO")) %>%
    st_centroid() %>%
    st_join(raceContextACS) %>%
    na.omit() %>%
      st_drop_geometry() %>%
      group_by(Regression, Race_Context) %>%
      summarize(mean.Error = mean(Error, na.rm = T)) %>%
      spread(Race_Context, mean.Error) %>%
      kable(caption = "Table 3: Mean Error by Neighborhood Racial Context") %>%
        kable_styling("striped", full_width = F) %>%
      row_spec(2, color = "black", background = "Yellow")

Table 3: Mean Error by Neighborhood Racial Context
Regression	Majority_Non_White	Majority_White
Spatial LOGO-CV: Just Risk Factors	-0.2704655	0.2171782
Spatial LOGO-CV: Spatial Process	-0.2270573	0.2001028

How does this model compare to traditional crime hotspot algorithms?

Ultimately, an important metric for this model’s quality is whether it predicts arrest risk better than traditional policing algorithms. Following, I will compare its accuracy with that of a Kernel Density model which is commonly used to predict crime. Kernel Density uses past (2017) arrest locations to predict future (2018) arrest risk. Figure 13 shows a visual comparison of the two models: we can see that while the risk map generated by the Kernel Density model generally maps onto 2018 arrest points, our Risk Prediction model is far more accurate.

arrests_ppp <- as.ppp(st_coordinates(arrests), W = st_bbox(final_net))
arrests_KD <- spatstat::density.ppp(arrests_ppp, 1000)

#as.data.frame(arrests_KD) %>%
#  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
#  aggregate(., final_net, mean) %>%
#   ggplot() +
#     geom_sf(aes(fill=value, colour=value)) +
#     geom_sf(data = sample_n(arrests, 1500), size = .5) +
#     scale_fill_viridis(name = "Density") +
#    scale_colour_viridis(name = "Density") +
#     labs(title = "Figure 10: Kernel Density of Observed Arrests 2017", subtitle = "Chicago, IL") +
#     mapTheme()

#download 2018 arrests data
arrests18 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy?Arrest=true") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]

#compare kernel density vs risk prediction, overlay 2018 burglaries
arrests_ppp <- as.ppp(st_coordinates(arrests), W = st_bbox(final_net))
arrests_KD <- spatstat::density.ppp(arrests_ppp, 1000)

arrests_KDE_sf <- as.data.frame(arrests_KD) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
  mutate(label = "Kernel Density 2017",
         Risk_Category = ntile(value, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(arrests18) %>% mutate(arrestsCount = 1), ., sum) %>%
    mutate(arrestsCount = replace_na(arrestsCount, 0))) %>%
  dplyr::select(label, Risk_Category, arrestsCount)

arrests_risk_sf <-
  filter(reg.summary, Regression == "Spatial LOGO-CV: Spatial Process") %>%
  mutate(label = "Risk Predictions 2017",
         Risk_Category = ntile(Prediction, 100),
         Risk_Category = case_when(
           Risk_Category >= 90 ~ "90% to 100%",
           Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
           Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
           Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
           Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
  cbind(
    aggregate(
      dplyr::select(arrests18) %>% mutate(arrestsCount = 1), ., sum) %>%
      mutate(arrestsCount = replace_na(arrestsCount, 0))) %>%
  dplyr::select(label,Risk_Category, arrestsCount)

#Map
rbind(arrests_KDE_sf, arrests_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(arrests18, 3000), size = .5, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE, option = "C", name = "Arrest Risk Category") +
    labs(title="Figure 13: Comparison of Kernel Density and Risk Predictions",
         subtitle="2017 Arrest Risk Predictions; 2018 Observed Arrests; Chicago, IL") +
    mapTheme()

Figure 14 suggests otherwise. For all risk categories lower than 90%, the traditional Kernel Density model is actually more accurate. For the highest category, however, our risk model is more accurate. This means that our model is better at locating arrest risk hotspots, which is its intended purpose.

rbind(arrests_KDE_sf, arrests_risk_sf) %>%
  st_set_geometry(NULL) %>% na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countArrests = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Rate_of_test_set_crimes = countArrests / sum(countArrests)) %>%
    ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_viridis(discrete = TRUE, option = "B", name = "Model Used") +
      labs(title = "Figure 14: Risk Prediction vs. Kernel Density, 2018 Arrests", subtitle = "Chicago, IL", x = "Risk Category", y = "Rate of Test Set Crimes") +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) +
  plotTheme()

Conclusion

Overall, this model achieves the goal of identifying arrest risk hotspots in Chicago that can benefit from interventions in social services and alternative policing methods. It predicts hotspots more accurately than the traditionally used Kernel Density model, and it identifies areas with no observed arrests that have latent arrest risk. This is important, as these areas may be influenced by the same negative factors that cause high arrest rates and they should be targeted with preventative interventions.

A limitation of this model is that it systematically under-predicts arrests in areas with high observed arrests and in majority non-white neighborhoods. As it is intended to be used to allocate resources to at-risk neighborhoods, this may lead to under-allocation in communities with the most need. Further, this model relies on potentially problematic theories such as the “Broken Windows Theory” which posits that neighborhoods with more physical disorder have more crime. It is additionally limited by its assumption that crime equates to arrests. In reality, arrests are more likely to occur in response to crime in disadvantaged neighborhoods. For these reasons, this model should not be used for policing because it can lead to the perpetuation of overpolicing in low income and minority neighborhoods. For the purpose of identifying high arrest risk neighborhoods in which to allocate interventions, however, it is effective and it should be implemented by Chicago to encourage positive alternatives to policing.