Introduction

Since the beginning of the Covid-19 pandemic, the Centers for Disease Control and Prevention (CDC) as well as a growing body of literature has demonstrated the disparate impacts of Covid-19 on individuals with lower access to healthcare, essential workers, those in crowded housing, and low-wealth communities. Often, this results in communities of color experiencing a higher rate of Covid-19 deaths and infections. Local governments and non-governmental agencies have targeted specific interventions aimed at mitigating the disparities in the impacts of Covid-19. In New York City, the Department of Health and Mental Hygiene launched the Covid-19 Equity Action Plan which details efforts to improve access to Covid-19 care and resources in communities of color by partnering with community health providers and making Covid-19 guidance available in 20 languages. In Philadelphia, the Black Doctor’s Covid-19 Consortium offered free vaccine clinics in 2021, prioritizing residents in zipcodes with the high positivity and death rates.

While the disparate impacts of Covid-19 are undeniable, the past two years have made me wonder: how do we quantify the disparate impact of Covid-19 on communities? Covid-19 guides from the CDC and local governments tend to look at case, hospitalization, and death rates per 100,000. Accordingly, interventions often allocate resources based on either an individual metric or a composite of the three. However, throughout the pandemic, there has been little discussion within public health and planning about case outcomes, or the likelihood of an individual dying from Covid-19 once infected. This is important, as it can further help allocate resources and inform future pandemic responses based on an understanding of a community’s vulnerability to Covid-19 infection. So far, most discussion of case outcomes has been confined to the medical literature. There, it is well-documented through clinical studies that individuals with attributes such as being older than 55, organ failure, and hypoxia are most likely to die from Covid-19 infection. In the spatial literature, Covid-19 mortality risk has been linked to income, race, transit use, travel, and tourism on a national scale. However, few studies have examined neighborhood-level risk factors and lived conditions which are associated with fatal Covid-19 case outcomes.

Motivation

The goal of this project is to model Covid-19 Mortality Ratio, defined as the ratio of Covid-19 deaths to infections. From a planning and public health perspective, understanding risk of fatal Covid-19 outcomes is extremely important. Currently, interventions and resource allocation is largely based on individual metrics like case, hospitalization, and death rates, and the severity of the pandemic at large and in communities in mostly understood this way. However, not all communities experiencing a high rate of infection experience a high rate of deaths from infection. By modeling the risk factors that correlate with high likelihood of dying due to Covid-19, we create a more accurate model of vulnerability and risk that can be applied to future crises and inform proactive responses and planning.

knitr::opts_chunk$set(
    message = FALSE,
    warning = FALSE,
    include = FALSE
)
options(scipen=10000000)
library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(mapview)
library(lubridate)
library(pander)
library(ggplot2)
library(tibble)
library(FNN)
library(spdep)
library(ggcorrplot)
library(gmodels)
library(vcd)
library(dplyr)

options(tigris_class = "sf")
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")


mapTheme <- function(base_size = 12) {
  theme(
    text = element_text(color = "black", family="Avenir"),
    plot.title = element_text(size = 18,colour = "black", hjust = 0.5),
    plot.subtitle=element_text(face="italic", hjust = 0.5),
    plot.caption=element_text(size = 7, hjust = 0.5),
    axis.ticks = element_blank(),
    panel.background = element_blank(), 
    axis.text = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "white", fill=NA, size=2)
  )
}

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 12,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "white", fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=8),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
    output <-
      as.data.frame(nn) %>%
      rownames_to_column(var = "thisPoint") %>%
      gather(points, point_distance, V1:ncol(.)) %>%
      arrange(as.numeric(thisPoint)) %>%
      group_by(thisPoint) %>%
      summarize(pointDistance = mean(point_distance)) %>%
      arrange(as.numeric(thisPoint)) %>% 
      dplyr::select(-thisPoint) %>%
      pull()
  return(output) 
}

Methods

I selected New York City for this analysis for several reasons. First, it has experienced a high incidence of both Covid-19 cases and deaths, resulting in a large sample of high resolution Covid-19 data. Additionally, it has been documented that New Yorkers have experienced a high disparity in both Covid-19 exposure and outcomes based on factors including race, median_age, and profession. Finally, its geographical size allows for a large sample size of communities with diverse demographic and socioeconomic attributes.

To conduct this analysis, I relied primarily on four data sources:
New York City Department of Health and Mental Hygiene
- Covid-19 cases, positivity, testing, hospitalization, deaths
US Census Bureau
- American Community Survey 2018 5-year Estimates
New York City OpenData
- Crime, institutions, evictions data
US Department of Housing and Urban Development (HUD)
- Small Area Fair Market Rents (SAFMR) data
- Picture of Subsidized Households Housing Choice Voucher data
- Census tract to zip code crosswalk files

Covid-19 Data

To construct my dependent variable, mortality ratio, I used Covid-19 data collected by the New York City Department of Health and Mental Hygiene. I accessed this data on the department’s Github, and selected an archived dataset from December 1, 2021 to exclude the Omicron variant which has a much lower fatality rate than previous variants. This data is collected at the zip code level, and includes counts as well as rates (per 100,000) for Covid-19 cases, positivity, hospitalizations, deaths, and testing. To calculate mortality ratio, I divided COVID_DEATH_RATE by COVID_CASE_RATE.

safmr <- st_read("/Users/annaduan/Documents/GitHub/NYC\ Covid/Data/nyc_safmrs.csv") %>%
  rename(zcta = zip_code) %>%
  mutate(safmr0 = as.numeric(X0BR), 
         safmr1= as.numeric(X1BR), 
         safmr2= as.numeric(X2BR), 
         safmr3= as.numeric(X3BR), 
         safmr4= as.numeric(X4BR)) %>%
  dplyr::select(safmr4, zcta)

zones <- st_read("https://data.cityofnewyork.us/resource/pri4-ifjk.geojson") %>% dplyr::select(modzcta) %>%
  rename(zcta = modzcta) %>%
  left_join(safmr)



hosp <- st_read("https://raw.githubusercontent.com/nychealth/coronavirus-data/master/trends/hosprate-by-modzcta.csv") %>%
dplyr::filter(., !grepl("2022", date))

hosp <- hosp %>%
  dplyr::select(-date, -HOSPRATE_Bronx, -HOSPRATE_Brooklyn, -HOSPRATE_Manhattan, -HOSPRATE_Queens, -HOSPRATE_Staten_Island, -HOSPRATE_Citywide)
colnames(hosp)<-gsub("HOSPRATE_","",colnames(hosp))

hosp <- sapply(hosp, as.numeric ) %>%
data.frame() %>%
 replace(is.na(.), 0)
colnames(hosp)<-gsub("X","",colnames(hosp))
hosp <- data.frame(t(hosp)) %>%
  mutate(hosp = X1+X2+X3+X4+X5+X7+X8+X9+X10+X11+X12+X13+X14+X15+X16+X17+X18+X19+X20+X21+X22) %>% dplyr::select(hosp) %>%
  tibble::rownames_to_column(., var = "zcta")


byGeo <- st_read("https://raw.githubusercontent.com/nychealth/coronavirus-data/b51d9e4f2479435faefe261822649249b175a2d0/totals/data-by-modzcta.csv") %>%      #commit Jan 2 2022 to account for Omicron
  rename(zcta = MODIFIED_ZCTA)

byGeo <- left_join(zones, byGeo, by = "zcta") %>%
  filter(zcta != "99999")

byGeo$MORT_RATIO <- (as.numeric(byGeo$COVID_DEATH_RATE)/as.numeric(byGeo$COVID_CASE_RATE))*100

byGeo <- byGeo %>%
  st_transform("EPSG:2263") %>%
  dplyr::select(-NEIGHBORHOOD_NAME, -label, -lat, -lon) %>%
  left_join(hosp, by = "zcta") %>%
mutate(COVID_CASE_RATE = as.numeric(COVID_CASE_RATE),
COVID_DEATH_RATE = as.numeric(COVID_DEATH_RATE),
PERCENT_POSITIVE = as.numeric(PERCENT_POSITIVE),
COVID_CONFIRMED_DEATH_RATE = as.numeric(COVID_CONFIRMED_DEATH_RATE),
COVID_CONFIRMED_CASE_RATE = as.numeric(COVID_CONFIRMED_CASE_RATE),
POP_DENOMINATOR = as.numeric(POP_DENOMINATOR),
TOTAL_COVID_TESTS = as.numeric(TOTAL_COVID_TESTS)) %>%
  dplyr::select(zcta, safmr4, BOROUGH_GROUP, COVID_CASE_RATE, POP_DENOMINATOR, COVID_DEATH_RATE, PERCENT_POSITIVE, TOTAL_COVID_TESTS, MORT_RATIO, hosp)
grid.arrange(
ggplot()+
  geom_sf(data = byGeo, colour = "transparent", aes(fill = as.numeric(COVID_CASE_RATE)), size = 0.5) +
  scale_fill_viridis(option = 'A', name = "Cases/ \n100,000") +
  labs(title = "Cases per 100k") +
  mapTheme() +theme(legend.position="bottom")
,
ggplot()+
  geom_sf(data = byGeo, colour = "transparent", aes(fill = as.numeric(hosp)), size = 0.5) +
  scale_fill_viridis(option = 'A', name = "Hospitalizations/ \n100,000") +
  labs(title = "Hospitalizations per 100k") +
  mapTheme() +theme(legend.position="bottom")
, 
   ggplot()+
  geom_sf(data = byGeo, colour = "transparent", aes(fill = as.numeric(COVID_DEATH_RATE)), size = 0.5) +
  scale_fill_viridis(option = 'A', name = "Deaths/ \n100,000") +
  labs(title = "Deaths per 100k") +
  mapTheme() +theme(legend.position="bottom")
,
ggplot()+
  geom_sf(data = byGeo, colour = "transparent", aes(fill = as.numeric(MORT_RATIO)), size = 0.5) +
  scale_fill_viridis(option = 'A', name = "Mortality Ratio (%)") +
  labs(title = "Mortality Ratio") +
  mapTheme() + theme(legend.position="bottom")
,
ncol = 2, top = "Covid-19 Metrics in New York City by Zip Code", bottom = "Source: https://github.com/nychealth/coronavirus-data/blob/master/totals/data-by-modzcta.csv")

Census Data

The census data in this analysis was pulled from the Census Bureau’s API using the R package, tidycensus. This data encompasses variables including demographics, living conditions, employment, and English fluency at the census tract level. For each variable, I calculated its percentage share of the whole population or its appropriate denominator. I then removed 66 census tracts that had a population of 0 and NA values for median income and median age, ending with 2101 census tracts.

#acs
census_api_key("7aaf667fa42073b560f4759176037374b38cb7fd", install = TRUE, overwrite = TRUE)
vars <- load_variables(year = 2018,
                       dataset = "acs5",
                       cache = TRUE)

pull_acs <- function(cty) {   #input is a borough name
  get_acs(geography = "tract", variables = c("B01003_001", #TOTAL POPULATION
                                                   "B01002_001", #MEDIAN AGE
                                                   "B01001_026", #TOTAL FEMALE POP
                                                   "B02001_002", #TOTAL WHITE ALONE
                                                   "B02001_003", #TOTAL BLACK ALONE
                                                   "B03003_003", #TOTAL HISPANIC/LATINO
                                                   "C18120_006", #TOTAL UNEMPLOYED
                                                   "B05003_008", #MALE 18+
                                                   "B05003_019", #FEMALE 18+
                                                   "B11010_009", #TOTAL FEMALE-HEADED HOUSEHOLD
                                                   "B06012_002", #TOTAL BELOW POVERTY
                                                   "B19057_002", #TOTAL RECEIVING PUBLIC ASSISTANCE
                                                   "B19013_001", #MEDIAN HOUSEHOLD median_age
                                                   "B25003_003", #RENTER OCCUPIED
                                                   "B09019_002", #TOTAL HOUSEHOLDS
                                                   "B25129_039", #RENTER OCC, Moved after 2010
                                                   "B25129_003", #OWNER OCC, Moved after 2010)
                                                   "B25002_003", #VACANT PROPERTY
                                                   "B25002_001", #HOUSING UNIT COUNT
                                                   "B25123_003", #OWNER OCC 1+ CONDITION
                                                   "B25123_009", #RENTER OCC 1+ CONDITION
                                                   "B19001_001", #HOUSEHOLDS W/ KNOWN INCOME
                                                   "B19001A_014",#White alone 100-124.99
                                                   "B19001A_015",#White alone 125-149.99
                                                   "B19001A_016",#White alone 150-199.99
                                                   "B19001A_017",#White alone 200+
                                                   "B19001B_002",#Black alone under 10  
                                                   "B19001B_003",#Black alone 10-14.99  
                                                   "B19001B_004",#Black alone 15-19.99  
                                                   "B19001B_005",#Black alone 20-24.99
                                             
                                                   "B25032_010", #owner occupied apt (50+ housing units)
                                                   "B25032_021", #renter occupied apt (50 + housing units)
                                                   "B25010_002", #avg owner occ household size
                                                   "B25010_003", #avg renter occ household size
                                                   "B08141_002", #no vehicles
                                                   "B08141_016", #work via transit
                                                   "B08141_031", #wfh_2018
                                                   "C24050_001",# total employed 16+
                                                   "C24050_006",# work in retail
                                                   "C24050_011",# works n education, health, social assistance
                                                   "C24050_029",# works in service
                                                   "B15003_001",# age 25+
                                                   "B15003_012",# age 25+ 8th grade
                                                   "B15003_017",# age 25+ 12th grade diploma
                                                   "B15003_022",# age 25+ bachelors 
                                                   "B15003_024",# age 25+ profesional degree
                                                   "B05002_013",# foreign born population
                                                   "C16002_013",# limited English household
                                                   "B27010_033",# individuals without health insurance
                                                   "B25011_042",# renter living alone
                                                   "B25011_018",# owner living alone
                                                   "B17024_106",# 65-74 years 
                                                   "B26107_003"# living in group quarters non veteran
                                             ), year=2018, state=36, county = cty, geometry=T) %>%
  st_transform("EPSG:2263") 
}
  
queens <- pull_acs(81)
kings <- pull_acs(47)
bronx <- pull_acs(5)
richmond <- pull_acs(85)
newyork <- pull_acs(61)
acs <- rbind(queens, kings, bronx, newyork, richmond)

acs <- 
  acs %>%
  dplyr::select( -NAME, -moe) %>%
  spread(variable, estimate) %>%
  dplyr::select(-geometry) %>%
  rename(population = B01003_001, 
         median_age = B01002_001,
         female = B01001_026, 
         white = B02001_002,
         black = B02001_003, 
         hispanic = B03003_003,
         unemployed = C18120_006,
         male_18 = B05003_008,
         female_18 = B05003_019,
         female_hh = B11010_009,
         public_asst = B19057_002, 
         poverty = B06012_002,
         medhhinc = B19013_001,
         renter_hh = B25003_003,
         total_hh = B09019_002,
         renter_5_years = B25129_039,
         owner_5_years = B25129_003,
         vacant = B25002_003,
         housing_units = B25002_001,
         owner_1condition = B25123_003,
         renter_1condition = B25123_009,
         known_hh_incomes = B19001_001,
         white100_124k = B19001A_014,
         white125_149k = B19001A_015,
         white150_199k = B19001A_016,
         white200k_plus = B19001A_017,
         black_10k_below = B19001B_002,
         black_10_14k = B19001B_003,
         black_15_19k = B19001B_004,
         black_20_24k = B19001B_005) %>%
  mutate(female_pct = ifelse(population > 0, female / population, 0),
         white_pct = ifelse(population > 0, white / population, 0),
         black_pct = ifelse(population > 0, black/ population, 0),
         hispanic_pct = ifelse(population > 0, hispanic / population, 0),
         other_race_pct = ifelse(population > 0, (population - (hispanic + white + black))/population, 0),
         adult_18 = male_18 + female_18,
         adult_pct = ifelse(population > 0, adult_18 / population, 0), 
         children_pct = 1 - adult_pct,
         female_hh_pct = ifelse(total_hh > 0, female_hh / total_hh, 0),
         public_asst_pct = ifelse(total_hh > 0, public_asst / total_hh, 0),
         poverty_pct = ifelse(population > 0, poverty / population, 0),
         renter_pct = ifelse(housing_units > 0, renter_hh / housing_units, 0),
         tenure_5yr_pct = ifelse(total_hh > 0, (renter_5_years + owner_5_years) / total_hh, 0),
         vacancy_pct = ifelse(housing_units > 0, vacant / housing_units, 0),
         housing_condition_pct = ifelse(housing_units > 0, (owner_1condition + renter_1condition)/housing_units, 0),
         unemployed_pct = ifelse(adult_18 > 0, unemployed / adult_18, 0),     
         race_ice = ifelse(known_hh_incomes >0, ((white100_124k + white125_149k + white150_199k + white200k_plus) - (black_10k_below + black_10_14k + black_15_19k + black_20_24k))/known_hh_incomes, 0),#((white non-hisp over US$100 k)  −(black alone under US$25 k))/total population_household income
apartments = ifelse(housing_units > 0, (B25032_010 + B25032_021)/housing_units, 0),
hh_size_renter = B25010_003,
hh_size_owner = B25010_002,
no_veh_hh = ifelse(total_hh > 0, B08141_002/total_hh, 0),
total_workers = C24050_001,
transit_commute = ifelse(total_workers >0, B08141_016/total_workers,0),
wfh_2018 = ifelse(total_workers >0, B08141_031/total_workers,0),
retail_worker = ifelse(total_workers >0, C24050_006/total_workers,0),
edu_health_soc_worker =ifelse(total_workers >0, C24050_011/total_workers,0),
service_worker = ifelse(total_workers >0, C24050_029/total_workers,0),
age_25plus = B15003_001, 
edu_ms =ifelse(age_25plus >0, B15003_012/age_25plus,0),
edu_hs = ifelse(age_25plus >0, B15003_017/age_25plus,0),
edu_bac = ifelse(age_25plus >0, B15003_022/age_25plus,0),
edu_prof =ifelse(age_25plus >0, B15003_024/age_25plus,0),
foreign_born=ifelse(population >0, B05002_013/population,0),
limited_english_hh=ifelse(total_hh >0, C16002_013/total_hh,0),
no_insurance=ifelse(population >0, B27010_033/population, 0),
live_alone= ifelse(total_hh >0, (B25011_042+ B25011_018)/total_hh, 0),
age_65_74 = ifelse(population >0, B17024_106/population,0),
group_quarters =ifelse(population >0, B26107_003/population,0)) %>%
  dplyr::select(median_age, medhhinc, female_pct, white_pct, black_pct, hispanic_pct, other_race_pct, adult_pct, female_hh_pct, public_asst_pct, poverty_pct, renter_pct, tenure_5yr_pct, vacancy_pct, housing_condition_pct, unemployed_pct,  race_ice, GEOID, population,apartments, hh_size_renter, hh_size_owner, no_veh_hh, total_workers, transit_commute, wfh_2018, retail_worker, edu_health_soc_worker, service_worker, age_25plus, edu_ms, edu_hs, edu_bac, edu_prof, foreign_born, limited_english_hh, no_insurance, live_alone, age_65_74, group_quarters )
acs[is.na(acs)] <- 0 #2167
acs <- acs %>%
  filter(median_age != 0 & medhhinc != 0 & population != 0) #2101

Additional Data

The rest of my independent variables came from the New York City OpenData portal and HUD’s Picture of Subsidized Housing dataset. From NYC OpenData, I pulled point data on evictions, public transit stops, healthcare facilities, schools, points of interest, and shootings to try and capture potential stressors, constraints, and resources across the city. From HUD’s dataset, I collected census tract level data on public housing availability and use, as the literature has suggested that individuals living in subsidized housing may be more vulnerable to Covid-19 exposure due to higher residential density.

evictions <- st_read("https://data.cityofnewyork.us/resource/6z8x-wfk4.csv") %>%
dplyr::filter(., !grepl("2022", executed_date)) %>%
  dplyr::select(
    # -court_index_number, -docket_number, -eviction_address, -eviction_apt_num, -marshal_first_name, -marshal_last_name, -borough, -eviction_zip, -ejectment, -community_board, -council_district, -census_tract, -bin, -bbl, -nta, -ejectment, -eviction_possession
longitude, latitude
                ) %>%
  mutate(latitude = as.numeric(latitude),
         longitude = as.numeric(longitude),
         eviction = 1) %>%
  na.omit(., latitude) %>%
  st_as_sf(., crs = 4326, coords = c("longitude", "latitude")) %>%
  na.omit("latitude") %>%
  sf::st_transform('EPSG:2263') %>%
  dplyr::select(geometry)

evictions_prox <- st_coordinates(evictions)


busstops <- st_read("https://data.cityofnewyork.us/resource/t4f2-8md7.geojson") %>%
  dplyr::select(geometry) %>%
  st_transform("EPSG:2263")


healthcare <- st_read("https://data.cityofnewyork.us/resource/f7b6-v6v3.geojson") %>%
  sf::st_transform('EPSG:2263') %>%
  dplyr::select(geometry)

schools <- st_read("/Users/annaduan/Documents/GitHub/NYC\ Covid/Data/Public_School_Locations\ 2/Public_Schools_Points_2011-2012A.shp") %>%
  dplyr::select(geometry) %>%
  sf::st_transform('EPSG:2263') %>%
  mutate(schools = 1)
  
poi <- st_read("https://data.cityofnewyork.us/resource/t95h-5fsr.geojson")

shootings_hist <- st_read("/Users/annaduan/Documents/GitHub/NYC\ Covid/Data/shooting_hist/geo_export_eb0fa86e-f53e-4323-8146-851f587c0ca1.shp") %>%
  filter(date_occur > "2016-12-31")

shootings_pres <- st_read("/Users/annaduan/Documents/GitHub/NYC\ Covid/Data/shooting_pres/geo_export_fe7f8a1b-8196-4bbf-b798-65c65a6308f0.shp") %>%
    filter(date_occur < "2022-01-01")

shootings <- rbind(shootings_hist, shootings_pres) %>%
  dplyr::select(geometry) %>%
    sf::st_transform('EPSG:2263') %>%
  mutate(shootings = 1)
shootings_prox <- shootings %>%
  dplyr::select(geometry)  %>%
  st_coordinates()

## public housing
public_housing <- st_read("/Users/annaduan/Documents/GitHub/NYC\ Covid/Data/public_housing.csv") %>%
  filter(grepl("Kings County", Name) | grepl("Queens County", Name) | grepl("Richmond County", Name) | grepl("Bronx County", Name) | grepl("New York County", Name)) %>%
dplyr::select(Code, Subsidized.units.available) %>%
  mutate(p_housing_units = as.numeric(Subsidized.units.available)) %>%
transform(Code=str_replace(Code,"=","")) %>%
  rename(GEOID = Code) %>%
  dplyr::select(GEOID, p_housing_units) 

public_housing <- acs %>%
  dplyr::select(GEOID) %>%
  st_centroid() %>%
  left_join(public_housing) %>%
  st_as_sf() %>%
  filter(is.na(p_housing_units) == F) %>%
  dplyr::select(-GEOID)

Data Merging

After cleaning the data, I merged the independent variables with the dependent variable, Covid-19 Mortality Ratio. I had to use the Department of Housing and Urban Development’s (HUD) Crosswalk dataset to convert census-tract level data to the zip code level based on population ratios.

zip_list <- c("10001", "10002", "10003", "10026", "10004", "10005", "10006", "10007", "10009", "10010", "10011", "10012", "10013", "10014", "10016", "10017", "10018", "10030", "10019", "10021", "10022", "10023", "10024", "10025", "10027", "10028", "10029", "10031", "10032", "10033", "10034", "10037", "10035", "10036", "10038", "10040", "11104", "11105", "10039", "10128", "10044", "10065", "10069", "10075", "10280", "10282", "10301", "11106", "10302", "10303", "10304", "10451", "10305", "10306", "10307", "10308", "10309", "10310", "10312", "10314", "10456", "10452", "10453", "10454", "10455", "10457", "10458", "10459", "10460", "10461", "11004", "11101", "10462", "10463", "10464", "10465", "10466", "10467", "10468", "10469", "10470", "11102", "11109", "10471", "10472", "10473", "10474", "11103", "10475", "11201", "11203", "11204", "11205", "11206", "11207", "11209", "11208", "11210", "11211", "11212", "11213", "11218", "11214", "11215", "11216", "11217", "11219", "11220", "11221", "11222", "11223", "11224", "11225", "11226", "11228", "11230", "11415", "11229", "11231", "11358", "11232", "11233", "11234", "11235", "11236", "11237", "11238", "11239", "11354", "11355", "11356", "11357", "11360", "11372", "11361", "11362", "11416", "11363", "11364", "11373", "11365", "11366", "11367", "11374", "11368", "11417", "11369", "11370", "11375", "11379", "11377", "11378", "11421", "11422", "11385", "11411", "11412", "11413", "11414", "11420", "11418", "11419", "11423", "11426", "11427", "11428", "11436", "11691", "11692", "11693", "11429", "11432", "11433", "11434", "11435", "11694", "11697")

crosswalk <- st_read("/Users/annaduan/Desktop/HIP/Rent\ Limit\ Study/Data/zip_tract_crosswalk.csv") %>%
  filter(usps_zip_pref_state == "NY" & zip %in% zip_list) %>%
 dplyr::select(-usps_zip_pref_city, -usps_zip_pref_state) %>%
  rename(GEOID = tract) %>% 
  dplyr::select(-bus_ratio, -oth_ratio, -tot_ratio) %>%
  rename(zip_code = zip)

acs_zip <- acs %>%
  left_join(crosswalk, keep = FALSE, by= "GEOID") %>%
  mutate(res_ratio = as.numeric(res_ratio)) %>%
  mutate_at(c("median_age","medhhinc", "female_pct","white_pct","black_pct", "hispanic_pct",          "other_race_pct","adult_pct","female_hh_pct" , "public_asst_pct" ,"poverty_pct","renter_pct","tenure_5yr_pct" ,"vacancy_pct", "housing_condition_pct","unemployed_pct","race_ice","population", "apartments" ,"hh_size_renter","hh_size_owner","no_veh_hh" , "total_workers" ,"transit_commute","wfh_2018", "retail_worker","edu_health_soc_worker","service_worker","age_25plus","edu_ms","edu_hs","edu_bac","edu_prof", "foreign_born" ,"limited_english_hh","no_insurance","live_alone", "age_65_74", "group_quarters"), ~.*res_ratio)

acs_zip[is.na(acs_zip)] <- 0
  
acs_zip <- acs_zip %>%
  st_drop_geometry() %>%
  group_by(zip_code) %>%
  summarise_at(c("median_age","medhhinc", "female_pct","white_pct","black_pct", "hispanic_pct",          "other_race_pct","adult_pct","female_hh_pct" , "public_asst_pct" ,"poverty_pct","renter_pct","tenure_5yr_pct" ,"vacancy_pct", "housing_condition_pct","unemployed_pct","race_ice","population", "apartments" ,"hh_size_renter","hh_size_owner","no_veh_hh" , "total_workers" ,"transit_commute","wfh_2018", "retail_worker","edu_health_soc_worker","service_worker","age_25plus","edu_ms","edu_hs","edu_bac","edu_prof", "foreign_born" ,"limited_english_hh","no_insurance","live_alone", "age_65_74", "group_quarters"), sum) %>%
  filter(zip_code != 0) %>%
  rename(zcta = zip_code)

#add covid data
data_zip <- byGeo %>%
  left_join(acs_zip, by = "zcta") %>%
  st_as_sf() %>%
  st_transform("EPSG:2263") %>%
  mutate(TEST_RATE = as.numeric(TOTAL_COVID_TESTS)/as.numeric(POP_DENOMINATOR), na.rm = T) 

data_zip <- data_zip %>%
  dplyr::select(zcta) %>%
  st_intersection(public_housing) %>%
  group_by(zcta) %>%
  summarise(p_housing_units = sum(p_housing_units)) %>%
  st_drop_geometry() %>%
  full_join(data_zip, by = "zcta") %>%
  st_as_sf()

grid.arrange(
ggplot() +
  geom_sf(data = acs, color = "white", fill = "purple",size = 0.2) +
  scale_fill_viridis(option = "A", name = "NYC Borough") +
  labs(title = "Tract", subtitle = "n = 2101") +
  mapTheme() + theme(legend.position="none")
,
ggplot() +
  geom_sf(data = data_zip, color = "white", fill = "magenta", size = 0.2) +
  scale_fill_viridis(option = "A", name = "NYC Borough") +
  labs(title = "ZIP Code", subtitle = "n = 177") +
  mapTheme() + theme(legend.position="none"), ncol = 2, top = "Tract to ZIP")

Next, I create a set of “nearest neighbor” variables to represent the point data. Nearest neighbor is a function that calculates the average distance between the centroid of a ZIP code and the k nearest “neighbor” ZIP codes with a certain attribute. I create these variables for healthcare, transit, evictions, schools, shootings, and public housing.

data_zip <-
  data_zip %>%
    mutate(
           healthcare.nn5 = nn_function(st_coordinates(st_centroid(data_zip)), st_coordinates(healthcare),5),
           busstop.nn5 = nn_function(st_coordinates(st_centroid(data_zip)), st_coordinates(busstops),5),
           evictions.nn5 = nn_function(st_coordinates(st_centroid(data_zip)), evictions_prox,5),
           schools.nn5 =  nn_function(st_coordinates(st_centroid(data_zip)), st_coordinates(schools),5),
           shootings.nn5 = nn_function(st_coordinates(st_centroid(data_zip)), shootings_prox,5),
           public_housing = ifelse(population >0, p_housing_units/population, 0)) %>%
    dplyr::select( -hosp, -na.rm, -POP_DENOMINATOR)

shootings_bronx <- shootings %>%
  st_intersection(st_union(data_zip %>% filter(BOROUGH_GROUP == "Bronx")))

grid.arrange(
ggplot() +
  geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx"),fill = "magenta", colour = "white", alpha = 0.5) +
  geom_sf(data = shootings_bronx, size = 1, colour = "black") +
  labs(title = "Point data", subtitle = "Bronx, NYC") +
  mapTheme() 
,
ggplot() +
  geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx"), aes(fill = shootings.nn5), colour = "white") +
  scale_fill_viridis(option = 'magma', name = "Mean distance to 5 \nnearest shootings (ft)") +
  labs(title = "Nearest Neighbors Variable", subtitle = "Bronx, NYC") +
  mapTheme() , ncol = 2, top = "Shootings Data Feature Engineering")

Exploratory Analysis

Summary Statistics

Before creating a model, I calculate some summary statistics regarding Covid-19 mortality ratio in New York City.

  • On average, just over 2 percent of positive Covid-19 cases resulted in death between January, 2020 and December, 2021
  • In the ZIP code with the lowest mortality ratio, this figure was 0.06%, and the highest ratio for a MODZCTA was 5.4%
  • In terms of death count, the least affected ZIP code experienced just 1 death, and the most affected one had 662
  • Case rates have a positive skew, death rates have a negative skew, and mortality ratios are roughly normally distributed, with a mode of 2
grid.arrange(
ggplot()+
  geom_histogram(data = data_zip, aes(x = as.numeric(COVID_CASE_RATE)), bins = 50, fill = "violet") +
  labs(title = "Case Rate", x = "Cases per 100k", y = "ZIP codes") +
  plotTheme()

,
ggplot()+
  geom_histogram(data = data_zip, aes(x = as.numeric(COVID_DEATH_RATE)), bins = 50, fill = "magenta") +
  labs(title = "Death Rate", x = "Deaths per 100k", y = "ZIP codes") +
  plotTheme()
,
ggplot()+
  geom_histogram(data = data_zip, aes(x = MORT_RATIO), bins = 50, fill = "purple") +
  labs(title = "Mortality Ratio", x = "% of Infections", y = "ZIP codes") +
  plotTheme(),
ncol = 3, top = "New York City Covid-19 Cases, Deaths, and Mortality Ratios, 2020-2021")

Variation within boroughs

Let’s look closer. By mapping mortality ratios in the Bronx and Staten Island, we see that there appears to be greater disparities in Staten Island than in the Bronx. Perhaps this is due to a higher level of inequality.

boro_labs <- data_zip %>%
  dplyr::select(BOROUGH_GROUP, MORT_RATIO) %>%
  group_by(BOROUGH_GROUP) %>%
  summarise(Mortality_Ratio = mean(MORT_RATIO))

ggplot()+
    geom_sf(data = data_zip, colour = "transparent", fill = "gray90", size = 0.5) +
  geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx" | BOROUGH_GROUP == "Staten Island"), colour = "transparent", aes(fill = as.numeric(MORT_RATIO)), size = 0.5) +
  geom_sf_text(data = boro_labs %>% filter(BOROUGH_GROUP == "Staten Island" | BOROUGH_GROUP == "Bronx"), aes(label = BOROUGH_GROUP), size = 3, colour = "white", nudge_x = 4)+
  scale_fill_viridis(option = 'A', name = "% of Positive Cases") +
  labs(title = "Mortality Ratio in the Bronx and Staten Island", caption = "Mortality ratio = (deaths/positive cases)*100") +
  mapTheme()

I map a few variables to test this theory.

The spatial distribution of poverty does not resemble that of mortality ratio.

ggplot()+
    geom_sf(data = data_zip, colour = "transparent", fill = "gray90", size = 0.5) +
  geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx" | BOROUGH_GROUP == "Staten Island"), colour = "transparent", aes(fill = as.numeric(poverty_pct)), size = 0.5) +
  scale_fill_viridis(option = 'A', name = "% of\npopulation") +
  labs(title = "Below Poverty Line") +
  mapTheme()

Unemployment, too, does not map well onto mortality ratio.

ggplot()+
    geom_sf(data = data_zip, colour = "transparent", fill = "gray90", size = 0.5) +
  geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx" | BOROUGH_GROUP == "Staten Island"), colour = "transparent", aes(fill = as.numeric(unemployed_pct)), size = 0.5) +
  scale_fill_viridis(option = 'A', name = "% of population \n 16+") +
  labs(title = "Age 16+ Unemployed") +
  mapTheme()

Percent non-White loosely resembles the mortality ratio in Staten Island, but less so in the Bronx.

ggplot()+
    geom_sf(data = data_zip, colour = "transparent", fill = "gray90", size = 0.5) +
  geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx" | BOROUGH_GROUP == "Staten Island"), colour = "transparent", aes(fill = 1 - as.numeric(white_pct)), size = 0.5) +
  scale_fill_viridis(option = 'A', name = "% of\npopulation") +
  labs(title = "Non-White Residents") +
  mapTheme()

Age appears to more closely resemble mortality ratio in the Bronx than in Staten Island.

ggplot()+
    geom_sf(data = data_zip, colour = "transparent", fill = "gray90", size = 0.5) +
  geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx" | BOROUGH_GROUP == "Staten Island"), colour = "transparent", aes(fill = as.numeric(median_age)), size = 0.5) +
  scale_fill_viridis(option = 'A', name = "Years") +
  labs(title = "Median Age") +
  mapTheme()

Percent foreign-born more closely resembles mortality ratio in Staten Island.

ggplot()+
    geom_sf(data = data_zip, colour = "transparent", fill = "gray90", size = 0.5) +
  geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx" | BOROUGH_GROUP == "Staten Island"), colour = "transparent", aes(fill = as.numeric(foreign_born)), size = 0.5) +
  scale_fill_viridis(option = 'A', name = "% of\npopulation") +
  labs(title = "Foreign-born Residents") +
  mapTheme()

Together, these trends suggest that a number of variables likely have interaction effects with the borough that a ZIP code is in.

Distribution of independent variables

Next, I plot the distribution of all of the current variables. The following variables have skewed distributions, and may need further transformation before being used for modelling:

skewed_list <- c("black_pct", "hispanic_pct", "white_pct", "busstop.nn5", "edu_ms", "edu_prof", "evictions.nn5", "female_hh_pct", "healthcare.nn5", "limited_english_hh", "live_alone", "no_veh_hh", "p_housing_units", "poverty_pct", "public_asst_pct", "race_ice", "renter_pct", "safmr4", "schools.nn5", "tenure_5yr_pct", "unemployed_pct", "wfh_2018") 
  

skewed_desc <- c("Percentage share of Black residents", "Percentage share of Hispanic residents", "Percentage share of White residents", "Average distance to nearest 5 bus stops from ZIP code centroid", "Percentage share of residents 25 years and older with middle school education level", "Percentage share of residents 25 years and older with professional degrees", "Average distance to nearest 5 evictions", "Percentage share of households headed by female householder", "Average distance to nearest 5 heatlhcare facilities", "Percentage share of households with limited english fluency", "Percentage share of residents living alone", "Percentage share of households with no access to automobiles", "Number of public housing units per ZIP code", "Percentage share of residents living below poverty line", "Percentage share of households receiving public assistance", "Racialized Index of Concentration at the Extremes (RICE); -1 = most disadvantaged, 1 = most advantaged", "Percentage share of households that rent their homes", "HUD Housing Choice Voucher rent limit for 4 bedroom unit", "Average distance to nearest 5 public K-12 schools", "Percentage share of renters who have lived in their current address for 5 or more years", "Percentage share of population 16 years and older that is unemployed", "Percentage share of workers who worked from home in 2018")

skewed_vars <- as.data.frame(skewed_list) %>%
  mutate(col = seq.int(nrow(.))) %>%
  rename(Variable = skewed_list)

skewed_desc <- as.data.frame(skewed_desc) %>%
  mutate(col = seq.int(nrow(.))) %>%
  rename(Description = skewed_desc) 

left_join(skewed_vars, skewed_desc) %>%
  dplyr::select(-col) %>%
  pander(caption = "Skewed Variables")

Joining, by = “col”

Skewed Variables
Variable Description
black_pct Percentage share of Black residents
hispanic_pct Percentage share of Hispanic residents
white_pct Percentage share of White residents
busstop.nn5 Average distance to nearest 5 bus stops from ZIP code centroid
edu_ms Percentage share of residents 25 years and older with middle school education level
edu_prof Percentage share of residents 25 years and older with professional degrees
evictions.nn5 Average distance to nearest 5 evictions
female_hh_pct Percentage share of households headed by female householder
healthcare.nn5 Average distance to nearest 5 heatlhcare facilities
limited_english_hh Percentage share of households with limited english fluency
live_alone Percentage share of residents living alone
no_veh_hh Percentage share of households with no access to automobiles
p_housing_units Number of public housing units per ZIP code
poverty_pct Percentage share of residents living below poverty line
public_asst_pct Percentage share of households receiving public assistance
race_ice Racialized Index of Concentration at the Extremes (RICE); -1 = most disadvantaged, 1 = most advantaged
renter_pct Percentage share of households that rent their homes
safmr4 HUD Housing Choice Voucher rent limit for 4 bedroom unit
schools.nn5 Average distance to nearest 5 public K-12 schools
tenure_5yr_pct Percentage share of renters who have lived in their current address for 5 or more years
unemployed_pct Percentage share of population 16 years and older that is unemployed
wfh_2018 Percentage share of workers who worked from home in 2018
correlation.long <-
  st_drop_geometry(data_zip) %>%
    dplyr::select(-zcta, -BOROUGH_GROUP, -COVID_CASE_RATE,  -COVID_DEATH_RATE, -other_race_pct, -PERCENT_POSITIVE,-TEST_RATE, -total_workers, -population, -hh_size_owner, -group_quarters, -apartments, -age_25plus) %>%
    gather(Variable, Value, -MORT_RATIO) %>%
  mutate(Value = as.numeric(Value))

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, MORT_RATIO, use = "complete.obs"))

ggplot(correlation.long %>% filter(Variable %in% skewed_list), aes(x = Value)) +
  geom_histogram(bins = 50, fill = "magenta") +
  facet_wrap(~Variable, ncol = 6, scales = "free") +
  labs(title = "Covid-19 Mortality Ratio as a function of risk factors", subtitle = "Skewed Variables") +
  plotTheme()

Feature Engineering

Based on what I found in my exploratory analysis, I further transform some of the variables in the following ways:

Create dummy variables for each borough to later use for interactions when modelling

data_zip <- data_zip %>%
  mutate(bronx = ifelse(BOROUGH_GROUP == "Bronx", 1, 0),
         manhattan = ifelse(BOROUGH_GROUP == "Manhattan", 1, 0),
         statenisland = ifelse(BOROUGH_GROUP == "Staten Island", 1, 0),
         queens = ifelse(BOROUGH_GROUP == "Queens", 1, 0),
         brooklyn = ifelse(BOROUGH_GROUP == "Brooklyn", 1, 0))

data_zip %>%
  st_drop_geometry() %>%
  dplyr::select(bronx, queens, manhattan, statenisland, brooklyn) %>%
  unique() %>%
  pander(caption = "Borough Group Dummy Variables")
Borough Group Dummy Variables
bronx queens manhattan statenisland brooklyn
0 0 1 0 0
0 0 0 1 0
1 0 0 0 0
0 1 0 0 0
0 0 0 0 1

Create categorical, log-transformed, and dummy variables to account for skewness of some variables

data_zip <- data_zip %>%
  mutate(black_med = ifelse(black_pct > median(black_pct), 1, 0),
         black_mean = ifelse(black_pct > mean(black_pct), 1, 0),
         black_maj = ifelse(black_pct > 0.5, 1, 0),
         black_log = log(black_pct),
         hispanic_med = ifelse(hispanic_pct > median(hispanic_pct), 1, 0),
         hispanic_mean = ifelse(hispanic_pct > mean(hispanic_pct), 1, 0),
         hispanic_maj = ifelse(hispanic_pct > 0.5, 1, 0),
         hispanic_log = log(hispanic_pct),
         white_maj = ifelse(white_pct > 0.5, 1, 0),
         white_qt3 = ifelse(white_pct > 0.68785, 1, 0),
         white_log = log(white_pct),
         nonwhite_qt3 = ifelse(white_pct < 0.22, 1, 0),
         bus.nn5_med = ifelse(busstop.nn5 > median(busstop.nn5), 1, 0),
         bus.nn5_mean = ifelse(busstop.nn5 > mean(busstop.nn5), 1, 0),
         bus.nn5_qt3 = ifelse(busstop.nn5 > 14895.4, 1, 0),
         busnn5_log = log(busstop.nn5),
         grade8_med = ifelse(edu_ms > median(edu_ms), 1, 0),
         grade8_mean = ifelse(edu_ms > mean(edu_ms), 1, 0),
         grade8_qt3 = ifelse(edu_ms > 0.022335, 1, 0),
         grade8_log = log(edu_ms),
         prof_degree_med = ifelse(edu_prof > median(edu_prof), 1, 0),
         prof_degree_mean = ifelse(edu_prof > mean(edu_prof), 1, 0),
         prof_degree_qt3 = ifelse(edu_prof > 0.0434616, 1, 0),
         prof_degree_log = log(edu_prof),
         evict.nn5_med = ifelse(evictions.nn5 > median(evictions.nn5), 1, 0),
         evict.nn5_mean = ifelse(evictions.nn5 > mean(evictions.nn5), 1, 0),
         evict.nn5_qt3 = ifelse(evictions.nn5 > 4622.2, 1, 0),
         evict.nn5_log = log(evictions.nn5),
         female_hh_med = ifelse(female_hh_pct > median(female_hh_pct), 1, 0),
         female_hh_mean = ifelse(female_hh_pct > mean(female_hh_pct), 1, 0),
         female_hh_qt3 = ifelse(female_hh_pct > 0.12607, 1, 0),
         female_hh_log = log(female_hh_pct),
         health.nn5_med = ifelse(healthcare.nn5 > median(healthcare.nn5), 1, 0),
         health.nn5_mean = ifelse(healthcare.nn5 > mean(healthcare.nn5), 1, 0),
         health.nn5_qt3 = ifelse(healthcare.nn5 > 14879, 1, 0),
         healthcare.nn5_log = log(healthcare.nn5),
         lim_english_binary = ifelse(limited_english_hh > 0.0002791, 1, 0),
         lim_english_mean = ifelse(limited_english_hh > mean(limited_english_hh), 1, 0),
         lim_english_qt3 = ifelse(limited_english_hh > 0.0023350, 1, 0),
         lim_english_log = log(limited_english_hh),
         live_alone_med = ifelse(live_alone > median(live_alone), 1, 0),
         live_alone_mean = ifelse(live_alone > mean(live_alone), 1, 0),
         live_alone_qt3 = ifelse(live_alone > 0.17207, 1, 0),
         live_alone_log = log(live_alone),
         no_car.med = ifelse(no_veh_hh > median(no_veh_hh), 1, 0),
         no_car.mean = ifelse(no_veh_hh > mean(no_veh_hh), 1, 0),
         no_car.qt3 = ifelse(no_veh_hh > 0.341979, 1, 0),
         no_car_log = log(no_veh_hh),
         ph_units_binary = ifelse(p_housing_units != 0, 1, 0),
         poverty_med = ifelse(poverty_pct > median(poverty_pct), 1, 0),
         poverty_mean = ifelse(poverty_pct > mean(poverty_pct), 1, 0),
         poverty_qt3 = ifelse(poverty_pct > 0.21182, 1, 0),
         poverty_log = log(poverty_pct),
         public_asst_med = ifelse(public_asst_pct > median(public_asst_pct), 1, 0),
         public_asst_mean = ifelse(public_asst_pct > mean(public_asst_pct), 1, 0),
         public_asst_qt3 = ifelse(public_asst_pct > 0.019006, 1, 0),
         public_asst_log = log(public_asst_pct),
         rice_binary = ifelse(race_ice > 0, 1, 0),
         rice_log = log(race_ice), 
         renter_med = ifelse(renter_pct > median(renter_pct), 1, 0),
         renter_mean = ifelse(renter_pct > mean(renter_pct), 1, 0),
         renter_log = log(renter_pct),
         safmr4_med = ifelse(safmr4 > median(safmr4), 1, 0),
         safmr_log = log(safmr4),
         schools.nn5_med = ifelse(schools.nn5 > median(schools.nn5), 1, 0),
         schools.nn5_mean = ifelse(schools.nn5 > mean(schools.nn5), 1, 0),
         schools.nn5_qt3 = ifelse(schools.nn5 > 2757.5, 1, 0),
         schools_log = log(schools.nn5),
         shootings.nn5_med = ifelse(shootings.nn5 > median(shootings.nn5), 1, 0),
         shootings.nn5_mean = ifelse(shootings.nn5 > mean(shootings.nn5), 1, 0),
         shootings.nn5_qt3 = ifelse(shootings.nn5 > 3003.25, 1, 0),
         shootings_log = log(shootings.nn5),
         tenure5y_med = ifelse(tenure_5yr_pct > median(tenure_5yr_pct), 1, 0),
         tenure5y_mean = ifelse(tenure_5yr_pct > mean(tenure_5yr_pct), 1, 0),
         tenure5y_qt3 = ifelse(tenure_5yr_pct > 0.073214, 1, 0),
         tenure5y_log = log(tenure_5yr_pct),
         unemployed_med = ifelse(unemployed_pct > median(unemployed_pct), 1, 0),
         unemployed_mean = ifelse(unemployed_pct > mean(unemployed_pct), 1, 0),
         unemployed_qt3 = ifelse(unemployed_pct > 0.04807, 1, 0),
         unemployed_log = log(unemployed_pct),
         wfh_med = ifelse(wfh_2018 > median(wfh_2018), 1, 0),
         wfh_mean = ifelse(wfh_2018 > mean(wfh_2018), 1, 0),
         wfh_qt3 = ifelse(wfh_2018 > 0.050053, 1, 0),
         wfh_log = log(wfh_2018))

Create additional nearest-neighbor variables to account for exposures

I create nearest neighbor variables for cut-offs for continuous demographic variables such as poverty50.nn3, or the distance to the nearest 3 grid cells where the poverty rate is above 50%. For each variable, I create a few iterations with a different number of neighbors for later testing.

data_zip <- data_zip %>%
  mutate(phousing.nn1 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(p_housing_units >0))),1),
           phousing.nn5 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(p_housing_units >0))),5),
            poverty30.nn3 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(poverty_pct > 0.3))),3),
            # poverty50.nn3 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(poverty_pct > 0.5))),3),
           renter70.nn3 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(renter_pct > 0.7))),3),
           unemployed5.nn3 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(renter_pct > 0.05))),3),
           public_asst2.nn3 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(public_asst_pct > 0.02))),3),
           ice_qt1.nn1 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(race_ice < -0.02))),1),
           ice_qt1.nn5 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(race_ice < -0.02))),5),
           ice_qt4.nn1 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(race_ice > .32))),1),
           ice_qt4.nn5 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(race_ice > .32))),5))

threshold_nn_list <- c("phousing.nn1", "phousing.nn5", "poverty30.nn3", "renter70.nn3", "unemployed5.nn3", "public_asst2.nn3", "ice_qt1.nn1", "ice_qt1.nn5", "ice_qt4.nn1", "ice_qt4.nn5")

threshold_nn_vars_desc_list <- c("Distance to nearest public housing project", "Average distance to nearest 5 public housing projects", "Average distance to nearest ZIP code where poverty > 30%", "Average distance to nearest 3 ZIP codes where renters > 70%", "Average distance to nearest 3 ZIP codes where unemployment > 5%", "Average distance to nearest 3 ZIP codes where public assistance use > 2%", "Distance to nearest ZIP code where ICE > citywide 25th percentile", "Average distance to nearest 5 ZIP codes where ICE > citywide 25th percentile", "Distance to nearest ZIP code where ICE > citywide 75th percentile", "Average distance to nearest 5 ZIP codes where ICE > citywide 75th percentile") 

threshold_nn_vars <- as.data.frame(threshold_nn_list) %>%
  rename(Variable = threshold_nn_list) %>%
  mutate(col = seq.int(nrow(.)))

threshold_nn_vars_desc <- as.data.frame(threshold_nn_vars_desc_list) %>%
  rename(Description = threshold_nn_vars_desc_list) %>%
  mutate(col = seq.int(nrow(.)))

left_join(threshold_nn_vars, threshold_nn_vars_desc) %>%
  dplyr::select(-col) %>%
  pander(caption = "Nearest Neighbor Variables") 

Joining, by = “col”

Nearest Neighbor Variables
Variable Description
phousing.nn1 Distance to nearest public housing project
phousing.nn5 Average distance to nearest 5 public housing projects
poverty30.nn3 Average distance to nearest ZIP code where poverty > 30%
renter70.nn3 Average distance to nearest 3 ZIP codes where renters > 70%
unemployed5.nn3 Average distance to nearest 3 ZIP codes where unemployment > 5%
public_asst2.nn3 Average distance to nearest 3 ZIP codes where public assistance use > 2%
ice_qt1.nn1 Distance to nearest ZIP code where ICE > citywide 25th percentile
ice_qt1.nn5 Average distance to nearest 5 ZIP codes where ICE > citywide 25th percentile
ice_qt4.nn1 Distance to nearest ZIP code where ICE > citywide 75th percentile
ice_qt4.nn5 Average distance to nearest 5 ZIP codes where ICE > citywide 75th percentile

Association and Correlation Testing

Correlations

After testing all variables for correlation with mortality ratio, I found that factors like the share of workers in service, education, health and social work are more strongly correlated with mortality ratio than race. Socioeconomic variables like median income, education, and vehicle ownership are also correlated with mortality ratio. Poverty, unemployment, and english proficiency are also correlated with mortality ratio. All of these correlations are statistically signficant. As a whole, these correlations suggest that neighborhood stressers, living conditions, and employment status and industry all affect a community’s Covid-19 mortality risk.

The 9 strongest positive correlations with mortality ratio includes a mix of demographics, employment, housing, and socioeconomic factors. Surprisingly the two strongest positive correlations are the percentage share of service workers and the percentage share of workers in health, education, and social work.

pos_cors <- c("edu_health_soc_worker","service_worker","edu_hs","housing_condition_pct","black_log","foreign_born","unemployed_med","poverty_log","lim_english_binary")
neg_cors <- c("race_ice","edu_bac","medhhinc","tenure5y_qt3","safmr4","white_pct", "no_car.qt3","wfh_2018","no_veh_hh","vacancy_pct")

correlation.long <-
  st_drop_geometry(data_zip) %>%
    dplyr::select( -zcta, -BOROUGH_GROUP, -COVID_CASE_RATE,  -COVID_DEATH_RATE, -other_race_pct, -PERCENT_POSITIVE,-TEST_RATE, -total_workers, -population, -hh_size_owner, -age_25plus) %>%
    gather(Variable, Value, -MORT_RATIO) %>%
  mutate(Value = as.numeric(Value))

correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, MORT_RATIO, use = "complete.obs"))

neg_cor_list <- c("race_ice", "edu_bac", 'medhhinc', "tenure5y_qt3", "safmr4", 'edu_prof', "white_pct", "no_car.qt3", "wfh_2018")

pos_cor_list <- c("edu_health_soc_worker", "service_worker", "edu_hs", "housing_condition_pct", "black_log", "edu_ms", "foreign_born", "unemployed_med", "poverty_log")

cor.neg <- correlation.cor %>%
  filter(correlation < 0) %>%
  filter(Variable %in% neg_cor_list)
cor.neg <- cor.neg[order(cor.neg$correlation),] %>%
  mutate(col = seq.int(nrow(.)))
cor.neg.desc.list <- c("Racialized Index of Concentration at the Extremes; -1 to 1","Percentage share of population older than 25 with Bachelor's degree","Median household income","Binary: percentage share of renters who have lived in their current address for 5 years or longer > 7.3% (75th percentile)","HUD Housing Choice Voucher rent limit for 4 bedroom unit","Percentage share of population older than 25 with professional degree","Percentage share of White residents","Binary: percentage share of carless households > 34% (75th percentile)","Percentage share of workers who worked from home in 2018")
cor.neg.desc <- as.data.frame(cor.neg.desc.list) %>%
  rename(Description = cor.neg.desc.list) %>%
  mutate(col = seq.int(nrow(.)))
cor.neg <- left_join(cor.neg, cor.neg.desc) %>%
  dplyr::select(-col)


cor.pos <- correlation.cor %>%
  filter(correlation > 0) %>%
  filter(Variable %in% pos_cor_list)
cor.pos <- cor.pos[order(-cor.pos$correlation),] %>%
  mutate(col = seq.int(nrow(.)))
cor.pos.desc.list <- c("Percentage share of workers working in education, health, or social work industry","Percentage share of workers in service industry","Percentage share of adults older than 25 with only high school degree","Percent of housing units with 1 or more physical conditions","Log-transformed percentage share of Black residents","Percentage share of adults older than 25 with only middle school degree","Percentage share of residents who are foreign-born","Binary: percentage share of residents older than 16 and unemployed > 3.8% (median)","Log-transformed percentage share of residents living below poverty line")
cor.pos.desc <- as.data.frame(cor.pos.desc.list) %>%
  rename(Description = cor.pos.desc.list) %>%
  mutate(col = seq.int(nrow(.)))
cor.pos <- left_join(cor.pos, cor.pos.desc) %>%
  dplyr::select(-col)
    
    


pander(cor.pos, caption = "Top 9 Positive Correlations with Mortality Ratio") 
Top 9 Positive Correlations with Mortality Ratio
Variable correlation Description
edu_health_soc_worker 0.5495 Percentage share of workers working in education, health, or social work industry
service_worker 0.5335 Percentage share of workers in service industry
edu_hs 0.4795 Percentage share of adults older than 25 with only high school degree
housing_condition_pct 0.4496 Percent of housing units with 1 or more physical conditions
black_log 0.4121 Log-transformed percentage share of Black residents
edu_ms 0.3996 Percentage share of adults older than 25 with only middle school degree
foreign_born 0.3927 Percentage share of residents who are foreign-born
unemployed_med 0.381 Binary: percentage share of residents older than 16 and unemployed > 3.8% (median)
poverty_log 0.3679 Log-transformed percentage share of residents living below poverty line
ggplot(correlation.long %>% filter(Variable %in% pos_cors), aes(Value, MORT_RATIO)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor%>% filter(Variable %in% pos_cors), aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "magenta") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Positive Correlation") +
  plotTheme()

The strongest negative correlations are primarily socioeconomic, education, and housing variables. The strongest correlation is radicalized index of concentration at the extremes, but this is closely followed by the share of residents with Bachelors degrees, median household income, and share of renters with 5 years or more of tenure.

pander(cor.neg, caption = "Top 9 Negative Correlations with Mortality Ratio") 
Top 9 Negative Correlations with Mortality Ratio
Variable correlation Description
race_ice -0.5966 Racialized Index of Concentration at the Extremes; -1 to 1
edu_bac -0.5802 Percentage share of population older than 25 with Bachelor’s degree
medhhinc -0.5791 Median household income
tenure5y_qt3 -0.5357 Binary: percentage share of renters who have lived in their current address for 5 years or longer > 7.3% (75th percentile)
safmr4 -0.5319 HUD Housing Choice Voucher rent limit for 4 bedroom unit
edu_prof -0.4999 Percentage share of population older than 25 with professional degree
white_pct -0.4971 Percentage share of White residents
no_car.qt3 -0.4824 Binary: percentage share of carless households > 34% (75th percentile)
wfh_2018 -0.4603 Percentage share of workers who worked from home in 2018
ggplot(correlation.long %>% filter(Variable %in% cor.neg$Variable), aes(Value, MORT_RATIO)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor%>% filter(Variable %in% cor.neg$Variable), aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "purple") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Negative Correlation") +
  plotTheme()

Difference in means

Following, I use T-tests to examine whether there is a statistically significant difference in mean Mortality Ratio for the 3 binary variables with the strongest correlations:

  1. unemployed_med: unemployment rate > 3.8%
ggplot() +
  geom_sf(data = data_zip, aes(fill = as.character(unemployed_med)), colour = "transparent") +
  scale_fill_viridis(discrete = TRUE, option = "magma", name = "Unemployment > 3.8% (median)", direction = -1) +
  labs(title = "Unemployment Binary") +
  mapTheme()

  1. tenure5y_qt3: share of renters in a ZIP code with more than 5 year tenure > 7.3%
ggplot() +
  geom_sf(data = data_zip, aes(fill = as.character(tenure5y_qt3)), colour = "transparent") +
  scale_fill_viridis(discrete = TRUE, direction = -1, option = "magma", name = "5 year tenure > 7.3%\n (75th percentile)") +
  labs(title = "5 Year Renter Tenure Binary") +
  mapTheme()

  1. no_car.qt3: share of households with 0 vehicles available > 34%
ggplot() +
  geom_sf(data = data_zip, aes(fill = as.character(no_car.med)), colour = "transparent") +
  scale_fill_viridis(discrete = TRUE, direction = -1, option = "magma", name = "Carless households > 34%\n (75th percentile)") +
  labs(title = "No Vehicle Households Binary") +
  mapTheme()

For all 3 variables, there is a statistically significant difference in means. Mortality ratio is 4 times higher in areas where unemployment > 3.8%. Mortality ratio is 8 times higher in neighborhoods where less than 7.3% of renters have 5 years or more of tenure. Neighborhoods with more carless households generally have a lower mortality ratio.

Model Building

Baseline Model

To build my first model, I use a backwards stepwise function to iteratively remove independent variables from a list of all of the variables and test for improvements in model fit. The optimal model returned by the function still has a very long list of variables, but it’s a good starting point. It has a high R-squared and adjusted R-squared, of 0.92 and 0.85, which suggests strong fit. However, from the results alone we can see many colinear variables such as the log and binary variables for percentage share of Black residents. Additionally, many of the variables are not statistically significant predictors.

mod_0 <- lm(MORT_RATIO ~ p_housing_units + median_age + white_pct + 
    female_hh_pct + public_asst_pct + renter_pct + tenure_5yr_pct + 
    housing_condition_pct + unemployed_pct + race_ice + apartments + 
    no_veh_hh + total_workers + transit_commute + wfh_2018 + 
    service_worker + age_25plus + edu_bac + edu_prof + foreign_born + 
    no_insurance + live_alone + age_65_74 + healthcare.nn5 + 
    busstop.nn5 + evictions.nn5 + schools.nn5 + shootings.nn5 + 
    bronx + manhattan + queens + black_med + black_maj + black_log + 
    hispanic_mean + hispanic_maj + white_log + nonwhite_qt3 + 
    bus.nn5_qt3 + grade8_med + grade8_qt3 + prof_degree_med + 
    prof_degree_mean + prof_degree_qt3 + prof_degree_log + evict.nn5_med + 
    evict.nn5_mean + evict.nn5_qt3 + evict.nn5_log + female_hh_mean + 
    female_hh_qt3 + female_hh_log + health.nn5_med + health.nn5_mean + 
    health.nn5_qt3 + healthcare.nn5_log + lim_english_binary + 
    lim_english_mean + lim_english_qt3 + live_alone_med + live_alone_mean + 
    live_alone_qt3 + live_alone_log + no_car.med + no_car.mean + 
    no_car.qt3 + poverty_qt3 + poverty_log + public_asst_med + 
    public_asst_qt3 + rice_binary + renter_mean + renter_log + 
    safmr4_med + safmr_log + schools.nn5_med + schools.nn5_mean + 
    schools.nn5_qt3 + schools_log + shootings.nn5_med + shootings.nn5_mean + 
    shootings_log + tenure5y_mean + tenure5y_qt3 + tenure5y_log + 
    unemployed_med + unemployed_log + wfh_med + wfh_mean + wfh_qt3 + 
    wfh_log + phousing.nn1 + phousing.nn5 + unemployed5.nn3 + 
    public_asst2.nn3 + ice_qt1.nn1 + ice_qt1.nn5 + ice_qt4.nn5 + 
    poverty30.nn3, data = data_final)

pander(summary(mod_0), caption = "Baseline Model")
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.294 7.334 1.267 0.2089
p_housing_units -0.0001008 3.987e-05 -2.529 0.01348
median_age 0.1495 0.03349 4.465 2.706e-05
white_pct 2.766 1.14 2.426 0.01763
female_hh_pct -36.01 10.34 -3.482 0.0008243
public_asst_pct 10.1 12.99 0.7775 0.4392
renter_pct 3.437 1.62 2.121 0.03715
tenure_5yr_pct -22.1 7.57 -2.92 0.004596
housing_condition_pct -8.193 2.154 -3.804 0.0002836
unemployed_pct 15.33 13.52 1.134 0.2602
race_ice -5.14 2.059 -2.496 0.01468
apartments 1.362 0.589 2.313 0.02338
no_veh_hh 2.291 2.417 0.948 0.3461
total_workers 0.0003498 0.0003115 1.123 0.2649
transit_commute 0.9136 0.8096 1.128 0.2627
wfh_2018 2.211 8.675 0.2548 0.7995
service_worker -3.044 2.627 -1.159 0.2501
age_25plus -0.0002785 0.0002146 -1.298 0.1982
edu_bac -2.477 2.272 -1.09 0.2791
edu_prof -6.8 4.73 -1.438 0.1546
foreign_born 1.128 1.174 0.9605 0.3398
no_insurance -3.162 2.039 -1.551 0.125
live_alone 20.88 6.942 3.007 0.003559
age_65_74 -6.948 5.142 -1.351 0.1806
healthcare.nn5 6.943e-06 3.802e-05 0.1826 0.8556
busstop.nn5 2.797e-05 1.902e-05 1.471 0.1454
evictions.nn5 8.977e-05 6.775e-05 1.325 0.1891
schools.nn5 -0.0002754 0.0001311 -2.101 0.03895
shootings.nn5 -0.0002047 6.593e-05 -3.105 0.002663
bronx -0.3216 0.365 -0.8811 0.381
manhattan -1.038 0.3367 -3.083 0.002844
queens -0.2816 0.2618 -1.076 0.2854
black_med 0.1917 0.2291 0.837 0.4052
black_maj -0.7775 0.2674 -2.908 0.004757
black_log -0.1339 0.1149 -1.165 0.2477
hispanic_mean -0.2722 0.1491 -1.825 0.07191
hispanic_maj 0.3455 0.2268 1.523 0.1318
white_log -0.1319 0.2346 -0.5624 0.5755
nonwhite_qt3 0.2855 0.2115 1.35 0.1811
bus.nn5_qt3 -0.1092 0.2504 -0.4359 0.6642
grade8_med 0.2344 0.1554 1.508 0.1357
grade8_qt3 0.2884 0.1479 1.949 0.05494
prof_degree_med 0.566 0.1712 3.306 0.001437
prof_degree_mean -0.8206 0.3263 -2.515 0.01398
prof_degree_qt3 1.161 0.3552 3.268 0.00162
prof_degree_log 0.302 0.1369 2.206 0.03033
evict.nn5_med -0.07806 0.1896 -0.4116 0.6818
evict.nn5_mean 0.09519 0.1988 0.4788 0.6334
evict.nn5_qt3 -0.233 0.2161 -1.078 0.2843
evict.nn5_log -0.2644 0.3019 -0.876 0.3838
female_hh_mean 0.4387 0.2547 1.723 0.08899
female_hh_qt3 0.1998 0.2781 0.7184 0.4747
female_hh_log 3.855 1.192 3.233 0.001805
health.nn5_med 0.1647 0.1881 0.8753 0.3841
health.nn5_mean -0.385 0.2145 -1.795 0.07663
health.nn5_qt3 0.5226 0.2391 2.186 0.03187
healthcare.nn5_log -0.7863 0.3097 -2.539 0.01313
lim_english_binary -0.1058 0.1363 -0.7758 0.4402
lim_english_mean 0.352 0.1669 2.109 0.03821
lim_english_qt3 -0.6304 0.1834 -3.436 0.0009532
live_alone_med 0.2061 0.1892 1.089 0.2795
live_alone_mean -0.6473 0.2508 -2.581 0.01176
live_alone_qt3 0.7925 0.2705 2.929 0.004469
live_alone_log -4.849 1.269 -3.822 0.0002669
no_car.med -0.5682 0.2578 -2.204 0.03049
no_car.mean -0.3253 0.2371 -1.372 0.1741
no_car.qt3 -0.1409 0.2503 -0.5628 0.5752
poverty_qt3 -0.1869 0.191 -0.9785 0.3309
poverty_log 0.1554 0.2564 0.606 0.5463
public_asst_med -0.2082 0.1589 -1.311 0.1939
public_asst_qt3 -0.1478 0.1956 -0.7555 0.4522
rice_binary -0.91 0.2868 -3.173 0.002165
renter_mean -0.4465 0.1969 -2.267 0.02616
renter_log -0.3832 0.4796 -0.799 0.4267
safmr4_med 0.2729 0.1586 1.721 0.08932
safmr_log -0.5574 0.734 -0.7594 0.4499
schools.nn5_med -0.1697 0.2154 -0.7879 0.4332
schools.nn5_mean -0.3048 0.2123 -1.436 0.1551
schools.nn5_qt3 0.4584 0.184 2.491 0.01488
schools_log 0.6456 0.2864 2.254 0.02705
shootings.nn5_med 0.2947 0.1536 1.918 0.05881
shootings.nn5_mean 0.1583 0.1637 0.9669 0.3366
shootings_log -0.186 0.1353 -1.375 0.173
tenure5y_mean -0.4905 0.2243 -2.187 0.03177
tenure5y_qt3 0.1397 0.285 0.4903 0.6253
tenure5y_log 1.502 0.5019 2.992 0.00372
unemployed_med 0.7974 0.1771 4.503 2.344e-05
unemployed_log -1.729 0.5946 -2.908 0.004749
wfh_med 0.3879 0.181 2.143 0.03526
wfh_mean -0.5991 0.1856 -3.229 0.001829
wfh_qt3 0.3183 0.2067 1.54 0.1277
wfh_log -0.3616 0.2517 -1.437 0.1548
phousing.nn1 4.528e-05 2.898e-05 1.563 0.1223
phousing.nn5 -9.464e-05 5.082e-05 -1.862 0.0664
unemployed5.nn3 0.0001696 7.534e-05 2.252 0.02719
public_asst2.nn3 1.163e-05 1.678e-05 0.693 0.4904
ice_qt1.nn1 -0.0001689 4.412e-05 -3.828 0.000261
ice_qt1.nn5 0.0001998 5.344e-05 3.739 0.0003535
ice_qt4.nn5 1.542e-05 1.347e-05 1.145 0.2557
poverty30.nn3 -3.196e-05 1.501e-05 -2.129 0.03642
Baseline Model
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
177 0.4039 0.9127 0.8004

Colinearity

For my next model, I eliminated all insignificant variables. Next, I selected the variables with the highest absolute estimate in instances where I have multiple variations of a variable (e.g. log-transformed, median binary, mean binary). I additionally checked for further colinear variables. In this process, I found a few interesting relationships:

RICE eliminates the need for many variables such as race, median household income, housing conditions, and service workers.

ggplot(data = data_final, aes(x = race_ice, y = service_worker)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, colour = "magenta") +
  labs(title = "Index of Concentration at the Extremes and Service Workers", subtitle = "cor = -0.88") +
  plotTheme()

The percent of renters is colinear with evictions, percent of commuters using transit, shootings, and distance from healthcare facilities.

ggplot(data = data_final, aes(x = renter_pct, y = transit_commute)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, colour = "magenta") +
  labs(title = "Percent Renter and Transit Commutes", subtitle = "cor = 0.82") +
  plotTheme()

Living alone is colinear with living in Manhattan and professional degrees.

ggplot(data = data_final, aes(x = live_alone, y = prof_degree_qt3)) +
  geom_point() +
  geom_smooth(method = "glm", se = FALSE, colour = "magenta") +
  labs(title = "Living Alone and Professional Degrees", subtitle = "cor = 0.69") +
  plotTheme()

Model 1: Demographics and living conditions

With an R-squared of 0.65 and p-value of less than 0.001, my first model is statistically significant and accurate. Immediately standing out as strong predictors are percent renter, Queens, race_ice, Staten Island, apartments, uninsured residents, residents living alone, distance to transit, households without cars. This tells us an interesting story: while racialized index of concentration at the extremes (race_ice) has strong explanatory power, variables describing communities’ living conditions are stronger predictors of mortality ratio. For instance, the strongest predictor is the share of residents who live alone. The more residents who live alone, the lower the mortality ratio. Similarly important is the share of householders who are renters (renter_pct). ZIP codes with more renters experience higher mortality ratios. I initially suspected that these relationships were spurious, and that these variables would be highly colinear with race and income. However, they have low correlations with both. As expected, ICE is also a strong predictor: for each unit increase in ICE, which is associated with a higher concentration of White residents in the 75th percentile of median household income, mortality ratio decreases by 2%.

library(car)
mod_1 <- lm(MORT_RATIO ~ median_age  + renter_pct + race_ice + apartments + no_insurance + live_alone + bus.nn5_qt3  + no_car.med , data = data_final)
pander(summary(mod_1), caption = "Model 1")  
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.817 1.395 1.302 0.1947
median_age 0.09769 0.01358 7.194 1.984e-11
renter_pct 2.274 0.4652 4.888 2.366e-06
race_ice -1.934 0.2615 -7.398 6.3e-12
apartments 0.9579 0.3027 3.165 0.001844
no_insurance -5.046 1.334 -3.782 0.0002158
live_alone -2.547 0.8949 -2.846 0.004974
bus.nn5_qt3 0.2942 0.1212 2.427 0.01627
no_car.med -0.4474 0.1358 -3.295 0.001201
Model 1
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
177 0.5238 0.6797 0.6644

Model 2: Demographics, living conditions, and education

To build my next model, I tested additional demographic variables as well as housing and employment variables. Adding grade8_med, a binary variable based on the citywide median of share of residents with only a middle school degree increased the model’s R-squared by 0.02. While variables such as service workers, health and education workers, unemployment, poverty rate, and public assistance are all strongly correlated with mortality ratio, I ultimately did not include them because they are colinear with race_ice and grade8_med. Similarly, the shares of foreign born residents and households with limited english fluency were insignificant and did not contribute to the model. In this model, the share of renters remains an important predictor, but its estimate and significance level have both decreased. By contrast, adding middle school education levels makes share of apartments a stronger and more significant predictor. One reason for this could be that by controlling for education level, the model separates the shares of people who are renting out of necessity and preference.

# anova(mod_0, mod_1, mod_1_g8, mod_2)
# cor.test(data_zip$lim_english_binary, data_zip$bus.nn5_qt3)
mod_2 <- lm(MORT_RATIO ~ median_age  + renter_pct + race_ice + apartments + no_insurance + live_alone + bus.nn5_qt3  + no_car.med + grade8_med, data = data_final)
pander(summary(mod_2), caption = "Model 2")    
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.26 1.398 0.9012 0.3688
median_age 0.09676 0.01341 7.216 1.783e-11
renter_pct 1.937 0.4814 4.023 8.689e-05
race_ice -1.829 0.262 -6.98 6.614e-11
apartments 1.041 0.3009 3.46 0.000686
no_insurance -4.441 1.342 -3.31 0.001144
live_alone -1.87 0.9298 -2.011 0.04593
bus.nn5_qt3 0.3519 0.1222 2.881 0.004489
no_car.med -0.4423 0.134 -3.3 0.001183
grade8_med 0.2639 0.1131 2.334 0.02078
Model 2
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
177 0.517 0.6898 0.6731

Model 3: Demographics, living conditions, education, and borough interactions

For my final model, I test the effect of adding borough interaction terms to some of the variables. In the exploratory analysis, I found that a few variables such as race and age seem to have a stronger association with mortality ratio in some boroughs than others. Adding the boroughs categorical variable as an interaction term to the binary variable for median share of residents with only middle school education increased the R-squared by 0.06. Interestingly, adding this also makes the share of people living alone and the share of people who are uninsured insignificant, so I removed these variables. Share of renters, ICE, and households with no cars also become weaker and less significant predictors. With the addition of these interaction terms, the share of apartments and distance to transit become more important and significant predictors. Among the boroughs, Manhattan and Queens have the strongest and most significant interactions with grade 8 education. Overall, this model is highly accurate and statistically signficant, with an R-squared of 0.72 and a p-value of well below 0.0001.

mod_3 <- lm(MORT_RATIO ~ median_age + race_ice + apartments  + bus.nn5_qt3  + no_car.med + BOROUGH_GROUP*grade8_med, data = data_final)
pander(summary(mod_3), caption = "Model 3")
Table continues below
  Estimate Std. Error t value
(Intercept) -0.7027 0.5951 -1.181
median_age 0.07952 0.01152 6.904
race_ice -2.195 0.2538 -8.647
apartments 1.523 0.2493 6.108
bus.nn5_qt3 0.4008 0.1275 3.143
no_car.med -0.1761 0.1168 -1.508
BOROUGH_GROUPManhattan -0.9191 0.3305 -2.781
BOROUGH_GROUPStaten Island -0.7328 0.3646 -2.01
BOROUGH_GROUPQueens -0.6943 0.3259 -2.131
BOROUGH_GROUPBrooklyn 0.1518 0.3354 0.4525
grade8_med -0.2139 0.3368 -0.6351
BOROUGH_GROUPManhattan:grade8_med 0.9928 0.3749 2.648
BOROUGH_GROUPStaten Island:grade8_med 0.5454 0.6326 0.8622
BOROUGH_GROUPQueens:grade8_med 1.032 0.3524 2.928
BOROUGH_GROUPBrooklyn:grade8_med 0.09724 0.3658 0.2658
  Pr(>|t|)
(Intercept) 0.2394
median_age 1.091e-10
race_ice 4.916e-15
apartments 7.232e-09
bus.nn5_qt3 0.001991
no_car.med 0.1336
BOROUGH_GROUPManhattan 0.006066
BOROUGH_GROUPStaten Island 0.04609
BOROUGH_GROUPQueens 0.03464
BOROUGH_GROUPBrooklyn 0.6515
grade8_med 0.5263
BOROUGH_GROUPManhattan:grade8_med 0.008891
BOROUGH_GROUPStaten Island:grade8_med 0.3899
BOROUGH_GROUPQueens:grade8_med 0.003903
BOROUGH_GROUPBrooklyn:grade8_med 0.7907
Model 3
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
177 0.4993 0.7193 0.695

Model Results

My final model suggests a strong relationship between mortality ratio and living conditions, spatial fixed effects, and socioeconomic variables. The three strongest predictors of mortality ratio are the following:

  1. Racialized Index of Concentration at the Extremes
  2. Share of housing units in a ZIP code that are apartments
  3. Interaction term between share of residents with middle school degrees and the ZIP code being in Queens

These results challenge current understandings of the risk factors for Covid-19 cases and deaths. Housing variables are strong predictors for Covid-19 case outcomes. While it is easy to understand why higher shares of apartments, and likely residential density, is associated with higher Covid-19 transmission, it is difficult to interpret why it is correlated with a higher rate of deaths from Covid-19 cases. It is possible that the inclusion of socioeconomic variables like education and relative advantage (ICE) allows the model to identify ZIP codes where the share of renters is determined by the share of people lacking the wealth and other means to live in larger homes. However, even when controlling for RICE and median age, there is a clear relationship between the share of apartments and mortality ratios. Therefore, further research is needed on the mechanisms by which living conditions affect Covid-19 case outcomes in order to inform interventions.

In addition to apartments, transit stop density and the share of households with no cars are negatively correlated with mortality ratio. As with apartments, this runs counter to the conventional wisdom that transit use and lack of private transportation are risks factors for Covid-19 exposure. To further investigate this trend, I plotted three transit density scenarios. I controlled for RICE, changing the average distance to the nearest 5 bus stops. In both the Bronx and Staten Island, the boroughs with the highest and lowest mortality ratios, the negative correlation between transit density and mortality ratio holds. In Staten Island, however, it appears that the correlation between transit stop density and mortality ratio is stronger, as the distance between the 1000ft and 15000ft scenario lines is much more pronounced than for the Bronx. This suggests that contrary to the belief that transit poses a risk during the pandemic, it may serve as an asset for mitigating the burden of Covid-19 on communities.

Among the socioeconomic variables, RICE is an important predictor that confirms the current literature documenting the disparate impacts of Covid-19 on Black and low-income communities. RICE is a metric which compares the share of White households in the top 25th percentile of median household income with the share of Black households in the lowest 25th percentile, and ranges from -1 to 1 with 1 considered the most advantaged. In this model, a unit increase in RICE is associated with a decrease of 1.8% in mortality ratio. Considering that the citywide average mortality ratio is 2.1%, the effect of income and race is substantial. Education, too is an important predictor in the model. Attainment of a middle school degree is the strongest of the educational variables that I tested. Citywide, the higher the share of adults above 25 years old with only a middle school degree, the higher the mortality ratio. This can be interpreted as areas with an overall lower level of education are likely to experience higher mortality ratios. Initially this result appeared spurious. However, when I tested variables such as unemployment, poverty, and share of service workers that are strongly correlated with 8th grade education, they did not predict as strongly for mortality ratio. Likewise, demographic variables that I expected to be correlated with higher mortality ratios such as foreign born residents, households with limited english fluency, and age did not predict well.

Lastly, this model demonstrates the impact of spatial fixed effects at the borough level. As I found in my exploratory analysis, ZIP codes in Staten Island, Manhattan, and Queens have significantly lower mortality ratios compared to Brooklyn and the Bronx.

Evaluation

Goodness of fit

Of all the models, my final model has the second lowest AIC, after the baseline model created by backward selection. However, the baseline model has many duplicate variables and suffers from high colinearity, making it overfit. By adding boroughs as interaction terms, I was able to reduce AIC in my final model by 9%. Further, adding borough interaction terms and taking out shares of people living alone and without insurance increased R-squared by 0.03, or 4%.

Fit Metrics by Model
Model df AIC R2
Baseline 101 236.1 0.9127
1 10 284.1 0.6797
2 11 280.4 0.6898
3 16 272.8 0.7193

Generalizability

Residuals by mortality ratio

We now know that the addition of new variables reduces the magnitude of error for each model. What about the models’ performance at different mortality ratios? The baseline model over-predicts mortality ratio at low values and underpredicts at high values. Model 1 improves the accuracy for high mortality ratios. The residuals don’t change visibly between model 1 and 2, but model 3 visibly improves accuracy at high mortality ratios. Still, it is clear that all of the models predict much better at mid and high-range mortality ratios.

Error by borough

Next, I calculate mean absolute percentage error (MAPE) by borough to see if the model predicts with similar accuracy citywide. Ideally, errors should be evenly distributed. Across all 4 models, percentage error is highest in Staten Island and Manhattan. In line with the distribution of residuals, these boroughs have the lowest mortality ratios citywide.

resid_boro_3 <-left_join(data_zip %>%
  dplyr::select(BOROUGH_GROUP, zcta), st_drop_geometry(resid_3), by = "zcta") %>%
  st_as_sf() %>%
  group_by(BOROUGH_GROUP) %>%
  summarise(MAE = mean(MAE),
            MAPE = mean(MAPE),
            fitted = mean(fitted),
            MORT_RATIO = mean(MORT_RATIO))

pander(st_drop_geometry(resid_boro_3), caption = "Model 3")  
Model 3
BOROUGH_GROUP MAE MAPE fitted MORT_RATIO
Bronx 0.7625 32.75 2.585 2.448
Manhattan 0.9922 176.8 2.057 1.515
Staten Island 1.1 82.43 2.403 1.544
Queens 0.6647 38.4 2.247 2.351
Brooklyn 1.103 46.37 1.451 2.301

Error by demographic

While it is useful to know how the model predicts across different boroughs and mortality ratio levels, it is arguably more important to understand how it predicts across demographic contexts. As the exploratory analysis section demonstrated, there is a high degree of racial and economic variation within the boroughs. To effectively inform policy intervention, this model must predict similarly for a diverse range of communities.

The model predicts more accurately in majority non-White contexts. This is likely due to the fact that it predicts more accurately in ZIP codes with higher mortality ratios. On average, error in majority non-White ZIP codes is 0.2% lower than in majority White contexts.

context_acs <- byGeo %>%
  dplyr::select(zcta) %>%
  left_join(acs_zip)
  
raceContextACS <- context_acs %>%
  dplyr::select(white_pct, zcta) %>%
  mutate(race_context = ifelse(white_pct > .5, "Majority_White", "Majority_Non_White"))

nativityContextACS <- context_acs %>%
  dplyr::select(foreign_born, zcta) %>%
  mutate(nativity_context = ifelse(foreign_born < 0.34820, "Less than 35%", "More than 35%"))

iceContextACS <- context_acs %>%
  dplyr::select(race_ice, zcta) %>%
  mutate(ice_context = ifelse(race_ice < 0, "-1 to 0", "0 to 1"))

povertyContextACS <- context_acs %>%
  dplyr::select(poverty_pct, zcta) %>%
  mutate(poverty_context = ifelse(poverty_pct > 0.15, "More than 15%", "Less than 15%"))

ggplot() +
  geom_sf(data = raceContextACS, aes(fill = race_context), color = "white") +
  scale_fill_viridis(option = "D", discrete = TRUE, name = "Majority Group") +
  labs(title = "Racial Context") +
  mapTheme()

  left_join(st_drop_geometry(raceContextACS), st_drop_geometry(resid_3), by = "zcta") %>%
    na.omit() %>%
    group_by(race_context) %>%
    summarize(mean.Error = mean(MAE, na.rm = T)) %>%
    spread(race_context, mean.Error) %>%
    kable(caption = "Mean Error by Racial Context") %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(1, color = "black", background = "Yellow") %>%
    row_spec(0, color = "black", background = "white")
Mean Error by Racial Context
Majority_Non_White Majority_White
0.7777365 0.9904661

The model also predicts more accurately in ZIP codes with a poverty rate above the citywide mean of 15%. In ZIP codes with a poverty rate below 15%, the average error is 0.1% higher. While this does not seem substantial, this is a higher percentage error in lower poverty areas because these areas generally have lower mortality ratios.

ggplot() +
  geom_sf(data = povertyContextACS, aes(fill = poverty_context), color = "white") +
  scale_fill_viridis(option = "D", discrete = TRUE, name = "Poverty rate (%)") +
  labs(title = "Poverty", subtitle = "15% reflects the median value citywide") +
  mapTheme()

   left_join(st_drop_geometry(povertyContextACS), st_drop_geometry(resid_3), by = "zcta") %>%
    na.omit() %>%
    group_by(poverty_context) %>%
    summarize(mean.Error = mean(MAE, na.rm = T)) %>%
    spread(poverty_context, mean.Error) %>%
    kable(caption = "Mean Error by Poverty Context") %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(1, color = "black", background = "Yellow") %>%
    row_spec(0, color = "black", background = "white")
Mean Error by Poverty Context
Less than 15% More than 15%
0.8192159 0.9692555

Finally, the model predicts more accurately in ZIP codes with a higher share of foreign born residents.

ggplot() +
  geom_sf(data = nativityContextACS, aes(fill = nativity_context), color = "white") +
  scale_fill_viridis(option = "D", discrete = TRUE, name = "Majority Group") +
  labs(title = "Foreign vs US-born Residents") +
  mapTheme()

   left_join(st_drop_geometry(nativityContextACS), st_drop_geometry(resid_3), by = "zcta") %>%
    na.omit() %>%
    group_by(nativity_context) %>%
    summarize(mean.Error = mean(MAE, na.rm = T)) %>%
    spread(nativity_context, mean.Error) %>%
    kable(caption = "Mean Error by Foreign Born Context") %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(1, color = "black", background = "Yellow") %>%
    row_spec(0, color = "black", background = "white")
Mean Error by Foreign Born Context
Less than 35% More than 35%
0.9935431 0.7699136

Discussion

The results of this model have important policy implications. First, they demonstrate a relationship between housing and transportation conditions and a community’s Covid-19 resiliency. Adding to existing literature on the positive correlation between housing density and Covid-19 transmission, the model suggests that a higher share of apartments is also correlated with an increased proportion of cases that result in death. With regards to transportation, the model suggests that while Covid-19 transmission is associated with transit use, a higher density of transit in a neighborhood is associated with a lower risk of individuals succumbing to Covid-19.

The model also confirms the role of previously studied demographic variables in influencing Covid-19 outcomes. However, many attributes that have been linked to Covid-19 infection and deaths were not strong predictors for mortality ratio. This includes median age, foreign born residents, people in group housing, and not having health insurance.

While these results are important for understanding the factors influencing a community’s vulnerability to Covid-19 related deaths, this model has a number of limitations. With an R-squared of 0.72, it predicts relatively accurately on average. However, there is a clear drop off in model performance as mortality ratio decreases. The model’s inability to account for areas with lower mortality ratios causes it to over-predict in Staten Island and Manhattan, as these boroughs have the lowest mortality ratios. Similarly, it predicts better for ZIP codes that are majority non-White, experience higher poverty, and have a higher share of foreign born residents. Considering the use case of developing a model for allocating resources to more vulnerable communities that have been disparately affected by the pandemic, this is a fair tradeoff, as the model can still identify the most at-risk ZIP codes in terms of mortality ratio.

In all, the results of this model make it clear that mortality ratio has very different risk factors than the metrics which have traditionally been used to inform Covid-19 policies. In future interventions, it will be important to adopt practices that prioritize neighborhoods at the highest risk of mortality once infected, and not necessarily the ones with the most cases. While this model has limited ability to predict for mortality ratio in areas where it is lower, it offers an important first step for understanding the factors which cause some areas to suffer disproportionately from Covid-19 transmission. In future research, it would be valuable to study the disparate impacts of Covid-19 on workers in different industries, as the exploratory analysis showed some associations between service, education, and healthcare industry workers and mortality ratio. Similarly, while this study was not able to establish associations between institutions such as public housing and schools and a community’s Covid-19 mortality ratio, future research is needed on the role that public assistance and institutions can play in alleviating the burdens of public health emergencies.

Sources

CDC. “Health Equity Considerations and Racial and Ethnic Minority Groups.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 25 Jan. 2022, https://www.cdc.gov/coronavirus/2019-ncov/community/health-equity/race-ethnicity.html.

CDC. “Omicron Variant: What You Need to Know.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 29 Mar. 2022, https://www.cdc.gov/coronavirus/2019-ncov/variants/omicron-variant.html.

Gallo Marin, Benjamin, et al. “Predictors of Covid ‐19 Severity: A Literature Review.” Reviews in Medical Virology, vol. 31, no. 1, 2020, pp. 1–10., https://doi.org/10.1002/rmv.2146.

Ramírez-Aldana, Ricardo, et al. “Spatial Epidemiological Study of the Distribution, Clustering, and Risk Factors Associated with Early COVID-19 Mortality in Mexico.” PLOS ONE, vol. 16, no. 7, 2021, https://doi.org/10.1371/journal.pone.0254884.

Thomas, Michael M., et al. “Investigating the Association between Mass Transit Adoption and Covid-19 Infections in US Metropolitan Areas.” Science of The Total Environment, vol. 811, 2022, p. 152284., https://doi.org/10.1016/j.scitotenv.2021.152284.