Since the beginning of the Covid-19 pandemic, the Centers for Disease Control and Prevention (CDC) as well as a growing body of literature has demonstrated the disparate impacts of Covid-19 on individuals with lower access to healthcare, essential workers, those in crowded housing, and low-wealth communities. Often, this results in communities of color experiencing a higher rate of Covid-19 deaths and infections. Local governments and non-governmental agencies have targeted specific interventions aimed at mitigating the disparities in the impacts of Covid-19. In New York City, the Department of Health and Mental Hygiene launched the Covid-19 Equity Action Plan which details efforts to improve access to Covid-19 care and resources in communities of color by partnering with community health providers and making Covid-19 guidance available in 20 languages. In Philadelphia, the Black Doctor’s Covid-19 Consortium offered free vaccine clinics in 2021, prioritizing residents in zipcodes with the high positivity and death rates.
While the disparate impacts of Covid-19 are undeniable, the past two years have made me wonder: how do we quantify the disparate impact of Covid-19 on communities? Covid-19 guides from the CDC and local governments tend to look at case, hospitalization, and death rates per 100,000. Accordingly, interventions often allocate resources based on either an individual metric or a composite of the three. However, throughout the pandemic, there has been little discussion within public health and planning about case outcomes, or the likelihood of an individual dying from Covid-19 once infected. This is important, as it can further help allocate resources and inform future pandemic responses based on an understanding of a community’s vulnerability to Covid-19 infection. So far, most discussion of case outcomes has been confined to the medical literature. There, it is well-documented through clinical studies that individuals with attributes such as being older than 55, organ failure, and hypoxia are most likely to die from Covid-19 infection. In the spatial literature, Covid-19 mortality risk has been linked to income, race, transit use, travel, and tourism on a national scale. However, few studies have examined neighborhood-level risk factors and lived conditions which are associated with fatal Covid-19 case outcomes.
The goal of this project is to model Covid-19 Mortality Ratio, defined as the ratio of Covid-19 deaths to infections. From a planning and public health perspective, understanding risk of fatal Covid-19 outcomes is extremely important. Currently, interventions and resource allocation is largely based on individual metrics like case, hospitalization, and death rates, and the severity of the pandemic at large and in communities in mostly understood this way. However, not all communities experiencing a high rate of infection experience a high rate of deaths from infection. By modeling the risk factors that correlate with high likelihood of dying due to Covid-19, we create a more accurate model of vulnerability and risk that can be applied to future crises and inform proactive responses and planning.
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
include = FALSE
)
options(scipen=10000000)
library(tidyverse)
library(sf)
library(RSocrata)
library(viridis)
library(grid)
library(gridExtra)
library(knitr)
library(kableExtra)
library(tidycensus)
library(mapview)
library(lubridate)
library(pander)
library(ggplot2)
library(tibble)
library(FNN)
library(spdep)
library(ggcorrplot)
library(gmodels)
library(vcd)
library(dplyr)
options(tigris_class = "sf")
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
mapTheme <- function(base_size = 12) {
theme(
text = element_text(color = "black", family="Avenir"),
plot.title = element_text(size = 18,colour = "black", hjust = 0.5),
plot.subtitle=element_text(face="italic", hjust = 0.5),
plot.caption=element_text(size = 7, hjust = 0.5),
axis.ticks = element_blank(),
panel.background = element_blank(),
axis.text = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "white", fill=NA, size=2)
)
}
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 12,colour = "black"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "white", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=8),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <-
as.matrix(measureFrom)
measureTo_Matrix <-
as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
I selected New York City for this analysis for several reasons. First, it has experienced a high incidence of both Covid-19 cases and deaths, resulting in a large sample of high resolution Covid-19 data. Additionally, it has been documented that New Yorkers have experienced a high disparity in both Covid-19 exposure and outcomes based on factors including race, median_age, and profession. Finally, its geographical size allows for a large sample size of communities with diverse demographic and socioeconomic attributes.
To conduct this analysis, I relied primarily on four data sources:
New York City Department of Health and Mental Hygiene
- Covid-19 cases, positivity, testing, hospitalization, deaths
US Census Bureau
- American Community Survey 2018 5-year Estimates
New York City OpenData
- Crime, institutions, evictions data
US Department of Housing and Urban Development (HUD)
- Small Area Fair Market Rents (SAFMR) data
- Picture of Subsidized Households Housing Choice Voucher data
- Census tract to zip code crosswalk files
To construct my dependent variable, mortality ratio, I used Covid-19 data collected by the New York City Department of Health and Mental Hygiene. I accessed this data on the department’s Github, and selected an archived dataset from December 1, 2021 to exclude the Omicron variant which has a much lower fatality rate than previous variants. This data is collected at the zip code level, and includes counts as well as rates (per 100,000) for Covid-19 cases, positivity, hospitalizations, deaths, and testing. To calculate mortality ratio, I divided COVID_DEATH_RATE by COVID_CASE_RATE.
safmr <- st_read("/Users/annaduan/Documents/GitHub/NYC\ Covid/Data/nyc_safmrs.csv") %>%
rename(zcta = zip_code) %>%
mutate(safmr0 = as.numeric(X0BR),
safmr1= as.numeric(X1BR),
safmr2= as.numeric(X2BR),
safmr3= as.numeric(X3BR),
safmr4= as.numeric(X4BR)) %>%
dplyr::select(safmr4, zcta)
zones <- st_read("https://data.cityofnewyork.us/resource/pri4-ifjk.geojson") %>% dplyr::select(modzcta) %>%
rename(zcta = modzcta) %>%
left_join(safmr)
hosp <- st_read("https://raw.githubusercontent.com/nychealth/coronavirus-data/master/trends/hosprate-by-modzcta.csv") %>%
dplyr::filter(., !grepl("2022", date))
hosp <- hosp %>%
dplyr::select(-date, -HOSPRATE_Bronx, -HOSPRATE_Brooklyn, -HOSPRATE_Manhattan, -HOSPRATE_Queens, -HOSPRATE_Staten_Island, -HOSPRATE_Citywide)
colnames(hosp)<-gsub("HOSPRATE_","",colnames(hosp))
hosp <- sapply(hosp, as.numeric ) %>%
data.frame() %>%
replace(is.na(.), 0)
colnames(hosp)<-gsub("X","",colnames(hosp))
hosp <- data.frame(t(hosp)) %>%
mutate(hosp = X1+X2+X3+X4+X5+X7+X8+X9+X10+X11+X12+X13+X14+X15+X16+X17+X18+X19+X20+X21+X22) %>% dplyr::select(hosp) %>%
tibble::rownames_to_column(., var = "zcta")
byGeo <- st_read("https://raw.githubusercontent.com/nychealth/coronavirus-data/b51d9e4f2479435faefe261822649249b175a2d0/totals/data-by-modzcta.csv") %>% #commit Jan 2 2022 to account for Omicron
rename(zcta = MODIFIED_ZCTA)
byGeo <- left_join(zones, byGeo, by = "zcta") %>%
filter(zcta != "99999")
byGeo$MORT_RATIO <- (as.numeric(byGeo$COVID_DEATH_RATE)/as.numeric(byGeo$COVID_CASE_RATE))*100
byGeo <- byGeo %>%
st_transform("EPSG:2263") %>%
dplyr::select(-NEIGHBORHOOD_NAME, -label, -lat, -lon) %>%
left_join(hosp, by = "zcta") %>%
mutate(COVID_CASE_RATE = as.numeric(COVID_CASE_RATE),
COVID_DEATH_RATE = as.numeric(COVID_DEATH_RATE),
PERCENT_POSITIVE = as.numeric(PERCENT_POSITIVE),
COVID_CONFIRMED_DEATH_RATE = as.numeric(COVID_CONFIRMED_DEATH_RATE),
COVID_CONFIRMED_CASE_RATE = as.numeric(COVID_CONFIRMED_CASE_RATE),
POP_DENOMINATOR = as.numeric(POP_DENOMINATOR),
TOTAL_COVID_TESTS = as.numeric(TOTAL_COVID_TESTS)) %>%
dplyr::select(zcta, safmr4, BOROUGH_GROUP, COVID_CASE_RATE, POP_DENOMINATOR, COVID_DEATH_RATE, PERCENT_POSITIVE, TOTAL_COVID_TESTS, MORT_RATIO, hosp)
grid.arrange(
ggplot()+
geom_sf(data = byGeo, colour = "transparent", aes(fill = as.numeric(COVID_CASE_RATE)), size = 0.5) +
scale_fill_viridis(option = 'A', name = "Cases/ \n100,000") +
labs(title = "Cases per 100k") +
mapTheme() +theme(legend.position="bottom")
,
ggplot()+
geom_sf(data = byGeo, colour = "transparent", aes(fill = as.numeric(hosp)), size = 0.5) +
scale_fill_viridis(option = 'A', name = "Hospitalizations/ \n100,000") +
labs(title = "Hospitalizations per 100k") +
mapTheme() +theme(legend.position="bottom")
,
ggplot()+
geom_sf(data = byGeo, colour = "transparent", aes(fill = as.numeric(COVID_DEATH_RATE)), size = 0.5) +
scale_fill_viridis(option = 'A', name = "Deaths/ \n100,000") +
labs(title = "Deaths per 100k") +
mapTheme() +theme(legend.position="bottom")
,
ggplot()+
geom_sf(data = byGeo, colour = "transparent", aes(fill = as.numeric(MORT_RATIO)), size = 0.5) +
scale_fill_viridis(option = 'A', name = "Mortality Ratio (%)") +
labs(title = "Mortality Ratio") +
mapTheme() + theme(legend.position="bottom")
,
ncol = 2, top = "Covid-19 Metrics in New York City by Zip Code", bottom = "Source: https://github.com/nychealth/coronavirus-data/blob/master/totals/data-by-modzcta.csv")
The census data in this analysis was pulled from the Census Bureau’s API using the R package, tidycensus. This data encompasses variables including demographics, living conditions, employment, and English fluency at the census tract level. For each variable, I calculated its percentage share of the whole population or its appropriate denominator. I then removed 66 census tracts that had a population of 0 and NA values for median income and median age, ending with 2101 census tracts.
#acs
census_api_key("7aaf667fa42073b560f4759176037374b38cb7fd", install = TRUE, overwrite = TRUE)
vars <- load_variables(year = 2018,
dataset = "acs5",
cache = TRUE)
pull_acs <- function(cty) { #input is a borough name
get_acs(geography = "tract", variables = c("B01003_001", #TOTAL POPULATION
"B01002_001", #MEDIAN AGE
"B01001_026", #TOTAL FEMALE POP
"B02001_002", #TOTAL WHITE ALONE
"B02001_003", #TOTAL BLACK ALONE
"B03003_003", #TOTAL HISPANIC/LATINO
"C18120_006", #TOTAL UNEMPLOYED
"B05003_008", #MALE 18+
"B05003_019", #FEMALE 18+
"B11010_009", #TOTAL FEMALE-HEADED HOUSEHOLD
"B06012_002", #TOTAL BELOW POVERTY
"B19057_002", #TOTAL RECEIVING PUBLIC ASSISTANCE
"B19013_001", #MEDIAN HOUSEHOLD median_age
"B25003_003", #RENTER OCCUPIED
"B09019_002", #TOTAL HOUSEHOLDS
"B25129_039", #RENTER OCC, Moved after 2010
"B25129_003", #OWNER OCC, Moved after 2010)
"B25002_003", #VACANT PROPERTY
"B25002_001", #HOUSING UNIT COUNT
"B25123_003", #OWNER OCC 1+ CONDITION
"B25123_009", #RENTER OCC 1+ CONDITION
"B19001_001", #HOUSEHOLDS W/ KNOWN INCOME
"B19001A_014",#White alone 100-124.99
"B19001A_015",#White alone 125-149.99
"B19001A_016",#White alone 150-199.99
"B19001A_017",#White alone 200+
"B19001B_002",#Black alone under 10
"B19001B_003",#Black alone 10-14.99
"B19001B_004",#Black alone 15-19.99
"B19001B_005",#Black alone 20-24.99
"B25032_010", #owner occupied apt (50+ housing units)
"B25032_021", #renter occupied apt (50 + housing units)
"B25010_002", #avg owner occ household size
"B25010_003", #avg renter occ household size
"B08141_002", #no vehicles
"B08141_016", #work via transit
"B08141_031", #wfh_2018
"C24050_001",# total employed 16+
"C24050_006",# work in retail
"C24050_011",# works n education, health, social assistance
"C24050_029",# works in service
"B15003_001",# age 25+
"B15003_012",# age 25+ 8th grade
"B15003_017",# age 25+ 12th grade diploma
"B15003_022",# age 25+ bachelors
"B15003_024",# age 25+ profesional degree
"B05002_013",# foreign born population
"C16002_013",# limited English household
"B27010_033",# individuals without health insurance
"B25011_042",# renter living alone
"B25011_018",# owner living alone
"B17024_106",# 65-74 years
"B26107_003"# living in group quarters non veteran
), year=2018, state=36, county = cty, geometry=T) %>%
st_transform("EPSG:2263")
}
queens <- pull_acs(81)
kings <- pull_acs(47)
bronx <- pull_acs(5)
richmond <- pull_acs(85)
newyork <- pull_acs(61)
acs <- rbind(queens, kings, bronx, newyork, richmond)
acs <-
acs %>%
dplyr::select( -NAME, -moe) %>%
spread(variable, estimate) %>%
dplyr::select(-geometry) %>%
rename(population = B01003_001,
median_age = B01002_001,
female = B01001_026,
white = B02001_002,
black = B02001_003,
hispanic = B03003_003,
unemployed = C18120_006,
male_18 = B05003_008,
female_18 = B05003_019,
female_hh = B11010_009,
public_asst = B19057_002,
poverty = B06012_002,
medhhinc = B19013_001,
renter_hh = B25003_003,
total_hh = B09019_002,
renter_5_years = B25129_039,
owner_5_years = B25129_003,
vacant = B25002_003,
housing_units = B25002_001,
owner_1condition = B25123_003,
renter_1condition = B25123_009,
known_hh_incomes = B19001_001,
white100_124k = B19001A_014,
white125_149k = B19001A_015,
white150_199k = B19001A_016,
white200k_plus = B19001A_017,
black_10k_below = B19001B_002,
black_10_14k = B19001B_003,
black_15_19k = B19001B_004,
black_20_24k = B19001B_005) %>%
mutate(female_pct = ifelse(population > 0, female / population, 0),
white_pct = ifelse(population > 0, white / population, 0),
black_pct = ifelse(population > 0, black/ population, 0),
hispanic_pct = ifelse(population > 0, hispanic / population, 0),
other_race_pct = ifelse(population > 0, (population - (hispanic + white + black))/population, 0),
adult_18 = male_18 + female_18,
adult_pct = ifelse(population > 0, adult_18 / population, 0),
children_pct = 1 - adult_pct,
female_hh_pct = ifelse(total_hh > 0, female_hh / total_hh, 0),
public_asst_pct = ifelse(total_hh > 0, public_asst / total_hh, 0),
poverty_pct = ifelse(population > 0, poverty / population, 0),
renter_pct = ifelse(housing_units > 0, renter_hh / housing_units, 0),
tenure_5yr_pct = ifelse(total_hh > 0, (renter_5_years + owner_5_years) / total_hh, 0),
vacancy_pct = ifelse(housing_units > 0, vacant / housing_units, 0),
housing_condition_pct = ifelse(housing_units > 0, (owner_1condition + renter_1condition)/housing_units, 0),
unemployed_pct = ifelse(adult_18 > 0, unemployed / adult_18, 0),
race_ice = ifelse(known_hh_incomes >0, ((white100_124k + white125_149k + white150_199k + white200k_plus) - (black_10k_below + black_10_14k + black_15_19k + black_20_24k))/known_hh_incomes, 0),#((white non-hisp over US$100 k) −(black alone under US$25 k))/total population_household income
apartments = ifelse(housing_units > 0, (B25032_010 + B25032_021)/housing_units, 0),
hh_size_renter = B25010_003,
hh_size_owner = B25010_002,
no_veh_hh = ifelse(total_hh > 0, B08141_002/total_hh, 0),
total_workers = C24050_001,
transit_commute = ifelse(total_workers >0, B08141_016/total_workers,0),
wfh_2018 = ifelse(total_workers >0, B08141_031/total_workers,0),
retail_worker = ifelse(total_workers >0, C24050_006/total_workers,0),
edu_health_soc_worker =ifelse(total_workers >0, C24050_011/total_workers,0),
service_worker = ifelse(total_workers >0, C24050_029/total_workers,0),
age_25plus = B15003_001,
edu_ms =ifelse(age_25plus >0, B15003_012/age_25plus,0),
edu_hs = ifelse(age_25plus >0, B15003_017/age_25plus,0),
edu_bac = ifelse(age_25plus >0, B15003_022/age_25plus,0),
edu_prof =ifelse(age_25plus >0, B15003_024/age_25plus,0),
foreign_born=ifelse(population >0, B05002_013/population,0),
limited_english_hh=ifelse(total_hh >0, C16002_013/total_hh,0),
no_insurance=ifelse(population >0, B27010_033/population, 0),
live_alone= ifelse(total_hh >0, (B25011_042+ B25011_018)/total_hh, 0),
age_65_74 = ifelse(population >0, B17024_106/population,0),
group_quarters =ifelse(population >0, B26107_003/population,0)) %>%
dplyr::select(median_age, medhhinc, female_pct, white_pct, black_pct, hispanic_pct, other_race_pct, adult_pct, female_hh_pct, public_asst_pct, poverty_pct, renter_pct, tenure_5yr_pct, vacancy_pct, housing_condition_pct, unemployed_pct, race_ice, GEOID, population,apartments, hh_size_renter, hh_size_owner, no_veh_hh, total_workers, transit_commute, wfh_2018, retail_worker, edu_health_soc_worker, service_worker, age_25plus, edu_ms, edu_hs, edu_bac, edu_prof, foreign_born, limited_english_hh, no_insurance, live_alone, age_65_74, group_quarters )
acs[is.na(acs)] <- 0 #2167
acs <- acs %>%
filter(median_age != 0 & medhhinc != 0 & population != 0) #2101
The rest of my independent variables came from the New York City OpenData portal and HUD’s Picture of Subsidized Housing dataset. From NYC OpenData, I pulled point data on evictions, public transit stops, healthcare facilities, schools, points of interest, and shootings to try and capture potential stressors, constraints, and resources across the city. From HUD’s dataset, I collected census tract level data on public housing availability and use, as the literature has suggested that individuals living in subsidized housing may be more vulnerable to Covid-19 exposure due to higher residential density.
evictions <- st_read("https://data.cityofnewyork.us/resource/6z8x-wfk4.csv") %>%
dplyr::filter(., !grepl("2022", executed_date)) %>%
dplyr::select(
# -court_index_number, -docket_number, -eviction_address, -eviction_apt_num, -marshal_first_name, -marshal_last_name, -borough, -eviction_zip, -ejectment, -community_board, -council_district, -census_tract, -bin, -bbl, -nta, -ejectment, -eviction_possession
longitude, latitude
) %>%
mutate(latitude = as.numeric(latitude),
longitude = as.numeric(longitude),
eviction = 1) %>%
na.omit(., latitude) %>%
st_as_sf(., crs = 4326, coords = c("longitude", "latitude")) %>%
na.omit("latitude") %>%
sf::st_transform('EPSG:2263') %>%
dplyr::select(geometry)
evictions_prox <- st_coordinates(evictions)
busstops <- st_read("https://data.cityofnewyork.us/resource/t4f2-8md7.geojson") %>%
dplyr::select(geometry) %>%
st_transform("EPSG:2263")
healthcare <- st_read("https://data.cityofnewyork.us/resource/f7b6-v6v3.geojson") %>%
sf::st_transform('EPSG:2263') %>%
dplyr::select(geometry)
schools <- st_read("/Users/annaduan/Documents/GitHub/NYC\ Covid/Data/Public_School_Locations\ 2/Public_Schools_Points_2011-2012A.shp") %>%
dplyr::select(geometry) %>%
sf::st_transform('EPSG:2263') %>%
mutate(schools = 1)
poi <- st_read("https://data.cityofnewyork.us/resource/t95h-5fsr.geojson")
shootings_hist <- st_read("/Users/annaduan/Documents/GitHub/NYC\ Covid/Data/shooting_hist/geo_export_eb0fa86e-f53e-4323-8146-851f587c0ca1.shp") %>%
filter(date_occur > "2016-12-31")
shootings_pres <- st_read("/Users/annaduan/Documents/GitHub/NYC\ Covid/Data/shooting_pres/geo_export_fe7f8a1b-8196-4bbf-b798-65c65a6308f0.shp") %>%
filter(date_occur < "2022-01-01")
shootings <- rbind(shootings_hist, shootings_pres) %>%
dplyr::select(geometry) %>%
sf::st_transform('EPSG:2263') %>%
mutate(shootings = 1)
shootings_prox <- shootings %>%
dplyr::select(geometry) %>%
st_coordinates()
## public housing
public_housing <- st_read("/Users/annaduan/Documents/GitHub/NYC\ Covid/Data/public_housing.csv") %>%
filter(grepl("Kings County", Name) | grepl("Queens County", Name) | grepl("Richmond County", Name) | grepl("Bronx County", Name) | grepl("New York County", Name)) %>%
dplyr::select(Code, Subsidized.units.available) %>%
mutate(p_housing_units = as.numeric(Subsidized.units.available)) %>%
transform(Code=str_replace(Code,"=","")) %>%
rename(GEOID = Code) %>%
dplyr::select(GEOID, p_housing_units)
public_housing <- acs %>%
dplyr::select(GEOID) %>%
st_centroid() %>%
left_join(public_housing) %>%
st_as_sf() %>%
filter(is.na(p_housing_units) == F) %>%
dplyr::select(-GEOID)
After cleaning the data, I merged the independent variables with the dependent variable, Covid-19 Mortality Ratio. I had to use the Department of Housing and Urban Development’s (HUD) Crosswalk dataset to convert census-tract level data to the zip code level based on population ratios.
zip_list <- c("10001", "10002", "10003", "10026", "10004", "10005", "10006", "10007", "10009", "10010", "10011", "10012", "10013", "10014", "10016", "10017", "10018", "10030", "10019", "10021", "10022", "10023", "10024", "10025", "10027", "10028", "10029", "10031", "10032", "10033", "10034", "10037", "10035", "10036", "10038", "10040", "11104", "11105", "10039", "10128", "10044", "10065", "10069", "10075", "10280", "10282", "10301", "11106", "10302", "10303", "10304", "10451", "10305", "10306", "10307", "10308", "10309", "10310", "10312", "10314", "10456", "10452", "10453", "10454", "10455", "10457", "10458", "10459", "10460", "10461", "11004", "11101", "10462", "10463", "10464", "10465", "10466", "10467", "10468", "10469", "10470", "11102", "11109", "10471", "10472", "10473", "10474", "11103", "10475", "11201", "11203", "11204", "11205", "11206", "11207", "11209", "11208", "11210", "11211", "11212", "11213", "11218", "11214", "11215", "11216", "11217", "11219", "11220", "11221", "11222", "11223", "11224", "11225", "11226", "11228", "11230", "11415", "11229", "11231", "11358", "11232", "11233", "11234", "11235", "11236", "11237", "11238", "11239", "11354", "11355", "11356", "11357", "11360", "11372", "11361", "11362", "11416", "11363", "11364", "11373", "11365", "11366", "11367", "11374", "11368", "11417", "11369", "11370", "11375", "11379", "11377", "11378", "11421", "11422", "11385", "11411", "11412", "11413", "11414", "11420", "11418", "11419", "11423", "11426", "11427", "11428", "11436", "11691", "11692", "11693", "11429", "11432", "11433", "11434", "11435", "11694", "11697")
crosswalk <- st_read("/Users/annaduan/Desktop/HIP/Rent\ Limit\ Study/Data/zip_tract_crosswalk.csv") %>%
filter(usps_zip_pref_state == "NY" & zip %in% zip_list) %>%
dplyr::select(-usps_zip_pref_city, -usps_zip_pref_state) %>%
rename(GEOID = tract) %>%
dplyr::select(-bus_ratio, -oth_ratio, -tot_ratio) %>%
rename(zip_code = zip)
acs_zip <- acs %>%
left_join(crosswalk, keep = FALSE, by= "GEOID") %>%
mutate(res_ratio = as.numeric(res_ratio)) %>%
mutate_at(c("median_age","medhhinc", "female_pct","white_pct","black_pct", "hispanic_pct", "other_race_pct","adult_pct","female_hh_pct" , "public_asst_pct" ,"poverty_pct","renter_pct","tenure_5yr_pct" ,"vacancy_pct", "housing_condition_pct","unemployed_pct","race_ice","population", "apartments" ,"hh_size_renter","hh_size_owner","no_veh_hh" , "total_workers" ,"transit_commute","wfh_2018", "retail_worker","edu_health_soc_worker","service_worker","age_25plus","edu_ms","edu_hs","edu_bac","edu_prof", "foreign_born" ,"limited_english_hh","no_insurance","live_alone", "age_65_74", "group_quarters"), ~.*res_ratio)
acs_zip[is.na(acs_zip)] <- 0
acs_zip <- acs_zip %>%
st_drop_geometry() %>%
group_by(zip_code) %>%
summarise_at(c("median_age","medhhinc", "female_pct","white_pct","black_pct", "hispanic_pct", "other_race_pct","adult_pct","female_hh_pct" , "public_asst_pct" ,"poverty_pct","renter_pct","tenure_5yr_pct" ,"vacancy_pct", "housing_condition_pct","unemployed_pct","race_ice","population", "apartments" ,"hh_size_renter","hh_size_owner","no_veh_hh" , "total_workers" ,"transit_commute","wfh_2018", "retail_worker","edu_health_soc_worker","service_worker","age_25plus","edu_ms","edu_hs","edu_bac","edu_prof", "foreign_born" ,"limited_english_hh","no_insurance","live_alone", "age_65_74", "group_quarters"), sum) %>%
filter(zip_code != 0) %>%
rename(zcta = zip_code)
#add covid data
data_zip <- byGeo %>%
left_join(acs_zip, by = "zcta") %>%
st_as_sf() %>%
st_transform("EPSG:2263") %>%
mutate(TEST_RATE = as.numeric(TOTAL_COVID_TESTS)/as.numeric(POP_DENOMINATOR), na.rm = T)
data_zip <- data_zip %>%
dplyr::select(zcta) %>%
st_intersection(public_housing) %>%
group_by(zcta) %>%
summarise(p_housing_units = sum(p_housing_units)) %>%
st_drop_geometry() %>%
full_join(data_zip, by = "zcta") %>%
st_as_sf()
grid.arrange(
ggplot() +
geom_sf(data = acs, color = "white", fill = "purple",size = 0.2) +
scale_fill_viridis(option = "A", name = "NYC Borough") +
labs(title = "Tract", subtitle = "n = 2101") +
mapTheme() + theme(legend.position="none")
,
ggplot() +
geom_sf(data = data_zip, color = "white", fill = "magenta", size = 0.2) +
scale_fill_viridis(option = "A", name = "NYC Borough") +
labs(title = "ZIP Code", subtitle = "n = 177") +
mapTheme() + theme(legend.position="none"), ncol = 2, top = "Tract to ZIP")
Next, I create a set of “nearest neighbor” variables to represent the point data. Nearest neighbor is a function that calculates the average distance between the centroid of a ZIP code and the k nearest “neighbor” ZIP codes with a certain attribute. I create these variables for healthcare, transit, evictions, schools, shootings, and public housing.
data_zip <-
data_zip %>%
mutate(
healthcare.nn5 = nn_function(st_coordinates(st_centroid(data_zip)), st_coordinates(healthcare),5),
busstop.nn5 = nn_function(st_coordinates(st_centroid(data_zip)), st_coordinates(busstops),5),
evictions.nn5 = nn_function(st_coordinates(st_centroid(data_zip)), evictions_prox,5),
schools.nn5 = nn_function(st_coordinates(st_centroid(data_zip)), st_coordinates(schools),5),
shootings.nn5 = nn_function(st_coordinates(st_centroid(data_zip)), shootings_prox,5),
public_housing = ifelse(population >0, p_housing_units/population, 0)) %>%
dplyr::select( -hosp, -na.rm, -POP_DENOMINATOR)
shootings_bronx <- shootings %>%
st_intersection(st_union(data_zip %>% filter(BOROUGH_GROUP == "Bronx")))
grid.arrange(
ggplot() +
geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx"),fill = "magenta", colour = "white", alpha = 0.5) +
geom_sf(data = shootings_bronx, size = 1, colour = "black") +
labs(title = "Point data", subtitle = "Bronx, NYC") +
mapTheme()
,
ggplot() +
geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx"), aes(fill = shootings.nn5), colour = "white") +
scale_fill_viridis(option = 'magma', name = "Mean distance to 5 \nnearest shootings (ft)") +
labs(title = "Nearest Neighbors Variable", subtitle = "Bronx, NYC") +
mapTheme() , ncol = 2, top = "Shootings Data Feature Engineering")
Before creating a model, I calculate some summary statistics regarding Covid-19 mortality ratio in New York City.
grid.arrange(
ggplot()+
geom_histogram(data = data_zip, aes(x = as.numeric(COVID_CASE_RATE)), bins = 50, fill = "violet") +
labs(title = "Case Rate", x = "Cases per 100k", y = "ZIP codes") +
plotTheme()
,
ggplot()+
geom_histogram(data = data_zip, aes(x = as.numeric(COVID_DEATH_RATE)), bins = 50, fill = "magenta") +
labs(title = "Death Rate", x = "Deaths per 100k", y = "ZIP codes") +
plotTheme()
,
ggplot()+
geom_histogram(data = data_zip, aes(x = MORT_RATIO), bins = 50, fill = "purple") +
labs(title = "Mortality Ratio", x = "% of Infections", y = "ZIP codes") +
plotTheme(),
ncol = 3, top = "New York City Covid-19 Cases, Deaths, and Mortality Ratios, 2020-2021")
Geographically, we see that of the city’s 5 boroughs, the Bronx has the highest mortality ratio (2.5%) and Staten Island has the lowest (1.5%).
data_zip %>%
dplyr::select(MORT_RATIO, BOROUGH_GROUP) %>%
group_by(BOROUGH_GROUP) %>%
summarise(Mortality_Ratio = mean(MORT_RATIO)) %>%
ggplot() +
geom_sf(aes(fill = Mortality_Ratio), color = "white") +
geom_sf_text(aes(label = BOROUGH_GROUP), size = 3, colour = "gray50")+
scale_fill_viridis(option = 'magma', name = "% of Positive Cases") +
labs(title = "Mortality Ratio by NYC Borough", caption = "Mortality ratio = (deaths/positive cases)*100") +
mapTheme()
data_zip$BOROUGH_GROUP <- factor(data_zip$BOROUGH_GROUP, # Relevel group factor
levels = c("Bronx", "Manhattan", "Staten Island", "Queens", "Brooklyn"))
byBorough <- data_zip %>%
group_by(BOROUGH_GROUP) %>%
summarise(MORT_RATIO = mean(MORT_RATIO), na.rm = T,
Positivity = mean(PERCENT_POSITIVE), na.rm = T,
Evictions_nn5 = mean(evictions.nn5), na.rm = T,
ICE_race = mean(race_ice), na.rm = T,
Shootings_nn5 = mean(shootings.nn5), na.rm = T,
Unemployment = mean(unemployed_pct), na.rm = T,
Carless_HHs = mean(no_veh_hh), na.rm = T,
Transit_use = mean(transit_commute), na.rm = T,
Bachelors = mean(edu_bac), na.rm = T,
Foreign_born = mean(foreign_born), na.rm = T,
White_pct = mean(white_pct), na.rm = T,
Uninsured = mean(no_insurance), na.rm = T,
Age65_74 = mean(age_65_74), na.rm = T) %>%
mutate(BOROUGH_GROUP = ifelse(BOROUGH_GROUP == "Brooklyn", "BKN", ifelse(BOROUGH_GROUP == "Bronx", "BRX", ifelse(BOROUGH_GROUP == "Queens", "QNS", ifelse(BOROUGH_GROUP == "Staten Island", "SI", "MHN"))))) %>%
rename(Borough = BOROUGH_GROUP)
byBorough<- byBorough %>%
st_drop_geometry() %>%
dplyr::select(-na.rm) %>%
gather(Variable, Value, -Borough)
pander(byBorough %>%filter(Variable == "MORT_RATIO") %>% dplyr::select(-Variable), caption = "Mortality Ratio by Borough")
Borough | Value |
---|---|
BRX | 2.448 |
MHN | 1.515 |
SI | 1.544 |
QNS | 2.351 |
BKN | 2.301 |
Looking closer at the differences between the boroughs, we see a more complicated picture. In the Bronx, we see the city’s highest unemployment rate, lowest share of residents with Bachelor’s degrees, and a relatively high share of foreign born residents. This suggests that these variables may be positively correlated with mortality ratio.
In Staten Island, however, we observe the highest rate of evictions and shootings, and the city’s highest positivity rate. On the other hand, Staten Island also has the highest share of white residents, the lowest of foreign born residents, and low rates of individuals who use transit and households without cars. The combination of these trends suggests that Staten Island has higher variation in neighborhood conditions and demographics.
byBorough %>%
ggplot(aes(Borough, Value)) +
geom_bar(stat = "identity", position = "dodge", fill = "violet") +
facet_wrap(~Variable, scales = "free", ncol=4) +
labs(title = "Mortality Ratio and Risk Factors across NYC Boroughs") +
plotTheme() + theme(legend.position="bottom")
Let’s look closer. By mapping mortality ratios in the Bronx and Staten Island, we see that there appears to be greater disparities in Staten Island than in the Bronx. Perhaps this is due to a higher level of inequality.
boro_labs <- data_zip %>%
dplyr::select(BOROUGH_GROUP, MORT_RATIO) %>%
group_by(BOROUGH_GROUP) %>%
summarise(Mortality_Ratio = mean(MORT_RATIO))
ggplot()+
geom_sf(data = data_zip, colour = "transparent", fill = "gray90", size = 0.5) +
geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx" | BOROUGH_GROUP == "Staten Island"), colour = "transparent", aes(fill = as.numeric(MORT_RATIO)), size = 0.5) +
geom_sf_text(data = boro_labs %>% filter(BOROUGH_GROUP == "Staten Island" | BOROUGH_GROUP == "Bronx"), aes(label = BOROUGH_GROUP), size = 3, colour = "white", nudge_x = 4)+
scale_fill_viridis(option = 'A', name = "% of Positive Cases") +
labs(title = "Mortality Ratio in the Bronx and Staten Island", caption = "Mortality ratio = (deaths/positive cases)*100") +
mapTheme()
I map a few variables to test this theory.
The spatial distribution of poverty does not resemble that of mortality ratio.
ggplot()+
geom_sf(data = data_zip, colour = "transparent", fill = "gray90", size = 0.5) +
geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx" | BOROUGH_GROUP == "Staten Island"), colour = "transparent", aes(fill = as.numeric(poverty_pct)), size = 0.5) +
scale_fill_viridis(option = 'A', name = "% of\npopulation") +
labs(title = "Below Poverty Line") +
mapTheme()
Unemployment, too, does not map well onto mortality ratio.
ggplot()+
geom_sf(data = data_zip, colour = "transparent", fill = "gray90", size = 0.5) +
geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx" | BOROUGH_GROUP == "Staten Island"), colour = "transparent", aes(fill = as.numeric(unemployed_pct)), size = 0.5) +
scale_fill_viridis(option = 'A', name = "% of population \n 16+") +
labs(title = "Age 16+ Unemployed") +
mapTheme()
Percent non-White loosely resembles the mortality ratio in Staten Island, but less so in the Bronx.
ggplot()+
geom_sf(data = data_zip, colour = "transparent", fill = "gray90", size = 0.5) +
geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx" | BOROUGH_GROUP == "Staten Island"), colour = "transparent", aes(fill = 1 - as.numeric(white_pct)), size = 0.5) +
scale_fill_viridis(option = 'A', name = "% of\npopulation") +
labs(title = "Non-White Residents") +
mapTheme()
Age appears to more closely resemble mortality ratio in the Bronx than in Staten Island.
ggplot()+
geom_sf(data = data_zip, colour = "transparent", fill = "gray90", size = 0.5) +
geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx" | BOROUGH_GROUP == "Staten Island"), colour = "transparent", aes(fill = as.numeric(median_age)), size = 0.5) +
scale_fill_viridis(option = 'A', name = "Years") +
labs(title = "Median Age") +
mapTheme()
Percent foreign-born more closely resembles mortality ratio in Staten Island.
ggplot()+
geom_sf(data = data_zip, colour = "transparent", fill = "gray90", size = 0.5) +
geom_sf(data = data_zip %>% filter(BOROUGH_GROUP == "Bronx" | BOROUGH_GROUP == "Staten Island"), colour = "transparent", aes(fill = as.numeric(foreign_born)), size = 0.5) +
scale_fill_viridis(option = 'A', name = "% of\npopulation") +
labs(title = "Foreign-born Residents") +
mapTheme()
Together, these trends suggest that a number of variables likely have interaction effects with the borough that a ZIP code is in.
Next, I plot the distribution of all of the current variables. The following variables have skewed distributions, and may need further transformation before being used for modelling:
skewed_list <- c("black_pct", "hispanic_pct", "white_pct", "busstop.nn5", "edu_ms", "edu_prof", "evictions.nn5", "female_hh_pct", "healthcare.nn5", "limited_english_hh", "live_alone", "no_veh_hh", "p_housing_units", "poverty_pct", "public_asst_pct", "race_ice", "renter_pct", "safmr4", "schools.nn5", "tenure_5yr_pct", "unemployed_pct", "wfh_2018")
skewed_desc <- c("Percentage share of Black residents", "Percentage share of Hispanic residents", "Percentage share of White residents", "Average distance to nearest 5 bus stops from ZIP code centroid", "Percentage share of residents 25 years and older with middle school education level", "Percentage share of residents 25 years and older with professional degrees", "Average distance to nearest 5 evictions", "Percentage share of households headed by female householder", "Average distance to nearest 5 heatlhcare facilities", "Percentage share of households with limited english fluency", "Percentage share of residents living alone", "Percentage share of households with no access to automobiles", "Number of public housing units per ZIP code", "Percentage share of residents living below poverty line", "Percentage share of households receiving public assistance", "Racialized Index of Concentration at the Extremes (RICE); -1 = most disadvantaged, 1 = most advantaged", "Percentage share of households that rent their homes", "HUD Housing Choice Voucher rent limit for 4 bedroom unit", "Average distance to nearest 5 public K-12 schools", "Percentage share of renters who have lived in their current address for 5 or more years", "Percentage share of population 16 years and older that is unemployed", "Percentage share of workers who worked from home in 2018")
skewed_vars <- as.data.frame(skewed_list) %>%
mutate(col = seq.int(nrow(.))) %>%
rename(Variable = skewed_list)
skewed_desc <- as.data.frame(skewed_desc) %>%
mutate(col = seq.int(nrow(.))) %>%
rename(Description = skewed_desc)
left_join(skewed_vars, skewed_desc) %>%
dplyr::select(-col) %>%
pander(caption = "Skewed Variables")
Joining, by = “col”
Variable | Description |
---|---|
black_pct | Percentage share of Black residents |
hispanic_pct | Percentage share of Hispanic residents |
white_pct | Percentage share of White residents |
busstop.nn5 | Average distance to nearest 5 bus stops from ZIP code centroid |
edu_ms | Percentage share of residents 25 years and older with middle school education level |
edu_prof | Percentage share of residents 25 years and older with professional degrees |
evictions.nn5 | Average distance to nearest 5 evictions |
female_hh_pct | Percentage share of households headed by female householder |
healthcare.nn5 | Average distance to nearest 5 heatlhcare facilities |
limited_english_hh | Percentage share of households with limited english fluency |
live_alone | Percentage share of residents living alone |
no_veh_hh | Percentage share of households with no access to automobiles |
p_housing_units | Number of public housing units per ZIP code |
poverty_pct | Percentage share of residents living below poverty line |
public_asst_pct | Percentage share of households receiving public assistance |
race_ice | Racialized Index of Concentration at the Extremes (RICE); -1 = most disadvantaged, 1 = most advantaged |
renter_pct | Percentage share of households that rent their homes |
safmr4 | HUD Housing Choice Voucher rent limit for 4 bedroom unit |
schools.nn5 | Average distance to nearest 5 public K-12 schools |
tenure_5yr_pct | Percentage share of renters who have lived in their current address for 5 or more years |
unemployed_pct | Percentage share of population 16 years and older that is unemployed |
wfh_2018 | Percentage share of workers who worked from home in 2018 |
correlation.long <-
st_drop_geometry(data_zip) %>%
dplyr::select(-zcta, -BOROUGH_GROUP, -COVID_CASE_RATE, -COVID_DEATH_RATE, -other_race_pct, -PERCENT_POSITIVE,-TEST_RATE, -total_workers, -population, -hh_size_owner, -group_quarters, -apartments, -age_25plus) %>%
gather(Variable, Value, -MORT_RATIO) %>%
mutate(Value = as.numeric(Value))
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, MORT_RATIO, use = "complete.obs"))
ggplot(correlation.long %>% filter(Variable %in% skewed_list), aes(x = Value)) +
geom_histogram(bins = 50, fill = "magenta") +
facet_wrap(~Variable, ncol = 6, scales = "free") +
labs(title = "Covid-19 Mortality Ratio as a function of risk factors", subtitle = "Skewed Variables") +
plotTheme()
Based on what I found in my exploratory analysis, I further transform some of the variables in the following ways:
data_zip <- data_zip %>%
mutate(bronx = ifelse(BOROUGH_GROUP == "Bronx", 1, 0),
manhattan = ifelse(BOROUGH_GROUP == "Manhattan", 1, 0),
statenisland = ifelse(BOROUGH_GROUP == "Staten Island", 1, 0),
queens = ifelse(BOROUGH_GROUP == "Queens", 1, 0),
brooklyn = ifelse(BOROUGH_GROUP == "Brooklyn", 1, 0))
data_zip %>%
st_drop_geometry() %>%
dplyr::select(bronx, queens, manhattan, statenisland, brooklyn) %>%
unique() %>%
pander(caption = "Borough Group Dummy Variables")
bronx | queens | manhattan | statenisland | brooklyn |
---|---|---|---|---|
0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 1 | 0 |
1 | 0 | 0 | 0 | 0 |
0 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 |
data_zip <- data_zip %>%
mutate(black_med = ifelse(black_pct > median(black_pct), 1, 0),
black_mean = ifelse(black_pct > mean(black_pct), 1, 0),
black_maj = ifelse(black_pct > 0.5, 1, 0),
black_log = log(black_pct),
hispanic_med = ifelse(hispanic_pct > median(hispanic_pct), 1, 0),
hispanic_mean = ifelse(hispanic_pct > mean(hispanic_pct), 1, 0),
hispanic_maj = ifelse(hispanic_pct > 0.5, 1, 0),
hispanic_log = log(hispanic_pct),
white_maj = ifelse(white_pct > 0.5, 1, 0),
white_qt3 = ifelse(white_pct > 0.68785, 1, 0),
white_log = log(white_pct),
nonwhite_qt3 = ifelse(white_pct < 0.22, 1, 0),
bus.nn5_med = ifelse(busstop.nn5 > median(busstop.nn5), 1, 0),
bus.nn5_mean = ifelse(busstop.nn5 > mean(busstop.nn5), 1, 0),
bus.nn5_qt3 = ifelse(busstop.nn5 > 14895.4, 1, 0),
busnn5_log = log(busstop.nn5),
grade8_med = ifelse(edu_ms > median(edu_ms), 1, 0),
grade8_mean = ifelse(edu_ms > mean(edu_ms), 1, 0),
grade8_qt3 = ifelse(edu_ms > 0.022335, 1, 0),
grade8_log = log(edu_ms),
prof_degree_med = ifelse(edu_prof > median(edu_prof), 1, 0),
prof_degree_mean = ifelse(edu_prof > mean(edu_prof), 1, 0),
prof_degree_qt3 = ifelse(edu_prof > 0.0434616, 1, 0),
prof_degree_log = log(edu_prof),
evict.nn5_med = ifelse(evictions.nn5 > median(evictions.nn5), 1, 0),
evict.nn5_mean = ifelse(evictions.nn5 > mean(evictions.nn5), 1, 0),
evict.nn5_qt3 = ifelse(evictions.nn5 > 4622.2, 1, 0),
evict.nn5_log = log(evictions.nn5),
female_hh_med = ifelse(female_hh_pct > median(female_hh_pct), 1, 0),
female_hh_mean = ifelse(female_hh_pct > mean(female_hh_pct), 1, 0),
female_hh_qt3 = ifelse(female_hh_pct > 0.12607, 1, 0),
female_hh_log = log(female_hh_pct),
health.nn5_med = ifelse(healthcare.nn5 > median(healthcare.nn5), 1, 0),
health.nn5_mean = ifelse(healthcare.nn5 > mean(healthcare.nn5), 1, 0),
health.nn5_qt3 = ifelse(healthcare.nn5 > 14879, 1, 0),
healthcare.nn5_log = log(healthcare.nn5),
lim_english_binary = ifelse(limited_english_hh > 0.0002791, 1, 0),
lim_english_mean = ifelse(limited_english_hh > mean(limited_english_hh), 1, 0),
lim_english_qt3 = ifelse(limited_english_hh > 0.0023350, 1, 0),
lim_english_log = log(limited_english_hh),
live_alone_med = ifelse(live_alone > median(live_alone), 1, 0),
live_alone_mean = ifelse(live_alone > mean(live_alone), 1, 0),
live_alone_qt3 = ifelse(live_alone > 0.17207, 1, 0),
live_alone_log = log(live_alone),
no_car.med = ifelse(no_veh_hh > median(no_veh_hh), 1, 0),
no_car.mean = ifelse(no_veh_hh > mean(no_veh_hh), 1, 0),
no_car.qt3 = ifelse(no_veh_hh > 0.341979, 1, 0),
no_car_log = log(no_veh_hh),
ph_units_binary = ifelse(p_housing_units != 0, 1, 0),
poverty_med = ifelse(poverty_pct > median(poverty_pct), 1, 0),
poverty_mean = ifelse(poverty_pct > mean(poverty_pct), 1, 0),
poverty_qt3 = ifelse(poverty_pct > 0.21182, 1, 0),
poverty_log = log(poverty_pct),
public_asst_med = ifelse(public_asst_pct > median(public_asst_pct), 1, 0),
public_asst_mean = ifelse(public_asst_pct > mean(public_asst_pct), 1, 0),
public_asst_qt3 = ifelse(public_asst_pct > 0.019006, 1, 0),
public_asst_log = log(public_asst_pct),
rice_binary = ifelse(race_ice > 0, 1, 0),
rice_log = log(race_ice),
renter_med = ifelse(renter_pct > median(renter_pct), 1, 0),
renter_mean = ifelse(renter_pct > mean(renter_pct), 1, 0),
renter_log = log(renter_pct),
safmr4_med = ifelse(safmr4 > median(safmr4), 1, 0),
safmr_log = log(safmr4),
schools.nn5_med = ifelse(schools.nn5 > median(schools.nn5), 1, 0),
schools.nn5_mean = ifelse(schools.nn5 > mean(schools.nn5), 1, 0),
schools.nn5_qt3 = ifelse(schools.nn5 > 2757.5, 1, 0),
schools_log = log(schools.nn5),
shootings.nn5_med = ifelse(shootings.nn5 > median(shootings.nn5), 1, 0),
shootings.nn5_mean = ifelse(shootings.nn5 > mean(shootings.nn5), 1, 0),
shootings.nn5_qt3 = ifelse(shootings.nn5 > 3003.25, 1, 0),
shootings_log = log(shootings.nn5),
tenure5y_med = ifelse(tenure_5yr_pct > median(tenure_5yr_pct), 1, 0),
tenure5y_mean = ifelse(tenure_5yr_pct > mean(tenure_5yr_pct), 1, 0),
tenure5y_qt3 = ifelse(tenure_5yr_pct > 0.073214, 1, 0),
tenure5y_log = log(tenure_5yr_pct),
unemployed_med = ifelse(unemployed_pct > median(unemployed_pct), 1, 0),
unemployed_mean = ifelse(unemployed_pct > mean(unemployed_pct), 1, 0),
unemployed_qt3 = ifelse(unemployed_pct > 0.04807, 1, 0),
unemployed_log = log(unemployed_pct),
wfh_med = ifelse(wfh_2018 > median(wfh_2018), 1, 0),
wfh_mean = ifelse(wfh_2018 > mean(wfh_2018), 1, 0),
wfh_qt3 = ifelse(wfh_2018 > 0.050053, 1, 0),
wfh_log = log(wfh_2018))
I create nearest neighbor variables for cut-offs for continuous demographic variables such as poverty50.nn3, or the distance to the nearest 3 grid cells where the poverty rate is above 50%. For each variable, I create a few iterations with a different number of neighbors for later testing.
data_zip <- data_zip %>%
mutate(phousing.nn1 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(p_housing_units >0))),1),
phousing.nn5 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(p_housing_units >0))),5),
poverty30.nn3 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(poverty_pct > 0.3))),3),
# poverty50.nn3 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(poverty_pct > 0.5))),3),
renter70.nn3 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(renter_pct > 0.7))),3),
unemployed5.nn3 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(renter_pct > 0.05))),3),
public_asst2.nn3 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(public_asst_pct > 0.02))),3),
ice_qt1.nn1 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(race_ice < -0.02))),1),
ice_qt1.nn5 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(race_ice < -0.02))),5),
ice_qt4.nn1 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(race_ice > .32))),1),
ice_qt4.nn5 = nn_function(st_coordinates(st_centroid(data_zip)),st_coordinates(st_centroid(data_zip %>% filter(race_ice > .32))),5))
threshold_nn_list <- c("phousing.nn1", "phousing.nn5", "poverty30.nn3", "renter70.nn3", "unemployed5.nn3", "public_asst2.nn3", "ice_qt1.nn1", "ice_qt1.nn5", "ice_qt4.nn1", "ice_qt4.nn5")
threshold_nn_vars_desc_list <- c("Distance to nearest public housing project", "Average distance to nearest 5 public housing projects", "Average distance to nearest ZIP code where poverty > 30%", "Average distance to nearest 3 ZIP codes where renters > 70%", "Average distance to nearest 3 ZIP codes where unemployment > 5%", "Average distance to nearest 3 ZIP codes where public assistance use > 2%", "Distance to nearest ZIP code where ICE > citywide 25th percentile", "Average distance to nearest 5 ZIP codes where ICE > citywide 25th percentile", "Distance to nearest ZIP code where ICE > citywide 75th percentile", "Average distance to nearest 5 ZIP codes where ICE > citywide 75th percentile")
threshold_nn_vars <- as.data.frame(threshold_nn_list) %>%
rename(Variable = threshold_nn_list) %>%
mutate(col = seq.int(nrow(.)))
threshold_nn_vars_desc <- as.data.frame(threshold_nn_vars_desc_list) %>%
rename(Description = threshold_nn_vars_desc_list) %>%
mutate(col = seq.int(nrow(.)))
left_join(threshold_nn_vars, threshold_nn_vars_desc) %>%
dplyr::select(-col) %>%
pander(caption = "Nearest Neighbor Variables")
Joining, by = “col”
Variable | Description |
---|---|
phousing.nn1 | Distance to nearest public housing project |
phousing.nn5 | Average distance to nearest 5 public housing projects |
poverty30.nn3 | Average distance to nearest ZIP code where poverty > 30% |
renter70.nn3 | Average distance to nearest 3 ZIP codes where renters > 70% |
unemployed5.nn3 | Average distance to nearest 3 ZIP codes where unemployment > 5% |
public_asst2.nn3 | Average distance to nearest 3 ZIP codes where public assistance use > 2% |
ice_qt1.nn1 | Distance to nearest ZIP code where ICE > citywide 25th percentile |
ice_qt1.nn5 | Average distance to nearest 5 ZIP codes where ICE > citywide 25th percentile |
ice_qt4.nn1 | Distance to nearest ZIP code where ICE > citywide 75th percentile |
ice_qt4.nn5 | Average distance to nearest 5 ZIP codes where ICE > citywide 75th percentile |
After testing all variables for correlation with mortality ratio, I found that factors like the share of workers in service, education, health and social work are more strongly correlated with mortality ratio than race. Socioeconomic variables like median income, education, and vehicle ownership are also correlated with mortality ratio. Poverty, unemployment, and english proficiency are also correlated with mortality ratio. All of these correlations are statistically signficant. As a whole, these correlations suggest that neighborhood stressers, living conditions, and employment status and industry all affect a community’s Covid-19 mortality risk.
The 9 strongest positive correlations with mortality ratio includes a mix of demographics, employment, housing, and socioeconomic factors. Surprisingly the two strongest positive correlations are the percentage share of service workers and the percentage share of workers in health, education, and social work.
pos_cors <- c("edu_health_soc_worker","service_worker","edu_hs","housing_condition_pct","black_log","foreign_born","unemployed_med","poverty_log","lim_english_binary")
neg_cors <- c("race_ice","edu_bac","medhhinc","tenure5y_qt3","safmr4","white_pct", "no_car.qt3","wfh_2018","no_veh_hh","vacancy_pct")
correlation.long <-
st_drop_geometry(data_zip) %>%
dplyr::select( -zcta, -BOROUGH_GROUP, -COVID_CASE_RATE, -COVID_DEATH_RATE, -other_race_pct, -PERCENT_POSITIVE,-TEST_RATE, -total_workers, -population, -hh_size_owner, -age_25plus) %>%
gather(Variable, Value, -MORT_RATIO) %>%
mutate(Value = as.numeric(Value))
correlation.cor <-
correlation.long %>%
group_by(Variable) %>%
summarize(correlation = cor(Value, MORT_RATIO, use = "complete.obs"))
neg_cor_list <- c("race_ice", "edu_bac", 'medhhinc', "tenure5y_qt3", "safmr4", 'edu_prof', "white_pct", "no_car.qt3", "wfh_2018")
pos_cor_list <- c("edu_health_soc_worker", "service_worker", "edu_hs", "housing_condition_pct", "black_log", "edu_ms", "foreign_born", "unemployed_med", "poverty_log")
cor.neg <- correlation.cor %>%
filter(correlation < 0) %>%
filter(Variable %in% neg_cor_list)
cor.neg <- cor.neg[order(cor.neg$correlation),] %>%
mutate(col = seq.int(nrow(.)))
cor.neg.desc.list <- c("Racialized Index of Concentration at the Extremes; -1 to 1","Percentage share of population older than 25 with Bachelor's degree","Median household income","Binary: percentage share of renters who have lived in their current address for 5 years or longer > 7.3% (75th percentile)","HUD Housing Choice Voucher rent limit for 4 bedroom unit","Percentage share of population older than 25 with professional degree","Percentage share of White residents","Binary: percentage share of carless households > 34% (75th percentile)","Percentage share of workers who worked from home in 2018")
cor.neg.desc <- as.data.frame(cor.neg.desc.list) %>%
rename(Description = cor.neg.desc.list) %>%
mutate(col = seq.int(nrow(.)))
cor.neg <- left_join(cor.neg, cor.neg.desc) %>%
dplyr::select(-col)
cor.pos <- correlation.cor %>%
filter(correlation > 0) %>%
filter(Variable %in% pos_cor_list)
cor.pos <- cor.pos[order(-cor.pos$correlation),] %>%
mutate(col = seq.int(nrow(.)))
cor.pos.desc.list <- c("Percentage share of workers working in education, health, or social work industry","Percentage share of workers in service industry","Percentage share of adults older than 25 with only high school degree","Percent of housing units with 1 or more physical conditions","Log-transformed percentage share of Black residents","Percentage share of adults older than 25 with only middle school degree","Percentage share of residents who are foreign-born","Binary: percentage share of residents older than 16 and unemployed > 3.8% (median)","Log-transformed percentage share of residents living below poverty line")
cor.pos.desc <- as.data.frame(cor.pos.desc.list) %>%
rename(Description = cor.pos.desc.list) %>%
mutate(col = seq.int(nrow(.)))
cor.pos <- left_join(cor.pos, cor.pos.desc) %>%
dplyr::select(-col)
pander(cor.pos, caption = "Top 9 Positive Correlations with Mortality Ratio")
Variable | correlation | Description |
---|---|---|
edu_health_soc_worker | 0.5495 | Percentage share of workers working in education, health, or social work industry |
service_worker | 0.5335 | Percentage share of workers in service industry |
edu_hs | 0.4795 | Percentage share of adults older than 25 with only high school degree |
housing_condition_pct | 0.4496 | Percent of housing units with 1 or more physical conditions |
black_log | 0.4121 | Log-transformed percentage share of Black residents |
edu_ms | 0.3996 | Percentage share of adults older than 25 with only middle school degree |
foreign_born | 0.3927 | Percentage share of residents who are foreign-born |
unemployed_med | 0.381 | Binary: percentage share of residents older than 16 and unemployed > 3.8% (median) |
poverty_log | 0.3679 | Log-transformed percentage share of residents living below poverty line |
ggplot(correlation.long %>% filter(Variable %in% pos_cors), aes(Value, MORT_RATIO)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor%>% filter(Variable %in% pos_cors), aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "magenta") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Positive Correlation") +
plotTheme()
The strongest negative correlations are primarily socioeconomic, education, and housing variables. The strongest correlation is radicalized index of concentration at the extremes, but this is closely followed by the share of residents with Bachelors degrees, median household income, and share of renters with 5 years or more of tenure.
pander(cor.neg, caption = "Top 9 Negative Correlations with Mortality Ratio")
Variable | correlation | Description |
---|---|---|
race_ice | -0.5966 | Racialized Index of Concentration at the Extremes; -1 to 1 |
edu_bac | -0.5802 | Percentage share of population older than 25 with Bachelor’s degree |
medhhinc | -0.5791 | Median household income |
tenure5y_qt3 | -0.5357 | Binary: percentage share of renters who have lived in their current address for 5 years or longer > 7.3% (75th percentile) |
safmr4 | -0.5319 | HUD Housing Choice Voucher rent limit for 4 bedroom unit |
edu_prof | -0.4999 | Percentage share of population older than 25 with professional degree |
white_pct | -0.4971 | Percentage share of White residents |
no_car.qt3 | -0.4824 | Binary: percentage share of carless households > 34% (75th percentile) |
wfh_2018 | -0.4603 | Percentage share of workers who worked from home in 2018 |
ggplot(correlation.long %>% filter(Variable %in% cor.neg$Variable), aes(Value, MORT_RATIO)) +
geom_point(size = 0.1) +
geom_text(data = correlation.cor%>% filter(Variable %in% cor.neg$Variable), aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "purple") +
facet_wrap(~Variable, ncol = 3, scales = "free") +
labs(title = "Negative Correlation") +
plotTheme()
Following, I use T-tests to examine whether there is a statistically significant difference in mean Mortality Ratio for the 3 binary variables with the strongest correlations:
ggplot() +
geom_sf(data = data_zip, aes(fill = as.character(unemployed_med)), colour = "transparent") +
scale_fill_viridis(discrete = TRUE, option = "magma", name = "Unemployment > 3.8% (median)", direction = -1) +
labs(title = "Unemployment Binary") +
mapTheme()
ggplot() +
geom_sf(data = data_zip, aes(fill = as.character(tenure5y_qt3)), colour = "transparent") +
scale_fill_viridis(discrete = TRUE, direction = -1, option = "magma", name = "5 year tenure > 7.3%\n (75th percentile)") +
labs(title = "5 Year Renter Tenure Binary") +
mapTheme()
ggplot() +
geom_sf(data = data_zip, aes(fill = as.character(no_car.med)), colour = "transparent") +
scale_fill_viridis(discrete = TRUE, direction = -1, option = "magma", name = "Carless households > 34%\n (75th percentile)") +
labs(title = "No Vehicle Households Binary") +
mapTheme()
For all 3 variables, there is a statistically significant difference in means. Mortality ratio is 4 times higher in areas where unemployment > 3.8%. Mortality ratio is 8 times higher in neighborhoods where less than 7.3% of renters have 5 years or more of tenure. Neighborhoods with more carless households generally have a lower mortality ratio.
To build my first model, I use a backwards stepwise function to iteratively remove independent variables from a list of all of the variables and test for improvements in model fit. The optimal model returned by the function still has a very long list of variables, but it’s a good starting point. It has a high R-squared and adjusted R-squared, of 0.92 and 0.85, which suggests strong fit. However, from the results alone we can see many colinear variables such as the log and binary variables for percentage share of Black residents. Additionally, many of the variables are not statistically significant predictors.
mod_0 <- lm(MORT_RATIO ~ p_housing_units + median_age + white_pct +
female_hh_pct + public_asst_pct + renter_pct + tenure_5yr_pct +
housing_condition_pct + unemployed_pct + race_ice + apartments +
no_veh_hh + total_workers + transit_commute + wfh_2018 +
service_worker + age_25plus + edu_bac + edu_prof + foreign_born +
no_insurance + live_alone + age_65_74 + healthcare.nn5 +
busstop.nn5 + evictions.nn5 + schools.nn5 + shootings.nn5 +
bronx + manhattan + queens + black_med + black_maj + black_log +
hispanic_mean + hispanic_maj + white_log + nonwhite_qt3 +
bus.nn5_qt3 + grade8_med + grade8_qt3 + prof_degree_med +
prof_degree_mean + prof_degree_qt3 + prof_degree_log + evict.nn5_med +
evict.nn5_mean + evict.nn5_qt3 + evict.nn5_log + female_hh_mean +
female_hh_qt3 + female_hh_log + health.nn5_med + health.nn5_mean +
health.nn5_qt3 + healthcare.nn5_log + lim_english_binary +
lim_english_mean + lim_english_qt3 + live_alone_med + live_alone_mean +
live_alone_qt3 + live_alone_log + no_car.med + no_car.mean +
no_car.qt3 + poverty_qt3 + poverty_log + public_asst_med +
public_asst_qt3 + rice_binary + renter_mean + renter_log +
safmr4_med + safmr_log + schools.nn5_med + schools.nn5_mean +
schools.nn5_qt3 + schools_log + shootings.nn5_med + shootings.nn5_mean +
shootings_log + tenure5y_mean + tenure5y_qt3 + tenure5y_log +
unemployed_med + unemployed_log + wfh_med + wfh_mean + wfh_qt3 +
wfh_log + phousing.nn1 + phousing.nn5 + unemployed5.nn3 +
public_asst2.nn3 + ice_qt1.nn1 + ice_qt1.nn5 + ice_qt4.nn5 +
poverty30.nn3, data = data_final)
pander(summary(mod_0), caption = "Baseline Model")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 9.294 | 7.334 | 1.267 | 0.2089 |
p_housing_units | -0.0001008 | 3.987e-05 | -2.529 | 0.01348 |
median_age | 0.1495 | 0.03349 | 4.465 | 2.706e-05 |
white_pct | 2.766 | 1.14 | 2.426 | 0.01763 |
female_hh_pct | -36.01 | 10.34 | -3.482 | 0.0008243 |
public_asst_pct | 10.1 | 12.99 | 0.7775 | 0.4392 |
renter_pct | 3.437 | 1.62 | 2.121 | 0.03715 |
tenure_5yr_pct | -22.1 | 7.57 | -2.92 | 0.004596 |
housing_condition_pct | -8.193 | 2.154 | -3.804 | 0.0002836 |
unemployed_pct | 15.33 | 13.52 | 1.134 | 0.2602 |
race_ice | -5.14 | 2.059 | -2.496 | 0.01468 |
apartments | 1.362 | 0.589 | 2.313 | 0.02338 |
no_veh_hh | 2.291 | 2.417 | 0.948 | 0.3461 |
total_workers | 0.0003498 | 0.0003115 | 1.123 | 0.2649 |
transit_commute | 0.9136 | 0.8096 | 1.128 | 0.2627 |
wfh_2018 | 2.211 | 8.675 | 0.2548 | 0.7995 |
service_worker | -3.044 | 2.627 | -1.159 | 0.2501 |
age_25plus | -0.0002785 | 0.0002146 | -1.298 | 0.1982 |
edu_bac | -2.477 | 2.272 | -1.09 | 0.2791 |
edu_prof | -6.8 | 4.73 | -1.438 | 0.1546 |
foreign_born | 1.128 | 1.174 | 0.9605 | 0.3398 |
no_insurance | -3.162 | 2.039 | -1.551 | 0.125 |
live_alone | 20.88 | 6.942 | 3.007 | 0.003559 |
age_65_74 | -6.948 | 5.142 | -1.351 | 0.1806 |
healthcare.nn5 | 6.943e-06 | 3.802e-05 | 0.1826 | 0.8556 |
busstop.nn5 | 2.797e-05 | 1.902e-05 | 1.471 | 0.1454 |
evictions.nn5 | 8.977e-05 | 6.775e-05 | 1.325 | 0.1891 |
schools.nn5 | -0.0002754 | 0.0001311 | -2.101 | 0.03895 |
shootings.nn5 | -0.0002047 | 6.593e-05 | -3.105 | 0.002663 |
bronx | -0.3216 | 0.365 | -0.8811 | 0.381 |
manhattan | -1.038 | 0.3367 | -3.083 | 0.002844 |
queens | -0.2816 | 0.2618 | -1.076 | 0.2854 |
black_med | 0.1917 | 0.2291 | 0.837 | 0.4052 |
black_maj | -0.7775 | 0.2674 | -2.908 | 0.004757 |
black_log | -0.1339 | 0.1149 | -1.165 | 0.2477 |
hispanic_mean | -0.2722 | 0.1491 | -1.825 | 0.07191 |
hispanic_maj | 0.3455 | 0.2268 | 1.523 | 0.1318 |
white_log | -0.1319 | 0.2346 | -0.5624 | 0.5755 |
nonwhite_qt3 | 0.2855 | 0.2115 | 1.35 | 0.1811 |
bus.nn5_qt3 | -0.1092 | 0.2504 | -0.4359 | 0.6642 |
grade8_med | 0.2344 | 0.1554 | 1.508 | 0.1357 |
grade8_qt3 | 0.2884 | 0.1479 | 1.949 | 0.05494 |
prof_degree_med | 0.566 | 0.1712 | 3.306 | 0.001437 |
prof_degree_mean | -0.8206 | 0.3263 | -2.515 | 0.01398 |
prof_degree_qt3 | 1.161 | 0.3552 | 3.268 | 0.00162 |
prof_degree_log | 0.302 | 0.1369 | 2.206 | 0.03033 |
evict.nn5_med | -0.07806 | 0.1896 | -0.4116 | 0.6818 |
evict.nn5_mean | 0.09519 | 0.1988 | 0.4788 | 0.6334 |
evict.nn5_qt3 | -0.233 | 0.2161 | -1.078 | 0.2843 |
evict.nn5_log | -0.2644 | 0.3019 | -0.876 | 0.3838 |
female_hh_mean | 0.4387 | 0.2547 | 1.723 | 0.08899 |
female_hh_qt3 | 0.1998 | 0.2781 | 0.7184 | 0.4747 |
female_hh_log | 3.855 | 1.192 | 3.233 | 0.001805 |
health.nn5_med | 0.1647 | 0.1881 | 0.8753 | 0.3841 |
health.nn5_mean | -0.385 | 0.2145 | -1.795 | 0.07663 |
health.nn5_qt3 | 0.5226 | 0.2391 | 2.186 | 0.03187 |
healthcare.nn5_log | -0.7863 | 0.3097 | -2.539 | 0.01313 |
lim_english_binary | -0.1058 | 0.1363 | -0.7758 | 0.4402 |
lim_english_mean | 0.352 | 0.1669 | 2.109 | 0.03821 |
lim_english_qt3 | -0.6304 | 0.1834 | -3.436 | 0.0009532 |
live_alone_med | 0.2061 | 0.1892 | 1.089 | 0.2795 |
live_alone_mean | -0.6473 | 0.2508 | -2.581 | 0.01176 |
live_alone_qt3 | 0.7925 | 0.2705 | 2.929 | 0.004469 |
live_alone_log | -4.849 | 1.269 | -3.822 | 0.0002669 |
no_car.med | -0.5682 | 0.2578 | -2.204 | 0.03049 |
no_car.mean | -0.3253 | 0.2371 | -1.372 | 0.1741 |
no_car.qt3 | -0.1409 | 0.2503 | -0.5628 | 0.5752 |
poverty_qt3 | -0.1869 | 0.191 | -0.9785 | 0.3309 |
poverty_log | 0.1554 | 0.2564 | 0.606 | 0.5463 |
public_asst_med | -0.2082 | 0.1589 | -1.311 | 0.1939 |
public_asst_qt3 | -0.1478 | 0.1956 | -0.7555 | 0.4522 |
rice_binary | -0.91 | 0.2868 | -3.173 | 0.002165 |
renter_mean | -0.4465 | 0.1969 | -2.267 | 0.02616 |
renter_log | -0.3832 | 0.4796 | -0.799 | 0.4267 |
safmr4_med | 0.2729 | 0.1586 | 1.721 | 0.08932 |
safmr_log | -0.5574 | 0.734 | -0.7594 | 0.4499 |
schools.nn5_med | -0.1697 | 0.2154 | -0.7879 | 0.4332 |
schools.nn5_mean | -0.3048 | 0.2123 | -1.436 | 0.1551 |
schools.nn5_qt3 | 0.4584 | 0.184 | 2.491 | 0.01488 |
schools_log | 0.6456 | 0.2864 | 2.254 | 0.02705 |
shootings.nn5_med | 0.2947 | 0.1536 | 1.918 | 0.05881 |
shootings.nn5_mean | 0.1583 | 0.1637 | 0.9669 | 0.3366 |
shootings_log | -0.186 | 0.1353 | -1.375 | 0.173 |
tenure5y_mean | -0.4905 | 0.2243 | -2.187 | 0.03177 |
tenure5y_qt3 | 0.1397 | 0.285 | 0.4903 | 0.6253 |
tenure5y_log | 1.502 | 0.5019 | 2.992 | 0.00372 |
unemployed_med | 0.7974 | 0.1771 | 4.503 | 2.344e-05 |
unemployed_log | -1.729 | 0.5946 | -2.908 | 0.004749 |
wfh_med | 0.3879 | 0.181 | 2.143 | 0.03526 |
wfh_mean | -0.5991 | 0.1856 | -3.229 | 0.001829 |
wfh_qt3 | 0.3183 | 0.2067 | 1.54 | 0.1277 |
wfh_log | -0.3616 | 0.2517 | -1.437 | 0.1548 |
phousing.nn1 | 4.528e-05 | 2.898e-05 | 1.563 | 0.1223 |
phousing.nn5 | -9.464e-05 | 5.082e-05 | -1.862 | 0.0664 |
unemployed5.nn3 | 0.0001696 | 7.534e-05 | 2.252 | 0.02719 |
public_asst2.nn3 | 1.163e-05 | 1.678e-05 | 0.693 | 0.4904 |
ice_qt1.nn1 | -0.0001689 | 4.412e-05 | -3.828 | 0.000261 |
ice_qt1.nn5 | 0.0001998 | 5.344e-05 | 3.739 | 0.0003535 |
ice_qt4.nn5 | 1.542e-05 | 1.347e-05 | 1.145 | 0.2557 |
poverty30.nn3 | -3.196e-05 | 1.501e-05 | -2.129 | 0.03642 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
177 | 0.4039 | 0.9127 | 0.8004 |
For my next model, I eliminated all insignificant variables. Next, I selected the variables with the highest absolute estimate in instances where I have multiple variations of a variable (e.g. log-transformed, median binary, mean binary). I additionally checked for further colinear variables. In this process, I found a few interesting relationships:
RICE eliminates the need for many variables such as race, median household income, housing conditions, and service workers.
ggplot(data = data_final, aes(x = race_ice, y = service_worker)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, colour = "magenta") +
labs(title = "Index of Concentration at the Extremes and Service Workers", subtitle = "cor = -0.88") +
plotTheme()
The percent of renters is colinear with evictions, percent of commuters using transit, shootings, and distance from healthcare facilities.
ggplot(data = data_final, aes(x = renter_pct, y = transit_commute)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, colour = "magenta") +
labs(title = "Percent Renter and Transit Commutes", subtitle = "cor = 0.82") +
plotTheme()
Living alone is colinear with living in Manhattan and professional degrees.
ggplot(data = data_final, aes(x = live_alone, y = prof_degree_qt3)) +
geom_point() +
geom_smooth(method = "glm", se = FALSE, colour = "magenta") +
labs(title = "Living Alone and Professional Degrees", subtitle = "cor = 0.69") +
plotTheme()
With an R-squared of 0.65 and p-value of less than 0.001, my first model is statistically significant and accurate. Immediately standing out as strong predictors are percent renter, Queens, race_ice, Staten Island, apartments, uninsured residents, residents living alone, distance to transit, households without cars. This tells us an interesting story: while racialized index of concentration at the extremes (race_ice) has strong explanatory power, variables describing communities’ living conditions are stronger predictors of mortality ratio. For instance, the strongest predictor is the share of residents who live alone. The more residents who live alone, the lower the mortality ratio. Similarly important is the share of householders who are renters (renter_pct). ZIP codes with more renters experience higher mortality ratios. I initially suspected that these relationships were spurious, and that these variables would be highly colinear with race and income. However, they have low correlations with both. As expected, ICE is also a strong predictor: for each unit increase in ICE, which is associated with a higher concentration of White residents in the 75th percentile of median household income, mortality ratio decreases by 2%.
library(car)
mod_1 <- lm(MORT_RATIO ~ median_age + renter_pct + race_ice + apartments + no_insurance + live_alone + bus.nn5_qt3 + no_car.med , data = data_final)
pander(summary(mod_1), caption = "Model 1")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 1.817 | 1.395 | 1.302 | 0.1947 |
median_age | 0.09769 | 0.01358 | 7.194 | 1.984e-11 |
renter_pct | 2.274 | 0.4652 | 4.888 | 2.366e-06 |
race_ice | -1.934 | 0.2615 | -7.398 | 6.3e-12 |
apartments | 0.9579 | 0.3027 | 3.165 | 0.001844 |
no_insurance | -5.046 | 1.334 | -3.782 | 0.0002158 |
live_alone | -2.547 | 0.8949 | -2.846 | 0.004974 |
bus.nn5_qt3 | 0.2942 | 0.1212 | 2.427 | 0.01627 |
no_car.med | -0.4474 | 0.1358 | -3.295 | 0.001201 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
177 | 0.5238 | 0.6797 | 0.6644 |
To build my next model, I tested additional demographic variables as well as housing and employment variables. Adding grade8_med, a binary variable based on the citywide median of share of residents with only a middle school degree increased the model’s R-squared by 0.02. While variables such as service workers, health and education workers, unemployment, poverty rate, and public assistance are all strongly correlated with mortality ratio, I ultimately did not include them because they are colinear with race_ice and grade8_med. Similarly, the shares of foreign born residents and households with limited english fluency were insignificant and did not contribute to the model. In this model, the share of renters remains an important predictor, but its estimate and significance level have both decreased. By contrast, adding middle school education levels makes share of apartments a stronger and more significant predictor. One reason for this could be that by controlling for education level, the model separates the shares of people who are renting out of necessity and preference.
# anova(mod_0, mod_1, mod_1_g8, mod_2)
# cor.test(data_zip$lim_english_binary, data_zip$bus.nn5_qt3)
mod_2 <- lm(MORT_RATIO ~ median_age + renter_pct + race_ice + apartments + no_insurance + live_alone + bus.nn5_qt3 + no_car.med + grade8_med, data = data_final)
pander(summary(mod_2), caption = "Model 2")
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 1.26 | 1.398 | 0.9012 | 0.3688 |
median_age | 0.09676 | 0.01341 | 7.216 | 1.783e-11 |
renter_pct | 1.937 | 0.4814 | 4.023 | 8.689e-05 |
race_ice | -1.829 | 0.262 | -6.98 | 6.614e-11 |
apartments | 1.041 | 0.3009 | 3.46 | 0.000686 |
no_insurance | -4.441 | 1.342 | -3.31 | 0.001144 |
live_alone | -1.87 | 0.9298 | -2.011 | 0.04593 |
bus.nn5_qt3 | 0.3519 | 0.1222 | 2.881 | 0.004489 |
no_car.med | -0.4423 | 0.134 | -3.3 | 0.001183 |
grade8_med | 0.2639 | 0.1131 | 2.334 | 0.02078 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
177 | 0.517 | 0.6898 | 0.6731 |
For my final model, I test the effect of adding borough interaction terms to some of the variables. In the exploratory analysis, I found that a few variables such as race and age seem to have a stronger association with mortality ratio in some boroughs than others. Adding the boroughs categorical variable as an interaction term to the binary variable for median share of residents with only middle school education increased the R-squared by 0.06. Interestingly, adding this also makes the share of people living alone and the share of people who are uninsured insignificant, so I removed these variables. Share of renters, ICE, and households with no cars also become weaker and less significant predictors. With the addition of these interaction terms, the share of apartments and distance to transit become more important and significant predictors. Among the boroughs, Manhattan and Queens have the strongest and most significant interactions with grade 8 education. Overall, this model is highly accurate and statistically signficant, with an R-squared of 0.72 and a p-value of well below 0.0001.
mod_3 <- lm(MORT_RATIO ~ median_age + race_ice + apartments + bus.nn5_qt3 + no_car.med + BOROUGH_GROUP*grade8_med, data = data_final)
pander(summary(mod_3), caption = "Model 3")
Estimate | Std. Error | t value | |
---|---|---|---|
(Intercept) | -0.7027 | 0.5951 | -1.181 |
median_age | 0.07952 | 0.01152 | 6.904 |
race_ice | -2.195 | 0.2538 | -8.647 |
apartments | 1.523 | 0.2493 | 6.108 |
bus.nn5_qt3 | 0.4008 | 0.1275 | 3.143 |
no_car.med | -0.1761 | 0.1168 | -1.508 |
BOROUGH_GROUPManhattan | -0.9191 | 0.3305 | -2.781 |
BOROUGH_GROUPStaten Island | -0.7328 | 0.3646 | -2.01 |
BOROUGH_GROUPQueens | -0.6943 | 0.3259 | -2.131 |
BOROUGH_GROUPBrooklyn | 0.1518 | 0.3354 | 0.4525 |
grade8_med | -0.2139 | 0.3368 | -0.6351 |
BOROUGH_GROUPManhattan:grade8_med | 0.9928 | 0.3749 | 2.648 |
BOROUGH_GROUPStaten Island:grade8_med | 0.5454 | 0.6326 | 0.8622 |
BOROUGH_GROUPQueens:grade8_med | 1.032 | 0.3524 | 2.928 |
BOROUGH_GROUPBrooklyn:grade8_med | 0.09724 | 0.3658 | 0.2658 |
Pr(>|t|) | |
---|---|
(Intercept) | 0.2394 |
median_age | 1.091e-10 |
race_ice | 4.916e-15 |
apartments | 7.232e-09 |
bus.nn5_qt3 | 0.001991 |
no_car.med | 0.1336 |
BOROUGH_GROUPManhattan | 0.006066 |
BOROUGH_GROUPStaten Island | 0.04609 |
BOROUGH_GROUPQueens | 0.03464 |
BOROUGH_GROUPBrooklyn | 0.6515 |
grade8_med | 0.5263 |
BOROUGH_GROUPManhattan:grade8_med | 0.008891 |
BOROUGH_GROUPStaten Island:grade8_med | 0.3899 |
BOROUGH_GROUPQueens:grade8_med | 0.003903 |
BOROUGH_GROUPBrooklyn:grade8_med | 0.7907 |
Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
---|---|---|---|
177 | 0.4993 | 0.7193 | 0.695 |
My final model suggests a strong relationship between mortality ratio and living conditions, spatial fixed effects, and socioeconomic variables. The three strongest predictors of mortality ratio are the following:
These results challenge current understandings of the risk factors for Covid-19 cases and deaths. Housing variables are strong predictors for Covid-19 case outcomes. While it is easy to understand why higher shares of apartments, and likely residential density, is associated with higher Covid-19 transmission, it is difficult to interpret why it is correlated with a higher rate of deaths from Covid-19 cases. It is possible that the inclusion of socioeconomic variables like education and relative advantage (ICE) allows the model to identify ZIP codes where the share of renters is determined by the share of people lacking the wealth and other means to live in larger homes. However, even when controlling for RICE and median age, there is a clear relationship between the share of apartments and mortality ratios. Therefore, further research is needed on the mechanisms by which living conditions affect Covid-19 case outcomes in order to inform interventions.
In addition to apartments, transit stop density and the share of households with no cars are negatively correlated with mortality ratio. As with apartments, this runs counter to the conventional wisdom that transit use and lack of private transportation are risks factors for Covid-19 exposure. To further investigate this trend, I plotted three transit density scenarios. I controlled for RICE, changing the average distance to the nearest 5 bus stops. In both the Bronx and Staten Island, the boroughs with the highest and lowest mortality ratios, the negative correlation between transit density and mortality ratio holds. In Staten Island, however, it appears that the correlation between transit stop density and mortality ratio is stronger, as the distance between the 1000ft and 15000ft scenario lines is much more pronounced than for the Bronx. This suggests that contrary to the belief that transit poses a risk during the pandemic, it may serve as an asset for mitigating the burden of Covid-19 on communities.
Among the socioeconomic variables, RICE is an important predictor that confirms the current literature documenting the disparate impacts of Covid-19 on Black and low-income communities. RICE is a metric which compares the share of White households in the top 25th percentile of median household income with the share of Black households in the lowest 25th percentile, and ranges from -1 to 1 with 1 considered the most advantaged. In this model, a unit increase in RICE is associated with a decrease of 1.8% in mortality ratio. Considering that the citywide average mortality ratio is 2.1%, the effect of income and race is substantial. Education, too is an important predictor in the model. Attainment of a middle school degree is the strongest of the educational variables that I tested. Citywide, the higher the share of adults above 25 years old with only a middle school degree, the higher the mortality ratio. This can be interpreted as areas with an overall lower level of education are likely to experience higher mortality ratios. Initially this result appeared spurious. However, when I tested variables such as unemployment, poverty, and share of service workers that are strongly correlated with 8th grade education, they did not predict as strongly for mortality ratio. Likewise, demographic variables that I expected to be correlated with higher mortality ratios such as foreign born residents, households with limited english fluency, and age did not predict well.
Lastly, this model demonstrates the impact of spatial fixed effects at the borough level. As I found in my exploratory analysis, ZIP codes in Staten Island, Manhattan, and Queens have significantly lower mortality ratios compared to Brooklyn and the Bronx.
Of all the models, my final model has the second lowest AIC, after the baseline model created by backward selection. However, the baseline model has many duplicate variables and suffers from high colinearity, making it overfit. By adding boroughs as interaction terms, I was able to reduce AIC in my final model by 9%. Further, adding borough interaction terms and taking out shares of people living alone and without insurance increased R-squared by 0.03, or 4%.
Model | df | AIC | R2 |
---|---|---|---|
Baseline | 101 | 236.1 | 0.9127 |
1 | 10 | 284.1 | 0.6797 |
2 | 11 | 280.4 | 0.6898 |
3 | 16 | 272.8 | 0.7193 |
We now know that the addition of new variables reduces the magnitude of error for each model. What about the models’ performance at different mortality ratios? The baseline model over-predicts mortality ratio at low values and underpredicts at high values. Model 1 improves the accuracy for high mortality ratios. The residuals don’t change visibly between model 1 and 2, but model 3 visibly improves accuracy at high mortality ratios. Still, it is clear that all of the models predict much better at mid and high-range mortality ratios.
Next, I calculate mean absolute percentage error (MAPE) by borough to see if the model predicts with similar accuracy citywide. Ideally, errors should be evenly distributed. Across all 4 models, percentage error is highest in Staten Island and Manhattan. In line with the distribution of residuals, these boroughs have the lowest mortality ratios citywide.
resid_boro_3 <-left_join(data_zip %>%
dplyr::select(BOROUGH_GROUP, zcta), st_drop_geometry(resid_3), by = "zcta") %>%
st_as_sf() %>%
group_by(BOROUGH_GROUP) %>%
summarise(MAE = mean(MAE),
MAPE = mean(MAPE),
fitted = mean(fitted),
MORT_RATIO = mean(MORT_RATIO))
pander(st_drop_geometry(resid_boro_3), caption = "Model 3")
BOROUGH_GROUP | MAE | MAPE | fitted | MORT_RATIO |
---|---|---|---|---|
Bronx | 0.7625 | 32.75 | 2.585 | 2.448 |
Manhattan | 0.9922 | 176.8 | 2.057 | 1.515 |
Staten Island | 1.1 | 82.43 | 2.403 | 1.544 |
Queens | 0.6647 | 38.4 | 2.247 | 2.351 |
Brooklyn | 1.103 | 46.37 | 1.451 | 2.301 |
While it is useful to know how the model predicts across different boroughs and mortality ratio levels, it is arguably more important to understand how it predicts across demographic contexts. As the exploratory analysis section demonstrated, there is a high degree of racial and economic variation within the boroughs. To effectively inform policy intervention, this model must predict similarly for a diverse range of communities.
The model predicts more accurately in majority non-White contexts. This is likely due to the fact that it predicts more accurately in ZIP codes with higher mortality ratios. On average, error in majority non-White ZIP codes is 0.2% lower than in majority White contexts.
context_acs <- byGeo %>%
dplyr::select(zcta) %>%
left_join(acs_zip)
raceContextACS <- context_acs %>%
dplyr::select(white_pct, zcta) %>%
mutate(race_context = ifelse(white_pct > .5, "Majority_White", "Majority_Non_White"))
nativityContextACS <- context_acs %>%
dplyr::select(foreign_born, zcta) %>%
mutate(nativity_context = ifelse(foreign_born < 0.34820, "Less than 35%", "More than 35%"))
iceContextACS <- context_acs %>%
dplyr::select(race_ice, zcta) %>%
mutate(ice_context = ifelse(race_ice < 0, "-1 to 0", "0 to 1"))
povertyContextACS <- context_acs %>%
dplyr::select(poverty_pct, zcta) %>%
mutate(poverty_context = ifelse(poverty_pct > 0.15, "More than 15%", "Less than 15%"))
ggplot() +
geom_sf(data = raceContextACS, aes(fill = race_context), color = "white") +
scale_fill_viridis(option = "D", discrete = TRUE, name = "Majority Group") +
labs(title = "Racial Context") +
mapTheme()
left_join(st_drop_geometry(raceContextACS), st_drop_geometry(resid_3), by = "zcta") %>%
na.omit() %>%
group_by(race_context) %>%
summarize(mean.Error = mean(MAE, na.rm = T)) %>%
spread(race_context, mean.Error) %>%
kable(caption = "Mean Error by Racial Context") %>%
kable_styling("striped", full_width = F) %>%
row_spec(1, color = "black", background = "Yellow") %>%
row_spec(0, color = "black", background = "white")
Majority_Non_White | Majority_White |
---|---|
0.7777365 | 0.9904661 |
The model also predicts more accurately in ZIP codes with a poverty rate above the citywide mean of 15%. In ZIP codes with a poverty rate below 15%, the average error is 0.1% higher. While this does not seem substantial, this is a higher percentage error in lower poverty areas because these areas generally have lower mortality ratios.
ggplot() +
geom_sf(data = povertyContextACS, aes(fill = poverty_context), color = "white") +
scale_fill_viridis(option = "D", discrete = TRUE, name = "Poverty rate (%)") +
labs(title = "Poverty", subtitle = "15% reflects the median value citywide") +
mapTheme()
left_join(st_drop_geometry(povertyContextACS), st_drop_geometry(resid_3), by = "zcta") %>%
na.omit() %>%
group_by(poverty_context) %>%
summarize(mean.Error = mean(MAE, na.rm = T)) %>%
spread(poverty_context, mean.Error) %>%
kable(caption = "Mean Error by Poverty Context") %>%
kable_styling("striped", full_width = F) %>%
row_spec(1, color = "black", background = "Yellow") %>%
row_spec(0, color = "black", background = "white")
Less than 15% | More than 15% |
---|---|
0.8192159 | 0.9692555 |
Finally, the model predicts more accurately in ZIP codes with a higher share of foreign born residents.
ggplot() +
geom_sf(data = nativityContextACS, aes(fill = nativity_context), color = "white") +
scale_fill_viridis(option = "D", discrete = TRUE, name = "Majority Group") +
labs(title = "Foreign vs US-born Residents") +
mapTheme()
left_join(st_drop_geometry(nativityContextACS), st_drop_geometry(resid_3), by = "zcta") %>%
na.omit() %>%
group_by(nativity_context) %>%
summarize(mean.Error = mean(MAE, na.rm = T)) %>%
spread(nativity_context, mean.Error) %>%
kable(caption = "Mean Error by Foreign Born Context") %>%
kable_styling("striped", full_width = F) %>%
row_spec(1, color = "black", background = "Yellow") %>%
row_spec(0, color = "black", background = "white")
Less than 35% | More than 35% |
---|---|
0.9935431 | 0.7699136 |
The results of this model have important policy implications. First, they demonstrate a relationship between housing and transportation conditions and a community’s Covid-19 resiliency. Adding to existing literature on the positive correlation between housing density and Covid-19 transmission, the model suggests that a higher share of apartments is also correlated with an increased proportion of cases that result in death. With regards to transportation, the model suggests that while Covid-19 transmission is associated with transit use, a higher density of transit in a neighborhood is associated with a lower risk of individuals succumbing to Covid-19.
The model also confirms the role of previously studied demographic variables in influencing Covid-19 outcomes. However, many attributes that have been linked to Covid-19 infection and deaths were not strong predictors for mortality ratio. This includes median age, foreign born residents, people in group housing, and not having health insurance.
While these results are important for understanding the factors influencing a community’s vulnerability to Covid-19 related deaths, this model has a number of limitations. With an R-squared of 0.72, it predicts relatively accurately on average. However, there is a clear drop off in model performance as mortality ratio decreases. The model’s inability to account for areas with lower mortality ratios causes it to over-predict in Staten Island and Manhattan, as these boroughs have the lowest mortality ratios. Similarly, it predicts better for ZIP codes that are majority non-White, experience higher poverty, and have a higher share of foreign born residents. Considering the use case of developing a model for allocating resources to more vulnerable communities that have been disparately affected by the pandemic, this is a fair tradeoff, as the model can still identify the most at-risk ZIP codes in terms of mortality ratio.
In all, the results of this model make it clear that mortality ratio has very different risk factors than the metrics which have traditionally been used to inform Covid-19 policies. In future interventions, it will be important to adopt practices that prioritize neighborhoods at the highest risk of mortality once infected, and not necessarily the ones with the most cases. While this model has limited ability to predict for mortality ratio in areas where it is lower, it offers an important first step for understanding the factors which cause some areas to suffer disproportionately from Covid-19 transmission. In future research, it would be valuable to study the disparate impacts of Covid-19 on workers in different industries, as the exploratory analysis showed some associations between service, education, and healthcare industry workers and mortality ratio. Similarly, while this study was not able to establish associations between institutions such as public housing and schools and a community’s Covid-19 mortality ratio, future research is needed on the role that public assistance and institutions can play in alleviating the burdens of public health emergencies.
CDC. “Health Equity Considerations and Racial and Ethnic Minority Groups.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 25 Jan. 2022, https://www.cdc.gov/coronavirus/2019-ncov/community/health-equity/race-ethnicity.html.
CDC. “Omicron Variant: What You Need to Know.” Centers for Disease Control and Prevention, Centers for Disease Control and Prevention, 29 Mar. 2022, https://www.cdc.gov/coronavirus/2019-ncov/variants/omicron-variant.html.
Gallo Marin, Benjamin, et al. “Predictors of Covid ‐19 Severity: A Literature Review.” Reviews in Medical Virology, vol. 31, no. 1, 2020, pp. 1–10., https://doi.org/10.1002/rmv.2146.
Ramírez-Aldana, Ricardo, et al. “Spatial Epidemiological Study of the Distribution, Clustering, and Risk Factors Associated with Early COVID-19 Mortality in Mexico.” PLOS ONE, vol. 16, no. 7, 2021, https://doi.org/10.1371/journal.pone.0254884.
Thomas, Michael M., et al. “Investigating the Association between Mass Transit Adoption and Covid-19 Infections in US Metropolitan Areas.” Science of The Total Environment, vol. 811, 2022, p. 152284., https://doi.org/10.1016/j.scitotenv.2021.152284.