knitr::opts_chunk$set(message = FALSE, warning = FALSE)
options(scipen=10000000)

library(tidyverse)
library(kableExtra)
library(caret)
library(knitr) 
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
library(viridis)
library(gridExtra)
library(forcats)
palette5 <- c("#d24ae8","#f0b660","#981FAC","#f5614e","#00bafe")
palette4 <- c("#d24ae8","#f0b660","#981FAC","#f5614e")
palette2 <- c("#d24ae8","#f0b660")

root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

housingsubsidy <- read.csv("/Users/annaduan/Documents/GitHub/Home-Repair-Tax-Credit-Program/housingSubsidy.csv") %>% 
  rename(tookSubsidyNum = y_numeric,
         prevMktgOutcome = poutcome,
         tookSubsidy = y)

Motivation

Emil City’s Department for Housing and Community Development (HCD) offers a home repair tax credit program to eligible homeowners. In the past, using this credit, homeowners have been able to increase the value of their home by an average of $10,000. In addition, their neighboring homes gain a $56,000 premium on their value. However, participation in this program is low, at only 11% of eligible homeowners (Figure 1).

To better allocate HCD’s resources and maximize uptake, we design a model to optimize outreach efforts by targeting homeowners likely to accept the subsidy. Using data from previous marketing campaigns, we create a classifier to predict homeowners as likely or unlikely to accept the subsidy. Using this model, we present Emil City’s HCD a new outreach strategy which optimizes cost and participation.

ggplot(mutate(housingsubsidy, tookSubsidy = fct_infreq(tookSubsidy))) + geom_bar(aes(x = tookSubsidy)) +
  labs(title = "Figure 1: Uptake in Prior Campaigns", x = "Repair Credit Uptake Among Contacted Homeowners", y = "Count of Homeowners") +
  plotTheme()

Data

We use records from past campaigns for our analysis. This data tells us about the characteristics of homeowners and whether they took the repair credit. Below, we list the variables and visualize their relationship with credit uptake outcome.

Continuous Variables

  • Age
  • Unemployment rate at time of campaign (unemploy_rate)
  • Consumer price index at time of campaign (cons.price.idx)
  • Consumer confidence index at time of campaign (cons.conf.idx)
  • Inflation rate at time of campaign (inflation_rate)
  • Annual spending on home repairs (spent_on_repairs)
  • Number of contacts during this campaign (campaign)
  • Number of contacts prior to campaign (previous)
  • Days since last contact (pdays)

Among these variables, the following appear more strongly correlated with uptake: unemployment rate, number of contacts prior to campaign, inflation rate, and contacts during campaign (Figure 2). Predictably, homeowners who accepted were contacted more times prior to the campaign, and more recently. They were also more likely to accept if contacted during a period where the US inflation rate is low, suggesting that homeowners are likely to spend money on repairs if the economy is healthy. Unemployment rate suggests the same effect, as homeowners who accepted were contacted at times with lower unemployment rates. Age, consumer confidence index, consumer price index, and annual repair spendings do not appear to have a strong relationship with likelihood of accepting the subsidy so we omit them from this visualization.

#Continuous variables
housingsubsidy %>%
  dplyr::select(tookSubsidy, campaign, pdays, previous, unemploy_rate, inflation_rate) %>% 
  gather(Variable, value, -tookSubsidy) %>%
    ggplot(aes(tookSubsidy, value, fill=tookSubsidy)) + 
      geom_bar(position = "dodge", stat = "summary", fun = "mean") + 
      facet_wrap(~Variable, scales = "free") +
      scale_color_viridis(option = "D")+
      labs(x="Credit Uptake Outcome", y="Mean", 
           title = "Figure 2: Feature Associations with Likelihood of Accepting Subsidy",
           subtitle = "(continous outcomes)") +
      theme(legend.position = "none") +
  plotTheme()

Yes/No Variables

  • Does this property have a tax lien?
  • Does the homeowner carry a mortgage?
  • Is the homeowner paying tax in Philadelphia?
  • Did the homeowner take the subsidy?

Most homeowners who have a mortgage did not accept credit (Figure 3). Similarly, most homeowners with a tax bill in Philadelphia declined. This may give the impression that both variables are negatively correlated with subsidy uptake. However, recall that within the data, more than 90% of homeowners declined the credit. Therefore, it is uncertain whether these trends indicate a correlation and further analysis is needed.

#yes/no variables
housingsubsidy %>%
  dplyr::select(tookSubsidy, mortgage, taxbill_in_phl) %>%
  gather(Variable, value, -tookSubsidy) %>%
  count(Variable, value, tookSubsidy) %>%
  filter(value == "yes") %>%
    ggplot(aes(tookSubsidy, n, fill = tookSubsidy)) +   
      geom_bar(position = "dodge", stat="identity") +
      facet_wrap(~Variable, scales = "free", ncol=2) +
      scale_fill_viridis(discrete = "TRUE", option = "B") +
      labs(x="Credit Uptake Outcome", y="Count",
           title = "Figure 3: Feature Associations with the Likelihood of Accepting Subsidy",
           subtitle = "Two category features (Yes and No)") +
      plotTheme() + theme(legend.position = "none")

Categorical Variables

  • Job
  • Marital status
  • Education attainment
  • Contact type (e.g. cell vs telephone)
  • Month of last contact
  • Day of week of last contact
  • Previous marketing outcome

Among these variables, we see a similar trend where more people reject the subsidy in almost all categories (Figure 4). It appears that the month of contact has a relatively strong effect on uptake; homeowners contacted in December, March, October, and September were more likely to accept the subsidy than those contacted in other months. Homeowners who were contacted via cell phone were also more likely to accept than those contacted via telephone.

#more than 2 categories
housingsubsidy %>%
  dplyr::select(tookSubsidy, education, contact, month, day_of_week) %>%
  gather(Variable, value, -tookSubsidy) %>%
  count(Variable, value, tookSubsidy) %>%
  ggplot(aes(value, n, fill = tookSubsidy)) +   
    geom_bar(position = "dodge", stat="identity") +
    facet_wrap(~Variable, scales="free", ncol=2) +
    scale_fill_manual(values = palette2) +
    labs(x="Credit Uptake Outcome", y="Count",
         title = "Figure 4: Feature Associations with the Likelihood of Accepting Subsidy",
         subtitle = "Multiple category features") +
    plotTheme() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Feature Engineering

To better utilize the data, we engineer new features to increase our model’s predictive power. Below, we create new category groupings for employment status, age group, education, job industry, last contact date, and whether an individual has been contacted before.

We use the job variable to create an employment status feature which tells us whether an individual is employed, unemployed, retired, or a student. The employment status, rather than the specific job that an individual has, should have more predictive power for whether they accept money from the government to fix their house. Similarly, we categorize age into “Young adult”, “Middle-aged adult”, and “Old adult” as we found that credit uptake outcome varies most between the 18-34, 35-64, and 65+ groups. Next, we separat education by high school, university, unknown, illiterate, and below high school, grouping together all categories less than high school. For industry, we split up jobs into blue collar, services, entrepreneur, white collar, and unknown to reflect typical wages, and by extension, expected propensity to spend money on home repairs. Finally, we create categories for the time since the last marketing contact and whether an individual has been contacted before. These new groupings should better capture the relationship between our independent variables and uptake outcome.

housingsubsidy <- 
  housingsubsidy %>% 
  mutate(employStatus = case_when(
                                  #job == "unemployed"  ~ "unemployed",
                                  # job == "retired" ~ "retired",
                                 # job == "student" ~ "student",
                                 job == "retired" | job == "student" | job == "unemployed" ~ "not employed",
                                   TRUE  ~ "employed"),
         ageGroup = case_when(age >= 18 & age < 35 ~ "Young adult",
                                   age >= 35 & age < 65  ~ "Middle-aged adult",
                                   age >= 65  ~ "Old adult"), 
         degreeGroup = case_when(
                                #education == "high.school" ~ "high school",
                                  education == "university.degree" ~ "University",
                                  #education == "unknown" ~ "unknown",
                                  #education == "illiterate" ~ "illiterate",
                                  TRUE ~ "No University"),
         industry = case_when(job == "blue-collar" | job == "technician" ~ "blue collar",
                             # job == "services" | job == "housemaid" ~ "services",
                             # job == "entrepreneur" | job == "self-employed" ~ "entrepreneur",
                             # job == "admin." | job == "management" ~ "white collar",
                              TRUE ~ "Non Blue-Collar"),
         lastContact = case_when(pdays == 999 ~ "never contacted",
                                 pdays < 7 ~ "Past week",
                                 pdays >= 7 & pdays < 14 ~ "1-2 Weeks",
                                 pdays >= 14 & pdays < 21 ~ "2-3 Weeks",
                               TRUE ~ "More than 3 Weeks"),
         contactedBefore = case_when(previous == 0 ~ "no",
                                  TRUE ~ "yes"),
         tookSubsidy = case_when(tookSubsidy == "yes" ~ "accepted",
                                 tookSubsidy == "no" ~ "declined"))

Next, we split our data into training and test sets.

set.seed(10)
trainIndex <- createDataPartition(housingsubsidy$tookSubsidy, 
                                  y = paste(housingsubsidy$education),
                                  p = .65,
                                  list = FALSE,
                                  times = 1)
housingsubsidyTrain <- housingsubsidy[ trainIndex,]
housingsubsidyTest  <- housingsubsidy[-trainIndex,]

Estimating an Uptake Model

We test multiple regression models before finding one with an optimal mix of original and engineered features (reg3). We also make a “kitchen sink” model, consisting of only unengineered features for comparison (reg0). Using our training data set, we estimate regressions on both models and the results are displayed below.

First, in the summary for the kitchen sink model, we see that as noted earlier, individuals contacted during December and March have higher odds of accepting the credit. While September and October contacts appear positively correlated with uptake in Figure 4, they don’t actually increase the odds of uptake. Additionally, Wednesday and Tuesday seem to be the best days to contact homeowners. As for age, young adults (18-34) are most likely to accept a credit.

library(sjPlot)
library(sjmisc)
library(sjlabelled)
#include all to see the coefficients
#reg0 <- glm(tookSubsidyNum ~ .,
 #                 data=housingsubsidyTrain %>% 
  #                  dplyr::select(-X, -tookSubsidy), #delete all the NAs
   #               family="binomial" (link="logit"))
#tab_model(reg0)
#summary(reg0)
#pR2(reg0)

#without engineered features - kitchen sink 
reg1 <- glm(tookSubsidyNum ~ .,
                  data=housingsubsidyTrain %>% 
                    dplyr::select(-employStatus, -ageGroup, -industry, -lastContact, -contactedBefore, -X, -tookSubsidy),
                  family="binomial" (link="logit"))
tab_model(reg1)
  tookSubsidyNum
Predictors Odds Ratios CI p
(Intercept) 0.00 0.00 – 127749944234945308450718746673152.00 0.163
age 1.01 1.00 – 1.03 0.128
job [blue-collar] 0.73 0.42 – 1.25 0.253
job [entrepreneur] 0.71 0.26 – 1.65 0.461
job [housemaid] 0.94 0.34 – 2.27 0.892
job [management] 0.50 0.26 – 0.91 0.028
job [retired] 0.69 0.32 – 1.44 0.337
job [self-employed] 0.46 0.18 – 1.03 0.078
job [services] 0.88 0.49 – 1.55 0.674
job [student] 0.92 0.37 – 2.19 0.853
job [technician] 0.98 0.62 – 1.55 0.947
job [unemployed] 0.86 0.35 – 1.96 0.739
job [unknown] 0.34 0.04 – 1.59 0.229
marital [married] 1.51 0.93 – 2.55 0.113
marital [single] 1.36 0.77 – 2.47 0.294
marital [unknown] 2.00 0.08 – 16.85 0.585
education [basic.6y] 0.80 0.31 – 1.91 0.633
education [basic.9y] 1.29 0.68 – 2.48 0.438
education [high.school] 1.07 0.57 – 2.06 0.840
education [illiterate] 0.00 NA – 1720328852398799471310254903302547392430080.00 0.983
education
[professional.course]
1.24 0.62 – 2.49 0.541
education
[university.degree]
1.26 0.67 – 2.43 0.476
education [unknown] 1.44 0.63 – 3.20 0.377
taxLien [unknown] 1.17 0.76 – 1.76 0.460
taxLien [yes] 0.00 NA – 6037506165911992147905779907636057289523200.00 0.984
mortgage [unknown] 0.83 0.26 – 2.24 0.730
mortgage [yes] 0.90 0.68 – 1.20 0.479
taxbill_in_phl [yes] 1.09 0.75 – 1.63 0.656
contact [telephone] 0.34 0.18 – 0.61 0.001
month [aug] 1.17 0.47 – 2.89 0.737
month [dec] 2.35 0.54 – 10.12 0.251
month [jul] 1.06 0.50 – 2.29 0.871
month [jun] 1.12 0.45 – 2.84 0.807
month [mar] 6.97 2.44 – 20.24 <0.001
month [may] 0.83 0.43 – 1.61 0.563
month [nov] 0.65 0.27 – 1.56 0.336
month [oct] 1.13 0.37 – 3.41 0.834
month [sep] 0.92 0.25 – 3.35 0.898
day_of_week [mon] 1.23 0.79 – 1.93 0.364
day_of_week [thu] 1.25 0.80 – 1.97 0.328
day_of_week [tue] 1.18 0.74 – 1.88 0.487
day_of_week [wed] 1.35 0.85 – 2.14 0.208
campaign 0.96 0.89 – 1.02 0.264
pdays 1.00 1.00 – 1.00 0.178
previous 0.98 0.67 – 1.44 0.900
prevMktgOutcome
[nonexistent]
1.34 0.72 – 2.58 0.365
prevMktgOutcome [success] 2.00 0.50 – 8.13 0.321
unemploy_rate 0.28 0.11 – 0.74 0.010
cons.price.idx 5.44 0.99 – 30.54 0.052
cons.conf.idx 1.05 0.99 – 1.11 0.125
inflation_rate 1.31 0.54 – 3.17 0.543
spent_on_repairs 1.00 0.98 – 1.03 0.667
Observations 2681
R2 Tjur 0.184
pR2(reg1)
## fitting null model for pseudo-r2
##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -730.4513292 -899.5516143  338.2005703    0.1879829    0.1185149    0.2424451
#with engineered features
#reg2 <- glm(tookSubsidyNum ~ .,
#                  data=housingsubsidyTrain %>% 
#                    dplyr::select(-age, -job, -education, -pdays, -previous, -X, -tookSubsidy, #-taxLien),
#                  family="binomial" (link="logit"))
#
#summary(reg2)
#pR2(reg2)

Next, we estimate a regression for our selected model. The regression summary initially omitted the values for several levels in employment status (retired, student, unemployed), degree (below HS, HS, illiterate, unknown), and industry (entrepreneur, services, unknown, white collar). This suggests that these levels are less important. Because of this, we return to feature engineering to reclassify all of these levels and we estimate the model again. Now, our McFadden score, a metric for goodness of fit, has increased.

#Selected model 
reg3 <- glm(tookSubsidyNum ~ .,
                  data=housingsubsidyTrain %>% 
                    dplyr::select(-X, -tookSubsidy, -industry, -employStatus, -degreeGroup, -contactedBefore, -taxLien, -pdays),
                  family="binomial" (link="logit"))
tab_model(reg3)
  tookSubsidyNum
Predictors Odds Ratios CI p
(Intercept) 0.00 0.00 – 1965015311576484768907264.00 0.126
age 1.04 1.01 – 1.06 0.003
job [blue-collar] 0.75 0.43 – 1.29 0.297
job [entrepreneur] 0.69 0.25 – 1.61 0.431
job [housemaid] 0.95 0.35 – 2.30 0.920
job [management] 0.51 0.26 – 0.93 0.033
job [retired] 0.70 0.31 – 1.50 0.368
job [self-employed] 0.46 0.18 – 1.03 0.079
job [services] 0.91 0.51 – 1.59 0.745
job [student] 0.93 0.37 – 2.23 0.876
job [technician] 1.02 0.64 – 1.60 0.938
job [unemployed] 0.85 0.34 – 1.92 0.705
job [unknown] 0.35 0.04 – 1.64 0.241
marital [married] 1.50 0.92 – 2.55 0.114
marital [single] 1.36 0.77 – 2.47 0.301
marital [unknown] 2.18 0.10 – 17.36 0.527
education [basic.6y] 0.80 0.31 – 1.89 0.625
education [basic.9y] 1.23 0.66 – 2.36 0.517
education [high.school] 1.01 0.54 – 1.94 0.973
education [illiterate] 0.00 NA – 1937867379328438066680464865036034994339840.00 0.983
education
[professional.course]
1.16 0.59 – 2.33 0.665
education
[university.degree]
1.20 0.64 – 2.29 0.573
education [unknown] 1.40 0.62 – 3.11 0.412
mortgage [unknown] 0.90 0.28 – 2.45 0.852
mortgage [yes] 0.91 0.68 – 1.21 0.499
taxbill_in_phl [yes] 1.11 0.76 – 1.65 0.612
contact [telephone] 0.34 0.18 – 0.61 0.001
month [aug] 1.22 0.49 – 3.04 0.667
month [dec] 2.31 0.52 – 10.03 0.262
month [jul] 1.08 0.50 – 2.32 0.848
month [jun] 1.08 0.43 – 2.75 0.868
month [mar] 6.83 2.35 – 20.06 <0.001
month [may] 0.86 0.45 – 1.68 0.648
month [nov] 0.66 0.27 – 1.58 0.346
month [oct] 1.17 0.38 – 3.55 0.786
month [sep] 1.01 0.27 – 3.72 0.989
day_of_week [mon] 1.23 0.79 – 1.94 0.361
day_of_week [thu] 1.24 0.79 – 1.95 0.357
day_of_week [tue] 1.15 0.72 – 1.83 0.564
day_of_week [wed] 1.31 0.82 – 2.09 0.254
campaign 0.96 0.88 – 1.02 0.234
previous 1.00 0.68 – 1.47 0.986
prevMktgOutcome
[nonexistent]
1.38 0.73 – 2.65 0.330
prevMktgOutcome [success] 2.25 0.44 – 12.61 0.339
unemploy_rate 0.26 0.10 – 0.69 0.007
cons.price.idx 6.24 1.12 – 35.49 0.037
cons.conf.idx 1.05 0.99 – 1.11 0.118
inflation_rate 1.31 0.54 – 3.17 0.550
spent_on_repairs 1.01 0.98 – 1.03 0.591
ageGroup [Old adult] 0.43 0.15 – 1.22 0.114
ageGroup [Young adult] 1.79 1.11 – 2.90 0.018
lastContact2-3 Weeks 0.39 0.05 – 2.95 0.361
lastContact [More than 3
Weeks]
209337.53 0.00 – NA 0.982
lastContact [never
contacted]
0.27 0.05 – 1.39 0.112
lastContact [Past week] 0.54 0.13 – 1.99 0.372
Observations 2681
R2 Tjur 0.188
pR2(reg3)
## fitting null model for pseudo-r2
##          llh      llhNull           G2     McFadden         r2ML         r2CU 
## -726.7006142 -899.5516143  345.7020003    0.1921524    0.1209778    0.2474835

Evaluating the Model

Goodness of fit

Next, we compare our model’s ability to predict positive and negative uptake outcomes. A positive outcome (1) is when a homeowner accepts the credit, negative (0) is when they decline.

In a very predictive model, the peak of the curve for negative predicted probabilities should be close to 0 and the peak for positive ones should be close to 1 (Figure 5). Our model is much more adept at predicting negative results. This is expected as most observed outcomes in our data are negative. This also means that our model has lower sensitivity (true positive rate) than specificity (true negative rate).

#reg3 engineering features
testProbs_r3 <- data.frame(Outcome = as.factor(housingsubsidyTest$tookSubsidyNum),
                        Probs = predict(reg3, housingsubsidyTest, type= "response"))


ggplot(testProbs_r3, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) +
  labs(x = "Uptake Outcome, 0 = decline, 1 = accept", y = "Density of Probabilities",
       title = "Figure 5: Distribution of Predicted Probabilities by Observed Outcome") +
  plotTheme() + theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

#kitcken sink
testProbs_ks <- data.frame(Outcome = as.factor(housingsubsidyTest$tookSubsidyNum),
                        Probs = predict(reg1, housingsubsidyTest, type= "response"))

Using feature engineering, however, we make our model’s sensitivity higher than the kitchen sink model’s.

Confusion matrix for kitchen sink model:

testProbs_ks <- 
  testProbs_ks %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs_ks$Probs > 0.5 , 1, 0))) 

#confusion matrix
caret::confusionMatrix(testProbs_ks$predOutcome, testProbs_ks$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1257  132
##          1   11   38
##                                               
##                Accuracy : 0.9006              
##                  95% CI : (0.8839, 0.9155)    
##     No Information Rate : 0.8818              
##     P-Value [Acc > NIR] : 0.01369             
##                                               
##                   Kappa : 0.3106              
##                                               
##  Mcnemar's Test P-Value : < 0.0000000000000002
##                                               
##             Sensitivity : 0.22353             
##             Specificity : 0.99132             
##          Pos Pred Value : 0.77551             
##          Neg Pred Value : 0.90497             
##              Prevalence : 0.11822             
##          Detection Rate : 0.02643             
##    Detection Prevalence : 0.03408             
##       Balanced Accuracy : 0.60743             
##                                               
##        'Positive' Class : 1                   
## 

Confusion matrix for our model:

testProbs_r3 <- 
  testProbs_r3 %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs_r3$Probs > 0.5 , 1, 0))) 

#confusion matrix
caret::confusionMatrix(testProbs_r3$predOutcome, testProbs_r3$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1253  131
##          1   15   39
##                                               
##                Accuracy : 0.8985              
##                  95% CI : (0.8817, 0.9136)    
##     No Information Rate : 0.8818              
##     P-Value [Acc > NIR] : 0.02562             
##                                               
##                   Kappa : 0.3088              
##                                               
##  Mcnemar's Test P-Value : < 0.0000000000000002
##                                               
##             Sensitivity : 0.22941             
##             Specificity : 0.98817             
##          Pos Pred Value : 0.72222             
##          Neg Pred Value : 0.90535             
##              Prevalence : 0.11822             
##          Detection Rate : 0.02712             
##    Detection Prevalence : 0.03755             
##       Balanced Accuracy : 0.60879             
##                                               
##        'Positive' Class : 1                   
## 

ROC Curve

Another tool that we use to evaluate our model is an ROC curve. On the curve, the x-axis represents false positive rate (predict accept credit, observe rejected credit) and the y-axis represents true positive rate (predicted and observed accept credit). In Figure 6, we see that beyond a 0.5 true positive rate, we start to see a faster rate of increase in false positives, an outcome which results in wasted outreach resources.

Another metric we look at is the area under the ROC curve (AUC). An over-fitted model’s AUC would be 1, indicating poor generalizability to new data. A random “coin flip” model’s AUC would be 0.5. Therefore, a strong model should have an AUC in between 0.5 and 1, and ours is 0.8258.

ggplot(testProbs_r3, aes(d = as.numeric(Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#FE9900") + 
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "Figure 6: ROC Curve", x = "False Positive Rate", y = "True Positive Rate") +
  plotTheme()

#area under curve 
pROC::auc(testProbs_r3$Outcome, testProbs_r3$Probs) 
## Area under the curve: 0.8258

Cross Validation

Next, we cross validate our model using our test set to see how well it predicts on new data. For comparison, we also do this for the kitchen sink model.

The AUC for our model is slightly higher than the kitchen sink model’s (Figure 7). The specificity, or true negative rate, is almost perfect for both models, likely because of the large share of observed negatives in the training data. Finally, the sensitivity, or true positive rate, in our final model is slightly higher than the kitchen sink model’s. This shows that we improved our model’s ability to predict positive cases.

#4c: Cross validate both models; compare and interpret two facetted plots of ROC, Sensitivity and Specificity.

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

#kitchen sink
cvFit1 <- train(tookSubsidy ~ ., data = housingsubsidy %>% 
                                   dplyr::select(-employStatus, -ageGroup, -degreeGroup, -industry, -lastContact, -contactedBefore, -X, -tookSubsidyNum), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)




#selected features
cvFit2 <- train(tookSubsidy ~ ., data = housingsubsidy %>% 
                                   dplyr::select(-X, -tookSubsidyNum, -industry, -employStatus, -ageGroup, -degreeGroup, -contactedBefore, -taxLien, -pdays), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)



#kitchen sink
f1<-
dplyr::select(cvFit1$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit1$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Kitchen Sink Model Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics: Kitchen Sink",
         subtitle = "Across-fold mean reprented as dotted lines kitchen sink") +
    plotTheme()
cvFit1


#selected model
f2<-
dplyr::select(cvFit2$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit2$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#FF006A") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#981FAC", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Selected Model Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics: Final Model",
         subtitle = "Across-fold mean reprented as dotted lines new features model") +
    plotTheme()

grid.arrange(f1, f2, ncol =1, top = "Figure 7: ROC, Sensivity and Specificity of Models")

cvFit2

Cost Benefit Analysis

The final step in designing this model is to optimize the threshold at which it classifies an individual as likely to accept tax credit. Our goal is to optimize it to balance the goals of saving money and increasing program uptake.

Assuming for illustration purposes that the whole homeowner population is represented by our test set, we calculate a baseline revenue and uptake rate that HCD would get without using an algorithm. Without using the model, HCD allocates $2850 in marketing resources to contact each of the 1430 homeowners in the test set. 11% (157) accept, and HCD gives these homeowners a $5000 tax credit each for repairs. As a result, their houses gain a $10,000 increase in value. In addition, neighboring houses experience a $56,000 premium, although this isn’t measurable given our data. At the end of the day, HCD’s revenue is -$3.3 million - it’s costly to not use an algorithm! In our analysis of cost/benefit, this revenue will serve as a point of comparison.

HCD’s cost benefit without algorithm: (10000 - 5000) * (0.11 * 1430) - 2850 * 1430 = -$3,289,000

To calculate cost/benefit using our model, we first assume that 25% of homeowners who we predict will accept the credit actually accept it when we contact them. We count the $10,000 premium as the benefit, and the $5000 subsidy and $2850 in marketing and advertising as the cost. As we do not have spatial data, we do not include the neighboring home premium in our calculation.

Assumptions:
1. $10,000 premium per house that accepts subsidy
2. $2,850 in marketing for each contact
3. $5,000 subsidy per house that accepts
4. 25% of contacted homeowners accept subsidy

Below is the method we use for calculating cost/benefit:

  • True negative: Predicted correctly homeowner would not take the credit, no marketing resources were allocated, no credit was allocated: $0

  • True positive: Predicted correctly homeowner would take the credit; allocated marketing resources, and 25% took the credit: $10000 - ($5000 + $2850) = $2150 return for the 25% of cases that accepted the credit. We spend $2850 on 75% of cases, who were sent the offer but did not take the credit.

  • Equation: ((10000 - 7850) * (true positive count * .25)) + (-2850 * (true positive count * .75))

  • False negative: Predicted that a homeowner would not take the credit but they did. These are likely homeowners who signed up for reasons unrelated to the marketing campaign. Thus, we ‘0 out’ this category, assuming the cost/benefit of this is $0

  • False positive: Predicted incorrectly that a homeowner would take the credit; allocated marketing resources; no credit allocated: -$2850

  • Equation: -2850 * false positive count

In Table 1, we see that even if HCD uses our algorithm to optimize outreach, it still spends more money than it directly generates. As it reaches out to true and false positives, it contacts 54 homeowners, which costs $153,900 in total. 25% of the 39 true positives accept the subsidy, so HCD spends $48,750 on credits. This generates $10,000 per home, equalling $97,500. The final revenue is -$105,150.

However, it must be noted that HCD is nonprofit and deals in public goods. Therefore, financial cost benefit is only one of many outcomes it is concerned with. In this model, we do not account for the $56,000 in premiums that neighboring houses experience, nor the increased quality of life that homeowners get from repairing their homes. Additionally, using this algorithm significantly reduces cost for HCD. Without using this model to optimize outreach efforts, HCD would have reached out to all 1430 eligible homeowners, incurring a -$3.3 million revenue. Our model saves HCD more than $3 million and leads to 140 credits given out, only 17 fewer than they would have by contacting all eligible homeowners.

cost_benefit_table <-
   testProbs_r3 %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               case_when(Variable == "True_Negative"  ~ 0,  
                         Variable == "True_Positive"  ~ ((10000 - 7850) * (Count * .25)) + (-2850 * (Count * .75)),  
                         Variable == "False_Negative" ~ 0,
                         Variable == "False_Positive" ~ (-2850)*Count)) %>%
    bind_cols(data.frame(Description = c(
              "Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was allocated.",
              "Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took the credit.",
              "Predicted that a homeowner would not take the credit but they did. These are likely homeowners who signed up for reasons unrelated to the marketing campaign. Thus, we '0 out' this category, assuming the cost/benefit of this is $0.",
              "Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated.")))

kable(cost_benefit_table) %>% 
  kable_styling(font_size = 12, full_width = F,  
                bootstrap_options = c("striped", "hover", "condensed")) %>%
  footnote(general_title = "\n",
           general = "Table 1")
Variable Count Revenue Description
True_Negative 1253 0 Predicted correctly homeowner would not take the credit, no marketing resources were allocated, and no credit was allocated.
True_Positive 39 -62400 Predicted correctly homeowner would take the credit; allocated the marketing resources, and 25% took the credit.
False_Negative 131 0 Predicted that a homeowner would not take the credit but they did. These are likely homeowners who signed up for reasons unrelated to the marketing campaign. Thus, we ‘0 out’ this category, assuming the cost/benefit of this is $0.
False_Positive 15 -42750 Predicted incorrectly homeowner would take the credit; allocated marketing resources; no credit allocated.

Table 1

There is still space to optimize this algorithm for cost efficiency and number of subsidies provided. This involves manipulating the probability threshold we use in this model for determining whether a household is a predicted positive or negative outcome. Currently, we use a default of 0.5, meaning that we classify probabilities of 50% and above as positive. In the code below, we calculate confusion metrics and revenue for each threshold from 0.01 to 0.99.

Figure 8 visualizes the confusion matrix, showing us the tradeoffs made in our model. We see that false negative and true negative generate 0 costs and 0 (measurable) benefits at all thresholds. By contrast, at lower thresholds where more homeowners are predicted as likely to accept a subsidy, more money is spent on outreach. Therefore, false positive results generate higher costs at these thresholds. True positive results have the same trend, although its curve is less steep because each generates $10,000, offsetting cost.

iterateThresholds <- function(data) {
  x = .01
  all_prediction <- data.frame()
  while (x <= 1) {
  
  this_prediction <-
      testProbs_r3 %>%
      mutate(predOutcome = ifelse(Probs > x, 1, 0)) %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
     gather(Variable, Count) %>%
     mutate(Revenue =
               ifelse(Variable == "True_Negative", Count * 0,
               ifelse(Variable == "True_Positive",((10000 - 7850) * (Count * .25)) + (-2850 * (Count * .75)),
               ifelse(Variable == "False_Negative", 0 * Count,
               ifelse(Variable == "False_Positive", (-2850) * Count, 0)))),
            Threshold = x)
  
  all_prediction <- rbind(all_prediction, this_prediction)
  x <- x + .01
  }
return(all_prediction)
}

whichThreshold <- iterateThresholds(testProbs_r3)

whichThreshold_revenue <- 
whichThreshold %>% 
    group_by(Threshold) %>% 
    summarize(Revenue = sum(Revenue))

#plot
whichThreshold %>%
  ggplot(.,aes(Threshold, Revenue, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette5[c(5, 1:3)]) +    
  labs(title = "Figure 8: Revenue by Confusion Matrix Type and Threshold",
       y = "Revenue") +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix")) 

#calculate revenue and number of credits allocated
whichThreshold_revenue <- 
  whichThreshold %>% 
    mutate(actualcreditno = ifelse(Variable == "True_Negative" | Variable == "False_Positive", Count, ifelse(Variable == "True_Positive", Count*.75, 0)),
           actualcredityes = ifelse(Variable == "True_Positive", (Count * .25), ifelse(Variable == "False_Negative", Count, 0)),
           total_TP = ifelse(Variable == "True_Positive", Count, 0),
           total_FN = ifelse(Variable == "False_Negative", Count, 0)) %>%  
    group_by(Threshold) %>% 
    summarize(Revenue = sum(Revenue),
              creditsGiven = sum(actualcredityes),
              falseNegative = sum(total_FN),
              truePositive = sum(total_TP))
whichThreshold_revenue <- whichThreshold_revenue %>%
  mutate(TP_rate = truePositive/creditsGiven)


#whichThreshold_revenue[1:5,]

maxrevenueTest <- whichThreshold_revenue[,1:3]

After calculating revenue and number of credits allocated for each threshold, we find an interesting outcome.

Surprisingly, revenue and total credits allocated both flatten out after threshold 0.92. In the 0.92 to 0.99 range, they are both maximized, satisfying the goals of minimizing cost and maximizing credits allocated (Figure 9). If we are optimizing solely for these outcomes, we should use any of the thresholds within this range.

However, within this range, the credits given out are all received by false negative homeowners. In other words, HCD didn’t contact them and they found and applied for the credit on their own. If HCD’s goal is to reach homeowners who do not already know about the program, 0.92 to 0.99 might not be our optimal range. Additionally, while our calculations do not count the $56,000 premium on neighboring houses, this is another significant incentive for HCD to maximize community awareness of the program.

Where does HCD reach the most new homeowners, then? Figure 9 shows that true positives are highest at the lowest thresholds. This is as expected. If our model classifies any probability as a positive outcome, HCD will reach out to all eligible homeowners. This way, HCD reaches the most homeowners who are unaware of the program. However, this is highly cost inefficient. Recall that total revenue without the algorithm is -$3,289,000. At thresholds lower than 0.04, the model will make HCD incur higher costs than it would with no model. If HCD’s only goal is to spend less on the program than it would with no model, it can use any threshold from 0.04 to 0.99. However, when factoring in the cost of designing the model and hiring a data scientist, it needs to use a higher threshold.

p1 <-
  ggplot(whichThreshold_revenue)+
  geom_line(aes(x = Threshold, y = Revenue))+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -Revenue)[1,1]))+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -Revenue)[8,1]))+
  geom_text(aes(x=0.75, label=paste0("Optimal\n Threshold:\n", "0.92 - 0.99"), y=-2000000))+
    labs(title = "Model Total_Revenue By Threshold For Test Sample",
         subtitle = "Vertical Lines Denote Optimal Threshold Range") +
   plotTheme()

p2 <-  
  ggplot(whichThreshold_revenue)+
  geom_line(aes(x = Threshold, y = creditsGiven))+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -creditsGiven)[1,1]))+
    geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -creditsGiven)[8,1]))+
    geom_text(aes(x=0.75, label=paste0("Optimal\n Threshold:\n", "0.92 - 0.99"), y=70))+
    labs(title = "Model Total_Count_of_Credits By Threshold For Test Sample",
         subtitle = "Vertical Lines Denote Optimal Threshold Range") +
   plotTheme()

p3 <-  
  ggplot(whichThreshold_revenue)+
  geom_line(aes(x = Threshold, y = truePositive))+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -truePositive)[1,1]))+
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -truePositive)[2,1]))+
  geom_text(aes(x=0.75, label=paste0("Optimal\n Threshold:\n", "0.01 - 0.02"), y=70))+
    labs(title = "Model True Positives By Threshold For Test Sample",
         subtitle = "Vertical Lines Denote Optimal Threshold Range") +
   plotTheme()

p4 <-  
  ggplot(whichThreshold_revenue)+
  geom_line(aes(x = Threshold, y = falseNegative)) +
  geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -falseNegative)[1,1])) +
    geom_vline(xintercept =  pull(arrange(whichThreshold_revenue, -falseNegative)[8,1])) +
  geom_text(aes(x=0.75, label=paste0("Optimal\n Threshold:\n", "0.92 - 0.99"), y=70)) +
    labs(title = "Model False Negatives By Threshold For Test Sample",
         subtitle = "Vertical Lines Denote Optimal Threshold Range") +
   plotTheme()

 grid.arrange(p1, p2, p3, p4, ncol =1, top = "Figure 9: Threshold as a function of Revenue, Credits Given, False Negatives, and True Positives")

Considering the previous discussion, the optimal threshold for HCD depends on its priorities and operating budget. In the unlikely case that it most values revenue, it should use a threshold between 0.92 and 0.99. If it wants to maximize reach of unaware homeowners and immeasurable neighborhood benefits including $56,000 sales premiums for neighbors, it should choose a threshold between 0.04 and 0.99. Within this range, the higher HCD’s budget for this program, the lower it can set the threshold to increase true positives. For comparison purposes, Table 2 displays the revenue and credits allocated at thresholds of 0.92 (optimized for cost + total credits taken), 0.5 (control, for reference), and 0.04 (optimized for true positives, with cost lower than no-algorithm approach). Note that at 0.92, all credits are given to false negatives; at 0.5 this figure is 93%; and at 0.04 it is 20%.

#0.91
threshold_table <- whichThreshold_revenue[c(92, 50, 4), c("Threshold", "Revenue", "creditsGiven")]
threshold_table %>%
  group_by(Threshold) %>%
  kable(caption = "Table 2: Total Revenue and Credits Given Using Algorithm", col.names = c("Threshold", "Total Revenue", "Total Count of Credits")) %>%
    kable_styling("striped", full_width = T) 
Table 2: Total Revenue and Credits Given Using Algorithm
Threshold Total Revenue Total Count of Credits
0.92 0 170.00
0.50 -105150 140.75
0.04 -2923600 50.00

Conclusion

We recommend that HCD use this model in the next marketing campaign for its home repair tax credit program. It allows HCD a range of options for optimizing cost, uptake, and reach of homeowners previously unaware of the program. As HCD is non-profit and aims to improve the housing stock and quality of life in Emil City, we believe that we offer HCD a lot of value in providing ways to optimize the model based on its unique goals.

This model could be improved in a number of ways. Its major flaw is in its ability to predict true positives, or cases when homeowners are likely to accept a tax credit. This is due to the low percentage of positive outcomes across all of the data that we trained and tested on. To improve the model, we would need to incorporate more data points where the homeowner accepts the credit. Additionally, spatial data would improve our ability to optimize for cost because it would allow us to factor in the $56,000 premium that neighboring houses experience, a significant benefit.

Given what we know from exploratory analysis of our data, we have several recommendations for HCD to make its marketing efforts more effective:

  1. Conduct outreach on Tuesdays and Wednesdays, and in December, March, October, and September
  2. Contact homeowners via cell phone
  3. Target young adults (18-34)