
Accurate prediction of home sale prices is important right now as the real estate market is seeing record levels of activity due to the pandemic. In this report, we construct a new hedonic model for Zillow’s housing market predictions. This task is challenging due to the number of factors which affect the real estate market and the non-linear relationship between many factors and prices. In this model which is built for Miami and Miami Beach, we incorporate local intelligence from open sourced data to adapt it to local housing and development patterns. We use determinants of home prices including internal characteristics, nearby amenities and dis-amenities, and spatial processes such as clustering to estimate home sale prices. Applying this model to a set of 3503 houses, it predicted that home sale prices are highest on the shoreline and in Miami Beach.


For this analysis, we received a dataset of houses and their internal characteristics including number of rooms, living area, and pools. In addition, we gathered open-sourced data from the American Community Survey, Miami Dade County’s Open Data Hub, OpenStreetMaps. This includes census-tract level demographic information and point and polygon layers of amenities including restaurants and parks (Table 1). It is expected that attributes that contribute to quality of life such as proximity to restaurants, park space, and low commuting times correlate positively with sale prices.

culture <- st_read("") %>%
  st_as_sf(coords = c("LON", "LAT"), crs = 4326, agr = "constant") %>%
culture <- rbind(
  culture[miamiBound,] %>%
    st_drop_geometry() %>%
    left_join(culture) %>%
    st_sf() %>%
    mutate(inMiami = "YES"),
  culture[miamiBound, op = st_disjoint] %>%
    st_drop_geometry() %>%
    left_join(culture) %>%
    st_sf() %>%
    mutate(inMiami = "NO")) %>%
  filter(inMiami == "YES")

## Table 1: Descriptive Statistics for Miami Houses
## =================================================================================
## Statistic              N     Mean    St. Dev.   Min  Pctl(25) Pctl(75)     Max   
## ---------------------------------------------------------------------------------
## AdjustedSqFt         3,503  2,082.2   1,426.0   331   1,274     2,379    18,006  
## LotSize              3,503  7,657.8   4,401.4  1,250  5,500     8,025    80,664  
## Bed                  3,503    3.0       1.1      0      2         4        13    
## Bath                 3,503    2.1       1.3      0      1         3        12    
## Stories              3,503    1.2       0.4      0      1         1         4    
## commercialProperties 3,503   15.6      26.1      0      0        21        203   
## age                  3,503   69.2      22.5      1      67       82        115   
## distWater            3,503    1.1       1.3    0.000   0.1       1.6       5.4   
## foodEstablishments   3,503    3.1       8.0      0      0         3        120   
## cultureSpots         3,503   0.01       0.1      0      0         0         1    
## busStops             3,503    2.2       2.3      0      0         4        13    
## parkArea             3,503 318,639.1 664,530.2   0      0     272,049.5 4,625,338
## timeToWork           3,503  2,347.2   1,200.6   476   1,480     3,037     5,962  
## monthhousingcost     3,503  1,888.9    807.6    390   1,316     2,421     5,078  
## pctVacant            3,503    0.2       0.1     0.0    0.1       0.2       0.5   
## ---------------------------------------------------------------------------------

First, we use a correlation matrix (Figure 1) to get an idea of different variables’ correlations with sale price. Below is a list of positive and negative correlations:

* AdjustedSqFt
* Bath
* Bed
* Stories
* Food establishments (Slight)
* Park Area
* Median household income

* Time to work
* Monthly housing cost
* Age
* Percent renter occupancy

Three variables that we will explore further are distWater, parkArea, pctVacant, and busStops. We expect that higher distWater and pctVacant values correlate with low sale prices, and parkArea and busStops make nearby houses more expensive and cost more.

As expected, proximity to the shoreline (low distWater) is desirable, and correlates with higher sale price. Based on Figure 2.1, however, this trend is strongest when distWater < 1. This suggests that we need to feature engineer distWater to account for greater correlation at shorter distances. The amount of park area near a house is positively correlated with sale price, although not as strongly as expected (Figure 2.2).

Surprisingly, the share of a census tract that is vacant has a positive correlation with sale price (Figure 2.3). This is unexpected because typically, vacant houses correlate with neighborhood disorder and unattractiveness. However, it is possible that these vacancies are the result of new construction and therefore do not make the area less attractive.

Bus stops also defy our understanding of cities: according to TOD theory, homes near transportation should be more attractive. However, as Figure 2.4 shows, the number of nearby bus stops correlates negatively with sale price. This may be because lower-income households tend to rely more on public transit.

We further examine relationships between variables, sale prices, and space in a series of maps. Figure 3 shows the distribution of observed sale prices. Prices are highest in Miami Beach and the shoreline of Miami. It gets progressively lower until it reaches its lowest cluster in the North.

Next, we map distWater, AdjustedSqFt, MedHHInc, parkArea, and foodEstablishments.

It can be expected that distance to water is negatively correlated to sale price in a city known for its beaches. Figure 4.1 shows that distance to the shoreline closely matches sale price.

However, it’s possible that sale price appears to be strongly correlated with distance to water because of other factors. Downtown Miami is located near the beach, and the heightened development around the beaches and downtown may give the area other attractive features.

Two of these features are green space and dining venues. Figure 4.2 shows that houses in Miami Beach have the most green space nearby. Interestingly, only some of the high sale price houses have high park area, mostly in Miami Beach. This may be due to crime incidents in park areas. In Figure 4.3, we see similar patterns, with the highest values in the houses in Miami Beach and the north part of Miami’s shoreline. This suggests that distance to the shoreline is only part of the story.

Outside of amenities, we expect that sale price correlates with median household income of census tracts and the adjusted square footage of houses. In Figures 4.4 and 4.5, we see this confirmed: much like sale price, the highest values for income and square footage are along the shoreline.

Now knowing that attractive amenities, high income households, large houses, and high sale prices cluster around the shoreline area, we can begin to test features to put in our regression model. We ranked the correlation of each feature with sale price, then added them in order until each addition feature no longer increased the model’s predictive power for sale price. The regression model that we chose includes house internal features, census tract variables, transportation availability, park area, and flood risk.

Next, to test this model, we used houses with observed sale prices to create a training set for training our model, and a test set for testing it.

The results of the regression model on the training set are presented in Table 2. Overall, the R-squared value of 0.9 means our variables explain most of the variation in sale price. The low p values also indicate a high level of confidence.

## Table 2: LM of Training Data
## =====================================================================
##                                               Dependent variable:    
##                                           ---------------------------
##                                                    SalePrice         
## ---------------------------------------------------------------------
## ActualSqFt                                         870.0***          
##                                                     (25.8)           
## LotSize                                             16.7**           
##                                                      (7.2)           
## Zoning0104 - SINGLE FAM - ANCILIARY UNIT           100,352.6         
##                                                   (81,601.5)         
## Zoning0800 - SGL FAMILY - 1701-1900 SQ          1,604,817.0***       
##                                                   (142,839.5)        
## Zoning2100 - ESTATES - 15000 SQFT LOT           3,702,589.0***       
##                                                   (278,752.4)        
## Zoning2200 - ESTATES - 25000 SQFT LOT           1,114,798.0***       
##                                                   (406,473.6)        
## Zoning2800 - TOWNHOUSE                             52,270.8          
##                                                   (393,499.5)        
## Zoning3900 - MULTI-FAMILY - 38-62 U/A              97,389.1          
##                                                   (155,539.5)        
## Zoning3901 - GENERAL URBAN 36 U/A LIMITED          164,228.0         
##                                                   (188,683.1)        
## Zoning4600 - MULTI-FAMILY - 5 STORY                                  
##                                                   (254,515.3)        
## Zoning4601 - MULTI-FAMILY - 8 STORY                                  
##                                                   (584,357.1)        
## Zoning4801 - RESIDENTIAL-LIMITED RETAI             326,606.2         
##                                                   (408,154.0)        
## Zoning5700 - DUPLEXES - GENERAL                    107,054.8         
##                                                   (72,636.1)         
## Zoning6100 - COMMERCIAL - NEIGHBORHOOD            -153,866.1         
##                                                   (311,195.1)        
## Zoning6101 - CEN-PEDESTRIAN ORIENTATIO             179,162.6         
##                                                   (290,088.6)        
## Zoning6106 - RESIDENTIAL-LIBERAL RETAI             543,719.2         
##                                                   (594,500.4)        
## Zoning6107 - RESIDENTIAL-MEDIUM RETAIL             344,976.2         
##                                                   (248,114.3)        
## Zoning6110 - COMM/RESIDENTIAL-DESIGN D             142,344.1         
##                                                   (431,713.7)        
## Zoning6402 - URBAN CORE 24 STORY/7FLR              977,335.5         
##                                                   (820,653.6)        
## Zoning7000 - INDUSTRIAL - GENERAL                  217,099.0         
##                                                   (809,087.8)        
## Zoning7700 - INDUSTRIAL - RESTRICTED               182,791.0         
##                                                   (419,503.1)        
## Stories.cat3+ Stories                           1,689,332.0***       
##                                                   (217,394.2)        
## Stories.catUp to 1 Stories                       249,130.2***        
##                                                   (71,265.3)         
## Bath.cat3+ Bathrooms                             -306,386.0***       
##                                                   (75,691.9)         
## Bath.catUp to 1 Bathroom                         235,574.1***        
##                                                   (61,901.7)         
## PoolPool                                           -39,506.9         
##                                                   (67,076.9)         
## medHHInc                                             -0.8            
##                                                      (1.0)           
## DockNo Dock                                      -447,900.5***       
##                                                   (125,730.3)        
## Bed.cat4+ Beds                                   -181,398.0***       
##                                                   (69,130.4)         
## Bed.catUp to 2 Beds                               124,709.6**        
##                                                   (58,671.8)         
## middleCatchde Diego, Jose Middle                   104,349.0         
##                                                   (142,965.0)        
## middleCatchJones-Ayers, Georgia Middle             74,406.7          
##                                                   (133,957.2)        
## middleCatchKinloch Park Middle                     46,837.2          
##                                                   (174,839.7)        
## middleCatchMann, Horace Middle                      -366.5           
##                                                   (151,053.0)        
## middleCatchNautilus Middle                       879,387.2***        
##                                                   (272,431.2)        
## middleCatchother                                   102,353.8         
##                                                   (126,529.1)        
## middleCatchPonce de Leon Middle                   240,613.4*         
##                                                   (132,522.5)        
## middleCatchShenandoah Middle                       126,364.6         
##                                                   (115,328.7)        
## age                                               -2,510.6**         
##                                                    (1,095.5)         
## pctVacant                                         -502,820.8         
##                                                   (397,606.0)        
## pctRenterOcc                                      -110,208.2         
##                                                   (217,189.3)        
## monthhousingcost                                     -22.8           
##                                                     (81.6)           
## PatioPatio                                        -85,407.7*         
##                                                   (44,095.8)         
## foodEstablishments                                 7,148.1**         
##                                                    (3,458.6)         
## timeToWork                                           -0.5            
##                                                     (53.8)           
## metromoverStops                                     8,538.5          
##                                                    (6,276.9)         
## metrorailStops                                     -4,129.2          
##                                                   (24,407.6)         
## parkArea                                            -0.2***          
##                                                     (0.05)           
## floodInsureType0294                               -123,745.1         
##                                                   (118,363.8)        
## floodInsureType0304                                -66,991.1         
##                                                   (199,556.8)        
## floodInsureType0307                              -741,470.8**        
##                                                   (289,096.7)        
## floodInsureType0308                                109,778.8         
##                                                   (232,076.7)        
## floodInsureType0309                               -235,418.1         
##                                                   (287,603.4)        
## floodInsureType0311                               -288,808.3         
##                                                   (218,469.6)        
## floodInsureType0312                               -137,721.1         
##                                                   (244,145.3)        
## floodInsureType0313                               -209,697.3         
##                                                   (209,222.1)        
## floodInsureType0314                               -196,682.5         
##                                                   (234,399.0)        
## floodInsureType0316                             1,120,936.0***       
##                                                   (336,868.1)        
## floodInsureType0317                               -198,299.6         
##                                                   (295,073.5)        
## floodInsureType0318                               -569,802.4         
##                                                   (520,250.7)        
## floodInsureType0319                             14,053,694.0***      
##                                                   (946,613.2)        
## floodInsureType0476                               -130,377.8         
##                                                   (194,466.9)        
## floodInsureType0477                               -566,390.3*        
##                                                   (291,553.8)        
## floodInsureTypeother                               -50,719.5         
##                                                   (177,357.8)        
## Constant                                         -795,732.2***       
##                                                   (291,875.5)        
## ---------------------------------------------------------------------
## Observations                                         1,651           
## R2                                                    0.9            
## Adjusted R2                                           0.9            
## Residual Std. Error                          797,033.6 (df = 1586)   
## F Statistic                                167.5*** (df = 64; 1586)  
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01

Next, we test the model on our test set to see its effectiveness on new data. Overall, it is relatively accurate: the average percentage error is 7.1%. However, the mean absolute error is $ 351,281, which is concerning as the average price in the test set is $ 689,606. This may be because our model is less accurate for expensive houses, as the absolute errors for them is higher at a given percentage error. Indeed, in Figure 5.1 and 5.2, absolute error is higher for homes with high observed prices, but percent error is consistently low, at less than 10%.

Is our model generalizable?

In addition to accuracy, generalizability is important for our model to be effective. To do this, we run a K-folds test to test the model on different segments of our test set.

We see in Table 3 that Fold75, one of the 100 partitioned segments of our test data, has an adjusted R-squared of 0.78 and a mean average error of $350,348.9. While the R^2 is high, the error is more than half of the average predicted price across all folds. These results are similar to the results for our training, which suggests strong generalizability across groups.

Table 3: Regression Results of One Test Set
MAE Resample RMSE Rsquared
583276.876702664 Fold075 1115646.16109935 0.949388154026353

Do errors cluster spatially?

One way to investigate the reason for the model’s inconsistency is to look at how it treats houses across space.


In Figure 7.1, we can see that residuals are evenly distributed spatially. This suggests that the low generalizability of the data is not due to spatial processes, but other factors.

To see more clearly the effect of spatial processes on our errors, we can look at spatial lags, or the clustering of prices and errors. In Figure 8.1, we can see that neighboring houses’ price estimate errors do not increase in the same way with sale price. This again tells us that most of our errors are not spatial in nature.

Finally, we can use a Moran’s I test to gain further insight into the spatial autocorrelation of our model errors. If our model errors are not influenced by spatial processes, we should see a Moran’s I of 0. Our Moran’s I value is less than 0.2, which confirms that we have very minimal spatial clustering of errors.

Accounting for neighborhood variance

We now add neighborhood as a feature in our model to account for the differences in sale price across neighborhoods. Table 4 tells us that when we account for neighborhood effects, we actually slightly increase the absolute error but we decrease the absolute percentage error. This may mean that the neighborhood model works best for lower priced homes, and that it increased the error in expensive ones. It also tells us that our Baseline model already accounted for spatial disparities through our use of open data features.

#Check accuracy
Table 4: Neighborhood Effect on Error
Regression SalePrice.AbsError SalePrice.APE
Baseline Regression 379308.8 0.9352236
Neighborhood Effects 377438.1 0.9184490

In Figure 10.1, we can see the effect of neighborhood model on the accuracy of our predictions. As suspected, the neighborhood model fits the data only marginally better. For some of the higher priced homes, however, it appears that adding neighborhood as a feature causes an overestimation of price. This may be due to variations within neighborhoods.

Finally, Figure 11.1 shows the prices that we predicted for the set of 3503 Miami area homes. As we saw with the known home sale prices, our predicted prices are highest close to the shoreline and on Miami Beach.

In Figure 12.1, we can see the spatial distribution of our errors. Overall, it appears that mean average percentage error is lower in Miami Beach, where sale prices are higher. It is highest in the center of Miami, and relatively low on the Miami shoreline. This suggests that our errors were smaller in neighborhoods with higher observed sale prices.

Figure 13.1 confirms this trend. We can see clearly that with one exception, neighborhoods with lower mean prices have higher mean average percentage errors.

Does this model work equally for different demographic groups?

The variation in the MAPE of different neighborhoods suggests our model has limited generalizability. To test this, we look at how well it predicts prices across different racial and income contexts. In Figure 14.1 is the racial and income context of Miami.

Table 5 and 6 confirm that this model applies slightly differently for different demographics. In Majority non-white neighborhoods, MAPE is 45% higher than in majority white neighborhoods. Similarly, MAPE is 41% higher in low income neighborhoods than in high income ones.

In conclusion, our model is effective at predicting the distribution of home sale prices in the area as the pattern of predicted prices matches that of recent known home sale prices. However, it is limited in its ability to generalize across different types of neighborhoods.

We have many interesting variables which had relationships with salePrice. As expected, distance from the shore correlates negatively with sale price, however the correlation was much weaker than expected because it only seemed to matter for the first mile from water. By contrast, floodInsureType correlated quite strongly with sale price. This feature comes from FEMA’s rating of flood risk for different neighborhoods, and unlike our expectation, it was not only based on distance from the shoreline.

In general, we found that houses’ internal features were the strongest predictors of sale price. In addition, census features such as percent vacancy in a tract, median household income, and travel time to work correlated with sale price. Contrary to our expectations, distance to water, food establishments, businesses, school catchment areas, and park space are less correlated with home sale price.

Our errors were higher than we would like. As discussed in the cross validation section, some houses had a mean average error of hundreds of thousands of dollars. Further, our errors were not distributed evenly, and were higher in low income and minority majority neighborhoods. Indeed, the highest MAPE values on our maps corresponded with the lowest income and lowest sale price neighborhoods. Based on our Moran’s I test, however, we were successful at eliminating spatial clustering of errors. In all, our model predicted much better in high income, high observed sale price neighborhoods. This disparity is likely attributable to the fact that we used mainly positive home attributes and neighborhood amenities in our model. If we used more attributes such as poverty rate, race, and renter occupancy rate, our model may have been better at modeling majority-minority and low-income neighborhoods.


To conclude, we do not believe that this model is ready for use by Zillow. Its errors are too significant, and its predictions are too uneven between different types of neighborhoods. Moving forward, we will add more features to this model to increase its ability to predict in neighborhoods with varying demographics. Additionally, we will fine tune features to account for non-linear correlations with sale price, such as by creating more categories.