Keywords: Ordinary Least Square Regression, K-fold Cross Validation, Stepwise Regression
GitHub Repository: MUSA5000-OLS | Website
Philadelphia has experienced significant demographic and economic transformations over recent decades, leading to notable implications for its urban housing market. These shifts have resulted in variations in median house values, which serve not only as a reflection of the city’s economic health but also as a proxy for broader social and spatial dynamics. Increasing median house values may indicate an influx of higher-income residents or early stages of gentrification, whereas declining values can be symptomatic of disinvestment and economic decline. Given these dynamics, accurately forecasting median house values is vital for urban planners and policymakers who are tasked with promoting sustainable and equitable urban development.
This study investigates the determinants of housing values in Philadelphia by focusing on four critical neighborhood characteristics: educational attainment, vacancy rates, the proportion of single-family homes, and poverty levels. These factors are closely related to housing market trends and provide insights into the socioeconomic status of neighborhoods. Educational attainment is closely linked to local economic prosperity. Individuals with higher educational attainment typically earn higher incomes and contribute more robustly to local economies. Consequently, a higher concentration of individuals with advanced educational backgrounds tends to increase demand for quality housing in affluent neighborhoods, thereby driving up housing prices and values. Conversely, high poverty rates signal economic distress, as many residents cannot afford luxury housing, and as a result, the housing values in these neighborhoods are likely to be lower.
High vacancy rates are frequently linked to declining neighborhoods, as vacant properties undermine community vitality, weaken crime prevention efforts, and diminish the overall health of commercial districts. Consequently, areas with a high concentration of vacant housing units tend to exhibit lower median housing values. In addition, although single-family homes are generally valued for their privacy and comfort, their common occurrence in suburban settings is often associated with limited accessibility to urban amenities and inadequate infrastructure, which may ultimately suppress their market values.
In this study, we utilize ordinary least squares (OLS) regression to analyze the relationship between these socioeconomic factors and median house values in Philadelphia. By examining these relationships, we aim to identify critical predictors of median housing values throughout Philadelphia and offer insights for decision-makers and community initiatives.
To predict median house values in Philadelphia, we obtained the original dataset from the United States Census data, which represents census block groups from the year 2000 and initially contained 1,816 observations. The key variables included:
For modeling purposes, we refined the dataset using the following criteria:
In additon, we removed one specific block group in North Philadelphia that exhibited inconsistencies, as it had an unusually high median house value (over $800,000) despite a very low median household income (less than $8,000).
After these cleaning steps, the final dataset contained 1,720 observations.
We will first examine the summary statistics mean
and standard deviation (SD) of key variables in the
dataset. These include the dependent variable MEDHVAL
(Median House Value), and the predictors
NBELPOV100
(Households Living in Poverty),
PCTBACHMOR
(% of Individuals with Bachelor’s
Degrees or Higher), PCTVACANT
(% of Vacant
Houses), and PCTSINGLES
(% of Single House
Units).
The mean (\(\bar{X}\)) represents the average value of a variable and is calculated as follows:
\[ \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i \]
where:
The mean provides a single representative value of the dataset. To measure the variability of the data, we use the standard deviation (SD), which quantifies how much the values in a dataset deviate from the mean. The formula for the sample standard deviation (\(s\)) is:
\[ s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} \left( X_i - \bar{X} \right)^2} \]
where:
A larger standard deviation indicates that the data points are more spread out, while a smaller standard deviation suggests that the data points are more closely clustered around the mean.
We will also examine the histograms and apply logarithmic transformations to key variables to assess whether the transformed variables exhibit a more normal-like distribution.
Histograms provide a visual representation of how a variable’s values are distributed, which helps identify whether the data follows a normal distribution, is right-skewed, or left-skewed. Note that linear regression models assume that variables are approximately normally distributed.
X-axis: Represents the values of the variable (e.g., house prices, income levels).
Y-axis: Represents the frequency of observations within each bin.
A right-skewed histogram suggests that a small number of observations have significantly higher values compared to the rest. For variables that exhibit right-skewness, a log transformation can improve normality. Since the log transformation is undefined for zero or negative values, we must first check whether any variable contains zero.
By comparing the original histograms with the log-transformed histograms, we will determine whether the transformation improves the suitability of the data for predictive modeling.
We will analyze the correlations between the predictors, to detect potential multicollinearity before proceeding with regression analysis, as it can distort model interpretations.
Multicollinearity occurs when predictors are highly correlated with one another, which can lead to unstable regression coefficients, inflated standard errors which reduce the statistical significance of predictors, and a higher risk of overfitting because redundant variables do not provide additional information to the model.
The correlation coefficient \(r\) quantifies the strength and direction of the linear relationship between two variables. It is calculated as:
\[ r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}} \]
In a more concise way, this formula can be expressed as:
\[ r = \frac{1}{n-1} \sum_{i=1}^{n} \left( \frac{x_i - \bar{x}}{S_x} \right) \left( \frac{y_i - \bar{y}}{S_y} \right) \] where:
The correlation coefficient \(r\) ranges from -1 to 1: A value of +1 indicates a perfect positive linear relationship, while a value of -1 indicates a perfect negative linear relationship. A value of 0 suggests no linear relationship between the variables, meaning changes in one do not influence the other.
After getting a general sense of the data, we conduct multiple
regression analysis to examine the relationship between the dependent
variable, Median House Value (MEDHVAL
),
and the predictors, education attainment
(PCTBACHMOR
), number of Households Living in
Poverty (NBELPOV100
), percentage of Vacant
Houses (PCTVACANT
), and percentage of
Single-family House Units (PCTSINGLES
). Regression
analysis is a statistical method to examine the relationship between a
dependent variable and one or more predictors. With this type of
analysis, researchers can identify the strength and direction of the
relationship between variables, make predictions, and assess the
significance of predictors. The model also estimates coefficients for
each predictor, which represent the expected change in the dependent
variable for a one-unit change in the predictor, holding other
predictors constant. For this study, the multiple regression model is
formulated as follows:
\[ \text{LNMEDHVAL} = \beta_0 + \beta_1 \text{PCTVACANT} + \beta_2 \text{PCTSINGLES} + \beta_3 \text{PCTBACHMOR} + \beta_4 \text{log(NBELPOV100)} + \epsilon \] where LNMEDHVAL is the log-transformed median house value, PCTVACANT is the proportion of vacant housing units , PCTSINGLES is the proportion of the single family housing , PCTBACHMOR is the percentage of the residents holding bachelor’s degree or higher, and log(NBELPOV100) is the log-transformed number of households living below the poverty line.
\(\beta_0\) is the intercept, \(\beta_1\), \(\beta_2\), \(\beta_3\), and \(\beta_4\) are the coefficients for each predictor, and \(\epsilon\) is the error term. The coefficient \(\beta_1\), \(\beta_2\), \(\beta_3\), \(\beta_4\) represent the change in the log-transformed median house value for a one-unit change in the corresponding predictor, holding other predictors constant. The error term \(\epsilon\) accounts for the variability in the dependent variable that is not explained by the predictors.
For the results of the regression analysis to be valid, several key assumptions must be met. These assumptions include linearity, independence of observations, homoscedasticity, normality of residuals, no multicollinearity, and no fewer than 10 observations per predictors.
Linearity assumes that the relationship between the dependent variable and the predictors is linear. To verify this assumption, we made scatter plots of the dependent variable against each predictor. If the relationship appears to be linear, the assumptions was met.
Independence of Observations assumes that the observations are independent of each other. There should be no spatial or temporal or other forms of dependence in the data.
Homoscedasticity assumes that the variance of the residuals \(\epsilon\) is constant regardless of the values of each level of the predictors. To check this assumption, we made a scatter plot of the standardized residuals against the predicted values. If the residuals are evenly spread around zero, the assumption was met. Any patterns may indicate the presence of heteroscedasticity.
Normality of Residuals assumes that the residuals are normally distributed. We examined the histogram of the standardized residuals to check if they are approximately normally distributed. If the histogram is bell-shaped, the assumption was met.
No Multicollinearity assumes that the predictors are not highly correlated with each other. We calculated the correlation matrix of the predictors to check for multicollinearity. If the correlation coefficients are is not greater than 0.8 or less than -0.8, the assumption was met.
No Fewer than 10 Observations per Predictor assumes that there are at least 10 observations for each predictor in the model. Since there are over 1,700 observations in the dataset, this assumption was met.
After verifing the assumptions of the regression, we start the regression analysis. We estimate the following parameters: - \(\beta_0\), which is the intercept - \(\beta_1, \dots, \beta_k\), which are the coefficients of each independent variable - \(\sigma^2\), the variance of the error terms, which quantifies the variability in the dependent variable that is not explained by the predictors.
The least squares method estimates the coefficients by minimizing the sum of squared errors (SSE), which is the sum of the squared differences between the observed values and the predicted values. The formula for SSE is:
\[ \text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_{i1} - \dots - \hat{\beta}_k x_{ik})^2 \] where \(y_i\) is the actual or observed value, \(\hat{y}_i\) is the predicted value, \(\hat{\beta}_0, \dots, \hat{\beta}_k\) are the estimated coefficients, and \(x_{i1}, \dots, x_{ik}\) are the predictor values for the \(i\)-th observation.
With the Error Sum of Squares (SSE) calculated, we can estimate the variance of the error terms \(\sigma^2\) using the formula:
\[ \sigma^2 = \frac{\text{SSE}}{n - (k+1)} \] where \(n\) is the number of observations and \(k\) is the number of predictors in the model.
We evaluated the model’s fitness using the coefficient of determination \(R^2\) and the adjusted \(R^2\). The coefficient of determination \(R^2\) measures the proportion of variance in the dependent variable that is explained by the predictors. The adjusted \(R^2\) adjusts the \(R^2\) value based on the number of predictors, which help to provide a more accurate measure of model fit for multiple regression, as the increase the number of predictors can artificially inflate the \(R^2\) value.
To obtain \(R^2\), \(SST\) or the total sum of squares, needed to be calculated first. The \(SST\) measures the total variance in the dependent variable, given by: \[ SST = \sum_{i=1}^{n} (y_i - \bar{y})^2 \] Where \(y_i\) is the observed value, and \(\bar{y}\) is the mean of the observed value. Then, the \(R^2\) can be obtained by: \[ R^2 = 1 - \frac{SSE}{SST} \] After that, \(R^2\) is adjusted as follows based on the number of observations \(n\) and the number of predictors \(k\): \[ R^2_{\text{adj}} = 1 - \frac{(1 - R^2)(n - 1)}{n - k - 1} \]
In this analysis, several hypothesis tests are conducted to determine the significance of the model and its predictors. The overall significance of the model is assessed using the F-ratio. It compares the variance explained by the model to the variance not explained by the model.A higher F-ratio indicates that the model explains a significant amount of variance in the dependent variable compared to the residual variance.
The null hypothesis and alternative hypothesis for the F-ratio are stated as follows:
The null hypothesis \(H_0\) states that all coefficients are equal to zero, meaning that the predictors do not explain the variance in the dependent variable (median house value). Stated as:
\[ H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0 \]
The alternative hypothesis \(H_a\) states that at least one coefficient is not equal to zero. In our case, this means that at least one predictor explains the variance in the dependent variable (median house value). Stated as: \[ H_a: \text{At least one } \beta_i \neq 0 \]
We also conduct a t-test to determine the significance of each individual predictor in the model:
The null hypothesis \(H_{0i}\) states that the coefficient for the predictor i is equal to zero, meaning that the predictor does not explain the variance in the dependent variable (median house value). Stated as:
\[ H_{0i}: \beta_i = 0 \]
The alternative hypothesis \(H_{ai}\) states that the coefficient for the predictor i is not equal to zero, meaning that the predictor explains the variance in the dependent variable (median house value). Stated as:
\[ H_{ai}: \beta_i \neq 0 \]
In addition to the multiple regression analysis, we conduct stepwise regression analysis to identify the most significant predictors of median house values in Philadelphia. Stepwise regression is a method that automatically selects the best subset of predictors for the model. It involves adding or removing predictors based on their statistical significance (P-value) and the AIC (Akaike Information Criterion). The stepwise regression analysis will help us identify the most important predictors and improve the overall model fits.
However, there are several limitations to stepwise regression. First, the final stepwise regression model is not guaranteed to be optimal in any specific sense. There may be other models that are as good as, or even better than the one selected by stepwise regression, but the procedure only yeilds a single model. Second, the stepwise regression does take into account researchers’ knowledge about the predictors. Important variables may be exclude from the model if they are not included in the initially. Moreover, This method also runs the risk of Type I and Type II errors—meaning it may include unimportant variables or exclude important ones.
To evaluate the performance of the regression model, we conduct K-fold cross-validation. Cross-validation is a technique used to evluate the performance of the predictive models. In K-fold Cross-Validation, the dataset is randomly divided into K equal-sized folds. The model is trained on K-1 folds and tested on the remaining fold. This process is repeated K times, with each fold serving as the test set exactly once. The average performance across all K folds is then calculated. Cross-validation helps to assess the model’s generalization performance and reduce the risk of overfitting.
In our analysis, we used K=5, which is a common choice for cross-validation. The root mean squared error (RMSE) is used to compare the performance of different models. The RMSE measures the difference between the predicted values and the actual values. A lower value of RMSE indicates a better predictive model.
The mean square errors is calculated as follows: \[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \] And the root mean square errors (RMSE) is the square root of the mean square errors (MSE): \[ \text{RMSE} = \sqrt{\text{MSE}} \]
In this analysis, we utilize the R programming language, a professional statistical software widely used for data analysis and visualization.The following libraries and packages have been employed to conduct the analysis:
tidyverse
: A collection of R packages for data
manipulation and visualization.sf
: A package for spatial data processing, especially
for handling shapefiles.ggplot2
: A popular visualization package for creating
high-quality map and charts, part of the tidyverse
collection.ggcorrplot
: A package designated to visualize
correlation matrices using ggplot2
.patchwork
: A package for combining multiple ggplot2
plots into a single plot.MASS
: A package provides multiple functions and
datasets for statistical analysis, including linear models.caret
: A package for creating predictive models and
conducting machine learning tasks, used in cross-validation and k-fold
analysis.kableExtra
: A package for creating tables with advanced
formatting options.The table below presents the summary statistics for the key variables, an overview of the central tendency and variability. The dependent variable median house value has a mean of 66,287.73, indicating that the average median house value by census tract in Philadelphia is approximately 66,287.73 dollar. The standard deviation of 60,006.08 dollar suggests that house values exhibit significant variability across census tracts.
For the predictor variables, all have standard deviations close to or greater than the mean, indicating large differences in educational attainment, economic status, housing occupancy, and housing structure composition across Philadelphia’s census block groups.
dependent_var <- "MEDHVAL"
predictors <- c("PCTBACHMOR", "NBELPOV100", "PCTVACANT", "PCTSINGLES")
summary_stats <- data %>%
dplyr::select(all_of(c(dependent_var, predictors))) %>%
summarise_all(list(Mean = mean, SD = sd), na.rm = TRUE) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
separate(Variable, into = c("Variable", "Stat"), sep = "_") %>%
pivot_wider(names_from = Stat, values_from = Value)
summary_stats$Variable <- recode(summary_stats$Variable,
"MEDHVAL" = "Median House Value",
"NBELPOV100" = "# Households Living in Poverty",
"PCTBACHMOR" = "% of Individuals with Bachelor’s Degrees or Higher",
"PCTVACANT" = "% of Vacant Houses",
"PCTSINGLES" = "% of Single House Units"
)
summary_stats <- summary_stats %>%
mutate(
Mean = round(Mean, 2),
SD = round(SD, 2)
)
summary_stats <- summary_stats %>%
arrange(Variable == "Median House Value")
predictor_rows <- which(summary_stats$Variable != "Median House Value")
dependent_rows <- which(summary_stats$Variable == "Median House Value")
# Determine the start and end rows for each group
start_pred <- min(predictor_rows)
end_pred <- max(predictor_rows)
start_dep <- min(dependent_rows)
end_dep <- max(dependent_rows)
# Create the table using kable and add extra formatting
kable(summary_stats, caption = "Summary Statistics",
align = c("l", "l", "l"), booktabs = TRUE, escape = FALSE ) %>%
add_header_above(c(" " = 1, "Statistics" = 2)) %>%
kable_styling(full_width = FALSE) %>%
group_rows("Predictors", start_pred, end_pred) %>%
group_rows("Dependent Variable", start_dep, end_dep)%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE)
Variable | Mean | SD |
---|---|---|
Predictors | ||
% of Individuals with Bachelor’s Degrees or Higher | 16.08 | 17.77 |
# Households Living in Poverty | 189.77 | 164.32 |
% of Vacant Houses | 11.29 | 9.63 |
% of Single House Units | 9.23 | 13.25 |
Dependent Variable | ||
Median House Value | 66287.73 | 60006.08 |
One of the assumptions of OLS regression is that variables follow a normal distribution. To assess the distribution of key variables, their histograms are presented below. As seen in the plots, all key variables exhibit left-skewness, meaning their distributions are right-skewed (positively skewed), with a concentration of smaller values and a long right tail.
longer_version<- data %>%
pivot_longer(cols = c("MEDHVAL", "PCTBACHMOR", "NBELPOV100", "PCTVACANT", "PCTSINGLES"),
names_to = "Variable",
values_to = "Value")
ggplot(longer_version,aes(x = Value)) +
geom_histogram(aes(y = ..count..), fill = "black", alpha = 0.7) +
facet_wrap(~Variable, scales = "free", ncol = 3, labeller = as_labeller(c(
"MEDHVAL" = "Median House Value",
"PCTBACHMOR" = "% with Bachelor’s Degrees or Higher",
"NBELPOV100" = "# Households Living in Poverty",
"PCTVACANT" = "% of Vacant Houses",
"PCTSINGLES" = "% of Single House Units"
))) +
labs(x = "Value", y = "Count", title = "Histograms of Dependent and Predictor Variables") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
To meet the normality assumption of OLS, log transformation is applied where necessary to improve distributional symmetry. This transformation helps normalize the distributions, making them more suitable for regression analysis. The histograms of the log-transformed variables indicate a more normal-like distribution. However, some slight skewness remains: the dependent variable (Log Median House Value) and the predictor (Log % Single House Value) are slightly left-skewed, while the predictors Log % Households in Poverty and Log % Vacant Houses are slightly right-skewed.
data <- data %>%
mutate(
LNMEDHVAL = log(MEDHVAL),
LNPCTBACHMOR = log(1+PCTBACHMOR),
LNNBELPOV100 = log(1+NBELPOV100),
LNPCTVACANT = log(1+PCTVACANT),
LNPCTSINGLES = log(1+PCTSINGLES)
)
longer_version2 <- data %>%
pivot_longer(cols = c(LNMEDHVAL, LNPCTBACHMOR ,LNNBELPOV100,LNPCTVACANT, LNPCTSINGLES),
names_to = "Variable",
values_to = "Value")
ggplot(longer_version2,aes(x = Value)) +
geom_histogram(aes(y = ..count..), fill = "red", alpha = 0.7) +
facet_wrap(~Variable, scales = "free", ncol = 3, labeller = as_labeller(c(
"LNMEDHVAL" = "Log Median House Value",
"LNPCTBACHMOR" = "Log % with Bachelor’s Degree",
"LNNBELPOV100" = "Log # Households in Poverty",
"LNPCTVACANT" = "Log % Vacant Houses",
"LNPCTSINGLES" = "Log % Single House Units"
))) +
labs(x = "Value", y = "Count", title = "Histograms of Dependent and log transformed Predictor Variables") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
Other regression assumptions will be assessed separately in the Regression Assumption Checks section later.
Linearity assumes that the relationship between the dependent variable and the predictors is linear. To examine this relationship, we analyze the spatial distribution of key variables. Certain predictor maps exhibit patterns similar to the dependent variable, suggesting a strong potential relationship.
Below is the choropleth map of the dependent variable, Log Transformed Median House Value. It reveals that central city areas and Germantown/Chestnut Hill have higher median house values, while North Philadelphia has lower values.
ggplot(shape) +
geom_sf(aes(fill = LNMEDHVAL), color = "transparent") +
scale_fill_gradientn(colors = c("#fff0f3", "#a4133c"),
name = "LNMEDHVAL",
na.value = "transparent") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8)) +
labs(title = "Log Transformed Median House Value")
To understand the spatial relationship between the dependent variable and predictors, we present four key predictor maps below as well.
shpe_longer<- shape %>%
pivot_longer(cols = c("PCTVACANT", "PCTSINGLES", "PCTBACHMOR", "LNNBELPOV"),
names_to = "Variable",
values_to = "Value")
custom_titles <- c(
PCTVACANT = "Percent of Vacant Houses",
PCTSINGLES = "Percent of Single House Units",
PCTBACHMOR = "Percent of Bachelor's Degree or Higher",
LNNBELPOV = "Logged Transformed Poverty Rate"
)
plot_list <- lapply(unique(shpe_longer$Variable), function(var_name) {
data_subset <- subset(shpe_longer, Variable == var_name)
ggplot(data_subset) +
geom_sf(aes(fill = Value), color = "transparent") +
scale_fill_gradientn(
colors = c("#fff0f3", "#a4133c"),
name = var_name,
na.value = "transparent"
) +
labs(title = custom_titles[[var_name]]) +
theme(
legend.text = element_text(size = 8),
legend.title = element_text(size = 10),
legend.key.size = unit(0.3, "cm"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 15, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8)
)
})
# Combine the plots into a grid (2 columns by 2 rows)
combined_plot <- (plot_list[[1]] + plot_list[[2]]) /
(plot_list[[3]] + plot_list[[4]])
combined_plot
These maps also help identify whether predictors are closely related
to each other, which could cause multicollinearity
problems. The Percent of Bachelor’s Degree or
Higher(PCTBACHMOR
) map looks very similar to the
Log Transformed Median House
Value(LNMEDHVAL
) map, suggesting that areas with
more college graduates tend to have higher house values. Also, the
Percent of Vacant Houses(PCTVACANT
) and
Percent of Single House Units(PCTSINGLES
)
maps share a similar pattern—North Philadelphia and University City have
a high percentage of vacant houses and a low percentage of single-house
units, while Germantown/Chestnut Hill and Far Northeast Philadelphia
show the opposite trend. This suggests that these two predictors may be
strongly related, which could lead to multicollinearity issues in the
regression analysis.
To further examine multicollinearity, we create a correlation matrix between predictors. All correlation values are below 0.7, indicating that no severe multicollinearity occurs in our model.
Among them, % of Single House
Units(PCTSINGLES
) and Logged Households
Living in Poverty(LNNBELPOV100
) have a correlation
of -0.32, suggesting some relationship but not strong enough to cause
severe multicollinearity. Looking back at the choropleth maps of these
two predictors, their spatial patterns are not similar. % of
Single House Units(PCTSINGLES
) and % of
Individuals with a Bachelor’s Degree or
Higher(PCTBACHMOR
) have a correlation of -0.3,
again indicating some relationship without severe multicollinearity. The
choropleth maps of these two predictors also do not exhibit similar
spatial patterns. Additionally, while the choropleth maps of
Percent of Vacant Houses(PCTVACANT
) and
Percent of Single House Units(PCTSINGLES
)
share a similar pattern, as mentioned before, their correlation value is
only 0.2, suggesting a weak relationship without multicollinearity.
There is no severe multicollinearity among the predictors, supporting the assumptions of OLS regression. However, if the correlation matrix had shown high correlations (\(|r| \)), we could further confirm the severity of multicollinearity using the Variance Inflation Factor (VIF) (VIF < 5 indicates low multicollinearity and is generally acceptable, while VIF \(\geq\) 5 suggests moderate to high multicollinearity that may require attention. If VIF \(\geq\) 10, severe multicollinearity is present, necessitating corrective measures).
custom_labels <- c(
"% of Individuals with Bachelor’s Degrees or Higher" = "PCTBACHMOR",
"% of Vacant Houses" = "PCTVACANT",
"% of Single House Units" = "PCTSINGLES",
"# Households Living in Poverty" = "LNNBELPOV100"
)
predictor_vars <- data[, c("PCTVACANT", "PCTSINGLES", "PCTBACHMOR", "LNNBELPOV100")]
cor_matrix <- cor(predictor_vars, use = "complete.obs", method = "pearson")
rownames(cor_matrix) <- names(custom_labels)
colnames(cor_matrix) <- names(custom_labels)
ggcorrplot(cor_matrix,
method = "square",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("grey", "white", "#a4133c"))+
labs(title = "Correlation Matrix for all Predictor Variables") +
theme(plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_text(size = 7),
axis.text.y = element_text(size = 7),
axis.title = element_text(size = 8))
Four independent variables were included in the multiple regression
model to predict the log-transformed median house value
(LNMEDHVAL
) in Philadelphia at census block group level,
including the proportion of vacant housing units
(PCTVACANT
), the proportion of single-family housing units
(PCTSINGLES
), the proportion of residents with a bachelor’s
degree or higher (PCTBACHMOR
), and the number of households
living below the poverty line (LNNBELPOV100
). The
regression results are presented below:
##
## Call:
## lm(formula = LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR +
## LNNBELPOV100, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.25825 -0.20391 0.03822 0.21744 2.24347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.1137661 0.0465330 238.836 < 2e-16 ***
## PCTVACANT -0.0191569 0.0009779 -19.590 < 2e-16 ***
## PCTSINGLES 0.0029769 0.0007032 4.234 2.42e-05 ***
## PCTBACHMOR 0.0209098 0.0005432 38.494 < 2e-16 ***
## LNNBELPOV100 -0.0789054 0.0084569 -9.330 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3665 on 1715 degrees of freedom
## Multiple R-squared: 0.6623, Adjusted R-squared: 0.6615
## F-statistic: 840.9 on 4 and 1715 DF, p-value: < 2.2e-16
The regression output tells us that the median housing values are
highly significant correlated with the proportion of vacant housing
units(PCTVANT
), the proportion of single-family housing
units(PCTSINGLES
), the proportion of residents with a
bachelor’s degree or higher (PCTBACHMOR
), and the
log-transformed number of households living below the poverty line
(LNNBELPOV
), with \(p-value<0.0001\) for all four
predictors.
As the proportions of vacant housing units (PCTVANT
)
goes up by 1 unit (1%), the median house value goes down by
approximately 1.915%, with holding all other three predictors constant.
In addition, the respective \(p-value\)
for \(\beta_1\) is less than 0.0001,
indicating that if there is no relationship between PCTVANT
and the dependent variable (i.e., if the null hypothesis that \(\beta_1 = 0\) is true), then the
probability of getting a \(\beta_1\)
coefficient estimate of -0.0191569 is less than 0.0001. We can safely
reject the null hypothesis \(H_0: \beta_1 =
0\) for \(H_a: \beta_1 \neq
0\).
\[(e^{\beta} - 1) \times 100\% = (e^{-0.0191569} - 1) \times 100\% \approx -1.915\% \]
Similarly, as the proportion of single-family housing units
(PCTSINGLES
) goes up by 1 unit (1%), the median house value
goes up by approximately 0.298%, with holding all other three predictors
constant. The \(p-value\) for \(\beta_2\) is less than 0.0001, indicating
that if there is no relationship between PCTSINGLES
and the
dependent variable (i.e., if the null hypothesis that \(\beta_2 = 0\) is true), then the
probability of getting a \(\beta_2\)
coefficient estimate of 0.0029769 is less than 0.0001. We can also
safely reject the null hypothesis \(H_0:
\beta_2 = 0\) for \(H_a: \beta_2 \neq
0\).
\[(e^{\beta} - 1) \times 100\% = (e^{-0.0029769} - 1) \times 100\% \approx 0.298\% \]
Furthermore, as the proportion of residents with a bachelor’s degree
or higher (PCTBACHMOR
) goes up by 1 unit (1%), the median
house value goes up by approximately 2.09%, with holding all other three
predictors constant. Similar to \(\beta_2\), the \(p-value\) for \(\beta_3\) is less than 0.0001, indicating
that if there is no relationship between PCTBACHMOR
and the
dependent variable (i.e., if the null hypothesis that \(\beta_3 = 0\) is true, then the probability
of getting a \(\beta_3\) coefficient
estimate of 0.0209098 is less than 0.0001. We can also safely reject the
null hypothesis \(H_0: \beta_3 = 0\)
for \(H_a: \beta_3 \neq 0\).
\[(e^{\beta} - 1) \times 100\% = (e^{0.0209098} - 1) \times 100\% \approx 2.09\% \]
Lastly, as 1 percent increase in the number of households living below the poverty line, the median house value goes down by approximately 0.078% , with holding all other three predictors constant. The \(p-value\) for \(\beta_4\) is also less than 0.0001 indicating that if there is no relationship between number of households below poverty line and the dependent variable (i.e., if the null hypothesis that \(\beta_4 = 0\) is true), then the probability of getting a \(\beta_4\) coefficient estimate of -0.0789054 is less than 0.0001. As stated before, we can also safely reject the null hypothesis \(H_0: \beta_4 = 0\) for \(H_a: \beta_4 \neq 0\).
\[(1.01^{\beta} - 1) \times 100\% = (1.01^{-0.0789054} - 1) \times 100\% \approx -0.078\%\]
##
## Call:
## lm(formula = LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR +
## LNNBELPOV100, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.25825 -0.20391 0.03822 0.21744 2.24347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.1137661 0.0465330 238.836 < 2e-16 ***
## PCTVACANT -0.0191569 0.0009779 -19.590 < 2e-16 ***
## PCTSINGLES 0.0029769 0.0007032 4.234 2.42e-05 ***
## PCTBACHMOR 0.0209098 0.0005432 38.494 < 2e-16 ***
## LNNBELPOV100 -0.0789054 0.0084569 -9.330 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3665 on 1715 degrees of freedom
## Multiple R-squared: 0.6623, Adjusted R-squared: 0.6615
## F-statistic: 840.9 on 4 and 1715 DF, p-value: < 2.2e-16
First, the residuals from the model show a reasonable distribution of errors, with a median residual of 0.03822, a minimum value of -2.26, a first quartile of -0.20, and a third quartile of 0.22, and a maximum value of 2.24. This distribution shows that the model captures the variability in the dependent variable well (logged transformed median home value), although some of the residuals are larger than 2.5 standard deviations from the mean, which are the outliers.
Moreover, the model has a multiple \(R^2\) value of 0.6623 and an adjusted \(R^2\) value of 0.6615, indicating that
approximately 66% of the variance in LNMEDHVAL
is explained
by the predictors. The \(F-statistic\)
of 840.0 with a \(p-value\) less than
0.0001, indicating that the model is statistically significant.
## Analysis of Variance Table
##
## Response: LNMEDHVAL
## Df Sum Sq Mean Sq F value Pr(>F)
## PCTVACANT 1 180.392 180.392 1343.087 < 2.2e-16 ***
## PCTSINGLES 1 24.543 24.543 182.734 < 2.2e-16 ***
## PCTBACHMOR 1 235.118 235.118 1750.551 < 2.2e-16 ***
## LNNBELPOV100 1 11.692 11.692 87.054 < 2.2e-16 ***
## Residuals 1715 230.344 0.134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Analysis of Variance (ANOVA) table shows
additional insight into the significance of each predictor in explaining
the variation in the dependent variable (LNVMEDHVAL
). For
proportion of vacant housing units (PCTVACANT
), the sum of
squares is 180.392, indicating that those amount of variation in
LNVMEDHVAL
that can be attributed to changes in
PCTVACANT
. A higher sum of squares indicates a mhigher
relationship between the predictor and the dependent variable. The sum
of squares for the percentage of residents with a bachelor’s degree or
higher (PCTBACHMOR
) explains the highest portion of the
variance in LNVMEDHVAL
. In contrast, the sum of squares for
the log-transformed number of households living below the poverty line
(LNNBELPOV100
) explains the least amount of variance in
LNVMEDHVAL
, with only 11.69.
The total sum of squares, which represents the total variance in the
dependent variable (LNVMEDHVAL
), is 230.34, with a mean
square error of 0.134. Overall, the ANOVA table provides a comprehensive
overview of the significance of each predictor in the model and the
proportion of variance explained by each predictor. It confirms that all
four predictors contribute significantly to explaining the variance in
the dependent variable. These findings provide strong evidence of the
robustness of the model and the relevance of neighbourhood
characteristics in predicting housing values.
In this section, we conducted variety of analysis to check if the assumption of the linear regression were met. In the earlier section, we already checked the variable distribution and multicollinearity. Here, we will check following assumption:
To further examine the linear relationship between the dependent variable and the predictors, we create four scatter plots, where the x-axis represents each predictor and the y-axis represents the dependent variable, Log Transformed Median House Value.
longer<-data %>%
pivot_longer(cols = c("PCTBACHMOR", "LNNBELPOV100", "PCTVACANT", "PCTSINGLES"),
names_to = "Variable",
values_to = "Value")
ggplot(longer,aes(x = Value, y = LNMEDHVAL)) +
geom_point(color = "black", size= 0.4) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
facet_wrap(~ Variable, scales = "free", labeller = as_labeller(c(
"PCTBACHMOR" = "% with Bachelor’s Degrees or Higher",
"LNNBELPOV100" = "Logged Households Living in Poverty",
"PCTVACANT" = "% of Vacant Houses",
"PCTSINGLES" = "% of Single House Units"
))) +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8)) +
labs(title = "Scatter Plots of Dependent Variable vs. Predictors",
x = "Predictor Value",
y = "Log of Median House Value")
From the plots, we observe the following trends:
Logged Households Living in Poverty shows a negative linear relationship with the dependent variable. As the percentage of households living in poverty increases, the Log Transformed Median House Value decreases. This is reflected in its correlation coefficient \(< 0\). However, the association is not strictly linear. In the graph, the points are more scattered widely, particular at lower level of poverty, where housing values vary significantly. This indicate while poverty levels have a negative impact on housing values, the effect is not uniform across all block groups. There is a substantial variation in housing values even at similar poverty levels, which suggests that other factors may also be influencing housing values.
% with Bachelor’s Degrees or Higher exhibits a positive linear relationship with the dependent variable. When the percentage of individuals with a bachelor’s degree or higher increases, the Log Transformed Median House Value also rises, with a correlation coefficient \(> 0\). The points on the graph form a tight band along the trend line, indicating that higher education levels within a neighborhood is highly correlated with higher housing values. The consistent upward trend suggest that educational attainment is a key driver of housing prices, and a clear linear trend implies that this variable is suitable for this linear regression analysis.
% of Single House Units has a weaker but still positive relationship with the dependent variable. While the trend is not obvious, areas with a higher percentage of single-house units tend to have higher Log Transformed Median House Values, especially when the percentage is sufficiently large. The correlation coefficient is \(> 0\). The scatter points shows significant variability in housing values at various levels of single-house units, with some block groups having high housing values despite a low percentage of single-house units. This variability suggests that while there is a positive relationship between single-house units and housing values, other factors may also be influencing housing values at the same time.
% of Vacant Houses has a general negative linear relationship with the dependent variable. As the percentage of vacant houses increases, the Log Transformed Median House Value decreases, with a correlation coefficient \(< 0\). However, in the graph, it reveals a more complex relationship with significant variation in housing values across different vacancy rates. There are clusters of block groups with high housing values and low vacancy rates, indicating that the relationship is not strictly linear.In the middle and lower ranges of vacancy rates, the decline in housing values becomes less pronounced. This suggests that while vacant housing is associated with lower property values, the impact diminishes or becomes less predictable as the vacancy rate changes.
In summary, the scatterplot shows that while the
(PCTBACHMOR
) and (LNMEDHVAL
) have a clear
linear relationship, the other predictors have more complex non-linear
relationships with the dependent variable. This suggests that the linear
regression model may not fully capture the complexity of the
relationships between the predictors and median house values.
Next, we examine the normality of residuals of the regression model. The histogram of standardized residuals provides a detail view of the of the distribution of the residuals from the OLS model, showing below:
ggplot(data, aes(x = Standardized_Residuals)) +
geom_histogram(bins = 30, fill = "black") +
labs(title = "Histogram of Standardized Residuals",
x = "Standardized Residuals",
y = "Frequency") +
theme_minimal() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
This graph indicate a general normal distribution of the residuals, with a bell shape curve centered around 0. The majority of the residuals are concentrated between -2 and 2 on the x-axis, with the highest frequency around 0. This concentration suggests that most of the predicted values are close to the actual values, leading to residuals that are small and symmetrically distributed around 0. This generally met the assumption of the normality of residuals in OLS regression.
However, there are still some noticeable deviations from perfect normality at the tails of the distribution. On the left side, there are few residuals with values smaller than -3, and on the right side, there are few residuals with values larger than 3. These outliers indicate that the model may not be capturing all the variability in the dependent variable, leading to some large residuals. While the majority of the residuals are normally distributed, these outliers suggest that the model may not be fully capturing the complexity of the relationship between the predictors and the dependent variable. Further examination of those specific data points may be necessary to understand the reasons for these large residuals.
After checking the normality of residuals, we examine the homoscedasticity, which requires that the residuals have constant variance across all levels of the fitted values. The standard is the residuals divided by their standard error. \[e_i^* \approx \frac{\epsilon_i}{s} \approx \frac{\epsilon_i}{\sqrt{\frac{SSE}{n-2}}}\] The scatter plot of standardized residuals against fitted values is a valuable tool for assessing homoscedasticity:
ggplot(data, aes(x = Fitted, y = Standardized_Residuals)) +
geom_point(color = "black", size= 0.4) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(
title = "Scatter Plot of Standardized Residuals vs Fitted Values",
x = "Predicted Values",
y = "Standardized Residuals"
) +
theme_minimal() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=6),
axis.text.y=element_text(size=6),
axis.title=element_text(size=8))
The scatterplot shows no clear pattern in the residuals, as the predicted values rising or falling. Instead, they seem to be pretty randomly distributed around the 0 line, indicating that the residuals have constant variance across all levels of the fitted values.There is no obvious funnel shape or systematic variation trend suggests that the assumption of homoscedasticity is satisfied. In addition, the residuals are generally symmetrical distributed around the 0 line,which indicate the model does not systemically over- or under-predicted the dependent variable over the range of fitted values.
However, the plot does show some potential outliers, particularly those points are outside -3 and 3 standardized residuals. These extreme residuals are located upper or lower of the main residual clusters and may indicate that the model’s predicted values deviate significantly from the actual values. Although the number of residuals is relative small, their presence indicates that the model may not fit all data well.
Then, we test the independence of observations, specifically examining spatial autocorrelation. The choropleth map of the Percentage of Percent of Bachelor’s Degree or Higher shows that central city areas and Germantown/Chestnut Hill have higher median house values, while North Philadelphia has lower values. This clustering suggests potential spatial dependence among block groups. Similarly, the dependent variable, Log Transformed Median House Value, follows the same pattern, further indicating that observations may not be spatially independent.
The final assumption we test is the independence of residuals, which we assess by analyzing the choropleth map of regression residuals. The map reveals that residuals tend to be systematically low in the center of Philadelphia and higher in surrounding block groups, suggesting spatial autocorrelation. This pattern indicates that residuals are not randomly distributed, meaning the current regression model may not fully capture spatial heterogeneity. As a result, further examination of spatial autocorrelation in both the variables and residuals is necessary, and a spatial regression model may be required.
join<- data %>%
dplyr::select(POLY_ID, Standardized_Residuals)
shape <- shape %>%
left_join(join, by = c("POLY_ID" = "POLY_ID"))
ggplot(shape)+
geom_sf(aes(fill = Standardized_Residuals), color = "transparent") +
scale_fill_gradientn(colors = c("#fff0f3", "#a4133c"),
name = "Std Residuals",
na.value = "transparent") + # Choose a color palette, invert direction if needed
labs(title = "Choropleth Map of Standardized Residuals") +
theme(legend.text = element_text(size = 9),
legend.title = element_text(size = 10),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.subtitle = element_text(size = 9, face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill = NA, size = 0.8))
We performed stepwise regression to determine the most relevant predictors for the model. The initial model included all four predictors, with a residual sum of squares of 230.34 and an Akaikie Information Criterion (AIC) of -3448.07. A lower AIC value indicates a better balance between model fit and complexity.
## Start: AIC=-3448.07
## LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100
##
## Df Sum of Sq RSS AIC
## <none> 230.34 -3448.1
## - PCTSINGLES 1 2.407 232.75 -3432.2
## - LNNBELPOV100 1 11.692 242.04 -3364.9
## - PCTVACANT 1 51.546 281.89 -3102.7
## - PCTBACHMOR 1 199.020 429.36 -2379.0
The stepwise process examined whether removing each predictor would
reduce the AIC and improve the model fit. However, removing any
predictors resulted in a higher AIC and increased in residual sum of
squares(RSS), indicating that all four predictors are important for the
model. For instance, reducing the proportion of vacant housing units
(PCTVACANT
) from the model increased the AIC to -3102.7 and
increase the residual sum of squares to 281.89, indicating a severe
decline in model performance.
As a result, non of the predictors were removed from the model, as eliminating any of them worsened the performance. Therefore, the final model remained unchanged from the initial one which include all four predictors.
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100
##
## Final Model:
## LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 1715 230.3435 -3448.073
We also performed 5 fold cross-validation to access the model’s predictive power and determine hwether a model with fewer predictors could perform better.
lm <- trainControl(method = "cv", number = 5)
cvlm_model <- train(LNMEDHVAL ~ PCTVACANT + PCTSINGLES + PCTBACHMOR + LNNBELPOV100, data=data, method = "lm", trControl = lm)
print(cvlm_model)
## Linear Regression
##
## 1720 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1376, 1376, 1376, 1376, 1376
## Resampling results:
##
## RMSE Rsquared MAE
## 0.3680934 0.6638173 0.2731422
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
cvlm_model_reduced = train(LNMEDHVAL ~ PCTVACANT + MEDHHINC, data = data, method = "lm", trControl = lm)
print(cvlm_model_reduced)
## Linear Regression
##
## 1720 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1376, 1376, 1376, 1376, 1376
## Resampling results:
##
## RMSE Rsquared MAE
## 0.4428679 0.5095797 0.3181564
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The original model with all four predictors had a root mean squared
error (RMSE) of 0.3666183, while the reduced model with only two
predictors, the percentage of vacant houses (PCTVACANT
) and
median household income (MEDHHINC
), had a higher RMSE of
0.4416494. A higher root mean square error (RMSE) indicate that the
model’s predictions deviate more from the actual values, suggesting
lower accuracy and a poorer fit to the data. Overall, original model
provides better predictive accuracy. The original model outperforms the
reduced model, demonstrating that all four predictors contribute to
improved model accuracy.
In this study, we analyzed Philadelphia census block group data (n =
1,720) to examine how neighborhood characteristics relate to median
house values. To address non-normality, we applied a logarithmic
transformation to the median house value, which allows us to interpret
the regression coefficients as approximate percentage changes in the
median house value. Our regression analysis revealed that for every
1-percentage point increase in the percentage of vacant houses
(PCTVACANT
), the median house value decreases by
approximately 1.915%. Similarly, a 1-percentage point increase in the
percentage of single-family homes (PCTSINGLES
) is
associated with a roughly 0.298% increase in the median house value,
while a 1-percentage point increase in the percentage of residents with
at least a bachelor’s degree (PCTBACHMOR
) corresponds to
about a 2.09% increase. In contrast, every 1-percentage point increase
in the log-transformed number of households living below the poverty
line (LNNBELPOV100
) is linked with approximately a 0.078%
decrease in the median house value. Overall, our model explains about
66% of the variation in the dependent variable, indicating that the
selected predictors significantly influence median house values. While
most results were in line with expectations, the relatively small impact
of the percentage of single-family homes (PCTSINGLES
) was
somewhat surprising.
The model’s overall quality is supported by an R² of approximately
0.66 and a highly significant F-ratio, indicating a strong overall fit.
Stepwise regression, guided by the Akaike Information Criterion,
retained all four predictors (PCTVACANT
,
PCTSINGLES
, PCTBACHMOR
, and
LNNBELPOV100
), demonstrating that each variable contributes
meaningfully to explaining the variation in the median house value.
Furthermore, 5-fold cross-validation, evaluated using root mean
square error (RMSE), showed that the full model performs better than a
reduced model that includes only the percentage of vacant houses
(PCTVACANT
) and median household
income(MEDHHINC
). These validation techniques reinforce our
confidence in the robustness and predictive power of our model.
Despite the robust performance of our model, several limitations should be noted:
Spatial Autocorrelation: Choropleth maps indicate that median house values and some predictors display spatial clustering. This suggests that neighboring census block groups may influence one another, potentially biasing our parameter estimates. Future research should incorporate spatial econometric techniques, such as Moran’s I or spatial lag models, to address this issue.
Measurement of Poverty: The variable
representing households living below the poverty line
(LNNBELPOV100
) is used as a raw count rather than as a
percentage. This choice may obscure relative differences in poverty
levels across block groups and affect the interpretation of its impact
on median house values.
Zero-Inflated Distributions: Although the logarithmic transformation improved the normality of most variables, it may not fully address the zero-inflated nature of some predictors. Future studies could explore alternative transformation methods or models that are better suited for handling zero-heavy distributions.
Ridge and LASSO regression are used when there are highly correlated predictors or when only a subset of features is expected to be truly important. In such cases, some predictor coefficients may become excessively large, leading to overfitting in a linear model. These methods manage overfitting by adding a penalty term to the loss function, which constrains the coefficients. However, in our model, the highest correlation between predictors is 0.32, indicating no significant multicollinearity. Additionally, no predictor exhibits excessively large coefficients. Therefore, applying Ridge or LASSO regression in this case would be unnecessary.
Our findings have significant implications for urban policy aimed at
promoting community stability and equitable growth. The strong negative
effect of higher vacancy rates (PCTVACANT
) on median house
values suggests that policies focused on rehabilitating vacant
properties or supporting community-based investments could help
stabilize neighborhoods without displacing existing residents.
Similarly, the positive association between educational attainment
(PCTBACHMOR
) and median house values highlights the
importance of policies that enhance educational opportunities while also
ensuring affordable housing for long-term residents. Future research
should consider incorporating additional predictors, such as access to
public amenities, quality of local services, or historical neighborhood
trends, and should employ spatial econometric techniques to capture
spatial dependencies more accurately. Although our current analysis does
not warrant the use of Ridge or LASSO regression, these methods might be
revisited in future studies if the dataset expands or if additional
predictors are included.