Using the Spatial Configuration of the Data to Improve Estimation

By

 

R. Kelley Pace

LREC Endowed Chair of Real Estate

Department of Finance

E.J. Ourso College of Business Administration

Louisiana State University

Baton Rouge, LA 70803

(225)-388-6256

FAX: (225)-388-6366

kelley@spatial-statistics.com

kelley@pace.am

 

Otis W. Gilley

Department of Economics and Finance

College of Administration and Business

Louisiana Tech University

Ruston, Louisiana 71272

(318)-257-2389

 

 

Abstract: Using the well-known Harrison and Rubinfeld (1978) hedonic pricing data, this manuscript demonstrates the substantial benefits obtained by modeling the spatial dependence of the errors. Specifically, the estimated errors on the spatial autoregression fell by 44% relative to OLS. The spatial autoregression corrects predicted values by a nonparametric estimate of the error on nearby observations and thus mimics the behavior of appraisers. The spatial autoregression, by formally incorporating the areal configuration of the data to increase predictive accuracy and estimation efficiency, has great potential in real estate empirical work.

 

Both authors gratefully acknowledge the research support they have received from their respective institutions.

This manuscript appeared as:

 

Kluwer Academic Publishers owns the copyright to this work and has graciously granted permission to us to place this upon our website and Spatial Statistics CD-ROM.

 

Using the Spatial Configuration of the Data to Improve Estimation

I. Introduction

In a well-known paper, Harrison and Rubinfeld (1978) investigated various methodological issues related to the use of housing data to estimate the demand for clean air. They illustrated their procedures using data from the Boston SMSA with 506 observations (one observation per census tract) on 14 non-constant independent variables. These variables include proxies for pollution, crime, distance to various centers, geographical features, accessibility, housing size, age, race, status, tax burden, educational quality, zoning, and industrial externalities.

Despite the inclusion of a wide variety of important economic variables, the Harrison and Rubinfeld model and data exhibits various problems common to many hedonic pricing or mass appraisal models. For example, not all variables exhibit the proper sign. Specifically, the AGE variable is insignificant and positive. In addition, the residuals display a pattern across space, a result incompatible with the assumed independent and identically distributed (iid) error structure.

To resolve these empirical problems, this paper explicitly allows for the areal configuration of the observations through a spatial autoregression. By appropriate differencing of the observations, the spatial autoregression recreates a more iid error structure which greatly improves the results. Specifically, the estimated spatial autoregression yields a negative and significant coefficient for AGE while vastly improving the sample goodness-of-fit. The estimated sum-of-squared errors falls by 44% relative to the original OLS results.

Section II discusses the spatial autoregressive estimator employed, section III estimates the resulting spatial autoregression, while section IV concludes with the key results.

II. A Spatial Autoregressive Estimator

When errors exhibit spatial autocorrelation, a common estimator corrects the usual prediction of the dependent variable, , by a weighted average of the errors on nearby properties as in ,

()1

where D represents an n by n comparable weighting matrix with 0s on the diagonal (the observation cannot predict itself). The rows of D sum to 1 as implied by below. The non-zero entries on the ith row of D represent the observations whose errors interact with the error on the ith observation. We assume independent, 0 mean errors from a normal distribution. These assumptions appear in .

()2

In the spatial statistics literature, the model in and describes a simultaneous autoregression (SAR) with the log-likelihood function,

()3

where B equals . The maximum likelihood method efficiently estimates the model asymptotically (given the assumptions hold).

Assuming the existence of the ML estimate, one could predict Y via .

()4

Furthermore, leads to the estimated errors in .

()5

Analogously, one could compute ex-sample errors by .

()6

 

III. Maximum Likelihood Sample Estimation of a Spatial Autoregression

This section illustrates the spatial autoregression estimator from section II using the augmented Harrison and Rubinfeld (1978) data. Section A discusses the data, section B presents the model, and section C presents the actual estimation results.

 

A. Data

In a well-known paper, Harrison and Rubinfeld investigated various methodological issues related to the use of housing data to estimate the demand for clean air. They illustrated their procedures using data from the Boston SMSA with 506 observations (one observation per census tract) on 14 non-constant independent variables. These variables include levels of nitrogen oxides (NOX), particulate concentrations (PART), average number of rooms (RM), proportion of structures built before 1940 (AGE), black population proportion (B), lower status population proportion (LSTAT), crime rate (CRIM), proportion of area zoned with large lots (ZN), proportion of nonretail business area (INDUS), property tax rate (TAX), pupil-teacher ratio (PTRATIO), location contiguous to the Charles River (CHAS), weighted distances to the employment centers (DIS), and an index of accessibility (RAD). As mentioned previously, many authors have used the data to illustrate various points.

We manually collected the location of each tract in latitude (LAT) and longitude (LON) out of the 1970 census. In the process of conducting this project, we rechecked the data against the original census data. We discovered eight miscoded dependent variable observations. We employ the corrected data in the estimation.

 

B. Model

We fitted the following model from Belsley, Kuh, and Welch (1980):

The quadratic expression involving latitude and longitude does not follow Belsley, Kuh, and Welch. However, the addition of these terms removes any "large scale" locational factors from the conditional mean and follows a standard practice in the spatial statistics area. The addition of these variables raises the R2 from .811 to .814, a very small amount.

 

C. Specification of the Spatial Weight Matrix

The weight given to the census tracts for differencing depended upon their proximity as measured by the latitude and longitude for each observation relative to all other tracts (using the Euclidean metric). Initially, we weight every observation j by its distance dij from the observation i as given by the function in .

()8

Naturally, this yields a weight of 1 for the tract itself (dij=0) and 0 for each observation j more than dmax distance from observation i. Subsequently, in (9) we normalize the initial weights so that .

(9)

In addition, we set , as assumed in , to prevent each observation from predicting itself. Depending upon their areal configuration, some observations may remain undifferenced while others may become differenced with many nearby observations.

For example, suppose we have 506 observations. For the third observation, D might appear as,

.

Note, the third entry of equals 0 while the row sums to 1.

 

 

D. Maximum Likelihood Sample Estimation

Table 1 contains the sample estimates from using OLS and the SAR maximum likelihood estimators. Based upon a two-dimensional grid search, the SAR maximum likelihood estimate of was .8 and dmax was .0099. For the SAR maximum likelihood estimate, the sample R2 was .89571 while for OLS the corresponding R2 was .81388, an increase in error of 79% over the corresponding SAR maximum likelihood estimated sum-of-squared errors. Note, the model contained numerous variables controlling for locational effects. It included a variable for distances to the various centers, a variable measuring accessibility to radial highways, a Charles River dummy, and a bivariate quadratic function of latitude and longitude. Despite a very reasonable effort to control for locational effects, the SAR maximum likelihood estimator greatly reduced overall errors.

The explanation for this lies in the type of error. The spatial statistics literature draws a distinction between "large scale" and "small scale" variation. All of the locational variables included in the Harrison and Rubinfeld data measure large scale effects. However, as the activities of real estate appraisers attest, the small scale neighborhood and very local influences may prove more important in the prediction of housing values. Differencing contiguous and other nearby tracts from each other cancels much of the error from unobservable local causes. This lower error can increase the efficiency of parameter estimates which in turn can aid accurate prediction.

Note the treatment by the two estimators of the AGE variable. OLS produces a positive but insignificant estimate of AGE while the maximum likelihood SAR produces a negative estimate with a t statistic of -3.32. Furthermore, the zoning variable (ZN) under OLS has a negative but insignificant estimate. In contrast, the maximum likelihood SAR estimator yields a positive and significant estimate of the effects of zoning.

The estimators differ in their estimates of the magnitude of other effects. For example, the SAR maximum likelihood estimate for the variable B, the effects of race, is 86% greater than the corresponding OLS estimate. However, the SAR maximum likelihood assigns other variables lower parameter estimates than OLS. Specifically, the coefficient on the pollution variable (NOXSQ) changes from -.59965 under OLS to -.36895 under the SAR maximum likelihood estimator. As the pollution variable was the main focus of the Harrison and Rubinfeld study, this highlights the importance of estimator choice.

V. Conclusion

One cannot judge estimators on the basis of a single sample. Nonetheless, the much higher degree-of-fit produced by the SAR maximum likelihood estimator relative to OLS (SSEols/SSEsar=1.86) should make it a candidate for real estate empirical work. In addition, the SAR maximum likelihood’s negative and significant coefficient for AGE and positive and significant coefficient for zoning (ZN) coincides more closely with most individuals’ priors than OLS which produced insignificant parameter estimates with the opposite signs.

The SAR maximum likelihood estimator can use the same variables as OLS to estimate a regression. However, the SAR maximum likelihood estimator, like an appraiser, uses the correlated errors on nearby properties to improve the overall prediction.

Ironically, the formal empirical tools currently employed in real estate do not make much use of the rich spatial information present in the data. Indeed, even the implementation of the SAR estimator herein leaves substantial room for improvement. The present implementation assumes "isotrophy" (same variance-covariance structure over space) and does not take into account many of the factors appraisers might use. For example, we do not account for the road network or physical obstructions such as rivers.

The continual improvement of geographic information systems offers great potential for incorporating such types of spatial information in constructing the weight matrix, D (Clapp and Rodriguez (1995)). For example, using Census data, one could attempt the following refinement for transactions data since the Census attempts to group similar entities. Holding distance constant, one could give higher weights to transactions occuring in the same census block, slightly lesser weights to observations in the same block group, lower weights yet to those in the same tract, and the lowest weights to those in a different tract. As an additional example, one could program the geographical information system to change the weight given (holding distance constant) to an observation depending upon traffic counts. The experience of appraisers over the years should lead to rich heuristics for specifying weights. The intersection of geographical information systems, appraiser heuristics, and spatial statistics has a great potential in sharpening the results from real estate data.

 

 

Bibliography

Subramanian, Shankar, and Richard T. Carson, "Robust Regression in the Presence of Heteroskedasticity," Advances in Econometrics, JAI Press, Volume 7, p. 85-138, 1988.

 

Table 1 — OLS and Spatial Autoregressive Estimates

b ols

tols

b sar

tsar

CRIM -0.01186 -9.53 -0.00670 -6.83
ZN -0.00021 -0.37 0.00091 1.81
INDUS -0.00041 -0.17 -0.00101 -0.35
CHAS 0.08165 2.46 -0.01231 -0.45
NOXSQ -0.59965 -5.07 -0.36895 -2.37
RM2 0.00593 4.50 0.00873 8.39
AGE 0.00009 0.17 -0.00162 -3.32
DIS -0.21579 -4.40 -0.18685 -2.63
RAD 0.08882 4.53 0.07262 3.72
TAX -0.00043 -3.50 -0.00041 -3.51
PTRATIO -0.02709 -5.20 -0.01704 -3.09
B 0.00036 3.53 0.00067 5.99
LSTAT -0.37763 -15.26 -0.24588 -11.35
LAT -278.54000 -1.44 -262.61000 -1.38
LON 9.87540 0.03 555.95000 1.99
LAT*LON -0.18337 -0.12 -1.60820 -1.19
LAT2 2.01620 1.50 2.32520 1.71
LON2 0.03862 0.01 -5.22980 -1.71
R2 0.81388 0.89571
d 0.8000
dmax 0.0099