|
Quick Computation of Spatial Autoregressive Estimators By
R. Kelley Pace LREC Chair of Real Estate E.J. Ourso College of Business Louisiana State University Baton Rouge, LA 70803-6308 (578)-388-6256
and
Ronald Barry Associate Professor of Mathematical Sciences University of Alaska Fairbanks, Alaska 99775-6660 (907)-474-7226 FAX: (907)-474-5394 FFRPB@aurora.alaska.edu
The contact information is updated from the published manuscript. This was published in: Pace, R. Kelley, and Ronald Barry, “Quick Computation of Regressions with a Spatially Autoregressive Dependent Variable,” Geographical Analysis, Volume 29, Number 3, July 1997, p. 232-247. We would like to thank Ohio State University Press for having granted us copyright permission to distribute this electronic version of the paper. Ohio State University Press owns the copyright to this work.
Abstract: Spatial estimators usually require the manipulation of n2 relations among n observations and use operations such as determinants, eigenvalues, and inverses whose operation counts grow at a rate proportional to n3. This paper provides ways to quickly compute estimates when the dependent variable follows a spatial autoregressive process, which by appropriate specification of the independent variables can subsume the case when the errors follow a spatial autoregressive process. Since only nearby observations tend to affect a given observation, most observations have no effect and hence the spatial weight matrix becomes sparse. By exploiting sparsity and rearranging computations, one can compute estimates at low cost. As a demonstration of the efficacy of these techniques, the paper provides a Monte Carlo study whereby 3107 observation regressions require only 0.1 seconds each when using Matlab on a 200Mhz Pentium Pro personal computer. In addition, the paper illustrates these techniques by examining voting behavior across US counties in the 1980 presidential election. ACKNOWLEDGMENTS: We would like to thank the University of Alaska for its generous research support. In particular, we would like to thank Todd Lee for his comments. In addition, Pace would like to thank the Center for Real Estate and Urban Economics, University of Connecticut for its support. We would both like to thank the referees and the editor for their careful reading of the manuscript and the resultant suggestions which have led to considerable improvements.
Quick Computation of Spatial Autoregressive Estimators Introduction The examination of empirical data over space, with explicit recognition of the influence each observation has upon the others, has made large gains since the seminal contributions of Whittle (1954). For example, Ord (1975) proposed an algorithm involving eigenvalues which made spatial estimation practical for small to moderate sized data sets. In addition, Griffith (1988), Anselin (1988), Haining (1990), Anselin and Hudak (1992), and others have worked on the implementation of spatial estimation and have provided actual code in major languages for computing spatial estimates. Despite these advances much work needs to be done to extend the benefits of spatial estimation to larger data sets made increasingly available by the widespread use of geographic information systems. Spatial estimators by necessity must examine the relation between each observation and every other observation. This leads to the use of n by n matrices where n represents the number of observations. Quickly, the logistics spiral out-of-control: a 10,000 observation problem creates a 10,000 by 10,000 matrix which would require 800MB of storage (double precision). Computational counts for operations such as determinants and inverses grow with the cube of n. Point estimates alone create these computational exigencies inference can further exacerbate the computational demands. As an additional problem, traditional maximum likelihood techniques require non-linear optimization techniques using either analytic derivatives or finite difference approximations. Unfortunately, these can fail to find the global optimum and do so without informing the user of their failure. For example, Ripley (1988, p. 11-15) provides an example with numerous local optima in a one-parameter profile likelihood problem. Hence, an ideal spatial estimator would (a) handle large data sets; (b) handle point estimation and inference quickly; and (c) not rely on local non-linear optimization algorithms. This paper provides algorithms which achieve all of these goals. The main weapon against these problems is sparseness, the prevalence of zeros in a matrix. The zeros arise because only nearby observations directly affect each other. Sparseness greatly accelerates computations and reduces storage requirements. For example, sparseness allows us to compute 100 determinants of 3107 by 3107 matrices in 83 seconds. Moreover, while the dense version of these matrices would require 77MB, the sparse version requires less than 1MB. A secondary method is based on reorganizing the computations in the likelihood to avoid iteration and to solve equations as opposed to computing inverses. We examine the case where the dependent variable follows a spatial autoregressive process. This case can subsume the autoregressive process in errors case when the independent variables include variables with their spatial lags (Anselin 1988, p. 227). To illustrate our technique, we provide an example and a simulation using U.S. counties or their equivalents. Counties have many attractive characteristics as a geographical entity. First, they tend to have stable definitions over recent time in contrast to census tracts and blocks. Second, many variables are available at this level of aggregation. For example, USA Counties on CD-ROM contains 2,844 variables across 3,141 counties or their equivalents across the U.S. These data cover many topics of interest such as age, agriculture, crime, housing, income, education, and elections. More aggregation would destroy the geographical nature of the data and less aggregation often reduces data availability because of problems with ensuring the privacy of respondents. As our empirical example, we model voting behavior across space. Many countries have automatic voter registration and compulsory voter turnout. In contrast, the relatively low rate of voter registration and turnout in the U.S. generates considerable comment. Most of the discussion centers around variables such as income, traditionally discussed without explicitly considering spatial effects. However, political offices and parties have traditionally been organized in hierarchies using such geographical entities as precincts, wards, counties, and states. It seems plausible that the effects of voting across space cannot be completely described by a small number of observable independent variables. To what degree do the observable independent variables versus the spatial effects describe voting behavior across US counties? We show how our techniques facilitate investigating the question of the relative importance of these two components. In addition, we conduct a Monte Carlo experiment involving 22,500 regressions of 3107 observations per regression using the same spatial weight matrix employed in the empirical voting example. The simulation demonstrates how sparsity greatly reduces the difficulty of generating vectors of spatially dependent variates. As we show, it takes, on average, only 0.1 second each using the Matlab language on a 200Mhz Pentium Pro to generate a spatially dependent vector and to find the resultant estimates. The next section presents the model, likelihood function, and estimation procedures. Latter sections show the role of sparsity, illustrate the techniques with a voting example, and discuss a Monte Carlo study of the estimator. The final section presents the key results. A Quick Spatial Autoregressive Estimator This overall section discusses estimation of a spatial autoregressive process in the dependent variable. Part 1 presents the model and its likelihood, part 2 discusses the estimated generalized least squares (EGLS) and maximum likelihood (ML) estimation of the model, while part 3 provides a means to conduct inference without computing information matrices, and part 4 examines the computation of likelihood ratio tests. 1. The Spatial Autoregressive Likelihood Function When the dependent variable exhibits spatial autocorrelation, the simultaneous
autoregression estimator corrects the usual prediction of the dependent variable,
()1 where D represents an n by n weighting matrix with 0s on the
diagonal (the observation cannot predict itself) and
()2 Hence, we are looking for some optimal convex combination of Y in levels and its spatial first differences (I-D)Y. To maintain the interpretation of a weighted average, the rows of D sum to 1 as implied by below. Such weighting matrices are said to be row-standardized (Anselin and Hudak 1992, p. 514). A non-zero entry in the jth column of the ith row indicates that the jth observation will be used to adjust the prediction of the ith observation (i ¹ j). After correcting for these interactions, the simultaneous autoregressive (SAR) models assume the residuals, e , are independently and normally distributed. These assumptions are summarized as:
()3 As an illustration of how to construct D, compare the distance dij
between every pair of observations j and i to
()4 Subsequently, one could normalize the initial weights so that
()5 Finally, the profile likelihood function for the autoregressive model in appears in ,
()6 where C represents a constant and SSE denotes sum-of-squared errors. 2. Estimated Generalized Least Squares Computations If one knew the value of
()7 Conditional upon a , the optimal solution for
()8 where
()9 where Interestingly, if one simply wishes to compute EGLS one could form the first order conditions as in .
()10 This leads to the simple solution in .
()11 3. Maximum Likelihood Computations Returning to the maximum likelihood approach, one could rewrite the log-likelihood function by substituting the expression for the SSE in into the log-likelihood function in . After dropping the constant, this yields .
()12 We wish to maximize the log-likelihood over
()13 Given (a) the scalars From a computational standpoint, the vectorization of the problem avoids the relatively high overhead incurred by invocation of non-linear optimizers and their normally sequential nature. In computational environments with fast vector operations, such sequential operations can greatly reduce performance. The use of a finite set of a will cause some small granularity in the chosen values a ml , but it should not prove difficult to make the granularity small relative to the statistical precision of the estimated a ml. While this approach may suffer a small loss of precision relative to non-linear maximization, evaluating the log-likelihood function over a grid offers the advantage of robustness. 4. Avoiding Computation of the Information Matrix Computation of the information matrix often becomes expensive in a spatial context. Inspection of the information matrix in Ord (1975, p. 124-125) or in Anselin (1988, p. 76-77) shows it requires, among other computations, an n by n inverse and multiplication of two n by n matrices. Moreover, its usual formulation requires the eigenvalues of the spatial weight matrix. Finding the eigenvalues of a large matrix demands substantial computational resources. In addition, the information matrix approach works best when the profile likelihoods are quadratic in a . However, most plots of the profile likelihood display substantial asymmetry (Ripley 1988, p. 14). In such cases, as Meeker and Escobar (1995) and others forcefully argue, profile likelihood techniques can outperform the information matrix approach. Finally, the information matrix approach requires enough "smoothness" to make second derivatives well-behaved. Fortunately, since the technology previously outlined facilitates rapid maximization of
the likelihood, this provides an alternative route to inference. For example, the speedy
maximization of the likelihood function allows one to compute restricted least squares
estimates for In appears the formula for the restricted least-squares estimator given the linear
hypotheses
()14 For a single hypothesis such as More generally, for a joint hypothesis comprised of h parts, R is h
by k , r is a h by 1 vector, the expression As a more detailed example, suppose interest centers on whether the variable
While one could drop variables to implement hypotheses such as
Sparsity and Computations If differencing an observation with its nearby neighbors removes most of the effects of autocorrelation, the spatial weighting matrix D can be quite sparse. For example, if an observation displays error dependency with its nearest m neighbors, only m non-zero entries exist per row of D. Thus, D will contain nm non-zero elements out of n2 possible elements. This produces a m/n proportion of non-zero elements, a popular measure of sparsity. For example, with this problem we used four neighbors for each observation. Hence, D has sparsity of 4/3107 (0.13%). This represents a very high level of sparsity which increases as n grows. Sparsity results in a number of computational gains. First, it dramatically decreases
the storage needed for D and Second, sparsity greatly accelerates computations. For example, multiplying the n
by n matrix D by the n by k matrix X requires O(kn2)
operations using dense matrices. Barring computational bookkeeping, the equivalent sparse
operation requires O(knm) operations. The real benefits come when computing
determinants, inverses, or solving systems of equations. All of these operations can build
upon the LU decomposition of a matrix through Gaussian elimination and all require O(n3)
operations when using dense matrices (Golub and Van Loan 1989). Sparsity, however, can
totally change the order of the number of operations required in these computations. For
example, if Unfortunately, the existence of a pure band structure does not arise very often. Figure
1a shows the actual plot of the non-zero elements in While the actual mechanics of these algorithms may seem quite involved, the intuition is simple. If one had many equations to solve, the fewer variables in each equation the better (more sparsity preferred). It seems intuitive to arrange the equations so that each one uses variables present only in nearby equations (low bandwidth preferred). Ideally, one would like to order the equations so that one could solve the first one for a variable, then take this variables value, substitute it into the second one, solve for the second variable, and so on. Gaussian elimination and the LU decomposition allow one to perform precisely this procedure. Hence, these algorithms formalize and swiftly execute a natural set of computations. We computed To place these results in perspective, Li (1995) took the eigenvalue route to computing determinants. Li used an IBM RS6000 Model 550 and a CM5 parallel processing supercomputer. The CM5 had 32 processors each with 32MB of local memory and four vector units. For a 2500 by 2500 spatial weight matrix the RS6000 required 8515.07 seconds while the CM5 required 45.78 seconds. Adjusting for size differences ((3107/2500)3) these times would go to 16345.26 and 87.88 seconds for a 3107 by 3107 problem. Hence, the use of sparse technology allows personal computers to approach supercomputer performance for this problem. The use of sparsity does not preclude the use of supercomputing technology. A substantial amount of development has gone into devising parallel sparse routines (Saad 1996, p. 324-422). Employing supercomputers and sparseness could vastly extend the range of computable spatial problems. Moreover, as demonstrated herein, one can easily vectorize spatial estimators. Note, the smoothness of Voting Across Counties In this section we illustrate the techniques presented in previous sections using data on the votes cast in the 1980 presidential election across U.S. counties. In what follows, part 1 discusses the data employed, part 2 presents the two models used, part 3 gives the estimation results, while part 4 demonstrates the use of the likelihood ratio tests discussed in a previous section. 1. Voting Data Specifically, we used the geographic centroids from all the counties (or their equivalents) in the continental U.S. from the 1990 Census which recorded votes in the 1980 presidential election. This yielded a matrix D with 3,107 rows and 3,107 columns. We picked 1980 because the presidential election cycle of every four years corresponded to the census data collection cycle of every ten years. We collected data on the total number of votes cast in the 1980 presidential election per county (Votes), the population in each county of 18 years of age or older (Pop), the population in each county with a 12th grade or higher education (Education), the number of owner-occupied housing units (Houses), and the aggregate income (Income). 2. Models We fitted two models by OLS and maximum likelihood, respectively. We elected to examine the log of the proportion of votes cast for both candidates in the 1980 presidential election. Hence, we can express our dependent variable as ln(PrVotes)= ln(Votes/ Pop) = ln(Votes)-ln(Pop). We fitted the following model via OLS: ln(PrVotes)=Interceptb 1+ln(Pop)b 2+ln(Education)b 3+ln(Houses)b 4 +ln(Income)b 5+e We fitted the following model which subsumes the previous model via maximum likelihood: ln(PrVotes)=Interceptb 1+ln(Pop)b 2+ln(Education)b 3+ln(Houses)b 4 +ln(Income)b 5+Dln(Pop)b 6+Dln(Education)b 7+Dln(Houses)b 8 +Dln(Income)b 9+ Dln(PrVotes) a +e This looks at the same fundamental model, but adds to it the spatial lags of the dependent and independent variables. 3. Estimation Results As Table 2 documents, both the OLS and maximum likelihood predictions showed reasonably good relative fit with R2s of .5242 and .7123. As this indicates, maximum likelihood using the spatial information displayed considerably lower error than OLS. In fact, the SSE from OLS of 49.2825 was 78.12% higher than the SSE of 27.6686 from maximum likelihood. Moreover, the median absolute error from OLS of .0864 was 40.49% higher than the median absolute error of .0615 from maximum likelihood. The OLS estimates displayed the expected signs with the exception of income which was negative and significant. The variable ln(Pop) had a negative sign for both OLS and maximum likelihood. However, this arises because we use the proportion of votes cast which effectively removes a coefficient of 1 from both sides. Given the measured coefficients on ln(Pop) were greater than -1, population has the expected positive overall effect on votes but the proportion voting declines with population per se. Similarly, maximum likelihood on the expanded model displayed the expected signs on all the variables. Inspection of the maximum likelihood results show the probable source of OLSs difficulties. The significance of spatially lagged income, which the OLS model omits, and its correlation with income, led OLS to adjust towards the negative and significant omitted variable. Relative to OLS, maximum likelihood ascribes a smaller influence of education and a larger influence of home ownership upon the propensity to vote. Naturally, the large coefficient on the lagged dependent variable suggests the existence of geographically correlated but omitted variables.
4. Inference The second section discussed means of avoiding computation of the information matrix in a spatial setting. When the likelihood costs little to compute, likelihood ratio tests have substantial appeal. In this case, the restricted maximum likelihood estimates do not require much computational resources. We test the hypotheses that each of the basic independent variables and their spatial lags have no effect upon the regression. As we have nine total independent variables (four basic variables, four spatially lagged basic independent variables, and an intercept), the matrix R has 4 rows and 9 columns. It contains all zeros except for a one in each row in the position of the variables whose effects we wish to set to zero. The vector r is a 4 by 1 vector of zeros. Table 3 displays the results of the likelihood ratio tests for the deletion of all spatially lagged variables, of the spatially lagged dependent variable, and all of the basic independent variables with their lags. All of the variables or combinations of the variables were statistically significant. It required only .09 seconds to compute the restricted likelihoods for the deletion of the basic independent variables with their lags using the techniques presented earlier. The likelihood associated with the OLS regression on the basic variables coupled with the maximum likelihood obtained for all variables give the likelihood ratio test for the significance of all spatially lagged variables. The likelihood associated with a =0 from the profile likelihood coupled with the maximum likelihood for all variables obtained give the likelihood ratio test for the significance of the spatially lagged dependent variable, Dy. These likelihoods require no additional computations but appear as a byproduct from the basic procedure and hence impose little computational cost. The significance of a , the parameter estimate of the spatially lagged dependent variable, and the significance of the other spatially lagged independent variables shows the substantial contribution geographically correlated variables make to the overall fit.
An Illustrative Simulation This section examines a simulation of 22,500 regressions each using 3107 observations. Part 1 discusses the data, part 2 gives the timings of the simulation computations, and part 3 provides the statistical results. 1. Simulation Data To provide verisimilitude to the simulation, we chose the same spatial weighting matrix as employed in the empirical voting example. In the simulation, we: 1. Generated uniform random variables for nine columns of X and used a constant for the other column. 2. Set b to a vector of ones. 3. Let a v equal [.01, .05, .1, .25, .5, .75, .9, .95, .99]. 4. Set s to each of [.1, .5, 1, 2, 10]. 5. Generated a common set of 250 N(0,1) vectors of 3107 elements each using the Matlab normal random number generator. We perform this operation once for the entire simulation. Scaling the common N(0,1) errors by s generates the N(0, s 2) random variables. This practice, referred to as "correlated sampling" (Rubinstein (1981)) greatly reduces the variance in Monte Carlo experiments. We subsequently generated the autocorrelated dependent variable, y, according to ,
()15 where u represents an n by iter matrix of N(0,1) random variates where iter represents the number of iterations in the experiment. In actuality, we solved the corresponding equation system in for the coefficients Z via Gaussian elimination using an LU decomposition rather than computing the inverse as this goes much faster.
()16 Compare this to the usual inverse formulation,
()17 Relative to , requires solving a larger system (n by n instead of n by (iter+1)) and subsequently multiplying an n by n matrix by a n by iter +1 matrix (O(n2(iter+1))). Since iter+1 usually is much smaller than n, the inverse method takes substantially longer to yield the same results. 2. Timing Results from the Monte Carlo Experiment Table 4 contains the results from the Monte Carlo experiment using the autoregressive dependent variable model. Table 4 contains 45 cases resulting from nine autoregressive parameter values, a , and five levels of error variability, s . Each case contains the average of the results from five runs of 100 iterations each. The runs required 36.8 minutes in total. Thus, each maximum likelihood spatial autoregression and simulated dependent variable needed 0.1 seconds of computational time, a very low figure. 3. Statistical Results from the Monte Carlo Experiment The results in Table 4 match some of those reported in the literature using regular lattices. First, the maximum likelihood estimator slightly underestimates the true differencing parameter, a . Second, the EGLS estimator overestimates a . Note, two versions of the EGLS estimator appear. The first one comes from . The second one comes from picking the value of a off the grid which minimizes SSE. Hence, one could consider EGLS2 an inequality restricted EGLS estimator. It has the same granularity problem as the maximum likelihood estimator and hence makes a better comparison with the maximum likelihood estimator. For example, examine the cases where a equals 0.99. In these cases, both the EGLS2 and ML give the same value of the square root of the mean squared error (RMSE) of 0.05. However, as the comparison between EGLS1 and EGLS2 reveals, for most cases the granularity problem does not greatly affect the results. The inequality restricted nature of EGLS2 does lead to it performing differently than EGLS1 at the endpoints. Third, the maximum likelihood estimator greatly outperforms EGLS in most cases. In fact, maximum likelihood displays a factor of 14.81 better performance at the worst case for EGLS (a =.5, s =10). EGLS becomes most acceptable for high a and low s . Fourth, all of the estimators perform worse as s rises. Each estimator reaches the nadir of its performance at various points depending upon both a and s . For example, the maximum likelihood estimator reaches its nadir at a of 0.1 with s of 10. The EGLS1 estimator reaches its nadir at a of 0.5 with s of 10. Conclusion A variety of methods can greatly accelerate the computation of large scale spatial autoregressions. This paper explored the use of sparse matrix techniques, formulating the profile likelihood to avoid iterative computations, and using restricted least squares to form likelihood ratio tests as opposed to computing information matrices. As illustrations of the efficacy of these techniques, we looked at an empirical example at the county level and conducted a Monte Carlo experiment with 22,500 spatial autoregressions. Despite the formidable size of the regressions (3107 observations), these cost only 0.1 seconds each to generate the spatially dependent variable and estimate the coefficients. The voting example showed the power of geographic information to help clarify social phenomenon. OLS on the non-spatial variables displayed 78% higher sum-of-squared errors than maximum likelihood on the combination of spatial and non-spatial variables. Relative to OLS, maximum likelihood ascribes a smaller influence of education and a larger influence of home ownership upon the propensity to vote. More importantly, OLS shows income has a significant, negative effect on the proportion voting while maximum likelihood shows income has a small positive but statistically insignificant effect. However, spatially lagged income has a significant, negative effect upon voting. Hopefully, additions like these to the spatial statistics toolkit will allow computations to keep pace with the ever-increasing flow of geographic information and bring spatial techniques into more routine usage.
Literature Cited Anselin, L. (1988). Spatial Econometrics: Methods and Models. Dordrecht: Kluwer. Anselin, L., and S. Hudak (1992). "Spatial Econometrics in Practice: A Review of Software Options." Journal of Regional Science and Urban Economics 22, 509-536. Bureau of the Census, USA Counties on CD-Rom, 1994. Can, A. (1992). "Specification and Estimation of Hedonic Housing Price Models." Regional Science and Urban Economics 22, 453-474. Can, A., and I. Megbolugbe (forthcoming). "Spatial Dependence and House Price Index Construction." Journal of the Real Estate Finance and Economics. Cressie, N. (1993). Statistics for Spatial Data. Revised ed. New York: John Wiley. Dubin, R. (1988). "Spatial Autocorrelation." Review of Economics and Statistics 70, 466-474. George, A., and J. Liu (1981). Computer Solution of Large Sparse Positive Definite Systems. Englewood Cliffs: Prentice-Hall. Gilley, O., and K. Pace (1995). "Using Inequality Restrictions to Tame Multicollinearity in Hedonic Pricing Models." Review of Economics and Statistics 77, 609-621. Golub, G., and C. Van Loan (1989). Matrix Computations. Second ed. Baltimore: John Hopkins. Griffith, D. (1988). "Estimating Spatial Autoregressive Model Parameters with Commercial Statistical Software." Geographical Analysis 20, 176-186. Griffith, D. (1995). "Some Guidelines for Specifying the Geographic Weights Matrix Contained in Spatial Statistical Models," in: Arlinghaus, S., ed. Practical Handbook of Spatial Statistics. Boca Raton: CRC Press, 65-82. Haining, R. (1990). Spatial Data Analysis in the Social and Environmental Sciences. Cambridge: Cambridge University Press. Judge, et al. (1988). Introduction to the Theory and Practice of Econometrics. New York: John Wiley. Li, B. (1995). "Implementing Spatial Statistics on Parallel Computers," in: Arlinghaus, S., ed. Practical Handbook of Spatial Statistics. Boca Raton: CRC Press, 107-148. Meeker, W., and L. Escobar (1995). "Teaching about Approximate Confidence Regions Based on Maximum Likelihood Estimates." The American Statistician 49, 48-53. Ord, K. (1975). "Estimation Methods for Models of Spatial Interaction." Journal of the American Statistical Association 70, 120-126. Pace, K., and R. Barry (forthcoming). "Sparse Spatial Autoregressions." Statistics and Probability Letters. Pace, K., and O. Gilley (forthcoming). "Using the Spatial Configuration of the Data to Improve Estimation." Journal of the Real Estate Finance and Economics. Pace, K. and O. Gilley (1993). "Improving Prediction and Assessing Specification Quality in Non-Linear Statistical Valuation Models." Journal of Business and Economics Statistics 11, 301-310. Ripley, B. (1981). Spatial Statistics. New York: John Wiley. Ripley, B. (1988). Statistical Inference for Spatial Processes. Cambridge: Cambridge University Press. Rubinstein, R. (1981). Simulation and the Monte Carlo Method. New York: John Wiley. Saad, Y. (1996). Iterative Methods for Sparse Linear Systems. Boston: PWS Publishing.
|