Lecture 4: Model Comparison

55 downloads 0 Views 130KB Size Report
a Gamma prior, (or diffuse where ν = d = 0) πs(σ. 2. ) ∼ G(ν, d). (3) .... 59.8110. (grid of 0.01 over lambda with spline interpolation) ... Models\rho. -0.5 -0.25 -0.10.
Lecture 4: Model Comparison James P. LeSage University of Toledo Department of Economics Toledo, OH 43606 [email protected] March, 2004

We consider a class of spatial regression models introduced in Ord (1975) and elaborated in Anselin (1988), shown in (1). The sample observations in these models represent regions located in space, for example counties, states, or countries.

y

=

ρW y + Xβ + u

u

=

λV u + ε

(1)

Where X denotes an n by k matrix of explanatory variables as in ordinary least-squares regression, β is an associated k by 1 parameter vector and ε is an n by 1 disturbance vector, which we assume takes the form: ε ∼ N (0, σ 2In). The n by n matrices W and V specify the structure of spatial dependence between observations (y ) or disturbances (u), with a common specification having elements Wij > 0 for observations j = 1 . . . n sufficiently close (as measured by some distance metric) to observation i. As noted above, observations reflect geographical regions, so distances might be calculated on the basis of the centroid of the regions/observations. The expressions W y and V u produce vectors that are often called spatial lags, and ρ and λ denote scalar parameters to be estimated along with β and σ 2. Non-zero values for the scalars ρ and λ indicate that the spatial lags exert an influence on y or u.

1

Comparing alternative models We wish to compare: 1) model specifications, e.g., y = ρW y + Xβ + ε, vs. y = Xβ + u, u = ρW u + ε Selection of the appropriate model, SAR, SEM, SDM or SAC has been the subject of numerous Monte Carlo studies that examine alternative systematic or sequential specification search approaches. Florax, Folmer and Rey (2003) provide a review of this literature. All of these approaches have in common maximum likelihood estimation methods in conjunction with conventional specification tests such as the Lagrange multiplier (LM) or likelihood ratio (LR) tests. 2) weight matrix specifications, e.g., W based on contiguity vs. W based on nearest neighbors, distance or parameterized Trying to parameterize the weight matrix causes a problem for maximum likelihood methods. Likelihood becomes ill-defined when the spatial dependence parameters is zero. No problem for Bayesian methods. 3) explanatory variables, e.g., X1, X2, X3 vs. X1, X3. Only Bayesian methods offer the potential for a comprehensive solution here.

2

Current state of parameter estimation and inference in spatial econometrics A lot of good methods/tools, each with their strengths and weaknesses.

• Likelihood – Strengths, inference regarding parameter dispersion is theoretically sound, strong connection to economic theory of production, utility, spillovers, imposes theoretical restriction on spatial dependence parameter – Weaknesses, slow, hard to code for large problems, not robust to non-normal error distributions, inference regarding dispersion is difficult in practice • GMM – Strengths, fast, easy to code, robust to error distribution, theoretically sound, strong connection to economic theory of production, utility, spillovers – Weaknesses, doesn’t impose theoretical restriction on spatial dependence parameter, inference regarding dispersion is an unsettled issue

3

• Semi-parametric – Strengths, robust wrt to many possible problems (e.g. many error distributions, model specification problems), good for prediction – Weaknesses, throws away parsimonious structure (spatial autoregressive), data requirements and tuning parameters make it harder to implement in small samples, inference difficult, weak connection to economic theory of production, utility, spillovers • Bayesian – Strengths, inference regarding parameter dispersion (both theoretical and applied), strong connection to economic theory of production, utility, spillovers, imposes theoretical restriction on spatial dependence parameter, works for binary, truncated, missing, and multinomial dependent variables, parameterized spatial weight matrices – Weaknesses, slow, hard to code

4

Points of failure for non-Bayesian methods • Likelihood, fails for parameterized weight matrices because the likelihood ratio is ill-defined at ρ = 0. • Likelihood, requires sequential testing for comparison of model specifications, and reliance on a host of Monte Carlo evidence. Boils down to parameter inference on a nested model structure. (Changing the weight matrix or explanatory variables destroys nesting) • GMM, not well developed in this area. Boils down to parameter inference on a nested model structure. (Changing the weight matrix or explanatory variables destroys nesting) • Semi-parametric, doesn’t wish to participate in this issue. Doesn’t believe in a true data generating process, relying instead on flexible functional forms. Throws away parsimonious model structures/specifications that can be derived from economic theory.

5

Current state of Bayesian model comparison in spatial econometrics • Things that are currently available in my MATLAB spatial econometrics toolbox, or will be available soon. • Focus here only on spatial autoregressive/spatial error models (ignoring other spatial estimation functions). Some of this based on recent work with Olivier Parent.

y

=

ρW y + αι + Xβ + u

u

=

λDu + ε

ε



N (0, σ V )

V

=

diag(v1, v2, . . . , vn)

2

• Need to rely on priors (π ) that are not too informative and not too diffuse to avoid Lindley’s (1957) paradox.

6

Priors • For β , Zellner’s g-prior, 2

2

0

πb(β|σ ) ∼ N [0, σ (giXM XMi ) i

−1

]

(2)

Fernandez, Ley and Steel (2001a, 2001b) provide a theoretical justification for use of the g−prior as well as Monte Carlo evidence comparing nine alternative approaches to setting the hyperparameter g . They recommend setting 2 gi = 1/max{n, kM }, for the case of least-squares based i M C 3 methodology.

• For α, a diffuse prior • For σ 2, a Gamma prior, (or diffuse where ν = d = 0) 2

πs(σ ) ∼ G(ν, d)

(3)

• For ρ, λ, either a uniform prior on the interval [−1, 1] or a type of β(a, a)distribution centered on zero.

πr (ρ) πr (ρ)

=

U [−1, 1]

=

(1 + ρ)a−1(1 − ρ)a−1 1 Be(a, a) 22a−1

(4)

7

Prior for ρ, λ prior comparisons 0.8 a=1.01 a=1.1 a=2.0 0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 1

0.8

0.6

0.4

0.2

0

0.2

0.4

0.6

0.8

1

8

Log marginal posteriors • Univariate integration wrt ρ, λ for SAR and SEM models (this problem solved) SAR : y

=

ρW y + αι + Xβ + ε

SEM : y

=

αι + Xβ + u

u

=

λDu + ε

ε



N (0, σ In)

2

• Bivariate integration for general model (this problem NOT yet solved) y

=

ρW y + αι + Xβ + u

u

=

λDu + ε

ε



N (0, σ In)

2

• MCMC solution is needed for heteroscedastic models: (this problem solved) 2

ε



N (0, σ V )

V

=

diag(v1, v2, . . . , vn) 9

Nature of the integration problem • SAR model, – we can analytically integrate out β, σ arriving at a log marginal posterior:

p(ρ|y)

=

K2 (

Z

g k/2 ) 1+g

×

|In − ρW |

×

2

[ν s¯ + S(ρ) + Q(ρ)]

 −1 K2

= =

Q(ρ)

=

Γ Γ

ν 2

n+ν 2  ν 2



2

ν s¯ 2 2 ν

! ν2

(ν s¯ ) 2 π

− n+ν 2

πr (ρ)dρ

n+ν −n 2 (2π) 2 2 Γ



n+ν 2

−n 2

Γ g ˆ 0 0 ˆ β(ρ) X X β(ρ) g+1

– |In − ρW |, S(ρ), Q(ρ) and πr (ρ) can be vectorized over a grid of ρ values making integration easy.

10



• SEM model, y ? = y − λDy , X ? = (X − λDX), and ?0 ? πb(β|σ 2) ∼ N [0, σ 2C], C = (gXM XM )−1, i i – we can analytically integrate out β, σ arriving at a log marginal posterior:

 p(λ|y)

=

Γ

n−1 2

Z  × ×

 π

−n 2

(5)

|C ?| |X ?0X ? + C ?|

[S(λ) + Q]

− n−1 2

 21

|In − λW |

πr (λ)dλ

=

g ? 0 ? e (λ) e (λ) (1 + g)

=

? ? y − X βˆ − αι ˆ

Q

=

1 ? ? 0 ? ? (y − y¯ ι) (y − y¯ ι) (1 + g)

βˆ

=

(X X )

α ˆ

=





?

=

y¯ − λW y

Wy

=

(1/n)(W y)

S(λ) ?

e (λ)

?0

? −1

?0 ?

X y

?

11

– |In − λD|, S(λ), Q(λ) and πr (λ) can be vectorized over a grid of λ values, but is more computationally intensive.

12

Operational/implementation issues For the case of a finite # of homoscedastic SAR, SEM models

• No need to estimate β, σ to do model comparison • There is a need to store the vectorized log-marginal posterior for scaling reasons. A vector of 2,000 values seems sufficient for each model being compared. • Using the log-determinant approximation methods of Pace and Barry (1999) and vectorization, problems involving reasonably large samples are no problem.

13

• An example for SAR models y = ρW y + αι + Xβ + ε, ε ∼ N (0, σ 2In) # of observations 49 506 3,107 24,473

time for log-det 0.0400 0.1205 1.3965 10.6905

time for log-marginal 0.0250 0.0300 0.0855 0.6160

total time 0.1000 0.1805 1.4725 12.2275

• An example for SEM models y = αι + Xβ + (In − λD)−1ε, ε ∼ N (0, σ 2In) (grid of 0.001 over lambda) # of observations time for log-det 49 0.0400 506 0.1205 3,107 1.3965 24,473 10.6905

time for log-marginal 0.1000 0.8915 1.1570 49.1955

total time 0.1755 1.0365 2.4535 59.8110

(grid of 0.01 over lambda with spline interpolation) # of observations time for time for total time log-det log-marginal 49 0.0400 0.0200 0.0950 506 0.1205 0.1050 0.2550 3,107 1.3965 0.1255 1.4870 24,473 10.6905 5.3070 17.0190

14

Performance, n, signal/noise, spatial dependence n = 49, r-squared = 0.9 (average over 30 trials) Models\rho -0.5 -0.25 -0.10 0 0.10 0.25 Pr(SAR)* 0.98 0.76 0.44 0.32 0.42 0.71 Pr(SEM) 0.02 0.24 0.56 0.68 0.58 0.29

0.50 0.98 0.02

Models\lam Pr(SAR) Pr(SEM)*

-0.5 0.14 0.86

-0.25 0.20 0.80

-0.10 0.31 0.69

0 0.29 0.71

0.10 0.29 0.71

0.25 0.38 0.62

0.50 0.17 0.83

n = 506, r-squared = 0.9 Models\rho -0.5 -0.25 Pr(SAR)* 1.00 0.99 Pr(SEM) 0.00 0.01

-0.10 0.17 0.83

0 0.33 0.67

0.10 0.67 0.33

0.25 1.00 0.00

0.50 1.00 0.00

Models\lam Pr(SAR) Pr(SEM)*

-0.25 0.21 0.79

-0.10 0.11 0.89

0 0.30 0.70

0.10 0.01 0.99

0.25 0.00 1.00

0.50 0.00 1.00

n = 3107, r-squared = 0.9 Models\rho -0.5 -0.25 Pr(SAR)* 1.00 1.00 Pr(SEM) 0.00 0.00

-0.10 1.00 0.00

0 0.28 0.72

0.10 1.00 0.00

0.25 1.00 0.00

0.50 1.00 0.00

Models\lam Pr(SAR) Pr(SEM)*

-0.10 0.06 0.94

0 0.27 0.73

0.10 0.02 0.98

0.25 0.00 1.00

0.50 0.00 1.00

-0.5 0.32 0.68

-0.5 0.00 1.00

-0.25 0.00 1.00

15

Weight matrix comparisons n = 506, SAR model (averaged over 30 trials) Models\rho Pr(W1) Pr(W2) Pr(W3)* Pr(W4) Pr(W5) Pr(W6)

-0.5 0.00 0.00 1.00 0.00 0.00 0.00

-0.25 0.00 0.00 1.00 0.00 0.00 0.00

-0.10 0.01 0.07 0.73 0.14 0.04 0.01

0 0.11 0.12 0.16 0.19 0.20 0.22

0.10 0.00 0.08 0.74 0.10 0.05 0.03

0.25 0.00 0.00 1.00 0.00 0.00 0.00

0.50 0.00 0.00 1.00 0.00 0.00 0.00

16

For the case of a finite # of homoscedastic general (SAC) models

• Bivariate integration would require 2,000 by 2,000 or 4,000,000 double precision numbers. • A Bivariate grid over ρ, λ is required for the log-determinant terms |In − ρW | and |In − λD|. • A smaller grid with possible spline interpolation may be possible. • I’m close to an MCMC solution, univariate integration over ρ conditional on λ, and univariate integration over λ conditional on ρ, then take and MCMC average of the log-marginal posterior vectors.

17

For the case of a finite # of heteroscedastic SAR, SEM models

• An MCMC solution needs to be used. • One advantage of this approach is that the log-marginal posterior would come (almost) for free as a part of MCMC estimation of these models. • On every trip through the MCMC sampler, evaluate the logmarginal posterior for current values of β, σ, ρ(λ) and V = diag(v1, v2, . . . , vn). Take the MCMC average. • An example for heteroscedastic SAR models: • 2,500 draws, first 500 excluded for burn-in # of observations 49 506 3,107 24,473

time for log-det 0.0350 0.1850 0.8460 10.8355

time for log-marginal 55.4300 89.4940 270.2480 2118.9620

total time 55.5750 89.7490 271.1895 2130.4235 (35 minutes)

18

M C 3 and BMA For the case of an infinite # of homoscedastic SAR, SEM models – A large literature on Bayesian model averaging over alternative linear regression models containing differing explanatory variables exists (Raftery, Madigan, Hoeting, 1997, Fernandez, Ley, and Steel, 2001a). – We introduce SAR and SEM model estimation when uncertainty exists regarding the choice of regressors. The Markov Chain Monte Carlo model composition (M C 3) approach introduced in Madigan and York (1995) is set forth here for the SAR and SEM models. – For a regression model with and intercept and k possible explanatory variables, there are 2k possible ways to select regressors to be included or excluded from the model. For k = 15 say, we have 32,768 possible models, ruling out computation of the log-marginal for all possible models as infeasible. – The M C 3 method of Madigan and York (1995) devises a strategic stochastic process that can move through the potentially large model space and samples regions of high posterior support. This eliminates the need to consider all models by constructing a sample from relevant parts of the model space, while ignoring irrelevant models. – Specifically, they construct a Markov chain M (i), i = 1, 2, . . . with state space ℵ that has an equilibrium distribution 19

p(Mi|y), where p(Mi|y) denotes the posterior probability of model Mi based on the data y . – The Markov chain is based on a neighborhood, nbd(M ) for each M ∈ ℵ, which consists of the model M itself along with models containing either one variable more, or one variable less than M . The addition of an explanatory variable to the model is often labelled a ‘birth process’ whereas deleting a variable from the set of explanatory variables is called a ‘death process’. – A transition matrix, q , is defined by setting q(M → M 0) = 0 for all M 0 3 nbd(M ) and q(M → M 0) constant for all M 0 ∈ nbd(M ). If the chain is currently in state M , we proceed by drawing M 0 from q(M → M 0). This new model is then accepted with probability:



p(M 0|y) min 1, p(M |y)





= 1, OM 0,M



(6)

– We note that the computational ease of constructing posterior model probabilities, or Bayes factors for the case of equal prior probabilities assigned to all candidate models, allows us to easily construct a Metropolis-Hastings sampling scheme that implements the M C 3 method. – A vector of the log-marginal values for the current model M is stored during sampling along with a vector for the proposed model M 0. These are then scaled and integrated to produce OM 0,M which is used in (6) to whether to accept the new 20

model or stay with the current model.

21

An example for SAR models – Generated SAR model: y = ρW y + Xβ + ε – ε ∼ N (0, σ 2In) – with X = [X1, X2, X3] # of unique models found = 141 # of models with probs > 0.001 = 26 # of MCMC draws = 10000 variables x1 x2 x3 x1out x2out x3out x4out mprobs model 1 1 1 1 1 0 1 0 0.0120 model 2 1 1 1 0 1 0 1 0.0124 model 3 1 1 1 1 0 0 1 0.0129 model 4 1 1 1 1 0 0 0 0.0150 model 5 1 1 1 0 0 1 0 0.0683 model 6 1 1 1 0 0 0 0 0.0722 model 7 1 1 1 0 1 0 0 0.0736 model 8 1 1 1 0 0 0 1 0.0773 model 9 1 1 1 1 0 0 0 0.0850 model 10 1 1 1 0 0 0 0 0.4839 freqs 26 26 26 11 11 11 11 vprobs 1.000 1.000 1.000 0.423 0.423 0.423 0.423

22

SAR model: y = ρW y + Xβ + ε, ε ∼ N (0, σ 2In) OLS BMA Model information (48 seconds, 20,000 # of unique models found = 61 # of models with prob > 0.001 = 21 variables x1 x2 x3 x1out x2out x3out model 1 1 0 0 1 0 0 model 2 1 0 0 0 1 0 model 3 1 0 1 1 0 0 model 4 1 0 1 0 0 1 model 5 1 0 1 0 1 0 model 6 1 1 0 0 0 0 model 7* 1 1 1 0 0 0 model 8 1 0 0 0 0 0 model 9 1 0 1 0 0 0 freqs 18 9 12 5 6 5 vprobs 0.857 0.429 0.571 0.238 0.286 0.238

draws)

mprobs 0.0127 0.0142 0.0180 0.0181 0.0237 0.0938 0.1207 0.2557 0.3805

SAR BMA Model information (300 seconds, 20,000 draws) # of unique models found = 48 # of models with prob > 0.001 = 14 variables x1 x2 x3 x1out x2out x3out mprobs model 1 1 1 0 0 0 0 0.0052 model 2 1 0 1 1 0 0 0.0123 model 3 1 0 1 0 0 1 0.0130 model 4 1 0 1 0 1 0 0.0178 model 5 1 1 1 1 0 0 0.0279 model 6 1 1 1 0 0 1 0.0324 model 7 1 1 1 0 1 0 0.0374 model 8 1 0 1 0 0 0 0.2562 model 9* 1 1 1 0 0 0 0.5836 freqs 13 9 12 4 4 4 vprobs 0.929 0.643 0.857 0.286 0.286 0.286 23

Model Averaging – For the election dataset we compare the M C 3 methodology to posterior model probabilities for two SAR models, one based on actual explanatory variables, a constant term, education homeownership and household income, as the X matrix and another based on this correct X matrix plus 3 random normal vectors. Of course, these bogus explanatory variables should not appear in the high posterior probability models identified by the M C 3 estimation methodology. – We compared these two models first by producing posterior model probabilities for each of these and the probability associated with the true model without the bogus explanatory variables was 1.0. An alternative would be to consider the set of all 2k possible models which arise from a set of k = 7 explanatory variables. Since we have k = 7, there are 27 = 128 possible models, so we could compute posterior model probabilities for each of these in this small example. Our next example based on the cross-country growth regressions illustrates the difficulty of taking this approach in general, since k = 16 so 2k = 65, 536 possible models.

24

– We follow Fernandez et al. (2001a) and use Zellner’s g-prior with the value of g set to 1000*σ ˆ 2, where σ 2 represents an estimate from the sar g model with all variables included in the X−matrix. The posterior mean estimates from the sar g model with all variables included in the X−matrix are shown below along with the prior standard deviations for each of the β coefficients. If one considers an interval of ±3σ , around a prior mean of zero for the β parameters used by the Zellner g-prior, these prior standard deviations seem loose enough to allow the sample data to determine the resulting estimates. bhat 0.6263 0.2196 0.4819 -0.0989 0.0004 -0.0009 -0.0000

prior std 1.2724 0.4180 0.4559 0.4946 0.0670 0.0666 0.0668

25

Model averaging information Model const educ homeo income x1-bog x2-bog model 1 1 0 1 1 0 1 model 2 1 0 1 1 1 0 model 3 1 1 1 1 1 1 model 4 1 0 1 1 0 0 model 5 1 0 1 0 0 0 model 6 1 1 1 1 1 1 model 7 1 0 1 1 0 0 model 8 1 1 1 1 0 1 model 9 1 1 1 1 1 0 model 10 1 1 1 1 0 1 model 11 1 1 1 1 1 0 model 12 1 1 1 1 0 0 model 13 1 1 1 1 0 0 #Occurences 51 35 51 44 34 29

x3-bog 0 0 1 1 0 0 0 1 1 0 0 1 0 37

probs 0.0108 0.0126 0.0130 0.0177 0.0247 0.0290 0.0417 0.0441 0.0498 0.0985 0.1114 0.1693 0.3774 1.0000

Bayesian spatial autoregressive model Dependent Variable = voters R-squared = 0.4422 Rbar-squared = 0.4417 mean of sige draws = 0.0138 Nobs, Nvars = 3107, 4 ndraws,nomit = 2500, 500 total time in secs = 18.3560 time for lndet = 1.7220 time for sampling = 16.4140 Pace and Barry, 1999 MC lndet approximation used order for MC appr = 50 iter for MC appr = 30 numerical integration used for rho min and max rho = -1.0000, 1.0000 *************************************************************** Posterior Estimates Variable Coefficient Std Deviation p-level const 0.626298 0.041466 0.000000 educ 0.220239 0.015633 0.000000 homeowners 0.481845 0.014338 0.000000 income -0.098988 0.016475 0.000000

26

rho

0.588173

0.015693

0.000000

Bayesian spatial autoregressive model Homoscedastic version Dependent Variable = voters R-squared = 0.4421 Rbar-squared = 0.4410 mean of sige draws = 0.0138 Nobs, Nvars = 3107, 7 ndraws,nomit = 2500, 500 total time in secs = 19.0880 time for lndet = 1.8530 time for sampling = 17.1150 Pace and Barry, 1999 MC lndet approximation used order for MC appr = 50 iter for MC appr = 30 numerical integration used for rho min and max rho = -1.0000, 1.0000 *************************************************************** Posterior Estimates Variable Coefficient Std Deviation p-level const 0.626294 0.042451 0.000000 educ 0.219571 0.016212 0.000000 homeowners 0.481878 0.014427 0.000000 income -0.098920 0.016697 0.000000 x1-bogus 0.000418 0.002156 0.412500 x2-bogus -0.000900 0.002071 0.333000 x3-bogus -0.000043 0.002145 0.493500 rho 0.589170 0.015343 0.000000

27

SAR Bayesian Model Averaging Estimates Dependent Variable = voters R-squared = 0.4338 sigma^2 = 0.0139 # unique models = 72 Nobs, Nvars = 3107, 7 ndraws for BMA = 5000 ndraws for estimates = 1200 nomit for estimates = 200 time for lndet = 1.9130 time for BMA sampling= 75.0280 time for estimates = 105.7220 Pace and Barry, 1999 MC lndet approximation used order for MC appr = 50 iter for MC appr = 30 min and max rho = -1.0000, 1.0000 *************************************************************** Variable Prior Mean Std Deviation const 0.000000 1.272378 educ 0.000000 0.417961 homeowners 0.000000 0.455872 income 0.000000 0.494618 x1-bogus 0.000000 0.067004 x2-bogus 0.000000 0.066649 x3-bogus 0.000000 0.066815 *************************************************************** Posterior Estimates Variable Coefficient Std Deviation p-level const 0.582201 0.018759 0.000000 educ 0.195921 0.006934 0.000000 homeowners 0.483636 0.006542 0.000000 income -0.082365 0.007474 0.000000 x1-bogus 0.000074 0.000273 0.399000 x2-bogus -0.000205 0.000243 0.192000 x3-bogus 0.000008 0.000407 0.508000 rho 0.600754 0.006940 0.000000

28

Conclusions • Work that is done: – For finite homoscedastic models, involving alternative W matrices, SAR vs. SEM model specification – For infinite homoscedastic models, involving SAR, SEM models, M C 3, BMA over alternative X 0s – For finite heteroscedastic models, involving alternative W matrices, SAR vs. SEM model specification • Work to be done: – For infinite heteroscedastic models, involving SAR, SEM models, M C 3, BMA over alternative X 0s – M C 3, BMA for the case of SAR, SEM, alternative W matrices and alternative X ’s – More general models: y

=

ρW y + αι + Xβ + u

u

=

λDu + ε

ε



N (0, σ In)

2

• I am trying to produce a paper/manual that describes the theory and use of my toolbox functions for model comparison purposes.

29