Discrete Unobserved Heterogeneity in Discrete

0 downloads 0 Views 2MB Size Report
observations are dropped (and more that 30% for the exit from autocracy) .... 1If the stayers are characterize by T positive responses, that is, a T-sequence of 1's .... the linear counterpart in terms of computation, asymptotics and identification.
Discrete Unobserved Heterogeneity in Discrete Choice Panel Data Models∗ Student: Raffaele Saggio†

Advisor: St´ephane Bonhomme‡

CEMFI, Madrid

CEMFI, Madrid

June 19, 2012

Abstract By restricting the support of the unobserved heterogeneity and allowing crosssectional units to be classified into a finite number of classes, we provide a non-linear panel data estimator that leaves the relationship between unobservables and observables unrestricted and it is able to use the whole sample in estimating the effects of interest. The latter represents an important improvement over typical maximum likelihood fixed effects models where the parameters are identified only through the sample of movers. Using results from Hahn and Moon (2010), we show that this non-linear group fixed effects estimator is consistent as both N and T goes to infinity under correct specification. It is also higher-order unbiased compared to standard non-linear panel data estimators. We apply this new estimator to different empirical applications. The results suggest that the non-linear group fixed effects estimator can be considered as a reliable solution to deal with the problem of unobserved heterogeneity in a flexible but yet parsimonious way. JEL Classification: C23; Keywords: Discrete Heterogeneity; Fixed Effects; Non-Linear Panel Data;



I would like to thank my supervisor, St´ephane Bonhomme, for his patience, for his excellent advice and for his constant support during the elaboration of my final thesis. Special thanks to Elena Manresa for her precious comments and constant support. I also would like to thank Manuel Arellano, Pedro Mira and Enrique Sentana for giving me the opportunity to share with them my ideas and for their useful suggestions during my presentations at CEMFI. Special thanks also to Jesus Carro for providing me the data on labour force participation. All the errors are my own. † E-mail: [email protected] ‡ E-mail: [email protected]

1

Introduction

Controlling for unobserved heterogeneity when using micro-datasets represents a crucial requirement in order to provide reliable inference. In this perspective, the usage of panel data has allowed economists to specify rich patterns of heterogeneity (e.g. Browning and Carro, 2007). However, applied researchers have been also aware of the existence of a trade off between allowing rich patterns of heterogeneity and doing so in a parsimonious way. A typical example along this line is the linear fixed effects model in the presence of low longitudinal variation in the covariates (Hahn et al., 2011). The extent of this tradeoff is magnified when analyzing non-linear panel data models. As a matter of fact, in a recent survey over non-linear panel data analysis, Arellano and Bonhomme (2011) write: “The empirical success of panel data is mostly confined to linear models and special nonlinearities [...] The analysis of nonlinear panel data models remains a challenge for econometricians”. To understand the difficulties that arise in a non-linear panel data context, consider the empirical investigation of Persson and Tabellini (2009). They analyze the dynamics of political change. Their main empirical result is that democratic capital, measured by a nation’s historical experience with democracy and the incidence of democracy in its neighborhood, appears to raise (decrease) exit rates from autocracy (democracy). However, once allowing for unobserved fixed heterogeneity, the logit fixed effects model delivers somehow different results. They find that the coefficient of domestic democratic capital has an opposite (and significant) sign. The problem is that, when using the logit fixed effects model, only a relatively small fraction of the sample is used. The fraction of dropped observations corresponds to the countries that have experienced long lived regimes of autocracy/democracy. In practice, this means that when estimating the exit rates from democracy more than 60% of the observations are dropped (and more that 30% for the exit from autocracy) We see that Persson and Tabellini’s problem is therefore two-fold. On the one hand, Persson and Tabellini (2009) want to allow for unobserved fixed heterogeneity as to reduce the incidence of endogeneity issues. In particular, the concern here is a familiar one: state dependence vs. unobserved heterogeneity. Unobserved historical factors influencing both political and economic development are likely to be correlated with democratic capital 1

and its accumulation over the years for a given country. If unobserved heterogeneity is not included in the econometric specification, domestic democratic capital may therefore capture unobserved heterogeneity rather than state dependence. This calls for a methodology that allows for a country, time invariant, effect that is correlated with the regressors, like in the logit fixed effects model. However, on the other hand, Persson and Tabellini (2009) know that the logit fixed effects model will automatically drop more than half of the original sample when estimating the effects of interest. This raises additional concerns. These concerns are not only connected to a statistician’s perspective - we loose efficiency when dropping more than half of our sample - but also, and more importantly, from an economist’s perspective. Using the logit fixed-effects specification, the marginal effect of democratic capital is identified only from a special and relatively small fraction of countries. In particular, we drop from estimation those countries that correspond to the long-lived regimes countries which are essential to fully capture the effects of domestic democratic capital on non-exit. Ultimately, the switch in the sign that we observe between the logit fixed effects estimates and the estimates without fixed effects may be indeed attributed to the fact that the logit fixed effects model is using a smaller and selected sample that excludes all the long lived regimes countries. The objective of this thesis is to lay down the basics for an estimator that could efficiently rebalance the components of the tradeoff that we have just characterized. Our starting point is to follow the approach of Bonhomme and Manresa (2012) and restrict the support of unobserved heterogeneity, assuming therefore that individual units belong to a small number of groups. We show how to provide a non-linear group fixed effects estimator (NLGFE) that has the following three fundamental features: (i) it estimates the parameters of interest leaving the relationship between the covariates and the discrete components of unobserved heterogeneity unrestricted, following therefore the typical fixed effects approach (ii) by exploiting the within-group variation, it uses - in principle - the whole sample when constructing the likelihood of the data (iii) it is consistent as both N and T goes to infinity under correct specification. Moreover, unlike most of the non-linear panel data estimators, the incidental parameter problem becomes negligible even when N grows as some exponential function

2

of T whereas maximum likelihood fixed effects estimators typically require N = o(T ). Our approach can be related to labour structural models that effectively assume a discrete number of unobserved types for individuals when analyzing discrete choice type of problems, see for instance Guner, Kaygusuz and Ventura (2011). In these models, the unobserved types represent state variables. Generally, in these kind of models, it is possible to identify the unobserved discrete types only through some restrictive distributional assumptions, following therefore the typical random effects’ type of approach. A major advantage of our methodology compared to such technique is that we can identify the groups effects, and therefore the unobserved types, by leaving the conditional distribution of the unobserved effects given observables unrestricted. Moreover, the group effects heterogeneity is also related to the empirical industrial organization literature of Ackerberg and Gowrisankaran (2002) and Pakes, Ostrovsky and Berry (2005), where game theoretic models predict that the number of possible equilibria is finite. Also, we can link the group effects framework to social interactions models, see for instance Townsend (1994) and Munshi and Rosensweig (2009). Finally, by allowing the discrete components of unobserved heterogeneity to be time-varying, we can show that our approach can be related to models that analyze the determinants of political change (see Persson and Tabellini, 2009), where “democratization waves” (Huntington, 1993) affect groups of countries at one period of time (e.g the Arab Spring in 2011). We show how to compute the NLGFE in a non-linear panel data context. The computation of such an estimator is challenging as this computation relies on the optimal grouping of the cross-sectional units. Following the insights of the data clustering literature (Lin, 2005), we extend the standard “kmeans” algorithm allowing for non-linearities as well as the presence of covariates in order to compute the parameters of interest. The optimal grouping is computed following a likelihood criterion in a similar manner as in Bonhomme and Manresa (2012) where the group assignment is based on a least-squares criterion. Using an heuristic approach, we compute a fast and reliable algorithm that we show performs well, especially for low dimensional problems. Regarding the asymptotic properties of the NLGFE, we follow a recent paper by Hahn and Moon (2010). They derive the asymptotic properties for an estimator of a gametheoretical model where there exists a finite number of equilibria. By allowing for equilibria

3

that are selected depending on unobserved market characteristics, they notice that the estimation of selected equilibrium out of the finite set of equilibria is equivalent to estimate fixed effects that have finite support in a panel data framework. This corresponds exactly to the framework that characterize the NLGFE described in this thesis. Hahn and Moon (2010) show that, although the NLGFE estimator remains biased for fixed T , the bias vanishes even if N grows at some exponential rate of T . This is a crucial feature of the NLGFE as standard maximum likelihood estimators (MLE) of non-linear panel data models with fixed effects require N = o(T ) in order to eliminate the incidental parameter problem of Neyman and Scott (1948). The NLGFE is therefore automatically higher-order bias reducing even when compared to recent estimators that have biases of order O(T −2 ) as in Carro (2006). The intuition behind this result stems from the fact that the estimation of the groups improves really fast as T increases. As a matter of fact, the NLGFE is asymptotically equivalent to the unfeasible ML estimator that treats the population groups values as known. The latter result is important for computation as it implies that standard errors can be calculated using standard techniques and they are therefore unaffected by the estimation of the group membership. However, there exists one important caveat when analyzing the properties of the NLGFE. There could be cases where the optimal group structure that we compute implies that we cluster one group with only stayers, that is, individual units that do not change their status over the T periods in which we observe them. We refer to this situation as the “constrained” solution of the NLGFE. Notice that under such constrained solution the group effect for the stayers is estimated to be ±∞1 . Even more importantly, under this situation we end up losing one of the three crucial features of the NLGFE: the possibility to estimate the effects of interest using the whole sample available. We investigate the nature of this problem extensively by means of several simulation studies. Moreover, we also show that testing for the presence of such constrained solution is difficult since we are testing for parameters at the boundary of the parametric space. We point to the results of Andrews (1999) in order to derive the correct asymptotic distribution 1

If the stayers are characterize by T positive responses, that is, a T -sequence of 1’s then the group effect will be in this case +∞, −∞ otherwise.

4

of a likelihood ratio statistic aimed at testing the presence of the constrained solution. Moreover, as shown by Andrews and Guggenberger (2009, 2010), test statistics based on hybrid subsampling can deliver correct asymptotic size when testing for parameters at the boundary. This result does not not depend on the choice of the subsample size b provided b → ∞ and b/N → 0 as N → ∞ and it is satisfied under mild regularity conditions. A consistent part of future work will be devoted to the proper statistical analysis of the stayers’ problem. This analysis will be built using the theoretical insights of Andrews (1999) and Andrews and Guggenberger (2009, 2010) as well as the Montecarlo results described in this thesis. We implement the NLGFE in three empirical applications. The first one is an application that studies the determinants of labour force participation for married women. Recently, Carro (2006) propose a modified maximum likelihood fixed effects estimator for a dynamic binary choice model that is shown to reduce the order of the bias from O(T −1 ) to O(T −2 ). However, his approach has the major drawback that, as in the maximum likelihood fixed effects model, it drops from estimation all individuals units that do not change their status over the T periods. These individuals correspond to about 55% of the observations in the original sample. Using Carro’s original data, we implement our approach and show that under groupseffects heterogeneity we can estimate the effects of interest using the whole sample available. This has important consequences when calculating the marginal effects for many key-policy parameters. For instance, when computing the marginal effect of having a child between 0 and 2 years old on the probability of participating in the labour market, we see that under our approach the effects is reduced by 64% compared to what we would calculate using the maximum likelihood fixed effects model. This is a natural consequence of the fact that now the marginal effects is calculated averaging over the entire sample, that is, by averaging also among the wives that have always participated in the labour market. In particular, the whole sample includes now 676 married wives who have always worked and 43% of them had at least one child during the T periods of observation. This makes the marginal effect of having a child between 0 and 2 years old (as well as for many other variables, such as log income) to shrink significantly in absolute value. The second application is related to Persson and Tabellini (2009). As mentioned before,

5

their empirical application suffers from the typical tradeoff that practitioners face when estimating non-linear panel data models. We apply our approach to their data set in order to question whether, once using the whole available sample of countries and allowing for group heterogeneity, the effect of democratic capital reverses his sign in a similar fashion as when we allow for fixed effects. The answer to this question is positive: even when allowing for a minimal degree of heterogeneity - that is with only two groups - the effect of democratic capital on the exit from autocracy2 share the same sign and significance of the fixed effects estimate while for the exit from democracy we find that with minimal heterogeneity the effect is negative and insignificant and it switches to positive and significant as the fixed effects estimates - when allowing for higher degrees of heterogeneity. We believe that these results represent an important robustness check in order to assess that the differences between pooled and fixed effects estimates in Persson and Tabellini (2009) are not due to the selected sample that excludes the long lived regimes. The key, instead, relies on unobserved heterogeneity. Our results confirm the concern that domestic democratic capital seem to capture unobserved heterogeneity rather than state dependence. This happens to be true even if we introduce into the model a minimal degree of heterogeneity: two unobserved and heterogenous effects which affect the exit rates from democracy/autocracy. Allowing for unobserved heterogeneity is therefore crucial in this application and - up to our knowledge - the only way to allow for such unobserved heterogeneity without the need to rely on a small and unrepresentative sample is by using the NLGFE approach. In this second application, our sample was composed by durations of either democracy or autocracy. This makes the variability that we can use for the NLFGE extremely small. As a matter of fact, as suggested by our Montecarlo results, we find that in two out of three cases considered, the likelihood under the constrained solution is higher than in the unconstrained case. The third application therefore investigates the link between the probability of switching from an autocracy to a democracy and several variables, including gdp per capita and democratic capital, by pooling the data of the hazard rates of both democracy and autocracy. This can be related to several studies, see for instance Acemoglu 2

Note that as in Persson and Tabellini (2005) we consider the negative of the hazard rate out of autocracy, see pp. 108.

6

et al. (2008) and Bartolucci and Moral-Benito (2012). Similarly to the other two applications, we are interested to see how the estimates change once we move from the standard fixed effects estimates that rely on the sample of movers to the NLGFE’s estimates that use the whole sample available. As suggested by the recent political economy literature (e.g. Moral-Benito and Bartolucci, 2012) we allow for non-linearities in the effect of GDP per capita on the probability of switching regime. We find that the coefficient of Lagged GDP per capita is negative (while it is positive for the pooled probit). However, this coefficient is also insignificant, as in Acemoglu et al. (2008), probably due to the fact that we are considering with the fixed effects model a much smaller sample. When using the NLGFE and therefore the whole sample available, the coefficient on lagged per capita income is positive (as the fixed effects estimate) and it is also significant for G ≥ 3. Moreover, similarly to what was happening in the previous application, the coefficient on domestic democratic capital is negative under the fixed effects model while it is positive under the Pooled Probit. When using the NLGFE, as we increase the degree of heterogeneity, we see that the coefficient switches from positive and insignificant to negative and significant. Finally, we notice that, as in Moral-Benito and Bartolucci (2012), the non-linear effect is positive and significant across all the specifications. The remainder of this thesis is organized as follows. Section 2 presents the NLGFE. Section 3 presents our iterative algorithm that allows to estimate the parameters of interest. Section 4 characterizes the asymptotic results of the NLGFE. Section 5 introduces and describes by means of Montecarlo studies the stayers’ problem. Section 6 discusses three empirical applications. Section 7 concludes while Section 8 presents the directions for future work.

7

1.1

Contributions and Related Literature

This thesis is connected with Bonhomme and Manresa (2012) who introduce the group fixed effects estimator in a linear panel data context. However, we believe that the gains, from an applied researcher perspective, are much higher in a panel data non-linear framework compared to the linear counterpart. Given the wide usage of binary data in empirical applications (e.g. labor force participation, smoking, etc.), there still exists among practitioners an open question on how to allow for heterogeneity in discrete choice models in a flexible but yet parsimonious way. The objective of this thesis is to provide a reliable answer to this question. Moreover, as we are going to point out in the subsequent sections, there are many additional challenges that arise in the non-linear framework compared to the linear counterpart in terms of computation, asymptotics and identification. This thesis is also related with a recent paper by Browing and Carro (2011). They were able to establish necessary and sufficient conditions for identification in a dynamic binary choice model with “maximal heterogeneity”. In particular they allow for fixed effects as well as heterogenous effects in both the auto-regressive coefficient and in the covariates. Although this certainly represents an important contribution, Browning and Carro (2011) suffers of one limitation. Once allowing for the presence of covariates, their model, being a mixture model, is able to identify the effects of interest only if an assumption is made regarding the distribution of the unobserved components given the covariates. The problem is that these assumptions are difficult to test and lack of a robust theoretical justification. Instead, when using the NLGFE, the conditional distribution of unobservables given observables is left unrestricted. This is an important feature that our estimator shares with the standard fixed effects approach.

2

The Non-Linear Group Fixed Effects Estimator

In this section we characterize, following Bonhomme and Manresa (2012), the GFE in the non-linear context. We provide a brief discussion on the relationship between the NLGFE and finite mixture modeling. The model presented here can be extended to allow for time-varying group effects.

8

2.1

The Framework

Let us consider a class of panel data models with dimension N and T where a vector of binary outcomes yi = (yi1 , . . . , yiT )> is related to a matrix of regressors of dimension K denoted as xi = (xi1 , . . . , xiT )> . The vector of regressors xit may include strictly exogenous regressors as well as predetermined regressors. In this section we assume that our panel is balanced but the following discussion can be easily extended for unbalanced panels. Our model is the following yit = 1{x> it θ + αgi + vit > 0}.

(1)

We have two types of parameters. Common parameters across individuals: θ ∈ Θ, where Θ is a subset of RK and the group-specific parameters: αg ∈ A where A ∈ R. The group membership variables gi assign each individual i ∈ {1, . . . , N } into the G groups. The relationship between xit and αgi is left unrestricted. Let α and γ denote respectively the set of all αg and gi . Hence, we have that γ ∈ ΓG where ΓG is the set of all the possible groupings of all {1, . . . , N } into the G groups. We assume that G is given by the researcher. The problem on how to optimally choose G is left for future work.3 The NLGFE is given by ˆα (θ, ˆ , γ) =

argmax

N X T X

θ,α,γ∈Θ×AN ×ΓG i=1 t=1

    > yit log F (x> it θ + αgi ) + (1 − yit ) log 1 − F (xit θ + αgi ) . (2)

For computational purpose, it is convenient to re-write the NLGFE estimator using the group membership characterization, i.e. gˆi (θ, α) = argmax

T X

    > yit log F (x> it θ + αgi ) + (1 − yit ) log 1 − F (xit θ + αgi ) .

(3)

g∈{1,2,...G} t=1

this correspond to the optimal assignment for each individual unit. Notice that the criterion that we use to compute this assignment is a likelihood criterion. Other criterions has been used in the clustering literature that rely for instance on an entropy criterion or on dissimilarity coefficients. See Li (2005) for a discussion on how to relate all of these possible criterions with the maximum likelihood framework. 3

See for Bonhomme and Manresa (2012) for a discussion on this issue.

9

The NLGFE introduced in (2) may be therefore re-written as ˆα (θ, ˆ ) = argmax

N X T X

θ,α∈Θ×AG i=1 t=1

    > yit log F (x> it θ + αgˆi (θ,α) ) + (1 − yit ) log 1 − F (xit θ + αgˆi (θ,α) ) . (4)

where gˆi (θ, α) is given by (3). There are three features of the NLGFE that is worth mentioning. First, in the following sections, we consider F to be probit but other suitable link can be implemented in this framework. This is an advantage compared to other nonlinear panel data estimators such as Honor`e and Kyriazidou (2000) and the standard Logit FE model which need to rely on the logit link. Second, there is an important connection between the NLGFE and standard finite mixture models. Following the insights of Li (2005), we notice that in our setting the observed data, characterized by the {0, 1}T binary data outcome and a matrix of covariates, is generated by G latent classes4 P r(yi1 , . . . , yiT |xi ) =

G Y T X

 1−yit yit  πig Φ x> 1 − Φ(x> it θ + agt it θ + agt )

(5)

g=1 t=1

Let us introduce the auxiliary vector hgi which indicate whether the observed data for unit i is generated from group g or not. The log-likelihood is therefore given by (Symons, 1981) L(θ, hg , a|yi , xi ) =

N X G X i=1 g=1

hig

T X

   (6) > yit log Φ x> it θ + agt +[1 − yit ] log 1 − Φ(xit θ + agt )

t=1

Notice from (6) that NLGFE estimator may be interpreted as the maximizer of the pseudolikelihood of a mixture-of-probit models, where the mixing probabilities are individualspecific and unrestricted. A crucial difference relative to recent and influential applied work such as Browing and Carro (2011) is that in the grouped fixed-effects approach the group probabilities in (5) are unrestricted functions of the individual dummies whereas in typical mixture applications they need to be specified as functions of the covariates. As noticed by Bonhomme and Manresa (2012), this fact establishes an important link between the NLGFE and the mixture approach. Third, the NLGFE corresponds to the maximizer of a likelihood function where the partition of the parameter space is defined by the different values of gˆi (θ, α) for i ∈ {1, . . . , N }. Taking an element of such partition as given, estimates 4

See McLachlan and Peel (2000).

10

of (θ, α) are obtained by solving a standard maximum likelihood problem using a set of group dummies as additional regressors. However, the complexity of the problem increases with N as the number of partitions of the N units into the G groups becomes increasingly more computationally intensive, making complete search essentially unfeasible. Moreover, the criterion function is non-standard: although it is globally continuous, it is neither globally differentiable nor concave for G > 1. In the next section we discuss on how to efficiently compute the NLGFE.

3

Computation

In this section, we discuss how to compute the NLGFE introduced in Section 2. We first present an iterative algorithm that is aimed at computing the solution of the maximization problem described in (2). In Section 3.2, we present some results in order to verify the robustness of our method.

3.1

The Iterative Algorithm

Computing the NLGFE is challenging given the large number of local minima of the objective function. As mentioned at the end of Section 2, our problem is to maximize a piece-wise likelihood function with as many pieces as possible ways to classify N objects into G classes. If there are no covariates, this problem boils down to a typical problem in the computer science literature: how to cluster binary observations into G clusters. We exploit the connection with the data clustering literature in order to obtain fast and reliable computation methods for the problem described in (2). In particular, we propose an extension to the the classical k-means algorithm to account for the problem of covariates, non-linearities in the dependent variable and unbalanced observations. Our iterative algorithm is defined as follows:  • Set s = 0. Let θ(0) , α(0) ∈ Θ × AG be some initial values. • Computer for all i ∈ {1, . . . , N } gˆis+1 = argmax

Ti X

    s s > s s yit log Φ(x> it θ + αg ) + (1 − yit ) log 1 − Φ(xit θ + αg )

g∈{1,2,...G} t=1

11

• Finally, compute (θˆs+1 , α ˆ s+1 ) = argmax

Ti N X X

(θ,α)∈Θ×AG i=1 t=1

yit log

h

Φ(x> it θ

i h i > + agˆis+1 ) +(1−yit ) log 1 − Φ(xit θ + agˆis+1 )

• Iterate on s until convergence.

Notice that this algorithm alternates between two steps. In the first step, we ask what is the optimal group for individual i, for given values of (θ, α). The second step is the updating step: using the group structure as the output of the first step, we update our estimates of (θ, α) as a simple probit regression adding as covariates the group membership indicators. We iterate between these two steps until numerical convergence is obtained. The latter is typically very fast. However, this algorithm is sensible to local minima given our piece-wise likelihood function. In practical terms, this means that the solution computed depends on the chosen starting values. A solution to this problem is to choose N SIM possible starting values and select the solution that yields the highest likelihood.

3.2

Numerical Performance of the Iterative Algorithm

We now present some preliminary results regarding the above discussion on the computation of local minima. For the aforementioned discussion, we use data from the Persson and Tabellini’s empirical application (where N = 159, max(Ti ) = 150) we have two regressors and we want to evaluate how the probability of switching from an autocracy to a democracy is affected by these two regressors. We fix in our algorithm N sim = 10000. Figure 1 show that our NLGFE is very reliable for small values of G. As in Bonhomme and Manresa (2012), we notice that the number of convergence points increases with G5 . Overall, these results suggests that the computation problem for NLGFE is a difficult problem, as suggested by the configuration of the likelihood plotted in Figure 1. However, due to recent advances in the clustering literature, our iterative algorithm delivers fast and reliable solutions, especially for low dimensional problem. Another possible way to 5

Notice that between G = 4 and G = 5 the increase seems not to be so pronounced. This is a consequence of the fact that as G increases the NLGFE tends to estimate constrained type of solution. See Section 5 for details.

12

deal with the problem of the local minima is to choose an exact computational method technique as the branch and bound approach. This approach has been exploited by Bonhomme and Manresa (2012) when computing the GFE in a linear context. Future work will be devoted to extend the branch and bound approach in the non-linear framework.

4

Asymptotics

In this section we characterize the asymptotic properties of the NLGFE estimator as both N and T goes to infinity following the framework from Hahn and Moon (2010). Given the generality of their results, we specify, where possible, more primitive assumptions in order to deliver the results. We restrict our attention to the time-invariant case. The derivation of the asymptotic distribution for the time-varying case is left for future work.

4.1

Asymptotic Distribution

We consider the following the data generating process 0 yit = 1{x> it θ + αgi0 + vit > 0}.

The

0

(7)

superscript denotes true parameter values, that is, gi0 denotes the true group mem-

bership indicators and αgi0 the true group effect associated with units that belong to group g 0 . It is assumed that G = G0 is known and it is independent of the sample size. See Bonhomme and Manresa (2012), Section 6, for a discussion on the properties of the GFE when both assumptions are relaxed in the linear context. Let wit = (yit ; xit ) denote the vector of observed outcomes and covariates. We consider the following maximum likelihood estimator gˆi (θ, α) = argmax

T X

g∈{1,2,...G} t=1 T X

= argmax

Lit (θ, α; gi )     > yit log Φ(x> it θ + αg ) + (1 − yit ) log 1 − Φ(xit θ + αg ) .

g∈{1,2,...G} t=1

13

ˆα φˆ = (θ, ˆ ) = argmax

N X T X

θ,α∈Θ×AG i=1 t=1 N X T X

= argmax

θ,α∈Θ×AG i=1 t=1

Lit (θ, α; gˆi )     > yit log Φ(x> it θ + αgˆi ) + (1 − yit ) log 1 − Φ(xit θ + αgˆi ) . (8)

For notation simplicity, we denote L(φ, γ) = E[Lit (φ; γ)] where we have Lit (φ; γ) =     > yit log Φ(x> it θ + αgi ) + (1 − yit ) log 1 − Φ(xit θ + αgi ) with φ = (θ, α) and γ represents the set of all gi . The following conditions are imposed: Assumption 1 (i) The parameter spaces Θ and A are compact subsets of RK and R, respectively. (ii) L(φ, γ) is continous in (φ, γ). (iii) The function L(φ, γ) is uniquely maximized at (φ0 , γ 0 ). (θ,α;gi ) (iv) There exists some M (w) such that supφ,γ | ∂Lit∂θ | ≤ M (w) and maxi E[M (wit )2 ] < k

∞ Assumption 2: For each i, {wit , t = 1, 2 . . .} is strictly stationary. Moreover we assume that the difference, if any, of the joint distribution of the processes {wi1 , wi2 , . . . } across i is characterized only by their difference in gi Assumption 3: Let ψ1 and ψ2 denote respectively Lit (φ, γ) − E(Lit (φ, γ)) and M (wt ) − E[M (wt )], then for any t and k we assume that ψj for j = 1, 2 is a stationary α-mixing coefficient such that for constants m,M and j = 1, 2 0 < mk ≤ kψj,t+1 , . . . , ψj,t+k k ≤ M k

(9)

Remarks. Assumption 1(i), 1(ii) and 1(iii) are usual regularity conditions for extremum estimators that guarantees an asymptotic identification. Assumption 1(iv) is needed for technical reasons. Assumption 2 is the same as condition 2 in Hahn and Moon (2010). It is typically satisfied when the time series of the observed data satisfy a time homogenous Markov process where wit = (yit , xit ) = (yit , yit−1 ). Therefore, the results here allow for the presence of predetermined regressors. Assumption 3 provides sufficient condiP tion that guarantees that the tail probabilities of T −1 Tt=1 Lit (φ; γ) − E[Lit (φ; γ)] and 14

T −1

PT

t=1

M (wit )Lit (φ; γ) − E[M (wit )] are properly bounded for ∀i. In particular we as-

sume that ψ1 and ψ2 are stationary α mixing coefficients that satisfy condition (9). Assumption 3 is needed because classification of the groups depends on the properties of the tail probabilities and a standard central limit theorem would not be sufficient in this context to characterize the asymptotic distribution of the NLGFE.6 Theorem 1. Let φ˜ be the unfeasible estimator that estimates the common parameters under the assumption that we can identify ex-ante the group structure in our data, that is, ˜α φ˜ = (θ, ˜ ) = argmax

N X T X

θ,α∈Θ×AG i=1 t=1 N X T X

= argmax

θ,α∈Θ×AG i=1 t=1



L(wit ; θ, α; gi0 ) h i h i > 0 0 yit log Φ(x> θ + α ) + (1 − y ) log 1 − Φ(x θ + α ) . it gi gi it it d

N T (φ˜ − φ) −→ N (0, Σ) as N → ∞ and T → ∞ where Σ is positive √ d definite. Then under Assumption 1-3 N T (φˆ − φ) −→ N (0, Σ) as N → ∞ and T → ∞ √ provided that N = exp( T ), where  > 0. The proof of this result follows from Lemma We have that

1, 2 and 3 of Hahn and Moon (2010) and Corollaries 4.1 and 4.2 of Bosq (1993) . We just need to verify that our more primitive assumptions satisfy the assumptions of Hahn and Moon (2010). In particular Assumption 1(i), 1(ii) and 1(iii) satisfy Condition 1(ii) of Hahn and Moon (2010), see assumption 1*(b*) of Andrews (1999) as a proof. Moreover, we have that Condition of 1(i) of Hahn and Moon in our setting becomes " " ! !## > 0 > 0 0 Φ(x θ + α ) 1 − Φ(x θ + α ) gi gi it it + (1 − yit ) log >0 E 1{gi 6= gi0 } yit log > 0 > 0 Φ(xit θ + αgi ) 1 − Φ(xit θ + αgi ) (10) via Assumption 1(iii) which therefore implies the validity of Condition of 1(i) provided that G > 1. Notice that the expression in (10) ensures that we can identify each group P for all the individuals in the population so that T −1 Tt=1 Lt (φ, gi ) uniformly converges to E[Lt (φ, gi )] for each i. Corollaries 4.1 and 4.2 of Bosq (1993) provides the sufficient 6

See Bonhomme and Manresa (2012) for a more primitive assumption where it is assumed that the linear error in their regression model is α-mixing with a faster-than-polynomial decay rate, with tails also decaying at a faster-than-polynomial rate.

15

conditions that satisfy Theorem 3 of Hahn and Moon (2010) which is used to derive the rate of decay of the tail probabilities described in (9). Theorem 1 states that the NLGFE is asymptotically equivalent to the unfeasible ML estimator that treats the population groups values as known. The latter result is important for computation as it implies that standard errors can be calculated using standard techniques and therefore they are unaffected by the estimation of the group membership. Notice that, in order to establish such result, we need both N and T to go to infinity at a suitable rate. If T is fixed then the NLGFE suffers from the incidental parameter problem of Neyman and Scott (1948) which in this case implies that with T fixed : Pr[ˆ gi 6= gi0 ] > 0 as N → ∞. However, the NLGFE remains automatically higher-order bias reducing even when compared to recent non-linear panel data estimators that have biases of order O(T −2 ) as in Carro (2006).

5

The Stayers’ Problem

We would like to compute an estimator that is able to account for discrete unobserved heterogeneity and is able to estimate the parameters of interest using the whole sample. However, we may end up in situations where the optimal group structure that we find implies that we cluster one group with only stayers. We define as stayers those observations that do not change their response variable over the T periods in which we observe them. To understand the nature of the problem, consider the following table drawn from Carro (2006) which analyzes the determinants of labour force participation using PSID data. The table shows the percentage of women that participate in the labour market for 10 calendar years 1979-1988. Carro (2006) estimates a dynamic discrete choice panel data model with fixed effects. His modified maximum likelihood estimator (MMLE) has a biased of order O(T −2 ) compared to the standard maximum likelihood estimator with fixed effects that has a bias of order O(T −1 ). However, as we can see from the table below, he has to drop more that 50% of the sample due to the high frequency of always participants. Similarly to what happened in Persson and Tabellini (2009), this is extremely unsatisfactory since the effects of interest is captured only by focusing on a special kind 16

of woman leaving out from estimation both the always participants as well as the never participant, which represent two crucial categories when estimating the determinants of labour force participation. Suppose now that we discretize the unobserved heterogeneity into two components which we are going to label as {L; H}. Is our NLGFE going to deliver a group structure where only the always participating belong to group H? In principle this could happen given the high frequency of always participating. We therefore estimate two group-specific probabilities of participating pL and pH for T = 9 and frequencies given by the table above. Figure 1 plots the likelihood using our NLGFE estimator in both the unrestricted case and the restricted case. The restricted case corresponds to the case where we restrict pˆH = 1 forcing therefore the always participating (and only them) to belong to group H. We refer to this as the constrained solution. Figure 2 clearly shows that the optimal assignment computed using the unconstrained NLGFE dominates the constrained case. Despite its relative simplicity, this is an useful exercise in order to understand the source for identification when using the NLGFE. Notice that the optimal assignment implies that wives that have worked 7, 8 and 9 years belong to group H. This is reasonable given that we don’t expect women that have worked 7 or 8 years out of 9 to be completely different from those that have worked during the whole observed period of 9 years. It is exactly with this intermediate specification where the stayers are clustered together with the quasi -stayers that the researcher is able to use the whole sample to estimate the effects of interest. However, we still would like to understand how we can statistically reject the pres17

ence of the constrained solution that we have just introduced. To fix ideas, let us still consider the case where there are no covariates. Our problem is to assign for each data point {yi1 , . . . , yiT }0 a particular group gi ∈ {1, 2, . . . , G}. Given the group structure g ∈ {1, 2, . . . G}N , estimate the group specific means. Via the iterative algorithm described previously, we obtain estimates of gˆ and pˆ. Without loss of generality, let us consider the case where we test whether we can reject the existence of a degenerate group structure in which we cluster within one group only the individuals with a positive response for all the T periods. A natural test statistic is the likelihood ratio   L(ˆ p1 , pˆ2 , . . . , 1; gˆ|Y) Λ(Y) = −2 log L(˜ p1 , p˜2 , . . . , p˜G ; g˜|Y)

(11)

where L(.) denotes the likelihood of our data, Y. θ˜ = (˜ p1 , p˜2 , . . . , p˜G ; g˜) denotes unrestricted estimates while θˆ denotes restricted estimates. Notice that with (11), we are testing for a null value that lies at the boundary of the parametric space. This makes the distribution of Λ(Y) to be non-standard. It is easy to show that the same problem applies also for the case with coviarates7 . Andrews (1999) computes the distribution for test statistics of the form of Λ(Y). The asymptotic distribution of θ is given by that of a random vector that minimizes a stochastic quadratic function over a convex cone that approximates the shifted and rescaled parameters space. The asymptotic distribution often depends on (estimable) nuisance parameters. Moreover, Andrews and Guggenberger (2009, 2010) show that test statistics based on hybrid subsampling techniques have correct asymptotic size when testing for parameters at the boundary. This result does not not depend on the choice of subsample size b provided b → ∞ and b/N → 0 as N → ∞ and it is satisfied under mild regularity conditions.8 . In order to have a general feeling on how likely it is to compute the constrained solution when using the NLGFE, we perform a small Monte Carlo exercise. The DGP that we consider is characterized by G = 2, no covariates, and the random variable {yi1 , . . . , yiT }0 is drawn from a bernoulli distribution with probabilities p1 , p2 , depending on whether the ith observation belong to group one or two. The probability of belonging to group two is 7

In this case, the group effect parameter associated with the always stayers would be at the boundary, that is ±∞ 8 See Section 6.3

18

fixed at 0.5. For the given DGP, we simulate the data S = 1000 times for N = 100 and T = 10. Table 1 shows our results. We notice that for well separated problem, that is, for p1 significantly different from p2 , the probability of computing a constrained solution in our sample is very low. However, the harder it is to differentiate among the two groups, the more likely it is to compute such a solution. In particular, for p1 = 0.6, we compute a constrained solution in 350 samples out the 1000 simulated samples when the true value of p2 = 0.99. Similarly, when p2 = 1, in 150 samples out of the 1000 simulated samples, the value found in our sample is different from the true population one. This corresponds to the cases where a test for a constrained solution is especially needed. Notice, however, that for smaller values of p1 , the probability of computing a pˆ2 6= 1 is increasingly lower. We conclude from this small Montecarlo results that testing for the constrained solution is extremely important for samples that exhibits a significant thick right/left tail9 . This is intuitive, especially when G = 2. Let us assume for simplicity that we are analyzing a labour force participation decision. When computing the group assignment in the constrained solution, we always pay a prize by clustering the quasi-stayers - those that have worked T − 1, T − 2 periods over T periods observed - with those individuals that instead have worked very little over the T periods. This prize is going to depend on the frequencies of the individuals that have worked 0 periods, T = 1 periods, T = 2 periods, etc. The higher these frequencies, the higher the price that we pay in the constrained solution since the quasi-stayers are clustered together with individuals that supposedly are very different from them and that could represent a substantial part of our sample. However, if these frequencies are low, that is, if we have a P1 that is not so different from P2 , the constrained solution tends to be the optimal one. This happens because in group 1 we no longer have a predominance of unemployed or quasi -unemployed individuals but, given the true population coefficient values, but instead a large fraction of the so-called quasi-stayers. Our empirical applications seem to confirm these insights. When analyzing the determinants of durations of democracy/autocracy, our sample has a very thick left tail (see 9

We refer to a right thick tail as a sample where a vast majority of the observations has a positive response in their binary outcomes over the T periods of observation. Similarly a left thick tail corresponds to a sample where a vast majority of the observations has a negative response.

19

Figure 3) since this sample, being a duration sample, is very concentrated around the negative responses10 and for G = 3 we find indeed that the likelihood under the constrained solution is higher than in the unconstrained case. Whereas, when analyzing the determinants of labour force participation, the sample is more balanced since we observe (see the table at the beginning of this section) both always employed and unemployed individuals. As a matter of fact, only for G = 5, we find that constrained solution has a higher likelihood than the unconstrained solution. Finally, notice that as we increase the degrees of heterogeneity by allowing for an nigher number of groups, the probability of computing a constrained solution is clearly going to increase significantly.

6

Empirical Analysis

We consider three different empirical applications. In Section 6.1, we analyze the determinants of labour force participation. In Section 6.2, we test the results of Persson and Tabellini (2009) using the NLGFE’s approach. Finally, in Section 6.3, we perform an analysis of the probability of switching from an autocracy to a democracy.

6.1

Labour Force Participation

In this first application, we are interested to analyze the performance of our estimator in one of the most typical application when dealing with a binary choice panel data set. In this first application, we analyze data on 1461 married women corresponding to waves 12-22 of the Panel Study of Income Dynamics (PSID). This corresponds to the data set originally analyzed in Carro (2006) and which we have started to describe in Section 5. Carro (2006) propose a modified maximum likelihood estimator for a dynamic binary choice model that is shown to reduce the order of the bias from O(T −1 ) to O(T −2 ). However, his approach has the major drawback that it drops from estimation all individuals units that do not change their status over the T periods, which corresponds to about 55% of the original sample. This type of drawback is shared by all non-linear panel data maximum likelihood FE estimator (ML-FE), like the Logit fixed effects (Logit FE) model, 10

This happens because the response variable in this case is denoted as the exit rate from autocracy/democracy and we have two different sample according to whether we are analyzing the hazard rate of democracy or autocracy

20

but also by estimators such as the one described in Honor`e and Kyriazidou (2000) where the practitioner has to drop all the observations where yi1 = yi2 . Using Carro’s original data, we implement our approach and show that under groupseffects heterogeneity we can estimate the effects of interest using the whole sample available. This has important consequences when calculating the marginal effects for many policy parameters. Table 2 shows our results. First, notice that when comparing the estimates in the restricted sample (that is, the sample without stayers) the NLGFE’s estimates seem to be in between the ML-FE and Pooled Probit estimates. This becomes even more apparent when performing the NLGFE in the restricted sample - that is where observations with PT PT t yit = 1 are dropped - which is used by the FE estimator in computing t yit = 0 or the effects of interest. When analyzing the marginal effects reported in Table 3, we see that the magnitude, in absolute value, of the estimates is significantly reduced under the NLGFE. For instance, when computing the marginal effect of having a child between 0 and 2 years old on the probability of participating in the labour market, we see that under our approach the effects is reduced by 64% when G = 4 compared to what we would calculate using the maximum likelihood fixed effects model. Similarly, for log income we find that the effect is reduced in absolute value by 72% compared to ML-FE estimates. This is a natural consequence of the fact that now the marginal effects is calculated averaging over the entire sample, that is, by averaging also among the wives that have always participated in the labour market. In particular, the whole sample includes now 676 married wives who have always worked and 43% of them had at least one child during the T periods of observation. This makes the marginal effect of having a child between 0 and 2 years old or having an higher income to shrink significantly in absolute value. The policy implications of this result are important: generally, panel data labour force participation results are computed without considering two crucial categories: the always participating and the never participating wives. Our results show that when we include these two crucial categories and we allow for group time-invariant heterogeneity, most of the key-policy parameters, when computed over this more representative sample of the population, have an effect that is significantly reduced in absolute value.

21

Finally notice that for G = 2, 3, 4 the likelihood of the constrained solution, reported in the square brackets, is always lower than in the unconstrained solution11 . This allows to conclude that when analyzing labour force participation decisions, the NLGFE seem to represent a viable alternative, especially when we are interested to compute a casual effect without the need to exclude from our sample both the always participating and the never participating wives. An interesting extension for this application would be to add another layer of heterogeneity. In particular, we could consider group-specific effects in order to see how heterogenous is the impact of the covariates in the response in our sample. In particular we can check the difference once including or excluding all the wives that have always participate.

6.2

Exit Rates from Autocracy-Democracy

Persson and Tabellini (2009) investigate the dynamics of political change. The find that domestic democratic capital appears to raise (decrease) exit rates from autocracy (democracy). This result is computed without allowing for any type of unobserved heterogeneity. However, as argued by the authors, unobserved heterogeneity is likely to be a concern in this application as we expect to have unobserved heterogenous effects, like unobserved historical factors influencing both political and economic development, to be correlated with democratic capital and its accumulation over the years for a particular country. Somehow surprisingly, once allowing for unobserved fixed heterogeneity using the logit FE model, Persson and Tabellini (2009) find that the coefficient of domestic democratic capital has an opposite (and significant) sign. However, since the Logit FE model drops from estimation of the long-lived regime countries, this could be due to the selected sample that logit FE model is constrained to use. We apply our approach Persson and Tabellini’s original data set. We want to question whether, once using the whole available sample of countries and allowing for group heterogeneity, the effect of democratic capital reverses his sign as when we allow for fixed effects suggesting therefore that the coefficient on democratic capital in the pooled probit estimates is likely to reflect unobserved heterogeneity rather than state dependence. 11

However, for G = 5, this is no longer true. For an intuition for this result, see Section 5.

22

According to Table 4, the answer to this question is positive. Even when allowing for a minimal degree of heterogeneity - that is with only two groups - the effect of democratic capital on the negative of the exit rate from autocracy share the same sign and significance of the FE estimate while for the exit from democracy we find that with minimal heterogeneity the effect is negative and insignificant and it is switches to positive and significant - as the FE estimates - when allowing for higher degrees of heterogeneity. This represents an important robustness check in order to assess the fact that unobserved heterogeneity and not the selected sample that excludes the long lived regimes are what drives the results in Persson and Tabellini’s original application. It is indeed remarkable that, even if we introduce into the model a minimal degree of heterogeneity only two unobserved and heterogenous effects for the exit rates from democracy/autocracy - the estimates switches from being negative and significant in the pooled probit model to being positive and significant in the NLGFE with G = 2. Hence, this result confirms the initial concern of Persson and Tabellini (2009): domestic democratic capital seem to capture unobserved heterogeneity rather than state dependence. Allowing for unobserved heterogeneity is therefore crucial in this application and - up to our knowledge - the only way to allow for such unobserved heterogeneity without the need to rely on a small and unrepresentative sample is by using the NLGFE approach. However, it is important to stress the fact that only for G = 2 the constrained solution has a lower likelihood than the unconstrained solution. As described in Section 5, this happens because the sample from Persson and Tabellini (2009) - being a duration sample has very small variation that we can exploit in order to compute the NLGFE. This makes the computation of the NLGFE problematic. In the next section, we consider a slightly different application where we can overcome more easily this type of problem.

6.3

Probability of Switching from an Autocracy to a Democracy

In order to overcome the difficulties in computing the NLGFE described at the end of the previous section, we construct a new sample, using the Polity IV data of Persson and Tabellini, where we pool the data of the hazard rates of both democracy and autocracy. Given this new data, we are interested to see the differences between FE, Pooled and NLGFE when investigating how transitions from autocracy to democracy are influenced 23

by GDP per capita, Democractic Capital and Foreign Democratic Capital. As suggested by the recent Political Economy literature (e.g. Moral-Benito and Bartolucci, 2012) we allow for non-linearities in the effect of GDP per capita on the probability of switching regime. Table 5 shows our results. There are three main findings that we believe are worth mentioning. First, the coefficient of Lagged GDP per capita is negative under the Logit FE specification while it is positive for the pooled probit. However, this coefficient is also insignificant, as in Acemoglu et al. (2008), probably due to the fact that we are considering with the FE model a much smaller sample. When using the NLGFE and therefore the whole sample available, the coefficient on lagged per capita income is positive (as in FE) and it is also significant for G ≥ 3. Second, similarly to what was happening in the previous application, the coefficient on domestic democratic capital is negative under the FE model while it is positive under the Pooled Probit. When using the NLGFE, as we increase the degree of heterogeneity, we see that the coefficient switches from positive and insignificant to negative and significant. Third, as in Moral-Benito and Bartolucci (2012), the non-linear effect is positive and significant across all the specifications. Finally, notice that one advantage of this type of application is the fact that we recover a group structure that can be easily interpreted and analyzed. We represent such group structure for G = 4 since this corresponds to our most preferred specification (for higher degrees of heterogeneity we have that the constrained solution dominates the unconstrained one). The group structure is plotted in Figure 4. Our group structure is fairly similar to the one found in Bonhomme and Manresa (2012). One important difference is represented by South Africa which is grouped together with the high democracy countries such as USA, Canada, UK, etc. This happens because in the Polity IV data South Africa is categorized as a democracy even during the aparthaid periods whereas in the Freedom House Index used by Bonhomme and Manresa (2012) this is no longer true. In general we see that the interpretation of the groups is straightforward. The first group - which corresponds to high-income, high-democracy countries- includes the US and Canada, most of Continental Europe, Japan and Australia, but also India and Costa Rica.

24

Then we have a second group of low income, low democracy- which includes a large share of North and Central Africa, China, and Iran, among others. We also have a third group that includes countries that have experienced sufficiently long autocracy regimes but have recently switched to a democracy. This includes countries like Spain, Portugal, Argentina, Mexico and also east-european countries like Romania and Bulgaria. Finally, we have a fourth group. In this group we have countries that have also experienced a transition from an autocracy to a democracy but have generally experienced longer spells of democracy compared to the third group. Some examples are: Turkey, Brazil, Greece, etc. An interesting extension of this analysis is to allow for time varying group effects, as in Bonhomme and Manresa (2012). However, since the analysis of the NLGFE remains a challenge in the time varying case, we leave this extension for future work.

7

Conclusions

In this thesis, we have lay down the basics for an estimator that can efficiently rebalance a tradeoff that has always existed in the analysis of non-linear panel data models. Our approach delivers estimates of common regression parameters, together with (interpretable) group membership and group effects. These effects are generally estimated using the whole sample available. This represents an important advantage compared to standard non-linear FE techniques that generally drops from estimation the so-called stayers, which in many applications - such as labour force participation or political transition - represent crucial categories when estimating the effects of interest. We demonstrate the importance of this last point when analyzing the determinants of labour force participation. We also show that the NLGFE can represent an important robustness check in order to assess the importance of unobserved heterogeneity in cases where, due to the limitations of the maximum likelihood fixed effects model, it would not be possible to establish ex-ante.

25

8

Future Work

Future work should focus on the following aspects: Stayers’ Problem: A former statistical analysis of the Stayers’ Problem represents the first and most important extension that shall be provided. In this thesis, we have pointed out some theoretical results - e.g. Andrews (1999) - that could be useful in developing a proper statistical analysis of this problem. In particular, establishing a routine to test for the presence of the constrained solution is deemed as fundamental in order to promote the circulation of the NLGFE among practitioners. Time-Varying Group Effects: Extend the model in Section 2 to allow for time varying group effects and conduct the asymptotic properties of the NLGFE under the time-varying assumption represents a natural extension considering the framework developed by Bonhomme and Manresa (2012). Under the assumption that this is possible to achieve, having time-varying group effects would represent another important feature for our estimator. In particular, we could have an estimator that estimates the parameters of the model using the whole sample available and it assumes time-varying unobserved heterogeneity. These two aspects are especially important when comparing the NLGFE with an estimator such as the Logit fixed effects estimator which uses only the sample of movers and it assumes time-invariant unobserved heterogeneity. Number of Groups: In this thesis, we have abstracted on how to optimally choose the number of groups, G. In the non-linear framework, the formulation of a criterion in order to optimally choose the number of groups must also take into account the stayers’ problem described in Section 5. We believe this represents a first important step in order to develop a methodology in order to optimally choose G, in a similar fashion as in Bonhomme and Manresa (2012). Moreover, the asymptotic results must be extended considering the important issue of misspecification of the number of groups.

26

Figures and Tables Figure 1: Solutions of the Iterative Algorithm

Note: Each dot in the figure shows the convergence point in the iterative algorithm, for 1,000 random choice of the starting values. The bar on the right shows the value of the likelihood function. The data is taken from Persson and Tabellini, where N = 159, max(Ti ) = 150. The x-axis represents democratic capital and the y-axis foreign democratic capital. See Persson and Tabellini (2009) for a definition of these variables.

27

Figure 2: Likelihood Comparison

Note: The figure shows the likelihood comparison between the constrained solution where we cluster within one group only the always stayers (that is, those that have worked nine periods out of nine) inside the group labelled as H and the unconstrained solution which is found using the iterative algorithm described in Section 2. The optimal solution implies that we cluster within one group the individuals that have worked 7,8,9 periods inside group H.

28

Figure 3: Distribution of the Frequencies of Duration of Democracy

Note: Data from Persson and Tabellini (2009). The graph shows the histogram of the variable exit from a democracy.

29

Figure 4: Group Structure with G = 4

No Democracy Late Democracy High Democracy Early Democracy No data

Note: Data from Polity IV. The periods of observation are from 1951 up to 2000. The panel is balanced.

30

Table 1: The Stayers’ Problem - Montecarlo Results p2 = 0.95 p2 = 0.96 p2 = 0.97 p2 = 0.98 p2 = 0.99 p1 = 0.3

p2 = 1

E(1{pˆ2 = 1}) max(ˆ p2 ) E(ˆ p2 ) p1 = 0.4

0 0.9800 0.9496

0 0.9833 0.9602

0 0.9917 0.9703

0 0.9963 0.9798

0.0130 1 0.9900

0.9960 1 1.0000

E(1{pˆ2 = 1}) max(ˆ p2 ) E(ˆ p2 ) p1 = 0.5

0 0.9897 0.9509

0 0.9886 0.9604

0 0.9957 0.9690

0.001 1 0.9794

0.008 1 0.9903

0.9340 1 0.9999

E(1{pˆ2 = 1}) max(ˆ p2 ) E(p2 ) p1 = 0.6

0 0.9838 0.9494

0 0.9895 0.9596

0 0.9925 0.9713

0.0030 1 0.9815

0.0370 1 0.9902

0.8080 1 0.9995

E(1{pˆ2 = 1}) max(ˆ p2 ) E(ˆ p2 ) p1 = 0.7

0.0020 1 0.9588

0.0060 1 0.9666

0.0220 1 0.9737

0.0950 1 0.9814

0.3500 1 0.9907

0.8500 1 0.9990

E(1{pˆ2 = 1}) max(ˆ p2 ) E(ˆ p2 )

0.2110 1 0.9660

0.3320 1 0.9740

0.4950 1 0.9826

0.6320 1 0.9898

0.7730 1 0.9955

0.8460 1 0.9980

Note: Montecarlo results with π2 , i.e. the probability of belong to group 2=0.5, N = 100 and T = 10. The number of groups is fixed at 2 and we simulate 1000 samples for each combination of p2 and p1 .

31

32

Observations

# Log Income

# Child 6-17

# Child 3-5

# Children 0-2

φ

Panel (B)

Observations Likelihoods Comparison

# Log Income

# Child 6-17

# Child 3-5

# Children 0-2

φ

Panel (A)

0.916 (0.039) -0.37 (0.042) -0.120 (0.037) 0.046 (0.020) -0.136 (0.029) 6640

1.413 (0.035) -0.354 (0.038) -0.120 (0.034) 0.02 (0.018) -0.148 (0.025) 14610 -3.6065e+03 [-3.7420e+03 ]

0.841 (0.040) -0.384 (0.042) -0.131 (0.037) 0.055 (0.020) -0.188 (0.029) 6640

1.0443 (0.034) -0.333 (0.041) -0.091 (0.037) 0.057 (0.019) -0.167 (0.027) 14610 -3.1847e+03 [-3.2469e+03]

0.804 (0.040) -0.447 (0.043) -0.181 (0.037) 0.009 (0.020) -0.171 (0.029) 6640

0.923 (0.039) -0.419 (0.042) -0.181 (0.037) -0.009 (0.020) -0.1505 (0.028) 14610 -3.0311e+03 [-3.0774e+03]

Table 2: NLGFE Labour Force Participation G=2 G=3 G=4

1.081 (0.042) -0.400 (0.058) -0.183 (0.050) -0.038 (0.039) -0.209 (0.051) 6640

1.081 (0.042) -0.400 (0.058) -0.183 (0.050) -0.038 (0.039) -0.209 (0.051) 6640

MME

0.753 (0.043) -0.534 (0.064) -0.283 (0.055) -0.078 (0.043) -0.253 (0.055) 6640

0.753 (0.043) -0.534 (0.064) -0.283 (0.055) -0.078 (0.043) -0.253 (0.055) 6640

ML-FE

1.238 (0.036) - 0.271 (0.039) -0.063 (0.035) 0.051 (0.019) -0.096 (0.028) 6640

2.118 (0.030) -0.252 (0.033) -0.074 (0.0303) -0.005 (0.015) -0.128 (0.042) 14610

P-Probit

Table 3: NLGFE Labour Force Participation - Marginal Effects G=2 G=3 G=4 ML-FE Panel (A) φ # Children 0-2 # Child 3-5 # Child 6-17 # Log Income Observations Panel (B) φ # Children 0-2 # Child 3-5 # Child 6-17 # Log Income Observations

P-Probit

0.213 (0.004) -0.054 (0.006) -0.018 (0.005) 0.003 (0.003) -0.022 (0.004) 14610

0.141 (0.004) -0.045 (0.005) -0.012 (0.005) 0.008 (0.003) -0.023 (0.004) 14610

0.119 (0.004) -0.054 (0.005) -0.023 (0.005) -0.001 (0.003) -0.019 (0.004) 14610

0.203 (0.011) -0.149 (0.015) -0.078 (0.014) -0.02 (0.011) -0.068 (0.015) 6640

0.401 (0.003) -0.048 (0.006) -0.014 (0.006) -0.001 (0.003) -0.023 (0.004) 14610

0.261 (0.010) -0.105 (0.011) -0.034 (0.010) 0.013 (0.006) -0.039 (0.008) 6640

0.233 (0.010) -0.106 (0.011) -0.036 (0.010) 0.015 (0.006) -0.052 (0.008) 6640

0.220 (0.010) -0.122 (0.005) -0.050 (0.005) 0.002 (0.003) -0.047 (0.004) 6640

0.203 (0.011) -0.149 (0.011) -0.078 (0.010) -0.02 (0.006) -0.068 (0.008) 6640

0.389 (0.007) -0.085 (0.012) -0.020 (0.011) -0.016 (0.006) -0.030 (0.009) 6640

Note: Table 2-3 refers to data from waves 12-22 of the Panel Study of Income Dynamics (PSID). MME stands for the modified maximum likelihood estimator of Carro (2006). ML-FE is the maximum likelihood estimates adding fixed effects. Only women continuously married, aged between 18 and 60 in 1985 and whose husband is a labor force participant in each of the sample years, were included in the sample. The dependent variable is 1 if the wife participate, 0 otherwise. The regressors include: lagged participation (with coefficient φ), #children0 − 2it , #children3 − 5it , #children6 − 17it , logincomeit , time dummies, and a quadratic function of age. The numbers in the square parenthesis in Table 2 represent the likelihood under the constrained solution that we cluster within one group only the wives that have always participated in the labour market over the T periods in which we observe them. Panel (B) of Table 2 computes the NLGFE and Pooled Probit estimates using the same sample as in the FE-model.

33

Table 4: Probability of Exit Autocracy - Democracy G=2 G=3 G=4 Panel (A):Democracy Domestic Democratic Capital Foreign Democratic Capital Lagged per Capita Income Observations Likelihoods Comparison

Logit-FE

P-Probit

-0.01 (0.463) -1.87 (0.56) -0.54 (0.10) 3848 -301.0913 [-303.6249 ]

1.22 (0.53) -2.35 (0.60) -0.76 (0.11) 3848 -283.9859 [ -282.6187]

2.10 (0.59) -3.01 (0.65) -0.99 (0.13) 3848 -275.1577 [-272.5159]

38.28 (6.25) -8.40 (1.98) -2.55 (0.77) 1569

-0.91 (0.37) -1.30 (0.50) -0.49 (0.08) 3848

1.20 (0.43) -3.22 (0.46) -0.40 (0.07) 4420 -465.5219 [-474.3399 ]

1.77 (0.49) -3.60 (0.48) -0.53 (0.09) 4420 -442.1982 [-441.1693]

1.39 (0.44) -4.19 (0.51) -0.71 (0.10) 4420 -435.6897 [-425.8631]

21.21 (3.73) -9.66 (1.56) -1.04 (0.54) 2966

-1.58 (0.34) -1.80 (0.38) -0.04 (0.05) 4420

Panel (B):Autocracy Domestic Democratic Capital Foreign Democratic Capital Lagged per Capita Income Observations Likelihoods Comparison

Note: Data take from Persson and Tabellini (2009). The results for the FE model in both Panel A and B are taken from Table 3 of Persson and Tabellini (2009). For both samples, the panel is unbalanced and max(Ti ) = 150. The dependent variable is 1 if the country exits from democracy/autocracy, 0 otherwise. The regressors include: domestic democratic capital, foreign democratic capital, lagged per capita income, wars (current and lagged), linear and a quadratic time trend. In Panel (B) we report the estimates on the negative of hazard rate of autocracy, as in Persson and Tabellini (2009). The numbers in the square parenthesis represent the likelihood under the constrained solution that we cluster within one group the countries that never exit from the democracy status (Panel A) or the autocracy status (Panel B).

34

Table 5: Probability of Switching from Autocracy to Democracy G=2 G=3 G=4 G=5 G=6

ML-FE

P-Probit

0.76 (0.031) -18.5 (2.77) 4.94 (2.33) 0.61 (0.38) 1569

0.47 (0.01) 1.24 (0.32) 2.05 (0.82) -0.049 (0.04) 6566

Panel (A)

Income ∗ Democracyt−1 Democratic Capital Foreign Democratic Capital Lagged per Capita Income Observations Likelihoods Comparison

0.45 (0.01) 0.58 (0.33) 2.36 (0.39) 0.09 (0.05) 6566 -630.4769 [-641.7667]

0.43 (0.01) 0.49 (0.36) 2.81 (0.40) 0.28 (0.06) 6566 -608.2231 [-614.2140]

0.42 (0.01) 0.06 (0.38) 2.89 (0.40) 0.23 (0.06) 6566 -599.0340 [-600.0955]

0.42 (0.01) -0.85 (0.39) 3.12 (0.41) 0.42 (0.07) 6566 -593.7114 [-590.8083]

0.42 (0.01) -0.96 (0.40) 3.22 (0.41) 0.47 (0.08) 6566 591.1392 [-585.6542]

Note: Data take from Polity IV. The periods of observation starts from 1951 and it ends in 2000. The panel is therefore balanced. The dependent variable is 1 if the country is in a democracy 0 if the country is an autocracy. The regressors include: domestic democratic capital, foreign democratic capital, lagged per capita income, wars (current and lagged), linear and a quadratic time trend. The numbers in the square parenthesis represent the likelihood under the constrained solution that we cluster within one group the countries that have always been a democracy during the T periods in which we observe them.

35

References [1] Acemoglu, D., S. Johnson, J. Robinson, and P. Yared (2008): “Income and Democracy” American Economic Review 98, 808-842. [2] Ackerberg, D. A. G. Gowrisankaran (2002): “Quantifying Equilibrium Network Externalities in the ACH Banking Industry” Rand Journal of Economics, forthcoming. [3] Andrews D. W. K. (1999) “Estimation When a Parameter Is on a Boundary” Econometrica 67, 13411383. [4] Andrews D. W. K. (2001) “Testing When a Parameter Is on the Boundary of the Maintained Hypothesis” Econometrica 69, 683-734. [5] Andrews, D.W.K., Guggenberger, P., (2009): “Hybrid and size-corrected subsampling methods” Econometrica 77, 721-762. [6] Andrews, D.W.K., Guggenberger, P., (2010): “Applications of subsampling, hybrid, and sizecorrection methods” Journal of Econometrics 158, 285-305. [7] Arellano, M., and J. Hahn (2007): “Understanding Bias in Nonlinear Panel Models: Some Recent Developments” In: R. Blundell, W. Newey, and T. Persson (eds.): Advances in Economics and Econometrics, Ninth World Congress, Cambridge University Press. [8] Arellano M. and Bonhomme S. (2011): “Nonlinear Panel Data Analysis” Annual Review of Economics, vol. 3. [9] Bartolucci C. and Moral-Benito E. (2012): “Income and Democracy: Revisiting the Evidence” mimeo. [10] Bonhomme S. and Manresa E. (2012): “Discrete Heterogeneity Patters in Panel Data” mimeo. [11] Bosq, D. (1993): “Bernstein-type large deviations inequality for partial sums of strong mixing processes”, Statistics 24, 59-70 [12] Browning, M., and J. Carro (2007): “Heterogeneity and Microeconometrics Modelling” in Advances in Economics and Econometrics, Theory and Applications: Ninth World Congress of the Econometric Society, Vol. 3, ed. by R. Blundell, W. Newey, and T. Persson. Cambridge, U.K.: Cambridge University Press, 47-74. [13] Browning M. and Carro, J. (2011): “Dynamic Binary Outcome Models with Maximal Heterogeneity” mimeo . [14] Brusco, M.J. (2006): “A Repetitive Branch-and-Bound Procedure for Minimum Within-Cluster Sums of Squares Partitioning” Psychometrika, 71, 357-373. [15] Carro J. (2006): “Estimating dynamic panel data discrete choice models with fixed effects” Journal of Econometrics 140, 503-528. [16] Guner N., Kaygusuz R. and and Ventura G. (2011): “Taxation and Household Labor Supply” The Review of Economic Studies, forthcoming. [17] Hahn, J., and W. Newey (2004): “Jackknife and Analytical Bias Reduction for Nonlinear Panel Models” Econometrica, 72, 1295-1319. [18] Hahn, J., and H. Moon (2010): “Panel Data Models with Finite Number of Multiple Equilibria” Econometric Theory, 26(3), 863-881 [19] Heckman, J., and B. Singer (1984): “A Method for Minimizing the Impact of Distributional Assumptions in Econometric Models for Duration Data” Econometrica, 52(2), 271-320. [20] Honor´e, B., Kyriazidou, E., (2000): “Panel data discrete choice models with lagged dependent variables” Econometrica, 68, 839-874.

36

[21] McLachlan, G., and D. Peel (2000): “Finite Mixture Models”, Wiley Series in Probabilities and Statistics. [22] Huntington, S.P. (1991): “The Third Wave: Democratization in the Late Twentieth Century” Norman, OK, and London: University of Oklahoma Press. [23] Munshi, K., and M. Rosenzweig (2009): ”Why is Mobility in India so Low? Social Insurance, Inequality, and Growth” BREAD Working Paper No. 092 [24] Neyman, J., and E. Scott (1948): “Consistent Estimates Based on Partially Consistent Observations” Econometrica, 16, 1-31. [25] Pakes, A., M. Ostrovsky, S. Berry (2005): “Simple Estimators for the Parameters of Discrete Dynamic Games”, unpublished working paper. [26] Persson T. and Tabellini G. (2009): “Democratic Capital: The Nexus of Political and Economic Change” American Economic Journal: Macroeconomics, 1:2, 88-126. [27] Pollard, D. (1982): “A Central Limit Theorem for K-Means Clustering, Annals of Statistics, 10, 919-926 [28] Townsend, R. M. (1994): “Risk and Insurance in Village India” Econometrica, 62, 539-91.

37