Nonparametric Bayes inference on conditional independence

3 downloads 6599 Views 349KB Size Report
Apr 5, 2014 - Email: [email protected]. †Department of Statistical Science, ..... popular parametric method, we add Lasso to the list, for which we use the regu- ..... 57 0.0177 [0.00007, 0.0463] rental housing: lower quartile rent.
Nonparametric Bayes inference on conditional independence arXiv:1404.1429v1 [stat.ME] 5 Apr 2014

Tsuyoshi Kunihama



David B. Dunson



April, 2014

Abstract In broad applications, it is routinely of interest to assess whether there is evidence in the data to refute the assumption of conditional independence of Y and X conditionally on Z. Such tests are well developed in parametric models but are not straightforward in the nonparametric case. We propose a general Bayesian approach, which relies on an encompassing nonparametric Bayes model for the joint distribution of Y , X and Z. The framework allows Y , X and Z to be random variables on arbitrary spaces, and can accommodate different dimensional vectors having a mixture of discrete and continuous measurement scales. Using conditional mutual information as a scalar summary of the strength of the conditional dependence relationship, we construct null and alternative hypotheses. We provide conditions under which the correct hypothesis will be consistently selected. Computational methods are developed, which can be incorporated within MCMC algorithms for the encompassing model. The methods are applied to variable selection and assessed through simulations and criminology applications. Key words: Dirichlet process; Graphical model; Hypothesis testing; Mutual information; Variable selection.

1

Introduction

One of the canonical problems in statistics is to assess whether or not Y is conditionally independent of X given Z, expressed as Y ⊥ X | Z. In general, Y ∈ Y is ∗

Department of Statistical Science, Duke University, Durham, NC 27708-0251, USA. Email: [email protected] † Department of Statistical Science, Duke University, Durham, NC 27708-0251, USA. Email: [email protected]

1

a response, X ∈ X are predictors of interest, Z ∈ Z are adjustment variables or covariates, and the variables can be multivariate and have a variety of measurement scales and domains. There is a rich literature on testing of conditional independence in parametric models; often this corresponds to testing whether a vector of regression coefficients for the X variables are equal to zero. However, much less consideration has been given to this problem from a nonparametric perspective, particularly from a model-based Bayesian perspective. A rich variety of Bayesian nonparametric models have been proposed for characterizing joint and conditional distributions, ranging from Dirichlet process mixtures (Lo (1984); West et al. (1994); Escobar and West (1995); M¨ uller et al. (1996)) to kernel stick-breaking processes (Dunson and Park (2008); An et al. (2008)). In addition, there is an emerging literature providing an asymptotic frequentist justification for these models (Norets and Pelenis (2011, 2012); Pati et al. (2013)). However, there has been amazingly little consideration on testing problems in the nonparametric Bayes literature. Notable exceptions include Dunson and Peddada (2008), Ma and Wong (2011) and Holmes et al. (2012), all of which consider testing for group differences. Also, more relevant to the conditional independence testing problem, several authors have proposed nonparametric Bayes methods for variable selection (Chung and Dunson (2009); Ma (2011); Reich et al. (2012)). It tends to be highly challenging to design Bayesian methods that accommodate variable selection or testing of conditional independence. It seems that the easiest approach would be to separately fit nonparametric Bayes models with and without conditional independence, and then calculate the Bayes factor as a ratio of marginal likelihoods for each model. However, accurately approximating marginal likelihoods in infinite-dimensional Bayesian models remains problematic, and the use of Bayes factors in nonparametric testing has recently been called into question (Xu et al. (2012)). Also, in the variable selection context, one needs to examine many possible conditional independence relationships, so that this approach would be impossible

2

computationally. An alternative strategy, which is broadly used in parametric models, is to define an encompassing model, which includes each possible hypothesis. Then, as a byproduct of conducting posterior computation under this one model, one can simultaneously calculate posterior hypothesis probabilities and conduct modelaveraged predictions. In the nonparametric setting, defining an encompassing approach is far from straightforward, as it becomes necessary to define priors that place positive probabilities on each possible conditional independence relationship. Unlike in Gaussian graphical models or typical parametric models, one cannot simply zero out finitely many parameters. Relying on a probit link function, Chung and Dunson (2009) proposed a flexible stick-breaking process and applied it to modeling of conditional distributions. Reich et al. (2012) combined the kernel stick-breaking process and stochastic search variable selection (George and McCulloch (1993, 1997)). These approaches select predictors which have effects on the stick-breaking weights and/or the mean of the response. In this paper, our emphasis is on developing a novel methodology for testing conditional independence. The proposed approach utilizes an existing nonparametric Bayesian model as an encompassing model and constructs null and alternative hypotheses relying on conditional mutual information. In information theory, the conditional mutual information is a well-known scalar measure of the strength of a conditional dependence relationship. Based on empirical process theory, we show that the proposed method consistently selects conditionally dependent predictors under appropriate conditions. Then, we apply the method to variable selection problems where we investigate the conditional dependence relationships between the response and each of the predictors at the same time. For posterior computation, variable selection can be conducted along with estimation of parameters for the encompassing model via a Markov chain Monte Carlo (MCMC) algorithm. We leverage on the rich existing literature designing efficient computational implementations for nonparamet-

3

ric Bayesian models to obtain a straightforward implementation. Section 2 proposes a novel approach for testing conditional independence and extends it to variable selection. Section 3 assesses the performance of the proposed approach against competitors through simulation. Section 4 applies it to criminology study. Section 5 discusses future directions.

2

Tests based on conditional mutual information

2.1

General framework

Let Y , X and Z be univariate or multivariate random variables where each element can have any type of scale and domain. f (y, x, z) denotes the joint density of Y , X and Z with respect to a product measure µ. The marginal densities we use below are denoted by f (y, z), f (x, z) and f (z). Suppose the primary interest is in testing if Y and X are conditionally independent given Z. The null hypothesis H0 : Y ⊥ X | Z can be equivalently expressed as H0 : f (y, x, z)f (z) = f (y, z)f (x, z),

(1)

for all (y, x, z) in the support of f . In information theory, conditional mutual information (CMI) is a widely-used quantity which measures the strength of functional relationship between Y and X given Z, defined by ζ=

Z

f (y, x, z) log

f (y, x, z)f (z) dµ. f (y, z)f (x, z)

Using Kullback-Leibler divergence, KL(p, q) =

R

(2)

p log(p/q), the CMI can be ex-

pressed as KL(f (y, x, z), f (y, z)f (x, z)/f (z)), which is always non-negative. CMI is zero if Y ⊥ X | Z, while large values of CMI indicate substantial violations of conditional independence; in the extreme, given knowledge of Z there is a functional relationship between Y and X. Hence, the null hypothesis (1) corresponds to H0 : ζ = 0.

4

(3)

2.2

Proposed tests of conditional independence

Various nonparametric tests of conditional independence have been proposed in the frequentist literature, relying on different expressions of conditional independence with characteristic functions (Su and White (2007)), probability density functions (Su and White (2008)), distribution functions (Seth and Principe (2010); Gy¨orfi and Walk (2012)), copula densities (Bouezmarni et al. (2012)) and kernel methods (Fukumizu et al. (2008)). Also, Song (2009) constructs a test using Rosenblatt-transforms of random variables. Seth and Principe (2012) review the field and develop an asymmetric measure of conditional independence based on cumulative distribution functions. We propose a novel method which expresses the conditional independence relationship using the CMI. Let Lµ be a space for joint probability densities with respect to a measure µ. We define a true data-generating probability P0 and assume it has a density f0 ∈ Lµ . A countable product of P0 is denoted by P0∞ . Let Π be a prior distribution on F ⊂ Lµ with Π(F ) = 1. Data set Dn consists of independently identically distributed observations (yi , xi , zi ) from P0 with i = 1, . . . , n. Let ζ0 be the CMI induced by the true data-generating function, that is, ζ0 =

Z

f0 (y, x, z)f0 (z) log dP0 = f0 (y, z)f0 (x, z)

Z

f0 (y, x, z) log

f0 (y, x, z)f0 (z) dµ. f0 (y, z)f0 (x, z)

(4)

As noted above, if H0 : Y ⊥ X | Z then ζ0 = 0. We estimate the CMI relying on an encompassing nonparametric Bayes model for the joint density f ∈ F . First, we define a function ζ(·, ·) of a joint density p ∈ Lµ and a probability measure P on X × Y × Z as ζ(p, P ) =

Z

log

p(y, x, z)p(z) dP. p(y, z)p(x, z)

Using this function, the true CMI ζ0 can be expressed as ζ(f0 , P0 ). Intuitively, if p and P are close to f0 and P0 in some sense, ζ(p, P ) can approximate ζ0 well. As an estimate of P0 , we utilize the empirical measure Pn , given by n

1X δ(y ,x ,z ) , Pn = n i=1 i i i where δ(y,x,z) is the Dirac measure concentrated at (y, x, z). Pn is a consistent estimate of P0 in that Pn (A) → P0 (A) almost surely for a fixed A by the strong law of large 5

numbers. Then, we let ζ(f, Pn ) =

Z

n

log

1X f (yi , xi , zi )f (zi ) f (y, x, z)f (z) dPn = , f ∈ F, log f (y, z)f (x, z) n i=1 f (yi , zi )f (xi , zi )

(5)

where ζ(f, Pn ) ∈ ℜ and, for any fixed f ∈ F , ζ(f, Pn ) → ζ(f, P0 ) almost surely P0∞ by the law of large numbers. Posterior computation for the estimated conditional MI relies on a simple modification to any existing MCMC algorithm for the encompassing model to add an additional step. In particular, for each of many draws from the posterior distribution of the parameters in the encompassing model, we compute and save ζ(f, Pn ). We can then calculate posterior summaries of the CMI from these samples, which we use as a basis for our testing approach. Under our asymptotic theory below, as n increases the posterior of ζ(f, Pn ) will be increasingly concentrated around the true CMI, ζ0 . Therefore, if ζ0 is not close to zero, zero should locate in the left tail of the distribution of ζ(f, Pn ). We consider the posterior probability of ζ(f, Pn ) being positive as a weight of evidence of a violation of conditional independence and reject the hypothesis if the weight is large. The posterior probability can be estimated by P (r) , Pn ) > 0} where R is the number of the MCMC iteration after (1/R) R r=1 1{δ(f

the burn-in period, 1{·} is an indicator function and f (r) is the joint density under the encompassing model at the rth MCMC iteration.

The next theorem provides sufficient conditions under which the posterior of ζ(f, Pn ) concentrates on arbitrarily small neighborhoods of the true CMI as the sample size increases. The conditions imply the posterior consistency of the density f ∈ F to f0 in KL divergence (Norets (2012)). Theorem 1. Suppose for any ǫ > 0, Π [KL{f0 (y, x, z), f (y, x, z)} < ǫ] > 0.

(6)

and the following classes of functions 

f0 (y, x, z) log ,f ∈ F f (y, x, z)

       f0 (y, z) f0 (x, z) f0 (z) , log , f ∈ F , log , f ∈ F , log ,f ∈ F , f (y, z) f (x, z) f (z)

6

are P0 -Glivenko-Cantelli. Then, for any ǫ′ > 0 Π (|ζ(f, Pn ) − ζ0 | < ǫ′ | Dn ) → 1, almost surely P0∞ . The proof is given in Appendix A. The condition (6) means the true data-generating function is in the KL support of the prior. Such KL support conditions are standard for Bayesian nonparametric models, and are routinely employed in theorems of posterior asymptotics (Ghosal et al. (1999); Ghosh and Ramamoorthi (2003); Tokdar (2006)). Wu and Ghosal (2008) discuss the KL property for various types of kernels in Dirichlet process mixture models. As for the Glivenko-Cantelli class, theoretical properties of the class have been studied in empirical process theory (van der Vaart and Wellner (1996); Kosorok (2008)). It is a class of functions such that the law of large numbers holds uniformly over the space. We show an example which satisfies the sufficient conditions. Let y ∈ ℜd1 , x ∈ ℜd2 , z ∈ ℜd3 , w = (y ′, x′ , z ′ )′ ∈ ℜq where q = d1 + d2 + d3 and φΣ is the q dimensional normal density with mean zero and covariance Σ. Then, we consider the location mixtures of normals with diagonal covariance as an encompassing model, f (w) =

Z

φΣ (w − µ)dQ(µ),

(7)

where Σ = diag(σ12 , . . . , σq2 ), µ = (µ1 , . . . , µq )′ , Q=

H X h=1

πh δµh ,

H X

πh = 1, µh ∼ G,

(8)

h=1

and G is a distribution on ℜq . (7) and (8) correspond to the standard finite mixture of normals if π = (π1 , . . . , πH )′ is generated from the Dirichlet distribution and to the truncation approximation of the Dirichlet process mixtures of normals if π is conQ structed through the stick-breaking process: πh = Vh l 0, Π [KL{f0 (y, x), f (y, x)} < ǫ] > 0. and the following classes of functions     f0 (y, x) f0 (x) log , f ∈ F , log ,f ∈ F , f (y, x) f (x)     f0 (x−j ) f0 (y, x−j ) , f ∈ F , log , f ∈ F , j = 1, . . . , p, log f (y, x−j ) f (x−j )

9

(10)

are P0 -Glivenko-Cantelli. Then, for any ǫ′ > 0 Π



 max |ζj (f, Pn ) − ζ0,j | < ǫ Dn → 1, almost surely P0∞ . ′

1≤j≤p

We illustrate a simple but non-trivial encompassing model which satisfies the sufficient conditions in Theorem 2. Let y ∈ ℜ, x ∈ ℜp and φa be the univariate normal density with mean 0 and standard deviation a. Then, we consider the location mixtures of normals where the kernel is the product of a regression density for the response and independent normal densities for the predictors, f (y, x) =

Z



φσ (y − x˜ β)

p Y

φτj (xj − µj )Q(dβ, dµ),

(11)

j=1

where x˜ = (1, x′ )′ , β = (β0 , . . . , βp )′ , τ = (τ1 , . . . , τp )′ and µ = (µ1 , . . . , µp )′ . Dirichlet process mixture models of this type have been widely studied (West et al. (1994); Escobar and West (1995); M¨ uller et al. (1996); Hannah et al. (2011)). Compared with (7), each component in the mixture model is more flexible, and hence the model favors fewer components. We assume the mixing measure Q is discrete with Q=

H X h=1

πh δ(βh ,µh ) ,

H X

πh = 1, (βh , µh ) ∼ G,

(12)

h=1

where βh = (β0,h , . . . , βp,h )′ , µh = (µ1,h , . . . , µp,h )′ and G is a distribution on ℜp+1 × ℜp . As discussed in subsection 2.2, this class of probabilities includes the finite mixture model and truncation approximations to Dirichlet process mixtures. The prior distribution for the joint densities is induced through Π = ΠQ × Π(σ,τ ) where ΠQ and Π(σ,τ ) are the prior distributions for Q and (σ, τ ). R Lemma 2. Suppose the true density can be expressed in the form f0 (y, x) = φσ0 (y − Q x˜′ β) pj=1 φτ0,j (xj − µj )Q0 (dβ, dµ). If Q0 , G and Π(σ,τ ) have compact supports, Q0 belongs to the support of ΠQ and (σ0 , τ0 ) are in the support of Π(σ,τ ) , then Π (max1≤j≤p |ζj (f, Pn ) − ζ0,j | < ǫ′ | Dn ) → 1 a.s. P0∞ . Proof is given in Appendix D. As mentioned in Lemma 1, the result can be extended to the location-scale mixture of normals. 10

3

Simulation study

In this section, we assess performance of the proposed method by comparing to frequentist nonparametric methods. As competitors, we employ the method based on the cumulative distribution functions with Cram´er-von-Mises type statistic (CM: Linton and Gozalo (1996)), the kernel measure method based on normalized crosscovariance operators on reproducing kernel Hilbert spaces (NCCO: Fukumizu et al. (2008)) and the asymmetric quadratic measure (AQM: Seth and Principe (2012)). The matlab codes for these methods are available in http://www.sohanseth.com/Home/codes. We use the default settings recommended in Seth and Principe (2012). Also, as a popular parametric method, we add Lasso to the list, for which we use the regularization coefficient producing the minimum mean square error with 5-fold cross validation. Letting TP, FP, TN and FN denote true positive, false positive, true negative and false negative, we evaluate the test performance based on the following measures: Type 1 error (FP/(FP+TN)), Type 2 error (FN/(TP+FN)), Positive predictive value (TP/(TP+FP)), Negative predictive value (TN/(TN+FN)) and Accuracy ((TP+TN)/(TP+FP+TN+FN)). Small values for Type 1 and 2 errors and large values for the rest indicate good performance. As an encompassing model for the proposed method, we employ the following Dirichlet process location-scale mixture of normals, f (y, x) =

=

Z

φσ (y − x˜ β)

H X h=1

where πh = Vh

Q

l 0 | Dn) > 0.95 with j = 1, . . . , p. We consider three different data-generating functions from which we simulate 100 data sets with n = 100 and p = 10. First, we generate data from a linear regression model with strong dependence among predictors. Case 1 :

yi = −xi,1 + xi,4 − xi,7 + εi ,

εi ∼ N(0, 1),

xi = (xi,1 , . . . , xi,10 ) ∼ N(0, Σx ), ′

Σx = {σj,j ′ }, σj,j ′ = Cov(xi,j , xi,j ′ ) = 0.7|j−j | . The left panel in Figure 1 and last column in Table 1 show the ROC curves and area under the curve (AUC) averaged over 100 data sets in Case 1. For the proposed method, we obtain the curve by shifting the threshold a in Π(ζj (f, Pn ) > a | Dn ) > 0.95. For LASSO, we shift the threshold for absolute values of regression coefficients. Although the AUC for the proposed method is slightly smaller than that for LASSO and AQM, it is large and close to one. Table 1 reports averaged measures of the test performance over 100 data sets in Case 1. For LASSO, its high Type 1 error and low PPV indicate it incorrectly rejects many hypotheses. Though the data are generated from the linear model, the strong dependence among predictors can cause poor performance. On the other hand, high Type 2 errors and low NPV in CM and AQM imply that they often fail to detect dependent relations. NCCO also faces the same problem of missing dependent predictors but the performance is much better with high ACC. The proposed method works quite well, reporting small Type 1 and 2 errors and high PPV and NPV. Compared to NCCO, there is not a big difference in measures with false positives but it less often produces false negatives since it shows lower Type 2 and higher NPV. Also, it reports the highest accuracy. Therefore, the 12

proposed method outperforms the competitors. Next, we generate data from a model in which the strong dependence among predictors remains but the relation between the response and predictors changes to be non-linear. Case 2 :

yi = −xi,1 + exp(xi,4 ) − x2i,7 + εi ,

εi ∼ N(0, 1),

xi = (xi,1 , . . . , xi,10 ) ∼ N(0, Σx ), ′

Σx = {σj,j ′ }, σj,j ′ = Cov(xi,j , xi,j ′ ) = 0.7|j−j | . The ROC curves and AUC in Case 2 are given in the middle of Figure 1 and Table 2. Though the competitors’ curves are away from the random guess line y = x, the proposed method shows largest AUC. Table 2 summarizes the test performance measures. The proposed method reports small Type 1 and 2 errors and high PPV, NPV and ACC. From the high Type 1 and small PPV, LASSO tends to wrongly pick up conditionally independent predictors. The high Type 2 and small NPV indicate CM and AQM have difficulty in finding dependent structures. NCCO performs better than CM and AQM but still reports high Type 2 and low NPV compared to the proposed method. Hence, we conclude it outperforms other methods. We also simulate data from a different non-linear model where the dependence comes from division of the sample into subgroups and non-linear regressions.

Case 3 :

 0.8x2 − xi,4 + εi , εi ∼ N(0, 0.72 ), if si = 0, i,1 yi = −x + 1.2 exp(x ) + ε , ε ∼ N(0, 1), if s = 1. i,1 i,7 i i i si ∼ Multinomial(1, 0.5),

2 xi,j ∼ N(µj,si , σj,s ), j = 1, . . . , 10, i 2 µj,s ∼ N(0, 1), σj,s ∼ Inverse-Gamma(2, 0.5), s ∈ {0, 1}, 2 2 µj,0 = µj,1 , σj,0 = σj,1 , j∈ / {1, 4, 7}.

The right plot in Figure 1 and last column in Table 3 correspond to the ROC curves and AUC in Case 3. It shows CM works poorly since the curve is quite close to the random guess line. The AUC by the proposed method is smaller than that by AQM but the curve is still far away from the y = x line. Table 3 reports the measures of the test performance. LASSO is likely to reject correct hypotheses and CM produces the

13

worst results in all measures except Type 1. The proposed method, NCCO and AQM show small Type 1 and high PPV, indicating they less likely produce false positives. As for the false negatives, the differences in Type 2 and NPV between the proposed method and NCCO are small with AQM slightly worse. Also, the proposed method reports the highest ACC among them. Therefore, we can conclude the proposed method works well. Method Proposed LASSO CM NCCO AQM

Type 1 0.022 0.497 0.002 0.001 0.000

Type 2 0.126 0.000 0.803 0.243 0.676

PPV 0.955 0.506 0.971 0.996 1.000

NPV 0.956 1.000 0.746 0.913 0.779

ACC 0.946 0.652 0.757 0.926 0.797

AUC 0.984 0.999 0.806 0.928 0.986

Table 1: Averages of the test performance measures over 100 data sets in Case 1. PPV, NPV, ACC and AUC represent positive predictive value, negative predictive value, accuracy and area under the curve. Method Proposed LASSO CM NCCO AQM

Type 1 0.040 0.320 0.017 0.002 0.000

Type 2 0.120 0.200 0.906 0.370 0.760

PPV 0.928 0.585 0.719 0.994 1.000

NPV 0.955 0.891 0.717 0.874 0.759

ACC 0.936 0.716 0.716 0.887 0.772

AUC 0.989 0.848 0.643 0.878 0.973

Table 2: Averages of the test performance measures over 100 data sets in Case 2. PPV, NPV, ACC and AUC represent positive predictive value, negative predictive value, accuracy and area under the curve.

Method Proposed LASSO CM NCCO AQM

Type 1 0.028 0.272 0.155 0.035 0.002

Type 2 0.270 0.276 0.780 0.270 0.413

PPV 0.943 0.647 0.432 0.940 0.994

NPV 0.904 0.888 0.721 0.902 0.855

ACC 0.899 0.726 0.657 0.894 0.874

AUC 0.896 0.784 0.476 0.824 0.947

Table 3: Averages of the test performance measures over 100 data sets in Case 3. PPV, NPV, ACC and AUC represent positive predictive value, negative predictive value, accuracy and area under the curve.

14

0.0

0.2

0.4

0.6

False Positive

0.8

1.0

0.8 0.0

0.2

0.4

True Positive

0.6

0.8 0.6 0.0

0.2

0.4

True Positive 0.0

0.2

0.4

True Positive

0.6

0.8

1.0

Case 3

1.0

Case 2

1.0

Case 1

0.0

0.2

0.4

0.6

0.8

1.0

False Positive

0.0

0.2

0.4

0.6

0.8

False Positive

Figure 1: ROC curves in Case 1 (left), Case 2 (middle) and Case 3 (right). y axis represents the true positive rate and x axis the false positive rate. Blue crosses, pink diamonds, red square, green circles and purple triangles indicate the averages of the true and false positive rates over 100 data sets for the proposed method, LASSO, CM, NCCO and AQM.

4

Application to criminology data

In this section, we apply the proposed method to communities and crime data from the UCI Machine Learning Repository1 . The data set is culled from 1990 US Census, 1995 US FBI Uniform Crime Report and 1990 US Law Enforcement Management and Administrative Statistics Survey2 . It includes various types of crime data and demographic information for n = 2, 215 communities in the US. We use 10 crimes as responses: numbers of murders, rapes, robberies, assaults, burglaries, larcenies, auto thefts, arsons, violent crimes (sum of murders, rapes, robberies and assaults) and non-violent crimes (sum of burglaries, larcenies, auto thefts and arsons). As predictors, we select p = 68 variables, such as per capita income and population 1

http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime+Unnormalized [1] U. S. Department of Commerce, Bureau of the Census, Census Of Population And Housing 1990 United States: Summary Tape File 1a and 3a, [2] U.S. Department of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan. (1992), [3] U.S. Department of Justice, Bureau of Justice Statistics, Law Enforcement Management And Administrative Statistics, [4] U.S. Department Of Commerce, Bureau Of The Census Producer, Washington, DC and Inter-university Consortium for Political and Social Research Ann Arbor, Michigan. (1992), [5] U.S. Department of Justice, Federal Bureau of Investigation, Crime in the United States (1995). 2

15

1.0

density, which indicate demographic characteristics of the communities. The list is given in the supplemental materials. The data set consists of count, percentage and positive continuous variables. We observe the count variables have right-skewed distributions and the percentage variables can inflate at 0% and 100%. Also, the data set includes missing values in the response. To incorporate mixed-scale measurements, we develop a joint model which relies on the rounded kernel method of Canale and Dunson (2011). Let y ∗ ∈ ℜ and x∗ = (x∗1 , . . . , x∗p )′ ∈ ℜp be latent continuous variables for the response y and predictors x = (x1 , . . . , xp )′ . We induce a flexible nonparametric model on y and x through a Dirichlet process mixture of normals for the latent variables. If xj is a count variable, it can be expressed as xj = l, if al < x∗j ≤ al+1 , l = 0, 1, 2, . . ., where −∞ = a0 < a1 < a2 < · · · with al = log(l) for l ≥ 1. Since the log function shrinks large values, a distribution with positive skewness can be efficiently approximated by mixtures of normals with the log cut-points. For a percentage variable which can inflate at 0% and 100%, it can be induced by   0 if x∗j ≤ 0    xj = x∗j if 0 < x∗j < 100,    100 if 100 ≤ x∗ . j

As for a positive continuous variable, we can set xj = exp(x∗j ), which corresponds to applying the log transformation to the original data and modeling it as a continuous variable in a real line. For the latent variables, we utilize the Dirichlet process mixture of normals (13) and (14) except we use the observed predictors for the regression on y ∗ . Then, we obtain the following joint model of y and x by integrating out the latent variables. f (y, x) =

H X

πh f (y | x, θh )

Q

l 0 | Dn) > 0.95 with j = 1, . . . , p. In the computation of ζj (f, Pn ) in (9), we evaluate f (yi , xi,−j ) using the Monte Carlo approximation with 500 samples. Figures 2-5 show 90% credible intervals of ζj (f, Pn ) for all j and Table 4-7 reports the top 10 selected predictors in descending order of the posterior mean CMI with all crime as the response. Full lists of the selected predictors are given in the supplemental materials. Certain predictors are selected for many different crime-related

17

response variables. For all crimes, land area and population density show the first and second largest conditional dependence adjusting for other factors. Also, their posterior means of the CMI are much larger than those of other predictors especially in burglaries, larcenies, auto thefts and non-violent crimes. In addition, population in urban areas is selected 8 times, population, % of kids with two parents and % of persons in dense housing are picked up 7 times, and % Caucasian, % of households with investment & rent income, % of housing occupied and % of families with two parents are conditionally dependent with 6 types of crimes. On the other hand, 12 predictors such as % of housing units with less than 3 bedrooms and % of moms of kids under 18 in labor force are not selected for any crimes. Also, we can find similarities in the top 10 lists. We observe that certain types of variables obtain high ranks for many responses. For example, all crimes except larcenies and auto thefts share at least one of population in the community and population in urban areas in their lists. In addition, % of families with parents and % of kids with parents show relatively strong conditional dependence with all crimes other than murders, auto thefts and arsons. The posterior means of CMI of race variables are large for murders, robberies, assaults and violent crimes. Also, the top 10 lists of rapes, burglaries, arsons and non-violent crimes include more than one predictor related to divorce. We also apply the competitors discussed in Section 3 to the crime data using the same default settings. For the missing values, we impute them by the mean of observed values. The lists of the selected predictors are given in the supplemental materials. CM seems to work poorly in that it selects all predictors for all crimes. The predictors selected by LASSO are overlapped with those by the proposed method such as population and % of housing occupied, but the land areas and population density are often missed. NCCO shows little difference over crimes. It basically selects the same sets of predictors for all crimes but the land area and population density are not included. AQM shares some predictors such as race with the proposed method but fails to pick up the top 2 variables as well. The inability of the other methods to detect these important predictors is likely due to their non-linear and non-monotonic relationship with the crime responses.

18

0.20 0.10 0.00

Estimated CMI

0.30

Murder

1 3 5 7 9 11

14

17

20

23

26

29

32

35

38

41

44

47

50

53

56

59

62

65

68

38

41

44

47

50

53

56

59

62

65

68

41

44

47

50

53

56

59

62

65

68

Predictors

0.4 0.2 0.0

Estimated CMI

Rape

1 3 5 7 9 11

14

17

20

23

26

29

32

35

Predictors

0.6 0.4 0.2 0.0

Estimated CMI

Robbery

1 3 5 7 9 11

14

17

20

23

26

29

32

35

38

Predictors

Figure 2: 90% credible intervals of the estimated CMIs for murders (top), rapes (middle) and robberies (bottom). The red color indicates the corresponding predictors are selected. 19

Murder No. Mean 66 0.2587 67 0.1188 4 0.0507 9 0.0250 1 0.0250 3 0.0192 57 0.0177 13 0.0075 6 0.0067 64 0.0039 Rape No. Mean 66 0.4168 67 0.1964 1 0.0680 9 0.0359 30 0.0189 32 0.0178 33 0.0174 29 0.0156 27 0.0123 39 0.0051 Robbery No. Mean 66 0.6074 67 0.5080 33 0.0859 4 0.0652 3 0.0530 9 0.0469 1 0.0388 47 0.0277 30 0.0159 18 0.0139

90%CI [0.2157, 0.2936] [0.0905, 0.1454] [0.0302, 0.0678] [0.0043, 0.0636] [0.0015, 0.0469] [0.0058, 0.0374] [0.00007, 0.0463] [0.0004, 0.0149] [0.0021, 0.0125] [0.0005, 0.0067]

Predictor land area in square miles population density in persons per square mile % of population that is caucasian # of people living in areas classified as urban population for community % of population that is african american rental housing: lower quartile rent % of households with investment / rent income in 1989 % of population that is of hispanic heritage % of people born in the same state as currently living

90%CI [0.3929, 0.4428] [0.1727, 0.2217] [0.0523, 0.0865] [0.0086, 0.0608] [0.0013, 0.0379] [0.0009, 0.0398] [0.0006, 0.0389] [0.0005, 0.0330] [0.0009, 0.0265] [0.0004, 0.0118]

Predictor land area in square miles population density in persons per square mile population for community # of people living in areas classified as urban % of population who are divorced % of families (with kids) that are headed by two parents % of kids in family housing with two parents % of females who are divorced % of males who are divorced total number of people known to be foreign born

90%CI [0.5551, 0.6554] [0.4548, 0.5605] [0.0545, 0.1203] [0.0353, 0.0953] [0.0211, 0.0865] [0.0078, 0.0926] [0.0268, 0.0623] [0.0084, 0.0493] [0.0007, 0.0348] [0.0009, 0.0326]

Predictor land area in square miles population density in persons per square mile % of kids in family housing with two parents % of population that is caucasian % of population that is african american # of people living in areas classified as urban population for community % of persons in dense housing (more than 1 person per room) % of population who are divorced per capita income

Table 4: List of top 10 selected predictors in descending order of the posterior mean for murders (top), rapes (middle) and robberies (bottom). Mean and 90%CI indicate the posterior mean and 90% credible interval.

20

0.2 0.0

Estimated CMI

0.4

Assault

1 3 5 7 9 11

14

17

20

23

26

29

32

35

38

41

44

47

50

53

56

59

62

65

68

38

41

44

47

50

53

56

59

62

65

68

38

41

44

47

50

53

56

59

62

65

68

Predictors

0.8 0.4 0.0

Estimated CMI

Burglary

1 3 5 7 9 11

14

17

20

23

26

29

32

35

Predictors

0.8 0.4 0.0

Estimated CMI

Larceny

1 3 5 7 9 11

14

17

20

23

26

29

32

35

Predictors

Figure 3: 90% credible intervals of the estimated CMIs for assaults (top), burglaries (middle) and larcenies (bottom). The red color indicates the corresponding predictors are selected. 21

Assault No. Mean 66 0.3380 67 0.1760 9 0.0760 1 0.0413 33 0.0350 13 0.0348 32 0.0176 47 0.0171 4 0.0168 3 0.0070 Burglary No. Mean 66 0.9177 67 0.7075 33 0.0508 47 0.0281 29 0.0173 50 0.0152 13 0.0135 6 0.0097 30 0.0083 9 0.0078 Larceny No. Mean 66 0.9425 67 0.8035 32 0.0305 2 0.0233 22 0.0219 35 0.0217 65 0.0165 8 0.0163 45 0.0135 33 0.0133

90%CI [0.2897, 0.3914] [0.1318, 0.2267] [0.0451, 0.0996] [0.0186, 0.0641] [0.0114, 0.0571] [0.0234, 0.0478] [0.0010, 0.0403] [0.0057, 0.0283] [0.0046, 0.0284] [0.0004, 0.0174]

Predictor land area in square miles population density in persons per square mile # of people living in areas classified as urban population for community % of kids in family housing with two parents % of households with investment / rent income in 1989 % of families (with kids) that are headed by two parents % of persons in dense housing (more than 1 person per room) % of population that is caucasian % of population that is african american

90%CI [0.8717, 0.9492] [0.6639, 0.7464] [0.0241, 0.0796] [0.0146, 0.0444] [0.0100, 0.0276] [0.0071, 0.0236] [0.0008, 0.0303] [0.00007, 0.0166] [0.0001, 0.0224] [0.0004, 0.0258]

Predictor land area in square miles population density in persons per square mile % of kids in family housing with two parents % of persons in dense housing (more than 1 person per room) % of females who are divorced % of housing occupied % of households with investment / rent income in 1989 % of population that is of hispanic heritage % of population who are divorced # of people living in areas classified as urban

90%CI [0.9149, 0.9682] [0.7707, 0.8359] [0.0003, 0.0505] [0.0126, 0.0397] [0.00001, 0.0436] [0.0085, 0.0383] [0.0062, 0.0256] [0.0008, 0.0321] [0.00002, 0.0520] [0.00002, 0.0422]

Predictor land area in square miles population density in persons per square mile % of families (with kids) that are headed by two parents mean people per household % of people 25 and over with a bachelors degree or higher education % of kids age 12-17 in two parent households % of people living in the same city as in 1985 (5 years before) % of population that is 65 and over in age % of all occupied households that are large (6 or more people) % of kids in family housing with two parents

Table 5: List of top 10 selected predictors in descending order of the posterior mean for assaults (top), burglaries (middle) and larcenies (bottom). Mean and 90%CI indicate the posterior mean and 90% credible interval.

22

0.4 0.0

Estimated CMI

0.8

Auto Theft

1 3 5 7 9 11

14

17

20

23

26

29

32

35

38

41

44

47

50

53

56

59

62

65

68

38

41

44

47

50

53

56

59

62

65

68

41

44

47

50

53

56

59

62

65

68

Predictors

0.3 0.2 0.1 0.0

Estimated CMI

Arson

1 3 5 7 9 11

14

17

20

23

26

29

32

35

Predictors

0.4 0.2 0.0

Estimated CMI

Violent Crime

1 3 5 7 9 11

14

17

20

23

26

29

32

35

38

Predictors

Figure 4: 90% credible intervals of the estimated CMIs for auto thefts (top), arsons (middle) and violent crimes (bottom). The red color indicates the corresponding predictors are selected. 23

Auto Theft No. Mean 90%CI 66 0.7650 [0.7310, 0.8011] 67 0.6471 [0.6098, 0.6847] 47 0.0298 [0.0164, 0.0437] 30 0.0245 [0.0008, 0.0541] 18 0.0229 [0.0001, 0.0626] 13 0.0211 [0.0054, 0.0405] 46 0.0197 [0.00001, 0.0899] 60 0.0138 [0.0054, 0.0342] 53 0.0119 [0.0050, 0.0178] 4 0.0095 [0.0004, 0.0214] Arson No. Mean 90%CI 66 0.3030 [0.2517, 0.3593] 67 0.1619 [0.1226, 0.2084] 1 0.0394 [0.0131, 0.0689] 9 0.0152 [0.0010, 0.0471] 19 0.0131 [0.0005, 0.0323] 27 0.0119 [0.0022, 0.0229] 13 0.0085 [0.0004, 0.0168] 29 0.0078 [0.0001, 0.0212] 41 0.0039 [0.0013, 0.0071] 15 0.0031 [0.0004, 0.0065] Violent Crime No. Mean 90%CI 66 0.5254 [0.4868, 0.5763] 67 0.3515 [0.3106, 0.4052] 9 0.1004 [0.0589, 0.1498] 33 0.0751 [0.0412, 0.1058] 47 0.0272 [0.0140, 0.0417] 32 0.0242 [0.0012, 0.0581] 13 0.0242 [0.0094, 0.0451] 4 0.0163 [0.0029, 0.0329] 1 0.0153 [0.0003, 0.0394] 3 0.0137 [0.0014, 0.0278]

Predictor land area in square miles population density in persons per square mile % of persons in dense housing (more than 1 person per room) % of population who are divorced per capita income % of households with investment / rent income in 1989 % of people in owner occupied households median gross rent % of vacant housing that has been vacant more than 6 months % of population that is caucasian Predictor land area in square miles population density in persons per square mile population for community # of people living in areas classified as urban # of people under the poverty level % of males who are divorced % of households with investment / rent income in 1989 % of females who are divorced % of population who have immigrated within the last 5 years % of households with public assistance income in 1989 Predictor land area in square miles population density in persons per square mile # of people living in areas classified as urban % of kids in family housing with two parents % of persons in dense housing (more than 1 person per room) % of families (with kids) that are headed by two parents % of households with investment / rent income in 1989 % of population that is caucasian population for community % of population that is african american

Table 6: List of top 10 selected predictors in descending order of the posterior mean for auto thefts (top), arsons (middle) and violent crimes (bottom). Mean and 90%CI indicate the posterior mean and 90% credible interval.

24

0.6 0.4 0.0

0.2

Estimated CMI

0.8

1.0

Non−Violent Crime

1

4

7

11

15

19

23

27

31

35

39

43

47

51

55

59

63

67

Predictors

Figure 5: 90% credible intervals of the estimated CMIs for non-violent crimes. The red color indicates the corresponding predictors are selected.

Non-Violent No. Mean 66 0.9859 67 0.8282 32 0.0300 28 0.0217 33 0.0217 9 0.0200 30 0.0183 27 0.0182 47 0.0181 1 0.0174

Crime 90%CI [0.9500, 1.0189] [0.7870, 0.8700] [0.0082, 0.0518] [0.0011, 0.0475] [0.0015, 0.0484] [0.0017, 0.0518] [0.0006, 0.0399] [0.0001, 0.0443] [0.0043, 0.0353] [0.0001, 0.0426]

Predictor land area in square miles population density in persons per square mile % of families (with kids) that are headed by two parents % of males who have never married % of kids in family housing with two parents # of people living in areas classified as urban % of population who are divorced % of males who are divorced % of persons in dense housing (more than 1 person per room) population for community

Table 7: List of top 10 selected predictors in descending order of the posterior mean for non-violent crimes. Mean and 90%CI indicate the posterior mean and 90% credible interval.

Acknowledgement This work was supported by Nakajima Foundation and grants ES017436 and ES017240 from the National Institute of Environmental Health Sciences (NIEHS) of the US National Institutes of Health. The computational results are mainly generated using Ox (Doornik (2006)) and Matlab.

25

A

Proof of Theorem 1

For ǫ > 0, we define E = {f : KL(f0 (y, x, z), f (y, x, z)) < ǫ}. Then, there exists N such that for n > N and f ∈ E, Z Z f0 (y, x, z)f0 (z) f (y, x, z)f (z) dPn − log dP0 , |ζ(f, Pn ) − ζ0 | = log f (y, z)f (x, z) f0 (y, z)f0 (x, z) Z Z f (y, x, z) f (y, x, z) 0 0 ≤ sup log dPn − log dP0 f (y, x, z) f (y, x, z) f ∈F Z Z f (y, z) f (y, z) 0 0 + sup log dPn − log dP0 f (y, z) f (y, z) f ∈F Z Z f (x, z) f (x, z) 0 0 dPn − log dP0 + sup log f (x, z) f (x, z) f ∈F Z Z f (z) f (z) 0 0 dPn − log dP0 + sup log f (z) f (z) f ∈F Z Z f0 (y, x, z)f0 (z) f0 (y, x, z)f0 (z) + log dPn − log dP0 f0 (y, z)f0 (x, z) f0 (y, z)f0 (x, z) Z Z f0 (y, x, z) f0 (y, z) + log dP0 + log dP0 f (y, x, z) f (y, z) Z Z f0 (z) f0 (x, z) dP0 + log dP0 , + log f (x, z) f (z)

(16) (17) (18) (19) (20) (21) (22)

≤ 9ǫ. Each term in (16)-(19) can be bounded by ǫ from the definition of P0 -GlivenkoCantelli classes. (20) goes to zero by the strong law of large numbers. (21) and (22) are bounded by 2ǫ respectively. This comes from the non-negativity of KL divergence, for example, Z

f0 (y, z) log dP0 ≤ f (y, z)

Z f0 (y, z) f0 (x | y, z) log dP0 + log dP0 , f (y, z) f (x | y, z) Z f0 (y, x, z) dP0 < ǫ. = log f (y, x, z) Z

Hence, by setting ǫ′ = 9ǫ, E ⊂ {f : |δ(f, Pn ) − δ0 | < ǫ′ }. Norets (2012) shows if {log(f0 (y, x, z)/f (y, x, z)), f ∈ F } is P0 -Galivenko-Cantelli and the KL support condition (6) is satisfied, then the posterior converges to the true data-generating function in KL distance. Therefore, Π(|δ(f, Pn ) − δ0 | < ǫ′ | Dn ) ≥ Π(E | Dn ) → 1 almost surely P0∞ . 26

B

Proof of Proposition 1

Without loss of generality, we assume d1 = d2 = d3 = 1. In this proof, let φσ be the normal density with mean 0 and standard deviation σ. First, we show the encompassing model satisfies the KL support condition (6). This part relies on Theorem 3 in Ghosal et al. (1999). Since Q0 and G have compact support, we suppose Q0 (A) = 1 and Q(B) = 1 where A = {µ : −k ≤ µ1 , µ2 , µ3 ≤ k}, B = {µ : −k ′ ≤ µ1 , µ2 , µ3 ≤ k ′ } and Q is in the support of ΠQ . For any η > 0, there exists a such R R that |y|>a g(y, x, z)f0(y, x, z)dydxdz < η, |x|>a g(y, x, z)f0 (y, x, z)dydxdz < η and R g(y, x, z)f0(y, x, z)dydxdz < η where g(y, x, z) = max(1, |y|) + max(1, |x|) + |z|>a

max(1, |z|) because f0 has moments of all orders. The KL divergence between f0 and f can be expressed as Z

f0 f0 log = f

Z

+

Z

f0 (y, x, z) log

R

φ (y − µ1 )φσ0,2 (x − µ2 )φσ0,3 (z − µ3 )dQ0 (µ) R σ0,1 dydxdz φσ1 (y − µ1 )φσ2 (x − µ2 )φσ3 (z − µ3 )dQ0 (µ) (23)

R φσ (y − µ1 )φσ2 (x − µ2 )φσ3 (z − µ3 )dQ0 (µ) f0 (y, x, z) log R 1 dydxdz. φσ1 (y − µ1 )φσ2 (x − µ2 )φσ3 (z − µ3 )dQ(µ) (24)

With respect to the term (24), we divide the support of (y, x, z) into C = {(y, x, z) ∈ R3 : −a ≤ y, x, z ≤ a} and C C . In the set C C , at least one of the variables is less than −a or larger than a. For example, we consider the case y < −a and −a ≤ x, z ≤ a. Define µ1 and µ∗1 , which are functions of y such that φσ1 (y −µ1 ) = supµ1 ≤|k| φσ1 (y −µ1 ) and φσ1 (y − µ∗1 ) = inf µ1 ≤|k′| φσ1 (y − µ1 ). Also, we define µ2 , µ∗2 , µ3 and µ∗3 as functions of x and z as well. Then,

27

−a

a

a

R φσ (y − µ1 )φσ2 (x − µ2 )φσ3 (z − µ3 )dQ0 (µ) dydxdz, f0 (y, x, z) log R 1 φσ1 (y − µ1 )φσ2 (x − µ2 )φσ3 (z − µ3 )dQ(µ) −∞ −a −a Z −a Z a Z a supµ1 ≤|k| φσ1 (y − µ1 ) supµ2 ≤|k| φσ2 (x − µ2 ) supµ3 ≤|k| φσ3 (z − µ3 ) dydxdz, ≤ f0 (y, x, z) log inf µ1 ≤|k′| φσ1 (y − µ1 ) inf µ2 ≤|k′| φσ2 (x − µ2 ) inf µ3 ≤|k′ | φσ3 (z − µ3 ) −∞ −a −a Z −a Z Z Z −a Z a Z φσ1 (y − µ1 ) φσ (x − µ2 ) ≤ f0 (y, x, z) log dydxdz + dydxdz f0 (y, x, z) log 2 ∗ φσ1 (y − µ1 ) φσ2 (x − µ∗2 ) −∞ −∞ −a Z −a Z a Z φσ (z − µ3 ) dydxdz, + f0 (y, x, z) log 3 φσ3 (z − µ∗3 ) −∞ −a  Z −a Z Z  k 2 + k ′2 k + k′ f0 (y, x, z)dydxdz ≤ |y| + σ12 2σ12 −∞  Z −a Z Z  k 2 + k ′2 k + k′ |x| + f0 (y, x, z)dydxdz + σ22 2σ22 −∞  Z −a Z Z  k + k′ k 2 + k ′2 f0 (y, x, z)dydxdz, + |z| + σ32 2σ32 −∞ ! 3 3 X k + k ′ X k 2 + k ′2 < + η, (25) 2 2 σ 2σ j j j=1 j=1

Z

Z

Z

For the remaining regions in C C , the corresponding integral can be bounded by (25). For the region C, following Ghosal et al. (1999), it can be shown that for 0 < η˜ < 1/3 there exists a set of probability E such that ΠQ (E) > 0 and Q ∈ E, R φσ1 (y − µ1 )φσ2 (x − µ2 )φσ3 (z − µ3 )dQ0 (µ) η R < 3˜ . − 1 φσ (y − µ1 )φσ (x − µ2 )φσ (z − µ3 )dQ(µ) 1 − 3˜ η 1 2 3

Therefore, for fixed Σ and Q ∈ E, (24) is less than 26

3 X k + k′ j=1

σj2

+

3 X k 2 + k ′2 j=1

2σj2

!

η+

3˜ η . 1 − 3˜ η

(26)

Also, the term on the right side of (23) converges to 0 as σj → σ0,j with j = 1, 2, 3. This comes from the dominated convergence theorem using the inequality R

φσ (y − µ1 )φσ0,2 (x − µ2 )φσ0,3 (z − µ3 ) φ (y − µ1 )φσ0,2 (x − µ2 )φσ0,3 (z − µ3 )dQ0 (µ) R σ0,1 ≤ sup 0,1 . φσ1 (y − µ1 )φσ2 (x − µ2 )φσ3 (z − µ3 ) φσ1 (y − µ1 )φσ2 (x − µ2 )φσ3 (z − µ3 )dQ0 (µ) µ∈A

For any ǫ > 0, we choose a small neighborhood N of Σ0 , η and η˜ such that for

Σ ∈ N, both the right term in (23) and (36) are less than ǫ/2. Then, the KL support 28

condition is satisfied. Next, we check the class {log(f0 (y, x, z)/f (y, x, z), f ∈ F )} is P0 -Glivenko-Cantelli. Using the expression of Q in (8), the joint density f (·) in (7) can be considered as a function of t = (π, {µh }, Σ). To make it clear, we denote it by f (·, t). Let T be the space of the parameters such that F = {f (·, t) : t ∈ T }. Following Lemma 6.1 in Wellner (2003), it is enough to check the conditions that log(f0 (·)/f (·, t)) is continuous at t ∈ T for P0 almost all (y, x, z) ∈ Y × X × Z, T is compact and there exists a R function B(·) such that supt∈T | log(f0 (·)/f (·, t))| ≤ B(·) and BdP0 < ∞.

From the assumption, T is compact. We suppose ΠΣ (A′ ) = 1 where A′ = {Σ :

0 < σ ≤ σ1 , σ2 , σ3 ≤ σ}. It can be checked log(f0 (·)/f (·, t)) is continuous at t for P0 almost surely. As for the function B(·), we can choose it as follows. R φσ (y − µ1 )φσ0,2 (x − µ2 )φσ0,3 (z − µ3 )dQ0 (µ) f0 (y, x, z) log , = log PH 0,1 f (y, x, z, t) h=1 πh φσ1 (y − µ1,h )φσ2 (x − µ2,h )φσ3 (z − µ3,h )

φσ0,1 (y − µ1 ) φσ (x − µ2 ) φσ (z − µ3 ) + log 0,2 + log 0,3 , ∗ ∗ φσ1 (y − µ1 ) φσ2 (x − µ2 ) φσ3 (z − µ∗3 ) 3 Y  −1 ≤ log max σσ0,j , σ0,j σ −1

≤ log

j=1

+ M1 (y 2 + |y| + 1) + M2 (x2 + |x| + 1) + M3 (z 2 + |z| + 1), ≡ B(y, x, z)

−2 −2 −2 + k ′ σ −2 ), (k 2 σ0,j + k ′2 σ −2 )/2}. It is easy to where Mj = max{(σ0,j + σ −2 )/2, (kσ0,j R check − log(f0 (·)/f (·, t)) ≤ B(·) for t ∈ T and B(y, x, z)dP0 < ∞ because f0 has

moments of all orders. Therefore, {log(f0 (y, x, z)/f (y, x, z)), f ∈ F } is P0 -Glivenko-

Cantelli. We can easily check the other classes of functions are also P0 -GlivenkoCantelli in a similar way.

29

C

Proof of Theorem 2

The proof is quite similar to that of Theorem 1. For ǫ > 0, we define E = {f : KL(f0 (y, x), f (y, x)) < ǫ}. Then, there exists N such that for n > N and f ∈ E, Z Z f (y, x)f (x ) f (y, x)f (x ) 0 0 −j −j dPn − log dP0 , max |δj (f, Pn ) − δ0 | = max log 1≤j≤p 1≤j≤p f (y, x−j )f (x) f0 (y, x−j )f0 (x) Z Z f (y, x) f (y, x) 0 0 ≤ sup log dPn − log dP0 (27) f (y, x) f (y, x) f ∈F Z Z f0 (y, x−j ) f0 (y, x−j ) dPn − log dP0 (28) + max sup log 1≤j≤p f ∈F f (y, x−j ) f (y, x−j ) Z Z f0 (x) f0 (x) + sup log dPn − log dP0 (29) f (x) f (x) f ∈F Z Z f0 (x−j ) f0 (x−j ) (30) dPn − log dP0 + max sup log 1≤j≤p f ∈F f (x−j ) f (x−j ) Z Z f (y, x)f (x ) f (y, x)f (x ) 0 0 −j 0 0 −j + max log dPn − log dP0 1≤j≤p f0 (y, x−j )f0 (x) f0 (y, x−j )f0 (x) (31)

f0 (y, x−j ) f0 (y, x) dP0 + max log dP0 1≤j≤p f (y, x) f (y, x−j ) Z Z f0 (x) f0 (x−j ) + log dP0 + max log dP0 , 1≤j≤p f (x) f (x−j )

+

Z

Z

log

(32) (33)

≤ 9ǫ. (27)-(30) are less than ǫ from the definition of P0 -Glivenko-Cantelli classes. (31) converges to zero by the strong law of large numbers. Each term in (32) and (33) are bounded by KL(f0 (y, x), f (y, x)), which is less than ǫ. Therefore, E ⊂ {f : max1≤j≤p |δj (f, Pn ) − δ0 | < ǫ′ } where ǫ′ = 9ǫ and Π(max1≤j≤p |δj (f, Pn ) − δ0 | < ǫ′ | Dn ) ≥ Π(E | Dn ) → 1 almost surely P0∞ from the posterior consistency of the joint densities in KL divergence (Norets (2012)).

D

Proof of Proposition 2

The proof is quite similar to that of Proposition 1. Without loss of generality, we assume p = 2 and β0 = 0. Since Q0 and G have compact support, we suppose Q0 (A) = 1 and Q(B) = 1 for Q in the support of ΠQ where A = {(β, µ) : −k ≤

30

β1 , β2 , µ1 , µ2 ≤ k} and B = {(β, µ) : −k ′ ≤ β1 , β2 , µ1 , µ2 ≤ k ′ }. Since f0 has moments R of all orders, for any η > 0, there exists a such that |y|>a g(y, x)f0(y, x)dydx < η, R R g(y, x)f0(y, x)dydx < η and |x2 |>a g(y, x)f0(y, x)dydx < η where g(y, x) = |x1 |>a 1 + |x1 | + |x2 | + x21 + x22 + |y||x1| + |y||x2| + |x1 ||x2 |. The KL divergence between f0 and f can be expressed as φσ (y − x′ β)φτ0,1 (x1 − µ1 )φτ0,2 (x2 − µ2 )dQ0 (β, µ) f0 (y, x) log R 0 dydx φσ (y − x′ β)φτ1 (x1 − µ1 )φτ2 (x2 − µ2 )dQ0 (β, µ) (34) R Z φσ (y − x′ β)φτ1 (x1 − µ1 )φτ2 (x2 − µ2 )dQ0 (β, µ) dydx. + f0 (y, x) log R φσ (y − x′ β)φτ1 (x1 − µ1 )φτ2 (x2 − µ2 )dQ(β, µ) (35)

f0 = f0 log f

Z

Z

R

As discussed in Appendix B, we consider the integration (35) over C = {(y, x) ∈ 3

R : −a ≤ y, x1, x2 ≤ a} and C C separately. For the region C C , we check the subspace {(y, x) : y < −a, −a ≤ x1 , x2 ≤ a} for example. −a

a

a

R φσ (y − x′ β)φτ1 (x1 − µ1 )φτ2 (x2 − µ2 )dQ0 (β, µ) f0 (y, x) log R dydx, φσ (y − x′ β)φτ1 (x1 − µ1 )φτ2 (x2 − µ2 )dQ(β, µ) −∞ −a −a Z −a Z Z 1 {(k 2 + k ′2 )(x21 + x22 ) + 2(k + k ′ )(|x1 | + |x2 |)|y| + 2(k 2 + k ′2 )|x1 ||x2 |}f0 (y, x)dydx ≤ 2σ 2 −∞  Z −a Z Z  k + k′ k 2 + k ′2 + |x1 | + f0 (y, x)dydx τ12 2τ12 −∞  Z −a Z Z  k 2 + k ′2 k + k′ |x2 | + f0 (y, x)dydx, + τ22 2τ22 −∞   k + k ′ 3(k 2 + k ′2 ) k + k ′ k 2 + k ′2 k + k ′ k 2 + k ′2 + + + η, (36) + + < σ2 2σ 2 τ12 2τ12 τ22 2τ22

Z

Z

Z

For the other subsets in C C , we can bound each of them by (36). Following Ghosal et al. (1999), there exists a set E in the space of probability on (β,µ) with ΠQ (E) > 0 and for Q ∈ E, the integral over C is less than 3˜ η /(1 − 3˜ η) where η˜ < 1/3. Therefore, for Q ∈ E, (35) is less than k + k ′ 3(k 2 + k ′2 ) k + k ′ k 2 + k ′2 k + k ′ k 2 + k ′2 26 + + + + + σ2 2σ 2 τ12 2τ12 τ22 2τ22 



η+

3˜ η . 1 − 3˜ η (37)

Also, we can check the right term in (34) converges to 0 as σ → σ0 , τj → τ0,j for 31

j = 1, 2 using the dominated convergence theorem with the inequality R

φ (y − x′ β)φτ0,1 (x1 − µ1 )φτ0,2 (x2 − µ2 )dQ0 (β, µ) R σ0 , φσ (y − x′ β)φτ1 (x1 − µ1 )φτ2 (x2 − µ2 )dQ0 (β, µ) φσ0 (y − x′ β)φτ0,1 (x1 − µ1 )φτ0,2 (x2 − µ2 ) ≤ sup . ′ (β,µ)∈A φσ (y − x β)φτ1 (x1 − µ1 )φτ2 (x2 − µ2 )

Based on the expression of Q in (12), let f (·, t) be the joint density (11) where t = (π, {βh }, {µh }, σ, τ ). Also, let T be the space of t such that F = {f (·, t) : t ∈ T }. From the assumption, T is compact and it is easy to check log(f0 (·)/f (·, t)) is continuous at t for P0 almost surely. We suppose Π(σ,τ ) (A′ ) = 1 where A′ = {(σ, τ ) : 0 < σ ≤ σ, τ1 , τ2 ≤ σ}. Then, for t ∈ T , 2 Y   −1 log f0 (y, x) ≤ log max σσ −1 , σ0 σ −1 + log max στ0,j , τ0,j σ −1 , 0 f (y, x, t) j=1 + My (y 2 + x21 + x22 + |y||x1| + |y||x2| + |x1 ||x2 |) + M1 (x21 + |x1 | + 1) + M2 (x22 + |x2 | + 1), ≡ B(y, x), −2 + where My = max{(σ0−2 + σ −2 ), (kσ0−2 + k ′ σ −2 ), (k 2 σ0−2 + k ′2 σ −2 )}, Mj = max{(τ0,j R −2 −2 + k ′ σ −2 ), (k 2 τ0,j + k ′2 σ −2 )}. It is easy to check B(y, x)dP0 < ∞. As a σ −2 ), (kτ0,j

result, {log(f0 (y, x)/f (y, x)), f ∈ F } is P0 -Glivenko-Cantelli. Similarly, we can show other classes of ratio of marginal densities are also P0 -Glivenko-Cantelli.

References An, Q., C. Wang, I. Shterev, E. Wang, L. Carin, and D. B. Dunson (2008). Hierarchical Kernel Stick-Breaking Process for Multi-Task Image Analysis. 25th International Conference on Machine Learning. Bouezmarni, T., J. V. K. Rombouts, and A. Taamouti (2012). Nonparametric CopulaBased Test for Conditional Independence with Applications to Granger Causality. Journal of Business & Economic Statistics 30, 275–287. Canale, A. and D. B. Dunson (2011). Bayesian kernel mixtures for counts. Journal of the American Statistical Association 106, 1528–1539. 32

Chung, Y. and D. B. Dunson (2009). Nonparametric Bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association 104, 1646–1660. Doornik, J. (2006). Ox: Object Oriented Matrix Programming. London: Timberlake Consultants Press. Dunson, D. B. and J.-H. Park (2008). Kernel stick-breaking processes. Biometrika 95, 307–323. Dunson, D. B. and S. D. Peddada (2008). Bayesian nonparametric inference on stochastic ordering. Biometrika 95, 859–874. Escobar, M. D. and M. West (1995). Bayesian Density Estimation and Inference Using Mixtures. Journal of the American Statistical Association 90, 577–588. Fukumizu, K., A. Gretton, X. Sun, and B. Sch¨olkopf (2008). Kernel measures of conditional dependence. Proc. Adv. Neural Inf. Process. Syst. 20 , 489–496. George, E. I. and R. E. McCulloch (1993). Variable Selection via Gibbs Sampling. Journal of the American Statistical Association 88, 881–889. George, E. I. and R. E. McCulloch (1997). Approaches for Bayesian variable selection. Statistica Sinica 7, 339–374. Ghosal, S., J. K. Ghosh, and R. V. Ramamoorthi (1999). Posterior consistency of Dirichlet mixtures in density estimation. Annals of Statistics 27, 143–158. Ghosh, J. K. and R. V. Ramamoorthi (2003). Bayesian nonparametrics. Springer. Gy¨orfi, L. and H. Walk (2012). Strongly consistent nonparametric tests of conditional independence. Statistics & Probability Letters 82, 1145–1150. Hannah, L. A., D. M. Blei, and W. B. Powell (2011). Dirichlet Process Mixtures of Generalized Linear Models. Journal of Machine Learning Research 12, 1923–1953. Holmes,

C. C.,

F. Caron,

J. E. Griffin,

and D. A. Stephens (2012).

Two-sample Bayesian nonparametric hypothesis testing. http://arxiv.org/abs/0910.5060.

33

Technical Report,

Kosorok, M. R. (2008). Introduction to empirical processes and semiparametric inference. Springer. Linton, O. and P. Gozalo (1996). Conditional independence restrictions: Testing and estimation. Cowles Foundation Discussion Papers 1140. Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates: I. Density Estimates. Annals of Statistics 12, 351–357. Ma, L. (2011). Adaptive testing of conditional association through Bayesian recursive mixture modeling. Technical Report, Department of Statistical Science, Duke University. Ma, L. and W. H. Wong (2011). Coupling Optional Polya Trees and the Two Sample Problem. Journal of the American Statistical Association 106, 1553–1565. Muliere, P. and L. Tardella (1998). Approximating Distributions of Random Functionals of FergusonDirichlet Priors. Canadian Journal of Statistics 26, 283297. M¨ uller, P., A. Erkanli, and M. West (1996). Bayesian curve fitting using multivariate normal mixtures. Biometrika 83, 67–79. Norets, A. (2012). Posterior consistency in Kullback-leibler divergence with applications to misspecified infinite dimensional models. Technical Report. Norets, A. and J. Pelenis (2011). Posterior Consistency in Conditional Density Estimation by Covariate Dependent Mixtures. Econometric Theory. In press. Norets, A. and J. Pelenis (2012). Bayesian Modeling of Joint and Conditional Distributions. Journal of Econometrics 168, 332–346. Pati, D., D. B. Dunson, and S. T. Tokdar (2013). Posterior consistency in conditional distribution estimation. Journal of Multivariate Analysis 116, 456–472. Reich, B. J., E. Kalendra, C. B. Storlie, H. D. Bondell, and M. Fuentes (2012). Variable selection for high dimensional Bayesian density estimation: application to human exposure simulation. Journal of the Royal Statistical Society Series CApplied Statistics 61, 47–66.

34

Seth, S. and J. C. Principe (2010). A conditional distribution function based approach to design nonparametric tests of independence and conditional independence. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2066–2069. Seth, S. and J. C. Principe (2012). Assessing Granger non-causality using nonparametric measure of conditional independence. IEEE Transactions on Neural Networks and Learning Systems 23, 47–59. Sethuraman, J. (1994).

A constructive definition of Dirichlet priors.

Statistica

Sinica 4, 639–650. Song, K. (2009). Testing conditional independence via Rosenblatt transforms. The Annals of Statistics 37, 4011–4045. Su, L. and H. White (2007). A consistent characteristic function-based test for conditional independence. Journal of Econometrics 141, 807–834. Su, L. and H. White (2008). A nonparametric Hellinger metric test for conditional independence. Econometric Theory 24, 829–864. Tokdar, S. T. (2006). Posterior consistency of dirichlet location-scale mixture of normals in density estimation and regression. Sankhya 67, 90–110. van der Vaart, A. and J. Wellner (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer. Wellner, J. (2003). Empirical processes: Theory and applications. Technical report. http://www.stat.washington.edu/jaw/RESEARCH/TALKS/BocconiSS/emp-prcbk-big2.pdf. West, M., P. Muller, and M. D. Escobar (1994). Hierarchical priors and mixture models, with application in regression and density estimation. Aspects of uncertainty: A Tribute to DV Lindley, 363–386. Wu, Y. and S. Ghosal (2008). Kullback leibler property of kernel mixture priors in bayesian density estimation. Electronic Journal of Statistics 2, 298–331.

35

Xu, X., P. Lu, S. N. MacEachern, and R. Xu (2012). Calibrated Bayes factor for model comparison and prediction. Technical Report No. 855, Department of Statistics, The Ohio State University.

36