Semiparametric Hidden Markov Model with Nonparametric Regression

Semiparametric Hidden Markov Model with Nonparametric Regression Mian Huang 1, Qinghua Ji 2, and Weixin Yao 3 1

School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, P. R. China.

2

School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, P. R. China.

3

Department of Statistics, University of California, Riverside, California, U.S.A.

Address for correspondence: Weixin Yao, Department of Statistics, University of California, Riverside, California, 92521, U.S.A.. E-mail: [email protected]. Phone: (+001) 951 827 6007. Fax: (+001) 951 827 3286.

Abstract: The hidden Makrov model regression (HMMR) has been popularly used in many fields such as gene expression and activity recognition. However, the traditional HMMR requires the strong linearity assumption for the emission model. In this paper, we propose a hidden Markov model with nonparametric regression (HMMNR), where the mean and variance of emission model are unknown smooth functions. The new semiparametric model might greatly reduce the modeling bias and thus enhance the applicability of the traditional hidden Markov model regression. We

2

Mian Huang et al.

propose an estimation procedure for the transition probability matrix and the nonparametric mean and variance functions by combining the ideas of the EM algorithm and the kernel regression. Simulation studies and a real dataset application are used to demonstrate the effectiveness of the new estimation procedure.

Key words:

EM algorithm; Kernel egression; Hidden Makrov model regression;

Forward-backward algorithm.

1

Introduction

The Hidden Markov Model (HMM) is a powerful and useful tool for modeling generative sequences that can be characterised by an underlying latent process. HMM has been successfully applied in many fields, including speech recognition (Rabiner, 1989; Huang et al., 1990), genetics and genomics (Gough et al., 2001; Krogh et al., 2001; Wang et al., 2007), and artificial intelligence (Bui et al., 2002). The book by MacDonald and Zucchini (1997) illustrates several applications of HMMs. Cappé et al. (2009) provides a more complete introduction to the HMM. An important extension of HMM is the Hidden Markov Model Regression (HMMR) proposed by Fridman (1993) to incorporate the effect of the covariates. HMMR has been applied in many fields, e.g., gene expression (Lee et al., 2014) and activity recognition (Trabelsi et al., 2013). The emission model of HMMR is a linear regression model, which may not be appropriate or even provide misleading results when the real relationship is complex and nonlinear. To solve this problem and relax the linearity assumption, we propose a

Semiparametric Hidden Markov Model

3

nonparametric extension of the traditional parametric emission model in the HMM framework. The proposed new model allows the regression function in each emission model to be an unknown but smooth function of covariates. With this specification the HMM model is referred to as the hidden markov model with nonparametric regression, or HMM-NR for short. Another motivation comes from an analysis of US housing data in Huang et al. (2013). Huang et al. (2013) used the nonparametric mixture regression to analyze the US housing data. However, Huang et al. (2013) did not incorporate the correlation among the time series housing data. The proposed HMM-NR is more appropriate for such type of time series data by combining the ideas of nonparametric mixture regression (Huang et al., 2013) and the hidden Markov model. In this paper, we will introduce the model definition for HMM-NR, and propose a modified EM algorithm to estimate HMM-NR by combining ideas of the EM algorithm and the kernel regression. The transition probability matrix and the mean and variance functions are estimated simultaneously in the modified EM algorithm. By extending the definition of degree of freedom for kernel estimates in Fan et al. (2001) and Huang et al. (2013), we propose to choose the bandwidth by Bayesian information criterion. We conduct a simulation study and a real data application to demonstrate the effectiveness of the proposed estimation procedure. The rest of this article is structured as follows. We introduce the HMM-NR and the estimation procedure in Section 2. In Section 3, we present a simulation study and a real data application. Some discussions are given in Section 4.

4

Mian Huang et al.

2

Model and estimation

In this section, we will introduce the proposed semiparametric hidden Markov model regression and propose a modified EM algorithm to simplify the computation.

2.1

Model setting

Suppose that the stochastic triplet sequence {St , Xt , Yt }∞ 0 is a Markov process and St is unobserved latent homogeneous irreducible Markov chain with finite state space {1, 2, · · · , S}. Let Γ be the S × S transition matrix of St , with elements γjk = P (St+1 = k|St = j), and π = (π1 , ..., πS ) be the initial probabilities of the states, P P where Sk=1 πk = 1 and Sk=1 γjk = 1 for all j. We assume that the initial probabilities and transition probabilities are all positive. The observation couple is {Xt , Yt }, where

Xt is the covariate, which is independent of (St , Xt−1 , Yt−1 ). Given St = k and Xt , the proposed HMM-NR assumes that response variable Yt follows a normal regression model with mean function mk (Xt ) and variance function σk2 (Xt ) and is independent of (St−1 , Xt−1 , Yt−1 ). Specifically, for the Markov triplet (St , Xt , Yt ) we investigated, we have p(St+1 |St , Xt , Yt ) = p(St+1 |St ), Yt |Xt , St = k ∼ N{mk (Xt ), σk2 (Xt )}, where mk (·) is the unknown smooth mean function, σk2 (·) is positive unknown smooth variance function, and N(µ, σ 2 ) is the normal distribution with mean µ and variance σ 2 . Note that the traditional HMMR is a special case of the proposed HMM-NR if we further assume that mk (x) is a linear function of x and σk2 (x) is constant. We consider univariate Xt throughout this paper. But the proposed model and estimation


5

procedure in this article can be easily extended to multivariate Xt . However, such extension is less desirable due to the “curse of dimensionality”. Denote by pk (x, y) = P (Yt = y|Xt = x, St = k) = φ{y|mk (x), σk2 (x)} for k = 1, ..., S, where φ(· | µ, σ 2 ) is a normal density function with mean µ and variance σ 2 . Let P (x, y) be a S × S diagonal matrix with diagonal elements p1 (x, y), ..., pS (x, y), i.e., P (x, y) = diag(p1 (x, y), ..., pS (x, y)). Here, we have defined a HMM-NR with unknown parameters Γ, π, and unknown functions θ(x) = {mk (x), σk2 (x), k = 1, · · · , S}.

2.2

A modified EM algorithm

The traditional EM algorithm for HMMR can not be directly applied to the proposed HMM-NR due to the nonparametric mean and variance functions. We propose a modified EM algorithm to estimate the unknown parameters and the unknown mean and variance functions of HMM-NR by combining the ideas of the EM algorithm and the kernel regression. Let {(xt , yt), t = 1, 2, ..., T } be a finite realization of {Xt , Yt }∞ 0 . The likelihood function for the observed data is ℓ(π, Γ, θ(·)) ℓ(π, Γ, θ(·)) = πP (x1 , y1 )ΓP (x2 , y2 )ΓP (x3 , y3) · · · ΓP (xT , yT )1′ .

(2.1)

Borrowing the idea from the parametric HMM, we also use EM algorithm to simplify the computation. Define the latent state variable zt = (zt1 , ..., ztS ), where ztk is the associated indicator of St , that is ztk =

  1, if St = k,

 0, otherwise.

6

Mian Huang et al.

Then the complete log-likelihood function for {(xt , yt , zt ), t = 1, 2, ..., T } is L=

S X

z1k log(πk ) +

T X S X S X

zt−1,j ztk log(γjk ) +

ztk logφ{yt |mk (xt ), σk2 (xt )}.

t=1 k=1

t=2 j=1 k=1

k=1

T X S X

(2.2) The modified EM algorithm works in an iterative way. In the E step, we calculate the expectation of the complete log-likelihood function (2.2), given the entire sequence of {xt , yt} and the estimate of (π, Γ, θ(·)). This is equivalent to calculating the expectation of ztk and zt−1,j ztk , that is, E(L | (x1 , y1 , . . . , xT , yT )) =

S X

r1k log(πk ) +

T X S X S X t=2 j=1 k=1

k=1

htjk log(γjk ) +

T X S X

rtk logφ{yt|mk (xt ), σk2 (xt )},

t=1 k=1

where rtk is the conditional probability of being in state k at time t given the entire observed sequence, and htjk is the conditional probability of being in state k at time t and in the state j at time t − 1 given the entire observed sequence. In practice, rtk and htjk can be efficiently calculated by forward-backward algorithm. We define the 1 × S vector of forward probabilities αt = (αt1 , . . . , αtS ) as αt = πP (x1 , y1 )ΓP (x2 , y2 ) · · · ΓP (xt , yt),

t = 1, 2, ..., T,

and define the S × 1 vector of backward probabilities βt = (βt1 , . . . , βtS )′ as βt = ΓP (xt+1 , yt+1 )ΓP (xt+2 , yt+2 ) · · · ΓP (xT , yT )1′ , Then, we have rtk = αtk βtk /ℓ(π, Γ, θ(·)), and htjk = αt−1,j γjk pk (xt , yt)βtk /ℓ(π, Γ, θ(·)),

t = 1, 2, ..., T.

7

Semiparametric Hidden Markov Model where ℓ(π, Γ, θ(·)) is defined in (2.1). In the M step, we maximize E(L | (x1 , y1 , . . . , xT , yT )) with respect to the unknown

parameters π and Γ, and unknown functions θ(·). Note that E(L) = L1 (π) + L2 (Γ) + L3 (θ(·)), where L1 (π) =

S X

r1k log(πk ),

k=1

L2 (Γ) =

T X S X S X

htjk log(γjk ),

t=2 j=1 k=1

and L3 (θ(·)) =

T X S X

rtk logφ{yt |mk (xt ), σk2 (xt )}.

t=1 k=1

Therefore, in order to maximize E(L | (x1 , y1 , . . . , xT , yT )), we only need to maximize L1 (π), L2 (Γ) and L3 (θ(·)). The solution of maximizing L1 (π) is πk = r1k , under the P constraint Sk=1 πk = 1. The maximization of L2 (Γ) can be separately performed by maximizing

S T X X

htjk log(γjk )

t=2 k=1

for j = 1, · · · , S. With the constraint rt−1,j , it is easy to show that γjk = for j = 1, ..., S.

PT

PS

k=1

htjk PS PT t=2 k=1 htjk t=2

γjk = 1 and the equality

PS

k=1 htjk

=

PT

htjk = PTt=2 t=2 rt−1,j

Note that L3 is not ready to be maximized due to the unknown smoothing function θ(·). We propose to estimate θ(·) based on the ideas of local likelihood and kernel approximations of mk (x) and σk2 (x) in L3 . Define a local log-likelihood function of L3 as 2

ℓ3 (m, σ ) =

T X S X t=1 k=1

rtk logφ{yt |mk , σk2 }Kh (xt − x),

(2.3)

8

Mian Huang et al.

where Kh (·) = h−1 K(·/h) is a rescaled kernel of a kernel function K(·) with the bandwidth h, m = {m1 , ..., mS }, and σ 2 = {σ12 , ..., σS2 }. The solutions of ℓ3 (m, σ 2 ) are mk (x) =

PT

t=1 P T

ωtk (x) · yt

t=1

ωtk (x)

,

and σk2 (x)

=

PT

t=1

ωtk (x) · (yt − mk (x))2 , PT t=1 ωtk (x)

where ωtk (x) = rtk ·Kh (xt −x). The maximization is performed at a set of grid points, and then the interpolation is used to obtain a functional estimate of θ(x).

2.3

Selection of the tuning parameter

In order to use the above proposed modified EM algorithm, we need to select the bandwidth h in (2.3). We propose to use the BIC information criterion approach to select the bandwidth h. The BIC criterion has the form −2L + log(T ) × DoF, where L is the maximum log-likelihood, and DoF is the degree of freedom of the model. Unlike the parametric HMM, the DoF is not well defined for the proposed semiparametric HMM-NR due to the unknown smoothing function θ(·). Here we extend the approach used by Fan et al. (2001) and Huang et al. (2013) to the proposed HMM-NR to find the DoF. The degree of freedom of a one-dimensional smoothing function is 1 df = τK h |Ω|{K(0) − 2 −1

Z

K 2 (t)dt},


9

where Ω is the support x, and R K(0) − 12 K 2 (t)dt τK = R . {K(t) − 21 K ∗ K(t)}2 dt Hence, the degree of freedom for the proposed HMM-NR is DoF = 2S × df + S 2 − 1. Because the number of states S is assumed to be known, the degree of freedom DoF only depends on the bandwidth h in our model setting.

3

Examples

In this section, we use a simulation study and a real data application to demonstrate the effectiveness of the new model and the proposed modified EM algorithm. To assess the performance of the estimators, we use the square root of the average squared errors (RASE) of estimators. For example, for the mean functions, 2 RASEm

=N

−1

S X N X

{m ˆ k (uj ) − mk (uj )}2 ,

k=1 j=1

where {uj , j = 1, 2, ..., N} are the grid points evenly spaced in the range of the covariate x. Similarly, we can also define RASE for variance functions σk2 (x).

3.1

Simulation

Example. In this example, we conduct a simulation for a two-state HMM-NR with π = (0.4, 0.6), 1 m1 (x) = 3.4 + cos(2.5πx), 3 1 m2 (x) = 2.2 + sin(3πx), 5 σ12 (x) = 0.06,

σ22 (x) = 0.05,

10

Mian Huang et al.

and 



 0.35 0.65  Γ= . 0.6 0.4 We generate T (200 or 400) points of the predictor x from one-dimensional uniform distribution on [0, 1]. We use K-means algorithm to obtain the initial values. The Epanechnikov kernel is used for kernel regression. To select the bandwidth for the simulation, we generate several datasets for a given sample size, and then select a bandwidth by BIC for each dataset. The optimal bandwidth is taken to be the average of these selected bandwidths. We then take 2/3 of the optimal bandwidth, the optimal bandwidth, and 1.5 times the optimal bandwidth for the purpose of evaluation. We conduct 500 simulations with sample sizes T = 200, and 400, respectively. The results are reported in Table 1. RASEm and RASEσ2 display the mean (with standard deviation in parentheses) of RASEs for mean functions and variance functions, respectively, over the 500 simulations. γ11 and γ21 display the mean (with standard deviation in parentheses) of the estimated transition matrix parameters. From Table 1, we see that the proposed estimation procedure is effective across the wide range of bandwidths. We next test the accuracy of the standard error estimation via the parametric bootstrap method. We first generate data based on the fitted model. Then we refit the model to obtain the bootstrap estimates and calculate their standard deviations based on replicated bootstrap estimates. Tables 2, 3, and 4 summarize the performance of the standard errors of the functional estimates at x = 0.1, 0.2, ..., 0.9 and the transition matrix. The true standard errors, denoted by SD, is approximated via the standard deviation of 500 estimates from Table 1. SE and Std denotes the sample


11

Table 1: Summary of simulation results RASEm

γ21

h

200

0.04

0.012(0.006) 0.0015(0.0009) 0.344(0.051) 0.602(0.053)

0.06

0.009(0.004) 0.0012(0.0007) 0.343(0.051) 0.600(0.051)

0.09

0.013(0.004) 0.0016(0.0008) 0.350(0.050) 0.605(0.051)

400 0.026 0.007(0.002)

RASEσ2

γ11

T

0.001(0.0006)

0.347(0.037) 0.598(0.034)

0.04

0.005(0.002) 0.0006(0.0003) 0.344(0.036) 0.601(0.035)

0.06

0.006(0.002) 0.0006(0.0003) 0.349(0.037) 0.604(0.035)

average and standard deviation of the 500 estimated standard errors using bootstrap. The results in Tables 2, 3 and 4 demonstrate that the proposed bootstrap procedure works reasonably well in general.

3.2

Application

In this section, we illustrate our method by analysing a real dataset. The data contains the monthly change of S&P/Case-Shiller HPI and monthly growth rate of GDP from January 1990 to December 2002 of United States. The same data is analyzed via the nonparametric mixture of regression in Huang et al. (2013). A scatterplot is shown in Figure 1(a). The previous analysis by Huang et al. (2013) investigated the impact of GDP growth rate on HPI change, by ignoring the correlation among the time series data. Here, we re-analyze the data using the HMM-NR model. The response is HPI change and the predictor is the GDP growth rate. The hidden state corresponds to the economic cycle identities which each observation belongs to. We analyze the data by 2-state HMM-NR via the proposed estimation procedure. We

12

Mian Huang et al.

Table 2: Standard errors of nonparametric functional estimates via the parametric bootstrap when T = 200 and h = 0.06 0.1 m1 (·)

m2 (·)

σ12 (·)

σ22 (·)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

SD

0.071 0.069 0.065 0.065 0.063 0.065 0.068 0.068 0.068

SE

0.069 0.071 0.067 0.058 0.061 0.067 0.063 0.059 0.064

Std

0.017 0.014 0.014 0.012 0.012 0.013 0.014 0.012 0.014

SD

0.054 0.047 0.052 0.069 0.051 0.057 0.048 0.046 0.058

SE

0.05

Std

0.009 0.012 0.013 0.015 0.012

SD

0.031 0.028 0.023 0.018 0.022 0.025 0.025 0.023 0.025

SE

0.033 0.034 0.026

0.024

0.03

0.027 0.024 0.025

Std

0.013 0.011 0.008 0.007 0.008

0.01

0.009 0.011 0.011

SD

0.017 0.015 0.023 0.025 0.019 0.016 0.015 0.015 0.017

SE

0.016 0.018 0.024 0.026 0.022

Std

0.005 0.006 0.008 0.009 0.008 0.006 0.006 0.005 0.007

0.052

0.06

0.061 0.055 0.052 0.051 0.048 0.052

0.02

0.01

0.02

0.009 0.009 0.012

0.018 0.016 0.017

choose N = 100 grid points evenly from the range of the predictor. The estimated mean functions are shown in Figure 1(b). We show the hard-clustering result in Figure 1(b), which is obtained by assigning state identities according to the largest pk (xt , yt ), k = 1, 2. Our results are similar to those given by Huang et al. (2013), that is, in the lower component (cycle 1, from January 1990 to September 1997), the GDP growth tends to show a positive impact on the HPI change, while in the upper component (cycle 2, from October 1997 to December 2002), the HPI change tends to be lower when the GDP growth is moderate. The estimate of transition matrix is (ˆ γ11 , γˆ12 , γˆ21 , γˆ22 ) = (1, 0, 0.01, 0.99). From this result, we conclude that the state of


13

Table 3: Standard errors of nonparametric functional estimates via the parametric bootstrap when T = 400 and h = 0.04 0.1 m1 (·)

m2 (·)

σ12 (·)

σ22 (·)

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

SD

0.059 0.054 0.058 0.051 0.054 0.048

0.05

0.051 0.062

SE

0.052 0.056 0.054

0.05

0.05

0.048

0.05

Std

0.01

0.009 0.008 0.008 0.007

0.01

0.01

SD

0.046 0.041 0.053 0.055 0.038 0.035 0.044 0.042 0.037

SE

0.04

Std

0.008 0.008 0.009

SD

0.022 0.017 0.022 0.015 0.015 0.021 0.019 0.015 0.017

SE

0.021 0.025 0.021 0.018 0.019 0.022 0.019 0.018

Std

0.007 0.007 0.006 0.005 0.005 0.007 0.006 0.005 0.006

SD

0.017 0.013 0.019

0.02

0.02

SE

0.014 0.014 0.017

0.02

0.017 0.015 0.015 0.013 0.014

Std

0.005 0.004 0.006 0.008 0.006 0.005 0.005 0.003 0.004

0.009

0.01

0.051 0.052

0.042 0.048 0.049 0.044 0.043 0.044 0.041 0.042 0.01

0.01

0.007 0.008 0.008 0.008

0.02

0.014 0.016 0.012 0.011

macroeconomic is very stable along the time.

4

Conclusion and discussion

In this paper, we extended the standard parametric hidden Markov model regression to hidden Markov model with nonparametric regression. We further propose a modified EM algorithm to estimate the model by combining the ideas of the EM algorithm and the kernel regression. Simulation results demonstrate that the proposed estimation procedure performs quite well. This article mainly focuses on the new

14

Mian Huang et al.

Table 4: Standard errors of transition matrix estimates via the parametric bootstrap SD

SE(STD)

SD

T=200, h=0.06

SE(STD) T=400, h=0.04

γ11

0.051

0.051(0.007)

0.035

0.035(0.003)

γ21

0.052

0.051(0.007)

0.033

0.035(0.004)

methodology and its computation. It requires further work to investigate the asymptotic properties of the proposed estimation procedure. In addition, in this article, we assume that the transition matrix is constant and does not depend on the covariates. It will be also interesting to incorporate the covariates into the transition matrix for the proposed HMM-NR. In this paper, we assume that St in the triplet (St , Xt , Yt) is unobserved while Xt , Yt are observed. In the situation when Xt is unobserved hidden continuous random variable, Conditionally Gaussian Pairwise Markov Switching Models (CGPMSMs, Abbassi et al. (2015)) could be applied. It is of interest to investigate whether functional modeling method in our paper could be used to extend the CGPMSMs, in the way similar to the generalization of HMMR to HMMNR.

Acknowledgements We thank referees, the Associate Editor, and the Editor whose comments and suggestions have helped us to improve the paper significantly. Huang’s research is supported by National Natural Science Foundation of China (NNSFC), Grant 11301324, and Shanghai Chenguang Program. Yao’s research is supported by NSF grant DMS-


15

Sc a t t e r Plo t

1 .2

1 .0

0 .8

HPI Ch a n g e

0 .6

0 .4

0 .2

0 .0

− 0 .2

− 0 .4 − 0 .1

0 .0

0 .1

0 .2

0 .3 GDP Gro wt h

0 .4

0 .5

0 .6

0 .7

0 .5

0 .6

0 .7

(a) Es t im a t io n a n d Ha rd Clu s t e rin g

1 .2

1 .0

0 .8

HPI Ch a n g e

0 .6

0 .4

0 .2

0 .0

− 0 .2

− 0 .4 − 0 .1

0 .0

0 .1

0 .2

0 .3 GDP Gro wt h Ra t e

0 .4

(b)

Figure 1:

(a) Scatterplot of U.S. house price index data. (b) Estimated mean

functions with 95% confidence intervals and the hard-clustering results.

1461677 and Department of Energy with the award No: 10006272.

16

Mian Huang et al.

References Abbassi, N., Benboudjema, D., Derrode, S., and Pieczynski, W. (2015). Optimal filter approximations in conditionally gaussian pairwise markov switching models. IEEE Transactions on Automatic Control, 60(4), 1104–1109. Bui, H. H., Venkatesh, S., and West, G. (2002). Policy recognition in the abstract hidden markov model. Journal of Artificial Intelligence Research, 17, 451–499. Cappé, O., Moulines, E., and Rydén, T. (2009). Inference in hidden markov models. In Proceedings of EUSFLAT Conference, pages 14–16. Fan, J., Zhang, C., and Zhang, J. (2001). Generalized likelihood ratio statistics and wilks phenomenon. Annals of Statistics, pages 153–193. Fridman, M. (1993). Hidden markov model regression. Institute for Mathematics and its Applications (USA). Gough, J., Karplus, K., Hughey, R., and Chothia, C. (2001). Assignment of homology to genome sequences using a library of hidden markov models that represent all proteins of known structure. Journal of molecular biology, 313(4), 903–919. Huang, M., Li, R., and Wang, S. (2013). Nonparametric mixture of regression models. Journal of the American Statistical Association, 108(503), 929–941. Huang, X. D., Ariki, Y., and Jack, M. A. (1990). Hidden Markov models for speech recognition, volume 2004. Edinburgh university press Edinburgh. Krogh, A., Larsson, B., Von Heijne, G., and Sonnhammer, E. L. (2001). Predicting transmembrane protein topology with a hidden markov model: application to complete genomes. Journal of molecular biology, 305(3), 567–580.


17

Lee, Y., Ghosh, D., and Zhang, Y. (2014). Regression hidden markov modeling reveals heterogeneous gene expression regulation: a case study in mouse embryonic stem cells. BMC genomics, 15(1), 1. MacDonald, I. L. and Zucchini, W. (1997). Hidden Markov and other models for discrete-valued time series, volume 110. CRC Press. Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286. Trabelsi, D., Mohammed, S., Chamroukhi, F., Oukhellou, L., and Amirat, Y. (2013). An unsupervised approach for automatic activity recognition based on hidden markov model regression. IEEE Transactions on Automation Science and Engineering, 10(3), 829–835. Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S. F., Hakonarson, H., and Bucan, M. (2007). Penncnv: an integrated hidden markov model designed for high-resolution copy number variation detection in whole-genome snp genotyping data. Genome research, 17(11), 1665–1674.