Hidden Markov Models

72 downloads 5415 Views 822KB Size Report
3.2 Marginal distributions and moments of a hidden Markov model . .... on the website “http://gsbwww.uchicago.edu/research/mkt/Databases/Databases.html.” ...... λ1,λ2, ..., λm of a Poisson HMM, where the input vector parvect contains m2 en-.
Hidden Markov Models Prof. Dr. Walter Zucchini, Andreas Berzel, Jan Bulla Institute for Statistics and Econometrics, University of G¨ottingen

25 1993

1994

1995

1996

1997

units sold

20

15

10

5

0 0

50

100

150

week

February 2006

200

250

Contents 1 Introduction

1

2 Fundamentals

4

2.1

Independent mixture distributions . . . . . . . . . . . . . . . . . . . . . . .

4

2.2

Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Problems for Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 Hidden Markov Models

19

3.1

The basic hidden Markov model . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2

Marginal distributions and moments of a hidden Markov model . . . . . . 22

3.3

The likelihood of a hidden Markov model . . . . . . . . . . . . . . . . . . . 26

Problems for Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4 Parameter Estimation

33

4.1

Forward and backward probabilities . . . . . . . . . . . . . . . . . . . . . . 33

4.2

The EM-algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3

Direct maximization of the likelihood . . . . . . . . . . . . . . . . . . . . . 38

4.4

4.3.1

Parameter restrictions . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.2

Numerical underflow . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3.3

An efficient algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Standard errors of the parameter estimates . . . . . . . . . . . . . . . . . . 45

Problems for Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5 Forecasting and Decoding

51

5.1

Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2

Forecast distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3

Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.1

Local decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3.2

Global decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.3.3

State prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Problems for Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

II

Contents

6 Model Selection and Model Validation

61

6.1

Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.2

Model validation with pseudo-residuals . . . . . . . . . . . . . . . . . . . . 64

Problems for Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 7 Applications and Extensions

73

7.1

Hidden Markov models with various component distributions . . . . . . . . 74

7.2

Second order hidden Markov models . . . . . . . . . . . . . . . . . . . . . 76

7.3

Hidden Markov models for multivariate series . . . . . . . . . . . . . . . . 77

7.4

7.3.1

Series of multinomial-like observations and categorical series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.3.2

Other multivariate series . . . . . . . . . . . . . . . . . . . . . . . . 79

Series which depend on covariates, such as time . . . . . . . . . . . . . . . 81

Problems for Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Solutions

85

References

98

1 Introduction Hidden Markov models (HMMs) are a class of models in which the distribution that generates an observation depends on the state of an underlying and unobserved Markov process. They show promise as flexible general-purpose models for univariate and multivariate time series, especially for discrete-valued series, including categorical series and series of counts (Zucchini and MacDonald, 1998). Consider, for example, a series of weekly sales of a specific soap product in a supermarket, as shown in Figure 1.1.1

25 1993

1994

1995

1996

1997

units sold

20

15

10

5

0 0

50

100

150

200

250

week

Figure 1.1: Series of weekly sales of a specific soap product.

In this case, the application of standard time series models such as ARMA models is restricted as they are based on the normal distribution. Instead, the basic model for unbounded counts is the Poisson distribution. However, the standard Poisson model is not appropriate either in this case since, as will be demonstrated later, the series displays considerable overdispersion relative to the Poisson distribution and strong positive serial dependence. In addition, there seem to exist some periods with a low rate of weekly sales and other periods with a relatively high rate of weekly sales. The class of hidden Markov time series models, which model the probability distribution Xt in dependence on the unobserved (hidden) state Ct of an m-state Markov chain and which can accomodate both overdispersion and serial dependence, therefore seems to be a useful tool in modelling this series and attempting to understand its structure. The 1

The data are taken from the DFF database provided by the Marketing Group of the University of Chicago on the website “http://gsbwww.uchicago.edu/research/mkt/Databases/Databases.html.”

2

1 Introduction

fitting of a Poisson hidden Markov model to the series of weekly soap sales will constitute an integral part of this course, i.e. most of the aspects of HMMs introduced during this course will be demonstrated by means of this series.2 HMMs have been used for more than two decades in signal-processing applications, especially in the context of automatic speech recognition but the interest in the theory and applications of HMMs is rapidly expanding to other fields, e.g. • all kinds of recognition: faces, speech, gesture, handwriting/signature, • bioinformatics: biological sequence analysis, • environment: wind direction, rainfall, earthquakes, • finance: daily return series. The bibliography at the end of this paper lists several articles and monographs that deal with the application of HMMs in these fields and may be of interest for further reading (Durbin et al. (1998), Elliott, Aggoun, and Moore (1995), Koski (2001), Rabiner (1989), Ephraim and Merhav (2002)). Attractive features of HMMs include their versatility, their mathematical tractability, and the fact that the likelihood is relatively straightforward (Zucchini and MacDonald, 2001). In detail, HMMs are characterized by the following properties: • all moments available: mean, variance, autocorrelations, • likelihood easy to compute: computation linear in T, • marginal distributions available: missing observations no problem, • conditional distributions available: outlier identification; k-step ahead forecast distribution, joint distribution of multiple forecasts. In addition, HMMs are interpretable in many cases and can easily accomodate additional covariates. Furthermore, they are moderately parsimonious, i.e. a simple two-state model often provides a reasonable fit. The main objectives when dealing with hidden Markov models are the following: • reveal the structure of data, i.e., trend, seasonal variation and serial dependence, • forecast future values including prediction intervals, • identify unusual values, • relate the observations to other series (covariates). 2

The main intention of this paper is to introduce the basic hidden Markov model and its properties. For that reason we do not consider any covariates for the soap sales series although this would be reasonable and also possible in a Hidden Markov Model (for details see Section 7.4). It is well possible that taking into account covariates such as price one would apply other models than a hidden Markov model, e.g. a simple regression model.

1 Introduction

3

This course basically intends to give an introduction to the simple hidden Markov model (HMM). It is simple in the sense that it is restricted to stationary time series (i.e. without trend or seasonal variation). The observations may be either discrete- or continuousvalued, but in this course we will assume them to be univariate and we will ignore any information that may be available on covariates. Merely at the end of this course we will give a short overview of possible extensions of the basic hidden Markov model. The emphasis will be on the application of the models, in particular model speci-fication, parameter estimation, model selection, diagnostic checking, and forecasting. The organisation of this course is as follows. The following section (Chapter 2) provides the basic ideas of mixture distributions and Markov chains, two fundamental concepts that are needed for an understanding of the basic structure of hidden Markov models. Chapter 3 introduces the simple HMM and its basic properties. The estimation of HMMs is displayed in Chapter 4. Further aspects of HMMs such as forecasting and decoding are treated in Chapter 5, while Chapter 6 deals with model selection and testing of HMMs. Finally this course concludes with a brief description of possible applications (with various component distributions) and extensions of the basic HMM (Chapter 7). Each section contains a few problems for practice, some merely theoretical others rather practical, which represent an integral part of this course. Since it is difficult to catch the idea of hidden Markov models without applying them in practice and as some aspects of hidden Markov models are only covered by these problems the revision of the problems is strongly recommended! The practical problems are meant to be solved by means of the software package R (R Development Core Team, 2005).3 Solutions to the theoretical problems are provided at the end of this paper while solutions to the practical problems will be made available in a separate document.

3

The software is freely available on “http://www.r-project.org”.

2 Fundamentals This chapter is intended to provide a short introduction into two fundamental concepts that are needed for an understanding of the basic structure of HMMs. First a short overview of mixture distributions is given (Section 2.1) because the marginal distribution of a HMM is a discrete mixture model. Afterwards Markov chains are introduced (Section 2.2) which are of great importance since the parameter process of a HMM is modelled by a Markov chain.

2.1

Independent mixture distributions

Reconsider the series of weekly soap sales introduced in chapter 1 and assume that there is no serial correlation in the series, i.e. the weekly sales are independent counts. The basic model for independent unbounded counts is the Poisson distribution with its x e−λ probability function p(x) = λ x! and the restrictive property that the variance equals the mean (E(X) = V ar(X)). For the soap sales series one has that x¯ = 5.44 and S 2 = 15.4, indicating strong overdispersion relative to the Poisson distribution and therefore inappropriateness of the latter. This is displayed in Figure 2.1 which displays a barplot of the observations and the fitted Poisson distribution on the left hand side. In addition the distribution of the weekly soap sales is bimodal while the Poisson distribution has only one mode.

counts and Poisson model

soap sales series and constant rate of sales 25

observed relative frequency fitted Poisson model

0.15

20

15

0.10

10 0.05

^ λ

5

0.00

0 0

2

4

6

8

10

12

14

16

18

20

22

0

50

100

150

200

Figure 2.1: Soap sales series and fitted Poisson distribution.

250

2 Fundamentals

5

One method of overcoming the deficiencies of the Poisson distribution in such a case is to use an independent mixture model. Mixture distributions can be applied in many different fields. They are very useful when treating overdispersed observations or data with bimodal or, more general, multimodal distributions. Mixture models assume that the overdispersion and/or multimodality of the observations of a variable may be due to unobserved heterogeneity in the population, i.e. the population consists of different groups with different distributions for that variable. Imagine for example the distribution of the number of packages of cigarettes bought by a number of customers in a supermarket. The customers can be separated into several groups, i.e. non-smokers, persons who smoke occasionally etc. This leads to overdispersion relative to the Poisson and maybe even to a multimodal distribution of their shopping. Suppose that each count of the soap sales series is generated by one of two Poisson distributions with means λ1 and λ2 , where the choice of mean is made by some other random mechanism which we call the parameter process.4 For example, if λ1 is selected with probability δ1 and λ2 with probability δ2 = 1 − δ1 , then the variance of the resulting series is greater than its mean, as will be shown later in this section. If the parameter process is a series of independent random variables then the counts are also independent. This is an example of an independent mixture. In general, an independent mixture distribution consists of a certain number of component distributions.5 These component distributions can be either discrete or continuous. In the case of two component distributions the mixture distribution is characterized by the two random variables X1 and X2 , and their probability functions or probability density functions (pdf), respectively: random variable probability function pdf

X1 p1 (x) f1 (x)

X2 p2 (x) f2 (x)

Moreover, for the parameter process a discrete random variable C is needed to perform the mixture: ½ C :=

4

1 with probability δ1 2 with probability δ2 = 1 − δ1

.

Here, we do not have heterogeneity in a group of customers but heterogeneity over time. There are phases with a low rate of sales and phases with a high rate of sales. This might be due to covariates such as the price which we do not consider here. 5 This is the case of a discrete mixture. In many applications it is not reasonable to assume that the population heterogeneity is represented by a finite number of components having a discrete distribution (B¨ohning, 2000). Instead, one might think of a continuous mixture model which can be interpreted as a special case of a discrete mixture with infinitely many components. A prominent example for a continuous mixture is the negative binomial distribution that can be derived as a mixture of Poisson distributions where the parameter λ is gamma-distributed. As continuous mixtures are of no relevance for HMMs, in the following we only consider discrete mixture distributions. More details on continuous mixtures can be found in Zucchini, B¨oker, and Stadie (2001) or B¨ohning (2000).

6

2 Fundamentals

One may imagine C like tossing a coin: If C takes the value 1, then draw an observation from X1 . According to this, if C takes the value 2, then draw an observation from X2 . The structure of that process for the case of two continuous component distributions is shown in Figure 2.2. parameter process state 1

δ1 = 0.75

state−dependent process

state 2

f1

δ2 = 0.25

observations

f2

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

24.3

16.8

9

12.1

31.6

14.5

Figure 2.2: Process structure of a two component mixture distribution. Please note: In practice, we do not know which way the coin landed. Only the observations generated by either X1 or X2 can be observed and in most cases they cannot be assigned to a distinct random variable! Given the probability of each component and the respective probability distributions, the probability (density) function of the mixture can be computed easily. Let X denote the realization of the mixture. Then, the probability (density) function is given by

p(x) = δ1 p1 (x) + δ2 p2 (x)

(discrete case)

or

f (x) = δ1 f1 (x) + δ2 f2 (x)

(continuous case) .

The proof for the discrete case is straightforward:

P (X = x) = = = =

P ({X1 = x} ∩ {C = 1}} ∪ {{X2 = x} ∩ {C = 2}) P (X1 = x, C = 1) + P (X2 = x, C = 2) P (X1 = x|C = 1)P (C = 1) + P (X2 = x|C = 2)P (C = 2) p1 (x)δ1 + p2 (x)δ2 .

The result for the continuous case can be proved analogously. Two graphical examples of two-component mixtures are given in Figure 2.3.

2 Fundamentals

7

discrete component mixture 0.6 0.5 0.4 0.3 0.2 0.1 0.0

P1

1

2

3

0.6 0.5 0.4 0.3 0.2 0.1 0.0

4

5

6

0.6 0.5 0.4 0.3 0.2 0.1 0.0

2

3

4

5

2

3

0

6

0.3P1 + 0.7P2

1

f1

2

4

0.4 0.3 0.2 0.1 0.0

P2

1

continuous component mixture 0.4 0.3 0.2 0.1 0.0

4

5

8

10

6

8

10

8

10

f2

0

2

0.4 0.3 0.2 0.1 0.0 6

6

4

0.4f1 + 0.6f2

0

2

4

6

Figure 2.3: Two component mixture distribution with discrete or continuous components. The extension to the m-component case is straightforward. Let δ1 , δ2 , ..., δm denote the weights assigned to the different components, p1 , p2 , ..., pm or f1 , f2 , ..., fm denote their probability functions or probability density functions, respectively. Then, the distribution of the mixture given by the random variable X can be easily calculated by a linear combination of the component distributions: p(x) = f (x) =

m X i=1 m X

δi pi (x)

(discrete case) ,

δi fi (x)

(continuous case) .

i=1

Moreover, the calculation of the expection of the mixture can also be traced back to the expectations of the component distributions. For a mixture of m discrete component distributions one obtains E(X) =

X x

m m m X X X X X xp(x) = x δi pi (x) = δi xpi (x) = δi E(Xi ) . x

i=1

i=1

x

i=1

An analogous result can be derived for a mixture of continuous component distributions. Thus, in the case of two discrete or continuous components it holds that E(X) = δ1 E(X1 ) + δ2 E(X2 ) = δ1 E(X1 ) + (1 − δ1 )E(X2 ) . More general, the k-th moment E(X k ) of a mixture is simply a linear combination of the respective moments of its components: E(X k ) =

m X i=1

δi E(Xik ) , k=1,2,...

8

2 Fundamentals

Note, that this does not hold for the central moments, e.g. the variance of a mixture: V ar(X) 6=

X

δi V ar(Xi ) .

i

For example, in the two-component case the variance of the mixture can be calculated from the following formula: V ar(X) = δ1 V ar(X1 ) + δ2 V ar(X2 ) + δ1 δ2 (E(X1 ) − E(X2 ))2

.

For the derivation of that formula see Problem 2.1! In general, the variance of a mixture model can be computed using the equality V ar(X) = E(X 2 ) − (E(X))2 in combination with the above result for E(X k )! For example, for a Poisson mixture one has that X X E(X) = δi λi and E(X 2 ) = δi (λi + λ2i ) i

and it follows that V ar(X) =

i

X

X δi (λi + λ2i ) − ( δi λi )2 .

i

i

The estimation of the parameters of a mixture distribution is usually performed by a maximum likelihood (ML) algorithm. In general, the likelihood of a mixture model with m components is given by the formula

L(θ 1 , . . . , θ m , δi , . . . , δm , x1 , ..., xn ) = L(θ 1 , . . . , θ m , δi , . . . , δm , x1 , ..., xn ) =

n X m Y j=1 i=1 n X m Y

δi pi (xj , θ i )

(discrete case) or

δi fi (xj , θ i )

(continuous case) ,

j=1 i=1

where θ 1 , . . . , θ m are the parameter vectors of the component distributions, δi , . . . , δm are the mixing parameters, and x1 , ..., xn are the observations. Thus, in the case of oneparametric component distributions 2m − 1 parameters have to be estimated, namely m parameters for the component distributions and m − 1 mixing parameters as the mixing parameters must sum to one! The maximization of the likelihood is not trivial, since it is not possible to solve the maximization problem analytically. This shall be demonstrated by an easy example. Suppose X1 , X2 are Poisson distributed with means λ1 and λ2 , i.e. X1 ∼ P o(λ1 ), X2 ∼ P o(λ2 ), and let δ1 , δ2 be the mixing parameters with δ1 + δ2 = 1. This means, the mixture distribution p(x) is given by p(x) = δ1 p1 (x) + δ2 p2 (x), where p1 (x) =

λx1 e−λ1 x!

and p2 (x) =

λx2 e−λ2 x!

.

2 Fundamentals

9

As δ2 = 1 − δ1 , three parameters have to be estimated: λ1 , λ2 , δ1 . Let x1 , x2 , ..., xn denote the observed values of the mixture distribution with distribution function p(x). Then, the likelihood is given by the following expression:

L(λ1 , λ2 , δ1 , x1 , x2 , ..., xn ) = p(x1 )p(x2 ) . . . p(xn ) n Y = p(xi ) i=1

¸ n · Y λx1 i e−λ1 λx2 i e−λ2 = δ1 + (1 − δ1 ) x ! xi ! i i=1

.

The maximization with respect to λ1 , λ2 and δ1 is not trivial as the likelihood is the nth product of a sum. Taking the logarithm and differentiating does not simplify the expression. Hence, the calculation of an explicit solution can become lengthy and nasty. It is much trickier than usual! Therefore, parameter estimation has to be carried out by numerical maximization of the likelihood using special software. A very useful software package for the estimation of mixture models is C.A.MAN which was developed by B¨ohning, Schlattmann, and Lindsay (1992).6 It can be used not only if the number of components is fixed but also in the case the number of components is unknown and has to be estimated from the observations. In the latter case the likelihood function has to be maximized over the parameters given above and the number of components m. Note that in general fitting one component for each observation does not maximize the likelihood! If one uses CA.MAN to fit a mixture of two or three Poisson distributions to the soap sales data one obtains the following results:

two–component mixture:

i 1 2

δi 0.85 0.15

λi 4.22 12.5

with logL = −634 ,

three–component mixture:

i 1 2 3

δi 0.11 0.77 0.12

λi 1.44 4.80 13.5

with logL = −628 .

The value of the log-likelihood can even be slightly improved by using a four-component mixture but the improvement is too small to justify the additional parameters! Figure 2.4 displays barplots of the original observations and the fitted mixture models.

6

The software is provided on http://www.medizin.fu-berlin.de/sozmed/caman.html.

10

2 Fundamentals

counts and two−component mixture

counts and three−component mixture

observed relative frequency fitted mixture

0.15

0.10

0.10

0.05

0.05

0.00

observed relative frequency fitted mixture

0.15

0.00 0

2

4

6

8

10

12

14

16

18

20

22

0

2

4

6

8

10

12

14

16

18

20

22

Figure 2.4: Soap sales data and fitted Poisson mixtures. It is obvious that both mixtures fit the observations much better than a simple Poisson distribution. In addition, for the three-component mixture one obtains E(X) = 5.44 and V ar(X) = 15.2, which is quite similar to the respective moments of the dataset. A field in which mixture distributions with continuous components may be applied is the analysis of stock returns which is demonstrated in the following. Figure 2.5 shows the daily percentage return on the New York Stock Exchange Composite Index (NYSE CI) between 01/02/1990 and 11/29/1996.

4

1990

1991

1992

1993

1994

1995

1996

3

return (%)

2 1 0 −1 −2 −3 −4 0

500

1000

1500

trading day

Figure 2.5: Percentage return on NYSE CI, 01/02/1990–11/29/1996. It is clearly visible that the variance of the returns is not constant over the whole trading period. Instead, there are some periods with low absolute returns and other periods with high absolute returns. For that reason a simple normal distribution does not provide an adequate description of the daily percentage return on the Composite Index, as can be seen in Figure 2.6 which shows a histogram of the daily returns and a fitted normal distribution.

2 Fundamentals

11

1.0

0.8

0.6

0.4

0.2

0.0 −4

−3

−2

−1

0

1

2

3

4

return (%)

Figure 2.6: Histogram of daily returns on NYSE CI and fitted normal distribution.

It becomes clear that the fitted normal underestimates the probability of extremely low absolute returns but also of extremely high absolute returns. In contrast, a mixture of two normal distributions or a mixture of a normal and a double-exponential distribution as shown in Figure 2.7 provide a better fit.

mixture of two normal distributions

normal/double−exponential mixture

1.0

1.0

(A) N(0.08;0.4)

0.8

(A) N(0.08;0.4)

0.8

Mixture distribution 75% (A) + 25% (B)

0.6

Mixture distribution 75% (A) + 25% (B)

0.6

0.4

0.4

(B) N(−0.03;1.1)

0.2

(B) D−exp(−0.03;1.1)

0.2

0.0

0.0 −4

−3

−2

−1

0

1

return (%)

2

3

4

−4

−3

−2

−1

0

1

2

3

4

return (%)

Figure 2.7: Histogram of daily returns on NYSE CI and fitted mixtures.

These mixtures were fitted by hand since the software package CA.MAN does not work for all mixtures. Note that it is even possible to use different model families for different components!

2.2

Markov chains

The theory of Markov chains is well elaborated, however, here we can only give a short introduction to Markov chains and some of their basic properties that are needed for the construction of HMMs. For a detailed description of Markov chains see e.g. Grimmett and Stirzaker (2001) or Parzen (1962). Consider a stochastic process, i.e. a sequence of random variables {Ct : t ∈ T } indexed by some set T which take values in some set S, called the state space. The time set T may be either discrete or continuous but in the following we concentrate on the discrete case, i.e. on discrete–time processes {C1 , C2 , C3 , ...} = {Ct : t ∈ T } where T = {1, 2, 3, . . .}. A stochastic process {Ct : t = 1, 2, . . .} is a Markov process if for each time t the next state Ct+1 depends only on the current state of the chain, Ct . That is, given Ct , Ct+1 does not depend further on the history of the chain {C1 , C2 , ..., Ct−1 }.

12

2 Fundamentals

More mathematically, one can a define a Markov process as follows. Let c1 , c2 , ..., ct , ct+1 , t ∈ {1, 2, . . .} denote a sequence of observations of a stochastic process {Cs , s = 1, 2, . . .}. {Cs } is a Markov process if it has the Markov property P (Ct+1 = ct+1 | Ct = ct , Ct−1 = ct−1 , ..., C1 = c1 ) = P (Ct+1 = ct+1 |Ct = ct ) | {z } ”entire history”

for all t ∈ {1, 2, . . .} (Grimmett and Stirzaker, 2001). I.e., given the present of the process, Ct , its future, Ct+1 , is independent of its past, Ct−1 , Ct−2 ,..., C1 . This strucure of a Markov process is demonstrated in Figure 2.8.

Ct−2

Ct−1

Ct

Ct+1

Figure 2.8: Markov process.

Up to now, we have not specified the state space S any further. Analogous to the time set T , the state space S of a Markov process may be either continuous or discrete. A Markov process whose state space is discrete is called a Markov chain. In the following, we will only deal with Markov chains whose state space is a limited set of integers S = {1, 2, . . . , m}, i.e., ct ∈ {1, 2, . . . , m} for all t ∈ {1, 2, . . .}. Then, if Ct = i, i ∈ {1, 2, . . . , m} we say that the Markov chain is in the ith state at time t. In order to describe a Markov chain, the probabilities of changing from one state i to another state j, P (Ct+1 = j | Ct = 1), should be regarded closer. In general, there are two possibilities: These so-called “transition probabilities” can change in time or they can be time-stable. The latter case can be characterized as follows. A Markov chain is called homogeneous or Markov chain with stationary transition probabilities if the transition probabilities are independent of t, i.e. P (Ct+1 = j | Ct = i) =: γij for all t ∈ {1, 2, . . .} and i, j ∈ {1, . . . , m} Then, γij denotes the probability of transition from state i to state j. In the following, we will only deal with homogeneous Markov chains. The transition probabilities of a homogeneous m-state Markov chain can be summarized in a m x m-matrix, the so-called transition probability matrix Γ: 

γ11  .. Γ :=  . γm1

 · · · γ1m ..  .. . .  · · · γmm

with γij := P (Ct+1 = j|Ct = i) and

m X j=1

γij = 1, ∀ i.

2 Fundamentals

13

As each row i of Γ describes the probability function, namely the transition probabilities for changing from state i to state j, the row sums equal one. The transition probability matrix Γ contains the one-step transition probabilities and thus describes the short-term behaviour of the Markov chain. For describing the longterm behaviour of a Markov chain one can define the k-step transition probabilities, i.e. the probabilities of transition from state i in t to state j in t + k: γij (k) := P (Ct+k = j | Ct = i) . The k-step transition probabilities can easily be computed via some matrix algebra as it holds that the matrix Γ(k) which contains the k-step transition probabilities is simply the k th power of the transition probability matrix Γ: 

 γ11 (k) · · · γ1m (k)  ..  = Γk . ... Γ(k) :=  ... .  γm1 (k) · · · γmm (k) This result follows from the Chapman-Kolmogorov equations (for a proof see Grimmett and Stirzaker (2001)): m X γik (m)γkj (n) . γij (m + n) = k=1

In this context one says that state i communicates with state j, written i → j, if the chain may ever visit state j with positive probability, starting from state i. That is, i → j if there exists some k ∈ {1, 2, . . .} with γij (k) > 0. Furthermore, one says that i and j intercommunicate and writes i ↔ j if i → j and j → i. Then, a Markov chain is defined to be irreducible if i ↔ j for all i, j ∈ {1, 2, . . . , m}. In the following, as in most applications, we assume the Markov chain to be irreducible. The k-step transition probabilities provide the conditional probabilities to be in state j at time t + k, given that one is in state i at time t. However, in general one is also interested in the marginal probability to be in state i at a given time t, δi (t). Given the initial probability distribution for the first state δ(1) := (δ1 (1), δ2 (1), . . . , δm (1)) = (P (C1 = 1), P (C1 = 2), . . . , P (C1 = m)) , 7 the probability function for the state Ct , δ(t), can be computed as δ(t) := ((δ1 (t), . . . , δm (t))) = (P (Ct = 1), . . . , P (Ct = m)) = δ(1)Γk−1 . A question then arising is whether any statement can be made concerning the behaviour of δ(t) for a very large t.

7

We would like to mention that, as a convention, vectors are row vectors in this paper, i.e. δ is a row vector, while δ 0 is the respective column vector.

14

2 Fundamentals

One can show that, in case the Markov chain is homogeneous and irreducible, δ(t) converges to a fixed vector, say δ := (δ1 , δ2 , . . . , δm ), for large t, and that δ is the unique vector of length m solving m X

δi γij = δj

∀j ∈ {1, 2, . . . , m} subject to

i=1

m X

δj = 1

j=1

or equivalently δ = δΓ

subject to

δ10 = 1 .

For a proof of that result see e.g. Seneta (1981). The system of equations can be solved easily. For example, for a two-state Markov chain with transition probability matrix given by µ ¶ γ11 γ12 Γ= γ21 γ22 one obtains (see Problem 2.3): δ = (δ1 δ2 ) =

1 (γ21 γ12 ) . γ12 + γ21

The vector δ is called the stationary distibution of Γ and a Markov chain is said to be stationary, if the stationary distribution δ exists and if it holds for each t, t ∈ {1, 2, . . .}, in particular for the initial distribution of the first state, e.g. δ(1) = δ. In practice, it depends on the application if one assumes the underlying Markov chain of a HMM to be stationary or not, however, apart from the following example, we consider only stationary Markov chains in this paper. In the following we regard a simple example in order to clarify the given properties of Markov chains. Imagine a sequence of rainy and sunny days (starting with a sunny day) modelled in such a way that the weather of tomorrow depends only on today’s weather and the transition probabilities of the respective Markov chain are given by the following table:

day t rainy sunny

day t + 1 rainy sunny 0.9 0.1 0.6 0.4

These transition probabilities can be interpreted as follows. Assuming today to be rainy the probability to obtain a rainy day tomorrow as well is 90%, the probability of a sunny day is only 10 %. In analogy, the probability that a sunny day is followed by another sunny day is 40% while the probability for a rainy day to follow is 60%. This is demonstrated again in Figure 2.9.

2 Fundamentals

15

t

t+1

t+2 0.9

0.6

0.1

0.4

0.6

... ...

...

...

0.4

Figure 2.9: Transition probabilities for a sequence of rainy and sunny days.

This situation can be modelled by a simple two-state Markov chain, namely its transition probability matrix Γ given by µ Γ: = µ = µ

γ11 γ12 γ21 γ22



P (Ct+1 = 1|Ct = 1) P (Ct+1 = 2|Ct = 1) P (Ct+1 = 1|Ct = 2) P (Ct+1 = 2|Ct = 2)



¶ P (Ct+1 = rainy | Ct = rainy) P (Ct+1 = sunny | Ct = rainy) = P (Ct+1 = rainy | Ct = sunny) P (Ct+1 = sunny | Ct = sunny) µ ¶ 0.9 0.1 = . 0.6 0.4 Furthermore, as we assume the Markov chain to start with a sunny day we have the following canonical probability distribution of today’s weather δ(1): ¡ ¢ δ(1) = δ1 (1) δ2 (1) ¡ ¢ = P (C1 = 1) P (C1 = 2) ¡ ¢ = P (C1 = rainy) P (C1 = sunny) ¡ ¢ = 0 1 . Given Γ and δ(1), the probability distributions of the weather of the following days can be computed as follows: ¡ ¢ δ(2) = P (C2 = 1) P (C2 = 2) = δ(1)Γ µ ¶ ¡ ¢ 0.9 0.1 = 0 1 0.6 0.4 ¡ ¢ = 0.6 0.4 ,

16

2 Fundamentals

δ(3) = δ(2)Γ µ ¶ ¡ ¢ 0.9 0.1 = 0.6 0.4 0.6 0.4 ¡ ¢ = 0.78 0.22 , .. . δ(k) = δ(k − 1)Γ = δ(1)Γk−1 . Thus, for example the probability that the day after tomorrow is a rainy day if today is a sunny day equals 78%. In addition, solving ¡

¢

¡

δ1 δ2 = δ1

µ ¶ ¢ 0.9 0.1 δ2 0.6 0.4

subject to

δ1 + δ2 = 1

or using the formula given above leads to the stationary distribution ¡ ¢ δ = 0.86 0.14 . This means that the probability for a given day in the future (which is far enough from today) to be rainy is about 86% while the probability for a sunny day equals only 14%, independent of the fact that today is a sunny day. Furthermore, if we had not known the initial state, and thus δ(1), and if we had assumed the Markov chain to be stationary, δ would denote the marginal probability distribution of the weather for any given day, even for today!

Problems for Chapter 2 Problem 2.1 Let X be a random variable which is distributed as a mixture of two distributions with expectations µ1 , µ2 , and variances σ12 and σ22 , respectively, where the mixing parameters are δ1 and δ2 with δ1 + δ2 = 1. (a) Show that Var(X) = δ1 σ12 + δ2 σ22 + δ1 δ2 (µ1 − µ2 )2 . (b) Show that a mixture of two Poisson distributions, Po(λ1 ) and Po(λ2 ), with λ1 6= λ2 , is overdispersed, that is Var(X) > E(X).

Problem 2.2 The object of this exercise is for you to learn to write R functions that enable you to work with mixture distributions. To carry out this exercise you need to use the R functions dpois(), ppois(), qpois(), rpois(). If you are not familiar with these then use the help command to learn how to use them.

2 Fundamentals

17

Consider a mixture of two Poisson distributions with means λ1 and λ2 and mixing parameter δ. The probability function of this mixture model is given by p(x) = δ

e−λ1 λx1 e−λ2 λx2 + (1 − δ) x! x!

for x = 0, 1, 2, . . . , and zero elsewhere.

(a) Write a set of R functions: dpoismix(x,lambda,delta), qpoismix(p,lambda,delta),

ppoismix(q,lambda,delta), rpoismix(n,lambda,delta),

analogous to the 4 standard functions for the Poisson distribution, but for a mixture of two Poisson distributions. The argument lambda in these functions is a vector of length 2 that contains the values of λ1 and λ2 . You may use any of the available R functions, such as dpois() and ppois() to construct your functions. The tricky one to do is qpoismix(p,lambda,delta). This should compute the quantile, defined as the smallest non-negative integer x which is such that F (x) ≥ p. For experienced R users: Write qpoismix() so that it works when p is a vector. (b) Use graphics to check and illustrate your functions. In particular verify that the random samples generated using rpoismix() have the required properties. For example their frequency distribution should correspond to the probabilities obtained using dpoismix(). Problem 2.3 Consider a stationary two-state Markov chain with transition probability matrix µ ¶ γ11 γ12 Γ= . γ21 γ22 (a) Show that the corresponding stationary distribution is given by (δ1 δ2 ) = (b) Consider the case

1 (γ21 γ12 ) . γ12 + γ21 µ

Γ=

0.9 0.1 0.2 0.8



and the following two sequences of observations that are assumed to be generated by the above Markov chain: sequence 1: sequence 2:

1 1 1 2 1 1

2 2 2 1

1 , 1 .

Compute the probability of each of the sequences. Note that each sequence contains the same number of ones and twos. Try to figure out why these sequences are not equally probable.

18

2 Fundamentals

Problem 2.4 Consider a stationary two-state Markov chain with transition probability matrix µ ¶ µ ¶ γ11 γ12 1 − γ1 γ1 Γ= = . γ21 γ22 γ2 1 − γ2 Show that the k-step transition probability matrix, i.e. Γk , is given by µ ¶ µ ¶ δ1 δ2 δ2 −δ2 k k Γ = +w , δ1 δ2 −δ1 δ1 where w = 1 − γ1 − γ2 . [Hint: One way to show this is to use mathematical induction.] Problem 2.5 Find out how to use the following R commands: • %*% (matrix multiplication), • t() (transpose a matrix), • solve() (solve a system of linear equations, or invert a matrix), • diag() (extract or replace the diagonal of a matrix, or construct a diagonal matrix), Then, write a R function statdist(gamma) that computes the stationary distribution, δ, of a stationary m-state Markov chain with transition probability matrix gamma. The solution to question (a) in problem 2.3 outlines how this can be done. In effect one computes a matrix A, which is I − Γ except that the last column is a vector of ones. The required stationary distribution is the solution to the system of linear equations A0 x = b, where b is a vector whose last entry is 1 and the others are zero (use the R function solve() to solve these equations).

3 Hidden Markov Models Applying a discrete mixture distribution to the soap sales series as shown in Section 2.1, one may think of some underlying, unobserved states that determine which component of the mixture to draw an observation from. In an independent mixture model, these states or the observations, respectively, should be independent. However, this is not the case for the soap sales series as can be seen in Figure 3.1 that shows the autocorrelation function of the soap sales series.

1.0

mean = 5.44 variance = 15.4

0.8 0.6 0.4 0.2 0.0 −0.2 0

2

4

6

8

10

12

lag

Figure 3.1: Soap sales series autocorrelation function. Figure 3.1 reveals that the soap sales observations are significantly correlated over lags one to three indicating serial dependence in the series. Thus, an independent mixture is not an appropriate model here as it does not consider all the information contained in the data and one should think of alternative models. One way of modelling count data series with serial correlation is to apply a Poisson HMM which is a special case of a dependent mixture. Therefore, in the following sections the basic HMM and its properties, such as marginal distributions, moments, and the likelihood, will be introduced.8

3.1

The basic hidden Markov model

Let {St } = {St , t = 1, 2, . . .} = {S1 , S2 , . . .} denote a sequence of observations, 8

Again we would like to point out that the ignorance of any covariate for the soap sales series influences the selection of an appropriate model. It may well be that the observed serial correlation is due to some covariate such as price. Thus, taking into account covariates there might exist models which are more appropriate than a HMM although covariates can also be considered in HMM, as will be shown in section 7.4.

20

3 Hidden Markov Models {Ct } = {Ct , t = 1, 2, . . .} = {C1 , C2 , . . .}

a Markov chain defined on the state space {1, 2, . . . , m}, S (t) = {Si , i = 1, 2, . . . , t} = {S1 , S2 , . . . , St } and C (t) = {Ci , i = 1, 2, . . . , t} = {C1 , C2 , . . . , Ct } their histories up to time t, respectively. Consider a stochastic process consisting of two parts, firstly the underlying but not observed parameter process {Ct } which has the Markov property P (Ct |C (t−1) ) = P (Ct |Ct−1 ) , and secondly the state-dependent process {St } for which holds that P (St |S (t−1) , C (t) ) = P (St |Ct ) . Then, the stochastic process {St } is called a m-state hidden Markov model. The second property is called conditional independence and means that in case Ct is known St depends only on Ct and not on previous states or observations.9 The basic structure of the HMM is illustrated in Figure 3.1.

St−2

St−1

St

St+1

observed

Ct−2

Ct−1

Ct

Ct+1

hidden

Figure 3.2: Basic structure of a HMM.

Thus, a HMM is a combination of two processes, a Markov chain which determines the state at time t, Ct = ct , and a state–dependent process which generates the observation St = st in dependence on the current state Ct = ct ; in fact, for each possible state of the state space {1, 2, . . . , m} one has a different distribution for St . In general, {Ct } may be any Markov chain, however in this paper we assume the Markov chain to be homogeneous and irreducible with transition probability matrix Γ. By the irreducibility of {Ct } there exists a unique stationary distribution of the Markov chain, δ (for details see Section 2.2). We shall suppose, unless otherwise indicated, that {Ct } is stationary, so that δ is for all t the distribution of Ct ! Note that a HMM is a rather theoretical construction. In reality, only the state-dependent process {St } is observed while the underlying Markov chain {Ct } remains unknown or hidden. However, in many applications there is a reasonable interpretation for the underlying states. Consider for example that the daily return series introduced in Section 2.1 is modelled with a two-state HMM. Then, the states of the underlying Markov chain might be interpreted as general states of the financial market, namely a state with low trading activity and a state with high trading activity. 9

Here and in the following, if not indicated otherwise, we refer to MacDonald and Zucchini (1997) as a standard reference.

3 Hidden Markov Models

21

The process generating the observations of a two-state HMM is demonstrated again in Figure 3.2.

parameter process state 1

δ1 = 0.75

state−dependent process

state 2

f1

δ2 = 0.25

observations

f2

0

10

20

30

40

0

10

20

30

40

24.3 0.3

0.7

16.8 0.9

0.1

0

10

20

30

40

0.9

0.1

0

10

20

30

40

0.9

0.1

0

10

20

30

40

0

10

20

30

40

0

10

20

30

40

9

12.1

31.6 0.3

0.7

14.5

Figure 3.3: Process structure of a two-state HMM.

In contrast to Figure 2.2 which showed the process structure of a two–component independent mixture model here the probabilities for the state Ct+1 do depend on the state Ct since the parameter process is modelled via a Markov chain! However, as in the case of independent mixtures for each state one has a different distribution for the state-dependent random variable St at time t, either discrete or continuous, as shown in Figure 3.4.

discrete components state 1

state 2

state m

continuous components state 1

state 2

state m

Figure 3.4: Component distributions of a HMM.

One may think of any possible distribution for modelling the state-dependent distributions. For example, assuming Poisson distributions as state-dependent distributions yields a Poisson HMM. If the state-dependent distributions follow a normal distribution one obtains a normal HMM. It is even possible to take different families of distributions for different states! The notation for the state-dependent distributions shall be introduced in detail. Let Ct be the state of the Markov chain at time t, St denotes the state-dependent distribution

22

3 Hidden Markov Models

at time t, and s is any possible realisation of St . Then, in the case of discrete components we use the following notation for the state dependent distributions: Ct = 1 : Ct = 2 : .. .

p1 (s) := P (St = s|Ct = 1) p2 (s) := P (St = s|Ct = 2) .. .

Ct = m : pm (s) := P (St = s|Ct = m) . Hence, pi (s), i ∈ {1, . . . , m} represents the distribution of St given that Ct = i. Note that in general it is necessary to add another index for the time t, i.e. to write pt,i (s), however as we assume that the state-dependent distributions do not change over time we omit the time index in this paper. The notation for the continuous case is similar: Ct = 1 : Ct = 2 : .. .

f1 (s) := f (St |Ct = 1) f2 (s) := f (St |Ct = 2) .. .

Ct = m : fm (s) := f (St |Ct = m) . Again, we omit the index for the time t.

3.2

Marginal distributions and moments of a hidden Markov model

In the preceding section the basic structure of a HMM and the notation for the statedependent distributions were introduced. In this section we intend to derive the marginal distributions of a HMM and the respective moments. Speaking of marginal distributions, the following is meant: Given a model for C1 , . . . , Cm , S1 , . . . , Sn , which is the distribution of St or even the joint distibution of St and St+k ? I.e. we are interested in the unconditional distribution of St , P (St = s), instead of the conditional distribution pi (s) = P (St = s|Ct = i). The calculations for the derivation of P (St = s) can be carried out analogously to the case of independent mixture distributions, and one obtains:

P (St = s) = P ({St = s, Ct = 1} ∪ {St = s, Ct = 2} ∪ · · · ∪ {St = s, Ct = m}) m X = P (St = s, Ct = i) i=1

=

m X

P (Ct = i)P (St = s|Ct = i)

i=1

=

m X i=1

δi pi (s) .

3 Hidden Markov Models

23

The second step follows from the fact that the states of Ct are disjoint, and thus also the {St = s, Ct = i}, and the equality P (Ct = i) = δi is a consequence of the assumption that the underlying Markov chain is stationary which means that its stationary distribution δ is for all t the distribution of Ct . Thus, P (St = s) is simply a linear combination of the respective probabilities of the component distributions, with the stationary distribution δ as vector of weights. Using matrix notation, the result can be rewritten as  ¡ P (St = s) = δ1 δ2 · · ·

¢  δm  

p1 (s)

0    where P (s) :=  

  1    p2 (s)  1 0   .  = δP (s)1 , ..   ..  . 1 pm (s) 0

p1 (s)

0

p2 (s) 0



  , ...  pm (s)

i.e. a diagonal matrix containing the probabilities of the observation s conditioned on the different states.10 The computation of the bivariate marginal distribution of two states St and St+k , P (St = u, St+k = v), is slightly more challenging but follows the same procedure:

P (St = u, St+k = v) m X m X = P (St = u, St+k = v, Ct = i, Ct+k = j) =

i=1 j=1 m X m X i=1 j=1

=

m X m X i=1 j=1

=

m X m X

P (S = u, St+k = v|Ct = i, Ct+k = j) P (Ct = i, Ct+k = j) | t {z } conditional independence pi (u)pj (v) P (Ct = i) P (Ct+k = j|Ct = i) {z } | {z } | δi

γij(k)

δi pi (u)γij (k)pj (v) .

i=1 j=1

10

Note that we use a different notation here than that used by MacDonald and Zucchini (1997). They denote the diagonal matrix of the conditional probabilities with λ, a symbol which we will use later for a vector containing the means of the Poisson component distributions of a HMM. In addition, we would like to mention again, that we assume the state-dependent probabilities to be constant over time, i.e. P (s) holds for all t. Otherwise one would have to introduce a time index, e.g. write P t (s).

24

3 Hidden Markov Models

Again, the result can be rewritten using matrix notation: P (St = u, St+k = v) = δP (u)Γk P (v)10 , where δ, Γ and 10 as above and    P (u) =  

p1 (u) p2 (u) 0





   ..  . pm (u)

  P (v) =  

0

,

p1 (v)

0

p2 (v) 0



  . ..  . pm (v)

The result given in matrix notation can be interpreted in a very convenient way. The initial probabilities of the first state are given by δ. Hence, δ has to be multiplied by the matrix P (u) containing the probabilites of the observation u conditioned on the different states. Then, multiplying k times by Γ moves from time t to time t + k. Finally, the observation v is modelled by P (v), and the unit vector sums the probabilities up. Analogously, one can derive formulae for multivariate marginal distributions such as P (St = s, St+k = u, St+k+l = v) = δP (s)Γk P (u)Γl P (v)10 .

The moments of a HMM can be obtained in a similar way. For example, the expectation of St , E(St ), can be traced back to the expectations of the component distributions: E(St ) =

m X

δi E(St |Ct = i) .

i=1

The proof is straightforward: E(St ) =

X

sP (St = s)

s m X X = s δi pi (s) s

i=1

m X X δi spi (s) =

=

i=1 m X

s

δi E(St |Ct = i) .

i=1

Thus, the expectation of St , E(St ), can be interpreted as a linear combination of the expectations of the component distributions with δ as vector of weights. This result can be generalized to any function g of St , g(St ), and thus to any moment of higher order, and one has that: m X δi E(g(St )|Ct = i) . E(g(St )) = i=1

3 Hidden Markov Models

25

The proof is similar to the derivation of E(St ) and is left to the reader as an exercise. It is also possible to derive moments of the joint distribution of St and St+k in a similar way. For example the expectation of a function of the joint distribution, E(g(St , St+k )), is E(g(St , St+k )) =

m X m X

δi γij (k)E(g(St , St+k )|Ct = i, Ct+k = j) ,

i=1 j=1

where γij (k) = (Γk )ij is the k-step transition probability from state i to state j of the underlying Markov chain, i.e. the element in the ith row and j th column of the k-step transition probability matrix Γ(k) = Γk . The proof is slightly more challenging: E(g(St , St+k )) PP = g(st , st+k )P (St = st , St+k = st+k ) st st+k

=

PP

g(st , st+k )

st st+k

=

PP

g(st , st+k )

st st+k

m m P P i=1 j=1 m m P P

P (St = st , St+k = st+k | Ct = i, Ct+k = j)P (Ct = i, Ct+k = j) P (St = st , St+k = st+k | Ct = i, Ct+k = j)×

i=1 j=1

P (Ct = i)P (Ct+k = j|Ct = i) = = =

PP st st+k m P m P i=1 j=1 m m P P

g(st , st+k )

m m P P

P (St = st , St+k = st+k | Ct = i, Ct+k = j)δi γij (k)

i=1 j=1

δi γij (k)

PP

g(st , st+k )P (St = st , St+k = st+k | Ct = i, Ct+k = j)

st st+k

δi γij (k)E(g(St , St+k )|Ct = i, Ct+k = j) .

i=1 j=1

From a theoretical point of view it is also possible to derive formulae for the centered moments, for example the variance V ar(St ), but in case the actual component distributions are unknown the formulae may become nasty. Therefore, we would only like to mention that in general V ar(St ) can be computed using V ar(St ) = E(St2 ) − (E(St ))2 . Using this equality and the above formula for E(g(St )) one can derive the following moments of a Poisson HMM:

E(St ) = E(St2 ) =

m X i=1 m X i=1

δi λi = δλ0 , (λ2i + λi )δi = λDλ0 + δλ0 ,

26

3 Hidden Markov Models

V ar(St ) =

m X

(λ2i

m X + λi )δi − ( δi λi )2 = λDλ0 + δλ0 − (δλ0 )2 ,

i=1

E(St , St+k ) =

i=1

m X m X

λi λj δi γij (k) = δΛΓk λ0 ,

i=1 j=1 m X m X

Cov(St , St+k ) =

m X λi λj δi γij (k) − ( δi λi )2 = δΛΓk λ0 − (δλ0 )2 ,

i=1 j=1

i=1

m P m P

ρk = Corr(St , St+k ) =

λi λj δi γij (k) − (

i=1 j=1 m P

δi λi )2

i=1

(λ2i + λi )δi − (

i=1

m P

m P

=

δi λi )2

δΛΓk λ0 − (δλ0 )2 , δΛλ0 + δλ0 − (δλ0 )2

i=1

where λ = (λ1 , . . . , λm ) is the vector of the parameters of the state–dependent Poisson distributions, δ = (δ1 , . . . , δm ) is the stationary distribution of the underlying Markov chain, Γ is the transition probability matrix, D = diag(δ) and Λ = diag(λ), i.e. diagonal matrices with the elements of δ and λ as diagonal elements, respectively (for a detailed derivation of these formulae see Problem 3.1). For a two-state Poisson HMM the moments reduce to: E(St ) = δ1 λ1 + δ2 λ2 , E(St2 ) = (λ1 + λ21 )δ1 + (λ2 + λ22 )δ2 , V ar(St ) = δ1 λ1 + δ2 λ2 + δ1 δ2 (λ2 − λ1 )2 , E(St , St+k ) = δ1 δ2 (λ2 − λ1 )2 (1 − γ12 − γ21 )k + (δ1 λ1 + δ2 λ2 )2 , Cov(St , St+k ) = δ1 δ2 (λ2 − λ1 )2 (1 − γ12 − γ21 )k , ρk =

δ1 δ2 (λ2 − λ1 )2 (1 − γ12 − γ21 )k . δ1 λ1 + δ2 λ2 + δ1 δ2 (λ2 − λ1 )2

The derivation of some of these formulae is not as simple as it seems. For example the computation of E(St , St+k ) is rather tricky as one needs to find an explicit expression for γij (k) or Γk , respectively. For details see Problems 2.4 and 3.1!

3.3

The likelihood of a hidden Markov model

The aim of this section is to develop a closed formula for the likelihood of a HMM in a general framework. In order to motivate the understanding of the likelihood, at first a simple example is considered before the general form of the likelihood will be derived. In Section 3.2 we have already given a formula for the multivariate marginal distribution of St , St+k and St+k+l : P (St = s, St+k = u, St+k+l = v) = δP (s)Γk P (u)Γl P (v)10 .

3 Hidden Markov Models

27

For example, one may be interested in the computation of the probability of a HMM to take the values s at t = 1, u at t = 3 and v at t = 7. According to the above formula this probability is given by P (S1 = s, S3 = u, S7 = v) = δP (s)Γ2 P (u)Γ4 P (v)10 . In order to demonstrate that the formula really holds we would like to regard a simple example. E.g. consider a two-state Bernoulli HMM where the transition probability matrix Γ and the respective stationary distribution δ are given by Ã1 1! ¡ ¢ Γ = 21 23 and δ = 13 23 . 4

4

The state-dependent probabilities are determined by two Bernoulli-distributed random variables, where the first state represents the case of tossing a fair coin and the second case denotes the deterministic case, i.e.: Ct = 1 :

Be (1/2)

Ct = 2 :

Be (1)

⇒ p1 (0) = P (St p1 (1) = P (St ⇒ p2 (0) = P (St p2 (1) = P (St

= 0|Ct = 1|Ct = 0|Ct = 1|Ct

= 1) = 1/2 = 1) = 1/2 = 2) = 0 = 2) = 1 .

Then, the likelihood of S1 = S2 = S3 = 1 can be calculated as follows: P (S1 = 1, S2 = 1, S3 = 1) 2 X 2 X 2 X P (S1 = 1, S2 = 1, S3 = 1, C1 = i, C2 = j, C3 = k) = i=1 j=1 k=1

=

2 X 2 X 2 X

P (S1 = 1, S2 = 1, S3 = 1|C1 = i, C2 = j, C3 = k) ×

i=1 j=1 k=1

P (C1 = i, C2 = j, C3 = k) =

2 X 2 X 2 X i=1 j=1 k=1

P (S1 = 1|C1 = i) P (S2 = 1|C2 = j) × | {z }| {z } pi (1)

pj (1)

P (S3 = 1|C3 = k) P (C1 = i, C2 = j, C3 = k) | {z } pk (1)

=

2 X 2 X 2 X

pi (1)pj (1)pk (1)δi γij γjk

.

i=1 j=1 k=1

The third step follows from the conditional independence property of the HMM and the last step can be derived by using the Markov property of the underlying Markov chain {Ct }: P (C1 = i, C2 = j, C3 = k) = P (C1 = i)P (C2 = j|C1 = i)P (C3 = k|C1 = i, C2 = j) = P (C1 = i)P (C2 = j|C1 = i)P (C3 = k|C2 = j) = δi γij γjk

.

28

3 Hidden Markov Models

Now, in order to compute the likelihood, the possible values of i, j, k and thus of δi , γij , γjk , pi (1), pj (1) and pk (1) can be written down in a table, and the above sum can be obtained by multiplying the values line by line and summing up the resulting products, as demonstrated in Table 3.1.

i

j

k

pi (1) pj (1) pk (1)

δi

γij

γjk

Π

1 1

1

1 2

1 2

1 2

1 3

1 2

1 2

1 96

1 1

2

1 2

1 2

1

1 3

1 2

1 2

1 48

1 2

1

1 2

1

1 2

1 3

1 2

1 4

1 96

1 2

2

1 2

1

1

1 3

1 2

3 4

1 16

2 1

1

1

1 2

1 2

2 3

1 4

1 2

1 48

2 1

2

1

1 2

1

2 3

1 4

1 2

1 24

2 2

1

1

1

1 2

2 3

3 4

1 4

1 16

2 2

2

1

1

1

2 3

3 4

3 4

3 8 29 48

Σ

Table 3.1: Computation of the likelihood of a Bernoulli HMM.

In this table Π denotes the product pi (1)pj (1)pk (1)δi γij γjk given the values of i, j and k, and Σ denotes the sum of all products. Thus, we have obtained the triple sum given above, i.e. the likelihood of S1 = S2 = S3 = 1, namely P ({S1 = 1, S2 = 1, S3 = 1}) =

29 . 48

It is also possible, and even easier, to compute the likelihood using matrix notation, since the triple sum that yields the likelihood can be rewritten as follows: P ({S1 = 1, S2 = 1, S3 = 1}) 2 X 2 X 2 X = pi (1)pj (1)pk (1)δi γij γjk i=1 j=1 k=1

=

2 X 2 X 2 X

δi pi (1)γij pj (1)γjk pk (1)

i=1 j=1 k=1

= δP (1)ΓP (1)ΓP (1)10 , where δ and Γ as given above and P (1) is a diagonal matrix containing the statedependent probabilities p1 (1) and p2 (1) as diagonal elements: ! Ã1 0 . P (1) = 2 0 1

3 Hidden Markov Models

29

This is exactly equivalent to the formula of the multivariate marginal distribution given above! Another interesting fact in this context is that, although they are based on a Markov chain, HMMs do not have the Markov property in general. This can be shown by giving a simple counterexample. E.g., for the given Bernoulli HMM one obtains that P ({S3 = 1|S1 = 1, S2 = 1}) =

P ({S1 = 1, S2 = 1, S3 = 1}) = P ({S1 = 1, S2 = 1})

29 48 17 24

=

29 , 34

while P ({S3 = 1|S2 = 1}) =

P ({S2 = 1, S3 = 1}) = P ({S2 = 1})

17 24 5 6

=

17 , 20

where P ({S1 = 1, S2 = 1}) = P ({S2 = 1, S3 = 1}) and P ({S2 = 1}) can be obtained constructing a table like table 3.1 or via matrix multiplication according to P ({S1 = 1, S2 = 1}) = P ({S2 = 1, S3 = 1}) = δP (1)ΓP (1)10 and

P ({S2 = 1}) = δP (1)10 .

Thus, in this case P ({S3 = 1|S2 = 1}) 6= P ({S3 = 1|S1 = 1, S2 = 1}), i.e. the HMM considered here does not have the Markov property. Now we return to the aim of this section, i.e. the derivation of a closed expression for the likelihood of a HMM in general. Analogous to the above example and the formula for the multivariate marginal distribution, one obtains the following formula for the likelihood Lt of a HMM: LT := P ({S1 = s1 , S2 = s2 , . . . , ST = sT }) = δP (s1 )ΓP (s2 )Γ . . . ΓP (sT )10 . The proof is straigthforward and similar to the computation of the likelihood in the above example: P (S1 = s1 , S2 = s2 , . . . , ST = sT ) = =

m P m P c1 =1 c2 =1 m P m P

··· ···

c1 =1 c2 =1

m P cT =1 m P cT =1

P (S1 = s1 , S2 = s2 , . . . , ST = sT , C1 = c1 , C2 = c2 , . . . , CT = cT ) P (S1 = s1 , S2 = s2 , . . . , ST = sT |C1 = c1 , C2 = c2 , . . . , CT = cT )× P (C1 = c1 , C2 = c2 , . . . , CT = cT )

=

m P m P

···

c1 =1 c2 =1

m P

P (S1 = s1 |C1 = c1 )P (S2 = s2 |C2 = c2 ) . . . P (ST = sT |CT = cT )×

cT =1

P (C1 = c1 )P (C2 = c2 |C1 = c1 ) . . . P (CT = cT |CT −1 = cT −1 ) =

m P m P c1 =1 c2 =1

···

m P cT =1

pc1 (s1 )pc2 (s2 ) . . . pcT (sT )δc1 γc1 c2 γc2 c3 . . . γcT −1 cT

= δP (s1 )ΓP (s2 )Γ . . . ΓP (sT )10

.

30

3 Hidden Markov Models

The first part of the third step follows from the conditional independence property of HMMs while the second part of step three can be derived from the Markov property of the underlying Markov chain. Now, using the equality δ = δΓ and defining B t = ΓP (st ) for all t ∈ {1, 2, . . . , T } the likelihood can be rewritten in a more convenient form: LT = δB 1 B 2 . . . B T 10 .

One extremely nice property of HMMs is the ease with which one can deal with missing observations. Suppose, for example, that one has observed the realisations s1 , s2 , s4 , s7 , . . . , sT only of a HMM, i.e. the realisations s3 , s5 and s6 are missing. Then, in order to compute the likelihood of the given observations, one only has to replace the respective diagonal matrices containing the state-dependent probabilities of the missing observations, i.e. P (s3 ),P (s5 ) and P (s6 ) by a unit matrix I in the above formula of the likelihood, and one obtains: (−3,5,6)

LT

= P (S1 = s1 , S2 = s2 , S4 = s4 , S7 = s7 , . . . , ST = sT ) = δP (s1 )ΓP (s2 )ΓI ΓP (s4 )ΓI ΓI ΓP (s7 ) . . . ΓP (sT )10 = δP (s1 )ΓP (s2 )Γ2 P (s4 )Γ3 P (s7 ) . . . ΓP (sT )10 ,

where the missing observations are denoted by an additional upper index. Note that this likelihood is nothing but the multivariate marginal distribution of S1 , S2 , S4 , S7 , . . . , ST , evaluated at the observations s1 , s2 , s4 , s7 , . . . , sT ! Again, the likelihood can be rewritten using B t = ΓP (st ): (−3,5,6)

LT

= δB 1 B 2 ΓB 4 Γ2 B 7 . . . B T 10 .

The fact that, even in the case of missing observations, the likelihood of a HMM can be easily computed is specially useful for the derivation of conditional distributions as will be shown in Section 5.1.

Problems for Chapter 3 Problem 3.1 Consider a stationary m-state Poisson HMM {St , t = 1, 2, ...} with transition probability matrix Γ, state-dependent process parameters λ = (λ1 , λ2 , ..., λm ) and stationary distribution of the Markov chain δ = (δ1 , δ2 , ..., δm ). Let D = diag(δ) be a diagonal matrix with diagonal elements δ1 , δ2 , ..., δm , and Λ = diag(λ) a diagonal matrix with diagonal elements λ1 , λ2 , . . . , λm . Derive the following expressions for the moments of St :

3 Hidden Markov Models

31

(a) E(St ) = δλ0 (b) E(St2 ) =

m P

(λ2i + λi )δi = λDλ0 + δλ0

i=1

(c) V ar(St ) = λDλ0 + δλ0 − (δλ0 )2

[= δΛλ0 + δλ0 − (δλ0 )2 ]

(d) E(St , St+k ) = δΛΓk λ0 [Hint: E(St , St+k |Ct = i, Ct+k = j) = λi λj for k 6= 0 ] (e) Cov(St , St+k ) = δΛΓk λ0 − (δλ0 )2 ³

´2

δΛΓk λ0 − δλ0 ³ ´2 (f) ρk = Corr(St , St+k ) = δΛλ0 +δλ0 − δλ0 δ1 δ2 (λ2 − λ1 )2 (1 − γ12 − γ21 )k . δ1 λ1 + δ2 λ2 + δ1 δ2 (λ2 − λ1 )2 [Hint: You may need the result of Problem 2.4!] Show also that for the case m = 2 : ρk =

Problem 3.2 Consider a stationary two-state Poisson HMM with parameters µ ¶ 0.1 0.9 Γ= and λ = (1 3) . 0.4 0.6 Compute the probability that the first three observations of this model are 0, 2, 1 in each of the following ways. (a) Consider all possible sequences of states of the Markov chain that could have occured. Compute the probability of each sequence and multiply it by the probability of the observations given each sequence. Finally, sum the probabilities up. (b) Apply the formula P (S1 = 0, S2 = 2, S3 = 1) = δP (0)ΓP (2)ΓP (1)10 , where à s −λ1 ! µ s −1 ¶ λ1 e 1 e 0 0 s! s! P (s) = = . 3s e−3 λs2 e−λ2 0 0 s! s! Problem 3.3 Consider again the model defined in Problem 3.2. In that problem you were asked to compute P (S1 = 0, S2 = 2, S3 = 1). Now compute P (S1 = 0, S3 = 1) in each of the following ways. (a) Consider all possible sequences of states of the Markov chain that could have occured. Compute the probability of each sequence and multiply it by the probability of the observations given each sequence. Finally, sum the probabilities up. (b) Apply the formula P (S1 = 0, S3 = 1) = δP (0)ΓIΓP (1)10 = δP (0)Γ2 P (1)10 , where I is a 2 x 2-identity matrix, and check that this probability is equal to your answer in (a).

32

3 Hidden Markov Models

Problem 3.4 Find out how to use the following R commands: • for() (used for looping), • sample() (a very useful function for drawing random samples). Then, write a R function genPoisHMM(n,gamma,lambda) that generates a series of length n from a stationary m-state Poisson HMM with transition probability matrix gamma and Poisson parameters lambda. Regard the following notes and specifications. (a) The function should determine the number of states, m, e.g. m 0 ∀ i ∈ {1, 2, . . . , m} . The first constraint assures that Γ is in fact a transition probability matrix, i.e. all elements of Γ must lie within the interval [0; 1] and the rows of Γ must sum up to 1. This is a generic constraint that holds for all HMMs. The second constraint guarantees that the parameters of the Poisson component distributions λi stay within their admissible range, i.e. are positive. This constraint differs depending on which distribution has been chosen for the component distributions; e.g. for a binomial HMM it changes to πi ∈ [0, 1].

4 Parameter Estimation

39

In order to satisfy the second constraint, one can introduce a simple transformation of the λi parameters: ηi := log(λi )

λi = eηi



∀i ∈ {1, . . . , m} .

Then, ηi ∈ R, and one can maximize the likelihood using the unconstrained parameters ηi ˆ i , by transforming and obtain the respective estimators of the constrained parameters, λ back the estimated uncostrained parameters ηˆi according to the above formula. This procedure can be easily adapted to other distributions than the Poisson. For example, in the case of a binomial HMM one can use a logit-transformation of the parameters πi . The reparameterization of the transition probabilities γij is a bit more demanding. Note, that the matrix Γ has m2 entries but m(m− 1) free parameters only as the above row sum constraints must be fulfilled. In the following, we will show one possible transformation γij ∈ [0; 1]



τij ∈ R

∀ i, j ∈ {1, . . . , m}, i 6= j | {z } m(m−1) pairs of indices!

for the non-diagonal elements of the transition probability matrix. For better readability, we restrict our attention to the case m = 3, however, the same transformation works also in the general case. Let T be a 3x3 matrix whose off-diagonal elements τij , i, j ∈ {1, 2, 3}, i 6= j are arbitrary real numbers, i.e τij ∈ R:   − τ12 τ13 T := τ21 − τ23  . τ31 τ32 − Now, let g : R → R+ be any strictly monotone increasing function, e.g. g(x) = ex , and ½ %ij :=

g(τij ) 1

for for

i 6= j . i=j

By this transformation, a new matrix R is generated where all elements %ij , i, j ∈ {1, 2, 3} are strictly positive:   1 %12 %13 R := %21 1 %23  . %31 %32 1

Finally, divide all elements of R by their respective row sums, i.e. γij :=

%ij 3 P k=1

%ik

∀ i, j ∈ {1, 2, 3} .

40

4 Parameter Estimation

This yields a matrix Γ which has the properties of a transition probability matrix: 

 γ11 γ12 γ13 Let Γ := γ21 γ22 γ23  , γ31 γ32 γ33 where γij ∈ [0; 1] for all i, j ∈ {1, 2, 3} and

3 P

γij = 1 for all i ∈ {1, 2, 3}. The proof is

j=1

left as an exercise to the reader (see Problem 4.3)! This transformation can be easily reversed. Given the transition probability matrix Γ, the matrix T with the τij as off-diagonal elements can be obtained by dividing each element γij of Γ by the respective diagonal element γii , which yields the matrix R, and then applying the inverse of the above function g(x), e.g. g −1 (x) = log(x), to the off–diagonal elements of R. The detailed derivation of the reverse transformation is left to the reader, too (see Problem 4.3 again). Now, using the transformations of λ and Γ outlined above the maximum-likelihood estimators for the general case m can be obtained in two steps. 1. Maximize the likelihood LT with regards to the new unconstrained parameters η = (η1 , . . . , ηm ) and T = {τij , i, j ∈ {1, . . . , m}, i 6= j}. 2. Convert the estimated unconstrained parameters back to the constrained parameters: ˆ. ˆ η ˆ→λ Tˆ → Γ, For an implementation of the transformations using the statistical software R see Problem 4.4.

4.3.2

Numerical underflow

Having overcome the parameter constraints one usually has to deal with a second problem, i.e. numerical underflow. As shown in Section 3.3, the likelihood of a HMM is given by the expression LT = δB 1 . . . B T 10 . As from the definition of the so-called forward probabilities introduced in Section 4.1 follows that LT = αT , one way of evaluationg LT is to compute the forward probabilities via the following recursion: start:

α0 := δ

update: αt := αt−1 B t

for t = 1, 2, . . . , T.

However, the problem is that αt becomes very small for large t, which leads to numerical underflow when computing the likelihood of a long series; it might even happen that LT is rounded to zero which makes it impossible to estimate the parameters of the HMM.

4 Parameter Estimation

41

The usual solution to the problem of numerical underflow, i.e. maximizing the loglikelihood instead of maximizing the likelihood, cannot be implemented that easily since the likelihood is a product of matrices. However, one can derive a closed formula for the log-likelihood of a HMM by applying a trick that, basically, yields an alternative algorithm for evaluating the likelihood recursively via scaled forward probabilities. This will be demonstrated in the following. First, we need to introduce some new notation. Define wt := αt 10

and

φt :=

αt . wt

Hence, wt is a scalar, namely the sum of the elements of αt , which in the following will be used as a scaling factor, and φt is a vector containing the rescaled forward probabilities, i.e. the forward probabilities divided by their sum. Then, another recursion can be developed: 0 φ0 = α = δ0 = δ w0 δ1 Bt = t = αt−1 update: φt = α wt wt

start:

wt−1 φt−1 B t wt

for t = 1, 2, . . . , T.

Executing this recursion up to φT one obtains: wT −1 wT −2 w0 · · · δB 1 B 2 · · · B T −1 B T wT wT −1 w1 wT −1 wT −2 w0 = · · · αT . wT wT −1 w1

φT =

Of course, it would be possible now to cancel all scaling factors except wT and w0 , however, this would not help us in constructing a rescaling algorithm for computing the likelihood. Instead, the above equality can be converted as follows: φT = ⇔ ⇔ ⇔ ⇔

wT wT −1 wT −1 wT −2 wT wT −1 wT −1 wT −2

wT −1 wT −2 wT wT −1

0 ··· w α w1 T

1 ··· w φ = αT w0 T

1 ··· w φ 10 = αT 10 w0 T

wT wT −1 wT −1 wT −2

1 ··· w = LT w0

T Q t=1

wt wt−1

= LT

.

The last step follows from the fact that due to its defintion φT has unit length, i.e. φT 10 = 1.

42

4 Parameter Estimation

Now that the likelihood LT has been rewritten as a product of scalars it is possible to take the logarithm of the likelihood, and one obtains log Lt =

T X

µ log

t=1

wt wt−1

¶ .

However, one still has to evaluate the terms of the sum, i.e. the quotients of the scaling factors. Regarding the recursion for φt and using the equality φT 10 = 1 again, the quotients can be obtained via φt = ⇔

φt 10 =



wt wt−1

wt−1 φt−1 B t wt wt−1 φt−1 B t 10 wt

= φt−1 B t 10

.

Having derived formulae for the log-likelihood of a HMM and for the terms of the logt , one can develop an efficient algorithm for evaluating likelihood, i.e. the quotients wwt−1 the log–likelihood recursively using the scaled forward probabilities. This algorithm will be given in the following section.

4.3.3

An efficient algorithm

The formulae derived in the preceding section can be implemented efficiently via the following recursive algorithm: start (t = 0): log L0 = 0

and

φ0 = δ .

update (t = 1, 2, . . . , T ): v t = φt−1 B t ut = v t 10 log Lt = log Lt−1 + log(ut ) φ = vt t

ut

³ ³ ³

= φt−1 B t 10 = = log Lt−1 + =

wt wt−1

´ ´

t log( wwt−1 )

wt−1 φt−1 B t wt

´

.

Repeating the loop for t = 1, 2, . . . , T leads to the required log-likelihood log LT . Implementing this algorithm and using the parameter transformations outlined in Section 4.3.1 one can estimate the maximum likelihood parameters of a HMM via direct numerical maximization using R or other statistical software packages that provide a function for numerical maximization, or minimization alternatively. For an implementation of the algorithm and of the transformations outlined above, and for the programming of a function that performs the maximum likelihood estimation for a Poisson HMM, using the software package R see Problems 4.4, 4.6 and 4.7!

4 Parameter Estimation

43

Now, reconsider the series of weekly sales of a soap product again. Numerical computation of the maximum likelihood estimates for Poisson HMM with m = 2, 3, 4 yields the following results: ¢ ¡ µ ¶ ˆ = 0.81 0.19 δ 0.91 0.09 ˆ= ¡ ¢ m=2: Γ ˆ = 4.02 11.37 0.37 0.63 λ

m=3:

  0.86 0.12 0.02 ˆ = 0.44 0.54 0.02 Γ 0.00 0.30 0.70

m=4:

 0.18  0.00 ˆ = Γ  0.08 0.00

0.63 0.88 0.33 0.00

0.00 0.10 0.57 0.36

 0.19  0.02  0.02 0.64

¡ ¢ δˆ = 0.72 0.22 0.06 ¡ ¢ ˆ = 3.74 8.44 14.93 λ

¡ ¢ δˆ = 0.02 0.70 0.21 0.07 ¡ ¢ ˆ = 0.00 3.87 8.20 14.69 λ

Note that, as we assume the underlying Markov chain to be stationary, in each case, δˆ ˆ and has not been estimated! In addition, we have to is the stationary distribution of Γ mention that all estimates are positive although it seems that some of the estimates are exactly zero; this is due to the rounding. Furthermore, it is noticeable that, using the standard version of the above estimation ˆ i -estimates occur in ascending order. It may algorithm, one cannot guarantee that the λ happen that the maximum likelihood estimation yields parameter estimates where the ˆ i -values does not belong to the last state. However, it is possible to largest of the λ sort the states after the estimation procedure; in fact, this has been done in this case. ˆ i -values is to estimate the positive Another possibility to assure an ascending order of the λ differences between the parameters instead of the parameters themselves. Regarding the parameter estimates for the different models in detail one can see how the models change for different numbers of states. A great change occurs when moving from ˆ i the two states of a two-state model to a three-state model, i.e. considering the means λ the two-state model diverge and a new state occurs in the middle. In contrast, moving from m = 3 to m = 4 hardly influences the parameter estimates of the three-state model; the only change is that a quasi degenerate state (λ1 and δ1 close to zero) appears to the left of the first state of the three-state model. These observations point at a heuristic policy for fitting (Poisson) HMMs. In general, it is reasonable to fit a simple Poisson model first. Then, one choses two values to the left ˆ as starting values for the fitting of a two-state Poisson and to the right of the resulting λ ˆ i of the two-state HMM. In the next step, one moves to the left and to the right of the λ model in order to obtain starting values for the three-state model; and so on. ˆ i -values of all Poisson HMMs lie within the In this one should consider that in general the λ range of the observations, and so should the starting values! It is even possible to force the ˆ i -values to lie within the desired range by using a special type of a logit-transformation λ for λi instead of the log-transformation outlined above. Given the parameter estimates one can easily compute the marginal distribution of St using the formula derived in Section 3.2. In this context it is noticeable that the marginal

44

4 Parameter Estimation

ˆ and distribution of St is equivalent to an independent mixture distribution with means λ ˆ Figure 4.1 shows the marginal distributions of a Poisson HMM for mixing parameters δ. m = 2, 3, 4. counts and two−state−HMM

two rates of sales 25

observed relative frequency two−state−HMM marginal

0.15

20

15

0.10

10 0.05 5

0.00

0 0

2

4

6

8

10

12

14

16

18

20

22

0

50

counts and three−state−HMM

100

150

200

250

200

250

200

250

three rates of sales 25

observed relative frequency three−state−HMM marginal

0.15

20

15

0.10

10 0.05 5

0.00

0 0

2

4

6

8

10

12

14

16

18

20

22

0

50

counts and four−state−HMM

100

150

four rates of sales 25

observed relative frequency four−state−HMM marginal

0.15

20

15

0.10

10 0.05 5

0.00

0 0

2

4

6

8

10

12

14

16

18

20

22

0

50

100

150

Figure 4.1: Soap sales series and fitted HMMs.

Note that the two colours of the marginal distribution for the two-state model indicate the contributions of the two component distributions, i.e. the black part gives the probabilities ˆ 1 weighted with δˆ1 , and the shaded part marks the probabilities of a Poisson with mean λ of the second component weighted with δˆ2 . It is obvious that even the marginal distribution of the two-state model provides a much better fit to the data than the simple Poisson distribution shown in Figure 2.1. However, the fit can be further improved by using a Poisson HMM with three or four states.

4 Parameter Estimation

45

In this context, it is interesting to compare the marginal distributions of the HMMs to the independent mixture distributions that were fitted in Section 2.1. The parameters of the HMM marginals differ strikingly from those of the independent mixtures, at least for m = 3, and comparing the respective plots, one might even think that the three-state independent mixture provides a better fit than does the three-state HMM. However, the HMM takes account of the serial correlation of the observations and therefore leads to a larger log-likelihood than the independent mixture (see Table 6.1). The question is now how to select an appropriate model and how to check the fit of the selected model. The answer to this question will be given in Chapter 6.

4.4

Standard errors of the parameter estimates

ˆ can be ˆ λ) ˆ := (Γ, With the knowledge of the preceding sections, parameter estimates Θ obtained by maximizing the likelihood for given data.11 The accurancy of the estimates cannot be calculated easily, only some asymptotic results are available. Under certain ˆ is consistent and asymptotically normal (for details see MacDonald and conditions, Θ Zucchini (1997, p. 95)). However, for practical applications the results lead to a couple of problems. • The sample is usually of finite size T , therefore the intensity of the asymtotic behaviour becomes an important aspect. ˆ be calculated? • Assuming normality, how can the standard error SE(Θ) ˆ are correlated. Hence, the interpretation is non–trivial as the • The entries of Θ parameters are linked. This problem, also occuring in other contexts, leads to the fact that statements can be made for the standard errors of the single parameters, separately, while the error of the whole model which is more interesting cannot be calculated directly. One possible solution to the problems mentioned above is the use of the parametric bootstrap method. The parametric bootstrap assumes that the fitted model with the estimated parameters ˆ is the true one. Then, in order to obtain the distributional properties of Θ ˆ the following Θ steps are repeated for b ∈ {1, 2, . . . , B}. • Generate a new sample b of observations of length T from the fitted model with ˆ (usually, the length should be the same as the number of original parameters Θ observations). This sample is called the bootstrap sample b. ˆ ∗ for the bootstrap sample b. • Estimate the vector of parameters Θ b

ˆ ˆ implies that one is dealing with a Poisson HMM. For other models the vector λ The definition of Θ ˆ in the definition of Θ has to be replaced by the appropriate vector of parameters. 11

46

4 Parameter Estimation

ˆ ∗, . . . , Θ ˆ ∗ , and thus an This procedure which yields B vectors of parameter estimates, Θ 1 B ˆ is illustrated in Figure 4.2. empirical distribution of the parameter estimates Θ,

bootstrap sample 1

θ^* 1

original sample

histogram of θ^

bootstrap sample 2 model

θ^

θ^* 2

bootstrap sample B

θ^*

B

Figure 4.2: Parametric bootstrap.

Then, given the parameter estimates for the bootstrap samples, the distributional properˆ i.e. one of the parameters, ˆ can be analyzed. For example, the standard error of θ, ties of Θ ∗ can be estimated by the standard deviation of the θˆb , b = 1, . . . , B, v u B ³ ´2 u 1 X t ∗ ˆ c , θˆb∗ − θˆ(.) SE(θ) = B − 1 b=1 ∗ where θˆ(.) =

1 B

PB ˆ∗ ˆ∗ b=1 θb denotes the mean of the respective parameter estimates θb .

ˆ can be obtained in a similar way: More general, the variance-covariance matrix of Θ 1 X ³ ˆ ∗ ˆ ∗ ´0 ³ ˆ ∗ ˆ ∗ ´ Θb − Θ(.) Θb − Θ(.) , B − 1 b=1 B

ˆ = \ Θ) Var-Cov(

ˆ ∗ is the vector of means of the parameter estimates for the bootstrap samples. where Θ (.) E.g. for a two-state Poisson HMM Θ is given by Θ = (γ12 , γ21 , λ1 , λ2 ) and the resulting variance-covariance matrix contains: 

d 12 ) Var(γ d Cov(γ21 , γ12 ) ˆ = \ Θ) Var-Cov( d  Cov(λ1 , γ12 ) d 2 , γ12 ) Cov(λ

d 12 , γ21 ) Cov(γ d 21 ) Var(γ d 1 , γ21 ) Cov(λ d 2 , γ21 ) Cov(λ

d 12 , λ1 ) Cov(γ d 21 , λ1 ) Cov(γ d 1) Var(λ d 1 , λ2 ) Cov(λ

 d 12 , λ2 ) Cov(γ d 21 , λ2 ) Cov(γ  . d Cov(λ1 , λ2 )  d 2) Var(λ

4 Parameter Estimation

47

Problems for Chapter 4 Problem 4.1 The backward probabilities of a HMM are defined as β 0t := B t+1 B t+2 . . . B T 10 for t = T − 1, T − 2, . . . , 1 and β T := 1 . Show that the ith entry of the vector β t is equal to P (St+1 = st+1 , St+2 = st+2 , . . . , ST = sT |Ct = i), i ∈ {1, 2, . . . , m}. Problem 4.2 To carry out this exercise you need to use the R functions dpois() and nlm(). If you are not familiar with these then use the help() command to learn how to use them. Especially, find out what the dpois()-option log=TRUE does. (a) Write a R function nllkpois(x,lambda) that computes the negative log-likelihood for the Poisson distribution with parameter λ given a sample of (independent) observations x1 , x2 , . . . , xn , where the latter are represented in your function by the vector x.12 (b) Generate a random sample x of size n from a Poisson distribution with parameter λ. Initially use n = 4 and λ = 1, but later you should experiment with a variety of other values. (i) Use the sample to verify that your function nllkpois() works correctly. (ii) Use the function nlm() to estimate λ numerically, i.e. function nllkpois() for the given sample values, x.

to minimize your

Problem 4.3 Consider the following parameterization of the transition probability matrix of a m-state Markov chain. Let τij ∈ R (i, j = 1, 2, ..., m; i 6= j) be m(m − 1) arbitrary real numbers and let g : R → [0, ∞) be some strictly monotone increasing function, e.g. g(x) = ex . Define

½ %ij =

and γij = %ij /

m X

1 for i = j g(τij ) for i 6= j , %ij

for i, j = 1, 2, ..., m.

j=1 12

To estimate the parameters of a distribution we usually maximize the likelihood, or the log-likelihood. This is of course equivalent to minimizing the negative log-likelihood. We will use the latter option because the R function nlm() is designed to search for the minimum, rather than the maximum, of a given function.

48

4 Parameter Estimation

(a) Show that the matrix Γ with entries γij constructed this way is a transition probability matrix, i.e. show that 0 ≤ γij ≤ 1, for all i, j ∈ {1, . . . , m}, and that the row sums of Γ are all equal to one. (b) Given the transition probability matrix Γ = {γij , i, j = 1, 2, ..., m} derive an expression for the parameters τij , i, j = 1, 2, ..., m; i 6= j. Problem 4.4 (a) Write a R function param(parvect) that computes the parameters Γ and λ1 , λ2 , ..., λm of a Poisson HMM, where the input vector parvect contains m2 entries; the first m(m−1) entries are τij , i, j = 1, 2, ..., m; i 6= j, and the last m entries are log(λi ), i = 1, 2, ..., m. For the definition of τij see Problem 4.3 or Section 4.3.1. (b) Write a R function paraminv(gamma,lambda) that returns a vector of length m2 . The first m(m − 1) entries of the vector are τij , i, j = 1, 2, ..., m; i 6= j, and the last m entries are log(λi ). The input vector gamma contains the m × m matrix Γ, and the vector lambda contains the Poisson parameters λ1 , λ2 , ..., λm . Problem 4.5 The purpose of this exercise is for you to investigate the numerical behaviour of the direct method of evaluating the likelihood of a HMM, and to compare this to the behaviour of an alternative algorithm that applies scaling. Consider a two-state Poisson HMM with parameters µ ¶ 0.9 0.1 Γ= , (λ1 λ2 ) = (1 5) . 0.2 0.8 Compute the likelihood, L10 , of the following sequence of 10 observations 2, 8, 6, 3, 6, 1, 0, 0, 4, 7 in the following two ways, using the software R. (a) Apply the direct method, i.e. compute the likelihood L10 using the recursion for the forward probabilities; that is L10 = α10 10

where α0 = δ

and

µ B t = ΓP (st ) = Γ

with

αt = αt−1 B t

p1 (st ) 0 0 p2 (st )

¶ ,

λsi t e−λi , i = 1, 2; t = 1, 2, ..., 10 . st ! Examine the numerical values of the vectors α0 , α1 , ..., α10 . pi (st ) =

4 Parameter Estimation

49

(b) Use the following algorithm to compute log L10 : Initialize: • Set φ0 ← δ and logL ← 0 Loop over t = 1, 2, ..., 10: • v ← φt−1 B t • u ← v10 • logL ← logL + log(u) • φt ← v/u In this algorithm, the symbol v denotes a temporary vector of length m; u denotes a temporary scalar, and logL is a scalar in which the log-likelihood is accummulated. At the end of the loop, the required log L10 is given by logL. Examine the numerical values of the vectors φ0 , φ1 , ..., φ10 (the easiest way to do this is to store these vectors as rows in a 11 × 2 matrix). Problem 4.6 Write a R function nllkpoisHMM(s,gamma,lambda) that computes minus one times the log-likelihood of a Poisson HMM with parameters gamma and lambda for a series of observations contained in the vector s. Notes: (a) You can use the function statdist() (see Problem 2.5) to compute δ. (b) Use the algorithm given in part (b) of Problem 4.5 rather than the direct method outlined in part (a) of the same problem. (c) It is not necessary to store each vector φt , t = 1, 2, . . . , n; you can simply reuse the same vector. Start by setting phi T .

5 Forecasting and Decoding

59

Problems for Chapter 5 Problem 5.1 Show that the conditional distribution of Su , given all the other observations, s(−u) := s1 , s2 , . . . , su−1 , su+1 , . . . , sT , can be obtained as (a) a ratio of two likelihoods: P (Su = s|S (−u) = s(−u) ) =

αu−1 ΓP (s)β 0u , u = 1, 2, . . . , T , αu−1 Γβ 0u

where α0 := δ and β T := 1. Note: This expression shows that the conditional probability can be regarded as the ratio of two likelihoods of a HMM; the numerator is the likelihood of the observed time series, but with the observation su replaced by s; the denominator is the likelihood of the observed time series, but with the observation su regarded as missing. (b) a mixture of the state-dependent probability distributions: P (Su = s|S (−u) = s(−u) ) =

m X

ζu (i)pi (s) ,

i=1

where ζu (i) = du (i)/

m P

du (j), and du (i) is the product of ith entry of the vector

j=1

αu−1 Γ and the ith entry of the vector β u . Note: This expression shows that the conditional probability can be regarded as a mixture of the state-dependent distributions, pi (s), where the mixture coefficients, ζu (i), are functions of the observations s(−u) and, of course, the model parameters. Problem 5.2 Write a function conditional.poisHMM(u,srange,st,gamma,lambda) that computes the conditional probability given in Problem 5.1, for time t = u and for all values s ∈ srange, given the observations contained in the vector st, the transition probability matrix gamma and the vector of m Poisson parameters lambda. This can be done using two calls of the function nllkpoisHMM(s,gamma,lambda) (see Problem 4.6).

60

5 Forecasting and Decoding

Problem 5.3 Consider a stationary m-state Poisson HMM with parameters Γ and λ = (λ1 , λ2 , . . . , λm ), and let pT +h (s) = P (ST +h = s|S (T ) = s(T ) ), s = 0, 1, 2 . . . , h = 1, 2, . . ., be the h-stepahead forecast distribution. (a) Show that the h-step-ahead forecast distribution is given by pT +h (s) =

m X

ξi pi (s) ,

i=1 h TΓ where ξi is the ith entry of the vector α αT 10 .

(b) Show that as the forecast horizon, h, becomes large, so the forecast distribution approaches m X 0 δP (s)1 = δi pi (s) , i=1

where P (s) is a diagonal matrix with diagonal entries pi (s) = P (St = s|Ct = i), i = 1, 2, . . . , m, and δ is the stationary distribution of the Markov chain, i.e. the solution to the system of equations δ = δΓ subject to δ10 = 1. [Hint: Use the fact that for an arbitrary vector, say η, whose entries sum to one, the vector ηΓh approaches δ as h increases.] Problem 5.4 Write a function forecast.poisHMM(h=1,srange,st,gamma,lambda) that computes pT,h (s) as defined in Problem 5.3 for s ∈ srange, given the observations contained in the vector st, a m × m transition probability matrix gamma, and the vector of m Poisson parameters lambda. Use your function to give the forecast distributions (for h = 1, 2) for the time series given in Problem 4.5.

6 Model Selection and Model Validation In Section 4.3 we have already mentioned the task of selecting an appropriate HMM for a series of observations and of checking the fit of the selected model. In this chapter we will give an introduction to model selection in HMMs (Section 6.1) and to the respective model diagnostics using pseudo-residuals (Section 6.2).

6.1

Model selection

A problem arising naturally when working with HMMs is that of selecting an appropriate model, i.e. of choosing the appropriate number of states m. Since asymptotics for the order m of a HMM are not clear yet (see e.g. Ryd´en (1995)) one needs to specify a criterion for model comparison. Assume that the observations s1 , . . . , sT were generated by the unknown operating model f and that one fits models from two different approximating families, {g1 ∈ G1 } and {g2 ∈ G2 }. The intention of model selection is to find the model which is best in a certain sense. There exist two different approaches of model selection. In the classical approach one selects the family estimated to be closest to the true model. Fur that purpose one defines a discrepancy between the operating and the approximating models, ∆(f, g1 ) and ∆(f, g2 ), and, as the operating model is unknown, derives the expected discrepancy as selection ˆ ˆ criterion, E(∆(f, g1 )) and E(∆(f, g2 )). Choosing the Kullback-Leibler discrepancy leads to the so-called Akaike information criterion (AIC), which can be computed as follows: AIC = −2logL + 2p , where logL is the log-likelihood of the fitted model and p denotes the number of parameters of the model. The first term is a measure of fit, it decreases with increasing number of states m. The second term is a penalty term, it increases with increasing m. The AIC is the canonical choice for model comparison.13 The idea of the second approach of model selection, the Bayesian approach, is to select the family estimated most likely to be true. In a first step, before considering the observations, one specifies the so-called priors, i.e. the probalities that f stems from the approximating 13

For a more detailed introduction to the AIC and its derivation see e.g. Zucchini (2000).

62

6 Model Selection and Model Validation

family, P (f ∈ G1 ) and P (f ∈ G2 ). Then, in a second step one computes and compares the posteriors, i.e. the probabilities that f belongs to the approximating family, given the observations, P (f ∈ G1 |s1 , . . . , sT ) and P (f ∈ G2 |s1 , . . . , sT ). This approach results in the Bayesian information criterion (BIC) which has a slightly modified penalty term compared to the AIC: BIC = −2logL + p · log(T ) , where logL and p as for the AIC and T is the number of observations.14 Compared to the AIC, the penalty of the BIC has more weight, because in general it holds that log(T ) > 2. Thus, the Bayesian information criterion usually favours models with less parameters than does the Akaike information criterion. This is also true for the soap sales series as can be seen in Figure 6.1 that plots the AIC and the BIC against the number of states m. The exact values of both criteria are provided in Table 6.1.15 1500 1450 1400

BIC

1350 1300

AIC 1250 1200 0

1

2

3

4

5

6

7

number of states

Figure 6.1: Model selection criteria AIC and BIC for the soap sales series.

model k Poisson 1 2-state 4 3-state 9 4-state 16 5-state 25 6-state 36

logL -712 -619 -611 -604 -603 -598

AIC 1426 1245 1239 1240 1255 1268

BIC 1429 1259 1270 1296 1342 1393

Table 6.1: Model selection criteria AIC and BIC for the soap sales series. According to the AIC the model with m = 3 is the most appropriate while the BIC selects the two-state model. Thus, the model selected depends on the approach of model selection one wants to follow. 14

We use T here for the number of observations as we are dealing with HMMs; more generally one denotes the number of observations by n. For more details on the BIC we refer to Zucchini (2000) again. 15 Compare these values to the log–likelihoods obtained for the independent mixture models in Section 2.1. Although the HMMs demand more parameters the resulting values of the AIC and the BIC are lower than those obtained for the independent mixture models. Consider also the comparison of both model families at the end of Section 4.3.

6 Model Selection and Model Validation

63

Assume that one considers the AIC as selection criterion, i.e chooses the three-state model. The respective parameter estimates are given by: 

 0.86 0.12 0.02 ˆ = 0.44 0.54 0.02 Γ 0.00 0.30 0.70

¡ ¢ δˆ = 0.72 0.22 0.06 ¡ ¢ ˆ = 3.74 8.44 14.93 . λ

The state-dependent component distributions of this model, together with the resulting marginal, are illustrated in figure 6.2. Note that the contributions of the component distributions to the marginal distribution are indicated by the respective colours.

state 1

state 2

0.20

state 3

0.20

λ1 = 3.74 δ1 = 0.72

0.15

3−state−HMM marginal

0.20

λ2 = 8.44 δ2 = 0.22

0.15

0.20

λ3 = 14.93 δ3 = 0.06

0.15

0.15

0.10

0.10

0.10

0.10

0.05

0.05

0.05

0.05

0.00

0.00 0

3

6

9

12 15 18 21 24

0.00 0

3

6

9

12 15 18 21 24

0.00 0

3

6

9

12 15 18 21 24

0

3

6

9

12 15 18 21 24

Figure 6.2: Component distributions of the selected three-state HMM.

It is also interesting to consider the autocorrelation function of the selected three-state Poisson HMM. The autocorrelations can be evaluated using the formula given in Section 3.2. In Figure 6.3 the autocorrelation function of the fitted model is juxtaposed to the one of the observations. It is obvious that both autocorrelation functions correspond to a great extent. Furthermore, the marginal moments of the HMM which are also given in Figure 6.3 lie close to the empirical moments of the observations. However, one can apply more sophisticated methods for model diagnostics as will be shown in the following section.

1.0

soap sales series mean = 5.44 variance = 15.4 3−state−HMM mean = 5.42 variance = 14.7

0.8 0.6 0.4 0.2 0.0 −0.2 0

2

4

6

8

10

12

lag

Figure 6.3: Sample and three-state HMM autocorrelation functions.

64

6 Model Selection and Model Validation

6.2

Model validation with pseudo-residuals

In the previous section we have considered two criteria for model selection in HMMs. These criteria provide a decision rule for selecting the best of several fitted models, however, they do not guarantee that the selected model is indeed appropriate. Therefore, one still has to assess the fit of the chosen model. In general, residuals are a popular tool for assessing the fit of a model. In the ideal case they are are independently and identically distributed. Furthermore, it is of great advantage if in the case of a valid model they are U (0, 1)- or N (0, 1)-distributed, independent of the fitted model. It is well known that in a standard regression, e.g. of the form yi = θ0 + θ1 xi + ei , the residuals ei are independently and identically distributed, however, the estimators of the residuals eˆi = yi − θˆ0 − θˆ1 xi are only approximately independently and identically distributed. Nevertheless, one can analyze the residuals by means of graphical tools in order to assess the fit of the model. For example, in order to search for outliers or check for specific structures and dependencies one can draw an indexplot of the residuals or plot them against the dependent variable or covariates. In case certain patterns occur the model has to be revised and improved, e.g. by adding quadratic or cubic terms or even new covariates. Another possibility is to test distributional assumptions using histograms or qq-plots of the residuals. Some of these residual plots are illustrated in Figure 6.4.

indexplot

plot agains y

histogram

qq−plot

^ e i

outlier

i

empirical quantiles

^ e i

yi

^ e i

theoretical quantiles

Figure 6.4: Residual plots.

When dealing with HMMs the residual analysis becomes slightly more complicated due to the fact that the residuals are not even approximately independently and identically distributed. Usually, each of the states of the HMM emits observations from a different distribution. Hence, different distributions should also be fitted to the residuals according to the different states of the HMM. The difficulty lies in the fact that, in general, the underlying Markov chain is unknown and therefore the residuals cannot be sorted in a suitable way. One solution to this problem is the use of so-called pseudo-residuals which we will introduce in the following.16 The concept of pseudo-residuals, which is based on the concept of p-values, meets the desired properties of residuals outlined at the beginning of this section at least 16

For a detailed introduction (in German language) to the construction and application of pseudoresiduals see Stadie (2003).

6 Model Selection and Model Validation

65

approximately, wherefore it is an appropriate means for assessing the fit of a model. Pseudo-residuals can be defined for almost every model and, in case the model is valid, they are independently and identically uniformly or normal distributed. For the construction of pseudo-residuals consider the following theorem: Let X be a random variable with distribution function F . Then, U := F (X) is uniformly distributed on the unit interval, i.e.: U = F (X) ∼ U [0, 1] . The proof is left as an exercise to the reader (see Problem 6.1). Based on this theorem, a first version of the pseudo-residual of an observation xi from a continuous random variable Xi can be defined as the probability to obtain an observation lower than xi under the fitted model: ui = P {Xi ≤ xi } = FXi (xi ) . Given the validity of the fitted model this type of pseudo-residual is U (0, 1)-distributed, with residuals for extreme observations close to 0 or 1 (Zucchini and MacDonald, 1999). The construction of the so-called uniform pseudo-residual is illustrated in Figure 6.5. f(x)

f(u) 1

u

*=

F(x ) ∈ [0,1] *

x

x*

0

u*

0

1

u

Figure 6.5: Construction of uniformly distributed pseudo-residuals. Thus, using the concept of uniform pseudo-resdiuals, observations from different distributions can be compared. Assume to have observations x1 , . . . , xn and a model Xi ∼ Fi , i = 1, . . . , n; i.e., each xi has its own distribution function and therefore the xi -values cannot be compared directly. However, the concept of pseudo-residuals can be used to generate pseudo-residuals ui which, in case the model holds, are independently and identically U (0, 1)-distributed and therefore can be compared. The respective residual plots could look as shown in Figure 6.6. indexplot 1 ui

0.97

histogram

qq−plot

0.999

1 empirical quantiles

1

0

1

n

i

0

0

1

ui 0 0

theoretical quantiles

1

Figure 6.6: Diagnostic plots for uniform pseudo-residuals.

66

6 Model Selection and Model Validation

If the histogram and the qq-plot of the uniform pseudo-residuals ui do not look like they should, one can deduce that the residuals are not uniformly distributed and thus the model is not valid. Certainly, the concept of uniform pseudo-residuals is very useful, however, it may lead to problems in outlier identification. For example, consider the indexplot given in Figure 6.6 and look at the values lying close to 0 or 1. It is hard to see, whether a value is very unlikely or not. As a value of 0.999 is difficult to distinguish from a value of 0.97 the indexplot is almost useless for a quick visual analysis. Nevertheless, this deficiency of uniform pseudo-residuals can be overcome easily using another theorem. Let Φ be the distribution function of the standard normal distribution and X a random variable with distribution function F . Then, Z := Φ−1 (F (X)) is standard normally distributed: Z = Φ−1 (F (X)) ∼ N (0, 1) . For the proof we refer to Problem 6.1 again. Using this theorem, a second version of pseudo-residuals arises from a simple transformation of the uniform pseudo-residuals via the distribution function of the standard normal distribution: zi = Φ−1 (ui ) = Φ−1 (FXi (xi )). In case the fitted model is valid, these so-called normal pseudo-residuals are N (0, 1)distributed, with the value of the residual equal to 0 in case the observation meets the median. Insofar, the normal pseudo-residuals measure the deviation from the median and not from the expectation (Zucchini, 2002). The construction of normal pseudo-resdiuals is illustrated in Figure 6.7.

u1=F1(x1) x

1

z1=Φ

u1

0

−1

(u1)

u

0

x1

z

1

−3

z1

u2=F2(x2) x

z2=Φ−1(u2)

1

u

0

x2

u2

0

z

1

z2

−3

un=Fn(xn) x

3

φ(z)

f(u)

fn(x)

3

φ(z)

f(u)

f2(x)

xn

φ(z)

f(u)

f1(x)

1

zn=Φ

−1

(un)

u

0 0u n

1

z −3

zn

Figure 6.7: Construction of normal pseudo-residuals.

3

6 Model Selection and Model Validation

67

Thus, given that the observations x1 , . . . , xn were indeed generated by the model Xi ∼ Fi the uniform pseudo-residuals zi should follow a standard normal distribution which can either be checked by visually analyzing the histogram or qq-plot or by performing tests for normality. Some examples for diagnostic plots for normal pseudo-residuals are given in Figure 6.8.

indexplot 3 zi

histogram

Φ−1(0.999) outlier

qq−plot 3

0

−3

1

n

empirical quantiles

Φ−1(0.97)

i

−3

0

3

zi

−3 −3

theoretical quantiles

3

Figure 6.8: Diagnostic plots for normal pseudo-residuals.

The second version of pseudo-residuals has the advantage, that the absolute value of the residual increases with an increasing deviation from the median and that extreme observations can be identified more easily on a normal scale (Zucchini and MacDonald, 1999). This becomes obvious if one compares the indexplots of the uniform and normal pseudo-residuals given in Figures 6.6 and 6.8. While in the case of uniform pseudoresiduals it was difficult to distinguish a value of 0.999 from a value of 0.97, in the indexplot of the normal pseudo-residuals it is clearly visible that the observation belonging to the residual Φ−1 (0.999) is an outlier, whereas the residual Φ−1 (0.97) does not strikingly differ from the other residuals. Note that the theory of pseudo-residuals outlined so far can be applied to continuous distributions only. In the case of discrete observations the pseudo–residuals have to be modified in order to allow for the discreteness of the data (for the following compare Zucchini (2002) or Stadie (2003)). The pseudo-residuals are not defined as points any longer, but as intervals. Thus, for a discrete random variable Xi with distribution function FXi one obtains the uniform pseudo-residual segments [u− ; u+ ] = [FXi (x− i ); FXi (xi )] , with x− i denoting the greatest realization possible, that is lower than xi , and the normal pseudo-residual segements − −1 + −1 −1 [z − ; z + ] = [Φ−1 (u− i ); Φ (ui )] = [Φ (FXi (xi )); Φ (FXi (xi ))] .

The construction of the normal pseudo-residual segement of a discrete random variable is illustrated in Figure 6.9.

68

6 Model Selection and Model Validation

Pi(x)

φ(z)

z−i = Φ−1(Fi(x−i )) +

P(Xi < xi)

zi = Φ

−1

(Fi(xi))

P(Xi < xi)

P(Xi=xi)

P(Xi=xi)

P(Xi > xi)

P(Xi > xi)

x

x−i xi

z−i z+i

−3

z

3 pseudo interval segment

Figure 6.9: Construction of normal pseudo-residuals in the discrete case.

Both versions of pseudo-residual segments contain information on how extreme and how rare the observations are. For example, the lower limit u− i of the uniform pseudo–residual interval specifies the probability of observing a value lower than xi , 1 − u+ i gives the − probability of a value greater than xi , and the difference u+ −u is equal to the probability i i of the observation xi under the fitted model. In analogy to the continuous case, the pseudo-residual segments can be interpreted as interval-censored realisations of a uniform respectively a standard normal distribution, if the fitted model is valid. Though this is correct only in case the parameters of the fitted model are known, it is still approximately correct, if the number of estimated parameters is small compared to the size of the sample (Stadie, 2003). The diagnostic plots of pseudo-residuals of discrete random variables as shown in Figure 6.10 look a bit different from those of continuous random variable.

indexplot

histogram

qq−plot

3 zi

−3

empirical quantiles

0

3

i

−3

mid−pseudo−residuals

3

zi

−3 −3

theoretical quantiles

3

Figure 6.10: Diagnostic plots for discrete pseudo-residuals.

Certainly, it is easy to construct an indexplot of the pseudo-residual segments or to plot these against any independent or dependent variable. However, in order to construct a qq-plot of the pseudo-residual segments one has to find a method for defining the order of the pseudo-residuals. One possibility to sort the pseudo-residual segments is to consider the so-called mid-pseudo-residuals which are defined as ¶ µ − ui + u+ i m −1 . zi = Φ 2 The qq-plot shown in Figure 6.10 was constructed sorting the pseudo-residuals according to the respective mid-pseudo-residuals. Furthermore, the mid-pseudo-residuals can be

6 Model Selection and Model Validation

69

used themselves for checking for normality, for example via a histogram of the midpseudo-residuals. Now, having outlined the basic theory of pseudo-residuals, we can consider the use of pseudo-residuals in the context of HMMs. The analysis of the pseudo-residuals of a HMM serves for two purposes, namely the assessment of the general fit of a selected model, and the detection of outliers. Depending on the aspects of the model to be analyzed, one can distinguish between two approaches to computing pseudo-residuals of a HMM. The first technique considers the observations one at a time and seeks those which, relative to the model and all other observations in the series, are sufficiently extreme to suggest that they differ in nature or origin from the others. This means that one computes the pseudo-residuals from the conditional distribution of Su given all the other observations, S (−u) = s(−u) according to ¡ ¢ zi = Φ−1 P (Su ≤ su |S (−u) = s(−u) ) for continuous component distributions, and £ ¡ ¢ ¡ ¢¤ [z − ; z + ] = Φ−1 P (Su < su |S (−u) = s(−u) ) ; Φ−1 P (Su ≤ su |S (−u) = s(−u) ) for discrete component distributions, where in both cases the conditional distribution P (Su = s|S (−u) = s(−u) ) is given by the formula introduced in Section 5.1. The following residual plots are based on this type of pseudo-residuals. The second technique for outlier detection seeks observations which are extreme relative to the model and all preceding observations. In this case the relevant conditional probability is P (Su = s|S (u−1) = s(u−1) ) and the respective pseudo–residuals are ¡ ¢ zi = Φ−1 P (Su ≤ su |S (u−1) = s(u−1) ) for continuous component distributions, and £ ¡ ¢ ¡ ¢¤ [z − ; z + ] = Φ−1 P (Su < su |S (u−1) = s(u−1) ) ; Φ−1 P (Su ≤ su |S (u−1) = s(u−1) ) for discrete component distributions. Here, in analogy to the conditional distribution in the first case, the desired conditional probability P (Su = su |S (u−1) = s(u−1) ) is given by the ratio of the likelihood of the first u observations to that of the first u − 1: P (Su = s|S (u−1) = s(u−1) ) =

αu−1 ΓP (s)10 . αu−1 10

The pseudo-residuals of the second type can be interpreted as forecast pseudo-residuals since they measure the deviance of an observation from the median of the corresponding one-step-ahead forecast. If a forecast pseudo-residual is extreme, this indicates that the respective observation is an outlier or, alternatively, that the model no longer provides an acceptable description of the series.

70

6 Model Selection and Model Validation

In Figures 6.11 to 6.14 we illustrate different types of residual plots for the fitted models of the soap sales series, using the first definition of pseudo-residuals based on the conditional distribution P (Su = s|S (−u) = s(−u) ). In order to compare the pseudo-residuals of the selected three-state model to those of the rejected models we provide the residual plots of the one-, two-, and four-state models, too. one−state HMM

two−state HMM

three−state HMM

0

−2

−4

2

0

−2

−4 0

50

100

150

200

week

250

4

pseudo−residual

2

four−state HMM

4

pseudo−residual

4

pseudo−residual

pseudo−residual

4

2

0

−2

−4 0

50

100

150

200

week

250

2

0

−2

−4 0

50

100

150

200

week

250

0

50

100

150

200

week

250

Figure 6.11: Indexplots of the pseudo-residuals of the soap sales series.

one−state HMM

two−state HMM

three−state HMM

four−state HMM

0.4

0.4

0.4

0.4

0.3

0.3

0.3

0.3

0.2

0.2

0.2

0.2

0.1

0.1

0.1

0.1

0.0

0.0

0.0

−5 −4 −3 −2 −1

0

1

2

3

4

5

−5 −4 −3 −2 −1

mid−pseudo−residuals

0

1

2

3

4

5

0.0 −5 −4 −3 −2 −1

mid−pseudo−residuals

0

1

2

3

4

5

−5 −4 −3 −2 −1

mid−pseudo−residuals

0

1

2

3

4

5

mid−pseudo−residuals

2 1 0 −1 −2 −3 −4 −4

−3

−2

−1

0

1

2

theoretical quantiles

3

4

two−state HMM

4 3 2 1 0 −1 −2 −3 −4 −4

−3

−2

−1

0

1

2

theoretical quantiles

3

4

three−state HMM

4

sorted pseudo−residual intervals

3

sorted pseudo−residual intervals

one−state HMM

4

sorted pseudo−residual intervals

sorted pseudo−residual intervals

Figure 6.12: Histograms of the mid-pseudo-residuals of the soap sales series.

3 2 1 0 −1 −2 −3 −4 −4

−3

−2

−1

0

1

2

theoretical quantiles

3

4

four−state HMM

4 3 2 1 0 −1 −2 −3 −4 −4

−3

−2

−1

0

1

2

3

4

theoretical quantiles

Figure 6.13: qq-plots of the pseudo-residuals of the soap sales series.

Regarding the residual plots provided in Figures 6.11 to 6.13 it is obvious that the selected three-state model provides an acceptable fit while, for example, the pseudo-residuals of the one-state model, i.e. a simple Poisson distribution, deviate strikingly from the standard normal distribution. However, considering the residual plots only, and not the model selection criteria, one might even accept the two–state model as simple alternative.

6 Model Selection and Model Validation

71

Figure 6.14 shows the autocorrelation functions of the mid-pseudo-residuals of the fitted models. Although the mid-pseudo-residuals should not be as highly correlated as those of the one-state model given in the left panel, one can show that the pseudo-residuals of a HMM are not independent in general (MacDonald and Zucchini, 1997). Nevertheless, the mid-pseudo-residuals are supposed to be approximately independent if the fitted model is adequate. However, in this case, the weak negative correlation of the pseudo-residuals of the three-state model over the first lag is still acceptable.

one−state HMM

two−state HMM

three−state HMM

four−state HMM

1.0

1.0

1.0

1.0

0.8

0.8

0.8

0.8

0.6

0.6

0.6

0.6

0.4

0.4

0.4

0.4

0.2

0.2

0.2

0.2

0.0

0.0

0.0

0.0

−0.2

−0.2

−0.2

−0.2

0

2

4

6

lag

8

10

12

0

2

4

6

lag

8

10

12

0

2

4

6

lag

8

10

12

0

2

4

6

lag

8

10

12

Figure 6.14: Autocorrelation functions of the mid-pseudo-residuals of the soap sales series.

Problems for Chapter 6 Problem 6.1 (a) Let X be a continuous-valued random variable with distribution function F . Show that the random variable U = F (X) is uniformly distributed over the interval [0, 1], i.e. U ∼ U (0, 1). (b) Suppose that U ∼ U (0, 1) and let F be the distribution function of a continuousvalued random variable. Show that the random variable X = F −1 (U ) has the distribution function given by F . (c)

(i) Give the explicit expression for F −1 for the exponential distribution, i.e. the density function f (x) = λ e−λx , x ≥ 0. (ii) Verify your result by generating 1000 uniformly distributed random deviates, transforming these using F −1 and then examining the histogram of the resulting values.

(d) Show that for a continuous-valued random variable X with distribution function F , the random variable Z = Φ−1 (F (X)), where Φ is the distribution function of the standard normal distribution, is standard normal distributed, i.e. Z ∼ N (0, 1).

72

6 Model Selection and Model Validation

Problem 6.2 (a) Write a R function estimate.expHMM() to estimate the parameters of an m-state exponential HMM. Note that the two functions param() and paraminv() (developed for the Poisson case) can be used without change in this case, too. (b) Apply the above function to estimate the parameters for the following series of observations for m = 1,2,3: 14.8, 35.0, 13.2, 4.4, 0.1, 0.3, 0.8, 65.6, 1.0, 1.1, 2.1, 10.0, 0.1, 8.2, 17.5, 69.4, 15.4, 0.3, 0.1, 0.7 (c) Write a function forecast.expHMM(h=1,srange,st,gamma,lambda) that computes the h-step-ahead forecast for an exponential HMM. Use your function to forecast the series in (b) for h = 1, 2. (d) Write a function psresid.forecast.expHMM(st,gamma,lambda) that computes the one-step-ahead forecast normal pseudo-residuals for an exponential HMM. Note that the distribution function of the one-step-ahead forecast at time t is given by m X (t) (t) P (St+1 ≤ s|S = s ) = ξi Fi (s) , i=1

where ξi is the ith entry of the vector αt Γ0 . αt 1 Note also that the distribution function at time t = 0 is P (S1 ≤ s) =

m X

δi Fi (s) .

i=1

Use your function to compute the forecast normal pseudo-residuals for the time series given in (b). Use plots (e.g. histogram, qq-plot) to examine these residuals.

7 Applications and Extensions One of the key advantages of HMMs is the ease with which the basic model can be generalized, in several different ways, to provide models for a wide range of time series. In this chapter we outline the application of the basic HMM using other component distributions than the Poisson (Section 7.1) and the basic ideas of some possible extensions of the basic HMM (Sections 7.2 to 7.4). The first generalization which we will discuss in Section 7.2 adds flexibility to the basic model by generalizing the underlying parameter process. The assumption that the parameter process is a first-order Markov chain is relaxed by allowing this to be a second-order Markov chain. This extension can be applied not only to the basic model but also to most of the other models to be discussed. We will then illustrate how the basic model can be generalized to construct HMMs for a number of different and more complex types of time series, including the following. • Series of multinomial-like observations and categorical series (Section 7.3.1): Examples of multinomial-like series include daily sales of a particular item categorized into the four consumer categories: adult female, adult male, juvenile female, juvenile male. An important special case of the multinomial-like series is that in which there is exactly one of q possible mutually exclusive outcomes in each time period, that is a categorical time series. An example is a time series of hourly wind directions in the form of the conventional 16 categories, i.e. the 16 points of the compass. • Multivariate series (Section 7.3.2): An example of a bivariate discrete-valued series is the number of sales of each of two related items. A key feature of multivariate time series is that, in addition to serial dependence within each series, there may be dependence across the series. • Series which depend on covariates (Section 7.4): Many, if not most, time series that are studied in practice exhibit trend, seasonal variation, or both. Examples include monthly sales of items, daily number of shares traded, insurance claims received, and so on. One can regard such series as depending on the covariate “time”. In some cases covariates other than time are relevant. For example one might wish to model the number of sales of an item as a function of the price, advertising expenditure, sales-promotions, and allow for trend and seasonal fluctuations.

74

7 Applications and Extensions

Note that this chapter (or at least Sections 7.2 to 7.4) is not part of the general introduction to the basic HMM but it is rather meant for further reading. For that reason, we concentrate on outlining the basic ideas of the possible extensions and conceivable applications without providing all the details. For a more detailed description of these extensions and their applications (especially for discrete random variables) we refer to MacDonald and Zucchini (1997).

7.1

Hidden Markov models with various component distributions

In this paper we have introduced the basic HMM concentrating on Poisson component distributions. However, as already mentioned, one may use any other possible discrete or continuous distribution for modelling the state-dependent distributions (in fact it is even feasible to use different distributions for different states). One simply has to redefine the matrices containing the state-dependent probabilities, and eventually the transformations for the state-dependent parameters in the estimation process. In the following we outline some possible applications of HMMs with various component distributions without going into detail (for more details on HMMs with other (discretevalued) component distributions than the Poisson, and their application respectively, we refer to MacDonald and Zucchini (1997) again). • Hidden Markov models for unbounded counts: The Poisson distribution is the “classical” model for unbounded counts. However, a popular alternative for modelling unbounded counts, especially for overdispersed data, is the negative binomial distribution, whose (state-dependent) probabilities are given by ³ ´ µ ¶ η1 µ ¶x Γ s + η1i i 1 ηi µi pi (s) = ³ ´ , s = 0, 1, 2, . . . , 1 + η i µi Γ 1 Γ(x + 1) 1 + ηi µi ηi

with E(St |Ct = i) = µi and V ar(St |Ct = i) = µi (1 + ηi µi ) (note that this is only one of several possible parameterizations of the negative binomial distribution). A negative binomial HMM may be applied in case the data are still overdispersed relative to a Poisson HMM. Conceivable (economic) examples for the application of Poisson or negative binomial HMMs include series of counts of stoppages or breakdowns of technical equipment, earthquakes, sales (see the soap sales series), insurance claims, accidents reported, defective items or stock trades. • Hidden Markov models for binary data: The Bernoulli HMM that may be used to model binary time series is the simples HMM. Its state-dependent probabilities for the two possible outcomes are given by pi (0) = P (St = 0|Ct = i) = 1 − πi pi (1) = P (St = 1|Ct = i) = πi

(failure), (success).

7 Applications and Extensions

75

An artificial example of a Bernoulli HMM is provided in Section 3.3. Real examples for the application of Bermoulli HMMs are given in the following table: random variable daily rainfall daily trading of a share daily presence market value of a share

failure no rain not traded absent down

success rain traded present up

• Hidden Markov models for bounded counts: Binomial HMMs may be applied to model series of bounded counts where nt is the number of trials at time t and st is the number of successes at time t. The state-dependent binomial probabilities are given by µ ¶ nt st pti (st ) = π (1 − πi )nt −st . st i Possible examples for series of bounded counts that may be described by a binomial HMM are series of – purchasing preferences, i.e. nt = number of purchases of all brands on day t, st = number of purchases of brand A on day t; – sales of newspapers or magazines, i.e. nt = number available on day (or week) t, st = number purchased on day (or week) t. Note that there is one restriction when computing the forecast distribution of a binomial HMM. Either the number of trials at time T + h must be known or one has to fit a separate model to predict the respective number of trials. Of course, by setting nT +h = 1 it is also possible to simply compute the forecast distribution of the count proportions. • Hidden Markov models for continuous-valued time series: So far, we have primarily considered HMMs with discrete-valued state-dependent component distributions. However, we have also mentioned that it is possible to use continuous-valued component distributions. One simply has to replace the probability functions by the respective state-dependent probability density functions. Important HMMs for continuous-valued time series are exponential, geometric, Gamma or normal HMMs (for the application of an exponential HMM see Problem 6.2). Normal HMMs are often applied for modelling stock returns, especially if the observed kurtosis of a series is greater than the theoretical kurtosis of a normal 4 = 3. For example, the series of returns on the NYSE Comdistribution, E(Sσt −µ) 4 posite Index introduced in Section 2.1 has a kurtosis of about 5.75 which is clearly

76

7 Applications and Extensions higher than that of a normal distribution. In addition, the series shows serial correlation for the first lag so that a normal HMM seems to be a more appropriate model than the independent mixture model used in Section 2.1. However, the estimated parameters of a fitted stationary two-state normal HMM, given by µ

¶ 0.92 0.08 , 0.03 0.97 ¡ ¢ ˆ = −0.05 0.07 , µ

ˆ= Γ ¡ ¢ δˆ = 0.25 0.75 ,

¡ ¢ ˆ = 1.0 0.5 , σ

ˆ are quite similar to those of the indewhere δˆ is the stationary distribution of Γ, pendent mixture of two normal distributions shown in Figure 2.7. Note that the likelihood of a normal HMM is indeed unbounded; theoretically it is possible to increase the likelihood infinitely by letting one state-dependent variance, σi , approach zero. In practice, this may leed to problems in parameter estimation.

7.2

Second order hidden Markov models

One generalization of the basic model is a second-order HMM. This can be constructed by replacing the underlying first-order Markov chain in the basic model (or in the extensions to follow) by a stationary second-order chain. The latter is characterized by the transition probabilities γijk = P(Ct = k | Ct−1 = j, Ct−2 = i) and the stationary bivariate distribution u(j, k) = P(Ct−1 = j, Ct = k). That is, the probabilities u(j, k) satisfy u(j, k) =

m X i=1

u(i, j)γijk and

m X m X

u(j, k) = 1.

j=1 k=1

The process {Ct } is not a first-order Markov chain, but we can define a new process {Xt } with Xt = (Ct−1 , Ct ) which is a first-order Markov chain (on the state space M 2 ). For example consider the most general stationary second-order Markov chain {Ct } on the two states 1 and 2. This can be characterized by the four transition probabilities: a = P(Ct = 2 | Ct−1 = 1, Ct−2 = 1), b = P(Ct = 1 | Ct−1 = 2, Ct−2 = 2), c = P(Ct = 1 | Ct−1 = 2, Ct−2 = 1), d = P(Ct = 2 | Ct−1 = 1, Ct−2 = 2). The process {Xt }, as defined above, is then a first-order Markov chain, on the four states (1,1), (1,2), (2,1), (2,2) (in order), and with transition probability matrix:   1−a a 0 0  0 0 c 1−c   .  1−d d 0 0  0 0 b 1−b The parameters a, b, c and d are bounded by 0 and 1 but are otherwise unconstrained. The stationary distribution of {Xt } is proportional to ¡ ¢ b(1 − d), ab, ab, a(1 − c) ,

7 Applications and Extensions

77

from which it follows that the matrix of stationary bivariate probabilities for {Ct } is µ ¶ 1 b(1 − d) ab . ab a(1 − c) b(1 − d) + 2ab + a(1 − c) We will mention only two important aspects of this second-order hidden Markov model here. Further details are given in Section 3.2 of MacDonald and Zucchini (1997). The first is that it is possible to evaluate the likelihood of a second-order hidden Markov model in very similar fashion to that of the basic model: the computational effort is in this case cubic in m, the number of states, and, as before, linear in T , the number of observations. This then enables one to estimate the parameters, for example by direct maximization of the likelihood, and also to compute the forecast distributions. The second aspect relates to the number of parameters of the (second-order) Markov chain component of the model. In general this has m2 (m − 1) free parameters, a number that rapidly becomes prohibitively large as m increases. This over-parameterization problem can be circumvented by using some restricted sub-class of second-order Markov chain models, for example that described by Pegram (1980), which has only m + 1 parameters, or that described by Raftery (1985), which has m(m − 1) + 1 parameters. Such models are necessarily less flexible than the general class of second-order Markov chains. They maintain the second-order structure of the chain but trade off some flexibility in return for a reduction in the number of parameters. Having pointed out that it is possible to increase the order of the Markov chain in a HMM from one to two (or higher, in fact), in what follows we will restrict our attention to the simpler case of a first-order Markov chain.

7.3

Hidden Markov models for multivariate series

7.3.1

Series of multinomial-like observations and categorical series

One of the examples of the basic model that we mentioned in Section 7.1 is the binomial HMM in which the series of observations {st : t = 1, . . . , T } can be regarded as the number of successes in nt independent Bernoulli trials. The m-state model has m different probabilities of success, πi , one for each state, i. The probability that is applicable at a given time t is determined by a first-order Markov chain {Ct }. Thus for example, if C1 = 2 and C2 = 4, the success probability π2 is applicable at time t = 1, and π4 at time t = 2. A multinomial HMM is a generalization of the binomial HMM in which there are q mutually exclusive possible outcomes in each trial, where q ≥ 2. The observations are then q series of counts, {stj : t = 1, . . . , T, j = 1, . . . , q} with st1 + st2 + . . . + stq = nt where nt is the (known) number of trials at time t. Thus for example s2 3 represents the number of outcomes at time t = 2 that were of the type 3. The counts stj at time t can be combined in a vector st = (st1 , st2 , . . . , stq ). We will suppose that, conditional on C (T ) , the T {S t = (St1 , St2 , . . . , Stq ) : t = 1, . . . , T } are mutually independent.

random

vectors

78

7 Applications and Extensions

The parameters of the model are as follows. As in the basic model the matrix Γ has m(m − 1) free parameters. With each of the m states of the Markov chain is associated a multinomial distribution with parameters nt (known) and q unknown probabilities which, for state i, we will denote by πi1 , πi2 , . . . , πiq . These probabilities are constrained because P q j=1 πij = 1 for each state i. This component of the model therefore has m(q − 1) free parameters, and the entire model has m2 − m + (q − 1)m = m2 + m(q − 2). The likelihood of observations s1 , . . . , sT from a general multinomial HMM differs little from the case of a binomial HMM; the only difference is that the binomial probabilities µ ¶ nt st π (1 − πi )nt −st pti (st ) = st i are replaced by the multinomial probabilities µ ¶ nt s st1 st2 pti (st ) = P(S t = st | Ct = i) = πi1 πi2 · · · πiqtq .17 st1 , st2 , . . . , stq The likelihood is therefore given by LT = δP 1 (s1 )ΓP 2 (s2 ) · · · ΓP T (sT )10 , where P t (st ) = diag (pt1 (st ), . . . , ptm (st )) . In maximizing the likelihood in order to estimate parameters one must observe P Pq−1a number of “generalized upper bound” constraints, namely: γ ≤ 1, and j6=i ij j=1 πij ≤ 1, i = 1, 2, . . . , m, as well as the usual constraints on probabilities: 0 ≤ γij ≤ 1 and 0 ≤ πij ≤ 1 . Once the parameters have been estimated, these can be used to estimate various forecast distributions, that is the distributions of future values of the series. This is done using the formulae that were given for the basic model, except that the state-dependent probabilities are now based on the multinomial distribution rather than on the binomial or Poisson. There is, however, one restriction to such forecasts (which also applies to the binomial HMM). We have assumed that nt , the number of trials at time t, is known. This number, being the sum of the q observed counts at time t, is certainly known at times t = 1, 2, . . . , T . Now in order to compute the one-step-ahead forecast distribution one needs to know nT +1 , the number of trials that will take place at time T + 1. This will be known in some applications, for instance when the number of trials is prescribed by a sampling scheme. However, there are also applications in which nT +1 is a random variable whose value remains unknown until time T +1. Thus for the latter it is not possible to (directly) compute the forecast distribution of the counts at time T + 1. However, by setting nT +1 = 1, it is possible to compute the forecast distribution of the count-proportions! An alternative approach, discussed in Section 3.8 of MacDonald and Zucchini (1997), is to fit a separate model for the series {nt }, to use that model to compute the forecast distribution of nT +1 and then, finally, to use that to compute the required forecast distribution for the counts for the multinomial HMM. 17

Note that these probabilities need to be indexed by the time t as the number of trials nt is supposed to be time-dependent. In contrast, we assume the state-dependent probabilities πi1 , πi2 , . . . , πiq to be constant over time.

7 Applications and Extensions

79

A simple but important special case of multinomial-hidden Markov models is where nt = 1 for all t. This provides models for categorical time series. Here, the state-dependent probabilities pti (st ) and the matrix expression for the likelihood simplify considerably. If the j th component of st is 1 (and the others are therefore 0) we have pti (st ) = πij , and the subscript t is clearly unnecessary. It follows that P (st ) = diag (pt1 (st ), . . . , ptm (st )) = diag (π1j , . . . , πmj ) := P (jt ) if again st is a vector with j th component 1 and the others 0. Hence the likelihood of observing categories j1 , j2 , . . . , jT at times 1, 2, . . . , T is given by LT = δP (j1 )ΓP (j2 )Γ · · · P (jT )10 . This implies, for instance, that the probability of observing category j at time t, given that category l is observed at time t − 1, is δP (l)ΓP (j)10 . δP (l)10 Thus, once the parameters of the model have been estimated, it is very easy to compute the forecast distribution.

7.3.2

Other multivariate series

The series of multinomial-like counts discussed in the last section are, of course, examples of multivariate series but they have a specific structure. In this section we illustrate how it is possible to develop HMMs for different and more complex types of multivariate series. Consider q time series {(St1 , St2 , . . . Stq ) : t = 1, . . . , T } which we will represent compactly as {S t , : t = 1, . . . , T }. As we did for the basic HMM, we will assume that, conditional on C (T ) = {Ct : t = 1, . . . , T }, the above random vectors are mutually independent. We will refer to this property as conditional independence along time in order to distinguish it from a different type of conditional independence that we will describe later. To specify a HMM for such a series it is necessary to postulate a model for the distribution of this random vector for each of the m states of the parameter process. Expressed in symbols one requires formulae for the probabilities pti (st ) = P (S t = st | Ct = i) i = 1, 2, . . . , m.18 In the case of the multinomial HMMs these probabilities were supplied by m multinomial distributions. We note that it is not required that each of the q component series has the same distribution. For example St1 could be modelled using a Poisson distribution, St2 18

For generality, we keep the time index here, i.e. we allow the state-dependent probabilities to change over time.

80

7 Applications and Extensions

using a Bernoulli distribution, and so on. The expression for the likelihood that is given below also holds if some of the series are continuous-valued. Secondly, it is not necessary that the distribution of St1 (say) is of the same type for each of the m states of the underlying Markov chain. In principle one could use the Poisson distribution in state 1, the negative-binomial distribution in state 2 and so on. However, we have not encountered applications in which this feature could be usefully exploited. What is necessary is to specify models for m joint distributions, a task that can be anything but trivial. For example there is no single bivariate Poisson distribution; different versions are available and they have different properties (in contrast, one can reasonably speak of the bivariate normal distribution because, for practical purposes, there is only one). One has to select a version that is appropriate in the context of the particular application being investigated. Once the required joint distributions have been selected, i.e. once one has given formulae for the state-dependent probabilities pti (st ), the likelihood of a general multivariate HMM is easy to write down. It has the essentially the same form as that for the basic model, namely LT = δP 1 (s1 )ΓP 2 (s2 ) · · · ΓP T (sT )10 , where s1 , . . . , sT are the observations and P t (st ) = diag (pt1 (st ), . . . , ptm (st )). This is, in fact, the same formula as the one given above for multinomial-like series. The task of finding suitable joint distributions is greatly simplified if one can make the assumption of contemporaneous conditional independence. We will illustrate the meaning of this term by means of the multisite precipitation model discussed by Zucchini and Guttorp (1991). In their application, there are five binary time series representing the presence or absence of rain at each of five sites which are regarded as being linked by a common climate process {Ct }. There the random variables Stj are binary. Let πtij be defined as πtij = P (Stj = 1 | Ct = i) = 1 − P (Stj = 0 | Ct = i). The assumption of contemporaneous conditional independence implies that the required joint probability pti (st ) can be computed as the product of the marginal probabilities: pti (st ) =

q Y

s

πtijtj (1 − πtij )1−stj .

j=1

Thus, for example, given climate state i, the probabilty that it will rain at, say, sites (1,2,4) but not at sites (3,5) on day t is the product of the (marginal) probabilities that these events occur, namely πti1 πti2 (1 − πti3 )πti4 (1 − πti5 ). For general multivariate HMMs that are contemporaneously conditionally independent, the state-dependent probabilities are given by a product of the corresponding q marginal probabilities: q Y pti (st ) = P(Stj = stj | Ct = i). j=1

7 Applications and Extensions

81

We wish to emphasize that the above two conditional independence assumptions, namely conditional independence along time and contemporaneous conditional independence, do not imply that (a) the individual component series are serially independent, or that (b) the component series are mutually independent. The parameter process, namely the Markov chain, which is common to all the components, induces both serial dependence and cross-dependence in the component series, even when these are assumed to have both the above conditional independence properties. This shall be demonstrated in Figure 7.1.

dependence across the series

dependence along time Ct=1

Markov chain Ct=2

state dependent process

S11   S12

S21   S22

S31   S32

S41   S42

S51   S52

S61   S62

S71   S72

S11   S12

S21   S22

S31   S32

S41   S42

S51   S52

S61   S62

S71   S72

independent independent

not independent

Figure 7.1: Contemporaneous conditional independence. Details relating to serial- and cross-correlation functions of these models are given in Section 3.4 of MacDonald and Zucchini (1997), as are other general classes of models for multivariate HMMs, such as models with time lags and models in which some of the variables are discrete and others are continuous. Multivariate HMMs (with continuous component distributions) might e.g. be used for modelling multivariate financial time series. For example, one could fit a two-state multivariate normal HMM to a multivariate series of daily returns on a number of shares where, as in the univariate case, the underlying states of the Markov change might correspond to calm and turbulent phases of the stock market.

7.4

Series which depend on covariates, such as time

HMMs can also be modified to allow for the influence of covariates by postulating dependence of the state-dependent probabilities pti (st ) on those covariates, as demonstrated in Figure 7.2. This opens the way for such models to incorporate time trend and seasonality, for instance.19 19

In fact, we have already allowed the state–dependent probabilities to change over time in the previous section.

82

7 Applications and Extensions

Xt−2

Xt−1

Xt

Xt+1

observed covariates

St−2

St−1

St

St+1

observations

Ct−2

Ct−1

Ct

Ct+1

hidden state

Figure 7.2: HMMs with covariates for the state-dependent probabilities.

We take {Ct } to be the usual Markov chain, and we suppose, in the case of Poisson HMMs, that the conditional mean λti , where the index t indicates that the mean changes in time, depends on the (row) vector xt of q covariates, for instance as follows: log λti = β i x0t . In the case of binomial HMMs, the corresponding assumption is that logit πti = β i x0t . The elements of xt could include a constant, time (t), sinusoidal components expressing seasonality (for example cos(2πt/r) and sin(2πt/r) for some positive integer r), and any other relevant covariates. For example, a binomial HMM with logit πti = βi1 + βi2 t + βi3 cos(2πt/r) + βi4 sin(2πt/r) + βi5 yt + βi6 zt allows for a (logistic-linear) time trend, r-period seasonality and the influence of covariates yt and zt , in the state-dependent success probabilities πti . Additional sine-cosine pairs can if necessary be included, to model more complex seasonal patterns. Similar models for the log of the conditional mean λti are possible in the Poisson HMM case. Clearly link functions other than the canonical ones used here could be considered, too. The expression for the likelihood of T consecutive observations s1 , . . . , sT for these models involving covariates, i.e. LT = δ P 1 (s1 )ΓP 2 (s2 ) · · · ΓP T (sT )10 , is similar to the one in the basic case. The only difference is the dependence of the state-dependent probabilities on covariates, i.e. the precise definition of pti (st ), and hence of P t (st ) = diag(pt1 (st ), . . . , ptm (st )). It is worth noting that the binomial and Poisson HMMs which allow for covariates in this way provide useful generalizations of logistic regression and Poisson regression respectively: a generalization that drops the independence assumption of such regression models and allows serial dependence.

7 Applications and Extensions

83

An alternative way to model time trend and seasonality in HMMs is to drop the assumption that the Markov chain is homogeneous, and assume instead that the transition probabilities are a function of time, i.e. µ ¶ γ11,t γ12,t Γt = . γ21,t γ22,t More generally, the transition probabilities can be modelled as depending on one or more covariates, not necessarily time but any variables that are considered relevant. The incorporation of covariates into the Markov chain is not as straightforward as incorporating them into the state-dependent probabilities. One reason why it could be worthwhile, however, is that the resulting Markov chain may have a useful substantive interpretation, e.g. as a weather process which is itself complex but determines rainfall probabilities at several sites in fairly simple fashion. We will illustrate one way in which it is possible to modify the transition probabilities of the Markov chain to model time trend and seasonality. Consider a model based on a two-state Markov chain {Ct } with P (Ct = 2 | Ct−1 = 1) = γ1,t

P (Ct = 1 | Ct−1 = 2) = γ2,t ,

and for i = 1, 2 logitγi,t = β i x0t . For example, a model incorporating r-period seasonality is that with logit γi,t = βi1 + βi2 cos(2πt/r) + βi3 sin(2πt/r). In general the above assumption on logit γi,t implies that the transition probability matrix, for transitions between times t − 1 and t, is given by   0 1 exp(β 1 xt )    1 + exp(β x0 ) 1 + exp(β x0 )  1 t 1 t . Γt =     exp(β 2 x0t )  1 1 + exp(β 2 x0t )

1 + exp(β 2 x0t )

Extension of this model to the case m > 2 presents difficulties, but they appear not to be insuperable. One important difference between the class of models proposed here and other HMMs (and a consequence of the non-homogeneity) is that we cannot always assume that there is a stationary distribution for the Markov chain. This problem arises when one or more of the covariates are functions of time, as in models for trend or seasonality. If necessary we therefore assume instead some initial distribution δ (at time t = 1). A very general class of models in which the Markov chain is non-homogeneous and which allows for the influence of covariates is that of Hughes (1993). This model, and additional details relating to the models outlined above, are given in Chapter 3 of MacDonald and Zucchini (1997).

84

7 Applications and Extensions

Problems for Chapter 7 Problem 7.1 Give an expression for the likelihood function of a HMM in the case where the statedependent distributions are given by (a) geometric distributions with parameters θi ∈ (0, 1), i = 1, 2, ..., m, (b) exponential distributions with parameters λi ≥ 0, i = 1, 2, ..., m, (c) bivariate normals with parameters µ ¶ µ1i µi = , µ2i

µ Σi =

2 σ1i σ12i 2 σ12i σ2i

¶ .

Solutions Here, we only outline the solutions to the theoretical problems. Solutions to the practical problems will be provided in a separate document.

Problems for Chapter 2 Solution to Problem 2.1 (a) For a mixture of two distributions one has that E(X 2 ) = δ1 E(X12 ) + δ2 E(X22 ) = δ1 (σ12 + µ21 ) + δ2 (σ22 + µ22 )

[as E(Xi2 ) = V ar(Xi ) + (E(Xi ))2 ] ,

E(X)2 = (δ1 µ1 + δ2 µ2 )2 = δ12 µ21 + δ22 µ22 + 2δ1 δ2 µ1 µ2 , Var(X) = E(X 2 ) − (E(X))2 = δ1 σ12 + δ2 σ22 + δ1 (1 − δ1 )µ21 + δ2 (1 − δ2 )µ22 − 2δ1 δ2 µ1 µ2 . Since δ1 + δ2 = 1, it follows that δ1 (1 − δ1 ) = δ2 (1 − δ2 ) = δ1 δ2 , and so Var(X) = δ1 σ12 + δ2 σ22 + δ1 δ2 (µ1 − µ2 )2 . (b) For a mixture of two Poisson distributions, Po(λi ), i = 1, 2, one has that µi = σi2 = λi , and so Var(X) = δ1 λ1 + δ2 λ2 + δ1 δ2 (λ1 − λ2 )2 = E(X) + δ1 δ2 (λ1 − λ2 )2 > E(X) for λ1 6= λ2 .

86

Solutions

Solution to Problem 2.3 (a) For a two-state stationary Markov chain we have that δΓ = δ

with

i.e. δ10 = 1 .

δ1 + δ2 = 1,

Denoting the 2 × 2 unit matrix by I we have that δ(I − Γ) = 0 i.e.

µ (δ1 δ2 )

and

1 − γ11 −γ12 −γ21 1 − γ22

δ10 = 1 , ¶ = (0 0) .

This yields two equations which are identical: δ1 (1 − γ11 ) − δ2 γ21 = 0 =⇒ δ1 γ12 − δ2 γ21 = 0 , −δ1 γ12 + δ2 (1 − γ22 ) = 0 =⇒ δ1 γ12 − δ2 γ21 = 0 . Since the second equation is redundant the additional equation needed to solve for δ1 and δ2 is δ1 + δ2 = 1, which yields δ2 = 1 − δ1 and hence γ21 γ12 δ1 γ12 − (1 − δ1 )γ21 = 0 =⇒ δ1 = and δ2 = . γ12 + γ21 γ12 + γ21 A more direct way of solving the problem is to simply replace the second redundant equation in the system δ(I − Γ) = 0 by δ10 = 1. That yields µ ¶ 1 − γ11 1 (δ1 δ2 ) = = (0 1) . γ21 1 For the more general case m ≥ 2, one replaces the last column of (I − Γ) by a vector of ones, and the vector on the right-hand side of the equation by a vector whose last entry is 1, and the rest are 0. For example, for m = 3 one solves the system   1 − γ11 −γ12 1 (δ1 δ2 δ3 )  −γ21 1 − γ22 1  = (0 0 1) . −γ31 −γ32 1

(b) We first need to compute δ which is given by (δ1 δ2 ) =

1 (0.2 0.1) = (2/3 1/3) 0.1 + 0.2

Then, the probabilities of the two sequences are computed as follows: sequence 1: 2/3 · 0.9 · 0.9 · 0.1 · 0.8 · 0.2 = 0.00864 , sequence 2: 1/3 · 0.2 · 0.9 · 0.1 · 0.2 · 0.9 = 0.00108 . The probabilities of the sequences depend on the probabilities of the respective first states and on the orders in which the states occur. Here, sequence 2 is less likely because it starts in the less probable of the two states and because its order of states is less probable than that of sequence 1.

Solutions

87

Solution to Problem 2.4 The result can be proved by mathematical induction. We show that (a) it holds for k = 1 and (b) if it holds for k = j, j ∈ {1, 2, . . .}, then it also holds for k = j + 1. From Problem 2.3 we know that (δ1 δ2 ) =

1 (γ2 γ1 ) , γ1 + γ2

and so for k = 1, we obtain ·µ ¶ µ ¶¸ 1 γ2 γ1 γ1 −γ1 1 Γ = + (1 − γ1 − γ2 ) γ2 γ1 −γ2 γ2 γ1 + γ2 µ ¶ 1 (γ1 + γ2 ) − γ1 (γ1 + γ2 ) γ1 (γ1 + γ2 ) = γ2 (γ1 + γ2 ) (γ1 + γ2 ) − γ2 (γ1 + γ2 ) γ1 + γ2 µ ¶ 1 − γ1 γ1 = =Γ, γ2 1 − γ2 which proves that the result holds for k = 1. Now suppose that it holds for k = j, j ∈ {1, 2, . . .}. Then, for k = j + 1, we have that Γj+1 = ΓΓj µ ¶ ·µ 1 − γ1 γ1 γ2 γ1 1 = · γ1 +γ2 γ2 1 − γ2 γ2 γ1 ·µ ¶ µ γ2 γ1 1 j = γ1 +γ2 + (1 − γ1 − γ2 ) γ2 γ1 µ ¶ µ δ1 δ2 δ2 j+1 = + (1 − γ1 − γ2 ) δ1 δ2 −δ1



µ j

+ (1 − γ1 − γ2 )

γ1 −γ1 −γ2 γ2

¶¸

γ1 (1 − γ1 − γ2 ) −γ1 (1 − γ1 − γ2 ) −γ2 (1 − γ1 − γ2 ) γ2 (1 − γ1 − γ2 ) ¶ −δ2 . δ1

¶¸

It follows that the result holds for k = 1, 2, ... .

Problems for Chapter 3 Solution to Problem 3.1 (a) E(St ) =

m X

E(St |Ct = i)P (Ct = i) =

m X

λi δi = δλ0 .

i=1

i=1

(b) We use the fact that, for a Poisson distributed random variable X ∼ P0 (λ), E(X 2 ) = Var(X) + (E(X))2 = λ + λ2 : E(St2 )

= =

m X i=1 m X

E(St2 |Ct = i)P (Ct = i)

(λi +

λ2i )δi

i=1

= λDλ0 + δλ0

=

m X i=1

λ2i δi

+

m X i=1 0

[= δΛλ0 + δλ ]

λi δi

88

Solutions

(c) V ar(St ) = E(St2 ) − E(St )2 = λDλ0 + δλ0 − (δλ0 )2

[= δΛλ0 + δλ0 − (δλ0 )2 ]

(d) E(St St+k ) = = =

m X m X i=1 j=1 m X m X i=1 j=1 m X m X

E(St St+k |Ct = i, Ct+k = j)P (Ct = i, Ct+k = j) E(St |Ct = i)E(St+k |Ct+k = j)P (Ct = i)P (Ct+k = j|Ct = i) λi λj δi γij (k) =

i=1 j=1 k

m X m X

δi λi γij (k)λj

i=1 j=1 0

= δΛΓ λ . The first step follows from the conditional independence property of the HMM and γij (k) denotes the entry in the ith row and j th column of the matrix Γk . (e) The result follows from the definition of the covariance and from (a) and (d): Cov(St , St+k ) = E(St , St+k ) − E(St )(E(St+k ) = δΛΓk λ0 − (δλ0 )2 . (f) The result follows from the definition of the correlation and from (c) and (e): 2

Cov(St , St+k ) δΛΓk λ0 − (δλ0 ) p Cor(St , St+k ) = = 2 Var(St )Var(St+k ) δΛλ0 + δλ0 − (δλ0 ) For m = 2, δ1 + δ2 = 1 and the denominator of this formula reduces to µ ¶µ ¶ µ ¶ · µ ¶¸2 λ1 0 λ1 λ1 λ1 (δ1 δ2 ) + (δ1 δ2 ) − (δ1 δ2 ) 0 λ2 λ2 λ2 λ2 = = = = =

δ1 λ21 + δ2 λ22 + δ1 λ1 + δ2 λ2 − (δ1 λ1 + δ2 λ2 )2 δ1 λ1 + δ2 λ2 − δ1 λ21 + δ2 λ22 − δ12 λ21 − 2δ1 δ2 λ1 λ2 − δ22 λ22 δ1 λ1 + δ2 λ2 + λ21 δ1 (1 − δ1 ) + λ22 δ2 (1 − δ2 ) − 2δ1 δ2 λ1 λ2 δ1 λ1 + δ2 λ2 + λ21 δ1 δ2 + λ22 δ2 δ1 − 2δ1 δ2 λ1 λ2 δ1 λ1 + δ2 λ2 + δ1 δ2 (λ2 − λ1 )2 .

For the numerator we use the result of Problem 2.4 and obtain µ (δ1

δ2 )

λ1 0 0 λ2

¶µ

δ1 δ2 δ1 δ2

¶µ

λ1 λ2



¶ ¶µ ¶µ λ1 λ1 0 δ2 −δ2 +(1 − γ1 − γ2 ) (δ1 δ2 ) λ2 −δ1 δ1 0 λ2 ¶ ¶ µ µ λ1 λ1 (δ1 δ2 ) − (δ1 δ2 ) λ2 λ2 k

µ

Solutions

89

The first and third terms cancel, and the second term reduces to µ k

(1 − γ1 − γ2 ) (δ1 λ1

δ2 λ2 )

δ2 (λ1 − λ2 ) δ1 (λ2 − λ1 )



= (1 − γ1 − γ2 )k (δ1 δ2 λ1 (λ1 − λ2 ) + δ1 δ2 λ2 (λ2 − λ1 )) = (1 − γ1 − γ2 )k δ1 δ2 (λ2 − λ1 )2 , which proves the result.

Solution to Problem 3.2

(a) From the Markov property and the stationarity of the sequence Ct , t = 1, 2, 3, we have that P (C1 = i, C2 = j, C3 = k) = P (C1 = i)P (C2 = j|C1 = j)P (C3 = k|C2 = j) = δi γij γjk , and thus, using the conditional independence property, we obtain P (S1 = 0, S2 = 2, S3 = 1) =

2 X 2 X 2 X

P (S1 = 0, S2 = 2, S3 = 1|C1 = i, C2 = j, C3 = k) ×

i=1 j=1 k=1

P (C1 = i, C2 = j, C3 = k) =

2 X 2 X 2 X

pi (0)pj (2)pk (1)δi γij γjk ,

i=1 j=1 k=1

where p` (s) =

λs` e−λ` s!

, ` ∈ {1, 2} .

Here we have (δ1

δ2 ) =

1 (0.4 0.9) = (4/13 9/13) 0.9 + 0.4

and with λ1 = 1 and λ2 = 3 we obtain p1 (0) ≈ 0.368 , p1 (2) ≈ 0.184 , p1 (1) ≈ 0.368 , p2 (0) ≈ 0.050 , p2 (2) ≈ 0.224 , p2 (1) ≈ 0.149 . The following table lists all possible sequences of states, the respective probabilities and their products.

90

Solutions i

j

k

pi (0)

pj (2)

pk (1)

δi

γij

γjk

Π

1 1 1 1 2 2 2 2

1 1 2 2 1 1 2 2

1 2 1 2 1 2 1 2

0.368 0.368 0.368 0.368 0.050 0.050 0.050 0.050

0.184 0.184 0.224 0.224 0.184 0.184 0.224 0.224

0.368 0.149 0.368 0.149 0.368 0.149 0.368 0.149

4/13 4/13 4/13 4/13 9/13 9/13 9/13 9/13

0.1 0.1 0.9 0.9 0.4 0.4 0.6 0.6

0.1 0.9 0.4 0.6 0.1 0.9 0.4 0.6 Σ

0.000077 0.000280 0.003359 0.002045 0.000093 0.000341 0.000682 0.000415 0.007292

Note that the values in the table can be used to compute the conditional probability of each possible sequence of states, C1 , C2 , C3 using P (C1 = i, C2 = j, C3 = k|S1 = 0, S2 = 2, S3 = 1) =

P (C1 = i, C2 = j, C3 = k, S1 = 0, S2 = 2, S3 = 1) P (S1 = 0, S2 = 2, S3 = 1)

for i, j, k ∈ {1, 2} .

Thus the most likely sequence of states here is C1 = 1, C2 = 2, C3 = 1; its conditional probability is given by 0.003359/0.007292 ≈ 0.461 . (b) P (S1 = 0, S2 = 2, S3 = 1) = δP (0)ΓP (2)ΓP (1)10 µ ¶ 0.368 0 = (4/13 9/13) × 0 0.050 µ ¶µ ¶µ ¶µ ¶µ ¶ 0.1 0.9 0.184 0 0.1 0.9 0.368 0 1 0.4 0.6 0 0.0224 0.4 0.6 0 0.149 1 ≈ 0.007292 Solution to Problem 3.3 (a) Analogously to Problem 3.2 one can show that P (S1 = 0, S3 = 1) =

2 X 2 X 2 X

P (S1 = 0, S3 = 1|C1 = i, C2 = j, C3 = k) ×

i=1 j=1 k=1

P (C1 = i, C2 = j, C3 = k) =

2 X 2 X 2 X

pi (0)pk (1)δi γij γjk ,

i=1 j=1 k=1

where p` (s) =

λs` e−λ` s!

, ` ∈ {1, 2} .

Solutions

91

Furthermore, as in Problem 3.2, we have (δ1

δ2 ) =

1 (0.4 0.9) = (4/13 9/13) 0.9 + 0.4

and with λ1 = 1 and λ2 = 3 we obtain p1 (0) ≈ 0.368 , p1 (1) ≈ 0.368 , p2 (0) ≈ 0.050 , p2 (1) ≈ 0.149 . The following table lists all possible sequences of states, the respective probabilities and their products. i

j

k

pi (0)

pk (1)

δi

γij

γjk

Π

1 1 1 1 2 2 2 2

1 1 2 2 1 1 2 2

1 2 1 2 1 2 1 2

0.368 0.368 0.368 0.368 0.050 0.050 0.050 0.050

0.368 0.149 0.368 0.149 0.368 0.149 0.368 0.149

4/13 4/13 4/13 4/13 9/13 9/13 9/13 9/13

0.1 0.1 0.9 0.9 0.4 0.4 0.6 0.6

0.1 0.9 0.4 0.6 0.1 0.9 0.4 0.6 Σ

0.000416 0.001522 0.014991 0.009130 0.000507 0.001853 0.003043 0.001853 0.033316

(b) P (S1 = 0, S3 = 1) = δP (0)Γ2 P (1)10 µ ¶µ ¶2 µ ¶µ ¶ 0.368 0 0.1 0.9 0.368 0 1 = (4/13 9/13) 0 0.050 0.4 0.6 0 0.149 1 ≈ 0.033316

Problems for Chapter 4 Solution to Problem 4.1 There are various ways to prove this result. We will use mathematical induction, i.e. (a) we show that the result holds for t = T − 1 and (b) we show that, if the result holds for t + 1, then it also holds for t. (a) the case t = T − 1 The vector β 0T −1 is defined as β 0T −1 = B T 10 = ΓP (sT )10 . Thus, the ith element of β T −1 is given by βT −1 (i) =

m X j=1

γij pj (sT ) =

m X j=1

P (CT = j|CT −1 = i)P (ST = sT |CT = j) .

92

Solutions Now, from P (CT = j|CT −1 = i) =

P (CT = j, CT −1 = i) P (CT −1 = i|CT = j)P (CT = j) = P (CT −1 = i) P (CT −1 = i)

and P (CT −1 = i|CT = j)P (ST = sT |CT = j) = P (ST = sT , CT −1 = i|CT = j) (conditional independence) follows that 1

βT −1 (i) =

P (CT −1 = i)

=

m X

1

j=1 m X

P (CT −1 = i)

j=1

P (ST = sT , CT −1 = i|CT = j)P (CT = j) P (ST = sT , CT −1 = i, CT = j)

P (ST = sT , CT −1 = i) = P (CT −1 = i) = P (ST = sT |CT −1 = i) . Thus, the results holds for t = T − 1. (b) the case t Suppose that the result holds for t + 1 and, for convenience, define R[j] as the event {Sj = sj , Sj+1 = sj+1 , ..., ST = sT }. Now we wish to show that the ith entry of β t is equal to P (R[t+1] |Ct = i). Given β t+1 , the vector β t can be computed via the recursion β 0t = B t+1 β 0t+1 = ΓP (st+1 )β 0t+1 . Thus, the ith entry of β t is given by βt (i) = = =

m X j=1 m X j=1 m X

γij pj (st+1 )βt+1 (j) P (Ct+1 = j|Ct = i)P (St+1 = st+1 |Ct+1 = j)P (R[t+2] |Ct+1 = j) P (Ct+1 = j|Ct = i)P (St+1 = st+1 , R[t+2] |Ct+1 = j) .

j=1

The last step follows from conditional independence of St+1 and subsequent Sj , j = t + 2, ..., T , given Ct+1 = j. Also, {St+1 = st+1 } ∩ {R[t+2] } = R[t+1] . Using the same steps as in the case t = T − 1 it follows that m

X 1 P (R[t+1] , Ct = i|Ct+1 = j)P (Ct+1 = j) βt (i) = P (Ct = i) j=1 P (R[t+1] , Ct = i) P (Ct = i) = P (R[t+1] |Ct = i) . =

Solutions

93

Solution to Problem 4.3 (a) We have

½ %ij =

1 for i = j g(τij ) ≥ 0 for i 6= j ,

and so %ij ≥ 0 for all i, j = 1, 2, ..., m. Then γij = %ij /

m X

%ij ∈ [0, 1] for all i, j = 1, 2, ..., m, and

j=1 m X

γij =

j=1

m X

%ij /

j=1

m X

%ij = 1

for all i = 1, 2, . . . , m.

j=1

(b) We assume here that γii 6= 0 for all i = 1, 2, ..., m. Then %ij = γij /γii , and τij = g −1 (%ij ) for i 6= j; i, j = 1, 2, ..., m. The existence of g −1 follows from the assumption that g is a strictly monotone increasing function.

Problems for Chapter 5 Solution to Problem 5.1 (a) P (Su = s, S (−u) = s(−u) ) P (S (−u) = s(−u) ) δ B 1 B 2 ...B u−1 ΓP (s)B u+1 ...B T 10 = δ B 1 B 2 ...B u−1 ΓB u+1 ...B T 10 αu−1 ΓP (s)β 0u = . αu−1 Γβ 0u

P (Su = s|S (−u) = s(−u) ) =

(b) Denoting η = αu−1 Γ we have m P

P (Su = s|S

(−u)

=s

(−u)

)=

η(i)pi (x)βu (i)

i=1

m P

. η(i)βu (i)

i=1

Now, denoting η(i)βu (i) by du (i) and du (i)/

m P

du (j) by ζu (i) we obtain

j=1

P (Su = s|S

(−u)

(−u)

=s

)=

m X i=1

ζu (i)pi (s) .

94

Solutions

Solution to Problem 5.3 (a) P (S (T ) = s(T ) , ST +h = s) P (S (T ) = s(T ) ) δB 1 B 2 ...B T Γh P (s)10 = δB 1 B 2 ...B T 10 αT Γh P (s)10 = . αT 10

P (ST +h = s|S (T ) = s(T ) ) =

h TΓ Now ξ := α 0 αT 1 is a vector of length m whose entries sum to one (to see this note that the entries of φT = αT /αT 10 sum to one, and that the rows of Γh sum to one). Thus, the above expression reduces to the form

   (ξ1 ξ2 ...ξm )  

p1 (s) . . . 0 0 . . . p2 (s) .. .. .. . . . 0 ... 0

... ... .. .



0 0 .. .

   

. . . pm (s)

1 1 .. . 1

 m  X  ξi pi (s) . =  i=1

This result holds for any discrete state-dependent distribution, not only for the Poisson. Furthermore, an analogous result holds for continuous state-dependent distributions, say with probability density functions (pdf) f1 , f2 , ..., fm : fST +h |S (T ) =s(T ) (s) =

m X

ξi fi (s) .

i=1

(b) According to (a) P (ST +h = s|S (T ) = s(T ) ) = φT Γh P (s)10 , where φT = ααTT10 is a vector whose entries sum to one. Since the Markov chain is homogeneous it follows that φT Γh approaches δ as h increases. Thus the forecast distribution approaches m X 0 δP (s)1 = δi pi (s) . i=1

Again, this result is true for any discrete state-dependent distribution. For continuous state-dependent distributions one simply has to replace pi (s) by fi (s) to obtain an analogous result; i.e. as h increases the pdf of the h-step-ahead forecast approaches m X 0 δP (s)1 = δi fi (x) . i=1

Solutions

95

Problems for Chapter 6 Solution to Problem 6.1 (a) As X is a continuous-valued random variable its distribution function (d.f.) F (x) is strictly increasing on the support of X. It therefore has an inverse, F −1 . Secondly, as F (x) ∈ [0, 1] the random variable U = F (X) has support [0, 1] and its d.f. is given by FU (u) = P (U ≤ u) = P (F (X) ≤ u) = P (X ≤ F −1 (u)) = F (F −1 (u)) = u for u ∈ [0, 1] . This is the d.f. of an U (0, 1) random variable. (b) Let U ∼ U (0, 1) and X = F −1 (U ) where F is the d.f. of some continuous-valued random variable. Then the d.f. of X can be written as FX (x) = P (X ≤ x) = P (F −1 (U ) ≤ x) = P (U ≤ F (x)) = F (x) . The last step follows from the fact that U is uniformly distributed. Thus the d.f. of X is given by F . (c)

(i)

Z

x

F (x) =

λ e−λt dt = 1 − e−λx ,

x ≥ 0.

0

Setting u = 1 − e−λx one has that x = − λ1 log (1 − u) and thus F −1 (u) = − λ1 log (1 − u). (ii) The commands needed to do this are lambda