Modeling seasonality of influenza with Hidden Markov ... - CiteSeerX

6 downloads 0 Views 101KB Size Report
This baseline follows a characteris- tically seasonal pattern, with frequent annual increases in the traditional “flu season” months of September through April.
Modeling seasonality of influenza with Hidden Markov Models Al Ozonoff, Suporn Sukpraprut, Paola Sebastiani Boston University School of Public Health, Department of Biostatistics Abstract Surveillance of respiratory disease requires a baseline model for the expected case load. This baseline follows a characteristically seasonal pattern, with frequent annual increases in the traditional “flu season” months of September through April. After baseline modeling, anomaly detection methods are used to determine the public health significance of variable incidence above baseline. Thus improved baseline models should yield improved results for anomaly detection. Typical approaches to modeling seasonal baseline use cyclic regression models (also known as Serfling’s method) or variations on this approach. We propose an alternative approach, based on Hidden Markov models (HMMs). We demonstrate this approach on pneumonia and influenza (P&I) mortality data available from the Centers for Disease Control and Prevention (CDC), and compare the performance of HMMs to Serfling’s method and related models.

model formulation is due to Serfling, and thus the method for determining the epidemic threshold bears his name. As currently implemented by CDC, Serfling’s method uses five years of weekly influenza mortality data to fit a periodic regression model containing terms for intercept, linear trend, and a pair of harmonic terms to capture the underlying sinusoidal behavior of seasonal influenza. Thus Serfling’s model is formulated as a simple linear regression: yt = α0 + α1 t + β1 sin (

2πt 2πt ) + β2 cos ( ) + ²t 52 52

(1)

where yt denotes observed weekly pneumonia and influenza (P&I ) in the U.S. at week t of the five year period; αi and βj are regression coefficients to be estimated; and ²t is a normally distributed error term. Since the model estimates the non-epidemic seasonal baseline, weeks that typically show high influenza activity are excluded to avoid biased parameter estimates. Current CDC practice is to exclude the months of October through April. Keywords: Surveillance, influenza, seasonality, Hidden After parameters of the model have been estimated, the upMarkov models. per limit of a confidence band around the sinusoid is used to determine the epidemic threshold for that time period. CDC uses this model prospectively, meaning each week the model 1 Introduction can be refitted and a new threshold determined. This provides Syndromic surveillance, as defined by the CDC, is the practice a continually updated determination of the state of epidemic of “...using health-related data that precede diagnosis and sig- influenza in the U.S. which is timely up to a two to three week nal a sufficient probability of a case or an outbreak to warrant reporting lag. Details of the data used by CDC for Serfling’s further public health response.” Thus syndromic surveillance method are described below in section 3. includes the prospective monitoring of pre-diagnostic health data, such as non-specific respiratory disease that could sig- 2.2 Auto-regressive models nal a large public health event such as epidemic influenza or a bioterrorist attack. The prospective surveillance paradigm Time series of infectious disease surveillance data, and inis first to determine what constitutes normal behavior; then to fluenza data in particular, often exhibit high degrees of serial be vigilant for deviations from the norm and act or investigate autocorrelation. The family of auto-regressive integrated moving average (ARIMA) models account for this auto-correlation accordingly. Surveillance data typically consist of time series counts of explicitly, and these models have been used successfully in a incident disease cases, aggregated weekly, daily, or more fre- variety of surveillance contexts. For this study we have choquently. In this context, establishing normal behavior requires sen a simplified second order periodic auto-regression (PAR) a baseline stochastic time series model from which residuals model, which incorporates an additional auto-regressive comcan be used as the basis for an anomaly detection scheme. ponent into Serfling’s model: Thus improved baseline modeling, and subsequent reduction 2πt 2πt of residual errors, should lead to improved event detection. ) + β2 cos ( )+ yt = α0 + α1 t + β1 sin ( 52 52 γ1 yt−1 + γ2 yt−2 + ²t (2) 2 Statistical models for influenza where γi are the auto-regressive parameters to be estimated simultaneously with the other regression coefficients. 2.1 Serfling’s method Since the mid-1960s, the Centers for Disease Control and Prevention (CDC) influenza surveillance programs have used a cyclic regression model to determine epidemic influenza activity and excess mortality attributed to influenza. The original

2.3 Hidden Markov models Intuitively: Hidden Markov models (HMMs) use a hidden (latent, unobserved) discrete random variable Ht , the state of the

system at time t. Observed variables are then modeled, conditional upon the hidden state. Thus, knowledge of the state implies knowledge of the stochastic distribution of the observed random variable. HMMs are constrained by the Markov property: the conditional probability of a state change (also called the transition probability) depends only upon the value of the latent state at the previous time point. Thus we can specify the HMM with k underlying states, using a k × k matrix of transition probabilities together with probability distributions of the data conditional on the hidden state. HMMs allow both likelihood and Bayesian approaches to parameter estimation. We have chosen Bayesian Inference Using Gibbs Sampling (BUGS), with open-source implementation in the statistical system R via the OpenBUGS project. We considered two HMM model formulations. The first model uses a single parameter αe to reflect a shift in mean upon transition to an epidemic state: Ht = 0 : Yt ∼ α0 + α1 t+ β1 sin (

2πt 2πt ) + β2 cos ( ) + ²t 52 52

Ht = 1 : Yt ∼ (α0 + αe ) + α1 t+ 2πt 2πt β1 sin ( ) + β2 cos ( ) + ²t 52 52

(3)

(4)

where Ht are indicator variables for epidemic state at time t, Yt are the observed data, and the remaining parameters are identical to those in Serfling’s model. The second formulation incorporates a second-order autoregressive term into the epidemic state model, while the baseline (non-epidemic state) model remains the same. Ht = 1 : Yt ∼ (α0 + αe ) + α1 t + β1 sin ( + β2 cos (

2πt ) 52

2πt ) + γ1 yt−1 + γ2 yt−2 + ²t 52

(5)

for the selected weeks. Then we calculated: v u 312 uX RMSE = t (yt − ybt )2

(6)

t=1

where yt denotes the observed weekly count of P&I deaths for week t of the index period and ybt denotes the fitted value for week t from the model under evaluation. Since all 312 weeks of the index period are used to estimate parameters, performance on this metric indicates model ability to fit data retrospectively. Lower values of RMSE reflect fitted values closer to the observed data and thus better goodness-of-fit. 5 Results For an initial investigation, we used a single index period from September 1990 to September 1996 to evaluate model fit. RMSE for each of the four model formulations (Serfling, PAR, and the two HMM models) was calculated for the 312 weeks in the index period. Serfling’s model demonstrated the least fit (RMSE = 83.3), while the PAR model showed a roughly 14% reduction in model error (RMSE = 72.0). Both HMMs showed a substantial (25-30%) improvement over Serfling in goodness-of-fit (RMSE = 63.7 and 60.4, for the mean-shift HMM and the AR-enhanced HMM respectively). 6 Discussion Based on these preliminary results, it appears that temporal modeling of seasonal surveillance data can be substantially improved by implementing relatively straightforward methodology. Hidden Markov models are intuitively motivated, and demonstrate improvements in goodness-of-fit when applied to retrospective P&I mortality data. Further work is needed to assess prospective performance of HMMS relative to other models (e.g. Serfling or ARIMA) using one-step-ahead forecasting error as an evaluation measure. Additional efforts to model the epidemic phase of influenza activity may also yield improved results.

3 Data

Acknowledgements: Research partially supported by NIAID U19 AI62627-02, administered via the Blood Center of WisOne component of influenza surveillance in the U.S. as per- consin pilot program. formed by CDC is the 122 Cities Mortality Program. This program has been operated continuously by CDC since 1962 References and provides surveillance coverage for roughly 25% of the U.S. population. P&I mortality is used as a proxy measure Le Strat, Y., Carrat, F. (1999), “Monitoring epidemiologic surveillance data using hidden Markov models”, Statistics in Medicine, 18, 3463-3478. for influenza activity, and is generally accepted as the best available data in terms of completeness and similarity to acRath, T.M., Carreras, M., Sebastiani, P. (2003), “Automated detection of influenza epidemics with Hidden Markov models”, Proceedings of the 5th tual influenza activity. International Symposium on Intelligent Data Analysis, Lecture Notes in Computer Science volume 2810 (Springer).

4

Evaluation

We measured model goodness-of-fit using root mean squared error (RMSE). After selecting a six year (312 week) period of influenza data, which we refer to as the index period, for each model we estimated parameters and determined fitted values

Serfling, R.E. (1963), “Methods for current statistical analysis of excess pneumonia-influenza deaths”, Public Health Reports, 78, 494-506.