High-Frequency Data in Medicine - Europe PMC

Using Linear Regression Functions to Abstract High-Frequency Data in Medicine Jian Li, MSc and Tze-Yun Leong, Ph.D. Medical Computing Laboratory, School of Computing National University of Singapore, Singapore 117543 {lijian,leongty }comp.nus.edu.sg Abstract This paper investigates the problem of representing medical time series in linear piece-wisefunctions and proposes a novel algorithm to transform time-stamped numeric data into simple linear regression functions. We apply methods that involve the hat matrix leverage value and the studentized deleted residual to identify outliers, and a heuristic approach to remove them from the data sets. By distinguishing the breaking points from true outliers, we can efficiently break the data set with respect to the underlying patterns. Using a rough segmentation step, our approach avoids using the whole data set as input, and reduces space requirement. The experimental results indicate our method can achieve more accurate representation of the underlying patterns in data sets collected in the intensive careefficiently. units

computationally expensive methods to compare all the possible trends in the data with the templates. For exin TrendDx, to match a trend template of I ample, intervals with a data cluster of T2data 2x I 3the I elements,

INTRODUCTION Modern hospitals and medical centers are well equipped with data collecting devices which provide relatively inexpensive means to collect and store the data in hospital information systems. The extensive amount of data gathered in medical databases necessitates specialized tools to facilitate health care practitioners to effectively reason with the data. There have been several efforts in developing such pattern or trend recognition and processing tools. Shahar [1] proposed the Knowledge-Based Temporal Abstraction (KBTA) framework with clear semantics for both the domain-specific knowledge and the problem-solving method. Combi [2] proposed an objectoriented data model and a query language to abstract clinical data sequences. However, such systems apply only qualitative inference mechanisms to abstract raw temporal data into higher-level concepts. In data-intensive settings such as the intensive care units (ICUs), directly applying such inference mechanisms cannot faithfully reveal the underlying global features. Other intelligent medical data analysis systems, such as Haimowitz's TrendDx [3], can manipulate high frequency data well. These systems, however, adopt

acidosis. Such sequential pattems can be easily defined by domain experts and queried using conventional database query language such as SQL. However, many of the categorical events, such as {acute increases in PaCO2}, are interesting common trend pattems themselves, and such knowledge can be directly abstracted and leamed from the numerical data. By transforming the numerical time series data into higher-level conceptual forms, we can make existing sequential learning algorithms directly applicable to numerical features of time series. We can also represent common trend patterns conveniently and efficiently to facilitate the domain experts to define, query and filter out the interesting patterns discovered in time series data. We propose an efficient intermediate-level abstraction method for abstracting high-frequency data in the ICUs. Appropriate abstraction of high-frequency data can also facilitate data visualization [6][7], and archiving and retrieving of interesting common trend patterns ofthe underlying temporal data [8].

1067-5027/00/$5.00O©2000 AMIA, Inc.

s

Our research concentrates on representing time series data in piece-wise linear segments and developing algorithms to achieve the target representations. By representing medical time series in linear segments, we can easily express trend pattems. For example, the slopes of the segments can be interpreted as increasing, decreasing, or stable trends in temporal abstract tasks; these trends can in turn be combined to express more complex trend patterns.

somno-

{arcute in is s a intPaC sequential pattern C02 narcosis)} lIence, describing one possible evaluation pathway for acute For eal

In related work, Shatkay [4] developed an algorithm using end points interpolation to segment medical time series, however, the algorithm often ignores some meaningful patterns, and the resulting patterns are sen-

492

where p is the number of regression parameters in the regression function including the intercept term, and n is the number of observations. For simple linear regression model, p is 2. The SDR is defined to be: 1/2 1 Sn-p-SDR = e {P 2 SSE(l -hii) -ei where SSE stands for Sum of Squared Errors and ei = Y, - Yj. Since the SDR will follow the t distribution with n - p - 1 degrees of freedom, we can conduct a formal test to determine whether an observation is outlying with respect to their Y values. For more details about hat matrix leverage value and the SDR, please see [9].

sitive to noise; Keogh [5] tried to segment time series by minimizing the norm error, but the output is still sensitive to noise in the data. We propose a new algorithm, called outlier-basedsegmentation (OBS), to transform the time-sequenced numerical data into linear segments which can in turn express certain complex patterns. Informally, OBS finds breaking points and generates simple linear regression functions for each subsequence between the breaking points. To determine whether a data point is a breaking point, we apply the following heuristic: * Heuristicl: For two consecutive segments, the points on the second segment can be regarded as outliers to the first

OBS can be discomposed into four subproblems: 1. Rough Segmenting CBS first finds a segment with standard error p ,which can be computed using all the data in the subsequence, called offset error p = (F + 4) , where £, tolerance error, is an acceptable error specified by the user for a specific application, and .is a small positive real value such that q is slightly larger than e. We propose an algorithm similar to the Newton Iteration method to search such a segment. We can initialize this process . with some arbitrary length of time series data, called initial size, and halt when tp is between and +searching , namely c