Ontology-Centered Syndromic Surveillance for Bioterrorism - CiteSeerX

0 downloads 0 Views 1MB Size Report
preter therefore transforms both the original data's contents and semantics to conform to each problem solver's input requirements. A controller for deploying.
www.computer.org/intelligent

Ontology-Centered Syndromic Surveillance for Bioterrorism Monica Crubézy, Martin O’Connor, David L. Buckeridge, Zachary Pincus, and Mark A. Musen

Vol. 20, No. 5 September/October 2005

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

© 2005 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. For more information, please see www.ieee.org/portal/pages/about/documentation/copyright/polilink.html.

H o m e l a n d

S e c u r i t y

Ontology-Centered Syndromic Surveillance for Bioterrorism Monica Crubézy, Martin O’Connor, Zachary Pincus, and Mark A. Musen, Stanford University David L. Buckeridge, Stanford University and VA Palo Alto Health Care System

I

n recent years, public health surveillance has become a priority, driven by concerns of possible bioterrorist attacks and disease outbreaks. Authorities argue that syn-

dromic surveillance, or the monitoring of prediagnostic health-related data for early detection of nascent outbreaks, is crucial to preventing massive illness and death.1 Rapid

Syndromic surveillance could prevent widespread illness and death, but public-health analysts face many technical barriers. BioSTORM can help them by supporting ontologybased data integration and problem-solver deployment.

26

outbreak detection is important following a bioterrorist attack because public health interventions generally are more effective if applied early in the course of an outbreak. With an anthrax attack, for example, a delay of even hours in administering chemoprophylaxis can substantially lessen chances for survival. In this case, a surveillance system should sound an early warning by detecting an abnormal increase in clinic visits or pharmaceutical purchases before the first clinical diagnoses are made. The need for improved surveillance and the increasing availability of electronic data have resulted in a blossoming of surveillance-system development.2 Electronic data’s availability in the public health information infrastructure presents both tremendous opportunities and considerable challenges for surveillance (see the sidebar “Surveillance Systems and Public Health: Applications and Issues”). Analysts can improve public health decision making by extracting more information from a growing set of data sources. However, they face technical barriers to incorporating physically heterogeneous data sources into surveillance systems and, more important, integrating these disparate data sources in a way that offers semantic coherence and improves decision making. Furthermore, analysts face the challenge of making sense of high-dimensional, noisy data to identify patterns of disease outbreaks. To meet syndromic surveillance’s complex operational and research needs, and as part of DARPA’s national biosurveillance technology program, we’ve developed BioSTORM (the Biological Spatio-Temporal Outbreak Reasoning Module). BioSTORM is an 1541-1672/05/$20.00 © 2005 IEEE Published by the IEEE Computer Society

experimental end-to-end computational framework that integrates disparate data sources and deploys various analytic problem solvers to support public health analysts in interpreting surveillance data and identifying disease outbreaks.

Ontologies for integrating and analyzing surveillance data Central to our approach is the use of ontologies to model and annotate syndromic surveillance information and knowledge. Ontologies are computerstored specifications of concepts, properties, and relationships that are important for describing an area of expertise. They provide principled, structured, and queriable frameworks for modeling knowledge and encoding data semantics independently of their lowlevel representation. In recent years, the biomedical community has increasingly been building information systems that rely on ontologies, such as UMLS (Unified Medical Language System) and SNOMED (Systematized Nomenclature of Medicine). In the BioSTORM system, we used ontologies to describe characteristics, types, and relationships of data emanating from different sources,3 as well as to specify problem-solving-method repositories.4 We’ve been working for nearly two decades to build intelligent systems that rely on ontologies to model and operationalize their different components. Our research produced Protégé, a mature methodology and software tool for building ontology-centered, component-based systems. (See the sidebar “Protégé: A Framework for Building Ontologybased Intelligent Systems.”) Protégé provides the IEEE INTELLIGENT SYSTEMS

basis for all ontologies underlying the BioSTORM system’s components. Adopting an ontology-centered framework for integrating and analyzing surveillance data provides a durable foundation for principled, reproducible, and scalable implementation and evaluation of public health surveillance approaches. Figure 1 shows the BioSTORM system’s four main components: • a data-source ontology for describing and unifying the data sources’and data streams’ disparate semantics; • a library of statistical and knowledgebased problem solvers for analyzing syndromic surveillance data; • a mediation component that includes a data broker to integrate multiple, related data sources that the data-source ontology describes and a mapping interpreter to connect the integrated data to those problem solvers that can best analyze them; and • a controller that identifies and deploys configurations of problem solvers to analyze incoming data streams.

Data-source ontology

Mapping ontology

Surveillance problem-solving ontology

Data

Problem solvers

Blackboard Data broker

Mapping interpreter Deployment controller

Figure 1. Overview of BioSTORM‘s architecture and components. The deployment controller orchestrates problem solvers’ deployment and the flow of data to them, using a blackboard mechanism. Incoming data streams pass through the data broker and mapping interpreter to a set of problem solvers. The data broker and mapping interpreter use the data-source and mapping ontologies to construct semantically uniform data streams. The controller uses the surveillance problem-solving ontology to configure problem solver sets into analytic strategies that operate on those data streams.

Our solution’s novelties are • the central use of ontologies, which underlie each component of our surveillance architecture and let the system encompass the rich semantics of both incoming surveillance data and analytic problem solvers; and • the companion use of a two-tier mediation component, which lets the system reconcile surveillance data’s heterogeneous semantics with those of various reusable analytic problem solvers. As a result, our approach leads to the meaningful configuration and flexible deployment of knowledge-level surveillance-analysis strategies. We used BioSTORM to perform a variety of analyses on several data sources. We used dispatch data from the San Francisco 911 emergency call center to determine whether we could use patterns in those data to detect large-scale influenza-like outbreaks (see the sidebar “Surveillance of San Francisco 911 Emergency Dispatch Data,” on page 32). We analyzed emergency-room dispatch data from the Palo Alto Veteran’s Administration Medical Center for the same purpose. Finally, we analyzed emergency room respiratory records SEPTEMBER/OCTOBER 2005

from hospitals in Norfolk, Virginia, primarily to demonstrate our system’s ability to deploy multiple distinct analytic strategies in parallel on large data sets.

A data-source ontology for describing and contextualizing data streams Public health surveillance data are diverse and usually distributed in various databases and files with little common semantic or syntactic structure. To enable analyses using disparate data sources, we must specify precisely knowledge about how to characterize and combine different data sources and types. To apply appropriate analytical methods to relevant outbreak-detection data, we must relate the data sources to one another, to descriptions of reportable conditions, and to enumerations of the primitive data on which we can make diagnoses of those conditions. Ontologies best model this knowledge. We developed a data-source ontology3 that lets us describe extremely diverse data coherently. In particular, we can customize our ontology for a particular data-source domain, such as syndromic surveillance. It further supports the BioSTORM components in reasoning with and processing the data. www.computer.org/intelligent

Our data-source ontology aims to make data self-descriptive by associating a structured, multilevel context with each potential data source. A developer describes the context from a data source by filling in a template with details about it, at the levels of data source, groups of related data, and atomic data elements (see figure 2a). Our datasource ontology, customized for syndromic surveillance, provides a taxonomy of datasource attributes to describe this context (see figure 2b). Developers describe individual data elements with metadata terms from the Logical Identifier Names and Codes. The LOINC approach describes a piece of data along five major semantic axes, including “kind-of-property,” “time aspect,” and “scale,” suitable for contextualizing results from clinical laboratories. We’ve generalized the LOINC axes from this role into a generic descriptors set for contextualizing many types of syndromic surveillance data. Our five axes are 1. what is being measured (for example, “Robitussin sales”)?; 2. how is it measured (for example, “cases sold per day”) ?; 3. when/for how long is it measured (for 27

H o m e l a n d

S e c u r i t y



Data Value(s) Syntax

Data Context Semantics Data Source Context Metadata pertinent to all data from a given source e.g. “This school is located at ...”

Data Group Logically related data e.g. “1/27/03; 5th: 25, 6th:10”

Datum Atomic data element e.g. “10”

Data Group Context Relationships between data in the group e.g. “5th and 6th grade absenteeism is reported together.” Metadata pertinent to a group of data e.g. “Absenteeism data is collected at noon.”

Datum Context Meaning of a single datum e.g. “This number is a count of people who might be ill.”

(a)

C DatumContextComponents



C LOINCContextComponents



MeasurableProperty C IndividualLevelHumanData  C DirectHealthStatusData  C QualitativeDiagnosis  C DataSourceContext  C QuantitativeClinicalResult C SchoolContext  C VitalSign C EmployerContext  C NumericLaboratoryValue C PharmacyContext  C QualitativeClinicalResult C Emergency911CallCenterContext  C Symptom  C HospitalContext  C Sign C VAHospitalContext  C RadiographicFinding C EmergencyRoomContext  C NarrativeLaboratoryValue C ©DataGroupContext C CaseSeverity C  ©DatumContext  C CommunicationData C LOINCContext  C AbsenteeismData C TemporalContext  C DemographicData C SpatialContext  C PopulationLevelHumanData C ©DataGroup  C KindOfProperty C  ©Datum  C SimpleSIProperty  C LOINCDatum  C UnitlessProperty  C TemporalDatum C RatioProperty  C SpatialDatum C RateProperty C AbstractNumericProperty C Scale (b) C Quantitative  C Ordinal C Nominal C Narrative C



Figure 2. Template data-source ontology for contextualizing data, customized for syndromic surveillance. (a) Data values are associated with metadata describing the data and relevant context. Arrows indicate one-to-one and many-to-one relationships between concepts. (b) Highlights show additions to our Protégé-developed template data-source ontology that are specific to syndromic surveillance. The left snapshot shows the template’s structure with our added context classes. The right snapshot shows the top levels of the metadata attributes taxonomy, expanded with our vocabulary of “Measurable Properties” used to describe data elements LOINC objects.

example, “averaged over a week”) ?; 4. where is it measured? (for example, “pharmacies in the 94305 ZIP code”); 5. what are the possible values? (for example, “numeric count”). Our data-source ontology’s systematic, template-directed process lets developers create a customized local model of each surveillance data source. Each local model shares a common structure, space of attributes, and set of possible attribute values. Without discarding each data source’s specificities, our data-source ontology represents each in a semantically uniform fashion. More precisely, the ontology provides a hybrid approach to data integration, in that it combines the semantic rigor of a global, shared ontology with the flexibility and level of detail that come from devising customized, local ontologies for each data source. Most 28

important, the ontology provides an abstract, metadata-rich view of data sources that’s unconcerned with the way data are stored (such as tab-delimited files and XML documents). Such metadata thus support integrating heterogeneous surveillance data at the level of semantic reconciliation, allowing uniform application of analytic problem solvers to each data source. The ontology describes generic data sources with instances that describe the specific data fields of particular data sources. For example, from the 911 dispatch data, VA patient data, and data related to reportable diseases, the data-source ontology captures individual-level primitive data (such as signs, symptoms, and laboratory tests) as well as observable population-level data (such as aggregated syndrome counts and school absenteeism). Describing the 911 and the VA data sources merely required completing our www.computer.org/intelligent

data-source ontology’s templates by selecting appropriate properties for describing the data types emanating from each data source.

A library of problem-solving methods for analyzing surveillance data Next-generation surveillance systems require various analytic methods, ranging from traditional statistical techniques operating on low-level data (such as raw disease counts) to knowledge-based approaches capable of reasoning about qualitative data and detecting unusual patterns. Additionally, systems must be capable of making correlations among different kinds of data and must be able to aggregate and abstract data into information about populations, spatial regions, and temporal intervals. As a result, it’s best to view analyzing surveillance data as a problem-solving task addressed with a IEEE INTELLIGENT SYSTEMS

Surveillance Systems and Public Health: Applications and Issues Most recent surveillance systems use electronic data and statistical-analysis methods. In general, they focus on interpreting noisy, prediagnostic data sources, such as admission codes from emergency room visits, reports of drug sales and absenteeism, and calls to medical-advice personnel. For example, the RealTime Outbreak Detection System1 allows automated transmission and analysis of admission codes and other data from hospital information systems at many Pittsburgh emergency rooms. The Electronic Surveillance System for the Early Notification of Community-based Epidemics2 monitors disease codes assigned for outpatient visits by military personnel and their dependents across the US and throughout the world. More recently, the US Centers for Disease Control and Prevention began developing the BioSense system, which monitors data from many sources, including US Department of Defense and Veterans Affairs facilities and over-the-counter pharmaceutical sales. By summer 2003, public health authorities had deployed more than 100 different surveillance systems in the US, all relying on electronic data to rapidly detect disease outbreaks.3 In most situations, electronic surveillance data are not collected for the express purpose of monitoring public health. Recently deployed surveillance systems tend to rely on data collected for administrative and business purposes. For example, many systems follow records collected to enable billing or pharmaceutical sales records collected for inventory and marketing purposes. Because these data sources aren’t collected with surveillance in mind, they’re often biased. Additionally, because public health agencies don’t control data collection, data rarely conform to a standard format. When incorporating data sources into a surveillance system, developers must reconcile differences in representing biosurveillance concepts. Semantic reconciliation is especially important so analyses across data sources can integrate conceptually diverse data and reason about them in a consistent manner. Unfortunately, existing surveillance systems don’t have comprehensive data-integration strategies; instead, they’re developed specifically to operate using the limited data sources available at the system’s implementation. These systems are therefore extremely limited by the number of usable data and by the significant effort required to incorporate novel data streams. Surveillance systems also have complex operational and research requirements. The straightforward time-series algorithms analysts use to summarize traditional surveillance data aren’t suitable for integrating the complex data that modern

set of problem-solving methods carefully configured to the case at hand. To this end, BioSTORM has a computational methods library that can analyze multiple, varying data types to detect abnormalities.5 BioSTORM’s library includes both generic, disease-independent statistical methods that analyze data as single or multiple time series and knowledge-based methods that relate detected abnormalities to knowledge about reportable SEPTEMBER/OCTOBER 2005

surveillance systems process. Surveillance data are rich in spatial and temporal measurements, which analysts must aggregate and interpret meaningfully. Thus, the high dimensionality, heterogeneity, and unpredictable nature of both data and disease-outbreak patterns require that systems have a range of analytic methods that can make sense out of surveillance data in numerous situations. Surveillance systems require a variety of methods to process large volumes of data, identify weak signals, analyze multiple indicators, and account for spatial structure. Moreover, these systems must use these different methods together in potentially complex configurations to provide appropriate results. Analyzing surveillance data for its interpretation thus becomes a problem-solving process that involves a set of methods suitably chosen and tailored to each situation at hand. Rather than implementing a specific, ad hoc approach, surveillance systems must provide an infrastructure for applying different analytic strategies to incoming data streams. The current approach to incorporating algorithms into surveillance systems, however, is to hand-tune a small set of analytic methods to one or two data sources. This solution is inadequate for assembling and evaluating different analytic strategies without substantial reprogramming. In particular, the approach makes it difficult to add new data types into the system and experiment and configure analytic methods to help public health officials in their outbreak surveillance tasks. Next-generation surveillance systems should accommodate surveillance data’s high dimensionality and semantic richness, provide a means to reconcile their semantic heterogeneity, and support the configuration of a range of analytic methods necessary for data interpretation.

References 1. F.C. Tsui et al., “Technical Description of RODS: A Real-Time Public Health Surveillance System,” J. Am. Medical Informatics Assoc., vol. 10, no. 5, 2003, pp. 399–408. 2. J. Lombardo et al., “A Systems Overview of the Electronic Surveillance System for the Early Notification of Community-Based Epidemics (ESSENCE II),” J. Urban Health, vol. 80, no. 2, supplement 1, 2003, pp. i32–i42. 3. J.W. Buehler et al., “Syndromic Surveillance and BioterrorismRelated Epidemics,” Emerging Infectious Diseases, vol. 9, no. 10, 2003, pp. 1197–1204.

diseases. We implement relatively straightforward methods as simple software routines, while we incorporate sophisticated methods into the system by “wrapping” existing software libraries to conform to our surveillance problem-solving ontology requirements. Within the context of syndromic surveillance, a system must model each method’s performance characteristics, data requirements, and assumptions. We categorize our www.computer.org/intelligent

library’s methods by the tasks they perform, the data types they can operate on, and the signal types they can detect. For example, a cumulative sum method signals an abnormality in a single temporal data stream and is well suited to detecting gradually increasing signals. Making such knowledge explicit both facilitates system modification to enhance method portability and reuse and helps public health professionals understand 29

H o m e l a n d

S e c u r i t y

Protégé: A Framework for Building Ontology-Based Intelligent Systems Protégé is a methodology for building knowledge-based systems from three classes of reusable components:1 • domain ontologies, or models of the concepts in an application area and relations between those concepts;2 • associated knowledge bases containing domain facts; and • problem-solving methods, or algorithms that apply generic reasoning patterns to domain knowledge.3 Protégé (http://protege.stanford.edu) is a free, open source ontology development framework that gives a growing user community a tool suite to construct domain models and knowledge-based applications. Protégé implements a knowledge model compatible with the Open Knowledge Base Connectivity protocol,4 designed for interoperability among frame-based systems. In a frame-based modeling representation, an ontology consists of a set of classes organized in a subsumption hierarchy to represent a domain’s salient concepts, the properties—or slots—associated to each concept, and a set of instances of those classes—individual exemplars of the concepts that hold specific values for their properties. Protégé also supports other formalisms for representing knowledge bases, such as the Semantic Web languages RDF (www.w3.org/RDF) and OWL (www.w3.org/2001/sw/WebOnt). Protégé provides both a wide set of user interface elements for knowledge modeling and entry and the capability to include custom-designed plug-in elements as application extensions.5,6 Protégé is not only a robust, user-friendly environment for building ontologies, it’s also a full-fledged server that can provide knowledge encoded in ontologies to any piece of software invoking it. In the BioSTORM system, Protégé provides a knowledge-based framework for our overall methodology and specific knowledge authoring and integration tools. In fact, most of BioSTORM’s components rely on Protégé to supply and manage their ontologies. We developed custom knowledge-entry tools for annotating data sources in our data-source ontology and for describing and organizing our library’s analytical methods according to

the system. Naturally, ontologies are ideally suited to providing the modeling framework necessary to categorize and annotate collections of problem solvers. Our approach was motivated by task analysis methodologies developed in the knowledge-based systems community and was guided by the Unified Problem-Solving Method Description Language framework for modeling problem-solving methods libraries (www.cs.vu.nl/~upml). Such a framework involves modeling the system’s top-level task and identifying a problemsolving method that can perform that task (see figure 3a). We then model the method as entailing several subtasks, and we model each subtask as solved by a problem-solving method that may, in turn, entail new subtasks. Modeling continues until one or more prim30

our surveillance problem-solving ontology. Furthermore, we developed a specific user interface plug-in to Protégé that lets users declare mapping relations between particular data groups and the input specifications of particular problemsolving methods.7

References 1. J.H. Gennari et al., “The Evolution of Protégé: An Environment for Knowledge-Based Systems Development,” Int’l J. Human-Comp. Studies, vol. 58, no. 1, 2003, pp. 89–123. 2. T.R. Gruber, “A Translation Approach to Portable Ontology Specifications,” Knowledge Acquisition, vol. 5, no. 2, 1993, pp. 199–220. 3. J. McDermott, “Preliminary Steps Toward a Taxonomy of ProblemSolving Methods,” Automatic Knowledge for Acquisition for Expert Systems, Kluwer, 1988, pp. 225–254. 4. V.K. Chaudhri et al., “OKBC: A Programmatic Foundation for Knowledge Base Interoperability,” Proc. 15th Nat’l Conf. Artificial Intelligence (AAAI 98), AAAI Press, 1998, pp. 600–607. 5. M.A. Musen et al., “Component-Based Support for Building Knowledge-Acquisition Systems,” Conf. Intelligent Information Processing (IIP 00) 16th Int’l Federation for Information Processing World Computer Cong. (WCC 2000), 2000, pp 18–22. 6. N.F. Noy et al., “Creating and Acquiring Semantic Web Contents with Protégé-2000,” IEEE Intelligent Systems, vol. 16, no. 2, 2001, pp. 60–71. 7. M. Crubézy, Z. Pincus, and M.A. Musen, “Mediating Knowledge between Application Components,” Proc. Semantic Integration Workshop 2nd Int’l Semantic Web Conf. (ISWC 03), CEUR, 2003; http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS// Vol-82/SI_demo_02.pdf.

itive methods can solve each subtask. This model has been helpful even when many surveillance methods are statistical rather than knowledge based. Following the task-method decomposition approach, we developed a practical, ontology-based framework for categorizing abnormality detection algorithms. We based the framework on the information contexts analysts commonly encounter in surveillance work and the algorithms’ functional requirements.5 We then created a surveillance problem-solving ontology in Protégé, which models and classifies the surveillance methods that automate each monitoring subtask (see figure 3b). With this ontology, our library becomes a computer-processable repository of problem-solving methods that BioSTORM can index, query, and invoke when analyzwww.computer.org/intelligent

ing surveillance data. In our surveillance problem-solving ontology, we associated our library’s analytic methods with a method ontology that defines the classes of data and knowledge on which a given method operates (see figure 3c).4,6 For example, statistical methods in our library each have specific requirements for the structure and parameters of the statistical model on which they operate. Some operate at the population level, whereas others work at the individual level. Many algorithms expect time-series data at varying granularities, and certain algorithms require spatial data at several levels of aggregation. The method ontology makes explicit a problemsolving method’s data requirements, enabling BioSTORM to apply that method uniformly to various data sources, independent IEEE INTELLIGENT SYSTEMS

Ontology of surveillance methods Surveillance task decomposition

Monitoring

Surveillance

...

Temporal, covariate monitoring

Temporal, spatial monitoring

Temporal monitoring

...

Standard temporal surveillance

 C SurveillanceProblemSolver  C AberrancyMethod  C CategorySpecificMethod  C LocationSpecificMethod C TemporalFilterMethod  C SpatialClusterMethod  C ForecastMethod  C RareCountForecastMethod C LargeCountForecastMethod  C ObservationMethod C AggregationObservationMethod C InterpolationObservationMethod

(b)

Ontology of data types Make prediction

ARIMA

EWMA

Make comparison

Signal abnormality

Compute standard residual

EWMA

Cusum

Shewart chart

Task Method Primitive method Realized by method Consists of subtasks

(a)

ARIMA

Autoregressive integrated moving average

EWMA

Exponentially weighted moving average

 C SurveillanceCategory  C PrimitiveEvent C IndividualEvent C AggregateEvent  C GeoReference  C Polygon C Point  C TimeReference C TimePoint C TimeInterval  C DataSource  C SurveillanceValue  C CategorySpecificValue  C LocationSpecificValue  C DataSourceSpecificValue  C AberrancyValue

(c)

Figure 3. The surveillance problem-solving ontology, modeled with Protégé. (a) A schematic of the monitoring task’s refinement into subtasks and the standard temporal surveillance method’s decomposition into further subtasks and submethods. (b) A portion of the surveillance method ontology. (c) A portion of the associated method ontology that models input and output data requirements for surveillance methods.

of their storage format. The method ontology helps BioSTORM map data sources to appropriate problem solvers and reconcile the semantic differences between data and methods. Furthermore, the method ontology facilitates analytic methods’ interoperation by identifying appropriate interactions between methods and different data types. Overall, our framework provides a structure for incorporating surveillance algorithms into our system and establishing those methods’ data and knowledge requirements. By making each method’s characteristics explicit, the surveillance problem-solving ontology helps BioSTORM identify suitable methods for a specific subtask in the overall task decomposition.

SEPTEMBER/OCTOBER 2005

A mediation component for integrating data and problem solvers Our data-source ontology provides a consistent mechanism for describing data sources and elements so that surveillance methods can use them concurrently. However, surveillance methods operating on these data could have different input requirements. Also, many methods used for surveillance analysis are actually generic, spatiotemporal methods that can operate on any data type, as long as the data are expressed in statistical terms that the methods expect. Keeping these methods’data requirements generic fosters their flexible reuse in different analytic strategies. Because each problem solver in our library adheres to the declarative method ontology, www.computer.org/intelligent

our system knows the data types that each method can process. This explicit information about our methods’ data requirements, combined with explicit information about the data-source ontology’s available data sources, allowed us to devise a uniform mechanism to mediate data from multiple sources to various methods at runtime. This mechanism involves a data broker and a mapping interpreter. These components operate at the data level and the ontological level, respectively, to reconcile the syntactic and semantic differences between incoming data streams and the data expectations of the methods. Together, the data broker and the mapping interpreter provide a semantic bridge between analytic methods and raw data. Our approach further supports BioSTORM in making meaningful 31

H o m e l a n d

S e c u r i t y

Surveillance of San Francisco 911 Emergency Dispatch Data BioSTORM successfully deployed a set of problem solvers over streams of 911 emergency data in San Francisco (from the year 1999) based on the ontologies that form the system’s backbone (see figure A). The three aberrancy methods allow us to examine the same data from different perspectives. To reduce the false-positive rate, we first examined results from the likelihood calculator, which makes the fewest comparisons by collapsing input values across locations. We set each aberrancy method’s threshold at a level that declared an epidemic on average once a month, or 3.3 percent of the days in the training period. In November and December 1999, the likelihood calculator trig-

 C AberrancyMethod  C CategorySpecificMethod  C LocationSpecificMethod C PoissonAberrancyCalculator C MinimumAberrancyCalculator C TemporalFilterMethod  C SpatialClusterMethod



DataSource C Dispatch911 C ERChiefComplaint C SchoolAbsenteeism

 C SurveillanceCategory  C IllnessCategory C Trauma C Cardiovascular C Respiratory C InhalationalAnthraxCandidate

Data-source ontology

Mapping ontology

Surveillance problem-solving ontology

Data

 C ObservationMethod C AggregationObservationMethod C InterpolationObservationMethod

 C AberrancyMethod  C CategorySpecificMethod C PoissonAberrancyCalculator  C LocationSpecificMethod C TemporalFilterMethod  C SpatialClusterMethod

Problem solvers

20

10 Log Likelihood

Initialize Problem Solvers C

gered an alert on eight days for respiratory calls, four days for cardiovascular calls, and two days for trauma calls. To show how we can use different methods to confirm and localize aberrancies, we examined the other two aberrancy methods’ results for the day of the first respiratory alert—27 November 1999. On that day, which coincides with a local peak in influenza admissions, the minimum aberrancy calculator declared a respiratory alert for a ZIP code (94133) in the northeast corner of the city. Examining the output from the Poisson aberrancy calculator for that ZIP code revealed that, on average, 0.26 respiratory calls were expected and two calls were made.

0

–10

–20

Blackboard Data Mapping broker interpreter Deployment controller

C 

ForecastMethod C RareCountForecastMethod C PoissonExpectedValueCalculator C LargeCountForecastMethod

 C AberrancyMethod  C CategorySpecificMethod C PoissonAberrancyCalculator C LikelihoodCalculator  C LocationSpecificMethod C MinimumAberrancyCalculator C PoissonAberrancyCalculator C TemporalFilterMethod  C SpatialClusterMethod

Nov 1 Nov 8 Nov 8 Nov 22 Nov 29 Dec 26 Dec 13 Dec 20 Dec 27 1999

Figure A. Three concurrent surveillance strategies’ deployment on 911 data in San Francisco. Each strategy performs a different analysis on the same data. The aggregation observation method and Poisson expected value calculator forecast method convert raw data to aggregated, observed, and expected counts for each day. BioSTORM then performs three different analyses. The minimum aberrancy calculator method examines all syndrome counts by location and identifies the most unusual syndrome count and p-values for each ZIP code to find spatial clusters of unusual counts that could span syndromes. The likelihood calculator method examines each syndrome’s count separately, with counts aggregated across all ZIP codes. This approach is useful for identifying outbreaks dispersed across a wide area. The Poisson aberrancy calculator method examines each syndrome’s counts separately by spatial location to identify spatial clusters confined to one syndrome. BioSTORM‘s controller initializes, connects, and orchestrates all methods based on our ontology surveillance problem-solving ontology.

computations over disparate types of surveillance data without major reprogramming whenever we incorporate a new data source or we develop a new problem-solving method for the library. 32

The data broker The data broker component uses the datasource ontology to allow problem solvers to read data from many sources at runtime. The component queries the data-source ontology www.computer.org/intelligent

for a particular data source’s description and context and constructs a stream of uniform data objects from raw data. First, the data broker accesses and retrieves data in the original location (for example, relational databases or IEEE INTELLIGENT SYSTEMS

Mapping ontology  C mapping  C slot-mapping C renaming-slot-mapping C constant-slot-mapping C lexical-slot-mapping C functional-slot-mapping C recursive-slot-mapping C instance-mapping C global-mapping

Source “Data Group” context

• Filter out in valid events • Extract and reformat source, date, location • Abstract illness category • Drop unique identifiers

Data-source ontology Class S

Method “Individual Event” context

Slot s1 Slot s2 Slot s3 Slot s4

(a)

Method ontology Instance mapping

Slot mappings

Class T Slot tA Slot tB Slot tC

(b)

Instance mapping

Source “Data Group” instance

Resulting method “Individual Event” instance

Constant slot mapping

(c)

(d)

Figure 4. Mapping 911 call records data groups to the aggregation method’s individual events with Protégé. (a) We must reconcile various mismatches to enable the aggregation method to process data groups as individual. (b) According to a mapping ontology, mapping relations specify transformations of instance data from the data-source ontology to the method ontology, both at the class and slot levels. (c) A Protégé snapshot with a sample mapping relation, defined between the source’s data group and the method’s individual event, highlights a slot-level mapping specifying that the dataSource slot of the method’s individual event should have the value “Dispatch911.” (d) An individual event instance, automatically generated from a 911 Call Record Data Group instance (both shown in Protégé snapshots) by the mapping interpreter.

flat files) based on metadata describing the data-source ontology’s low-level data classes. Next, it formats the data and groups them as the data-source ontology specifies. The data broker packages the data with the appropriate context annotations to create syntactically uniform and semantically unambiguous data objects. The data objects are then ready for the problem solvers that must operate on them. This way, each problem solver receives a customized set of data objects and can ignore the raw data’s original formats. The mapping interpreter Some problem solvers can operate directly on data that the data broker supplies. HowSEPTEMBER/OCTOBER 2005

ever, many surveillance methods in our library expect data in a different format, conceptualization, or granularity level from the data broker’s lower-level data objects. In these cases, BioSTORM must supply data to the problem solvers in the appropriate representation, which requires mapping and transforming the data (see figure 4a). Using Protégé, we devised a mapping ontology4,6 that enumerates the ontological transformation types—or mapping relations— that enable data sources and data elements to match different problem solvers’ particular data requirements (see figure 4b). (See the sidebar “Protégé: A Framework for Building Ontology-based Intelligent Systems.”) For www.computer.org/intelligent

each data group in the data-source ontology, specific mapping relations define data elements’ transformation into runtime inputs of problem solvers. These transformations range from simply renaming data-specific elements to the corresponding terms the method uses to composing lexical or functional expressions of data elements to match method terms as the method ontology defines them. For example, when configuring a surveillance method to aggregate different data streams, where each stream reported on different 911 dispatches, we created a set of mappings to transform the different data streams’contents into individual events as the aggregation method requires (figure 4c). 33

H o m e l a n d

S e c u r i t y

Based on the created set of data-to-method mapping relations, BioSTORM must translate incoming data elements to a set of input instances for the particular method to use. Our mapping interpreter performs this task by processing the mapping relations for each data group and problem solver and generating streams of data structure that the problem solvers can process (for example, generating individual events for the aggregation method; see figure 4d). The mapping interpreter therefore transforms both the original data’s contents and semantics to conform to each problem solver’s input requirements.

A controller for deploying surveillance methods As an overarching piece of the BioSTORM infrastructure, we developed a controller that coordinates data flow from raw representations to appropriate problem solvers via the data broker and mapping interpreter.7 It manages disparate data sources’ unification into semantically uniform data streams, maps them to multiple problem solvers, and deploys the problem solvers to conduct surveillance. It uses the ontologies that describe the surveillance data and analytic methods to configure and monitor system components. The controller also ensures that the data broker passes the correct data at the correct time through the data mapper to the relevant problem solver. This entire process must execute efficiently, with potentially many data sources being sent to various complex problem solver configurations operating in parallel. Thus, the architecture doesn’t store, access, or manipulate data directly in the data-source and method ontologies; rather, these ontologies hold abstract models and references to the data’s various states. The controller is based on the JavaSpaces implementation of the Linda blackboard-based model to provide a distributed, knowledgedriven means for problem-solver deployment (www.sun.com/software/jini/specs/jini1. 1html/js-spec.html). JavaSpaces provides a shared data store that lets Java processes exchange data. We’ve developed a hybrid Linda/relational knowledge-driven model to provide efficient data flow in a deployed BioSTORM system. In this model, the system groups data into bundles based on shared semantic properties, which it describes in terms of our data-source ontology. Data might be bundled spatially, temporally, or based on other semantic properties, such as all 911 calls for a particular ZIP code on a 34

particular day. These bundles can be exchanged through JavaSpaces’ data store. Relational databases store the raw data and semantic markers associated with the bundles that reference those data. The data broker then controls the bundled data’s actual insertion and extraction to and from the appropriate relational database. The controller completely shields problem solvers from these details. Our approach lets BioSTORM take advantage of the tremendous efficiencies the Linda model affords while enabling the system to scale up to handle large data sets. Additionally, our solution lets problem solvers exchange data using high-level terms the data-source ontology provides, freeing them from low-level data formatting concerns. When problem solvers require data in a form that the data-source ontology doesn’t provide, the controller invokes the mapping interpreter seamlessly to perform any necessary tailoring. The controller thus provides a coherent, efficient runtime system that unifies data sources, knowledge bases, and problem solvers. It provides the basis for the BioSTORM computational system to support the modular, concurrent application and the structured evaluation of multiple knowledge-driven analytic methods. The sidebar “Surveillance of San Francisco 911 Emergency Dispatch Data” shows the ontology-centered deployment of a set of problem solvers over streams of 911 emergency data in San Francisco.

B

ioSTORM demonstrates an end-to-end solution to many problems associated with data acquisition, integration, and analysis for public health surveillance. Its architecture builds solidly on long-standing AI work concerning using ontologies for integrating semantically disparate data, mapping reusable problem-solving methods to domain ontologies, and using blackboard control methods for distributed problem-solver deployment. Most important, it uses ontologies’full potential to represent all knowledge necessary in a system to acquire, access, integrate, and mediate disparate data to judiciously chosen analytic methods. BioSTORM demonstrates how established AI methods can help us rapidly develop a robust approach to analyzing large volumes of disparate, noisy data. Our generic approach is successful because ontologies let us abstract data storage and analytic programming details at the semantic level, where data and programs can interopwww.computer.org/intelligent

erate meaningfully. Because analytic methods are independent of data sources and data-tomethod mapping occurs at the semantic level, the approach enables a degree of flexibility that current surveillance systems don’t meet. In that sense, our ontology-centered approach provides a novel contribution to public health surveillance. Unlike special-purpose solutions, our system can accommodate data sources’and outbreak patterns’unpredictable nature. When developers identify novel data sources, they can edit our data-source ontology and incorporate them in a straightforward manner. When developers create new analytic methods, they can model them in our surveillance problem solvers library and easily encode and add them to our control structure. Unlike existing syndromic surveillance systems, BioSTORM doesn’t require reprogramming whenever a new data source or a new analytic algorithm becomes available; instead, developers simply edit the associated ontologies, which our system then reads to automatically configure new analysis strategies. Improvement to our approach includes customizing the data-source ontology and mediating components to account for emerging biomedical data-messaging standards such as the US Centers for Disease Control and Prevention’s Public Health Information Network Messaging System and Health Level 7 (www.cdc.gov/phin/software-solutions/ phinms; www.hl7.org). Also, our next step in classifying surveillance methods will be defining correspondence between specific disease agents and epidemic patterns. For example, influenza outbreaks often produce signals that are geographically diffuse, which rise to a peak incidence over weeks. Linking outbreaks to signal types would facilitate selecting analytic methods appropriate for detecting a specific outbreak. A knowledge base of epidemic patterns associated with various disease agents would complete the system and help analytic methods determine the type of outbreak being detected. Finally, further testing the performance of the BioSTORM controller against custom-developed solutions would let us assess our ontology-driven implementation’s computational efficiency. The ease with which our ontology-based approach accommodates system changes has implications that extend beyond system maintenance. The major difficulty with current syndromic surveillance systems is that not one has been rigorously evaluated. The homeland security community has taken it on faith that these systems are useful.2 HowIEEE INTELLIGENT SYSTEMS

ever, because syndromic surveillance of electronic health data is a new discipline, identifying the optimal data and methods for detecting disease outbreaks won’t be possible until many combinations of data sources and analytic methods have been evaluated. BioSTORM’s architecture offers a modular framework in which developers can incorporate new problem-solving methods and data sources and then measure system performance. Furthermore, the architecture’s ontology-centered approach provides the basis for correlating different data sources and selecting appropriate analytic methods at the ontological level, even when data and methods encode different spatiotemporal granularities. We believe that BioSTORM can have immediate payoff as a test bed for evaluating surveillance data sources’ and analytics’ relative contributions. Beyond syndromic surveillance, our architecture provides the conceptual and implementation components for developing many other data-source monitoring and analysis applications. Because all components are generic, adapting the architecture for a new application, such as weather forecasting, is a matter of modeling application-specific knowledge elements into each ontology. Because our ontologies are all developed in the Protégé modeling environment, developers get significant tool support from the environment to perform those ontology extensions.

T h e

A u t h o r s Monica Crubézy is a research scientist in the Stanford Medical Informatics Laboratory at Stanford University. Her research focuses on modeling libraries of problem-solving methods and integrating them with domain ontologies to achieve knowledge-intensive tasks. She also studies mapping and mediating knowledge among ontology-based system components, such as in the context of the Semantic Web. She received her PhD in computer science from the Université de Nice-Sophia Antipolis and the Institut National de Recherche en Informatique et Automatique. Contact her at Stanford Medical Informatics, 251 Campus Dr., Stanford Univ., Stanford, CA 94305; crubezy@ smi.stanford.edu; http://smi-web.stanford.edu/people/crubezy. Martin O’Connor is a systems software developer in the Stanford Medical Informatics Laboratory at Stanford University. His research interests include temporal query languages and distributed problem-solver deployment. He received his MSc in computer science from the University of Dublin, Trinity College. Contact him at Stanford Medical Informatics, 251 Campus Dr., Stanford Univ., Stanford, CA 94305; [email protected]; http:// smi-web.stanford.edu/people/moconnor.

David L. Buckeridge is an assistant professor of epidemiology and biostatistics at McGill University’s Center for Clinical and Health Informatics. He holds a Canada Research Chair at McGill University. His research interests include public health informatics, particularly public health surveillance informatics. He received his PhD in biomedical informatics from Stanford University. Contact him at McGill Univ., Clinical and Health Informatics Research Group, 1140 Pine Ave. West, Montreal QC, H3A 1A3; david. [email protected]. Zachary Pincus is a doctoral student in Stanford University’s Biomedical Informatics program. His research interests include image segmentation and machine-learning algorithms for quantifying biological microscopy images as well as generic structures and procedures for mapping between ontologies and data sources. He received his BS in biology from Stanford University. Contact him at Stanford Medical Informatics, 251 Campus Dr., Stanford Univ., Stanford, CA 94305; [email protected].

Mark A. Musen is a professor of medicine (medical informatics) and com-

Acknowledgments This article was supported by DARPA under a national research program for biosurveillance technology. We presented a preliminary version of this article at the 2005 AAAI Spring Symposium on AI Technologies for Homeland Security. We thank our anonymous reviewers for their insightful comments.

References

puter science at Stanford University and the head of the Stanford Medical Informatics Laboratory. He has directed Protégé since its inception in 1986. His research interests include knowledge acquisition for intelligent systems, knowledge-system architecture, and medical decision support. He received his PhD in medical information sciences from Stanford University. He’s a member of the American College of Medical Informatics and the American Society for Clinical Investigation. He’s served on the US National Library of Medicine’s Biomedical Library Review Committee. Contact him at Stanford Medical Informatics, 251 Campus Dr., Stanford Univ., Stanford, CA 94305; musen@smi. stanford.edu; http://smi-web.stanford.edu/people/musen.

1. J.A. Pavlin, “Epidemiology of Bioterrorism,” Emerging Infectious Diseases, vol. 5, no. 4, 1999, pp. 528–530. 2. D.M. Bravata et al., “A Critical Evaluation of Existing Surveillance Systems for Illnesses and Syndromes Potentially Related to Bioterrorism,” Annals of Internal Medicine, vol. 140, no. 11, 2004, pp. 910–922. 3. Z. Pincus and M.A. Musen, “Contextualizing Heterogeneous Data for Integration and Inference,” Proc. Am. Medical Informatics Assoc. Ann. Symp., Hanley and Belfus, 2003, pp. 514–518. 4. M. Crubézy and M.A. Musen, “Ontologies in SEPTEMBER/OCTOBER 2005

Support of Problem Solving,” Handbook on Ontologies, S. Staab and R. Studer, eds., Springer, 2003, pp. 321–341. 5. D.L. Buckeridge et al., “An Analytic Framework for Space-Time Aberrancy Detection in Public Health Surveillance Data,” Proc. Am. Medical Informatics Assoc. Ann. Symp., Hanley and Belfus, 2003, pp. 120–124. 6. J.H. Gennari et al., “Mapping Domains to Methods in Support of Reuse,” Int’l J. Human-Computer Studies, vol. 41, no. 3, 1994, pp. 399–424. www.computer.org/intelligent

7. M.J. O’Connor et al., “RASTA: A Distributed Temporal Abstraction System to Facilitate Knowledge-Driven Monitoring of Clinical Databases,” Proc. 10th World Congress Medical Informatics (MEDINFO 01), IOS Press, 2001, pp. 508–512.

For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib. 35