XML-Based Data Model and Architecture for a ... - Semantic Scholar

2 downloads 9361 Views 3MB Size Report
for data mining and knowledge discovery in large HCS image. 87 sets is the ... grid-enabled automated tools for extraction of spatiotemporal. 115 knowledge that ...
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

XML-Based Data Model and Architecture for a Knowledge-Based Grid-Enabled Problem-Solving Environment for High-Throughput Biological Imaging

1

2

3

4

Wamiq M. Ahmed, Student Member, IEEE, Dominik Lenz, Jia Liu, J. Paul Robinson, and Arif Ghafoor, Fellow, IEEE

5 6

Q1

32 33

Abstract—High-throughput biological imaging uses automated imaging devices to collect a large number of microscopic images for analysis of biological systems and validation of scientific hypotheses. Efficient manipulation of these datasets for knowledge discovery requires high-performance computational resources, efficient storage, and automated tools for extracting and sharing such knowledge among different research sites. Newly emerging grid technologies provide powerful means for exploiting the full potential of these imaging techniques. Efficient utilization of grid resources requires the development of knowledge-based tools and services that combine domain knowledge with analysis algorithms. In this paper, we first investigate how grid infrastructure can facilitate high-throughput biological imaging research, and present an architecture for providing knowledge-based grid services for this field. We identify two levels of knowledge-based services. The first level provides tools for extracting spatiotemporal knowledge from image sets and the second level provides high-level knowledge management and reasoning services. We then present cellular imaging markup language, an extensible markup language-based language for modeling of biological images and representation of spatiotemporal knowledge. This scheme can be used for spatiotemporal event composition, matching, and automated knowledge extraction and representation for large biological imaging datasets. We demonstrate the expressive power of this formalism by means of different examples and extensive experimental results. Index Terms—Grid computing, high-throughput biological imaging, knowledge extraction (KE), spatiotemporal modeling.

34

I. INTRODUCTION

35

ICROSCOPIC imaging is one of the most popular means for studying biological processes. Recent advances in microscopic and optical instrumentation and automation technologies have made sophisticated imaging tools available for the biologist [1]. The rapidly evolving field of high-content/highthroughput screening (HCS/HTS) uses automated imaging instruments for image-based analysis of large populations of cells,

36 37 38 39 40 41

M

to monitor their spatiotemporal behavior under different experimental conditions [2]. These technologies provide valuable information about biological processes and can help in generating rich information bases about different cellular systems [3]. These information bases can, in turn, help in developing better systems-level understanding of biological organisms, and can also aid in developing new drugs. In order to exploit the full potential of these technologies, many challenges have to be overcome. Analysis of a large number of cells for validating scientific hypotheses requires high-throughput computing resources, efficient management of datasets, and knowledgebased image understanding tools that can rapidly identify relevant biological objects and events and extract high-level semantic information [4], [5]. Significant progress has been made in recent years in the areas of distributed computing and distributed data management using grid technologies [6], [7]. The availability of basic grid services provides a great opportunity for developing knowledge-based applications and services for highthroughput biological imaging. This requires the development of tools and services for spatiotemporal knowledge representation and extraction. Development of such tools and knowledge representation schemes would not only expedite the analysis of high-throughput imaging research but would also make the new knowledge generated in the process available for the design of other experiments. It is increasingly clear that availability of such knowledge would significantly impact future biological research [8]. Grid technologies are a promising solution for the challenging demands of high-throughput biological imaging research. Computational grids have provided high-throughput computing resources for compute-intensive applications in different scientific domains [9], [10]. Data grids have provided elegant solutions for the management of distributed data sources for applications that involve large volumes of data [10]. The combined potential of enormous computational and data resources can be best utilized by knowledge-based tools and services that can extract high-level knowledge from huge datasets and can help researchers in making informed decisions, and thus, aid in the process of biological discovery. These tools and services can allow automated extraction of biological knowledge and highlevel information-rich metadata from large image sets and make them available to other researchers. The maturation of computational and data grids provides an opportunity for applying the

IE E Pr E oo f

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

1

Manuscript received November 7, 2006; revised xxx. W. M. Ahmed and A. Ghafoor are with the School of Electrical and Computer Engineeing, Purdue University, West Lafayette, IN 47906 USA (e-mail: [email protected]; [email protected]). D. Lenz, J. Liu, and J. P. Robinson are with the Department of Basic Medical Sciences, Purdue University, West Lafayette, IN 47906 USA (e-mail: dominik@ flowcyt.cyto.purdue.edu; [email protected]; [email protected]. purdue.edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TITB.2008.904153

1089-7771/$25.00 © 2008 IEEE

42 43 Q2 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84

2

86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123

powerful services provided by these technologies for solving challenging problems in the domain of HCS. A key challenge for data mining and knowledge discovery in large HCS image sets is the extraction of semantic information about cell states in terms of their spatiotemporal dynamics. Once this information is extracted, advanced data mining techniques can be applied for discovery of new knowledge from diverse repositories of these data. These issues are addressed in this paper. The specific contributions of this paper are as follows. 1) We investigate how grid infrastructure can facilitate highthroughput biological imaging research, and present an architecture for providing knowledge-based grid services for this field that builds on basic computing, communication, and data grid services. The proposed architecture provides two layers, a knowledge extraction (KE) and a knowledge management (KM) layer. The KE layer deals with algorithms for extraction of spatiotemporal knowledge from images whereas the KM layer deals with higher level knowledge representation and reasoning. 2) We introduce a data model and an extensible markup language (XML) based language for specifying objects and spatial and temporal events. This specification of spatiotemporal knowledge can be used by object- and eventidentification tools to extract spatiotemporal knowledge from image sets. Schemas are also provided for the representation of this automatically extracted knowledge that makes this information-rich metadata available to other researchers in a machine-readable format. There are a number of benefits to this approach. Firstly, it can speed up the analysis of large image datasets by providing grid-enabled automated tools for extraction of spatiotemporal knowledge that can help researchers in understanding biological systems. Secondly, these tools can also be used for automatic extraction of spatiotemporal metadata and representation of these metadata at varying levels of detail in a machine-readable format so that this knowledge becomes available to other researchers. Thirdly, it enhances the opportunity for applying advanced data mining tools to large image sets for identifying hidden patterns in such data.

124

A. High-Throughput Fluorescence Microscopic Imaging

125

Fluorescence imaging is one of the most useful tools in the arsenal of biologists, as it enables the visual examination of biological systems using fluorescent markers [1], [11]. As many biological samples are relatively transparent, various markers or probes are used to highlight cell compartments of interest. These markers are frequently fluorescent molecules that have specific absorption and emission spectra, and they emit light in their emission spectrum when excited within their appropriate absorption wavelengths. Optical filters are used to collect light in particular spectral bands. Fig. 1 shows cell nuclei stained with a Hoechst dye and cell cytoplasms stained with Alexa 488 dye. Similarly, different cell locations can be stained with a variety of markers and this information is collectively used for understanding biological phenomena. Recent advances in automation technologies combined with microscopic imaging techniques

126 127 128 129 130 131 132 133 134 135 136 137 138 139

Fig. 1.

Cell nuclei and cytoplasm stained with different dyes.

have given rise to the highly promising field of HCS. HCS technologies image large populations of cells in an automated fashion [2]. They allow the examination of cell populations during and subsequent to different experimental perturbations. A key feature of biological systems is their spatial and temporal dynamics in response to variations in experimental conditions. For example, location of different proteins inside cells reveals important information about their function, and changes in the phenotype of cells are important for understanding different cell states. Monitoring this spatiotemporal dynamics is crucial to understanding biological systems.

IE E Pr E oo f

85

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

140 141 142 143 144 145 146 147 148 149 150

B. Examples of Spatiotemporal Cellular Events

151

Many biological events involve spatial and temporal changes in different properties of cells and subcellular objects. Apoptosis is one such event [12]. Apoptosis is defined as programmed cell death. Every living cell spends its life cycle according to a set of instructions, and at the end, it initiates the process of apoptosis. It will be used as a detailed example in Section V and is therefore explained more fully here. Studies of apoptosis are important for cancer cell research. Cancerous cells often avoid apoptosis and instead continue to divide without control. Many anticancer drugs are designed to induce apoptosis in cancer cells, for example, to stop the expansion of a tumor. Apoptosis can be detected by using a specific surface marker, for example Annexin V fluorescein isothiocyanate (FITC), as well as other probes for staining the nucleus and cytoplasm. Another fluorescent nuclear stain called propidium iodide (PI) is used to verify cell viability. This dye cannot penetrate the intact cell membrane of healthy cells, so it is not found in nuclei. As the cell undergoes apoptosis, its cell membrane becomes more permeable and PI stains the nucleus. Cell division is another example of such events. Study of the effect of different perturbations on cell division is also useful for monitoring the progression of cancer cells. Cell division involves enlargement of a cell followed by nuclear division and the formation of daughter cells. In addition to physical changes in cells, more subtle alterations within cellular metabolic states can be observed by monitoring the production and location of various proteins. Proteomics, which is the study of the structure, function, and interactions of proteins produced by the genes of a particular cell, relies heavily on the location of different proteins inside cells. The function of proteins depends on the location where they are expressed [13]. Automated tools for localization of proteins and monitoring their dynamics can be very helpful for the field of proteomics [14].

152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184

AHMED et al.: XML-BASED DATA MODEL AND ARCHITECTURE FOR A KNOWLEDGE-BASED GRID-ENABLED

C. Related Work

186

Grid technologies have been applied to many problems in biological and medical sciences [10], [15]. The biomedical informatics research network (BIRN) focuses on applying grid services for neurological problems [15]. A grid-enabled image processing toolkit has been proposed in [16]. The OpenMolGRID project aims at providing grid-enabled applications for molecular science and engineering [15]. All of these projects build on basic grid services and apply grid technologies to particular domains. Like these projects, our aim is to apply the computing, communication, and data services provided by the grid to a specific domain (HCS). Additionally, we focus on providing knowledge-based services that can be used for representing and extracting spatiotemporal knowledge. These tools can be used not only for extraction of quantitative knowledge from image sets but also for automated metadata generation. We propose cellular imaging markup language (CIML) for this purpose. This paper also draws on related work in the computer vision and knowledge representation communities. A languagebased representation of video events and a markup language for video annotation based on XML have been proposed in [17]. The authors also discuss domain ontologies for physical security and meeting domains. An ontology-based approach for semantic image interpretation is proposed in [18]. The idea of embedding semantic information into multimedia documents has been proposed in [19]. Open microscopy environment (OME) has been proposed as a solution for the management of multidimensional biological image sets [4]. It uses XML schemas for saving image data along with the experiment context and analysis results. The OME focuses not on providing image analysis or KE tools but on providing a unified data format for storage and sharing of multidimensional microscopic images. The challenging problem of knowledge-based semantic interpretation of biological images requires incorporation of domain-specific knowledge into intelligent image understanding tools. This requires modeling of images and mechanisms for representing spatiotemporal biological knowledge. At the same time, the high volume of image sets generated by HCS technologies requires powerful computational resources and efficient data storage and management systems. High-throughput biological imaging can greatly benefit from developments in grid, computer vision, and knowledge representation communities. Based on these observations, we propose an architecture that builds upon the basic grid services and provides KE and management services for the HCS. We also propose an XML-based language for event representation and event markup with a specific focus on biological imaging. This approach enables the extraction of quantitative knowledge from large biological image sets for hypothesis-driven image analysis. This extracted knowledge can also be embedded into an image format like OME. This approach can be useful in solving the interoperability issues of biological imaging data produced by heterogeneous sources. This makes it possible to use advanced data mining techniques for rule induction. The rest of the paper is organized as follows. Section II describes how grid technologies can be used to provide knowledgebased services for the HCS. Section III discusses the issues

187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240

Fig. 2.

Hierarchy of grid services.

of spatiotemporal event recognition and representation. The architecture of our grid-enabled system for the HCS along with our spatiotemporal model and the CIML are presented in Section IV. An illustrative example is described in Section V. Section VI describes the implementation status and presents results of an apoptosis screen, and the paper is concluded in Section VII.

247

II. KNOWLEDGE-BASED GRID FOR HCS

248

Grid technologies have been proposed for applications that require large computational resources or deal with high volumes of data. Modeling and simulation of complex systems and research in particle physics are some examples [10]. Grids provide middleware and tools for harnessing diverse and distributed computational and storage resources in a seamless fashion and for providing the end user with a uniform front end [6]. Grid technologies were initially motivated by computationally intensive applications that required tremendous computing power and computational grids provided the required resources [6]. Development of data grids was motivated by the need for systems to analyze and process data stored at geographically diverse data storage facilities, using the computational power provided by the computational grids [20]. Great progress has been made in recent years in providing powerful middleware and services for computational and data grids [6]. Recently, there has been tremendous interest in knowledge grids as enabling technologies for better exploiting the underlying computational and data services provided by computational and data grids [16], [21]–[24]. Knowledge grid technologies aim at providing knowledge services that are tailored for particular domains, and thus, facilitate efficient utilization of available resources. Fig. 2 shows the hierarchy of grid services. In this section, we investigate how grid infrastructure can help high-throughput biological imaging, and present the architecture of a knowledgebased problem-solving environment for this rapidly expanding field. Biological imaging investigations are normally aimed at validating scientific hypotheses. The first step, therefore, is hypothesis definition, followed by experiment design. The experiment is then conducted and results are analyzed to understand the biological processes. Owing to the complexity of biological organisms, it is evident that future research in biological sciences and drug development will require more and more collaboration among research groups. This collaboration effort will be facilitated by automated KE and metadata-generation tools. Grid technologies, such as described in Fig. 3, provide a promising solution to meet the computational, data processing, and

249

IE E Pr E oo f

185

3

241 242 243 244 245 246

250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286

4

Fig. 3.

288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323

HCS workflow utilizing grid services.

collaboration needs of such tools to facilitate the HCS pipeline. Hypothesis development requires intensive searches for related knowledge available in diverse locations. As the data in this case happen to be in the form of large collections of images, a key requirement is to have automated KE and metadata-generation tools and formalisms for representation of this extracted knowledge so that software agents can search for relevant knowledge. The design of imaging experiments would also benefit from collaboration services. For example, if a few researchers are studying the proteome of the same organism, they may want to discuss the design of experiments and may even want to divide the whole proteome among themselves so as to avoid repetition of the same work. The human genome project is an example of a recent project of this nature [25]. Grid services can also make remote instruments available to scientists for conducting experiments. Remote microscopy has been used to provide access to expensive instruments like electron microscopes or to obtain a second opinion from a pathologist [26]. A large drug screen may require expensive instruments like imaging cytometers at multiple geographic sites, requiring a high degree of coordination. Efficient grid middleware can provide these services. Once the experiments are conducted, the results are gathered and analyzed. In the case of HCS, transfer of large volumes of imaging data may require a high-speed communication infrastructure. These data are to be transferred to appropriate locations where they are analyzed using high-performance computational resources. Two levels of analysis can be identified here. The first level is concerned with identifying the basic information about objects and events depicted in image sets. This level essentially extracts “numbers” out of images. The second level deals with data mining and reasoning based on the basic information extracted by the first-level processing. Both these types of analysis require high-performance computational resources. Finally, the high-level knowledge thus extracted should be represented in a machine-readable format so that it is available for new studies. Based on the discussion in previous paragraphs, we envision two levels of knowledge services for the HCS, as shown

in Fig. 4. The KE layer provides algorithms for extraction of spatiotemporal knowledge from images, while the KM layer deals with higher level knowledge representation and reasoning. Imaging data may consist of two-dimensional microscope images, three-dimensional confocal datasets, time-lapse images, or multispectral images [4]. The KE layer provides services for identifying objects and events. The object identification by classification algorithms is based on object features like size, area, shape, and spectrum, to cite a few examples. This layer also provides algorithms for preprocessing of images for noise removal, and algorithms for registration and segmentation. Algorithms are also provided for identifying spatial and temporal events as explained in Section III. The KM layer provides services for specifying objects and events so that they can be identified by the algorithms provided by the KE layer. This layer also provides data mining and reasoning services for discovering new information from the spatiotemporal information about the images. Other services provided by this layer include automatic metadata generation and visualization. The working and interaction of KE- and KM-layer tools is demonstrated in Fig. 5. The user provides an event composition file in the CIML representation as described in Section IV. The XML parser reads this description and generates workflows for KE- and KM-layer tools. The KE-layer tool manager and KM-layer tool manager each read and execute the workflows. The information extracted by KE-layer tools is represented as XML documents and stored in a KE-layer XML database. KE-layer tools generally extract information from a dataset produced by a single experiment, whereas KM-layer tools operate on data produced by multiple experiments. For this reason, KE-layer tools operate directly on the imaging data while KM-layer tools operate on the information extracted by KE-layer tools as shown in Fig. 5. The tools within KE and KM layers also exchange data in the form of XML documents.

357

A. Event Composition File

358

The event composition file describes the spatiotemporal events of interest that are to be matched in imaging datasets. This file describes not only the high-level event information in terms of semantics but also the specific algorithms used for particular steps of processing. For example, apoptosis is defined in terms of spatial overlap of different types of dyes as explained in Section V. The specific information about what segmentation algorithm and spatial representations are to be used is described in the event composition file. CIML representations of spatiotemporal events along with examples are given in Section IV.

359

B. Workflows and Tool Managers

369

IE E Pr E oo f

287

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

Workflows describe what processing tools and algorithms are to be used for problem solving. While the event composition file contains the semantic information about the events of interest, workflows contain information about what algorithms will be applied and in what order. A fragment of a workflow for segmentation and feature extraction is shown in Fig. 6. This example shows that a watershed segmentation algorithm is to be used and Haralick texture features are to be extracted. The

324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356

360 361 362 363 364 365 366 367 368

370 371 372 373 374 375 376 377

AHMED et al.: XML-BASED DATA MODEL AND ARCHITECTURE FOR A KNOWLEDGE-BASED GRID-ENABLED

Services provided by the proposed architecture.

Fig. 5.

Interaction between KE and KM layers.

Fig. 6.

Fragment of a workflow.

IE E Pr E oo f

Fig. 4.

381

task of the tool manager is to match available tools with user’s request as specified in the workflow. Tools managers read the workflows and execute them by running appropriate tools on the inputs.

382

C. KE-Layer Tools

383

As explained earlier, the KE layer provides tools that operate directly on image data. These tools include algorithms for preprocessing, segmentation, feature extraction, clustering, classification, spatial analysis, and temporal analysis. Preprocessing is generally used for removing noise and artifacts from images. For large sets of fluorescence images, variations in light intensity need to be normalized using intensity-normalization algorithms. Segmentation algorithms such as region-based and edge-based algorithms attempt to separate objects of interest from background. After identifying the objects of interest, the next step is to extract object features for classification. These features include size, intensity, region and boundary descriptors, and texture features. The features extracted by feature-

378 379 380

384 385 386 387 388 389 390 391 392 393 394 395

5

extraction modules are used for clustering and classification. Clustering generally refers to unsupervised classification where objects are grouped based on similarity of their features. Supervised classification, in contrast, uses labeled examples of classes for training classifiers. Classification algorithms such as neural networks, decision trees, and support vector machines are used for classification [27]. Spatial analysis refers to analysis of an object’s position with respect to others in the image, and temporal analysis refers to changes in features of an object over time, or changes in features and position of one object with respect to other objects over time [28].

406

D. KM-Layer Tools

407

The KM-layer tools operate on the information extracted by the KE layer. Some of the tools provided by this layer include tools for generating metadata and tools for visualizing experimental results. Other tools provided by this layer, such as data mining and reasoning tools, operate on collections of XML documents generated by the KE layer. For example, a screening experiment that uses different drug concentrations for monitoring the effect of concentration on a particular cell type would use KE-layer tools to generate separate XML documents for each concentration of the drug used. The KM-layer tools for reasoning can then be used to find the optimal drug concentration for inducing a certain phenomenon. Data mining tools can be used for finding association rules and hidden patterns in such data. The results of KM-layer processing are also represented as XML documents and stored in an XML database.

408

396 397 398 399 400 401 402 403 404 405

409 410 411 412 413 414 415 416 417 418 419 420 421 422

6

Fig. 7.

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

Intermediate results represented in XML.

E. XML Documents for Extracted Information

424

Intermediate as well as final results of analysis are stored as XML documents. KE-layer tools store the results of information extraction in the KE-layer XML database as XML documents. Similarly, KM-layer tools store the processing results as XML documents in the KM-layer XML database. Different tools at each layer communicate intermediate results in the form of XML documents. An example of the results of feature extraction is shown in Fig. 7. These results can be passed to the classification module for recognizing objects. These knowledge services are based on the basic computational and data grid services. The lower layers manage issues like resource discovery, scheduling, data movement, security, and collaboration. Our current implementation uses Condor [29] as the basic grid infrastructure as described in Section VI. The focus of this paper is on spatiotemporal KE and representation services.

425 426 427 428 429 430 431 432 433 434 435 436 437 438 439

440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467

spatial representation. The bounding box is a simple and efficient approach for capturing spatial information about objects in an image. Even though it is imprecise compared to the convexhull- or the articulate-boundary-based analysis, its speed and simplicity make it an attractive choice for high-throughput biological imaging analysis where large numbers of cells are to be analyzed. Interobject spatial analysis can then be performed using projections on coordinate axes. An extension of temporal logic [34] to two and three dimensions is then used to describe the interobject relationships as described in [35]. Fig. 8 shows graphically some of the predicates used in our system. More predicates and functions can be defined, but only those used in later sections to explain our event representation scheme are introduced here. Qualifiers can also be specified for these predicates for different situations, for example, disjoint along with a distance between the centroids of objects, or overlap with the area of the overlapping region. Many biological imaging experiments involve studying the kinetics of the diffusion of different proteins and chemicals among multiple subcellular compartments. Monitoring the variations of spatial relations between different objects over time is highly beneficial for such experiments. For example, if a drug is designed to enter a certain compartment of the cell, it is very important to monitor the dynamics of this process. This will involve using markers that reveal the position of the drug and the location of the targeted compartment of the cell. If the drug penetrates from outside the cell into the particular region, the two markers will initially be at disjoint locations, and then, overlap as time passes. Monitoring the evolution of objects and their spatial relations over time thus can be very helpful in understanding biological processes.

498

IV. ARCHITECTURE OF THE XML-BASED GRID FOR HCS

499

XML is a general-purpose markup language that provides a means for defining special-purpose markup languages [36]. It has been extensively used as a platform-independent mechanism for describing data in many different domains [37]–[39]. Its flexibility in defining different types of data makes it suitable for description of biological objects and events. The general architecture of our system is shown in Fig. 9. This system uses a modular approach of describing objects in terms of their attributes, and then, spatial and temporal events based on the changes in attributes of the objects that constitute those events. Spatial events are described as the constituent objects along with the relationships among their spatial representations (bounding

500

IE E Pr E oo f

423

Fig. 8. Graphical representation of three different spatial relations. (a) Disjoint. (b) Overlap. (c) Contain.

III. SPATIOTEMPORAL EVENT RECOGNITION AND REPRESENTATION

Most of the events of interest in biological imaging arise either because of changes in the attributes of individual objects or because of interactions among multiple objects. For example, a cell undergoing division changes size and shape, and then, divides into two daughter cells. In order to recognize particular events of interest, different objects in the scene need to be identified, and then, monitored over a period of time. Many approaches have been proposed in the literature for recognizing events in videos [30], [31]. These can be adapted to the requirements of event detection in biological images as well. In our system, we define predicates for spatial analysis based on bounding box, convex hull, or exact outline, depending on the speed and accuracy requirements of the application. Temporal event identification is based on a state-machine-based approach where each state is defined by the objects participating in that subevent, the specific values of their attributes, and the spatial relations between them. One of the objectives is to decouple event recognition from event representation. This way, newer events can be defined in a flexible manner and this representation can be used by event-recognition logic to identify events of interest independent of the lower level algorithms used. Objects of interest can be identified using different parameters, for example size, shape, color, texture, or spectral information in the case of multispectral imaging [32], [33]. Spatial location of different objects relative to each other can be analyzed using bounding boxes around objects or other suitable

468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497

501 502 503 504 505 506 507 508 509 510 511

AHMED et al.: XML-BASED DATA MODEL AND ARCHITECTURE FOR A KNOWLEDGE-BASED GRID-ENABLED

Fig. 10.

Fig. 9.

513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532

533 534 535 536 537 538 539 540 541 542 543 544 545 546

Spatiotemporal data model.

Architecture of the grid-enabled component-based system for HCS.

box, convex hull, or exact outline) depending upon the speed and accuracy requirements of the application. Temporal events are generally composed of a sequence of one or more spatial events occurring over a period of time. Some temporal events, such as a monotonic increase in the size of a cell, cannot be easily broken down into constituent spatial events and require special functions. The XML editor is used for describing the attributes of objects and events manually. In fact, any text or XML editor can be used for this purpose. Alternatively, the feature extractor module can be used for specifying objects and events. The user provides example images of the objects and events, and the feature extractor automatically extracts the attributes and generates XML representations that are stored in the objects and events repository. These representations are used by object and event recognition tools for identifying their specific instances in image sets. The information about the objects and events thus identified is kept in the XML knowledge base. This knowledge base represents the results of first-level processing as explained in Section II. Analysis results with different levels of detail can then be embedded into the image data using the XML embedding module.

Events are modeled as a composition of objects. Events are defined in terms of their attributes like start time, duration, spatial relations between participating objects, and other conceptual information. As shown in Fig. 10, spatial and temporal events are defined as the two types of events. Spatial events are defined solely by the participating objects and their spatial relations. For example, two touching nuclei constitute a spatial event that is completely defined by the features of the objects and their spatial relation. The simplest spatial event is the single-object event that is defined by the participating object along with specific values of its features. Multiple-object spatial events are defined in terms of the participating objects and their spatial relations. A spatial event persisting over a number of frames gives rise to a simple temporal event. Temporal events may also be composed of single or multiple events. Single-object temporal events involve evolution of an object over time, for example, expansion of a cell after undergoing a certain treatment. Multiple-object events involve the evolution of the attributes and spatial relations of the participating objects, for example, the movement of a particular protein from nucleus to cytoplasm as the cell is undergoing expansion in response to a certain treatment. Temporal events may also be single-threaded or multithreaded [17]. A single-threaded event is one in which all constituent events happen in a linear order, whereas in a multithreaded event, one or more constituent events happen in parallel.

573

B. Representation of Objects

574

Biological cells, subcellular compartments, or other entities of interest constitute the objects. Objects are defined in terms of their parameters, which may include color, size, shape, texture, spectrum, or other parameters depending on the application. Objects can be defined either manually using the XML editor or by providing example images of the objects to the feature extractor that extracts the relevant parameters of objects. These are then stored as XML schemas. An example schema is shown in Fig. 11(a).

575

IE E Pr E oo f

512

7

A. Spatiotemporal Data Model

Fig. 10 shows the entity-relation diagram for the data model. We model biological images as a composition of biological objects that are specified by their features like size, shape, spectrum, etc. The choice of the specific features to use for defining objects depends on the particular application and the choice of imaging modality. For many cases, simple features like size and color may be adequate, whereas for others, more complicated features like Haralick texture and shape descriptors may be required [40]. Representation of objects in terms of their features allows automated software tools to identify objects meeting these feature criteria in a large set of images. It also makes it possible to use the same analysis parameters for other image sets for comparison.

547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572

576 577 578 579 580 581 582 583

8

(a) Representation of a cancer cell. (b) Representation of the expansion phase in a cell division event.

IE E Pr E oo f

Fig 11.

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

584

C. Representation of Spatial and Temporal Events

585

Spatial and temporal events can also be specified either manually or by using the feature extractor. Spatial events may involve one or more objects and may be characterized by certain values of an object’s features or by one object’s features, for example, location, in relation to another. As an example of a spatial event, one may be interested in finding all the cells that became compressed because of a certain treatment. This condition may be identified by the size or shape of the objects. Alternatively, one may wish to identify all cells that contain a specific number of subcellular objects. This task will require analyzing the location of different objects relative to each other. Temporal events arise because of changes in different parameters of objects over time. For example, diffusion of a protein into a cell from the outside environment is a temporal event, and in this situation, the diffusion rate and the particular location to which the protein binds may be of interest. In order to perform such analysis tasks, subcellular objects need to be identified along with their positions relative to each other. This information then needs to be monitored over time to extract temporal information. The XML specification of cell division as a temporal event is shown in Fig. 12(a). Salient phases in a cell division would involve expansion of the cell followed by nuclear division, and then, the splitting of the cell into two daughter cells. Figs. 11(b) and 12(b) show the XML specification of the expansion and nuclear division phases of cell division. A graphical representation of different phases of cell division is shown in Fig. 13. Temporal relations used in Fig. 13 are described in Fig. 14.

586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611

612

D. Representation of Analysis Results

613

HCS experiments produce huge datasets. Hypothesis-driven biological imaging experiments require large cell populations in order for the results to have statistical significance. It is impor-

614 615

tant to have a mechanism not only for automated KE but also for sharing analysis results of cell screening experiments along with the imaging data. Most such experiments aim at identifying sample subpopulations that meet certain specific criteria. Different levels of detail may be required for the extracted knowledge. For example, in some cases, it may be sufficient to identify only the different subpopulations. In others, locations of different objects may also be needed. These analysis results can then be embedded into image data using a data format like the one introduced in the OME [4]. This approach facilitates searching such imaging data and also reduces computational time if reanalysis is required. Also, it makes it easy to share data and results among different research groups Automatic identification of spatiotemporal events in biological imaging experiments opens new avenues for event mining and rule induction. A large population of cells can be monitored over time under different perturbations and the effects can be automatically analyzed. This approach can be very helpful in drug discovery for the identification of promising targets for potential drugs.

635

V. AN EXAMPLE OF EVENT REPRESENTATION

636

In this section, we explain the concepts introduced in earlier sections by means of a detailed example of an apoptosis screening. The knowledge representation schemas used for this example are also used for an actual apoptosis screening that is explained in Section VI. A discussion of the process of apoptosis appears in Section I-B. The nuclei of cells are identified using a Hoechst marker. The PI is used for identifying dead cells as it can penetrate the ruptured cell membranes of dead cells but cannot penetrate normal live cells. Annexin V-FITC is used as an apoptotic marker because it binds specifically to the membrane of apoptotic cells. The objective of such

637

616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634

638 639 640 641 642 643 644 645 646 647

648 649 650 651 652 653 654 655 656 657 658 659 660

Fig. 12.

(a) Representation of a cell division event. (b) Representation of the nuclear division phase in a cell division event.

Fig. 13.

Graphical representation of subevents for cell division.

a study is normally to test the efficacy of drugs in inducing apoptosis. Events of interests in this case are early apoptotic, late apoptotic, live, and dead cells. These conditions can be identified by observing the overlap between regions that contain different types of markers. Cells containing only Hoechst will be live, whereas cells with both Hoechst and PI staining will be dead, cells with Hoechst and Annexin V-FITC will be early apoptotic, and cells with all three markers will be late apoptotic and necrotic. The XML representation for an apoptosis screen is given in Fig. 15. This screen will provide the percentage of cells in live, dead, and apoptotic states. An example of a schema for the analysis results is shown in Fig. 16.

661

VI. EXPERIMENTAL RESULTS AND DISCUSSION

662

A subset of the proposed system has been developed. The feature-extraction, event-specification, and spatial-event-

663

9

IE E Pr E oo f

AHMED et al.: XML-BASED DATA MODEL AND ARCHITECTURE FOR A KNOWLEDGE-BASED GRID-ENABLED

Fig. 14.

Temporal relations.

identification modules as well as the XML parser are developed in Matlab. Condor [29] is used as the basic grid infrastructure for the problem-solving environment as shown in Fig. 17. Condor is a distributed software system that can harness the computing power of diverse computing resources and present them to the end user as a single resource. At our university, the Condor pool consists of many high-performance clusters and a network of commodity processors. These clusters are owned by different

664 665 666 667 668 669 670 671

Fig. 15.

672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

IE E Pr E oo f

10

Representation of events involved in an apoptosis screen.

research groups and during their idle time are made available to Condor’s job scheduler. In our system, either objects can be specified manually in terms of their attributes or some representative images can be provided to the feature extractor, which extracts object features, and then, stores them as XML object representation schemas. A screen shot of the graphical interface is shown in Fig. 18. As HCS screens typically generate a large number of images, the object representation schemas can be used to automatically identify objects within all images. The spatial-analysis module provides algorithms based on bounding box, convex hull, and exact outline. Spatial events of interest can be specified manually or using the graphical interface shown in Fig. 18. The event-specification module then generates XML schemas corresponding to the spatial events of interest. This spatial-event knowledge is then used by the spatial-analysis module to find all such events in the set of collected images.

An apoptosis screening, as described in the previous sections, was performed using the event specification and matching modules described earlier. Human myelogenous leukemia (HL60) cells were exposed to one of two apoptosis-inducing drugs, rotenone or camptothecin, for 24 h. A negative control consisting of cells without any treatment was also prepared. Subsequently, staining was performed. Three dyes were used: Hoechst 33342, PI, and Annexin V-FITC as described earlier. Four hundred sets of images were collected for each of the samples using an imaging cytometer (iCys research imaging cytometer, CompuCyte Corporation, Cambridge, MA). Each set consisted of three images, one for each of the dyes used. Fig. 19 shows a representative set of images. As explained in Section V, events of interest in this case are early apoptotic, late apoptotic, live, and dead cells. These are defined in terms of spatial relationships between the three dyes used as specified in the schema shown in Fig. 15. The spatial-analysis module used this knowledge and

689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705

AHMED et al.: XML-BASED DATA MODEL AND ARCHITECTURE FOR A KNOWLEDGE-BASED GRID-ENABLED

Representation of analysis results for the apoptosis screen.

Fig. 17.

Implementation of the proposed architecture on Condor.

IE E Pr E oo f

Fig. 16.

711

identified the populations of live, dead, early apoptotic, and late apoptotic cells. Next we compare the performance of different spatialanalysis algorithms. These include bounding-box-, convexhull-, and exact-outline-based methods. We compare the algorithms in terms of their speed, accuracy, and scalability.

712

A. Speed and Accuracy

713

Interobject spatial analysis is useful for localizing different types of particles inside cells. For this experiment, bovine aortal endothelial cells were used. Images of 15 cells were used as templates to generate a set of 30 composite images containing 16 cells each. Each cell for the composite images was selected randomly from the set of 15 cellular image templates. Another set of 30 images was generated to simulate particles within cells. Spatial analysis was carried out using bounding-box-, convexhull-, and exact-outline-based approaches. The run times and accuracy were measured. For accuracy, the exact outline approach was used as the standard as it is the most accurate of the three spatial representations used for the analysis. The results are shown in Table I. Although the exact outline approach is the most accurate, it has a high processing cost, whereas the

706 707 708 709 710

714 715 716 717 718 719 720 721 722 723 724 725 726

11

Fig. 18.

Graphical interface for objects and events specification.

bounding-box approach is the least accurate but has the lowest computational cost. The speed and accuracy of the convex-hull approach falls between the bounding-box and exact-outline approaches. The user can select one of these algorithms based on the speed and accuracy requirements of the application.

727 728 729 730 731

12

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

Fig. 21.

Speedups for different number of processors for the apoptosis screen.

IE E Pr E oo f

Fig. 19. Representative set of images for the apoptosis screen showing HL60 cells in different states. (a) Hoechst 33342. (b) Annexin V-FITC. (c) PI. (d) Scatter image. TABLE I COMPARISON OF DIFFERENT SPATIAL ANALYSIS ALGORITHMS

Fig. 22.

Fig. 20. Scaling the number of processors for a set of 1024 images for apoptosis screen.

732

B. Scalability

733

Here, we present the results of scalability for an apoptosis screen. A total of 1028 images and 16 processors out of the Condor pool were used. In the first experiment, the number of images was fixed at 1028 and the number of processors was varied from 1 to 16. The resulting run times and speedups are presented in Figs. 20 and 21. The speedups are linear in the

734 735 736 737

Q3 738

Scaling the number of images for 16 processors for apoptosis screen.

number of processors, implying that a subsequent increase in the number of processors will result in a proportional improvement in run time. In the second experiment, the number of processors was fixed at 16 and the image set was varied from 64 to 1028 images. The resulting run times are shown in Fig. 22. The run time is linear as a function of the size of the dataset, which suggests that the algorithms scale well for larger datasets. In order to investigate the scalability behavior of convex-hulland exact-outline-based approaches, we used a set of 64 images and performed spatial analysis using convex hull and exact outline while varying the number of processors from 1 to 16. The resulting run times and speedups are shown in Figs. 23 and 24. The speedups are linear for the convex-hull approach whereas those for the exact-outline approach are not strictly linear but scale well for a large number of processors. In another experiment, we varied the size of the image set from 32 to 512 images for 16 processors and performed spatial analysis using convex-hull- and exact-outline-based approaches. The results are shown in Fig. 25. As can be seen, the two

739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757

AHMED et al.: XML-BASED DATA MODEL AND ARCHITECTURE FOR A KNOWLEDGE-BASED GRID-ENABLED

Fig. 25.

Fig. 24.

758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774

Scaling the number of images for 16 processors for spatial analysis.

VII. CONCLUSION

Speedups for different number of processors for spatial analysis.

algorithms scale linearly as the size of the image set is increased. We conclude that all of these algorithms have nice scaling properties in the number of processor and size of the input dataset. Moreover, these algorithms provide a tradeoff between speed and accuracy, and the most suitable algorithm can be chosen depending upon the speed and accuracy requirements of the application. HCS applications require powerful computing and data management resources. At the same time, these systems have to be affordable and economical. Grid technologies provide a promising solution. The realization of this solution requires powerful knowledge-based tools that combine domain knowledge with analysis algorithms to exploit the full potential of the underlying grid services. We believe that the grid-enabled HCS system and the spatiotemporal knowledge representation scheme proposed in this paper provide a powerful yet economic solution to the challenging requirements of HCS applications

775

Automatic extraction and management of spatiotemporal semantic knowledge from large image sets produced by HCS screens requires high-throughput computing resources, efficient management of data, and intelligent knowledge-based tools. Grid technologies provide a basic infrastructure on which domain-specific services can be built. In this paper, we propose an architecture for a knowledge-based grid-enabled problemsolving environment for the HCS. Two main layers of this architecture are the KE layer and the KM layer. The KE layer provides tools for extracting information about biological objects and spatiotemporal events whereas the KM layer provides high-level knowledge representation and reasoning services. We also introduce an XML-based language for modeling biological images and for representing spatiotemporal knowledge. This mechanism provides a knowledge representation scheme that is used by event-recognition logic to identify events of interest. This approach combines domain knowledge with analysis tools to fully utilize the powerful computing and data management services provided by grid infrastructure. A subset of the proposed architecture has been implemented and evaluated. In the future, we plan to explore data mining techniques for automatic rule induction from spatiotemporal dynamics of biological systems. We would also like to develop algorithms for automatically learning the structure of complex spatiotemporal events from example images.

800

ACKNOWLEDGMENT

801

The authors would like to thank J. Sturgis for help with collecting images for this paper.

803

REFERENCES

804

IE E Pr E oo f

Fig. 23. Scaling the number of processors for a set of 64 images for spatial analysis.

13

776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799

802

[1] D. J. Stephens and V. J. Allan, “Light microscopy techniques for live cell 805 imaging,” Science, vol. 300, no. 5616, pp. 82–86, Apr. 2003. 806 [2] K. A. Giuliano, “High-content screening: A new approach to easing key 807 bottlenecks in the drug discovery process,” J. Biomol. Screening, vol. 2, 808 no. 4, pp. 249–259, 1997. 809 [3] K. A. Giuliano et al., “Systems cell biology knowledge created from 810 high content screening,” Assay Drug Develop. Technol., vol. 3, no. 5, Q4 811 pp. 501–514, 2005. 812

14

[4] J. R. Swedlow, I. Goldberg, E. Brauner, and P. K. Sorger, “Informatics and quantitative analysis in biological imaging,” Science, vol. 300, no. 5616, pp. 100–102, Apr. 2003. [5] X. Zhou and S. T. C. Wong, “High content cellular imaging for drug development,” IEEE Signal Process. Mag., vol. 23, no. 2, pp. 170–174, Mar. 2006. [6] K. Krauter, R. Buyya, and M. Maheswaran, “A taxonomy and survey of grid resource management systems for distributed computing,” Softw.Practice Exp., vol. 32, pp. 135–164, 2002. [7] I. Foster and C. Kesselman, “Globus: A toolkit-based grid architecture,” in The Grid: Blueprint For a New Computing Infrastructure, I. Foster and C. Kesselman, Eds. San Mateo, CA: Morgan Kaufmann, 1999, pp. 259–278. [8] F. S. Collins, M. Morgan, and A. Patrinos, “The human genome project: Lessons from large-scale biology,” Science, vol. 300, no. 5617, pp. 286– 290, Apr. 2003. [9] M. Cannataro, C. Comito, F. L. Schiavo, and P. Veltri, “Proteus, a grid based problem solving environment for bioinformatics: Architecture and experiments,” IEEE Comput. Intell. Bull., vol. 3, no. 1, pp. 7–18, Feb. 2004. [10] W. E. Johnston, “Computational and data grids in large-scale science and engineering,” Future Gener. Comput. Syst., vol. 18, pp. 1085–1100, 2002. [11] J. P. Robinson, “Principles of confocal microscopy,” Methods Cell Biol., vol. 63, pp. 89–106. [12] J. D. Robertson, S. Orrenius, and B. Zhivotovsky, “Review: Nuclear event in apoptosis,” J. Struct. Biol., vol. 129, no. 2/3, pp. 346–358, Apr. 2000. [13] W. Huh et al., “Global analysis of protein localization in budding yeast,” Nature, vol. 425, pp. 686–691, Oct. 2003. [14] K. Huang and R. F. Murphy, “From quantitative microscopy to automated image understanding,” J. Biomed. Opt., vol. 9, no. 5, pp. 893–912, 2004. [15] M. Ellisman et al., “The emerging role of biogrids,” Commun. ACM, vol. 47, no. 11, pp. 52–57, Nov. 2004. [16] S. Hastings et al., “Image processing for the grid: A toolkit for building grid-enabled image processing applications,” in Proc. 3rd IEEE/ACM Int. Symp. Cluster Comput. Grid (CCGRID 2003), pp. 36–43. [17] R. Nevatia, J. Hobbs, and B. Bolles, “An ontology for video event representation,” in Proc. IEEE Workshop Event Detect. Recog., Jun. 2004, p. 119. [18] N. Maillot, M. Thonnat, and A. Boucher, “Towards ontology-based cognitive vision,” Mach. Vision Appl., vol. 16, no. 1, pp. 33–40, Dec. 2004. [19] K. Thirunarayan, “On embedding machine-processable semantics into documents,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 7, pp. 1014– 1018, Jul. 2005. [20] M. Cannataro, D. Talia, and P. Trunfio, “Distributed data mining on the grid,” Future Gener. Comput. Syst., vol. 18, pp. 1101–1112, 2002. [21] S. Hastings et al., “Grid-based management of biomedical data using an XML-based distributed data management system,” in Proc. 2005 ACM Symp. Appl. Comput., pp. 105–109. [22] D. D. Roure, N. R. Jennings, and N. R. Shadbolt, “The semantic grid: Past, present, and future,” Proc. IEEE, vol. 93, no. 3, pp. 669–681, Mar. 2005. [23] H. Zhuge, “China’s e-science knowledge grid environment,” IEEE Intell. Syst., vol. 19, no. 1, pp. 13–17, Jan./Feb. 2004. [24] M. Cannataro and D. Talia, “Semantics and knowledge grids: Building the next-generation grid,” IEEE Intell. Syst., vol. 19, no. 1, pp. 56–63, Jan./Feb. 2004. [25] J. C. Venter et al., “The sequence of the human genome,” Science, vol. 291, no. 5507, pp. 1304–1351, Feb. 2001. [26] K. Yoshida et al., “Design of a remote operation system for trans pacific microscopy via international advanced networks,” J. Electron Microsc., vol. 51, pp. S253–S257, 2002. [27] A. K. Jain, R. P. W. Duin, and J. Mao, “Statistical pattern recognition: A review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 1, pp. 4–37, Jan. 2000. [28] W. M. Ahmed, M. N. Ayyaz, B. Rajwa, F. Khan, A. Ghafoor, and J. P. Robinson, “Semantic analysis of biological imaging data: Challenges and opportunities,” Int. J. Semantic Comput., vol. 1, no. 1, pp. 67–85, Mar. 2007. [29] Condor project homepage [Online]. Available: http://www.cs.wisc.edu/ condor [30] D. Ayers and M. Shah, “Monitoring human behavior from video taken in an office environment,” Image Vision Comput., vol. 19, no. 12, pp. 833– 846, 2001.

[31] A. F. Bobick and Y. A. Ivanov, “Action recognition using probabilis- 888 tic parsing,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern 889 Recog., 1998, pp. 196–202. 890 [32] M. V. Boland and R. F. Murphy, “A neural network classifier capable 891 of recognizing the patterns of all major subcellular structures in fluores- 892 cence microscope images of HeLa cells,” Bioinformatics, vol. 17, no. 12, 893 pp. 1213–1223, 2001. 894 [33] M. E. Dickinson, G. Bearman, S. Tille, R. Lansford, and S. E Fraser, 895 “Multi-spectral imaging and linear unmixing add a whole new dimension 896 to laser scanning fluorescence microscopy,” BioTechniques, vol. 31, no. 6, 897 pp. 1272, 1274–1276, 1278, 2001. 898 [34] J. F. Allen, “Maintaining knowledge about temporal intervals,” Commun. 899 ACM, vol. 26, no. 11, pp. 832–843, Nov. 1983. 900 [35] Y. F. Day, S. Dagtas, M. Iino, A. Khokhar, and A. Ghafoor, “Object- 901 oriented conceptual modeling of video data,” in Proc. 11th Int. Conf. 902 Data Eng. (ICDE 1995), pp. 401–408. 903 [36] M. Klein, “XML, RDF, and relatives,” IEEE Intell. Syst., vol. 16, no. 2, 904 pp. 26–28, Mar./Apr. 2001. 905 [37] S. Decker et al., “The semantic Web: The roles of XML and RDF,” IEEE 906 Internet Comput., vol. 4, no. 5, pp. 63–74, Sep./Oct. 2000. 907 [38] A. Yao and J. Jin, “The development of a video metadata authoring and 908 browsing system in XML,” in Proc. ACM Int. Conf. Proc. Series, 2000, 909 pp. 39–46. 910 [39] S. Wrede, J. Fritsch, C. Bauckhage, and G. Sagerer, “An XML based 911 framework for cognitive vision architectures,” in Proc. 17th Int. Conf. 912 Pattern Recog., 2004, pp. 757–760. 913 [40] R. C. Gonzalez, R. E. Woods, and S. L. Eddins, Digital Image Processing 914 Using Matlab. Upper Saddle River, NJ: Pearson/Prentice-Hall, 2004, 915 919 ch. 11, pp. 426–483. 916 [41] P. Mazzatorta et al., “OpenMolGRID: Molecular science and engineering 917 in a grid context,” in Proc. 2004 Int. Conf. Parallel Distrib. Process. Tech. Q7 918 Appl., pp. 775–779.

IE E Pr E oo f

813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 Q5 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 Q6 884 885 886 887

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

Wamiq M. Ahmed (S’xx) received the B.S. degree 920 in electronic engineering in 2000 from Ghulam Ishaq 921 Khan Institute of Engineering Sciences and Tech- 922 nology, Topi, Pakistan, and the M.S. degree in 2003 923 from Purdue University, West Lafayette, IN, where 924 he is currently working toward the Ph.D. degree in 925 the School of Electrical and Computer Engineering. Q8 926 His current research interests include multime- 927 dia information systems, bioinformatics, and grid 928 computing. 929 930

Dominik Lenz received the D.V.M. degree from the Faculty of Veterinary Medicine, University of Leipzig, Leipzig, Germany, in 2000. From 2000 to 2005, he was a Postdoctoral Fellow in the Department of Pediatric Cardiology, Heart Center Leipzig, University of Leipzig. He is currently with Purdue University, West Lafayette, IN, where from 2005 to 2007, he was a Postdoctoral Fellow in the Cytometry Laboratories. His current research interests include slide-based cytometry and quantitative imaging in a variety of biological disciplines.

931 932 933 934 935 936 937 938 939 940 941 942

Jia Liu received the Bachelor of Medical Science degree from the School of Basic Medical Sciences, Peking University Health Science Center, Beijing, China, in 2001. He was in technical support for a biotechnology company for one year. In August 2003, he joined the Department of Basic Medical Sciences, Purdue University, West Lafayette, IN, as a Graduate Student. His current research interests include the role of reactive oxygen species/reactive nitrogen species (ROS/RNS) in the regulation of cell function. Mr. Liu is a Student Member of the International Society of Analytical Cytology (ISAC).

943 944 945 946 947 948 949 950 951 952 953 954 955 956

AHMED et al.: XML-BASED DATA MODEL AND ARCHITECTURE FOR A KNOWLEDGE-BASED GRID-ENABLED

J. Paul Robinson received the Ph.D. degree in immunopathology from the University of New South Wales, Sydney, Australia. He is the SVM Professor of cytomics in the School of Veterinary Medicine, Purdue University, West Lafayette, IN, where he is a Professor in the Weldon School of Biomedical Engineering, and is currently the Director of Purdue University Cytometry Laboratories and the Deputy Director for cytomics and imaging in the Bindley Biosciences Center. He is the Editor-in-Chief of the Current Protocols in Cytometry and an Associate Editor of the Cytometry. He is an active researcher with over 110 peer-reviewed publications and 20 book chapters, has edited seven books, and has given over 80 international lectures and taught advanced courses in over a dozen countries. He has been on numerous National Institute of Health (NIH) and National Science Foundation (NSF) review boards. His current research interests include reactive oxygen species, primarily in neutrophils, but more recently in HL-60 cells and other cell lines, biochemical pathways of apoptosis as related to reactive oxygen species in mitochondria, bioengineering with hardware and software groups developing innovating technologies such as high-speed multispectral cytometry, optical tools for quantitative fluorescence measurement, and advanced classification approaches for clinical diagnostics. Prof. Robinson was elected to the College of Fellows, American Institute for Medical and Biological Engineering, in 2004. He was the recipient of the Gamma Sigma Delta Award of Merit Research in 2002 and the Pfizer Award for Innovative Research in 2004. He is currently the President of the International Society for Analytical Cytology.

Arif Ghafoor (S’83–M’83–SM’89–F’99) is currently a Professor in the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, where he is the Director of the Distributed Multimedia Systems Laboratory. He was engaged in research areas related to multimedia information systems, database security, and parallel and distributed computing. He has published numerous papers in several areas. He has served on the Editorial Boards of various journals, including the Association for Computing Machinery (ACM)/Springer Multimedia Systems Journal, the Journal of Parallel and Distributed Databases, and the International Journal on Computer Networks. He was a Guest/Co-Guest Editor for various special issues of numerous journals, including the ACM/Springer Multimedia Systems Journal, the Journal of Parallel and Distributed Computing, and the International Journal on Multimedia Tools and Applications. He has coedited a book entitled “Multimedia Document Systems in Perspectives” and has coauthored a book entitled “Semantic Models for Multimedia Database Searching and Browsing” (Kluwer, 2000). Prof. Ghafoor was a Guest/Co-Guest Editor for various special issues of journals, such as the IEEE JOURNAL ON THE SELECTED AREAS IN COMMUNICATIONS and the IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. He is the recipient of the IEEE Computer Society 2000 Technical Achievement Award for his research contributions in the area of multimedia systems.

IE E Pr E oo f

957 958 Q9 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984

15

985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 Q10 1004 1005 1006 1007 1008 1009

QUERIES

1011

Q1: Author: Please provide the IEEE membership details (membership grades and years in which these were obtained), if any, for D. Lenz, J. Liu, and J. P. Robinson. Q2. Author: Please note the reference cites have been reordered sequentially, per TITB requirements. Please check. Q3. Author: Fig. 24 has been set as Fig. 21 and subsequent figures have been renumbered to retain the sequential citation of figures. Please check for correctness. Q4. Author: Please provide the names of all the authors in Refs. [3], [13], [15], [16], [21], [25], [26], [37], and [41]. Q5. Author: Please provide the year in Ref. [11]. Q6. Author: Please provide the year in Ref. [29]. Q7. Author: Please cite Ref. [41] in the text. Q8. Author: Please provide the subject (electronic engineering, bioinformatics, etc.) in which W. M. Ahmed received the M.S. degree. Also provide the year of the IEEE membership Q9. Author: Please provide the year in which J. Paul Robinson received the Ph.D. degree. Q10. Author: Please provide the details of various academic qualifications received by A. Ghafoor.

1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023

IE E Pr E oo f

1010

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

XML-Based Data Model and Architecture for a Knowledge-Based Grid-Enabled Problem-Solving Environment for High-Throughput Biological Imaging

1

2

3

4

Wamiq M. Ahmed, Student Member, IEEE, Dominik Lenz, Jia Liu, J. Paul Robinson, and Arif Ghafoor, Fellow, IEEE

5 6

Q1

32 33

Abstract—High-throughput biological imaging uses automated imaging devices to collect a large number of microscopic images for analysis of biological systems and validation of scientific hypotheses. Efficient manipulation of these datasets for knowledge discovery requires high-performance computational resources, efficient storage, and automated tools for extracting and sharing such knowledge among different research sites. Newly emerging grid technologies provide powerful means for exploiting the full potential of these imaging techniques. Efficient utilization of grid resources requires the development of knowledge-based tools and services that combine domain knowledge with analysis algorithms. In this paper, we first investigate how grid infrastructure can facilitate high-throughput biological imaging research, and present an architecture for providing knowledge-based grid services for this field. We identify two levels of knowledge-based services. The first level provides tools for extracting spatiotemporal knowledge from image sets and the second level provides high-level knowledge management and reasoning services. We then present cellular imaging markup language, an extensible markup language-based language for modeling of biological images and representation of spatiotemporal knowledge. This scheme can be used for spatiotemporal event composition, matching, and automated knowledge extraction and representation for large biological imaging datasets. We demonstrate the expressive power of this formalism by means of different examples and extensive experimental results. Index Terms—Grid computing, high-throughput biological imaging, knowledge extraction (KE), spatiotemporal modeling.

34

I. INTRODUCTION

35

ICROSCOPIC imaging is one of the most popular means for studying biological processes. Recent advances in microscopic and optical instrumentation and automation technologies have made sophisticated imaging tools available for the biologist [1]. The rapidly evolving field of high-content/highthroughput screening (HCS/HTS) uses automated imaging instruments for image-based analysis of large populations of cells,

36 37 38 39 40 41

M

to monitor their spatiotemporal behavior under different experimental conditions [2]. These technologies provide valuable information about biological processes and can help in generating rich information bases about different cellular systems [3]. These information bases can, in turn, help in developing better systems-level understanding of biological organisms, and can also aid in developing new drugs. In order to exploit the full potential of these technologies, many challenges have to be overcome. Analysis of a large number of cells for validating scientific hypotheses requires high-throughput computing resources, efficient management of datasets, and knowledgebased image understanding tools that can rapidly identify relevant biological objects and events and extract high-level semantic information [4], [5]. Significant progress has been made in recent years in the areas of distributed computing and distributed data management using grid technologies [6], [7]. The availability of basic grid services provides a great opportunity for developing knowledge-based applications and services for highthroughput biological imaging. This requires the development of tools and services for spatiotemporal knowledge representation and extraction. Development of such tools and knowledge representation schemes would not only expedite the analysis of high-throughput imaging research but would also make the new knowledge generated in the process available for the design of other experiments. It is increasingly clear that availability of such knowledge would significantly impact future biological research [8]. Grid technologies are a promising solution for the challenging demands of high-throughput biological imaging research. Computational grids have provided high-throughput computing resources for compute-intensive applications in different scientific domains [9], [10]. Data grids have provided elegant solutions for the management of distributed data sources for applications that involve large volumes of data [10]. The combined potential of enormous computational and data resources can be best utilized by knowledge-based tools and services that can extract high-level knowledge from huge datasets and can help researchers in making informed decisions, and thus, aid in the process of biological discovery. These tools and services can allow automated extraction of biological knowledge and highlevel information-rich metadata from large image sets and make them available to other researchers. The maturation of computational and data grids provides an opportunity for applying the

IE E Pr E oo f

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

1

Manuscript received November 7, 2006; revised xxx. W. M. Ahmed and A. Ghafoor are with the School of Electrical and Computer Engineeing, Purdue University, West Lafayette, IN 47906 USA (e-mail: [email protected]; [email protected]). D. Lenz, J. Liu, and J. P. Robinson are with the Department of Basic Medical Sciences, Purdue University, West Lafayette, IN 47906 USA (e-mail: dominik@ flowcyt.cyto.purdue.edu; [email protected]; [email protected]. purdue.edu). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TITB.2008.904153

1089-7771/$25.00 © 2008 IEEE

42 43 Q2 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84

2

86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123

powerful services provided by these technologies for solving challenging problems in the domain of HCS. A key challenge for data mining and knowledge discovery in large HCS image sets is the extraction of semantic information about cell states in terms of their spatiotemporal dynamics. Once this information is extracted, advanced data mining techniques can be applied for discovery of new knowledge from diverse repositories of these data. These issues are addressed in this paper. The specific contributions of this paper are as follows. 1) We investigate how grid infrastructure can facilitate highthroughput biological imaging research, and present an architecture for providing knowledge-based grid services for this field that builds on basic computing, communication, and data grid services. The proposed architecture provides two layers, a knowledge extraction (KE) and a knowledge management (KM) layer. The KE layer deals with algorithms for extraction of spatiotemporal knowledge from images whereas the KM layer deals with higher level knowledge representation and reasoning. 2) We introduce a data model and an extensible markup language (XML) based language for specifying objects and spatial and temporal events. This specification of spatiotemporal knowledge can be used by object- and eventidentification tools to extract spatiotemporal knowledge from image sets. Schemas are also provided for the representation of this automatically extracted knowledge that makes this information-rich metadata available to other researchers in a machine-readable format. There are a number of benefits to this approach. Firstly, it can speed up the analysis of large image datasets by providing grid-enabled automated tools for extraction of spatiotemporal knowledge that can help researchers in understanding biological systems. Secondly, these tools can also be used for automatic extraction of spatiotemporal metadata and representation of these metadata at varying levels of detail in a machine-readable format so that this knowledge becomes available to other researchers. Thirdly, it enhances the opportunity for applying advanced data mining tools to large image sets for identifying hidden patterns in such data.

124

A. High-Throughput Fluorescence Microscopic Imaging

125

Fluorescence imaging is one of the most useful tools in the arsenal of biologists, as it enables the visual examination of biological systems using fluorescent markers [1], [11]. As many biological samples are relatively transparent, various markers or probes are used to highlight cell compartments of interest. These markers are frequently fluorescent molecules that have specific absorption and emission spectra, and they emit light in their emission spectrum when excited within their appropriate absorption wavelengths. Optical filters are used to collect light in particular spectral bands. Fig. 1 shows cell nuclei stained with a Hoechst dye and cell cytoplasms stained with Alexa 488 dye. Similarly, different cell locations can be stained with a variety of markers and this information is collectively used for understanding biological phenomena. Recent advances in automation technologies combined with microscopic imaging techniques

126 127 128 129 130 131 132 133 134 135 136 137 138 139

Fig. 1.

Cell nuclei and cytoplasm stained with different dyes.

have given rise to the highly promising field of HCS. HCS technologies image large populations of cells in an automated fashion [2]. They allow the examination of cell populations during and subsequent to different experimental perturbations. A key feature of biological systems is their spatial and temporal dynamics in response to variations in experimental conditions. For example, location of different proteins inside cells reveals important information about their function, and changes in the phenotype of cells are important for understanding different cell states. Monitoring this spatiotemporal dynamics is crucial to understanding biological systems.

IE E Pr E oo f

85

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

140 141 142 143 144 145 146 147 148 149 150

B. Examples of Spatiotemporal Cellular Events

151

Many biological events involve spatial and temporal changes in different properties of cells and subcellular objects. Apoptosis is one such event [12]. Apoptosis is defined as programmed cell death. Every living cell spends its life cycle according to a set of instructions, and at the end, it initiates the process of apoptosis. It will be used as a detailed example in Section V and is therefore explained more fully here. Studies of apoptosis are important for cancer cell research. Cancerous cells often avoid apoptosis and instead continue to divide without control. Many anticancer drugs are designed to induce apoptosis in cancer cells, for example, to stop the expansion of a tumor. Apoptosis can be detected by using a specific surface marker, for example Annexin V fluorescein isothiocyanate (FITC), as well as other probes for staining the nucleus and cytoplasm. Another fluorescent nuclear stain called propidium iodide (PI) is used to verify cell viability. This dye cannot penetrate the intact cell membrane of healthy cells, so it is not found in nuclei. As the cell undergoes apoptosis, its cell membrane becomes more permeable and PI stains the nucleus. Cell division is another example of such events. Study of the effect of different perturbations on cell division is also useful for monitoring the progression of cancer cells. Cell division involves enlargement of a cell followed by nuclear division and the formation of daughter cells. In addition to physical changes in cells, more subtle alterations within cellular metabolic states can be observed by monitoring the production and location of various proteins. Proteomics, which is the study of the structure, function, and interactions of proteins produced by the genes of a particular cell, relies heavily on the location of different proteins inside cells. The function of proteins depends on the location where they are expressed [13]. Automated tools for localization of proteins and monitoring their dynamics can be very helpful for the field of proteomics [14].

152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184

AHMED et al.: XML-BASED DATA MODEL AND ARCHITECTURE FOR A KNOWLEDGE-BASED GRID-ENABLED

C. Related Work

186

Grid technologies have been applied to many problems in biological and medical sciences [10], [15]. The biomedical informatics research network (BIRN) focuses on applying grid services for neurological problems [15]. A grid-enabled image processing toolkit has been proposed in [16]. The OpenMolGRID project aims at providing grid-enabled applications for molecular science and engineering [15]. All of these projects build on basic grid services and apply grid technologies to particular domains. Like these projects, our aim is to apply the computing, communication, and data services provided by the grid to a specific domain (HCS). Additionally, we focus on providing knowledge-based services that can be used for representing and extracting spatiotemporal knowledge. These tools can be used not only for extraction of quantitative knowledge from image sets but also for automated metadata generation. We propose cellular imaging markup language (CIML) for this purpose. This paper also draws on related work in the computer vision and knowledge representation communities. A languagebased representation of video events and a markup language for video annotation based on XML have been proposed in [17]. The authors also discuss domain ontologies for physical security and meeting domains. An ontology-based approach for semantic image interpretation is proposed in [18]. The idea of embedding semantic information into multimedia documents has been proposed in [19]. Open microscopy environment (OME) has been proposed as a solution for the management of multidimensional biological image sets [4]. It uses XML schemas for saving image data along with the experiment context and analysis results. The OME focuses not on providing image analysis or KE tools but on providing a unified data format for storage and sharing of multidimensional microscopic images. The challenging problem of knowledge-based semantic interpretation of biological images requires incorporation of domain-specific knowledge into intelligent image understanding tools. This requires modeling of images and mechanisms for representing spatiotemporal biological knowledge. At the same time, the high volume of image sets generated by HCS technologies requires powerful computational resources and efficient data storage and management systems. High-throughput biological imaging can greatly benefit from developments in grid, computer vision, and knowledge representation communities. Based on these observations, we propose an architecture that builds upon the basic grid services and provides KE and management services for the HCS. We also propose an XML-based language for event representation and event markup with a specific focus on biological imaging. This approach enables the extraction of quantitative knowledge from large biological image sets for hypothesis-driven image analysis. This extracted knowledge can also be embedded into an image format like OME. This approach can be useful in solving the interoperability issues of biological imaging data produced by heterogeneous sources. This makes it possible to use advanced data mining techniques for rule induction. The rest of the paper is organized as follows. Section II describes how grid technologies can be used to provide knowledgebased services for the HCS. Section III discusses the issues

187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240

Fig. 2.

Hierarchy of grid services.

of spatiotemporal event recognition and representation. The architecture of our grid-enabled system for the HCS along with our spatiotemporal model and the CIML are presented in Section IV. An illustrative example is described in Section V. Section VI describes the implementation status and presents results of an apoptosis screen, and the paper is concluded in Section VII.

247

II. KNOWLEDGE-BASED GRID FOR HCS

248

Grid technologies have been proposed for applications that require large computational resources or deal with high volumes of data. Modeling and simulation of complex systems and research in particle physics are some examples [10]. Grids provide middleware and tools for harnessing diverse and distributed computational and storage resources in a seamless fashion and for providing the end user with a uniform front end [6]. Grid technologies were initially motivated by computationally intensive applications that required tremendous computing power and computational grids provided the required resources [6]. Development of data grids was motivated by the need for systems to analyze and process data stored at geographically diverse data storage facilities, using the computational power provided by the computational grids [20]. Great progress has been made in recent years in providing powerful middleware and services for computational and data grids [6]. Recently, there has been tremendous interest in knowledge grids as enabling technologies for better exploiting the underlying computational and data services provided by computational and data grids [16], [21]–[24]. Knowledge grid technologies aim at providing knowledge services that are tailored for particular domains, and thus, facilitate efficient utilization of available resources. Fig. 2 shows the hierarchy of grid services. In this section, we investigate how grid infrastructure can help high-throughput biological imaging, and present the architecture of a knowledgebased problem-solving environment for this rapidly expanding field. Biological imaging investigations are normally aimed at validating scientific hypotheses. The first step, therefore, is hypothesis definition, followed by experiment design. The experiment is then conducted and results are analyzed to understand the biological processes. Owing to the complexity of biological organisms, it is evident that future research in biological sciences and drug development will require more and more collaboration among research groups. This collaboration effort will be facilitated by automated KE and metadata-generation tools. Grid technologies, such as described in Fig. 3, provide a promising solution to meet the computational, data processing, and

249

IE E Pr E oo f

185

3

241 242 243 244 245 246

250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286

4

Fig. 3.

288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323

HCS workflow utilizing grid services.

collaboration needs of such tools to facilitate the HCS pipeline. Hypothesis development requires intensive searches for related knowledge available in diverse locations. As the data in this case happen to be in the form of large collections of images, a key requirement is to have automated KE and metadata-generation tools and formalisms for representation of this extracted knowledge so that software agents can search for relevant knowledge. The design of imaging experiments would also benefit from collaboration services. For example, if a few researchers are studying the proteome of the same organism, they may want to discuss the design of experiments and may even want to divide the whole proteome among themselves so as to avoid repetition of the same work. The human genome project is an example of a recent project of this nature [25]. Grid services can also make remote instruments available to scientists for conducting experiments. Remote microscopy has been used to provide access to expensive instruments like electron microscopes or to obtain a second opinion from a pathologist [26]. A large drug screen may require expensive instruments like imaging cytometers at multiple geographic sites, requiring a high degree of coordination. Efficient grid middleware can provide these services. Once the experiments are conducted, the results are gathered and analyzed. In the case of HCS, transfer of large volumes of imaging data may require a high-speed communication infrastructure. These data are to be transferred to appropriate locations where they are analyzed using high-performance computational resources. Two levels of analysis can be identified here. The first level is concerned with identifying the basic information about objects and events depicted in image sets. This level essentially extracts “numbers” out of images. The second level deals with data mining and reasoning based on the basic information extracted by the first-level processing. Both these types of analysis require high-performance computational resources. Finally, the high-level knowledge thus extracted should be represented in a machine-readable format so that it is available for new studies. Based on the discussion in previous paragraphs, we envision two levels of knowledge services for the HCS, as shown

in Fig. 4. The KE layer provides algorithms for extraction of spatiotemporal knowledge from images, while the KM layer deals with higher level knowledge representation and reasoning. Imaging data may consist of two-dimensional microscope images, three-dimensional confocal datasets, time-lapse images, or multispectral images [4]. The KE layer provides services for identifying objects and events. The object identification by classification algorithms is based on object features like size, area, shape, and spectrum, to cite a few examples. This layer also provides algorithms for preprocessing of images for noise removal, and algorithms for registration and segmentation. Algorithms are also provided for identifying spatial and temporal events as explained in Section III. The KM layer provides services for specifying objects and events so that they can be identified by the algorithms provided by the KE layer. This layer also provides data mining and reasoning services for discovering new information from the spatiotemporal information about the images. Other services provided by this layer include automatic metadata generation and visualization. The working and interaction of KE- and KM-layer tools is demonstrated in Fig. 5. The user provides an event composition file in the CIML representation as described in Section IV. The XML parser reads this description and generates workflows for KE- and KM-layer tools. The KE-layer tool manager and KM-layer tool manager each read and execute the workflows. The information extracted by KE-layer tools is represented as XML documents and stored in a KE-layer XML database. KE-layer tools generally extract information from a dataset produced by a single experiment, whereas KM-layer tools operate on data produced by multiple experiments. For this reason, KE-layer tools operate directly on the imaging data while KM-layer tools operate on the information extracted by KE-layer tools as shown in Fig. 5. The tools within KE and KM layers also exchange data in the form of XML documents.

357

A. Event Composition File

358

The event composition file describes the spatiotemporal events of interest that are to be matched in imaging datasets. This file describes not only the high-level event information in terms of semantics but also the specific algorithms used for particular steps of processing. For example, apoptosis is defined in terms of spatial overlap of different types of dyes as explained in Section V. The specific information about what segmentation algorithm and spatial representations are to be used is described in the event composition file. CIML representations of spatiotemporal events along with examples are given in Section IV.

359

B. Workflows and Tool Managers

369

IE E Pr E oo f

287

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

Workflows describe what processing tools and algorithms are to be used for problem solving. While the event composition file contains the semantic information about the events of interest, workflows contain information about what algorithms will be applied and in what order. A fragment of a workflow for segmentation and feature extraction is shown in Fig. 6. This example shows that a watershed segmentation algorithm is to be used and Haralick texture features are to be extracted. The

324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356

360 361 362 363 364 365 366 367 368

370 371 372 373 374 375 376 377

AHMED et al.: XML-BASED DATA MODEL AND ARCHITECTURE FOR A KNOWLEDGE-BASED GRID-ENABLED

Services provided by the proposed architecture.

Fig. 5.

Interaction between KE and KM layers.

Fig. 6.

Fragment of a workflow.

IE E Pr E oo f

Fig. 4.

381

task of the tool manager is to match available tools with user’s request as specified in the workflow. Tools managers read the workflows and execute them by running appropriate tools on the inputs.

382

C. KE-Layer Tools

383

As explained earlier, the KE layer provides tools that operate directly on image data. These tools include algorithms for preprocessing, segmentation, feature extraction, clustering, classification, spatial analysis, and temporal analysis. Preprocessing is generally used for removing noise and artifacts from images. For large sets of fluorescence images, variations in light intensity need to be normalized using intensity-normalization algorithms. Segmentation algorithms such as region-based and edge-based algorithms attempt to separate objects of interest from background. After identifying the objects of interest, the next step is to extract object features for classification. These features include size, intensity, region and boundary descriptors, and texture features. The features extracted by feature-

378 379 380

384 385 386 387 388 389 390 391 392 393 394 395

5

extraction modules are used for clustering and classification. Clustering generally refers to unsupervised classification where objects are grouped based on similarity of their features. Supervised classification, in contrast, uses labeled examples of classes for training classifiers. Classification algorithms such as neural networks, decision trees, and support vector machines are used for classification [27]. Spatial analysis refers to analysis of an object’s position with respect to others in the image, and temporal analysis refers to changes in features of an object over time, or changes in features and position of one object with respect to other objects over time [28].

406

D. KM-Layer Tools

407

The KM-layer tools operate on the information extracted by the KE layer. Some of the tools provided by this layer include tools for generating metadata and tools for visualizing experimental results. Other tools provided by this layer, such as data mining and reasoning tools, operate on collections of XML documents generated by the KE layer. For example, a screening experiment that uses different drug concentrations for monitoring the effect of concentration on a particular cell type would use KE-layer tools to generate separate XML documents for each concentration of the drug used. The KM-layer tools for reasoning can then be used to find the optimal drug concentration for inducing a certain phenomenon. Data mining tools can be used for finding association rules and hidden patterns in such data. The results of KM-layer processing are also represented as XML documents and stored in an XML database.

408

396 397 398 399 400 401 402 403 404 405

409 410 411 412 413 414 415 416 417 418 419 420 421 422

6

Fig. 7.

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

Intermediate results represented in XML.

E. XML Documents for Extracted Information

424

Intermediate as well as final results of analysis are stored as XML documents. KE-layer tools store the results of information extraction in the KE-layer XML database as XML documents. Similarly, KM-layer tools store the processing results as XML documents in the KM-layer XML database. Different tools at each layer communicate intermediate results in the form of XML documents. An example of the results of feature extraction is shown in Fig. 7. These results can be passed to the classification module for recognizing objects. These knowledge services are based on the basic computational and data grid services. The lower layers manage issues like resource discovery, scheduling, data movement, security, and collaboration. Our current implementation uses Condor [29] as the basic grid infrastructure as described in Section VI. The focus of this paper is on spatiotemporal KE and representation services.

425 426 427 428 429 430 431 432 433 434 435 436 437 438 439

440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467

spatial representation. The bounding box is a simple and efficient approach for capturing spatial information about objects in an image. Even though it is imprecise compared to the convexhull- or the articulate-boundary-based analysis, its speed and simplicity make it an attractive choice for high-throughput biological imaging analysis where large numbers of cells are to be analyzed. Interobject spatial analysis can then be performed using projections on coordinate axes. An extension of temporal logic [34] to two and three dimensions is then used to describe the interobject relationships as described in [35]. Fig. 8 shows graphically some of the predicates used in our system. More predicates and functions can be defined, but only those used in later sections to explain our event representation scheme are introduced here. Qualifiers can also be specified for these predicates for different situations, for example, disjoint along with a distance between the centroids of objects, or overlap with the area of the overlapping region. Many biological imaging experiments involve studying the kinetics of the diffusion of different proteins and chemicals among multiple subcellular compartments. Monitoring the variations of spatial relations between different objects over time is highly beneficial for such experiments. For example, if a drug is designed to enter a certain compartment of the cell, it is very important to monitor the dynamics of this process. This will involve using markers that reveal the position of the drug and the location of the targeted compartment of the cell. If the drug penetrates from outside the cell into the particular region, the two markers will initially be at disjoint locations, and then, overlap as time passes. Monitoring the evolution of objects and their spatial relations over time thus can be very helpful in understanding biological processes.

498

IV. ARCHITECTURE OF THE XML-BASED GRID FOR HCS

499

XML is a general-purpose markup language that provides a means for defining special-purpose markup languages [36]. It has been extensively used as a platform-independent mechanism for describing data in many different domains [37]–[39]. Its flexibility in defining different types of data makes it suitable for description of biological objects and events. The general architecture of our system is shown in Fig. 9. This system uses a modular approach of describing objects in terms of their attributes, and then, spatial and temporal events based on the changes in attributes of the objects that constitute those events. Spatial events are described as the constituent objects along with the relationships among their spatial representations (bounding

500

IE E Pr E oo f

423

Fig. 8. Graphical representation of three different spatial relations. (a) Disjoint. (b) Overlap. (c) Contain.

III. SPATIOTEMPORAL EVENT RECOGNITION AND REPRESENTATION

Most of the events of interest in biological imaging arise either because of changes in the attributes of individual objects or because of interactions among multiple objects. For example, a cell undergoing division changes size and shape, and then, divides into two daughter cells. In order to recognize particular events of interest, different objects in the scene need to be identified, and then, monitored over a period of time. Many approaches have been proposed in the literature for recognizing events in videos [30], [31]. These can be adapted to the requirements of event detection in biological images as well. In our system, we define predicates for spatial analysis based on bounding box, convex hull, or exact outline, depending on the speed and accuracy requirements of the application. Temporal event identification is based on a state-machine-based approach where each state is defined by the objects participating in that subevent, the specific values of their attributes, and the spatial relations between them. One of the objectives is to decouple event recognition from event representation. This way, newer events can be defined in a flexible manner and this representation can be used by event-recognition logic to identify events of interest independent of the lower level algorithms used. Objects of interest can be identified using different parameters, for example size, shape, color, texture, or spectral information in the case of multispectral imaging [32], [33]. Spatial location of different objects relative to each other can be analyzed using bounding boxes around objects or other suitable

468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497

501 502 503 504 505 506 507 508 509 510 511

AHMED et al.: XML-BASED DATA MODEL AND ARCHITECTURE FOR A KNOWLEDGE-BASED GRID-ENABLED

Fig. 10.

Fig. 9.

513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532

533 534 535 536 537 538 539 540 541 542 543 544 545 546

Spatiotemporal data model.

Architecture of the grid-enabled component-based system for HCS.

box, convex hull, or exact outline) depending upon the speed and accuracy requirements of the application. Temporal events are generally composed of a sequence of one or more spatial events occurring over a period of time. Some temporal events, such as a monotonic increase in the size of a cell, cannot be easily broken down into constituent spatial events and require special functions. The XML editor is used for describing the attributes of objects and events manually. In fact, any text or XML editor can be used for this purpose. Alternatively, the feature extractor module can be used for specifying objects and events. The user provides example images of the objects and events, and the feature extractor automatically extracts the attributes and generates XML representations that are stored in the objects and events repository. These representations are used by object and event recognition tools for identifying their specific instances in image sets. The information about the objects and events thus identified is kept in the XML knowledge base. This knowledge base represents the results of first-level processing as explained in Section II. Analysis results with different levels of detail can then be embedded into the image data using the XML embedding module.

Events are modeled as a composition of objects. Events are defined in terms of their attributes like start time, duration, spatial relations between participating objects, and other conceptual information. As shown in Fig. 10, spatial and temporal events are defined as the two types of events. Spatial events are defined solely by the participating objects and their spatial relations. For example, two touching nuclei constitute a spatial event that is completely defined by the features of the objects and their spatial relation. The simplest spatial event is the single-object event that is defined by the participating object along with specific values of its features. Multiple-object spatial events are defined in terms of the participating objects and their spatial relations. A spatial event persisting over a number of frames gives rise to a simple temporal event. Temporal events may also be composed of single or multiple events. Single-object temporal events involve evolution of an object over time, for example, expansion of a cell after undergoing a certain treatment. Multiple-object events involve the evolution of the attributes and spatial relations of the participating objects, for example, the movement of a particular protein from nucleus to cytoplasm as the cell is undergoing expansion in response to a certain treatment. Temporal events may also be single-threaded or multithreaded [17]. A single-threaded event is one in which all constituent events happen in a linear order, whereas in a multithreaded event, one or more constituent events happen in parallel.

573

B. Representation of Objects

574

Biological cells, subcellular compartments, or other entities of interest constitute the objects. Objects are defined in terms of their parameters, which may include color, size, shape, texture, spectrum, or other parameters depending on the application. Objects can be defined either manually using the XML editor or by providing example images of the objects to the feature extractor that extracts the relevant parameters of objects. These are then stored as XML schemas. An example schema is shown in Fig. 11(a).

575

IE E Pr E oo f

512

7

A. Spatiotemporal Data Model

Fig. 10 shows the entity-relation diagram for the data model. We model biological images as a composition of biological objects that are specified by their features like size, shape, spectrum, etc. The choice of the specific features to use for defining objects depends on the particular application and the choice of imaging modality. For many cases, simple features like size and color may be adequate, whereas for others, more complicated features like Haralick texture and shape descriptors may be required [40]. Representation of objects in terms of their features allows automated software tools to identify objects meeting these feature criteria in a large set of images. It also makes it possible to use the same analysis parameters for other image sets for comparison.

547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572

576 577 578 579 580 581 582 583

8

(a) Representation of a cancer cell. (b) Representation of the expansion phase in a cell division event.

IE E Pr E oo f

Fig 11.

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

584

C. Representation of Spatial and Temporal Events

585

Spatial and temporal events can also be specified either manually or by using the feature extractor. Spatial events may involve one or more objects and may be characterized by certain values of an object’s features or by one object’s features, for example, location, in relation to another. As an example of a spatial event, one may be interested in finding all the cells that became compressed because of a certain treatment. This condition may be identified by the size or shape of the objects. Alternatively, one may wish to identify all cells that contain a specific number of subcellular objects. This task will require analyzing the location of different objects relative to each other. Temporal events arise because of changes in different parameters of objects over time. For example, diffusion of a protein into a cell from the outside environment is a temporal event, and in this situation, the diffusion rate and the particular location to which the protein binds may be of interest. In order to perform such analysis tasks, subcellular objects need to be identified along with their positions relative to each other. This information then needs to be monitored over time to extract temporal information. The XML specification of cell division as a temporal event is shown in Fig. 12(a). Salient phases in a cell division would involve expansion of the cell followed by nuclear division, and then, the splitting of the cell into two daughter cells. Figs. 11(b) and 12(b) show the XML specification of the expansion and nuclear division phases of cell division. A graphical representation of different phases of cell division is shown in Fig. 13. Temporal relations used in Fig. 13 are described in Fig. 14.

586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611

612

D. Representation of Analysis Results

613

HCS experiments produce huge datasets. Hypothesis-driven biological imaging experiments require large cell populations in order for the results to have statistical significance. It is impor-

614 615

tant to have a mechanism not only for automated KE but also for sharing analysis results of cell screening experiments along with the imaging data. Most such experiments aim at identifying sample subpopulations that meet certain specific criteria. Different levels of detail may be required for the extracted knowledge. For example, in some cases, it may be sufficient to identify only the different subpopulations. In others, locations of different objects may also be needed. These analysis results can then be embedded into image data using a data format like the one introduced in the OME [4]. This approach facilitates searching such imaging data and also reduces computational time if reanalysis is required. Also, it makes it easy to share data and results among different research groups Automatic identification of spatiotemporal events in biological imaging experiments opens new avenues for event mining and rule induction. A large population of cells can be monitored over time under different perturbations and the effects can be automatically analyzed. This approach can be very helpful in drug discovery for the identification of promising targets for potential drugs.

635

V. AN EXAMPLE OF EVENT REPRESENTATION

636

In this section, we explain the concepts introduced in earlier sections by means of a detailed example of an apoptosis screening. The knowledge representation schemas used for this example are also used for an actual apoptosis screening that is explained in Section VI. A discussion of the process of apoptosis appears in Section I-B. The nuclei of cells are identified using a Hoechst marker. The PI is used for identifying dead cells as it can penetrate the ruptured cell membranes of dead cells but cannot penetrate normal live cells. Annexin V-FITC is used as an apoptotic marker because it binds specifically to the membrane of apoptotic cells. The objective of such

637

616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634

638 639 640 641 642 643 644 645 646 647

648 649 650 651 652 653 654 655 656 657 658 659 660

Fig. 12.

(a) Representation of a cell division event. (b) Representation of the nuclear division phase in a cell division event.

Fig. 13.

Graphical representation of subevents for cell division.

a study is normally to test the efficacy of drugs in inducing apoptosis. Events of interests in this case are early apoptotic, late apoptotic, live, and dead cells. These conditions can be identified by observing the overlap between regions that contain different types of markers. Cells containing only Hoechst will be live, whereas cells with both Hoechst and PI staining will be dead, cells with Hoechst and Annexin V-FITC will be early apoptotic, and cells with all three markers will be late apoptotic and necrotic. The XML representation for an apoptosis screen is given in Fig. 15. This screen will provide the percentage of cells in live, dead, and apoptotic states. An example of a schema for the analysis results is shown in Fig. 16.

661

VI. EXPERIMENTAL RESULTS AND DISCUSSION

662

A subset of the proposed system has been developed. The feature-extraction, event-specification, and spatial-event-

663

9

IE E Pr E oo f

AHMED et al.: XML-BASED DATA MODEL AND ARCHITECTURE FOR A KNOWLEDGE-BASED GRID-ENABLED

Fig. 14.

Temporal relations.

identification modules as well as the XML parser are developed in Matlab. Condor [29] is used as the basic grid infrastructure for the problem-solving environment as shown in Fig. 17. Condor is a distributed software system that can harness the computing power of diverse computing resources and present them to the end user as a single resource. At our university, the Condor pool consists of many high-performance clusters and a network of commodity processors. These clusters are owned by different

664 665 666 667 668 669 670 671

Fig. 15.

672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

IE E Pr E oo f

10

Representation of events involved in an apoptosis screen.

research groups and during their idle time are made available to Condor’s job scheduler. In our system, either objects can be specified manually in terms of their attributes or some representative images can be provided to the feature extractor, which extracts object features, and then, stores them as XML object representation schemas. A screen shot of the graphical interface is shown in Fig. 18. As HCS screens typically generate a large number of images, the object representation schemas can be used to automatically identify objects within all images. The spatial-analysis module provides algorithms based on bounding box, convex hull, and exact outline. Spatial events of interest can be specified manually or using the graphical interface shown in Fig. 18. The event-specification module then generates XML schemas corresponding to the spatial events of interest. This spatial-event knowledge is then used by the spatial-analysis module to find all such events in the set of collected images.

An apoptosis screening, as described in the previous sections, was performed using the event specification and matching modules described earlier. Human myelogenous leukemia (HL60) cells were exposed to one of two apoptosis-inducing drugs, rotenone or camptothecin, for 24 h. A negative control consisting of cells without any treatment was also prepared. Subsequently, staining was performed. Three dyes were used: Hoechst 33342, PI, and Annexin V-FITC as described earlier. Four hundred sets of images were collected for each of the samples using an imaging cytometer (iCys research imaging cytometer, CompuCyte Corporation, Cambridge, MA). Each set consisted of three images, one for each of the dyes used. Fig. 19 shows a representative set of images. As explained in Section V, events of interest in this case are early apoptotic, late apoptotic, live, and dead cells. These are defined in terms of spatial relationships between the three dyes used as specified in the schema shown in Fig. 15. The spatial-analysis module used this knowledge and

689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705

AHMED et al.: XML-BASED DATA MODEL AND ARCHITECTURE FOR A KNOWLEDGE-BASED GRID-ENABLED

Representation of analysis results for the apoptosis screen.

Fig. 17.

Implementation of the proposed architecture on Condor.

IE E Pr E oo f

Fig. 16.

711

identified the populations of live, dead, early apoptotic, and late apoptotic cells. Next we compare the performance of different spatialanalysis algorithms. These include bounding-box-, convexhull-, and exact-outline-based methods. We compare the algorithms in terms of their speed, accuracy, and scalability.

712

A. Speed and Accuracy

713

Interobject spatial analysis is useful for localizing different types of particles inside cells. For this experiment, bovine aortal endothelial cells were used. Images of 15 cells were used as templates to generate a set of 30 composite images containing 16 cells each. Each cell for the composite images was selected randomly from the set of 15 cellular image templates. Another set of 30 images was generated to simulate particles within cells. Spatial analysis was carried out using bounding-box-, convexhull-, and exact-outline-based approaches. The run times and accuracy were measured. For accuracy, the exact outline approach was used as the standard as it is the most accurate of the three spatial representations used for the analysis. The results are shown in Table I. Although the exact outline approach is the most accurate, it has a high processing cost, whereas the

706 707 708 709 710

714 715 716 717 718 719 720 721 722 723 724 725 726

11

Fig. 18.

Graphical interface for objects and events specification.

bounding-box approach is the least accurate but has the lowest computational cost. The speed and accuracy of the convex-hull approach falls between the bounding-box and exact-outline approaches. The user can select one of these algorithms based on the speed and accuracy requirements of the application.

727 728 729 730 731

12

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

Fig. 21.

Speedups for different number of processors for the apoptosis screen.

IE E Pr E oo f

Fig. 19. Representative set of images for the apoptosis screen showing HL60 cells in different states. (a) Hoechst 33342. (b) Annexin V-FITC. (c) PI. (d) Scatter image. TABLE I COMPARISON OF DIFFERENT SPATIAL ANALYSIS ALGORITHMS

Fig. 22.

Fig. 20. Scaling the number of processors for a set of 1024 images for apoptosis screen.

732

B. Scalability

733

Here, we present the results of scalability for an apoptosis screen. A total of 1028 images and 16 processors out of the Condor pool were used. In the first experiment, the number of images was fixed at 1028 and the number of processors was varied from 1 to 16. The resulting run times and speedups are presented in Figs. 20 and 21. The speedups are linear in the

734 735 736 737

Q3 738

Scaling the number of images for 16 processors for apoptosis screen.

number of processors, implying that a subsequent increase in the number of processors will result in a proportional improvement in run time. In the second experiment, the number of processors was fixed at 16 and the image set was varied from 64 to 1028 images. The resulting run times are shown in Fig. 22. The run time is linear as a function of the size of the dataset, which suggests that the algorithms scale well for larger datasets. In order to investigate the scalability behavior of convex-hulland exact-outline-based approaches, we used a set of 64 images and performed spatial analysis using convex hull and exact outline while varying the number of processors from 1 to 16. The resulting run times and speedups are shown in Figs. 23 and 24. The speedups are linear for the convex-hull approach whereas those for the exact-outline approach are not strictly linear but scale well for a large number of processors. In another experiment, we varied the size of the image set from 32 to 512 images for 16 processors and performed spatial analysis using convex-hull- and exact-outline-based approaches. The results are shown in Fig. 25. As can be seen, the two

739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757

AHMED et al.: XML-BASED DATA MODEL AND ARCHITECTURE FOR A KNOWLEDGE-BASED GRID-ENABLED

Fig. 25.

Fig. 24.

758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774

Scaling the number of images for 16 processors for spatial analysis.

VII. CONCLUSION

Speedups for different number of processors for spatial analysis.

algorithms scale linearly as the size of the image set is increased. We conclude that all of these algorithms have nice scaling properties in the number of processor and size of the input dataset. Moreover, these algorithms provide a tradeoff between speed and accuracy, and the most suitable algorithm can be chosen depending upon the speed and accuracy requirements of the application. HCS applications require powerful computing and data management resources. At the same time, these systems have to be affordable and economical. Grid technologies provide a promising solution. The realization of this solution requires powerful knowledge-based tools that combine domain knowledge with analysis algorithms to exploit the full potential of the underlying grid services. We believe that the grid-enabled HCS system and the spatiotemporal knowledge representation scheme proposed in this paper provide a powerful yet economic solution to the challenging requirements of HCS applications

775

Automatic extraction and management of spatiotemporal semantic knowledge from large image sets produced by HCS screens requires high-throughput computing resources, efficient management of data, and intelligent knowledge-based tools. Grid technologies provide a basic infrastructure on which domain-specific services can be built. In this paper, we propose an architecture for a knowledge-based grid-enabled problemsolving environment for the HCS. Two main layers of this architecture are the KE layer and the KM layer. The KE layer provides tools for extracting information about biological objects and spatiotemporal events whereas the KM layer provides high-level knowledge representation and reasoning services. We also introduce an XML-based language for modeling biological images and for representing spatiotemporal knowledge. This mechanism provides a knowledge representation scheme that is used by event-recognition logic to identify events of interest. This approach combines domain knowledge with analysis tools to fully utilize the powerful computing and data management services provided by grid infrastructure. A subset of the proposed architecture has been implemented and evaluated. In the future, we plan to explore data mining techniques for automatic rule induction from spatiotemporal dynamics of biological systems. We would also like to develop algorithms for automatically learning the structure of complex spatiotemporal events from example images.

800

ACKNOWLEDGMENT

801

The authors would like to thank J. Sturgis for help with collecting images for this paper.

803

REFERENCES

804

IE E Pr E oo f

Fig. 23. Scaling the number of processors for a set of 64 images for spatial analysis.

13

776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799

802

[1] D. J. Stephens and V. J. Allan, “Light microscopy techniques for live cell 805 imaging,” Science, vol. 300, no. 5616, pp. 82–86, Apr. 2003. 806 [2] K. A. Giuliano, “High-content screening: A new approach to easing key 807 bottlenecks in the drug discovery process,” J. Biomol. Screening, vol. 2, 808 no. 4, pp. 249–259, 1997. 809 [3] K. A. Giuliano et al., “Systems cell biology knowledge created from 810 high content screening,” Assay Drug Develop. Technol., vol. 3, no. 5, Q4 811 pp. 501–514, 2005. 812

14

[4] J. R. Swedlow, I. Goldberg, E. Brauner, and P. K. Sorger, “Informatics and quantitative analysis in biological imaging,” Science, vol. 300, no. 5616, pp. 100–102, Apr. 2003. [5] X. Zhou and S. T. C. Wong, “High content cellular imaging for drug development,” IEEE Signal Process. Mag., vol. 23, no. 2, pp. 170–174, Mar. 2006. [6] K. Krauter, R. Buyya, and M. Maheswaran, “A taxonomy and survey of grid resource management systems for distributed computing,” Softw.Practice Exp., vol. 32, pp. 135–164, 2002. [7] I. Foster and C. Kesselman, “Globus: A toolkit-based grid architecture,” in The Grid: Blueprint For a New Computing Infrastructure, I. Foster and C. Kesselman, Eds. San Mateo, CA: Morgan Kaufmann, 1999, pp. 259–278. [8] F. S. Collins, M. Morgan, and A. Patrinos, “The human genome project: Lessons from large-scale biology,” Science, vol. 300, no. 5617, pp. 286– 290, Apr. 2003. [9] M. Cannataro, C. Comito, F. L. Schiavo, and P. Veltri, “Proteus, a grid based problem solving environment for bioinformatics: Architecture and experiments,” IEEE Comput. Intell. Bull., vol. 3, no. 1, pp. 7–18, Feb. 2004. [10] W. E. Johnston, “Computational and data grids in large-scale science and engineering,” Future Gener. Comput. Syst., vol. 18, pp. 1085–1100, 2002. [11] J. P. Robinson, “Principles of confocal microscopy,” Methods Cell Biol., vol. 63, pp. 89–106. [12] J. D. Robertson, S. Orrenius, and B. Zhivotovsky, “Review: Nuclear event in apoptosis,” J. Struct. Biol., vol. 129, no. 2/3, pp. 346–358, Apr. 2000. [13] W. Huh et al., “Global analysis of protein localization in budding yeast,” Nature, vol. 425, pp. 686–691, Oct. 2003. [14] K. Huang and R. F. Murphy, “From quantitative microscopy to automated image understanding,” J. Biomed. Opt., vol. 9, no. 5, pp. 893–912, 2004. [15] M. Ellisman et al., “The emerging role of biogrids,” Commun. ACM, vol. 47, no. 11, pp. 52–57, Nov. 2004. [16] S. Hastings et al., “Image processing for the grid: A toolkit for building grid-enabled image processing applications,” in Proc. 3rd IEEE/ACM Int. Symp. Cluster Comput. Grid (CCGRID 2003), pp. 36–43. [17] R. Nevatia, J. Hobbs, and B. Bolles, “An ontology for video event representation,” in Proc. IEEE Workshop Event Detect. Recog., Jun. 2004, p. 119. [18] N. Maillot, M. Thonnat, and A. Boucher, “Towards ontology-based cognitive vision,” Mach. Vision Appl., vol. 16, no. 1, pp. 33–40, Dec. 2004. [19] K. Thirunarayan, “On embedding machine-processable semantics into documents,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 7, pp. 1014– 1018, Jul. 2005. [20] M. Cannataro, D. Talia, and P. Trunfio, “Distributed data mining on the grid,” Future Gener. Comput. Syst., vol. 18, pp. 1101–1112, 2002. [21] S. Hastings et al., “Grid-based management of biomedical data using an XML-based distributed data management system,” in Proc. 2005 ACM Symp. Appl. Comput., pp. 105–109. [22] D. D. Roure, N. R. Jennings, and N. R. Shadbolt, “The semantic grid: Past, present, and future,” Proc. IEEE, vol. 93, no. 3, pp. 669–681, Mar. 2005. [23] H. Zhuge, “China’s e-science knowledge grid environment,” IEEE Intell. Syst., vol. 19, no. 1, pp. 13–17, Jan./Feb. 2004. [24] M. Cannataro and D. Talia, “Semantics and knowledge grids: Building the next-generation grid,” IEEE Intell. Syst., vol. 19, no. 1, pp. 56–63, Jan./Feb. 2004. [25] J. C. Venter et al., “The sequence of the human genome,” Science, vol. 291, no. 5507, pp. 1304–1351, Feb. 2001. [26] K. Yoshida et al., “Design of a remote operation system for trans pacific microscopy via international advanced networks,” J. Electron Microsc., vol. 51, pp. S253–S257, 2002. [27] A. K. Jain, R. P. W. Duin, and J. Mao, “Statistical pattern recognition: A review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 1, pp. 4–37, Jan. 2000. [28] W. M. Ahmed, M. N. Ayyaz, B. Rajwa, F. Khan, A. Ghafoor, and J. P. Robinson, “Semantic analysis of biological imaging data: Challenges and opportunities,” Int. J. Semantic Comput., vol. 1, no. 1, pp. 67–85, Mar. 2007. [29] Condor project homepage [Online]. Available: http://www.cs.wisc.edu/ condor [30] D. Ayers and M. Shah, “Monitoring human behavior from video taken in an office environment,” Image Vision Comput., vol. 19, no. 12, pp. 833– 846, 2001.

[31] A. F. Bobick and Y. A. Ivanov, “Action recognition using probabilis- 888 tic parsing,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern 889 Recog., 1998, pp. 196–202. 890 [32] M. V. Boland and R. F. Murphy, “A neural network classifier capable 891 of recognizing the patterns of all major subcellular structures in fluores- 892 cence microscope images of HeLa cells,” Bioinformatics, vol. 17, no. 12, 893 pp. 1213–1223, 2001. 894 [33] M. E. Dickinson, G. Bearman, S. Tille, R. Lansford, and S. E Fraser, 895 “Multi-spectral imaging and linear unmixing add a whole new dimension 896 to laser scanning fluorescence microscopy,” BioTechniques, vol. 31, no. 6, 897 pp. 1272, 1274–1276, 1278, 2001. 898 [34] J. F. Allen, “Maintaining knowledge about temporal intervals,” Commun. 899 ACM, vol. 26, no. 11, pp. 832–843, Nov. 1983. 900 [35] Y. F. Day, S. Dagtas, M. Iino, A. Khokhar, and A. Ghafoor, “Object- 901 oriented conceptual modeling of video data,” in Proc. 11th Int. Conf. 902 Data Eng. (ICDE 1995), pp. 401–408. 903 [36] M. Klein, “XML, RDF, and relatives,” IEEE Intell. Syst., vol. 16, no. 2, 904 pp. 26–28, Mar./Apr. 2001. 905 [37] S. Decker et al., “The semantic Web: The roles of XML and RDF,” IEEE 906 Internet Comput., vol. 4, no. 5, pp. 63–74, Sep./Oct. 2000. 907 [38] A. Yao and J. Jin, “The development of a video metadata authoring and 908 browsing system in XML,” in Proc. ACM Int. Conf. Proc. Series, 2000, 909 pp. 39–46. 910 [39] S. Wrede, J. Fritsch, C. Bauckhage, and G. Sagerer, “An XML based 911 framework for cognitive vision architectures,” in Proc. 17th Int. Conf. 912 Pattern Recog., 2004, pp. 757–760. 913 [40] R. C. Gonzalez, R. E. Woods, and S. L. Eddins, Digital Image Processing 914 Using Matlab. Upper Saddle River, NJ: Pearson/Prentice-Hall, 2004, 915 919 ch. 11, pp. 426–483. 916 [41] P. Mazzatorta et al., “OpenMolGRID: Molecular science and engineering 917 in a grid context,” in Proc. 2004 Int. Conf. Parallel Distrib. Process. Tech. Q7 918 Appl., pp. 775–779.

IE E Pr E oo f

813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 Q5 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 Q6 884 885 886 887

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 12, NO. 00, 2008

Wamiq M. Ahmed (S’xx) received the B.S. degree 920 in electronic engineering in 2000 from Ghulam Ishaq 921 Khan Institute of Engineering Sciences and Tech- 922 nology, Topi, Pakistan, and the M.S. degree in 2003 923 from Purdue University, West Lafayette, IN, where 924 he is currently working toward the Ph.D. degree in 925 the School of Electrical and Computer Engineering. Q8 926 His current research interests include multime- 927 dia information systems, bioinformatics, and grid 928 computing. 929 930

Dominik Lenz received the D.V.M. degree from the Faculty of Veterinary Medicine, University of Leipzig, Leipzig, Germany, in 2000. From 2000 to 2005, he was a Postdoctoral Fellow in the Department of Pediatric Cardiology, Heart Center Leipzig, University of Leipzig. He is currently with Purdue University, West Lafayette, IN, where from 2005 to 2007, he was a Postdoctoral Fellow in the Cytometry Laboratories. His current research interests include slide-based cytometry and quantitative imaging in a variety of biological disciplines.

931 932 933 934 935 936 937 938 939 940 941 942

Jia Liu received the Bachelor of Medical Science degree from the School of Basic Medical Sciences, Peking University Health Science Center, Beijing, China, in 2001. He was in technical support for a biotechnology company for one year. In August 2003, he joined the Department of Basic Medical Sciences, Purdue University, West Lafayette, IN, as a Graduate Student. His current research interests include the role of reactive oxygen species/reactive nitrogen species (ROS/RNS) in the regulation of cell function. Mr. Liu is a Student Member of the International Society of Analytical Cytology (ISAC).

943 944 945 946 947 948 949 950 951 952 953 954 955 956

AHMED et al.: XML-BASED DATA MODEL AND ARCHITECTURE FOR A KNOWLEDGE-BASED GRID-ENABLED

J. Paul Robinson received the Ph.D. degree in immunopathology from the University of New South Wales, Sydney, Australia. He is the SVM Professor of cytomics in the School of Veterinary Medicine, Purdue University, West Lafayette, IN, where he is a Professor in the Weldon School of Biomedical Engineering, and is currently the Director of Purdue University Cytometry Laboratories and the Deputy Director for cytomics and imaging in the Bindley Biosciences Center. He is the Editor-in-Chief of the Current Protocols in Cytometry and an Associate Editor of the Cytometry. He is an active researcher with over 110 peer-reviewed publications and 20 book chapters, has edited seven books, and has given over 80 international lectures and taught advanced courses in over a dozen countries. He has been on numerous National Institute of Health (NIH) and National Science Foundation (NSF) review boards. His current research interests include reactive oxygen species, primarily in neutrophils, but more recently in HL-60 cells and other cell lines, biochemical pathways of apoptosis as related to reactive oxygen species in mitochondria, bioengineering with hardware and software groups developing innovating technologies such as high-speed multispectral cytometry, optical tools for quantitative fluorescence measurement, and advanced classification approaches for clinical diagnostics. Prof. Robinson was elected to the College of Fellows, American Institute for Medical and Biological Engineering, in 2004. He was the recipient of the Gamma Sigma Delta Award of Merit Research in 2002 and the Pfizer Award for Innovative Research in 2004. He is currently the President of the International Society for Analytical Cytology.

Arif Ghafoor (S’83–M’83–SM’89–F’99) is currently a Professor in the School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, where he is the Director of the Distributed Multimedia Systems Laboratory. He was engaged in research areas related to multimedia information systems, database security, and parallel and distributed computing. He has published numerous papers in several areas. He has served on the Editorial Boards of various journals, including the Association for Computing Machinery (ACM)/Springer Multimedia Systems Journal, the Journal of Parallel and Distributed Databases, and the International Journal on Computer Networks. He was a Guest/Co-Guest Editor for various special issues of numerous journals, including the ACM/Springer Multimedia Systems Journal, the Journal of Parallel and Distributed Computing, and the International Journal on Multimedia Tools and Applications. He has coedited a book entitled “Multimedia Document Systems in Perspectives” and has coauthored a book entitled “Semantic Models for Multimedia Database Searching and Browsing” (Kluwer, 2000). Prof. Ghafoor was a Guest/Co-Guest Editor for various special issues of journals, such as the IEEE JOURNAL ON THE SELECTED AREAS IN COMMUNICATIONS and the IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING. He is the recipient of the IEEE Computer Society 2000 Technical Achievement Award for his research contributions in the area of multimedia systems.

IE E Pr E oo f

957 958 Q9 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984

15

985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 Q10 1004 1005 1006 1007 1008 1009

QUERIES

1011

Q1: Author: Please provide the IEEE membership details (membership grades and years in which these were obtained), if any, for D. Lenz, J. Liu, and J. P. Robinson. Q2. Author: Please note the reference cites have been reordered sequentially, per TITB requirements. Please check. Q3. Author: Fig. 24 has been set as Fig. 21 and subsequent figures have been renumbered to retain the sequential citation of figures. Please check for correctness. Q4. Author: Please provide the names of all the authors in Refs. [3], [13], [15], [16], [21], [25], [26], [37], and [41]. Q5. Author: Please provide the year in Ref. [11]. Q6. Author: Please provide the year in Ref. [29]. Q7. Author: Please cite Ref. [41] in the text. Q8. Author: Please provide the subject (electronic engineering, bioinformatics, etc.) in which W. M. Ahmed received the M.S. degree. Also provide the year of the IEEE membership Q9. Author: Please provide the year in which J. Paul Robinson received the Ph.D. degree. Q10. Author: Please provide the details of various academic qualifications received by A. Ghafoor.

1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023

IE E Pr E oo f

1010