Computational methods and tools for decision support

Handbook of Research on Advanced Techniques in Diagnostic Imaging and Biomedical Applications, Themis P. Exarchos et al (Eds) IGI Global, Copyright © 2009. DOI: 10.4018/978-1-60566-314-2.ch001

Computational methods and tools for decision support in biomedicine: A critical overview Ioannis N. Dimou, Michalis E. Zervakis, David Lowe and Manolis Tsiknakis

Abstract The automation of diagnostic tools and the increasing availability of extensive medical datasets in the last decade have triggered the development of new analytic methodologies in the context of medical informatics. The aim is always to explore the problems’ feature spaces, extract useful information and support clinicians in their time, volume and accuracy demanding decision making tasks. From simple summarizing statistics to state of the art pattern analysis algorithms the underlying principles that drive all medical problems show trends than can be identified and taken into account to improve the usefulness of computerized medicine to the field clinicians and ultimately to the patient. This chapter consists of a thorough effort to map this field and highlight the achievements, shortcomings and assumptions of each family of methods. Our effort has been focused on methodological issues as to generalize useful conclusions based on the large number of notable, yet too case-specific work presented in the field. Most importantly we attempt to derive general trends and pitfalls that occur during the research process.

1 Introduction (Assumptions) Contemporary and future methods of healthcare delivery will be exploiting new technology, novel sensing devices and a plethora of modes of information, served from distributed data sources. This raw data is inevitably increasing in volume and complexity at a rate faster than the ability of primary healthcare providers to access and understand it. The availability of technologies such as the Grid will allow greater accessibility of raw data, but does not address the problem of conveying timely information in the appropriate form to the individual or the clinician. The raw data will be unintelligible generally to the medical or technical non-specialist in that specific area. Several countries are now considering some of these issues – of integrated personalised healthcare – and of the requirement of automated ‘smart’ data mining methodologies to provide medical decision support to the clinician (and the individual) incorporating more reliable `expert’ data analysis using principled pattern processing methodologies. Within this environment, the area of medical imaging, with its various structural (CT, MRI, US) and functional (PET, fMRI) modalities, is probably on the top of the list with respect to the amount of raw data generated. Most of these modalities are explored in other chapters of this volume. Even though visualization by human experts enables the accurate localization of anatomic structures and/or temporal events, their systematic evaluation requires the algorithmic extraction of certain characteristic features that encode the anatomic or functional properties under scrutiny. Such imaging features as markers of a disease can then be integrated with other clinical, biological

and genomic markers towards the most effective diagnostic, prognostic and therapeutic actions. It is the purpose of this chapter to address issues related with the decision making process, trace developments in infrastructure and techniques, as well as to explore new frontiers in this area.

1.1 The medical informatics revolution (infrastructure, tools) During the last decades a constant shift is being observed in the medical field. Doctors are at an increasing level supported by advanced sensing equipment. These tools provide objective diagnostic information which reduces the margin of error in diagnosis and prognosis of diseases. Detailed imaging techniques map the human body, advanced signal processing characterizes biosignals and biochemical analyses are now largely automated, faster and more accurate. In the broader medical research field, larger datasets of patients including multiple covariates are becoming available for analysis. This relative data abundance results in an explosion of papers referring to thorough statistical analysis with data mining and pattern classification techniques. New findings are more easily made public through the internet and cheap processing power aids the development of complex models of diseases, drugs, and effects. Medical informatics has been defined as the field that "concerns itself with the cognitive, information processing, and communication tasks of medical practice, education, and research, including information science and the technology to support these tasks." ([45]). It is by now evident that we should not just provide information but also intelligently summarize it. The information flow in a medical information system resembles the well known paradigm of the information pyramid in pattern recognition. At an initial stage a large amount of data is collected from various sensors and pre-processed. This data is stored in a structured accessible format and fused with other information, such as expert knowledge. At a higher level patterns are sought in the full dataset and translated in an intelligent way to helpful reasoning. This output supports healthcare providers to prognose, diagnose, decide course of action and estimate results. At the end of this process feedback to the system in the form of expert evaluation or validity of analysis can be incorporated to improve future performance.

1.2 Current State in medical informatics (more practical, organizational) According to the latest statistics an increasing number of health care institutions are initiating the shift to informatics. Most hospitals employ simple web pages; many maintain web portals providing appointment request management and results retrieval and few large ones can show an integrated information system both for internal use and for patient accessibility. The underlying driving force has traditionally been the automation of the charging system for the provided medical services. From a more practical point of view funding for Healthcare Information Technologies (HIT) is in at an all time high reaching in some countries 3.5% of their healthcare budget ([41],[7],[8]). Leading the way in this sector are US and UK hospitals, followed by other advanced western European countries and Australia. In most cases such efforts are neither well organized nor standardized within countries and much less is expected at a global scale. Although initially driven by logistics and billing support needs, healthcare information systems now face a growing interest for higher level applications such as diagnostic decision support and treatment evaluation. Compared to the impact of computers and networking in other industries such as banking, one can safely claim that the advent of informatics in medicine is still in its infancy. Starting from the US, serious effort has been made in the last 5 years in creating a standard for a patient’s Electronic Health Record (EHR) ([2],[13],[39]) to facilitate data exchange not only between hospitals, labs and clinicians, but also between institutions and countries. The task of

2

structuring an individual’s medical status and history in a common format has been undertaken by a number of vendors. There are three main organizations that create standards related to EHR: HL7, CEN TC 215 and ASTM E31. HL7, operating in the United States, develops the most widely used health care-related electronic data exchange standards in North America, while CEN TC 215, operating in 19 European member states, is the pre-eminent healthcare information technology standards developing organization in Europe. Both HL7 and CEN collaborate with the ASTM that operates in the United States and is mainly used by commercial laboratory vendors. The combination of EHR standards with the upward trend in funding for HIT creates good prospects in the field. However a decisive factor in the expansion of such technologies in the applied domain, is their acceptance by the healthcare personnel. There is some scepticism regarding the extensive use of computerized tools for decision support. Starting from individual data safety to reliability of automated systems, to proper training and acceptance, the attitude of clinicians is most positive in research related institutions.

1.3 A Diversity of Applications Assuming the above current infrastructure and mentality in the medical sector, we can proceed to identify key application areas where automatic data processing is already or can be effectively applied. Diagnosis of diseases is probably the single most important of the applications of informatics in medical decision making. It involves the evaluation of criteria which discriminate between different pathologies. Common methods used in this application are indices and general nonlinear classification methods such as Neural Nets and Support Vector Machines. A large number of studies in this area ([17],[19],[16],[18],[20]) focus on feature selection and design of algoriths that can separate cases that exhibit a pathology from a general population. In a more general context Bayesian Belief Networks are applied to identify illnesses from a broad variety taking into account prior knowledge ([28]). Some fuzzy classification techniques can account for multiple concurrent pathological causes. These methods may face difficulties in generalizing to larger and more diverse population that the ones used to design them. Prognosis is the appearance of a certain pathology and as such it is also among the leading applications of pattern recognition tools in medicine. Its usefullness ranges from preemtive advice to statistical analysis of risk for insurance companies. At a consequent stage, asuming that the patient is already diagnosed, treatment evaluation is a key feature. Treatment evaluation involves the monitoring of the progress of the disease and correlation of the results with the treatment plan. The effort is to quantify the effect of the chosen treatment on the pathology. In a similar aspect patient survival modelling utilizes ideas from statistical survival analysis and censoring models to estimate the probability of death –due to a specific disease- of a patient at a specific time. It is also used for patient grouping and allocation to reference centres. Moreover the statistical results from the survivability analysis are used to optimize follow-on monitoring of patients on a individualized manner. A less known yet important application of medical pattern mapping methods lies in the smart indexing and analysis of medical publications and patient data files to extract textual information about certain pathologies and to assist locating of relevant bibliographic material for research ([15]). At a lower level computerized health records (EHR) are now a reality and are handled by medical information systems that provide near real-time data access, dramatically expanding the clinician’s opportunities for timely response. Although not an application in itself these standards

3

elevate the functionality of upper layers of decision support systems by providing a unified structured and meaningful representation of multimodal and diverse data. From a broader standpoint, automated data processing has made possible the large scale analysis of populations to identify disease trails and epidemiological aspects that were previously beyond the reach of a single researcher. All the above applications share common analytic aspects but at the same time pose a diversity of difficulties for the biomedical engineers, as they indicate a transition from “hospital care” to “ambulatory care” and a shift from disease management to personalized treatment. Enabling BMI with Medical Informatics Medical thinking over the last two centuries has been the ascendancy of the scientific method. Since its acceptance, it has become the lens through which we see the world, and governs everything from the way we view disease, through to the way we battle it. While the advance of the scientific method is pushing medical knowledge down to a fine grained molecular and genetic level, events at the other end of the scale are forcing the change. Best practice as determined by experts on the basis of outcomes requires integrating individual clinical expertise with the best available external evidence (e.g., information and knowledge coming from extensive clinico-genomic trials). However, finding, interpreting and applying the results of the thousands of clinical trials reported in published research literature, is a major problem ([38]), and a real challenge for medical informatics research. Can we put together rational structures for the way clinical evidence is pooled, communicated, and applied to routine care? What tools and methods need to be developed to help achieve these aims in a manner that is practicable, testable, and in keeping with the fundamental goal of healthcare - the relief from disease? Many different groups within clinical research are addressing the issues raised here, and not always in a co-ordinated fashion. Indeed, these groups are not always even aware that their efforts are connected, nor that their concerns are ones of informatics. Genomic researchers are working to discover the molecular mechanisms of diseases. Access to and integration of data coming from the clinical setting is essential for functional genomics research. Medical informatics professionals need to accelerate the development of both the information models and the tools needed for a number of tasks. Medical decision support systems which are data driven have specific peculiarities different to many other data processing areas. One such difference is in the inherent uncertainty at all levels of the information processing hierarchy – from random noise in sensor data, missing, inaccurate or miscoded data in patient records, to unknown models and at the highest levels, the lack of absolute truth, variable utilities and competing motivations for decision making (financial, accessibility, optimum patient care, quality of life). In order to be able to exploit developments in technology for individualised healthcare we will require commensurate developments in our abilities to process this data deluge and refine into manageable streams of information and knowledge capable of human interpretation. In the following sections of this paper we present a perspective on current state-of-the-art in automated data analysis and their relevance to decision support based on population and personal biomedical data, with more emphasis on the medical aspect. We provide a set of taxonomies of data processing approaches in which we have provided our own suggestions for how these techniques can be viewed as part of a common hierarchy, independent of the trends in specific medical areas. We also indicate failings in these methods and gaps in knowledge where future work is going to be required if we are to capitalise on augmenting diverse patient information.

4

2 From Phenotype Data to Medical Knowledge: fundamental questions and needs (Background) 2.1 Need for Intelligent Knowledge extraction tools Advanced biosensing methods and technologies have resulted in an explosion of information and knowledge about diseases and their treatment. As a result, our ability to characterize, understand and cope with the various forms of diseases is growing. In such a diverse and semantically heterogeneous medical environment - in terms of both clinical and genomic information - clinical data-processing and decision-making processes become more demanding in terms of their domain of reference, i.e., population-oriented and lifetime clinical profiles enriched by evidential genomic/genetic information. The ability of healthcare professionals to be informed, to consider and to adapt fast, to the potential changes and advances of the medical practice is of crucial importance for the future clinical decision-making. At the same time, errors in U.S. hospitals cause from 44,000 to 98,000 deaths per year, putting medical errors, even at the more conservative estimate, above the eighth leading cause of death ([14]).

expert knowledge (doctors, biologists, interdisciplinary researchers)

optimal decision (treatment)

medical markers

genomic data

Figure 1. Knowledge coupling in biomedical decision support systems It seems that, difficulties and failures of medical decision-making in everyday practice are largely failures in knowledge coupling, due to the over-reliance on the unaided human mind to recall and organize all the relevant details. They are not, specifically and essentially, failures to reason logically with the medical knowledge once it is presented completely and in a highly organized form within the framework of the patient’s total and unique situation ([40]). If we are to reduce errors and provide quality of care, we must transform the current healthcare enterprise to one in which caregivers exercise their unique human capacities within supportive systems that compensate for their inevitable human limitations.

5

Therefore, tools to extend the capacity of the unaided mind are required to couple the details of knowledge about a problem with the relevant knowledge from combined evidenced clinical and genomic knowledge repositories. The main questions that need to be addressed in this effort are the following: Do we have the necessary tools for intelligent data analysis and summarized presentation? A realistic answer is not straightforward. While there have been impressive advances in specific problems, others still remain disputed. Section 3 describes in detail each method’s limitations and successes. Can we link the research results to the clinical context or are we just publishing statistical papers? The cases where direct and exploitable connection between the computational analysis findings and actual treatment decisions has been achieved are not as many as would be desired. Unfortunately a large part of the advanced analysis of medical data still remains in a theoretic/publication oriented approach. Can we deliver the results to the medical professionals effectively? Delivery and presentation of the results depends on the effectiveness of the tools to mine and compress the knowledge from the dataset and to provide comprehensible pointers for justifiable decisions. Are the healthcare providers educated? Do they have the appropriate mentality to use the provided tools? It should be made clear to the end users that a decision support system’s role is advisory and complementary to the experts’ opinions. Also the ethical concerns of putting an automated system into the decision loop and making the required personal medical records available should be clarified. Such aspects tough are deemed to be beyond the scope of this paper.

2.2 Underlying technologies In this direction there are a number of technologies that serve as foundations upon which the upper layer services can be built. At the data collection level most research clinicians face the need for common protocols to standardise and assist data collection, allow online incorporation of new cases and comparison of results over diverse population regions. Much like the EHR eases the transfer of health records, common data standards ([36][37]) are needed to facilitate interoperability of not only patient data but also analysis and presentation tools at levels ranging from institutional to international level. Being able to compare results is a landmark in drawing useful large scale conclusions about disease and eventually assessing treatments on a diverse population basis. The emerging GRID ([10],[33]) technologies are of great usefulness in the wide application field of collecting, storing, processing, and presenting the medical data in a way transparent to the end user. Utilizing bandwidth, computational power and other resources optimally the GRID promises a cost effective solution to the unification of the medical informatics landscape. Although limited to research purposes presently its use is undoubtedly going to extend the limits of Artificial Intelligence in Medicine. Moving to a more abstract (re)presentation layer the development of medical and genomic ontologies and their coupling is a step needed to be able to store and handle huge datasets in a structured and logically founded way. A lot of research effort has been devoted recently primarily in connection to dna analysis. A more detailed description of medical ontologies is discussed in section 4 of this paper.

6

2.3 Specific Challenges The aforementioned needs, and the posted scientific and technological challenges push for transdisciplinary team science and translational research. The breadth and depth of information, already available in both medical and genomic research communities, present an enormous opportunity for improving our ability to study disease-mechanisms, reduce mortality, improve therapies and meet the demanding individualization of care needs. Up to now, the lack of a common infrastructure has prevented clinical research institutions from being able to mine and analyze disparate data sources. This inability to share both data and technologies developed by MI and BI research communities, and by different vendors and scientific groups, can therefore severely hamper the research process. Similarly, the lack of a unifying architecture can prove to be a major roadblock to a researcher’s ability to mine huge, distributed and heterogeneous information and data sources. Most critically, however, even within a single laboratory, researchers have difficulty integrating data from different technologies because of a lack of common standards and other technological and medico-legal and ethical issues. Among the challenges that characterize medical data pattern analysis one can pinpoint missingness as a key concept. Missing data occur due to inconsistent data entry, poor protocol design, death censoring, inadequate personnel familiarization with the system and inconsistent sensing equipment between collection centres. The patterns of missingness depend on the specific cause of this effect. The optimal way to handle this is in turn based upon the pattern and the covariance of the effect with other model factors. As shown in review papers in the field ([21]) most often in published clinical research this phenomenon is handled inadequately. Cases or covariates with missing data are discarded or imputed in naïve ways resulting in added bias to the statistical findings. There is now such a large amount of knowledge in addressing incomplete data problems in other application fields that safe imputation of medical datasets is realistically achievable up to certain missingness ratios. Above these thresholds the affected covariates have to be discarded. In particular EM imputation ([22]), data augmentation ([23],[24]) and multiple imputation ([25]) are considered effective and statistically sound methods to compensate for this problem. Apart from incomplete data, noise is also a crucial factor in the medical informatics data pyramid. The human body as the object of study has itself largely varying characteristics. Sensing equipment also introduce an additional noise component. On top of that the operator or examiner assesses the raw information in a subjective way depending on the context, their experience and other factors. The final quantifiable and electronically storable result is far from an ideal measurement. Taking this into account any biomedical pattern recognition system has to be robust with respect to noise. Some algorithms can be adjusted to a desirable level of smoothing. Common practice dictates that the noise component should be removed as early as possible in the processing sequence. This is usually achieved through pre-processing. As an additional measure cross validation of the results can be used in the post-processing phase to minimize output variance and sensitivity to noise. Closely related to the presence of noise and the reliability of the input information is the concept of uncertainty. Researchers observe very strong uncertainties in data, models, and expert opinions relating to clinical information. Being able to quantify this factor in every step of the computation process makes it possible to infer bounds on the final decision support outcome. This is far from theoretical. The real world decisions that a healthcare provider has to make require that the information used is a concise as possible. Confidence intervals are common metrics that bound a variable’s expected range. More advanced techniques as Bayesian Belief networks ([27],[28]) and Dempster-Schafer theory of evidence ([26]) are already used in commercial diagnostic support software packages. 7

Finally, a usually overlooked part of any medical problem is that the outcome usually affects human lives. This makes the misclassification cost largely asymmetric. A false negative patient diagnosis costs for more that a false positive one (sometimes not even measurable in medical expenses). Yet in most clinical classification research papers symmetric cost functions are assumed symmetric as a default for simplicity reasons. Another important problem is that the models’ assumptions are not always explained in detail. It is common practice to assume Gaussian distributions or equal class priors due to their mathematical tractability. However not all models are applicable to all cases. Biomedical datasets usually consist of features of different modalities. Data types can range form CT images, to EEGs, to blood tests, to experts’ opinions, to microarray data. All have different dimensionalities and dynamic ranges and require different pre-processing, data mapping and feature extraction and representation methods. Moreover assuming that each is handled properly they are associated with different statistical characteristics, uncertainties and noise components. Even from a practical perspective storing, data reduction, transmission and compact high level presentation become complicated. Data fusion will also play an important part of clinical information systems. Combining the various types of available information into a single decision boundary is important if we are to utilize all available information. To do this one has to take into account the relative weight of each datasource and contextual information. At a more advanced level a medical decision support system should be able combine the results of multiple classifiers to provide a more effective classification result. The advantage of this is that different classifiers can learn different parts of the feature space of a multimodal dataset and provide a fused output. Already a number of research work focus on this field. The difficulties in this area include high covariance of the classifiers’ outputs and different uncertainties. The data is collected through large scale multicenter studies. The participating centres are usually distributed geographically and have to submit new samples in a concise yet simple to the end user manner. This creates the need for on-site data reduction in order to be able to transmit and store high volumes of patient data. In this sense a pattern classifier should be scalable or modular or be able to run in an automated way before the data has to be transmitted to a core facility.

3 Methods for Processing Medical Data Having established some of the main challenges to be addressed by the biomedical Informatics research Community, we will in the following section present and critically assess the available methods for processing multilevel medical data. The objective remains to identify the adequacy or inadequacy of methods and tools for medical data processing and analysis and point out future directions. The following is given as a possible taxonomy of approaches and methods for processing biomedical data.

3.1

The pattern processing waterfall model

We recognize that medical informatics data analysis has the intention of taking massive amounts of low-level, noisy and rudimentary data (such as supplied by hundreds of sensing probes as in EEG/MEG studies and eventually providing decision support by changing the low level noisy data into high quality, low data rate information that can be interpreted by the clinician or the patient to take a more informed decision. This process of moving from high data rate but uninformative data

8

to low data rate and highly distilled information sources may be mapped over to an accepted model of pattern processing, known as the waterfall model (Figure 2).

Instructions Decision Support

Options/Costs SITUATION ASSESSMENT

States PATTERN INTERPRETATION

Patterns FEATURE EXTRACTION

Reduced Signals SIGNAL PROCESSING

Signals SENSING

Environment

Figure 2: The Pattern Processing Waterfall Model. Hierarchical, from large volumes of low quality data, to low volumes of high quality information. In this model, the explicitly hierarchical nature of data analysis is acknowledged. Figure 2 shows us that to go from low level data analysis to high level decision support invariably requires different stages of analysis, including Sensing, Signal Processing, Feature extraction, Pattern interpretation, Situation Assessment and finally Decision support. Each level of this processing hierarchy produces a common interface of results, such as signals, reduced signals, patterns, system states, options and costs, and finally instructions for action. In current biomedical data analysis approaches we tend to concentrate more on the lower levels in this hierarchy – choosing to focus on signal processing methods for noise reduction, feature extraction for changing signal characteristics into more manageable structures such as power spectra or intrinsic images, and also pattern interpretation such as producing models to assist in classification and prognosis.

9

We now consider how these different levels in the hierarchy map over into techniques and methods used in biomedical data analysis.

3.2 Functional Taxonomies Medical data sources can be temporally and spatially correlated, such as EEG and fMRI, and they can also be unstructured, such as doctor’s records of patient profiles. The data can be very noisy, and often has missing data values. Data dimensionalities can be high, 1-D time series, 2-D images, 3-D body scans, 4-D spatio-temporal sequences. How we process this data depends on the eventual function required of this data. The first high level division of techniques depends upon whether or not we are aware of the type or class of ailment, or whether we are more concerned with a generic investigation of health state. If we are focusing on a specific class of ailment (eg. Benign or malignant cancer, epilepsy or alternative seizure versus healthy state), then we are concerned with a supervised problem. Supervised problems are exemplified by classification and prediction problems where we are able to collect data involving ground truth target vectors for patients with known established outcomes. In supervised approaches we use the labelled data as a source for constructing generic models which can then be applied to individuals from a population in which these target labels are not yet known. For example, using histology or DNA samples to make an estimate of cancer type or staging. If we cannot pose the problem in a way in which we can ascribe an explicit target value of a prognostic or medical indicator, then we are dealing with an unsupervised problem. For instance, in dealing with the visualization of a population of thousands of people based on features extracted from ECG waveforms, specific ailments might be irrelevant and we are more interested in how the population is dispersed as a distribution. This would be the visualization problem of an unsupervised density estimation. Using such an unconditional distribution would be useful at the individualized level for detection whether an individual should be considered somehow anomalous, or an outlier of the population. This would be used as a warning signal for further investigation of that individual. Figure 2 graphically depicts this hierarchical task decomposition based on the nature of the problem faced by the biomedical engineer. After the first broad categorisation between supervised and unsupervised tasks, the next major division is whether we can model the problem as a deterministic or a stochastic problem. Of course most real work problems are combinations of random and deterministic parts. However the techniques we have normally either assume small noise components so the problem is largely one of function approximation, or we ignore spatio-temporal dependencies and approximate the full joint distribution of stochastic processes through products of independent simple distributions. Whether we choose a deterministic or stochastic framework, we are then faced with techniques which are linear or nonlinear. Linear signal processing and Gaussian distribution assumptions provide us with our most advanced and best-understood of techniques. Although these techniques may not fit the actual biomedical situation, we assume that a local approximation of the real situation can be closely modelled using linear techniques. More recently there have been advances in algorithms and understanding of nonlinear methods such as machine learning approaches as in adaptive neural networks which have found extensive use in the biomedical domain.

10

Biomedical Processing

Unsupervised

Deterministic

Supervised

Stochastic

Deterministic

Stochastic

Linear

Linear

Linear

Linear

Nonlinear

Nonlinear

Nonlinear

Nonlinear

Figure 1: Hierarchical division of approaches for data analysis

3.3 Unsupervised techniques In unsupervised approaches in low level biomedical data analysis, the generic approaches are concerned with (1) Feature extraction, (2) Clustering, (3) Visualisation, and (4) Density Modelling. Figure 3 depicts a functional hierarchy of unsupervised processing tasks, along with some exemplar common algorithms used to implement these functional tasks. In feature extraction, most biomedical techniques are interested either in decomposing the data into components (ICA, PCA) so that the noise and the signal may be more easily discriminated, or, in transforming into another more appropriate representation of the data (Projection Pursuit, PCA), or simply for data reduction without reducing signal content. For example, in brain state analysis it is common to think in terms of spectral bands. Several techniques are focused on producing spectra from time series (FFT, state space/ARMA models). Time-frequency spectrograms have also been used for this purpose of transforming into more appropriate spaces in areas such as ECG waveform monitoring (where visualization is easier), but are unsuitable for high dimensional signals. There are several ways to produce spectra from time series, and some are numerically more suitable than others, since the noise in the time series needs to be smoothed if reliable spectral characterizations are to be obtained. In time series problems, recent common nonlinear methods have looked at using embedology as an approach to construct state spaces which can then be used for subsequent model building. For preliminary noise removal, a common recent approach in biomedical data analysis has been to use wavelet transforms in one or two dimensions, replacing previous methods of filtering in the frequency domain. The scale-space models have also been augmented by data adaptive basis function expansion models such as using PCA/SVD approaches and reconstructing the signal only using the most significant basis functions (in the case of the PCA approach, only directions with largest variance are used in the reconstructions). The wavelet approach (and to a lesser extent PCA approaches) have also been common favoured techniques for data compression. In terms of the BioPattern project, if Grid-enabled methods are to be routinely used, and if this involves the passing of significant amounts of data between large

11

processing centres, then data compression methods will be important for bandwidth considerations alone (since medical data as we have already commented is exponentially expanding with the advent of new technologies). Another major commonality in biomedical processing in the last 8 years has been the development of methods for blind source separation, and in particular a plethora of algorithms for independent component analysis (ICA). ICA is a class of algorithms which seek to explain a set of observed biomedical data in terms of a latent set of hidden `sources’. ICA assumes these sources are not temporally correlated and are independent and only one of them at most is Gaussian distributed. For temporally correlated sources, recent advances in ICA methods such as the method of Complexity Pursuit, have been developed to tackle sources which have some temporal structure, though this is still a basic limitation of ICA models. Similarly, extensions to instantaneous mixing and linearity have been made, with some algorithms for ICA in situations in which the mixtures are convolutive, and nonlinear, though they are not yet in common use in biomedical data analysis. One can also take these `sources’ as nonorthogonal bases which explain the signal variation, and so can be used in the same way as PCA and wavelet basis function methods – ie for data reduction and noise removal. Hence one of the more common uses of ICA methods is in artifact removal from biosignals, such as removing muscle artifact from EEG and ECG. Note that there are many algorithms for estimating independent components, based on using different contrast functions or on moment/cumulant expansions. The FastICA method seems to be prevalent in biomedical signal processing, though it does have some drawbacks, one of which is the lack of any principled method for determining how many sources might make up a given set of signals. ICA is also basically a deterministic algorithm, ignoring the detail of the noise distribution on the signals, although again, some very recent developments have considered mixture models to try and partially circumvent this restriction.

Unsupervised

Feature Extraction

Clustering

Visualisation

Density Modelling

e.g. Wavelets, FFT, Polyspectra PCA ICA Projection Pursuit

e.g. K-means Gaussian clustering Vector Quantisation Dendrograms SOFM

e.g SOFM GTM PCA projections Sammon map MDS Neuroscale Projection Pursuit

e.g Mixture Models Parzen estimators Extreme Value Distributions

Figure 2: Functional separation of unsupervised tasks, with algorithm examples. Clustering is the next most common use of algorithms in unsupervised biopattern analysis. One of the more common approaches (especially in the genetic domain) is in constructing hierarchical clustering models such as the dendrogram. Other common clustering approaches are K-means, fuzzy (or soft) k-means ([1],[6]) – equivalent to Gaussian clustering, prototype-based clustering 12

such as vector quantization, and non-metric methods such as the self-organising feature map of Kohonen (though this is really a visualization technique, discussed next). Clustering methods rely on the existence of dissimilarity measures (such as Euclidean or city-block distance between features, but other measures can be used such as entropy measures, Itakura distance, KullbackLeibler divergence) or scoring methods (parametric t-tests, or nonparametric scoring methods such as Wilcoxon statistic and Kruskal-Wallis scores). One criticism of clustering methods is that the distribution of patterns in clusters and indeed the number of clusters, varies and depends crucially on the metrics used for comparisons. Visualisation of high dimensional biomedical data is an important and under-researched area. It is difficult since the projection of high-dimensional data into low dimensional spaces in which data can be visualized requires compromises to be made which necessarily distort the data. Amongst the more common visualization methods are projections into dominant principal component spaces, the self-organising feature map (used in areas such as document clustering, not yet really developed extensively for patient data visualization) the Generative Topographic Map (a principled probabilistic version of the SOFM), and a group of methods which explicitly exploit relative dissimilarity between feature vectors rather than the absolute values. Amongst these methods are Sammon maps, multidimensional scaling and the neural network counterpart, NeuroScale. These latter methods attempt to distribute patient characteristics in a low (2 dimensional) dimensional space so as to preserve the topographic structure in the highdimensional data by minimizing STRESS functions. Exploratory Projection Pursuit is another such method that attempts to identify meaningful projections of the feature space. It is based on a variety metrics called projection indices which are to be maximized. A number of useful tools that implement this method are available online (REFS).++ These methods tend to be computationally intensive and are also sensitive to the choices of metrics used to describe dissimilarity. Some of the methods are heuristic, probabilistic or deterministic. Visualisation is an area which has yet to be focused on the needs of the clinician as a decision support tool (such as identifying specific patients who might be at risk). The final area of unsupervised methods is related to density modelling. This is an explicit attempt to describe the probabilistic structure of medical data in situations in which we do not have gold standard target data vectors. In this domain we are interested in describing the unconditional probability distribution of patient characteristics without knowledge of explicit disease characteristics or states. This is linked to the visualization area before, but now we are not constrained by low-dimensional representations. Amongst this class of algorithms are Mixture Models, Extreme Value Distribution theory and Parzen density estimators (and other kernel based methods). The main use of density models so far is to try and identify outliers. So, examples would be patients with anomalous characteristics in their bioprofiles, or trying to determined groups of significantly expressed genes amongst the thousands observed in microarray experiments to help in identifying pathways. Unconditional density modelling is very difficult since most approaches have to assume some form of independence in the structures to construct any feasible models. As such this is an under-researched activity.

3.4 Supervised Techniques The most common of algorithmic approaches to biomedical data analysis is in terms of supervised tasks. In these situations, data has typically been explicitly collected for a specific purpose such as the classification of cancer types. It should be noted that the vast majority of medical data is not in

13

a form suitable for supervised tasks, yet supervised algorithms have dominated biomedical data processing in recent years. Since in supervised tasks we have extra information (class labels of desired target vectors), we can construct more refined models incorporating higher levels of prior or domain knowledge. Amongst the most common of supervised tasks in a functional taxonomy are (1) Feature extraction, (2) Prediction (and prognosis), (3) Classification, and (4) Function Approximation. Figure 4 depicts the functional taxonomy of supervised tasks along with some chosen examples of common algorithms used in each area.

Supervised

Feature Extraction

Prediction Prognosis

Classification

Function Approximation

e.g. Fisher Discriminants Nonlinear Discriminants ARD

e.g. Neural Networks (incl MLP, RBF) SVMs, RVMs Partial Kernel Logistic Regression

e.g Neural Networks (incl MLP, RBF) SVMs, RVMs Logistic Regression

e.g Kernel Smoothers Smoothing Splines Linear and nonlinear regression E-SVM

Figure 3: Functional separation of supervised tasks, with algorithm examples. For signal processing purposes, if the supervised task is known in advance, then more appropriate features can be extracted from the raw signals. For example, a projection onto PCA directions which does not involve classification information (only unconditional variance information) may remove the precise information needed to make an accurate classification. Fisher Linear Discriminants is a common linear method of extracting directions which separate intraclass variability and reduce intra-class variability through maximizing a separation criterion. This gives features which more accurately separate classes but using fewer dimensions of data than the original data source. This approach is more useful for data reduction and noise reduction in biomedical situations in which the data is already labeled and pre-segmented. Once the linear separating directions can be found, then new, unlabelled data can be ascribed classes on the basis of their projections onto these direction vectors. An extension of Fisher discriminates has been made using a neural network approach in which nonlinear transformations of the data can be found which reduce the data dimensionality and preserve or enhance class separability. Many of the common approaches in supervised biomedical tasks come under the categories of prediction and classification. Because we have access to a dataset of labeled patient data, we can construct parametric or nonparametric models which try and reproduce the generator of the labeled data without reproducing the noise itself. Hence in a prognosis task where we wish to predict the likely outcome of health of a patient we need access to linear or nonlinear models, which could be deterministic or statistical in nature. By far the most common methods are neural networks of 14

different forms. In particular, multilayer perceptrons (MLP’s) radial basis function networks (RBF’s) support vector machines (SVM’s) and their close counterpart relevance vector machines (RVM’s), along with linear predictive models and explicitly parameterized statistical models which are less adaptive in nature. The only real difference between classification and prediction tasks is in the nature of the target variable: in classification tasks we are concerned with the estimation of a binary vector, whereas in generic prediction the target is typically a continuously variable quantity, such as life expectancy. In each of these situations the neural network model is used to either estimate a point conditional probability of a given pattern belonging to a given class, or it is used as a regression model to estimate the interpolated expected average value conditioned on the input. The class of artificial neural network models has provided one of the more common areas of overlap of algorithms for biomedical data analysis. Another difference is in the use of ROC and area-under-ROC curves as a common metric used for model validation in medical classification tasks, partly discussed in the next section. However the ROC curve carries less information than the misclassification matrix of the classification model, so this constraint should be borne in mind when using ROC statistics to choose between models. Other common non-neural methods for classification in the medical area have been the use of expert systems, Blackboard architectures, and Dempster-Shafer fuzzy evidence approaches to classification. These methods were more popular in the past. They have advantages in that they explicitly try and deal with expert domain knowledge and uncertainty. They suffer from the unstable rule-extraction and ruleelicitation from experts and the ad-hoc nature of some of the algorithms. The final task of supervised pattern processing is in Function Approximation. This is a situation for data modelling and understanding rather than classification, prediction or feature extraction. In data modelling, one aim would be to provide some form of regularization or smoothing to allow noise and outliers to be extracted. Techniques which fall in this category are kernel smoothers, smoothing splines and linear and nonlinear regressors. Again, suitably modified neural networks can also be used for this filtering task. Medical data imputation, and data correction can be made if we have a good interpolator model of the medical data. Again, this interpolator could be a deterministic function, or it could be viewed generatively as sampling from an underlying distribution function – using MCMC techniques for example. This class of techniques is currently underrepresented in the biomedical data analysis arena.

3.5 Model Estimation Techniques The final taxonomy we consider is a set of common aims of the data analyst which cut across the supervised/unsupervised boundary. The main common aspects in this taxonomy are (1) Assessment and Selection, (2) Inference and Averaging, and (3) Metrics. Model selection and assessment are crucial common aspects. In biomedical data processing it is crucial to be able to construct low bias models which are robust to fluctuations (in data and model parameters) for stability. Overtraining of adaptive models, or over-parameterisation of parameterized models are two examples of situations to be avoided in medical data processing in particular. Common methods for model assessment involve issues linked to the bias-variance dilemma, and particularly issues of regularization – either explicitly though the cost functions being used or implicitly through restricting the model class. Common methods used involve outof-sample prediction error trade-off, cross-validation and bootstrap methods, and measures of fitness based on minimum description length (MDL) and issues linked to Vapnik-Chervonenkis (V-C) dimensionality.

15

However, more and more in the intermediate levels of biomedical data processing, the use of single models is being replaced by methods of averaging. This is motivated from the Bayesian perspective of marginalization rather than selection. By averaging over model predictions or over models trained on different data samples drawn from the same distribution, it is possible to compensate for weak or overtrained models. Common methods of averaging include Bootstrap, Boosting, Bagging, Stacking, Bayesian averaging (and since this is generally intractable, approximate methods based on sampling such as Markov chain Monte Carlo methods are used). Thes methods are becoming more known to biomedical data analysts and are current techniques in active developments in generic pattern processing. The final common aspect is related to metrics. In Biomedical data processing there are already several common metrics used – such as the ROC curve used as a performance metric in choosing classification models, and scoring metrics such as Wilcoxon and Kruskal-Wallis statistics. However, we also need to agree metrics for clustering, visualization, model comparisons and optimization. The common metrics involve prediction error variance, different types of Entropy measure, Mutual information, Kullback-Leibler divergence, dissimilarity metrics such as Standardised Residual Sum of Squares (STRESS). Using different performance metrics produces different estimates of which approaches are more superior. For example, much effort is devoted to comparing different types of neural network models on specific tasks using different metrics (the SVM models use different cost functions to MLP’s for example). It is probably more important to focus effort on global failings of neural networks in the context of the Bioprofile.

Model Estimation

Assessment and Selection

Inference and Averaging

Metrics

e.g. Bias/Variance Optimisation and Regularisation Prediction Error MDL V-C dimension Cross Validation Bootstrap

e.g. Boosting Bootstrap Bayesian MCMC Bagging Stacking

e.g Entropy K-L divergence Dissimilarity metrics Mutual Information ML/MAP Scoring methods

Figure 4: Common model estimation related algorithmic issues.

3.6 Classifier Fusion In pattern analysis it is known that there is no single best algorithm for a specific dataset. Classifier ensembles have in the recent years produced promising results, improving accuracy,

16

confidence and most importantly improved feature space coverage in many practical applications. In medical problems the combination of multiple classifiers has been effectively applied for diagnosis, gene selection and patient grouping ([47],+,+). The driving principle behind this method is that a combination of even moderate elementary classifiers can map a feature space better provided that the outcomes are interpreted and combined in a statistically appropriate way. Classifier fusion methods are described and reviewed in [49], [50], [51] and [52] and in [48] the concept of decision profiles (DPs) is introduced. A common pitfall in using ensembles is that the L1 classifiers’ outcomes are in many cases not interpretable as probabilities. This limits the choice of the available combiners if the objective is to provide statistical bounds on the fused output. Therefore researchers should pay attention to the assumptions of each fusion method. Another problem for both simple classifiers and combiners is that the ratio of positive to negative cases in clinical datasets is usually low. This creates problems in training and adds bias to the results. Adjusting for prior class distributions is an adequate way to handle this asymmetry.

3.7 Performance metrics, evaluation and generalization (DAVID to complete!!!) Many novel and promising models in published literature are presented and evaluated in a way that leaves doubt regarding the variance, reproducibility and reliability of the results. Showing high accuracies on a specific dataset instance is not important unless supported by analysis that verifies statistical robustness. In addition, many newly proposed models only provide point estimators of desired outputs such as classification accuracy, neglecting the fact that all models are wrong and so as a minimum, estimates of uncertainty on point predictors are desired. This is especially important in life-critical applications where additional cost factors need to be accounted for. Optimistically, diagnostic or prognostic models should be aiming to produce either the classconditional probabilities P[D|Y] or P[Y|D] for combinations of disease occurrence D and feature or test presence Y. Measures of accuracy based on these model estimators need to be presented when new models are introduced. In this respect, assessment and evaluation needs to also accommodate asymmetric misclassification costs of model predictions. Hence, model accuracy on, eg a specific disease class, is inadequate. We can consider the production of alternative models as equivalent to obtaining different diagnostic tests for a medical condition. Binary Diagnostics: Consider the case of a binary diagnostic test. The measures of accuracy are based on the false positive fraction (FPF= P[Y=1|D=0]) and the true positive fraction (TPF=P[Y=1|D=1]) where D=1/0 denotes the presence or absence of a disease, and Y=1/0 denotes the diagnostic of evidence for or against the disease. The overall misclassification probability of such a diagnostic test is P[Y≠ D] = p (1-TPF) + (1-p) FPF Where p is the prevalence (prior) of the disease in the population,. However this is an inadequate summary of a diagnostic, since costs and implications of misclassification are usually very different. Hence reporting FPF and (1-TFP) are required as a minimum. False negative errors imply patients with the disease do not undergo treatment, whereas false positive errors tend not to be quite so serious other than unnecessary treatment is applied to a healthy patient.

17

Alternatives to the misclassification probabilities are the predictive values. Predictive values quantify the clinical relevance of the test and involve the reverse conditioning of the probabilities. For instance, the positive predictive value is PPV = P[D=1|Y=1] and the negative predictive value is NPV=P[D=0|Y=0]. A perfect test will predict the disease perfectly with PPV=NPV=1. An uninformative test will have PPV=p and NPV=(1-p), and so the predictive values depend also on the prevalence of the disease, and not just the model’s diagnostic output. The classification probabilities can be expreseed in terms of the predictive values as long as the prevalence is also known. The third primary description of the prognostic value of a `test’ (or model) is in terms of Likelihood ratios. Positive Likelihood Ratio = PLR= P[Y=1|D=1]/P[Y=1|D=0] and Negative Likelihood Ratio= NLR= P[Y=0|D=1]/P[Y=0|D=0] These are specifically relevant since they quantify the increase in knowledge about a disease gained through the output of a diagnostic model. These likelihood ratios are also known as Bayes factors and are motivated from predicting the disease status from the model. In this sense they relate closely to the previous predictive values, except that the prevalences are not used. In fact they are also simple functions of the classification probabilities, PLR=TPF/FPF, and NLR= (1TPF)/(1-FPF). These three quantifiers (likelihood ratios, predictive values, classification probabilities) represent approaches to quantify diagnostic accuracy of models (regarded as diagnostic tests) determined across data. It should not be forgotten that often the cost of inaccurate diagnosis should also be taken into account. Continuous Diagnostic: For models that produce continuous or ordinal output scales, Y, alternative accuracy tests need to be considered in addition to the methods discussed for the binary case. The current accepted approach is to construct the Receiver Operating Characteristic (ROC) for a given model. Using a threshold c, define a binary test based on the continuous model result Y as positive if Y ≥ c and negative if Y ≤ c. Then the corresponding true and false positive fractions are a function of c: TPF( c ) = P[Y≥c | D=1], FPF( c ) = P[ Y≤ c | D=0]. The ROC curve is the set of possible true and false positive fractions as c is varied: ROC = { (FPC( c ), TPF( c ) ), c ϵ ( - ∞ , ∞ ) } Very few introduced models provide information on ROC curves and comparisons between models using different ROC curves. The ROC curves represent the full tradeoffs between selecting prognostic classes depending upon a threshold. Different situations or clinicians would require different thresholds and the compromises between different models should be able to be presented to the clinician.

18

The ROC curve is invariant to strictly increasing transformations of Y. In particular the Likelihood ratio plays an important part here in that the optimal criterion based on Y for classifying subjects as positive for disease presence is LR(Y)>c (in the sense that it achieves the highest TPF with FPF=P[LR(Y) > c | D=0]). Summary statistics of the ROC curve are often used when it is difficult to estimate the full ROC curve. For instance, the AUC (area under the ROC curve) is often used as it can also be interpreted as the probability that test results from a randomly selected pair of subjects with and without disease are correctly ordered. i.e. AUC= P[Y(disease) > Y( not disease) ]. The ROC curves of different models provide a distribution-free distance measure between distributions. When combining multiple test results it is therefore useful to compare ROC curves. However for multiple tests where the test result Y is multidimensional, the ROC curve seems to require computation of the joint likelihood, which is in general difficult. However the risk score (RS(Y) = P[D=1 | Y]) has the same ROC curve as LR(Y) and yields the same optimal decision rules. Hence combining multiple tests requires obtaining risk score functions which are usually directly evaluated by models. Thus, for both binary and continuous prognostic models, accuracy tests should be enforced such that appropriate quantification and comparisons of performance can be made. This subsection has discussed a minimum such requirement. To Include: Prospective evaluation of models Several scoring systems, mathematical models and pattern recognition algorithms have been developed to distinguish between pathologies. However, few of them have been externally validated in new populations. Comparing their performance on a prospectively collected large multicenter dataset is a test that reveals the real generalization capability of each method. Work in this area ([53]) is limited mainly due to the lack of data available at such a large time and geographical scale.

4 Future and Emerging Trends (MANOLIS to complete!!!) In this review article we focused on the Data Mining and Knowledge extraction methods and tools used in medical informatics. ttempting to identify ……. Achieving the vision of individualised healthcare in the postgenomics era requires substantial advances in a number of other scientific domains. Such domains include: methods and tools for seamless access and integration of distributed, heterogeneous, biomedical ontologies, multi-level biomedical data, visualisation tools, and computational environments able to support efficient processing of multilevel biomedical data. In the following sections we briefly review current efforts and SoA in the domain of biomedical Ontologies and discuss the promise of Grid Technology in responding to the computational challenges in biomedicine.

19

4.1 Medical Data integration (Manolis to complete!!!) Ironically, huge gains in efficiency in the “front end” of the discovery pipeline have created huge “down stream” inefficiencies because the data cannot be accessed, integrated, and analyzed quickly enough to meet the demands of drug R&D. The industry has outgrown traditional proprietary data capture and integration methods, and traditional “big IT” approaches solve only part of the problem. First generation integration solutions that centered on the concept of local repositories (silos, warehouses) have not scaled well, are costly to maintain, and ultimately are limited in long-term usefulness. The integration and exploitation of the data and information generated at all levels by the disciplines of bioinformatics, medical informatics, medical imaging and clinical epidemiology requires a new synergetic approach that enables a bi-directional dialogue between these scientific disciplines and integration in terms of data, methods, technologies, tools and applications. To approach the vision and achieve the aforementioned objectives and goals a new breed of techniques, systems and software tools are required for two main reasons:(a) to convert the enormous amount of data collected by geneticists and molecular biologists into information that physicians and other health-care providers can use for the delivery of care and the converse, and (b) to codify and anonymize clinical phenotypic data for analysis by researchers Uniform Information Modelling. Towards the goal of seamless information and data integration (for sharing, exchanging and processing of the relevant information and data items) the need for uniform information and data representation models is raised. The Resource Description Framework (RDF) and XML technology offers the most suitable infrastructure framework towards seamless information/data integration (http://www.w3.org/DesignIssues/Toolbox.html). Based on an appropriate RDF Query Language (Karvounarakis et al., 2001) the generated XML documents could parsed in order to: (i) homogenize their content (according to the adopted data-models and ontologies); and (ii) apply dynamic querying operations in order to generate sets of data on which intelligent data processing operation could be uniformly applied (Potamias et al., 2004).

4.2 Expert knowledge integration (Manolis to complete!!!) One peculiarity of medical problems which however if exploited can lead to great gains in effectiveness is the need for integration of expert knowledge into data driven decision support systems. There difficulties encountered in this task are twofold: Firstly the system’s concept of the problem has to be presented to the clinician in a comprehensible way in order to render him able to provide feedback. Seer presentation of classification results is not adequate for a prolific(?) man-machine interaction. Secondly the expert’s assessment has to be quantified in a way understandable by the decision support model. In practice black-box techniques do not lend themselves easily to such tasks. Support Vector Machines can take advantage of prior knowledge in the form of preselected feature transformations that map the symmetries of the dataset or the characteristics given by a human expert. Bayesian Belief Networks on the other hand provide both a graphical map of the problem’s interactions and a way to adjust to experts’ input through conditional probability values. (REFS) +other methods? Bayesian nets Variable selection Results verification by a practitioner 20

4.3 Presentation/visualization (David to complete!!!) Clear concise GUIs Need to summarize patient history timeline Need to present and highlight critical info to assist quick situation awareness and action Perhaps a couple of images of medical systems’ GUIs

4.4 Medical Ontologies Given the increasing availability of biomedical information located at different sites and accessible over Internet, researchers need new methods to integrate such information. Researchers also need novel methods to search, access, and retrieve this information, which must be gathered, classified and interpreted. To integrate distributed and heterogeneous databases two levels of heterogeneity must be considered: i) databases may be located at various platforms, spread over Internet, with different architectures, operating systems and database management systems, and ii) databases can present different conceptual data models and different underlying database schemes. The development of Medical ontologies has surfaced in part as a bi-product of the genomic ontologies development effort. Researchers trying to map genes to phenotypes needed a uniform representation of both genomic and medical concepts and therefore extended in part the first ontologies to include disease, body characteristics and small scale human biological formations. Access to the patients’ clinical and genomic information sources should be syntactically and semantically consistent. In other words the posted queries as well as the data-extraction functions should be consistent with a uniform clinical data-model. Standard and well-documented interfaces expressed in the Interface Definition Language (IDL) provide the basic support for interoperability among clinical data and information. Especially the Clinical Object Access Service (http://www.omg.org/technology/documents/ formal/clinical_observation_access_service.htm) interface of the OMG group, and the HL7 RIM (Health Level 7 Reference Information Model; http://www.hl7.org/library/datamodel/RIM/C30118/rim.html), may serve the needs for the uniform modelling-of and access-to the clinical information-items. Futhermore, general and specific medical ontologies may be utilised: UMLS, SNOMED, MeSH, ICD).

4.5 Biomedical GRIDs (Manolis to complete!!!) 4.6 Ethical/legal (David to complete!!!) - Implications of automated biomedical data analysis on individual patient data - The human factor (training, familiarization, errors, interpretation, limitations) - The following relate to ethical issues associated to the GRID, but needs rephrasing The Grid -- the IT infrastructure of the future -- promises to transform computation, communication, and collaboration. GRID systems and applications aim to integrate, ‘virtualise’, and manage resources and services within distributed, heterogeneous, dynamic “virtual organizations”. Grid Computing delivers on the potential in the growth and abundance of network connected systems and bandwidth: computation, collaboration and communication over the Web. At the 21

heart of Grid Computing is a computing infrastructure that provides dependable, consistent, pervasive and inexpensive access to computational capabilities. By pooling federated assets into a virtual system, a grid provides a single point of access to powerful distributed resources. (http://www.sun. com/software/grid/overview.xml). A number of challenging (as well as promising) diverse technologies should be smoothly integrated in order to realize the envisioned Biomedical GRID infrastructure and related services (see figure 5, below): There are many research and development areas in informatics necessary to support Medical research such as the development of models and digital simulations, preprocessing of imaging data, parametrization of disease classification models, diagnostic support, global scale data access and association (Oliveira et al., 2004; Berti et al., 2003). Expected contributions of Grid technology to Medical research and more generally in Genomic Medicine realization include (Healthgrid White Paper, 2005). 

Provide personalized healthcare services following: (a) the genetic profile of each patient, (b) epidemiological studies, (c) heredity, (d) statistical analysis results, and (e) clinical observations.



Development of models and digital simulations of diseases.



Providing tools to support physicians' training, and improve biomedical knowledge management.



Integrating databases and knowledge between the clinical world and that of genomic research.

5 Conclusions In this chapter we have identified common algorithmic aspects in biomedical data processing by reference to a more generic taxonomy of approaches to pattern processing. In biomedical data analysis, whether the task is time series or image analysis, microarray processing or histology analysis, there are common themes which emerge, such as the desire to reduce noise, reduce dimension, transform to more suitable representations for subsequent interpretation, extract similarities, and exploit dissimilarities. In many areas there have been significant advances and much international research (notably in neural network areas, optimisation and image analysis, and Bayesian approaches to inference). However we would identify a few bottleneck areas.  The development of methods which have been genuinely devised to support medical decision making (ie clinician involvement in how complex and high dimensional data could be presented to a clinician to help in a diagnosis or prognosis. To this end, tools and techniques which link clinicians to data and the analysis of that data need to be encouraged.  Medical data is notoriously unreliable, noisy, distributed and incomplete. Many common methods already assume data integrity. Proportionally there are insufficient numbers of methods being examined which deal with uncertainty explicitly, both in its inputs and its

22

outputs. True probabilistic methods which present predictions along with uncertainties in those predictions are very few. Similarly, developments in other areas such as Complexity, Communications and Information Theory probably have a great deal to offer to biomedical data processing as methods cross the discipline boundaries, so what we have presented in this brief report is merely a summary of current common aspects reflecting future promise and an indication that much more research is going to be needed to deal with the pattern processing waterfall. Research in the field indicates that the fundamental methods for multilevel biomedical data do exist. Advances are required in other areas such as (a) biomedical ontologies, (b) ontology based integration of heterogeneous biomedical data, and (c) service oriented computational frameworks capitalizing on modern technologies (i.e. Grid) enabling the fast and efficient processing of biomedical data, etc. Acknowledgements

References [1]

Gupta, S.K., Rao, S., Bhatnagar, V. K-means Clustering Algorithm for Categorical Attributes. LNCS 1676 (1999) 203 - 208

[2]

Katehakis D.G., Sfakianakis S., Tsiknakis M., Orphanoudakis S.C., “An Infrastructure for Integrated Electronic Health Record Services: The Role of XML (Extensible Markup Language)”, Journal of Medical Internet Research vol. 3 no. 1, 2001.

[3]

Lopez, L.M., I.F. Ruiz, R.M. Bueno and G.T. Ruiz, Dynamic Discretisation of Continuous Values from Time Series, in Proc. 11th European Conference on Machine Learning (ECML 2000), eds. R.L. Mantaras E. and Plaza, LNAI 1810 (2000) 290-291.

[4]

Potamias G., Gaga L., Blazadonakis M., and Moustakis V., (1993). Multistrategy learning support to the acquisition of clinical knowledge: methodological issues. (extended abstract) In MLNet workshop on Multi-Strategy Learning, Blanes, Spain.

[5]

Quinlan, J.R.: Induction of decision trees, Machine Learning 1 (81) (1986) 81-106.

[6]

San, O.M., Huynh, V-N., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. Int. J. Appl. Math. Comput. Sci., 14:2 (2004) 241–247.

[7]

Committee on Quality of Health Care in America. (2001). Crossing the Quality Chasm: A New Health System for the 21st Century. Washington, DC: National Academy Press.

[8]

Crimson, J. (2001). Delivering the electronic healthcare record for the 21st century. Int J Med Inform. 64:(2-3), pp 111-127.

[9]

Gunter, C. (2004). Human genomics and medicine. Nature 429, p. 439.

[10]

Healthgrid White Paper. (2005). A joint White Paper from the Healthgrid Association and Cisco Systems. November 2005. [http://whitepaper.healthgrid.org/30260HealthgridWPv5.pdf].

[11]

Hilario, M., Kalousis A., Prados, J., Binz, P.A. (2004). Data Mining for Mass-Spectra Based Diagnosis and Biomarker Discovery. Biosilico journal, 2:5, pp. 171-222.

[12]

Ideker, T., Thorsson, V., Ranish, J.A., Christmas, R., Buhler, J., Eng, J.K., Bumgarner, R., Goodlett, D.R., Aebersold, R., and Hood, L. (2001). Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292, pp. 929-934.

23

[13]

Katehakis, D.G., Tsiknakis, M., Orphanoudakis, S. (2002). Towards an Integrated Electronic Health Record - Current Status and Challenges, Business Briefing: Global Healthcare 2002, The Official Publication of the World Medical Association, January 2002. [http://www.ics.forth.gr/~katehaki/ publications/bb2002.pdf].

[14]

Kohn, L.T, Corrigan, J.M., Donaldson, M.S. (eds). (1999). To Err Is Human: Building a Safer Health System. Committee on Quality of Health Care in America. Washington, DC: National Academy Press.

[15]

I. Bichindaritz, S. Akkineni, Concept mining for indexing medical literature Engineering Applications of Artificial Intelligence, Volume 19, Issue 4, June 2006, Pages 411-417

[16]

Sharpe PK, Caleb P., Artificial neural networks within medical decision support systems. Scand. Journal of Clinical and Lab investigation, 1994, 219: pp.3-11

[17]

Controlling the Sensitivity of Support Vector Machines. K. Veropoulos, C. Campbell and N. Cristianini. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI99), Workshop ML3, pp. 55-60

[18]

Smith AE, Nugent CD, McClean SI., Evaluation of inherent performance of intelligent medical decision support systems: utilising neural networks as an example., Artificial Intelligence in Medicine, 2003 Jan, 27(1),pp.1-27

[19]

Brameier, M. Banzhaf, W., A comparison of linear genetic programming and neural networks inmedical data mining, IEEE Trans. Evolutionary Computation, Volume: 5, Issue: 1, Feb 2001

[20]

Lisboa, P.J.G., Vellido, A. and Wong, H. ‘Outstanding Issues for Clinical Decision Support with Neural Networks’ in H. Malmgren, M. Borga, and L. Niklasson (eds.) ‘Artificial Neural Networks in Medicine and Biology’ Springer, London, 63-71, 2000

[21]

Burton A., Altman D.G.: “Missing covariate data within cancer prognostic studies: a review of current reporting and proposed guidelines.”, British Journal of Cancer (2004) 91, 4 –8.

[22]

Tapio Schneider. 2001: Analysis of Incomplete Climate Data: Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values. Journal of Climate: Vol. 14, No. 5, pp. 853–871

[23]

Schafer, J.L., “Analysis of incomplete multivariate data”. London: Chapman & Hall. 1997, ISBN: 0412-04061-1

[24]

Bernards C.A.,Farmer M.M., et al. : Comparison of two Multiple Imputation Proceduresin a Cancer Screening Survey. Journal of Data Science, 1 (2003)

[25]

Schafer, J. and Olsen, M., “Multiple imputation for multivariate missing-data problems: a data analyst's perspective”, Multivariate Behavioural Research, Vol. 33, pp. 545-571.

[26]

Dempster, Arthur P.; A generalization of Bayesian inference, Journal of the Royal Statistical Society, Series B, Vol. 30, pp. 205-247, 1968

[27]

Friedman, N. & M. Goldszmidt, "Learning Bayesian Network from Data." SRI International. 1998

[28]

I. Milho and A. Fred. An auxiliary system for medical diagnosis based on Bayesian belief networks. In Proc., RECPAD’2000, pages 271 –276

[29]

Martin-Sanchez, F., et al (2004). Synergy between medical informatics and bioinformatics: facilitating genomic medicine for future health care. Journal of Biomedical Informatics, 37:1 pp. 30-42.

[30]

MI_BI_NI. (2001). Synergy between Research in Medical Informatics, Bio-Informatics and NeuroInformatics: Knowledge empowering Individualised Healthcare and Well-Being. Workshop on Biomedical Informatics, 14 December 2001, Pyramids, Place Rogier, Brussels.

[31]

Moorman P.W., et al., Evaluation of reporting based on descriptional knowledge. (1995). J. Am. Med. Inform. Assoc. 2:6, pp. 365-373.

[32]

Nature. (2004). Making data dreams come true (editorial), Nature 428:6980, p239.

[33]

Oliveira, I.C., Oliveira, J.L., Martin-Sanchez, F., Maojo e, V., and A. Pereira, A.S. (2004). Biomedical information integration for health applications with Grid: a requirements perspective.

24

[34]

Pittman, J., Huang, E., Dressman, H., Horng, C-F., S.H. Cheng, M-H Tsou, C-M Chen, A. Bild, E.S. Iversen, A.T. Huang, J.R. Nevins, and M. West. (2004). Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proc Natl Acad Sci USA. 101:22, pp. 8431–8436.

[35]

Pomeroy, S.L., et al. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415:6870, pp. 436-442.

[36]

Potamias G., Koumakis L., and Moustakis V. (2004). Mining XML Clinical Data: The HealthObs System. Ingenierie des systems d'information, special session: Recherche, extraction et exploration d’information 10:1, 2005.

[37]

Potamias, G., and Moustakis, V. (2001). Knowledge Discovery from Distributed Clinical Data Sources: The Era for Internet-Based Epidemiology. In Procs 23rd Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Istanbul, Turkey, pp. 25 – 28.

[38]

Rindfleisch, T. C. and Brutlag, D.L. (1998). Directions for Clinical Research and Genomic Research into the Next Decade Implications for Informatics. JAMIA 5:5, pp. 404-411.

[39]

Shabo, A., Vortman, P., and Robson, B. (2001). Who's Afraid of Lifetime Electronic Medical Records? Towards Electronic Health Records (TEHRE 2001), November 14, London, UK. [ http://www.haifa.il. ibm.com/projects/software/imr/papers/WhosAfraidOfEMRfinal.pdf ].

[40]

Weed, L.L. (1991) Knowledge Coupling: New premises and new tools for medical care and education. Springer-Verlag, ISBN: 0387975373.

[41] Andrew Steele, “Medical Informatics Around the World: An International Perspective Focusing on Training Issues.”, Universal Publishers, 2002, ISBN: 1581126344 [42] James J. Cimino and Edward H. Shortliffe, “ Biomedical Informatics: Computer Applications in Health Care and Biomedicine (Health Informatics)”, 2006 [43] Hsinchun Chen, Sherrilynne S. Fuller, Carol Friedman, William Hersh (Editors), “Medical Informatics: Knowledge Management and Data Mining in Biomedicine” (Integrated Series in Information Systems), Springer 2005 [44] “Standard Guide for Content and Structure of the Electronic Health Record (EHR)”, American Society for Testing and Materials, Annual book of standards 1999. [45] Greenes RA, Shortliffe EH. “Medical informatics. An emerging academic discipline and institutional priority.” JAMA 1990; 263:1114-20. [46] David W Bates, “The quality case for information technology in healthcare.”, BioMed Central, Medical Informatics and Decision Making 2002, 2:7 [47] Dimou I., Manikis G., Zervakis M., “Classifier Fusion approaches for diagnostic cancer models”, IEEE EMBS2006 [48] L. I. Kuncheva, J. C. Bezdek, R. P. W. Duin, “Decision templates for multiple classifier fusion: An experimental comparison”, Pattern Recognition, 34, (2), 2001, 299-314. [49] H. Altincay, “On naive Bayesian fusion of dependent classifiers.”, Pattern Recognition Letters, vol. 26, pp. 2463-2473, 2005. [50] K. Tumer, J. Ghosh, “Classifier combining: Analytical results and implications”, AAAI 96 Workshop in Induction of Multiple Learning Models, 1995. [51] L. I. Kuncheva, “A theoretical study on six classifier fusion strategies”, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 24, No. 2, February 2002. [52] D. Ruta, B.Gabrys, “An overview of classifier fusion methods”, Computing and Information Systems, 7 (2000) p.1-10. [53] C. VanHolsbeke, B. VanCalster, et. al., “An External Validation Of Different Mathematical Models To Distinguish Between Benign And Malignant Adnexal Tumors: A Multicenter Study By The International Ovarian Tumor Analysis (Iota) Group”, Clinical Cancer Research, ?? 2006

25

26