A NEW CLUSTERING ALGORITHM FOR ... - CiteSeerX

0 downloads 0 Views 1MB Size Report
6.2 Testing on MR Images . .... tistics are calculated like mean and covariance matrices. ...... not answer the question of how to cluster the data set using the mixture model. .... similarity test based an a preset matching threshold), sample ..... about the structures in the data. This information can be obtained by. b f g. ,( ).
A NEW CLUSTERING ALGORITHM FOR SEGMENTATION OF MAGNETIC RESONANCE IMAGES

By ERHAN GOKCAY

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2000

Copyright 2000 by Erhan Gokcay

ACKNOWLEDGEMENTS

First and foremost I wish to thank my advisor, Dr. Jose Principe. He allowed me the freedom to explore, while at the same time provided invaluable insight without which this dissertation would not have been possible. I also wish to thank the members of my committee, Dr. John Harris, Dr. Christiana Leonard, Dr. Joseph Wilson, and Dr. William Edmonson, for their insightful comments which improved the quality of this dissertation. I also wish to thank my wife Didem and my son Tugra for their patience and support during the long nights I have been working.

iii

TABLE OF CONTENTS

Page

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii CHAPTERS 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 1.2 1.3 1.4

Magnetic Resonance Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Image Formation in MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Characteristics of Medical Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Segmentation of MR images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4.2 Gray Scale Single Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4.3 Multispectral Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5.1 MRI Contrast Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.2 Validation Using Phantoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.3 Validation Using MRI Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5.4 Manual Labeling of MR Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5.5 Brain Development During Childhood . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.6 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2

UNSUPERVISED LEARNING AND CLUSTERING . . . . . . . . . . . . . . . . . . . . . 15 2.1 Classical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Clustering Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.1.3 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Criterion Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 The Sum-of-Squared-Error Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2 The Scatter Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

iv

Page

2.4

2.5 2.6 2.7 3

2.3.1 Iterative Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.2 Merging and Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.3 Neighborhood Dependent methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.4 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.5 Nonparametric Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.2 EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Competitive Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 ART Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

ENTROPY AND INFORMATION THEORY . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Maximum Entropy Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Divergence Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.1 The Relationship to Maximum Entropy Measure . . . . . . . . . . . . . . . . . 41 3.3.2 Other Entropy Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.3 Other Divergence Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4

CLUSTERING EVALUATION FUNCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.1 Clustering Evaluation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.2 CEF as a Weighted Average Distance . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2.3 CEF as a Bhattacharya Related Distance . . . . . . . . . . . . . . . . . . . . . . . 49 4.2.4 Properties as a Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3 Multiple Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4.1 A Parameter to Control Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.2 Effect of the Variance to the Pdf Function . . . . . . . . . . . . . . . . . . . . . . 59 4.4.3 Performance Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.5 Comparison of Distance Measures in Clustering . . . . . . . . . . . . . . . . . . . . . . 62 4.5.1 CEF as a Distance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5.2 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5

OPTIMIZATION ALGORITHM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

v

Page 5.2 Combinatorial Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2.1 Local Minima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2.2 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.3 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3.1 A New Neighborhood Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.2 Grouping Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3.3 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4 Preliminary Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.5 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6

APPLICATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.1 Implementation of the IMAGETOOL program . . . . . . . . . . . . . . . . . . . . . . . 95 6.1.1 PVWAVE Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.1.2 Tools Provided With the System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.2 Testing on MR Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.2 Test Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.3.1 Brain Surface Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.3.3 Segmentation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7

CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

vi

ABSTRACT

Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

A NEW CLUSTERING ALGORITHM FOR SEGMENTATION OF MAGNETIC RESONANCE IMAGES

By Erhan Gokcay August 2000

Chairman: Dr. Jose C. Principe Major Department: Computer and Information Science and Engineering

The major goal of this dissertation is to present a new clustering algorithm using information theoretic measures and apply the algorithm to segment Magnetic Resonance (MR) Images. Since MR images are highly variable from subject to subject, data driven segmentation methods seem appropriate. We developed a new clustering evaluation function based on information theory that outperforms previous clustering algorithms, and the new cost function works as a valley seeking algorithm. Since optimization of the clustering evaluation function is difficult because of its stepwise nature and existence of local minima, we developed an improvement on the K-change algorithm used commonly in clustering problems. When applied to nonlinearly separable data, the algorithm performed with very good results, and was able to find the nonlinear boundaries between clusters without supervision.

vii

The clustering algorithm is applied to segment brain MR images with successful results. A feature set is created from MR images using entropy measures of small blocks from the input image. Clustering the whole brain image is computationaly intensive. Therefore, a small section of the brain is first used to train the clustering algorithm. Afterwards, the rest of the brain is clustered using the results obtained from the training image by using the distance measure proposed. The algorithm is easy to apply and the calculations are simplified by choosing a proper distance measure which does not require numerical integration.

viii

CHAPTER 1 INTRODUCTION

1.1 Magnetic Resonance Image Segmentation Segmentation of medical imagery is a challenging task due to the complexity of the images, as well as to the absence of models of the anatomy that fully capture the possible deformations in each structure. Brain tissue is a particularly complex structure, and its segmentation is an important step for derivation of computerized anatomical atlases, as well as pre- and intra-operative guidance for therapeutic intervention. MRI segmentation has been proposed for a number of clinical investigations of varying complexity. Measurements of tumor volume and its response to therapy have used image gray scale methods as applied to X-ray Computerized Tomography (CT) or simple MRI datasets [Cli87]. However, the differentiation of tissues within tumors that have similar MRI characteristics, such as edema, necrotic, or scar tissue, has proven to be important in the evaluation of response to therapy, and hence, multispectral methods have been proposed [Van91a] [Cla93]. Recently, multimodality approaches, such as positron emission tomography (PET) and functional magnetic resonance imaging (fMRI) studies using radiotracers [Tju94], or contrast materials [Tju94] [Buc91] have been suggested to provide better tumor tissue specification and to identify active tumor tissue. Hence, segmentation methods need to include these additional image data sets. In the same context, a similar progression of segmentation methods is evolving for the planning of surgical procedures primarily in neurological investigations [Hil93] [Zha90] [Cli91], surgery simulations

1

2

[Hu90] [Kam93], or the actual implementation of surgery in the operating suite where both normal tissues and the localization of the lesion or mass needs to be accurately identified. The methods proposed include gray scale image segmentation and multispectral segmentation for anatomical images with additional recent efforts directed toward the mapping of functional metrics (fMRI, EEG, etc) to provide locations of important functional regions of the brain as required for optimal surgical planning. Other applications of MRI segmentation include the diagnosis of brain trauma where white matter lesions, a signature of traumatic brain injury, may potentially be identified in moderate and possibly mild cases. These methods, in turn, may require correlation of anatomical images with functional metrics to provide sensitive measurements of brain trauma. MRI segmentation methods have also been useful in the diagnostic imaging of multiple sclerosis [Wal92], including the detection of lesions [Raf90], and the quantitation of lesion volume using multispectral methods[Jac93]. In order to understand the issues in medical image segmentation, in contrast with segmentation of, say, images of indoor environments, which are the kind of images with which general purpose visual segmentation systems deal, we need an understanding of the salient characteristics of medical imagery. One application of our clustering algorithm is to map and identify important brain structures, which may be important in brain surgery.

1.2 Image Formation in MRI MRI exploits the inherent magnetic moment of certain atomic nuclei. The nucleus of the hydrogen atom (proton) is used in biologic tissue imaging due to its abundance in the

3

human body and its large magnetic moment. When the subject is positioned in the core of the imaging magnet, protons in the tissues experience a strong static magnetic field and precess at a characteristic frequency that is a function solely of the magnetic field strength, and does not depend, for instance, on the tissue to which the proton belongs. An excitation magnetic field is applied at this characteristic frequency to alter the orientation of precession of the protons. The protons relax to their steady state after the excitation field is stopped. The reason MRI is useful is because protons in different tissues relax to their steady state at different rates. MRI essentially measures the components of the magnitude vector of the precession orientation at different times and thus differentiates tissues. These measures are encoded in 3D using methods for slice selection, frequency encoding and phase encoding. Slice selection is performed by exciting thin cross-sections of tissue one at a time. Frequency encoding is achieved by varying the returned frequency of the measured signal, and phase encoding is done by spatially varying the returned phase of the measured signal.

1.3 Characteristics of Medical Imagery While the nature of medical imagery allows a segmentation system to ignore issues such as illumination and pose determination that would be important to a more general purpose segmentation system, there are other issues, which will be briefly discussed below. The objects to be segmented from medical imagery are actual anatomical structures, which are often non rigid and complex in shape, and exhibit considerable variability from person to person. This combined with the absence of explicit shape models that capture the deformations in anatomy, makes the segmentation task challenging. Magnetic resonance images are further complicated due to the limitations in the imaging equipment,

4

like in homogeneities in the receiver or transmitter coils leads to a non-linear gain artifact in the images, and large differences in magnetic susceptibilities of adjacent tissue leads to distortion of the gradient magnetic field, and hence a spatial susceptibility artifact in the images. In addition, the signal is degraded by the motion artifacts that may appear in the images due to movement of the subject during the scan.

1.4 Segmentation of MR images MR segmentation can be roughly divided into two categories: a single image segmentation, where a single 2D or 3D gray scale image is used, and multi-spectral image segmentation where multiple MR images with different gray scale contrasts are available.

1.4.1 Feature Extraction Segmentation of MR images is based on sets of features that can be extracted from the images, such as pixel intensities, which in turn can be used to calculate other features such as edges and texture. Rather than using all the information in the images at once, feature extraction and selection breaks down the problem of segmentation to the grouping of feature vectors [Jai89] [Zad92] [Zad96]. Selection of good features is the key to successful segmentation [Pal93]. The focus of the thesis is not feature extraction. Therefore we will use a simple but effective feature extraction method using entropy measures, and we will not investigate the feature extraction further.

1.4.2 Gray Scale Single Image Segmentation The most intuitive approach to segmentation is global thresholding. One common difficulty with this approach is determining the value of the thresholds. Knowledge guided

5

thresholding methods, where global thresholds are determined based on a “goodness function” describing the separation of background, skull and brain have been reported [Van88] [Cla93] [Jac93] [Hal92] [Lia93] [Ger92b]. The method is limited, and successful application for clinical use hindered by the variability of anatomy and MR data. Edge detection [Pel82] [Mar80] [Bru93] schemes suffer from incorrect detection of edges due to noise, over- and under-segmentation, and variability in threshold selection in the edge image [Del91a] [Sah88]. Combination with morphological [Rit96] [Rit87a] [Rit87b] filtering is also reported [Bom90]. Another method is boundary tracing [Ash90], where the operator clicks a pixel in a region to be outlined and the method then finds the boundary starting from that point. It is usually to be restricted to segmentation of large, well defined structures, but not to distinguish tissue types. Seed growing methods are also reported [Rob94], where the segmentation requires an operator to empirically select seeds and thresholds. Pixels around the seed are examined, and included in the region if they are within the thresholds. Each added pixel then becomes a new seed. Random field methods have been successfully applied, where an energy function is required, which is often very difficult to define, that describes the problem [Bou94] [Kar90].

1.4.3 Multispectral Segmentation Supervised methods require a user supplied training set, usually found by drawing regions of interest on the images. Using maximum likelihood (ML) methods where multivariate Gaussian distributions are assumed [Cla93] [Hal92] [Lia94] [Ger92b] [Kar90], sta-

6

tistics are calculated like mean and covariance matrices. The remaining pixels are then classified by calculating the likelihood of each tissue class, and picking the tissue type with the highest probability. Parametric methods are useful when the feature distributions for different classes are well known, which is not necessarily the case for MR images [Cla93]. k nearest neigbourhood (kNN) has given superior results both in terms of accuracy and reproducibility compared to parametric methods [Cla93]. Artificial neural networks also are commonly used [Cla93] [Hal92] [Daw91] [Hay94] [Has95] [Hec89] [Ozk93] [Wan98]. All supervised methods are operator dependent. Inter- and intra-operator variability has been measured and shown to be relatively large [Cla93] [Ger92b] [Dav96]. Because of this reason unsupervised methods may be preferred from a viewpoint of reproducibility. Unsupervised techniques, which is usually called clustering, automatically find the structure in the data. A cluster is an area in feature space with a high density. Clustering methods include k-means [Bez93] [Ger92b] [Van88], and its fuzzy equivalent, fuzzy cmeans [Bez93] [Hal92] [Phi95]. These methods and their variants are basically limited to find linearly separable clusters. Another promising development is using semi-supervised methods [Ben94b]. The expectation-maximization (EM) algorithm is also used in clustering of MR images [Wel96] [Dem77] [Gui97], where the knowledge of tissue class is considered as the missing information. As usual the method assumes a normal distribution, and it incorporates an explicit bias field, which frequently arise in MR images. Model based approaches including deformable models [Dav96] [Coh91], also known as active contour models, provide a method for minimizing an objective function to obtain

7

a contour of interest, especially if an approximate location of the contour is available. A deformable contour is a planar curve which has an initial position and an objective function associated with it. A special class of deformable contours called snakes was introduced by Witkin [Wit88], in which the initial position is specified interactively by the user and the objective function is referred to as the energy of the snake. The snake tries to minimize its energy over time, similar to physical systems. This energy of the snake is expressed as a sum of two components: the internal energy of the snake and the external energy of the snake which is given as

E snake = E internal + E external

(1.1)

The internal energy term imposes a piecewise smoothness constraint on the snake, and the external energy term is responsible for attracting the snake to interesting features in the image. The balloon model for deformable contours is an extension of the snake model. It modifies the snake energy to include a “balloon” force, which can be either an inflation force, or a deflation force. All these methods require an operator input to place the snake close to a boundary.

1.5 Validation MRI segmentation is being proposed either as a method for determining the volume of tissues in their 3D spatial distibutions in applications involving diagnostic, therapeutic, or surgical simulation protocols. Some form of quantitative measure of the accuracy and/or reproducibility for the proposed segmentation method is clearly required. Since a direct measure of ground truth is not logistically feasible, or even possible with pathologic correlation, several alternative procedures have been used.

8

1.5.1 MRI Contrast Methods The use of MR contrast agents in neuroinvestigations of the brain provide information about whether or not a breakdown of blood-brain barrier (BBB) has occurred and on the integrity of the tissue vascularity both of which are often tumor type- and stage-dependent [Run89] [Bra93] [Hen93] [Bro90]. However, MR contrast may not be optimum for the quantitative differentiation of active tumor tissue, scar tissue, or recurrent tumors. Many segmentation methods, in particular gray scale methods and multispectral methods, use MR contrast information with T1-weighted images for tumor volume or size estimations despite the limitations of these methods in the absence of ground truth determinations[Mcc94] [Gal93]. Recently the use of multi-modality imaging methods, such as the correlation with the PET studies, have been proposed to identify active tissues [Tju94]. Alternatively, the use of fMRI measurement of contrast dynamics has been suggested to provide better differentiation of active tumor tissue in neurological investigations and these functional images could be potentially included in segmentation methods[4,5].

1.5.2 Validation Using Phantoms The use of phantoms constructed with compartments containing known volumes is widely reported [Cli91] [Jac93] [Koh91] [Ger92b] [Mit94] [Bra94] [Pec92] [Jac90]. The typical phantom represents a very idealized case consisting of two or three highly contrasting classes in a homogenous background [Koh91] [Ash90] [Jac90]. Phantoms containing paramagnetic agents have been introduced to mimic MRI parameters of the tissues being modelled [Cli91] [Jac93] [Ger92b]. However, phantoms have not evolved to encompass all the desired features which allow a realistic segmentation validation, namely: a

9

high level of geometric complexity in three dimensions, multiple classes (e.g. representative of white matter, gray matter, cerebrospinal fluid, tumor, background, etc.), and more importantly, RF coil loading similar to humans and MRI parameter distributions similar to those of human tissue. The reported accuracy obtained using phantoms is very high for large volumes [Koh91] [Bra94] [Ash90], but decreases as the volume is smaller [Koh91] [Ger92b]. For a true indication of the maximum obtainable accuracy of the segmentation methods, the phantom volumes should be comparable to the anatomical or pathological structures of interest. In summary, phantoms do not fully exhibit the characteristics that make segmentation of human tissues so difficult. The distributions of MRI parameters for a given tissue class is not necessarily Gaussian or unimodal, and will often overlap for different tissues. The complex spatial distribution of the tissue regions, in turn, may cause the MR image intensity in a given pixel to represent signal from a mix of tissues, commonly referred to as the partial volume artefact. Although phantom images provide an excellent means for daily quality control of the MRI scanner, they can only provide a limited degree of confidence in the reliability of the segmentation methods. We believe using phantoms will limit the accuracy of the algorithm, because of limited modelling capabilities of phantoms. Using MR image of the brain itself is more suitable for our purpose.

1.5.3 Validation Using MRI Simulations Because of the increase in computer speeds, studying MR imaging via computer simulations is very attractive. Several MR signal simulations, and the resulting image construc-

10

tion methods can be found in literature [Bit84] [Odo85] [Sum96] [Bea92] [Hei93] [Pet93] [Sch94]. These simulation methods have so far not been used for the evaluation of segmentation methods, but were used to investigate a wide variety of MR processes including: optimization of RF pulse techniques [Bit94] [Sum96] [Pet93], the merit of spin warp imaging in the presence of field inhomogeneities and gradient field phase-encoding with wavelet encoding [Bea92], noise filtering[Hei93]. In summary, simulation methods can be extended to include MRI segmentation analysis. The robustness of the segmentation process may be probed by corrupting the simulated signal with noise, nonlinear field gradients, or more importantly, nonuniform RF excitation. In this fashion, one source of signal uncertainty can be introduced at a time, and the resulting segmentation uncertainty can be related to the signal source uncertainty in a quantifiable manner.

1.5.4 Manual Labeling of MR Images Some validation methods have let experts manually trace the boundaries of the different tissue regions [Van91a] [Del91b] [Zij93] [Vel94]. The major advantage of the manual labeling technique is that it truly mimics the radiologist’s interpretation, which realistically is the only “valid truth” available for in vivo imaging. However, there is considerable variation with operators[Ger92b] [Eil90], limiting “ground truth” determinations. Furthermore, manual labeling is labor intensive and currently cannot be feasibly performed for large numbers of image data sets. Improvements in the area of manual labeling may be found by interfacing locally operating segmentation techniques with manual improve-

11

ments. As manual labeling allows an evaluation most closely related to the radiologists’ opinion, these improvements deserve further investigation.

1.5.5 Brain Development During Childhood Normal brain development during childhood is a complex and dynamic process for which detailed scientific information is lacking. Several studies are done to investigate the volumetric analysis of the brain during childhood [Van91b] [Puj93] [Gie99] [Rei96] [Ben94a] [Bro87]. Prominent, age-related changes in gray matter, white matter and CSF volumes are evident during childhood and appear to reflect ongoing maturation and remodelling of the central nervous system. There is little change in cerebral volume after the age of 5 years in either male or female subjects [Rei96] [Gie99]. After removing the effect of total cerebral volume, age was found to predict a significant proportion of the variance in cortical gray matter, and cerebral white matter volume, such that increasing age was associated with decreasing gray matter volume and increasing white matter volume in children. The change in CSF was found to be very small with increasing age which can be shown in Figure 1-11 [Rei96] so that the brain volume is not a determining factor on the volumes of gray and white matter. The change in gray matter is about -%1 per year for boys and -%0.92 per year for girls. The change in white matter is about +%0.093 for boys and +%0.072 for girls [Rei96]. We will use this method to quantify our clustering method, since detecting %1 change is a good indicator to evaluate a segmentation method. We expect to find a percentage of %1 using the proposed clustering algorithm. Although

1. The Figure 1-1 is reprinted with the permission of Oxford University Press.

12

this method will not verify the segmentation of individual structures, it still is a good method, because the change is very small and difficult to show.

Figure 1-1. Total cerebral volume in children in age from 5 to 17 years

1.6 Motivation Segmentation of MR images is considered to be difficult because of the non-rigid and complex shapes in anatomical structures. Adding the high variability among patients and even among the same scan makes it difficult for model based approaches to segment MR images. We believe data based approaches are more appropriate for MR images because of the complexity of anatomical structures. Many clustering algorithms are proposed to solve segmentation problems in MR images, which are reviewed in Chapter 2. A common problem with the segmentation algorithms is the fact that they depend on Euclidean distance measure to separate the clusters. Deformable models are excluded from this reasoning, but they require to be placed near the boundary, and we will consider only data-driven methods because of the complexity of

13

the brain images. The Euclidean distance has limited capacity to separate nonlinearly separated clusters. To be able to distinguish nonlinearly separable clusters, more information about the structure of the data should be obtained. On the other hand, because of the missing label assignments, clustering is a more computationaly intensive operation than classification. More information about the structure should be collected without introducing complicated calculations. The motivation of this dissertation is to develop a new cost function for clustering which can be used in nonlinearly separable clusters. Such a method should be computationaly feasible and simple to calculate. The proposed method does not require any numerical integration methods, and uses information theory to collect more information about the data. The stepwise nature of the cost function required us to develop an improved version of the K-change algorithm [Dud73].

1.7 Outline In Chapter 2, the basic clustering algorithms are reviewed. There are many variations to these algorithms but the basic principles stay the same. Many of them can not be used in nonlinearly separable clusters, and the ones that can be used, like the valley seeking algorithm [Fug90], suffer from generating more clusters than there would be if the distributions were not unimodal. Chapter 3 covers the basics of information theory and entropy based distance measures. Many of these calculations require numerical methods which increase the already high computational cost of clustering algorithms. Therefore, we propose a different distance measure, which does not require a numerical integration and is simple to calculate.

14

Chapter 4 covers the new clustering evaluation function proposed, and will give some initial results to give the power of the cost function, which is capable of clustering nonlinearly combined clusters. Chapter 5 focuses on the optimization algorithm that minimizes the cost function developed in Chapter 4. We propose an improvement to the K-change algorithm by introducing a changing group size scheme. In Chapter 6, applications of MR image segmentation are tested and discussed, and Chapter 7 includes the conclusion and the discussion for future research.

CHAPTER 2 UNSUPERVISED LEARNING AND CLUSTERING

2.1 Classical Methods 2.1.1 Discussion There are many important applications of pattern recognition, which include a wide range of information processing problems of great practical significance, from speech recognition and the classification of handwritten characters, to fault detection and medical diagnosis. The discussion here provides the basic elements of clustering where there are many variations to these ideas in the literature. So we try to investigate the basic algorithms. Clustering [Har85] is an unsupervised way of data grouping using a given measure of similarity. Clustering algorithms attempt to organize unlabeled feature vectors into clusters or “natural groups” such that samples within a cluster are more similar to each other than to samples belonging to different clusters. Since there is no information given about the underlying data structure or the number of clusters, there is no single solution to clustering, neither is there a single similarity measure to differentiate all clusters. Because of this reason there is no theory which describes clustering uniquely. Pattern classification can be divided into two areas depending on the external knowledge about the input data. If we know the labels of our input data, the pattern recognition problem is considered supervised. Otherwise the problem is called unsupervised. Here we will only cover statistical pattern recognition. There are several ways of handling the prob-

15

16

lem of pattern recognition if the labels are given a priori. Since we know the labels, the problem reduces to finding features of the data set with the known labels, and to build a classifier using these features. The Bayes’ rule shows how to calculate the posteriori probability from a priori probability. Assume that we know that a priori probabilities P ( c i ) and the conditional densities p ( x c i ) . When we measure x, we can calculate the posteriori probability P ( c i x ) as shown in (2.1).

p ( x c i )P ( c i ) P ( c i x ) = ------------------------------p( x)

(2.1)

where N

p( x) =

∑ p ( x ci )P ( ci )

(2.2)

i=1

In the case of unsupervised classification or clustering, we don’t have the labels, which increases the problem. The clustering problem is not well defined unless the resulting clusters are required to have certain properties. The fundamental problem in clustering is how to choose these properties. Once we have a suitable definition of a cluster, it is possible to evaluate the validity of the resulting clustering using standard statistical validation procedures. There are two basic approaches to clustering, which we call parametric and nonparametric approaches. If the purpose of unsupervised learning is data description then we can assume a predefined distribution function for the data set, and calculate the sufficient statistics which will describe the data set in a compact way. For example, if we assume that the data set comes from a normal distribution N ( M , Σ ) , which is defined as

17

T –1 1 1 N X ( M , Σ ) = ----------------------------------- exp  – --- ( X – M ) Σ ( X – M )  2  n⁄2 1⁄2 ( 2π ) Σ

(2.3)

the sufficient statistics are the sample mean M = E { X } and the sample covariance T matrix Σ = E { X X } , which will describe the distribution perfectly. Unfortunately, if the data set is not distributed according to our choice, then the statistics can be very misleading. Another approach uses a mixture of distributions to describe the data [Mcl88] [Mcl96] [Dem77]. We can approximate virtually any density function in this way, but estimating the parameters of a mixture is not a trivial operation. And the question of how to separate the data set into different clusters is still unanswered, since estimating the distribution does not tell us how to divide the data set into clusters. If we are using the first approach, namely fitting one distribution function to each cluster, then the clustering can be done by trying to estimate the parameters of the distributions. If we are using the mixture of distributions approach, then clustering is very loosely defined. Assume that we have more mixtures than the number of clusters. The model does not tell us how to combine the mixtures to obtain the desired clustering. Another approach to clustering is to group the data set into groups of points which posses strong internal similarities [Dud73] [Fug70]. To measure the similarities we use a criterion function and seek the grouping that finds the extreme point of the criterion function. For this kind of algorithm we need a cost function to evaluate how well the clustering fits to the data, and an algorithm to minimize the cost function. For a given clustering problem, the input data X is fixed. The clustering algorithm varies only by the sample assignment C, which means that the minimization algorithm will change only C. Because

18

of the discrete and unordered nature of C, classical steepest descent search algorithms can not be applied easily.

2.1.2 Clustering Criterion We will define the clustering problem as follows: We will assume that we have N samples, i.e. x 1 …x N . At this moment we assume that the samples are not random variables, since once the samples are fixed by the clustering algorithm, they are not random variables anymore. The problem can be defined as to place each sample into one of L clusters, w 1 …w L , where L is assumed to be given. The cluster k to which the ith sample is assigned is denoted by w k ( i ) , where k(i) is an integer between 1…L , and i = 1…N . A clustering C is a vector made of w k ( i ) and X is a vector made up x i ‘s, that is,

C = [ w k ( 1 ) …w k ( N ) ]

T

(2.4)

and

X = [ x i …x N ]

(2.5)

The clustering criterion J is a function of C and X. and can be written as,

J ( C, X ) = J ( w k ( 1 ) …w k ( N ) ;x 1 …x N )

(2.6)

The best clustering C 0 should satisfy

J ( C 0, X ) = min

C

or

max ( J ( C, X ) ) C

(2.7)

depending on the criterion. Only minimization will be considered, since maximization can always be converted to minimization.

19

2.1.3 Similarity Measures In order to apply the clustering algorithm we have to define how we will measure similarity between samples. The most obvious measure of the similarity between two samples is the distance between them. The L p norm is the generalized distance measure where

p = 2 corresponds to the Euclidean distance [Ben66] [Gre74[. The L p norm between two vectors of size N, is given as

 L p ( x 1, x 2 ) =  

1 --p p

N

∑ ( x1 ( i ) – x2 ( i ) ) 

(2.8)

i=1

If this distance is a good measure of similarity, then we would expect the distance between samples in the same cluster to be significantly less than the distance between samples in different clusters. Another way to measure the similarity between two vectors is the normalized inner product which is given as T

x1 x2 s ( x 1, x 2 ) = --------------------x1 x2 This measure is basically the angle between two vectors.

(2.9)

20

2.2 Criterion Functions 2.2.1 The Sum-of-Squared-Error Criterion One of the simplest and most widely used error criterian is the sum-of-squared-error criterion. Let N i be the number of samples in X i and let m i be the mean of those samples,

1 m i = -----Ni



x

(2.10)

x ∈ Xi

Then the criterion can be defined as L

J =

∑ ∑

i = 1 x ∈ Xi

x – mi

2

(2.11)

which means that the mean vector m i is the best representation of the samples in X i and the clustering achieved is minimized by the squared error vectors x – m i . The error function J measures the total squared error when N samples are represented by L cluster centers m 1 …m L . The value of J depends how the samples are distributed among the cluster centers. This kind of clustering is often called minimum-variance partitions [Dud73]. This kind of clustering works well when the clusters are compact regions that are well separated from each other but it gives unexpected results when the distance between clusters is comparable to size of clusters. An equivalent expression can be obtained by eliminating the mean vectors from the expression as in

1 J = --2

L

∑ ni s˜ i i=1

(2.12)

21

where

∑ ∑

2 1 s˜ i = ------x1 – x2 2 ni x1 ∈ X i x2 ∈ X i

(2.13)

The above expression shows that the sum-of-squared-error criterion uses the Euclidean distance to measure similarity [Dud73]. We can derive different criterion functions by changing s˜ i using other similarity functions s ( x 1, x 2 ) .

2.2.2 The Scatter Matrices In discriminant analysis, within-class, between-class and mixture scatter matrices are used to measure and formulate class separability. These matrices can be combined in different ways to be used as a criterion function. Let’s make the following definitions: Mean vector for the ith cluster

1 m i = -----Ni



x

(2.14)

x ∈ Xi

Total mean vector

1 m = ---N



x∈X

1 x = ---N

L

∑ N i mi

(2.15)

i=1

Scatter matrix for ith cluster

Si = Within-cluster scatter matrix



x ∈ Xi

( x – mi ) ( x – mi )

T

(2.16)

22

L

Sw =

∑ Si

(2.17)

i=1 Between-cluster scatter matrix L

SB =



ni ( mi – m ) ( mi – m )

T

(2.18)

i=1 Total scatter matrix

ST = SW + SB

(2.19)

The following criterion functions can be defined [Fug90][Dud73] –1

J 1 = tr ( S 2 S 1 ) –1

(2.20)

J 2 = ln S 2 S 1

(2.21)

J 3 = tr ( S 1 ) – µ ( tr ( S 2 ) – c )

(2.22)

tr ( S 1 ) J 4 = --------------tr ( S 2 )

(2.23)

where S 1 and S 2 are one of S W , S B , or S T . Some combinations are invariant under any nonsingular linear transformation and some are not. These functions are not universally applicable, and this is a major flaw in these criterion functions. Once the function is determined, the clustering that the function will provide is fixed in terms of parameters. We are assuming we can reach the global extreme point, which may not always be the case. If the function does not provide good results with a certain data set, no parameter is readily available to change the behavior of the clustering output. The only parameter we

23

can change is the function itself, which may be difficult to set. Another limitation of the criterion functions mentioned above is the fact that they are basically second order statistics. In the coming chapters we will provide a new clustering function using information theoretic measures and a practical way to calculate and optimize the resulted function, which gives us a way to control the clustering behavior of the criterion. There are other measures using entropy and information theory to measure cluster separability, which will be covered in the next chapter. These methods will be the basis for our clustering function.

2.3 Clustering Algorithms 2.3.1 Iterative Optimization The input data are finite, therefore there are only a finite number of possible partitions. In theory, the clustering criterion can always be solved by exhaustive enumeration. However, in practice such an approach not feasible, since the number of iterations will grow exponentially with the number of clusters and sample size where the number of different solutions are given approximately by L

N

⁄ L! .

The basic idea in iterative-optimization is to find an initial partition and to move samples from one group to another if such a move will improve the value of the criterion function. In general this procedure will guarantee local optimization. Different initial points will give different results. The simplicity of the method usually overcomes the limitations in most problems. In the following chapters we will improve this optimization method to obtain better results and use in our algorithm.

24

2.3.2 Merging and Splitting After a number of clusters are obtained, it is possible to merge certain clusters or split certain clusters. Merging may be required if two clusters are very similar. Of course we should define a similarity measure for this operation as we did for clustering. Several measures given in (2.20), (2.21), (2.22) and (2.23), can be used for this purpose [Dud73]. Merging is sometimes desired when the cluster size is very small. The criterion for appropriate splitting is far more difficult to define [Dud73]. Multimodal and nonsymmetric distributions as well as distributions with large variances along one direction can be split. We can start partitioning the input data of size N to N clusters containing one sample each. The next step is to partition the current clustering into N-1 clusters. We can continue doing this until we reach the desired clustering. At any level some samples from different clusters may be combined together to form a single cluster. The merging and splitting are heuristic operations with no guarantee of reaching the desired clustering, but still useful [Dud73].

2.3.3 Neighborhood Dependent methods Once we choose a measure to describe similarity between clusters, we will use the following algorithm [Dud73]. After the initial clustering, we will recluster a sample according to the nearest cluster center, and calculate the mean of the clusters again, until there is no change in clustering. This algorithm is called the nearest mean reclassification [Dud73].

25

2.3.4 Hierarchical Clustering Let’s consider a sequence of partitions of the N samples into C clusters. First partition into N clusters, where each cluster contains exactly one sample. The next iteration is a partition into N-1 clusters, until all samples form one cluster. If the sequence has the property that whenever two samples are in the same cluster at some level, they remain together at all higher levels, then the sequence is called a hierarchical clustering. It should be noted that the clustering can be done in reverse order, that is, first all samples form a single cluster, and at each iteration more clusters are generated. In order to combine or divide the clusters, we need a way to measure the similarity in the clusters and dissimilarity between clusters. Commonly used distance measures [Dud73] are as follows:

D min ( C 1, C 2 ) = min x 1 – x 2

(2.24)

D max ( C 1, C 2 ) = max x 1 – x 2

(2.25)

1 D avg ( C 1, C 2 ) = -------------N1N2

∑ ∑

x1 ∈ C 1 x2 ∈ C 2

D mean ( C 1, C 2 ) = m 1 – m 2

x1 – x2

(2.26)

(2.27)

All of these measures have a minimum-variance flavor, and they usually give the same results if the clusters are compact and well separated. However, if the clusters are close to each other, and/or the shapes are not basically hyperspherical, very different results may be obtained. Although never tested, it is possible to use our clustering evaluation function to measure the distance between clusters in hierarchical clustering to improve the results.

26

2.3.5 Nonparametric Clustering When a mixture density function has peaks and valleys, it is most natural to divide the samples into clusters according to the valley. The valley may not have a parametric structure, which creates difficulties with parametric assumptions. One way to discover the valley is to use estimation of local density gradients at each sample point and move the samples in the direction of the gradient. By repeating this we will move the samples away from the valley, and the samples form compact clusters. We call this procedure the valleyseeking procedure [Fuk90].

valley

Figure 2-1. Valley seeking algorithm

The local gradient can be estimated by the local mean vector around the sample. The direction of the local gradient is given as shown in Figure 2-1.

∇p( X ) ----------------- ≅ M ( X ) p( X )

(2.28)

27

The local gradient near the decision surface will be proportional to the difference of the means, that is, M 1 ( x ) – M 2 ( x ) , where M 1 ( x ) is the local mean of one cluster, and

M 2 ( x ) is the local mean of the other cluster inside the same local region. The method seems very promising but it may result in too many clusters, if there is a slight nonuniformity in one of the clusters. The performance and the number of clusters depends on the local region used to calculate the gradient vector. If the local region is too small, there will be many clusters, and on the other hand if the region is too huge, then all the points form one cluster. So size of local region is directly related to the number of clusters but loosely related to the quality of the clustering. The advantage of this method is that the number of clusters does not need to be specified in advance, but in some cases this may be a disadvantage. Assume we want to improve the clustering without increasing the number of clusters. A change in local region will change the number of clusters and will not help to improve the clustering. Of course small changes will change the clustering without increasing the cluster number, but determining the range of the parameter may be a problem.

2.4 Mixture Models The mixture model is a semi-parametric way of estimating the underlying density function [Dud73] [Par62] [Chr81]. In the non-parametric kernel-based approach to density estimation, the density function was represented as a linear superposition of kernel functions, with one kernel centered on each data point. In the mixture model the density function is again formed by a linear combination of basis functions, but the number of basis functions is treated as a parameter of the model and it is much less than the number N of

28

data points. We write the density estimator as a linear combination of component densities

p ( x i ) in the form L

∑ p ( x i )P ( i )

p( x) =

(2.29)

i=1

Such a representation is called a mixture distribution and the coefficients P ( j ) are called the mixing parameters.

2.4.1 Maximum Likelihood Estimation

N

The maximum likelihood estimation may be obtained by maximizing with respect to P i , M i and Σ i under the constraint

∏ p( x j) j=1

L

∑ Pi = 1

(2.30)

i=1 The negative log-likelihood is given by N

E = – ln ( Γ ) = –

∑ ln ( p ( xi ) )

(2.31)

i=1 which can be regarded as an error function [Fug90]. Maximizing the likelihood Γ is then equivalent to minimizing E. One way to solve the maximum likelihood is the EM algorithm which is explained next.

29

2.4.2 EM Algorithm Usually no theoretical solution exists for the likelihood equations, and it is necessary to use numerical methods. Direct maximization of the likelihood function using NewtonRaphson or gradient methods is possible but it may need analytical work to obtain the gradient and possibly the Hessian. The EM algorithm [Dem77] is a general method for computing maximum-likelihood (ML) estimates for “incomplete data” problems. In each iteration of the EM algorithm there are two steps, called the expectation step or the E-step and the maximization step or M-step, thus the name EM algorithm, given by [Dem77] in their fundamental paper. The EM algorithm can be applied in situations described as incomplete-data problems, where ML estimation is made difficult by the absence of some part of the data in a more familiar and simpler data structure. The parameters are estimated after filling in initial values for the missing data. The latter are then updated by their predicted values using these parameter estimates. The parameters are then reestimated iteratively until convergence. The term “incomplete data” implies in general to the existence of two sample spaces X and Y and a many-to-one mapping H from X to Y, where x ∈ X and y ∈ Y are elements of the sample spaces and y = H ( x ) . The corresponding x in X is not observed directly, but only indirectly through y. Let f ( x θ ) be the parametric distribution of x, where θ is a vector of parameters taking values in Θ . The distribution of y, denoted by g ( x θ ) , is also parametrized by θ , since the complete-data specification f ( … … ) is related to the incomplete-data specification g ( … … ) by

30

g( y θ) =



f ( x θ ) dx

(2.32)

H ( x) = y

The EM algorithm tries to find a value of Θ which maximizes

g ( y Θ ) given an

observed y, and it uses the associated family f ( x Θ ) . It should be noted that there are many possible complete-data specifications f ( x Θ ) that will generate g ( y Θ ) . The maximum-likelihood estimator θˆ maximizes the log-likelihood function

L ( θ ) = ln ( g ( x θ ) )

(2.33)

over θ ,

θˆ = arg max

θ∈Θ

( L(θ))

(2.34)

The main idea behind the EM algorithm is that, there are cases in which the estimation of θ would be easy if the complete data x were available, and is only difficult for the incomplete data y. In other words the maximization of ln ( g ( y θ ) ) is complicated, where the maximization of ln ( f ( x θ ) ) is easy. Since only the incomplete data y is available in practice, it is not possible to directly perform the optimization of the complete data likelihood ln ( f ( x θ ) ) . Instead it will be easier to estimate ln ( f ( x θ ) ) from y and use this estimator to find θˆ . Since estimating the complete data likelihood requires θ , we need an iterative approach. First using an estimate of θ the complete likelihood function will be estimated, then this likelihood function should be maximized over θ , and so on, until a satisfactory convergence is obtained. Given the current value θ' of the parameters and y, we can estimate ln ( f ( x θ ) ) using

31

P ( θ, θ' ) = E [ ln ( p ( x θ ) ) y, θ' ]

(2.35)

The EM algorithm can be expressed as E-step p

p

P ( θ, θ ) = E [ ln ( p ( x θ ) ) y, θ ]

(2.36)

M-step

θ where θ

p

p+1

∈ arg max

p

θ∈Θ

(P(θ θ ))

(2.37)

is the value of θ at pth iteration. For the problem of density estimation using a

mixture model we do not have corresponding cluster labels. The missing labels can be considered as incomplete data and can be solved with the other parameters of the mixture using the EM algorithm described above. Unfortunately estimating the mixture model will not answer the question of how to cluster the data set using the mixture model.

2.5 Competitive Networks Unsupervised learning can be accomplished by appropriately designed neural networks. The original unsupervised learning rule was proposed by Hebb [Heb49], which is inspired by the biological synaptic signal changes. In this rule, changes in weight depend on to the correlation of pre- and post-synaptic signals x and y, respectively, which may be formulated as T

w k + 1 = w k + ρy k x k

(2.38)

T where ρ > 0 , x k is the unit’s input signal, and y k = x k w k is the unit’s output.The analysis will show that the rule is unstable in this form and it drives the weights to infinite

32

in magnitude. One way to prevent divergence is to normalize the weight vector after each iteration. Another update rule which prevents the divergence is proposed by Oja [Oja82] 2 [Oja85] adds a weight decay proportional to y and results in the following update rule: T

w k + 1 = w k + ρ ( x k – y k w k )y k

(2.39)

There are many variations to this update rule, and the update rule is used very frequently in principal component analysis (PCA) [Ama77] [Lin88] [San89] [Oja83] [Rub89] [Pri90]. There are linear and nonlinear versions [Oja91] [Tay93] [Kar94] [Xu94] [Pri90]. Competitive networks [Gro76a] [Gro76b] [Lip89] [Rum85] [Mal73] [Gro69] [Pri90] can be used in clustering procedures [Dud73] [Har75]. Since in clustering there is no signal which to show the cluster labels, competitive networks use a competition procedure to find the output node to be updated according to a particular weight update rule. The unit with the largest activation is usually chosen as the winner whose weight vector is updated according to the rule

∆w i = ρ ( x k – w i )

(2.40)

where w i is the weight vector of the winning node. The weight vectors of other nodes are not updated. The net effect of the rule is to move the weight vectors of each node towards the center-of-mass of the nearest dense cluster of data points. This means that the number of output nodes determine the number of clusters. One application of competitive learning is adaptive vector quantization [Ger82] [Ger92a]. Vector quantization is a technique where the input space is divided into a number of distinct regions, and for each region a “template” or reconstruction vector is

33

defined. When presented with a new input vector x, a vector quantizer first determines the region in which the vector lies. Then the quantizer outputs an encoded version of the reconstruction vector w i representing that particular region containing x. The set of all possible reconstruction vectors is usually called the codebook of the quantizer [Lin80] [Gra84]. When the Euclidean distance is used to measure the similarity of x to the regions, the quantizer is called a Voronoi quantizer [Gra84].

2.6 ART Networks Adaptive resonance architectures are artificial neural networks that are capable of stable categorization of an arbitrary sequence of unlabeled input patterns in real time. These architectures are capable of continuous training with nonstationary inputs. They also solve the stability-plasticity dilemma. In other words, they let the network adapt yet prevent current inputs from destroying past training. The basic principles of the underlying theory of these networks, known as adaptive resonance theory (ART), were introduced by Grossberg 1976 [Gro76a] [Gro76b]. A class of ART architectures, called ART1 [Car87a] [Car88], is characterized by a system of ordinary differential equations, with associated theorems. A number of interpretations and simplifications of the ART1 net have been reported in the literature [Lip87] [Pao89] [Moo89]. The basic architecture of the ART1 net consists of a layer of linear units representing prototype vectors whose outputs are acted on by a winner-take-all network. This architecture is identical to the simple competitive network with one major difference. The linear prototype units are allocated dynamically, as needed, in response to novel input vectors. Once a prototype unit is allocated, appropriate lateral-inhibitory and self-excitatory con-

34

nections are introduced so that the allocated unit may compete with preexisting prototype units. Alternatively, one may assume a prewired network with a large number of inactive (zero weights) units. A unit becomes active if the training algorithm decides to assign it as a cluster prototype unit, and its weights are adapted accordingly. The general idea behind ART1 training is as follows. Every training iteration consists of taking a training example x k and examining existing prototypes (weight vectors w j ) that are sufficiently similar to x k . If a prototype w i is found to match x k (according to a similarity test based an a preset matching threshold), sample x k is added to the cluster represented by w i , and w i is modified to make it better match x k . If no prototype matches x k , then x k becomes the prototype for a new cluster. The family of ART networks also includes more complex models such as ART2 [Car87b] and ART3 [Car90]. These ART models are capable of clustering binary and analog input patterns. A simplified model of ART2, ART2-A [Car91a], has been proposed that is two to three orders of magnitude faster than ART2. Also, a supervised real-time learning ART model called ARTMAP has been proposed [Car91b].

2.7 Conclusion In this chapter we summarized basic algorithms used in clustering. There are many variations to these algorithms, but the basic principle stays the same. A fundamental problem can be seen immediately, which can be summarized as the usage of the Euclidean distance as a measure for cluster separability. Others use mean and variance to differentiate the clusters. When there are nonlinear structures in the data, then it is obvious that the Euclidean distance measures and differences in the mean and variance is an inadeguate measure of cluster separability. The valley seeking algorithm tries to solve the problem by

35

moving the samples along the gradients, and the algorithm will behave as a classifier if the clusters are well separated and unimodal. But when the clusters are multi-modal and overlapping, then the valley seeking algorithm may create more cluster centers than there are clusters in the data. The question of how to combine these cluster centers is not answered, even when we know the exact number of clusters. Defining the number of clusters beforehand can be an advantage depending on the problem. For example in MRI segmentation it is to our advantage to fix the number of clusters, since we know this number a priori. Consider an MRI brain image where the basic structures are CSF, white matter and gray matter. Failure to fix the number of clusters in this problem beforehand will raise the question of how to combine the excess cluster centers later. When we consider the fact that the tissue boundaries in an MRI brain image are not sharply defined, it is obvious that the Euclidean distance measures and mean and variance differences are not enough to differentiate the clusters in a brain MRI. The variability of brain structures among persons and within the same scan makes it difficult to use model based approaches, since the model that fits to a particular part of the brain, may not fit to the rest. This encourages us to use data-driven based algorithms, where there is no predefined structure imposed on the data. But the limitations of the Euclidean distance measures forces us to seek other measures for cluster separability.

CHAPTER 3 ENTROPY AND INFORMATION THEORY

3.1 Introduction Entropy [Sha48] [Sha62] [Kaz90] was introduced into information theory by Shannon (1948). The entropy of a random variable is a measure of the average amount of information contained. In another words entropy measures the uncertainty of the random variable. Consider a random variable X which can take values x 1 …x N with probabilities

p ( x k ), k = 1…N . If we know that the event x i occurs with probability p i = 1 , which requires that p i = 0, i ≠ k , there is no surprise and therefore there is no information contained in X, since we know the outcome exactly. If we want to send the value of X to a receiver, then the amount of information is given as I ( x k ) = – ln ( p ( x k ) ) , if the variable takes the value x k . Thus, the expected information needed to transmit the value of X is given by

E ( I ( xk ) ) = H S ( X ) = –

∑ p ( xk ) ln ( p ( xk ) )

(3.1)

k

which is called the entropy of the random variable X. Shannon’s measure of entropy was developed essentially for the discrete values case. Moving to the continuous random variable case where summations are usually replaced with integrals, is not so trivial, because a continuous random variable takes values from

– ∞ to ∞ , which makes the information content infinite. In order to avoid this problem, the continuous entropy calculation is considered as differential entropy, instead of absolute

36

37

entropy as in case of discrete random variables [Pap65] [Hog65]. If we let the interval between discrete random variables be

∇x k = x k – x k – 1

(3.2)

If the continuous density function is given as f ( x ) , then p i can be approximated by

f ( x k ) ∇x k , so that H= –

∑ p ( xk ) ln ( p ( xk ) ) ≈ –∑ f ( xk ) ( ∇xk ) ln ( f ( xk ) ∇xk ) k

(3.3)

k

After some manipulations and replacing the summation by integrals we will obtain



H = – f ( x ) ln ( f ( x ) ) ) dx – ln ( ∇x k )

(3.4)

X In this equation – ln ( ∇ x k ) → ∞ as ∇ x k → 0 , which suggest that the entropy of a continuous variable is infinite. If the equation is used in making comparisons between different density functions, the last term cancels out. We can drop the last term and use the equation as a measure of entropy by assuming that the measure is the differential entropy with the reference term – ln ( ∇ x k ) . If all measurements are done relative to the same reference point, dropping the last term from (3.4) is justified and we have



h ( X ) = – f ( x ) ln ( f ( x ) ) ) dx

(3.5)

X

3.2 Maximum Entropy Principle The Maximum Entropy Principle (MaxEnt) or the principle of maximum uncertanity was independently proposed by Jaynes, Ingarden and Kullback independently [Jay57] [Kul59] [Kap92]. Given just some mean values, there are usually an infinity of compatible

38

distributions. MaxEnt encourages us to select the distribution that maximizes the Shannon entropy measure while being consistent with the given constraints. In other words, out of all distributions consistent with the constraints, we should choose the distribution that has maximum uncertainty, or choose the distribution that is most random. Mathematically, this principle states that we should maximize N



∑ pi ln pi

(3.6)

i=1 subject to N

∑ pi = 1

(3.7)

i=1 N

∑ pi gr ( xi ) = ar

r = 1, …M

(3.8)

i=1 and

pi ≥ 0

i = 1, …N

(3.9)

The maximization can be done using Lagrangian multipliers. 3.2.1 Mutual Information Let’s assume that H ( X ) represents the uncertainty about a system before observing the system output, and the conditional entropy H ( X Y ) represents the uncertainty about the system after observing the system output. The difference H ( X ) – H ( X Y ) must represent the uncertainty about the system input after observing system output. This quan-

39

tity is called the mutual information [Cov91] [Gra90] between the random variables X and Y which is given by

I ( X ;Y ) = H ( X ) – H ( X Y )

(3.10)

Entropy is a special case of mutual information, where

H ( X ) = I ( X ;X )

(3.11)

There are some important properties of the mutual information measure. These properties can be summarized as follows [Kap92]. 1. The mutual information is symmetric, that is I ( X ;Y ) = I ( Y ; X ) . 2. The mutual information is always nonnegative, that is I ( X ;Y ) ≥ 0 . The mutual information can also be regarded as the Kullback-Leibler divergence [Kul59] between the joint pdf f X X ( x 1, x 2 ) and the factorized marginal pdf 1 2

f X ( x 1 ) f X ( x 2 ) . The Kullback-Leibler divergence is defined in (3.12) and (3.13). 1 2 The importance of mutual information is that it provides more information about the structure of two pdf functions than second order measures. Basically it gives us information about how different the two pdf’s are, which is very important in clustering. The same information can not be obtained by using second order measures.

3.3 Divergence Measures In Chapter 2, the clustering problem was formulated as a distance between two distributions, but all the proposed measures are limited to second order statistics (i.e. variance). Another useful entropy measure is the minimum cross-entropy measure which gives the separation between two distributions [Kul59]. This is also called directed divergence,

40

since most of the measures are not symmetrical, although they can be made symmetrical. Assume D ( p, q ) is a measure for the distance between p and q distributions. If

D ( p, q ) is not symmetric, then it can be made symmetric by introducing D' ( p, q ) = D ( p, q ) + D ( q, p ) . Under certain conditions the minimization of directed divergence measure is equivalent to the maximization of the entropy [Kap92]. The first divergence we will introduce is the Kullback-Leibler’s cross-entropy measure which is defined as [Kul59] n

D KL ( p, q ) =

∑ i=1

 p i p i ln  -----  q i

(3.12)

where p = ( p 1, p 2, …, p n ) and q = ( q 1, q 2, …, q n ) are two probability distributions. The following are some important properties of the measure •

D KL ( p, q ) is a continuous function of p and q.



D KL ( p, q ) is permutationally symmetric.



D KL ( p, q ) ≥ 0 , and it vanishes iff p = q.



D KL ( p, q ) ≠ D KL ( q, p )

The measure can be also formulated for the continuous variate density functions

D KL ( f , g ) = where it vanishes iff f ( x ) = g ( x ) .



f ( x) f ( x ) ln  ----------- dx  g ( x )

(3.13)

41

3.3.1 The Relationship to Maximum Entropy Measure The Kullback-Leibler divergence measure is used to measure the distance between two distributions.Where the second distribution is not given, it is natural to choose the distribution that has maximum entropy. When there are no constraints we compare to the uniform distribution u. We will use the following measure to minimize n

D KL ( p, u ) =

∑ i=1

pi p i ln  ----------  1 ⁄ n

(3.14)

In other words, we maximize n



∑ pi ln pi

(3.15)

i=1 Thus, minimizing cross-entropy is equivalent to maximization of entropy when the distribution we are comparing to is a uniform distribution [Kap92]. Even though maximization of entropy can be thought as a special case of minimum cross-entropy principle, there is a conceptual difference between the two measures [Kap92]. The maximum entropy principle maximizes uncertainty, while the minimum cross-entropy principle minimizes a probabilistic distance between two distributions.

3.3.2 Other Entropy Measures We are not restricted to Shannon’s entropy definition. There are other entropy measures which are quite useful. One of the measures is the Renyi’s entropy measure [Ren60] which is given as

42

1 H R ( X ) = ------------ ln 1–α

n

α

∑ pk

α > 0, α ≠ 1

(3.16)

k=1

The Havdra-Charvat’s entropy is given as

1 H HC ( X ) = ------------ ln 1–α

n



α

α > 0, α ≠ 1

pk – 1

(3.17)

k=1

and

H S ( X ) = lim H R ( X ) α→1

(3.18)

We will use the Renyi’s entropy measure for our derivation in the next chapter due to its better implementation properties. We can compare the three types of entropy in Table 3-1. Table 3-1. The comparison of properties of three entropies Properties

Shannon’s

Renyi’s

H-C’s

Continuous function

yes

yes

yes

Permutationally symmetric

yes

yes

yes

Monotonically increasing

yes

yes

yes

Recursivity

yes

no

yes

Additivity

yes

yes

no

3.3.3 Other Divergence Measures Another important measure of divergence is given by Bhattacharya [Bhat]. The distance D B ( f , g ) is defined by

43

b( f , g) =



f ( x )g ( x ) dx

(3.19)

and

D B ( f , g ) = – ln b ( f , g )

(3.20)

D B ( f , g ) vanishes iff f = g almost everywhere. There is a non-symmetric measure, the so-called generalized Bhattacharya distance, or Chernoff distance [Che52], which is defined by



D C ( f , g ) = – ln [ f ( x ) ]

1–s

s

[ g ( x ) ] dx

0 random [ 0, 1 ]   c

then

i = j

k

endfor

k = k+1 CALCULATE (

Lk

)

CALCULATE (

ck

)

until StopCriterian

Figure 5-2. Simulated annealing algorithm A typical feature of the algorithm is that, besides accepting improvements in cost, it also to a limited extent accepts deteriorations in cost. Initially, at large values of c, large deteriorations will be accepted and finally, as the value of c approaches 0, no deteriorations will be accepted at all. L k should be sufficiently long so that the system reaches thermal equilibrium.

5.3.1 A New Neighborhood Structure We have some a-priori information about the optimization problem. The problem is basically a clustering algorithm that tries to group pixels into regions. The pixels that

75

belong to a class should be close to each other using some distance metric. This information will help us create another neighborhood structure and will eliminate many local minima. On top of this algorithm we will put a modified version of a simulated annealing algorithm, which gives us a chance to escape from local minima, if any. We start by looking at the problem closely. Assume that at an intermediate step the optimization reached

Class1 g1

g2

Class2

Class2 g3 g4 Class1

Figure 5-3. An instance

the following label assignment which is shown in Figure 5-3. The ideal case is to label the group of points marked in g2 as “Class2”, so that the upper triangle will be labeled as “Class2”, and the lower triangle will be labeled as “Class1”. The variance of the Gaussian kernel is selected as σ = 0.08 , and the value of the CEF function is given as 0.059448040 for the given configuration. When some of the labels in the group g2 are changed, the value of CEF function will increase to 0.060655852, and when all the labels

76

in the group g2 are changed to “Class2”, then the value of CEF will drop sharply to 0.049192752. Which means that the 2-change N 2 ( p, q ) neighborhood structure explained before will fail to label the samples in the group g2 correctly. This behavior and the local minima can also be seen in the Figure 4-3. Of course this behavior disappears at a certain value of the variance, but there is no guarantee that lowering the value of the variance will always work for every data set. A better approach is to find a better neighborhood structure to escape from the local minima. In the previous example, one can make the following observation very easily. If we change the labels of the group of pixels g2 at the same time, then we can skip the local minima. This brings two questions: How to choose the groups, and how big they should be?

5.3.2 Grouping Algorithm We know the clustering algorithm should give the same label to samples that are close to each other in some distance metric. Instead of taking a fixed region around a sample, we used the following grouping algorithm. Let’s assume that the group size is defined as KM. We will group samples from the same cluster starting from a large KM, which will be decreased during the iterations according to a predefined formula. This can be exponential or linear. The grouping algorithm is explained in Figure (6-4). Assume that the starting sample is x 1 , and the subgroup size is 4. The first sample that is closest to x 1 is x 2 , so

x 2 is included in the subgroup. The next sample that is closest to the initial sample x 1 is x 3 , but we are looking for a sample that is closest to any sample in the subgroup, which in this case is x 4 . The next sample closest to the subgroup is still not x 3 , but x 5 , which is

77

close to x 4 . The resulting subgroup { x 1, x 2, x 4, x 5 } is a more connected group than just by selecting the 4 pixels { x 1, x 2, x 3, x 4 } that are closest to the initial sample x 1 . The grouping algorithm is more calculation intensive than taking the samples around the

x1

x3

x4 x2 x5

Figure 5-4. Grouping

pixel, but it will help the algorithm to escape from the local minima. The initial cluster labels come from the random initialization at the beginning of the algorithm. This grouping is done starting with every pixel, and the groups that are equal are eliminated. The grouping is done among the samples that have the same cluster label, and done for each cluster label independently. When the grouping is finished for a certain group size, for all cluster labels, they are joined in a single big group, and they will be used in the optimization process instead of the original data set. We will obtain the following set of groups as seen in Figure 5-9. The difference of this grouping algorithm can be seen clearly using the

N C

data set in Figure 5-5. The group size is initialized as KM = ---- , where N is the total number of data samples, and C is the number of clusters.

In Figure 5-6, the group is selected starting from the sample x 1 , and the closest N samples is selected. The group shown in Figure 5-6 is selected using N=10. This group is actually the kNN of point x 1 . The proposed grouping algorithm, on the other hand, cre-

78

Figure 5-5. Data set for grouping algorithm

x1

Figure 5-6. Grouping using 10 nearest samples (kNN) ates the group shown in Figure 5-7. As can be seen from both examples, the proposed grouping is more sensitive to the structure of the data, whereas the kNN method, a very common method, is less sensitive to the structure of the data. Several grouping examples can be seen in Figure 5-8. Using the proposed method will create a better grouping structure for our clustering algorithm.

79

x1

Figure 5-7. Grouping using the proposed algorithm (group size is 10)

g1 g2 g3

g4

Figure 5-8. Examples using proposed grouping algorithm

5.3.3 Optimization Algorithm First we consider the case where the groups overlap, and propose an optimization algorithm. The optimization starts by choosing an initial groups size KM, which is smaller

80

Groups ( Group size = 1 ) { GRP = CREATE_GROUPS ( InputData, KM, ClusterLabels ) REPEAT UNTIL NOCHANGE { FOR i=0 to SIZE(GRP) { Change the clustering labels of GRP(i) and record the improvement if any FOR j=i+1 to SIZE(GRP) { Change the clustering labels of GRP(i) and GRP(j) and record the improvement if any } } } ; Decrease the group size. This can be linear or exponential KM = GENERATE_NEWGROUPSIZE(KM) }

Figure 5-10. Optimization algorithm without any random decision. Of course we can do this for every pair, but this will end up doing a complete enumeration among the possible solutions, which is not computationally feasible. We repeat calculating a new cluster label, until there is no improvement possible with the given group size KM. The next step is to reduce the group size KM, and repeat the whole process, until the group size is 1, which will be equivalent to the 2-change algorithm with the N 2 ( p, q ) neighborhood structure as explained before. When the algo-

82

rithm stops it is a good idea to run one more pass by setting KM to the initial value, and run the whole process again using the same data where the initial condition is the previously obtained clustering, or this can be repeated until there is no change recorded. It is possible to increase the group size starting from one, to a final value. Repeating the algorithm will help us to escape from local minima, unlike previous attempts. It is possible that several optimization algorithms can be derived from the given algorithm. Let’s assume that we are increasing the value of group size KM, and at a certain group size, we made an improvement. It is possible that instead of increasing the group size at the next step, we can decrease the group size, until no further improvements occur in the cost function. When the improvements stop, then we can continue to increase the group size. We can explain this algorithm using the following Figure 5-11. It should be noted that none of these algorithms is guaranteed to reach the global optimal point of the function.

INITIALIZE ( KM, var, ClusterLabels ) ClusterLabels = random() WHILE ( KM >= 1 ) { CHECK FOR IMPROVEMENT IF NO IMPROVEMENT KM = KM - 1 IF IMPROVEMENT KM = KM + 1 ENDWHILE

Figure 5-11. A variation of the algorithm

83

It is possible to record the change that makes the biggest improvement at the end of the outside loop, after calculating all possible improvements, instead of changing it immediately when an improvement is recorded. This may result in a slightly different solution.

5.3.4 Convergence Since there is a finite number of combinations tested, and since the algorithm continues to work only when there is a improvement in the cost function, the algorithm is guaranteed to stop in a finite number of iterations. There is only one condition where the algorithm may not stop. Because of some underflow errors, if the value of the function becomes zero, the algorithm will run forever, since no improvement can be done beyond zero. The integral is always greater than or equal to zero. This is an computational problem, not an algorithmic problem. So whenever the cost function becomes zero, the algorithm should be forced to stop.

5.4 Preliminary Result We tested our algorithm using multiple clusters and the results are given in Figure 512. Three partial sinewaves were created with Gaussian noise added with different variance with 150 samples per cluster. The algorithm was run with KM as 150, linearly annealed one by one until we reach the group size of one. The kernel size used in this experiment is 0.05 and the minimum value of the CEF function is 3.25E-05. Each symbol represents a different cluster. As we can see the algorithm was able to find the natural boundaries for the clusters, although they are highly nonlinear. We tested the algorithm with more overlapping clusters. The input data is given in Figure 5-13. We obtained meaningful results when we set the variance of the kernel such that

84

Figure 5-12. Output with three clusters the pdf estimation shows the structure of the data properly. When we say meaningful, it means meaningful to a person that created the data, which may not be always what is given in the data. Another important property of all these results is that all of them represent a local optimum point of the cost function. It is always possible that another solution with a lower value of the cost function exists. This time we tested the algorithm with different variance values. Results are given in Figure 5-14. When the variance is high, single point clusters are formed. As we dropped the variance, the results are improved, and we got the result that mimics the data set generation. Points that are isolated may not represent the structure of the data as well as the other points. So it is possible that those points will form single point clusters. This is a basic problem with the cost function proposed. It is possible to eliminate these points and run the algorithm again.

85

Figure 5-13. Overlapping boundaries Another test set, where clusters do not have same number of points, is used to see the behavior of the algorithm. The input data can be seen in Figure 5-15. The output of the algorithm can be seen in Figure 5-16. To get more information about the conversion of the algorithm, we collected some statistics each time an improvement occured in the algorithm given in Figure 5-10. The statistics collected are the entropy of each class, and the value of the CEF function. The value of the CEF function can be seen in Figure 5-17. The plot is the output of the algorithm during the minimization of the data given in Figure 515. The horizontal scale is compressed time and does not reflect the real time passed. In the beginning of the algorithm the interval between improvements were short, whereas as the calculations continue, the time interval between improvements gets bigger. The minimum value is not zero, but a very small value. The entropy of each class is a more interesting statistic and is given in Figure 5-18. Renyi’s entropy is used to calculate the entropy of each cluster generated after each improvement.

86

Figure 5-14. Output with a variance of 0.026

Figure 5-15. Non-symmetric clusters An interesting plot occurs when we plot the inverse of the entropies, which can be seen in Figure 5-19. Although the clusters are not symmetrical, the inverse entropies are almost

87

Figure 5-16. Output with variance 0.021

Figure 5-17. CEF function

88

Figure 5-18. Entropies of each cluster

Figure 5-19. Inverse entropies of each cluster

89

a mirror reflection of each other. When we add them up we get the plot in Figure 5-20. The interesting thing about this figure is that the function decreases in parallel with the CEF function.

Figure 5-20. Sum of inverse entropies

5.5 Comparison We would like to collect the performance measures of the algorithm in several tables, where the algorithm is compared to a k-means clustering algorithm and to a neural network based classification algorithm. It should be mentioned that it is meaningful to compare our algorithm with a neural network if the data set contain valley(s) between clusters. Otherwise a neural network with enough neurons is capable of separating the samples even when there is no valley between them because of the supervised learning scheme. To be able to use a supervised classification algorithm, we assume that the data is generated

90

with known labels. We used the data in Figure 5-12, Figure 5-13 and Figure 5-15. We used one hidden layer neural network topology with 10 hidden nodes and two output nodes. The result shows that the proposed algorithm is superior to the k-means algorithm and although it is an unsupervised method, performed equally with a supervised classification algorithm. Table 5-1. Results for k-means algorithm DATA SET1

Class 1

Class 2

Class 3

Class 1

110

40

0

Class 2

25

90

35

Class 3

2

36

112

Table 5-2. Results for k-means algorithm DATA SET2

Class 1

Class 2

Class 1

350

0

Class 2

69

221

Table 5-3. Results for k-means algorithm DATA SET3

Class 1

Class 2

Class 1

130

20

Class 2

31

319

91

Table 5-4. Results for supervised classification algorithm DATA SET1

Class 1

Class 2

Class 3

Class 1

150

0

0

Class 2

0

150

0

Class 3

0

0

150

Table 5-5. Results for supervised classification algorithm DATA SET2

Class 1

Class 2

Class 1

342

8

Class 2

9

341

Table 5-6. Results for supervised classification algorithm DATA SET3

Class 1

Class 2

Class 1

144

6

Class 2

9

341

Table 5-7. Results for the proposed algorithm DATA SET1

Class 1

Class 2

Class 3

Class 1

150

0

0

Class 2

0

150

0

92

Table 5-7. Results for the proposed algorithm DATA SET1

Class 1

Class 2

Class 3

Class 3

0

0

150

Table 5-8. Results for the proposed algorithm DATA SET2

Class 1

Class 2

Class 1

340

10

Class 2

12

338

Table 5-9. Results for the proposed algorithm DATA SET3

Class 1

Class 2

Class 1

145

5

Class 2

9

341

93

Figure 5-21. Output of K-Means

Boundary of two kernels

Figure 5-22. Output of EM using 2 kernels

94

Figure 5-23. Output of the supervised classification

Figure 5-24. Output pf the CEF with a variance of 0.026

CHAPTER 6 APPLICATIONS

6.1 Implementation of the IMAGETOOL program 6.1.1 PVWAVE Implementation The complexity of the MRI images require a special interface in order to analyze, collect and visualize the data. Another reason to develop the interface is to integrate the methods with the interface so that the algorithms can be used practically with any MRI image, and the results can be reproduced and validated visually. As a development platform, PVWAVE is chosen, because of the compatibility between different architectures, and because of the tools provided to develop a GUI, combined with the powerful mathematical packages which are used in the algorithms. The program is called Imagetool. The main window of the program can be seen Figure 6-1. Imagetool can read any MRI image which is stored in unstructured byte format. The MRI images can be viewed from three different axes, and moving along any axis slice by slice is possible using the right-left mouse buttons. The image size can be increased or decreased by an integer factor for easier observation of the brain structures. When the cursor is on the image, a display shows the current position of the mouse on the brain and the value of the pixel at that position. The contrast and brightness can be changed using the sliders. Any part of the image can be selected as a 3-D volume and certain statistics can be calculated like mean and variance. The proposed algorithm developed for segmentation is integrated with the tool and any selected region

95

96

Figure 6-1. Imagetool program main window can be segmented manually. The program is available as a Computational NeuroEngineering Laboratory (CNEL) internal report [Gok98].

6.1.2 Tools Provided With the System A comprehensive drawing tool is provided to hand-segment the image, which will be useful to provide labeled data by an expert. The segmented data can be saved separately in ASCII format to be used by different programs. The main screen of the drawing tool can be seen in Figure 6-2. A free hand drawing tool is provided with an option to connect the points. A spline interpolation can select regions of the drawn boundary to fill gaps if any. It

97

is possible to paint inside the boundary provided the boundary is a closed curve. When the boundary is drawn across several slices, then a 3-D view of the boundary is possible. To correct the errors made during drawing, an UNDO/REDO function is provided up to the starting point. The boundary can be saved, or a saved boundary can be loaded to view/ modify.

Figure 6-2. Drawing tool

To increase the efficiency of the data analysis several different viewers are provided. The Axis Viewer displays (Figure 6-3) all three axis at the same time, where the Multiple Slices Viewer displays several slices at the same time (Figure 6-4). Finally the 3-D Viewer of any selected region helps to visualize the data structure. The 3-D Viewer (Figure 6-5) is useful to visualize the data and it is possible to view not just the data itself but also the segmented data. The displayed image can be rotated using the slide buttons. The Figure 6-5 is obtained by selecting the whole data set.

98

Figure 6-3. Axis viewer

Figure 6-4. Multiple slices viewer

99

Figure 6-5. 3-D viewer 6.2 Testing on MR Images 6.2.1 Feature Extraction If we don’t have a good feature set, even the most powerful algorithm won’t help us to identify different brain tissues. Although the topic of this thesis is not feature extraction, the following feature set was found to be very useful during simulations. The first feature is the brain image itself. The second feature is calculated by taking a 2x2 square regions starting from every point in the brain image and by calculating the Renyi’s entropy of these regions. Notice that since the starting point is at every pixel, the regions are overlap-

100

ping. This calculation will give us more information about the edges and an illustration can be seen in the Figure 6-6. Another important property of the calculation is the mea-

Entropy of 2x2 block

original image

feature set

Figure 6-6. Entropy calculation

surement of smoothness of the brain tissue. We propose this feature extraction method as a new edge detector, although properties of the detector should be investigated further.

6.2.2 Test Image A small test image is used to see the power of the algorithm which is given in Figure 67. We selected a small region of size 30x30 from a brain MRI image given in Figure 6-12, and calculated a 2-dim feature set using entropy of 2x2 blocks and the image itself. The entropy of the blocks can be seen in Figure 6-10.(a). We used the Renyi’s entropy derived before with Gaussian kernels using a variance of 20.0. The feature set created using entropy of the blocks enhances the edges between tissues. The feature set has a low value

101

Figure 6-7. Test image

Figure 6-8. Histogram of the original image only on the edges. In order to differentiate between tissue classes we multiply the feature set with the image itself. To our surprise the multiplication of the feature set with the image itself increases the contrast of the brain image which can be seen in Figure 6-10.(b). This may be due to the fact that the smoothness of the gray and white matter differ. The variance of the kernel used in the clustering procedure is 5.0 and the algorithm reached a

102

Figure 6-9. Histogram of the enhanced image –7 minimum point of 2.08 × 10 . We can see the difference in the histograms in the origi-

(a)

(b)

Figure 6-10. Entropy of the blocks and second feature set

nal image and the enhanced image in Figure 6-8 and Figure 6-9. The rest of the image is classified using the results obtained above. After generating the feature set, the distance of each pixel to each of the classes found using the small test

103

(a)

(b)

Figure 6-11. Input test image and output of clustering algorithm image, is measured using the distance measure CEF developed in CHAPTER 4. The pixel is assigned to the closest class. This type of clustering, where clustering is done on a small set of the data, and the rest of the data just reclassified without running the clustering algorithm, is preferred because of highly computational calculation of the minimization function. After classifying all the pixels, the clustering algorithm can be run with a very small group size (i.e. 1 or 2) to check further improvements. The full brain image can be seen in Figure 6-12 and the output of the clustering algorithm is shown in Figure 6-13. The white matter seems to be clustered very smoothly, except for a couple of discontinuous pixels. The CSF seems to be broken in many areas, but since the gray matter folds and touches other gray matter, this is no surprise. The algorithm missed a spot at the top of the brain, where the structure looks like white matter, but it is classified as gray matter. Running the algorithm on one slice has an influence on the results. Including more than one slice in feature extraction and reclassification of the rest of the pixels will definitely change the results because of improved estimation of the probability density estimation. One way of achieving this is to take a 3-D block during the calculation of features. For example, we

104

can use a 2x2x2 size blocks to calculate the feature vector which is generated by measuring the entropy of each block.

Figure 6-12. Full brain image (single slice)

6.3 Validation To visual inspection of the segmented brain images, the results appear to be good, but we need a quantitative assessment. We adapted the validation method explained in Chapter 1, using the fact that the percentage of the gray matter decreases with age, and the percentage of white matter increases with age. The total cerebral volume is not changed significantly after age 5, but the percentage of white matter and gray matter change by a both 1 percent per year [Rei96]. Detecting this change should be a good measure of the quality of the segmentation algorithm, although validation of the individual structures can not be done using this method. We tested our algorithm on MR images of two children scanned at two year interval with different ages ranging from 5 to 10 years. Scan

105

Figure 6-13. Output of clustering algorithm sequences were acquired in a 1.5T Siemens Magnetom using a quadrature head coil: a gradient echo volumetric acquisition "Turboflash" MP Rage sequence (TR= 10 ms, TE = 4 ms, FA = 10o, 1 acquisition, 25 cm field of view, matrix=130x256) that was reconstructed into a gapless series of 128 1.25-mm thick images in the sagittal plane.

6.3.1 Brain Surface Extraction Since we would like to measure the change of the gray and white matter of the brain only, we removed the skull and the remaining tissues using a “Automated Brain Surface Extraction Program” called BSE [San96] [San97], developed at the University of Southern California. The algorithm in this package uses a combination of non-linear smoothing, edge-finding and morphologic processing. The details can be found in [San97]. An example can be seen in Figure 6-14. The correct filter parameters are found by trial and error, since there is no universal way of setting the parameters. The extraction program usually

106

has difficulty removing the bottom part of the brain. This can be seen in Figure 6-15, where the algorithm couldn’t removed the bottom part of the head. Setting the parameters to remove the bottom part usually results in holes in the brain itself. Since we don’t want to compromise the brain structure, we applied the images from top to bottom where an example can be seen in Figure 6-16.

Figure 6-14. Brain extraction sample 1

Figure 6-15. Brain extraction sample 2

6.3.2 Segmentation The algorithm is applied to MR images of two different children of ages 5 years and 8 years at first scan, and 7.5 years and 10.2 years at second scan. Because of the large data

107

set, the algorithm is applied to a small section of the brain, and afterwards the rest of the pixels are reclassified by measuring the CEF value between the unclassified pixels and the labeled pixels. The feature extraction is performed by using 2x2x2 blocks, calculating the entropy and mean of each block, and by multiplying the entropy values with the image itself creating a contrast enhanched image, used as a feature. The mean of each block becomes another feature.

Figure 6-16. Brain extraction sample

The values of the features are between 0 and 400.0. The variance of the kernel should not be too small, where the distributions loose their smoothness, and it should not be too large to cause over-smoothed distributions. The selected value for the variance is between 15.0-20.0. Selection of the variance is one particular aspect of the clustering algorithm, that needs to be improved further. The group size is selected as N/3, where N is the total number of points in the training data set. The group size can be selected between 1 and N. The bigger the group size, the better the chance of escaping from local minima. Of course the price is increased computation time.

108

6.3.3 Segmentation Results

Figure 6-17. Original, enhanced and segmented MR images (5.5 years, slice 50)

Figure 6-18. Original, enhanced and segmented MR images (5.5 years, slice 60)

Figure 6-19. Original, enhanced and segmented MR images (5.5 years, slice 65)

109

Figure 6-20. Original, enhanced and segmented MR images (5.5 years, slice 70)

Figure 6-21. Original, enhanced and segmented MR images (5.5 years, slice 80)

Figure 6-22. Original, enhanced and segmented MR images (7.5 years, slice 50)

110

Figure 6-23. Original, enhanced and segmented MR images (7.5 years, slice 60)

Figure 6-24. Original, enhanced and segmented MR images (7.5 years, slice 70)

Figure 6-25. Original, enhanced and segmented MR images (7.5 years, slice 75)

111

Figure 6-26. Original, enhanced and segmented MR images (7.5 years, slice 80)

Figure 6-27. Original, enhanced and segmented MR images (8 years, slice 60)

Figure 6-28. Original, enhanced and segmented MR images (8 years, slice 70)

112

Figure 6-29. Original, enhanced and segmented MR images (8 years, slice 75)

Figure 6-30. Original, enhanced and segmented MR images (8 years, slice 80)

Figure 6-31. Original, enhanced and segmented MR images (8 years, slice 85)

113

Figure 6-32. Original, enhanced and segmented MR images (10.2 years, slice 60)

Figure 6-33. Original, enhanced and segmented MR images (10.2 years, slice 70)

Figure 6-34. Original, enhanced and segmented MR images (10.2 years, slice 75)

114

Figure 6-35. Original, enhanced and segmented MR images (10.2 years, slice 80)

Figure 6-36. Original, enhanced and segmented MR images (10.2 years, slice 85)

6.4 Results After all the images are segmented, the percentages of white matter, gray matter and CSF w.r.t. the cerebral volume are calculated and given in Table 6-1. The percentage volume of white matter to the cerebral volume is increased, whereas the percentage volume of gray matter to the cerebral volume is decreased. The results are in good aggrement with the published results in [Rei96], which is given in Figure 6-37 and Figure 6-381. We embedded our results in the figures as big black dots. The comparison shows that the clustering algorithm successfully segmented the brain image. 1. The Figure 6-37 and Figure 6-38 are reprinted with the permission of Oxford University Press.

115

Figure 6-37. Change of gray matter

Figure 6-38. Change of white matter

116

Table 6-1. Percentages of brain matter white

gray

CSF

5.5 years

%31.45

%57.97

%10.56

7.5 years

%33.37

%55.73

%10.89

8 years

%36.39

%54.21

%9.39

10.2 years

%37.60

%50.87

%11.51

CHAPTER 7 CONCLUSIONS AND FUTURE WORK

We would like to summarize the contributions made in this dissertation. The initial goal is to develop a better clustering algorithm which can be used to cluster nonlinearly separable clusters. Second order statistics are not sufficient for use on nonlinearly separable clusters. In order to achieve this goal, we developed a cost function which can be used to measure the cluster separability and which can be used for nonlinearly formed clusters. The cost function can be easily computable which decreases the normally high computational requirements of clustering algorithms. The cost function is developed using information theoretic distance measures between probability density functions, and the relation of different measures to the clustering algorithm is investigated. The cost function is basically a valley seeking algorithm, where the distance measure is calculated from the data without requiring numerical integration. It is a data driven method. An optimization procedure which is efficient and simple is also developed to be used with the proposed method, although the usage of the optimization algorithm is not limited to the method proposed in this dissertation. We also find out that certain distance measures are not suitable for use in our clustering algorithm. In the second part of the dissertation, we applied the algorithm for the segmentation of MR images. The segmentation was found to be very succesful. There are certain areas where the algorithm can be improved further. It is possible to adapt the kernel shape dynamically according to the data. Having a fixed circular shape may not be suitable for all data sets. Instead of fixing the shape, we can change the shape

117

118

of the kernel using the samples around the pixel. We can use the samples already found during our grouping algorithm, and fit a gaussian kernel to each group of pixels whenever the grouping is changed. This may improve the result of our clustering algorithm. The success of the algorithm depends on the estimation of the probability density function of each cluster generated during iterations, which is controlled to some degree by the variance of the kernel used. A more systematic way should be developed to adjust the variance of the kernel. This is a problem related to probability density estimation and it is more general than the clustering problem. The algorithm can be used only in batch mode in the current form. This is an area which can be improved with an online optimization algorithm. This will help the cases where not all the data is available at the same time for batch learning. Running the algorithm with large data sets is very time-consuming. This is an area which needs to be improved. Clustering the data into a more compact data clusters will help to improve the efficiency of the method, although that brings into question how to adjust the compact cluster sizes.

[Aar90]

REFERENCES Aarts, E., Simulated Annealing and Boltzmann Machines, John Wiley & Sons, New York, 1990.

[Ama77] Amari, S.I., “Neural theory of association and concept-formation,” Biological Cybernetics, V26, p175-185, 1977. [Ash90]

Ashtari, M., Zito, J.L., Gold, B.I., Lieberman, J.A., Borenstein, M.T., Herman, P.G., “Computerized volume measurement of brain structure,” Invest Radiology, V25, p798-805, 1990.

[Bea92]

Bealy, D.M., Weaver, J.B., “Two applications of wavelet transforms in magnetic resonance imaging,” IEEE Transactions in Information Theory, V38, p840-860, 1992.

[Ben66]

Benson, Russell V., Euclidean Geometry and Convexity, McGraw-Hill, New York, 1966.

[Ben94a] Benes, F.M., Turtle, M. , Khan, Y. , and Farol, P., “Myelination of a key relay zone in the hippocampal formation occurs in the human brain during childhood, adolescence, and adulthood,” Archives of General Psychiatry V51, p477-484, 1994. [Ben94b] Bensaid, A.M., Hall, L.O., Bezdek, J.C., Clarke, L.P., “ Fuzzy cluster validity in magnetic resonance images,” Proceedings of SPIE, V2167, p454-464, 1994. [Bez93]

Bezdek, J. C., “Review of MRI Image Segmentation techniques using pattern recognition,” Medical Physics, V20, N4, p1033-1048, 1993.

[Bha43]

Bhattacharya, A., “ On a measures of divergence between two statistical populations defined by their probability distributions,” Bull. Cal. Math. Soc., V35, p99-109, 1943.

119

120

[Bis95]

Bishop, C., Neural Networks for Pattern Recognition, Oxford University Press, New York, 1995.

[Bit84]

Bittoun, J.A., “A computer algorithm for the simulation of any nuclear magnetic resonance imaging method,” Journal of Magnetic Resonance Imaging, V2, p113-120, 1984.

[Bom90] Bomans, M., Hohne, K.H., Tiede, U., Riemer, M., “3-D segmentation of MR images of the head for 3-D display,” IEEE Transactions on Medical Imaging, V9, p177-183, 1990. [Bou94]

Bouman C.A., “A multiscale random field model for bayesian image segmentation,” IEEE Transactions on Image Processing, V3, p162-176, 1994

[Bra93]

Bradley, W.G., Yuh, W.T.C., Bydder, G.M., “Use of MR imaging contrast agents in the brain,” Journal of Magnetic Resonance Imaging, V3, p199-232, 1993.

[Bra94]

Brandt, M.E., Bohan, T.P., Kramer, L.A., Fletcher, J.M., “Estimation of CSF, white and gray matter volumes in hydrocephalic children using fuzzy clustering of MR images,” Computational Medical Imaging, V18, p25-34, 1994.

[Bro87]

Brody, B.A., Kinney, H.C. , Kloman, A.S. and Gilles, F.H., “Sequence of central nervous system myelination in human infancy.I. An autopsy study of myelination,” Journal of Neuropathology & Experimental Neurology V46, p283-301, 1987.

[Bro90]

Bronen, R.A., Sze, G., “Magnetic resonance imaging contrast agents: Theory and application to the central nervous system,” Journal of Neurosurgery, V73, p820-839, 1990.

[Bru93]

Brummer, M.E., Eisner, R.L., “Automatic Detection of Brain Contours in MRI Data Sets,” IEEE Trans. on Med. Imag., V12, N2, p152-166, 1993.

[Buc91]

Buchbinder, B.R., Belliveau, J.W., McKinstry, R.C., Aronen, H.J., Kennedy, M.S., “Functional MR imaging of primary brain tumors with PET correlation,” Society of Magnetic Resonance in Medicine, V1, 1991.

121

[Car88]

Carpenter G.A., Grossberg, S., "Art of adaptive recognition by a selforganizing neural network," Computer, V21, N3, p77-88, March 1988.

[Car87a] Carpenter, G.A., Grossberg, S., “A massively parallel architecture for a self-organizing neural pattern recognition machine,” Computer Vision, Graphics, and Image Processing, V37, p54-115, 1987. [Car87b] Carpenter, G.A., Grossberg, S., “ART2: Self-organization of stable category recognition codes for analog input patterns,” Applied Optics, V26, p4919-4930, 1987. [Car90]

Carpenter, G.A., Grossberg, S., “ART3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures,” Neural Networks, V3, p129-152, 1990.

[Car91a] Carpenter, G.A., Grossberg, S., “ART2-A: An adaptive resonance algorithm or rapid category learning and recognition,” Neural Networks, V4, p493-504, 1991. [Car91b] Carpenter, G.A., Grossberg, S., “ARTMAP: Supervised real-time learning and classification of nonstaionary data by a self-organizing neural network,” Neural Networks, V4, p565-588, 1991. [Che52]

Chernoff, H., “A measure of asymptotic efficiency for tests of a hypothesis based on a sum of observations,” Ann. Math. Stat., V23, p493-507, 1952.

[Cla93]

Clarke, L.P., Velthuizen, R.P., Phuphanich, S., Shellenberg, J.D., Arrignton, J.A., “Stability of three supervised segmentation techniques,” Magnetic Resonance Imaging, V11, p95-106, 1993.

[Cli87]

Cline, H.E., Dumoulin, C.L., Hart, H.R., “3D reconstruction of the brain from magnetic resonance images using a connectivity algorithm,” Magnetic Resonance Imaging, V5, p345-352, 1987.

[Cli91]

Cline, H.E., Lorensen, W.E., Souza, S.P., Jolesz, F.A., Kikinis, R., Gerig, G., Kennedy, T.E., “3D surface rendered MR images of the brain and its vasculature,” Computer Assisted Tomogrophy, V15, p344351,," 1991.

122

[Chr81]

Christensen, R., Entropy Minimax Sourcebook, V1, Entropy Ltd., Lincoln, MA, 1981.

[Coh91]

Cohen, L., “On active contour models and balloons,” Computer Vision, Graphics and Image Processing, V53, p211-218, 1991.

[Cov91]

Cover T.M., Thomas, J.A., Elements of Information Theory, Wiley, New York, 1991.

[Dav96]

Davatzikos, C., Bryan, R.N., “Using a Deformable Surface Model to Obtain a Shape Representation of the Cortex,” Technical Report, Johns Hopkins University, Baltimore, 1996.

[Daw91] Dawant, B.M., Ozkan, M., Zijdenbos, A., Margolin, R., “A computer environment for 2D and 3D quantitation of MR images using neural networks,” Proceedings of the 13th IEEE Engineering in Medicine and Biology Society, V13, p64-65, 1991. [Del91a] Dellepiane, S., “Image segmentation: Errors, sensitivity and uncertanity,” Proceedings of the 13th IEEE Engineering in Medicine and Biology Society, V13, p253-254, 1991. [Del91b] Dellepiane, S., Venturi, G., Vernazza, G., “A fuzzy model for the processing and recognition of MR pathologic images,” Information Processing in Medical Imaging, p444-457, 1991. [Dem77]

Demspter, A.P., “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal Statistic Society, Series B 39, p138, 1977.

[Dud73]

Duda, R.O., Hart P.E., Pattern Classification and Scene Analysis, Wiley, New York, 1973.

[Eil90]

Eilbert, J.L., Gallistel, C.R., McEachron, D.L., “The variation in user drawn outlines on digital images,” Computational Medical Imaging, V14, p331-339, 1990.

[Fug70]

Fugunaga, K., Koontz, W.L.G., “A criterion and analgorithm for grouping data,” Transactions of IEEE Computers, V19, p917-923, 1970.

123

[Fug90]

Fugunaga, K., Introduction to Statistical, Pattern Recognition, Academic Press, New York, 1990.

[Gal93]

Galloway, R.L., Maciunas, R.J., Failinger, A.L., “Factors affecting perceived tumor volumes in magnetic resonance imaging,” Annals of Biomedical Engineering, V21, p367-375, 1993.

[Ger82]

Gersho, A., “On the structure of vector quantizers,” IEEE Transactions on Information Theory, V28, p157-166, 1982.

[Ger92a] Gersho, A., Gray, R.M., Vector Quantization and Signal Compression, Kluwer Norwell, MA, 1992. [Ger92b] Gerig, G., Martin, J., Kikinis, R., Kubler, O., Shenton, M., Jolesz, F.A., “Unsupervised tissue type segmentation of 3D dual-echo MR head data,” Image Vision Computing, V10, p349-360, 1992. [Gie99]

Giedd J.N., Blumental J., Jeffries N.O., Castellanos F.X., Liu H., Zijdenbos A., Paus T., Evans A.C., Rapoport J.L, “Brain development during childhood and adolescense: a longitudinal MRI study,” Nature Neuroscience, V2, p861-863, 1999.

[Gok98]

Gokcay, E. “A pvwave interface to visualize brain images: Imagetool,” CNEL internal report, University of Florida, 1998.

[Gok00]

Gokcay E., Principe J., “A new clustering evaluation function using Renyi’s information potential,” ICASSP 2000, Istanbul, Turkey, 2000.

[Gra84]

Gray, R.M., “Vector quantization,” IEEE ASSP Magazine, V1, p4-29, 1984.

[Gra90]

Gray, R.M., Entropy and Information Theory, Springer-Verlag, New York, 1990.

[Gre74]

Greenberg, M.J., Euclidean and non-Euclidean Geometries: Development and History, W. H. Freeman, San Francisco, 1974.

[Gro69]

Grossberg, S., “On learning and energy-entropy dependence in recurrent and nonrecurrent signed networks,” Journal of Statistical Physics, V1, p319-350, 1969.

124

[Gro76a] Grossberg, S., “Adaptive pattern classification and universal recording: I. Paralel development and coding of neural feature detectors,” Biological Cybernetics, V23, p121-134, 1976. [Gro76b] Grossberg, S., “Adaptive pattern classification and universal recording: II. Feedback, expectation, olfaction, and illusions,” Biological Cybernetics, V23, p187-202, 1976. [Gui97]

Guillemaud, R., Brady, M., “Estimating the bias field of MR images,” IEEE Trans. on Med. Imag., V16, N3, p238-251, 1997.

[Hal92]

Hall, L.O., Bensaid, A.M., Clarke, L.P., Velthuizen, R.P., Silbiger, M.S., Bezdek, J.C., “A comparison of neural network and fuzzy clustering techniques is segmenting magnetic resonance images of the brain,” IEEE Transactions on Neural Networks, V3, p672-682, 1992.

[Har85]

Haralick, R.M., Shapiro, L.G., “Image segmentation techniques,” Computer Vision, Graphics, and Image Processing, V29, p100-132, 1985.

[Har75]

Hartigan, J., Clustering Algorithms, Wiley, New York, 1975.

[Has95]

Hassoun, M.H., Fundamentals of Artificial Neural Networks, MIT, Cambridge, MA, 1995.

[Hay94]

Haykin, S., Neural Networks, IEEE Press, New Jersey, 1994.

[Heb49]

Hebb, D.O., The Organization of Behaviour:A Neuropsychological Theory, New York, Wiley, 1949.

[Hec89]

Hecht-Nielsen R., Neurocomputing, Addison-Wesley, Reading, MA, 1989.

[Hei93]

Heine, J.J., “Computer simulations of magnetic resonance imaging and spectroscopy,” MS Thesis, University of South Florida, Tampa, 1993.

[Hen93]

Hendrick, R.E., Haacke, E.M., “Basic Physics of MR contrast agents in the brain,” Journal of Magnetic Resonance Imaging, V3, p137-156, 1993.

125

[Hil93]

Hill, D.L.G., Hawkes, D.J., Hussain, Z., Green, S.E.M., Ruff, C.F., Robinson, G.P., “Accurate combination of CT and MR data of the head:Validation and applications in surgical and therapy planning,” Comput. Medical Imaging Graph., V17, p357-363, 1993.

[Hog65]

Hogg, R.V., Craig, A.T., Introduction to Mathematical Statistics, Macmillan, New York, 1965.

[Hu90]

Hu, X.P., Tan, K.K., Levin, D.N., Galhotra, S., Mullan, J.F., Hekmatpanah, J., Spire, J.P.,”Three-dimensional magnetic resonance images of the brain: Application to neurosurgical planning,” Journal of Neurosurgery, V72, p433-440, 1990.

[Jac90]

Jack, C.R., Bentley, M.D., Twomey, C.K., Zinsmeister, A.R., “MR Imaging-based volume measurement of the hippocampal formation and anterior temporal lobe,” Radiology, V176, p205-209, 1990.

[Jac93]

Jackson, E.F., Narayana, P.A., Wolinksy, J.S., Doyle, T.J., “Accuracy and reproducibility in volumetric analysis of multiple sclerosis lesions,” Journal of Computer Assisted Tomogrophy, V17, p200-205, 1993.

[Jai89]

Jain, A.K., Fundamentals of Digital Image Processing, Prentice Hall, Englewood Cliffs, NJ, 1989.

[Jay57]

Jaynes, E.T., “Information theory and statistical mechanics, I, II,” Physical Review, V106, p620-630, V108, p171-190, 1957.

[Kam93] Kamada, K., Takeuchi, F., Kuriki, S., Oshiro, O., Houkin, K., Abe, H., “Functional neurosurgical simulation with brain surface magnetic resonance imaging and magnetoencephalograpy,” Neurosurgery, V33, p269-272, 1993. [Kap92]

Kapur, J.N., Kesavan, H.K., Entropy Optimization Principles with Applications, Academic Press, Boston, 1992.

[Kap95]

Kapur, T., “Segmentation of brain tissue from magnetic resonance images,”Technical Report, MIT, Cambridge, 1995.

126

[Kar94]

Karhunen, J., “Optimization criteria and nonlinear PCA neural networks,” IEEE International Conference on Neural Networks, V2, p1241-1246, 1994.

[Kars90] Karssemeijer, N., “A statistical method for automatic labeling of tissues in medical images,” Machine Vision and Applications, V3, p75-86, 1990. [Kaz90]

Kazakos D., Kazakos Papantoni P., Detection and Estimation, Computer Science Press, New York, 1990.

[Kir83]

Kirkpatrick S., Gelatt, C.D., Vecchi, M.P., "Optimization by simulated annealing," Science, V220, p671-680, 1983.

[Koh91]

Kohn, M.I., Tanna, N.K., Herman, G.T., Resnick, S.M., Mozley, P.D., Gur, R.E., Alavi, A., Zimmerman, R.A., Gur, R.C., “Analysis of brain and cerebrospinal fluid volumes with MR imaging:Part1. Methods, reliability and validation,” Radiology V178, p115-122, 1991.

[Koh95]

Kohonen, T., Self-Organizing Maps, Springer, New York, 1995.

[Kul59]

Kullback, S., Information Theory and Statistics, John Wiley, New York, 1959.

[Lia93]

Liang, Z., “Tissue classification and segmentation of MR images,” IEEE Eng. Med. Biol., V12, p81-85, 1993.

[Lia94]

Liang, Z., Macfall, J.R., “Parameter estimation and tissue segmentation from multispectral MR images,” IEEE Trans. on Med. Imag., V13, N3, p441-449, 1994.

[Lin80]

Linde, Y., Buzo, A., Gray, R.M., “An algorithm for vector quantizer design,” IEEE Transactions on Communications, V28, p84-95, 1980.

[Lin88]

Linsker, R., “Self-organization in a perceptual network,” Computer, p105-117, March 1988.

[Lip87]

Lippman, R.P., “An introduction to computing with neural nets,” IEEE Magazine on Accoustics, Signal, and Speech Processing, V4, p4-22, 1987.

127

[Lip89]

Lippman, R.P., “Review of neural networks for speech recognition,” Neural Computation, V1, p1-38, 1989.

[Mal73]

Malsberg, V.D. , “Self-organizing of orientation sensitive cells in the striate cortex,” Kybernetick, V14, p85-100, 1973.

[Mar80]

Marr, D., Hilderth, E., “Theory of edge detection,” Proceedings of the Royal Society of London, V207, p187-217, 1980.

[Mcc94] McClain, K.J., Hazzle. J.D., “MR image selection for segmentation of the brain,” Journal of Magnetic Resonance Imaging, V4, 1994. [Mcl88]

McLachlan, G.J., Basford, K.E., Mixture Models: Inference and Applications to Clustering, Marcel Dekker, Inc., New York, 1988.

[Mcl96]

McLachlan, G.J., Krishnan, T., The EM algorithm and extensions, John Wiley & Sons, Inc., New York, 1996.

[Mit94]

Mittchell, J.R., Karlik, S.J., Lee, D.H., Fenster, A.,”Computer-assisted identification and quantification of multiple sclerosis lesions in MR imaging volumes in the brain,” Journal of Magnetic Resonance Imaging, V4, p197-208, 1994.

[Moo89] Moore, B., “ART1 and pattern clustering,” Proceedings of the 1988 Connectionists Models Summer Schools, p174-185, 1989. [Odo85]

O’Donnel, M., Edelstein, W.A., “NMR imaging inthe presence of magnetic field inhomogeneities and gradient field nonlinearities,” Medical Physics, V12, p20-26, 1985.

[Oja82]

Oja, E., “A simlified neuron model as a principal componet analyzer,” Journal of Mathematical Biology, V15, p267-273, 1982.

[Oja83]

Oja, E., “Neural networks, principal components, and subspaces,” International Journal of Neural Systems, V1, p61-68, 1983.

[Oja85]

Oja, E., Karhunen, J., “On stochastic approximation of the eigenvectors of the expectation of a random matrix,” Journal of Mathematical Analysis and Applications, V106, p69-84, 1985.

128

[Oja91]

Oja, E., “Data compression, feature extraction, and autoassociation in feedforward neural networks,” Artificial Neural Networks, V1, p737745, 1991.

[Ozk93]

Ozkan, M., Dawant, B.M., “Neural-Network Based Segmentation of Multi-Modal Medical Images,” IEEE Transactions on Med. Imag., V12, N3, p534-544, 1993.

[Pal93]

Pal, N.R., Pal, S.K., “A review on image segmentation techniques,” Pattern Recognition, V26, N9, p1277-1294, 1993.

[Pao89]

Pao, Y.H., Adaptive Pattern Recognition and Neural Networks, Addison-Wesley, Reading, MA, 1989.

[Pap65]

Papoulis, A., Probability, Random Variables and Stochastic Processes, McGraw-Hill, New York, 1965.

[Par62]

Parzen E., "On estimation of a probability density function and mode," Annals of Mathematical Statistics, V33, p1065-1076, 1962.

[Pec92]

Peck, D.J., Windham, J.P., Soltanian-Zadeh, H., Roebuck, J.R., “A fast and accurate algorithm for volume determination in MRI,” Medical Physics, V19, p599-605, 1992.

[Pel82]

Peli T., Malah, D., “A study of edge detection algorithms,” Computer Graphics and Image Processing, V20, p1-21, 1982.

[Pet93]

Peterson, J.S., Christoffersson, J.O., Golman, K., “MRI simulation using k-space formalism,” Magnetic Resonance Imaging, V11, p557568, 1993.

[Phi95]

Phillips, W.E., Velhuizen, R.P., Phuphanich, S., Vilora, J., Hall, L.O., Clarke, L.P., Silbiger, M.L., “ Application of fuzzy segmentation techniques for tissue differentation in MR images of a hemorrhagic glioblastoma multiforme,” Magnetic Resonance Imaging, V13, p277-290, 1995.

[Pri90]

Principe, J.C., Euliano, N.R., Lefebvre, W.C., Neural and Adaptive Systems: Fundamentals Through Simulations, John Wiley & Sons, New York, 2000.

129

[Pri00]

Principe, J., Xu D., and Fisher, J.,”Information theoretic learning,” in Unsupervised Adaptive Filtering, Ed. S. Haykin, John Wiley & Sons, New York, 2000.

[Puj93]

Pujol, J., Vendrell, P., Jungue, C., Marti-Vilalta, J.L., Capdevila, A.,”When does human brain development end? Evidence of corpus callosum growth up to adulthood,” Annals or Neurology, V34, p71-75, 1993.

[Raf90]

Raff, U., Newman, F.D., “Lesion detection in radiologic images using an autoassociative paradigm,” Medical Physics, V17, p926-928, 1990.

[Rei96]

Reiss A., Abrams M., Singer H., Ross J., Denckla M., “Brain development, gender and IQ in children: A volumetric imaging study,” Brain, V119, p1763-1774, 1996.

[Ren60]

Renyi A., "On measures of entropy and information," Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability, p547-561, 1960.

[Rit87a]

Ritter, G.X., Wilson, J.N., “Image algebra: A unified approach to image processing,” SPIE Proc. on Medical Imaging, Newport Beach, CA, 1987.

[Rit87b]

Ritter, G.X., Shrader-Frechette, M., Wilson, J.N., “Image algebra: A rigorous and translucent way of expressing all image processing operations,” Proc. SPIE Southeastern Technical Symposium on Optics, Electro-Optics and Sensors, Orlando, FL, p116-121, 1987.

[Rit96]

Ritter, G.X., Wilson, J.N., Handbook of Computer Vision Algorithms in Image Algebra, CRC Press, Boca Raton, FL, 1996.

[Rub89]

Rubner, J., Tavan, P., “A self-organizing network for principal-component analysis,” Europhysics Letters, V10, p693-698, 1989.

[Rum85] Rumelhart, D.E., Zipser, D., “Feature discovery by competitive learning,” Cognitive Science, V9, p75-112, 1985. [Run89]

Runge, V.M., Enhanced Magnetic Resonance Imaging, C.V. Mosby Company, St. Louis, 1989.

130

[Rob94]

Robb, R.A., Visualization methods for analysis of multimodality images,” Functional Neuroimaging: Technical Foundations, Orlando, FL, Academic Press, p181-190, 1994.

[Sah88]

Sahoo, P.K., Soltani, S., Wong, K.C., “A survey of thresholding techniques,” Computer Vision, Graphics and Image Processing, V41, p233260, 1988.

[San96]

Sandor S. R., Leahy R. M., Timsari B., "Generating cortical constraints for MEG inverse procedures using MR volume data,” Proceedings BIOMAG96, Tenth International Conference on Biomagnetism, Santa Fe, New Mexico, Feb. 1996.

[San97]

Sandor, S., Leahy, R.M., "Surface-based labeling of cortical anatomy using a deformable database," IEEE Transactions on Medical Imaging, V16, N1, p41-54, February 1997.

[San89]

Sanger, T.D., “Optimal unsupervised learning in a single layer linear feedforward neural network,” Neural Networks, V2, p459-473, 1989.

[Sha48]

Shannon, C.E., “A mathematical theory of communications,” Bell System Technical Journal, V27, p370-423, 1948.

[Sha62]

Shannon, C.E., Weaver, W., The Mathematical Theory of Communication, The University of Illinois Press, Urbana, 1962.

[Sch94]

Schwartz, M.L., Sen, P.N., Mitra, P.P., “Simulations of pulsed field gradient spin-echo measurements in porous media,” Magnetic Resonance Imaging, V12, p241-244, 1994.

[Sum86] Summars, R.M., Axel, L., Israel, S., “A computer simulation of nuclear magnetic resonance,” Magnetic Resonance Imaging, V3, p363-376, 1986. [Tay93]

Taylor, J.G., Coombes, S., “Learning higher order correlations,” Neural Networks, V6, p423-427, 1993.

131

[Tju94]

Tjuvajev, J.G., Macapinlac, H.A., Daghighian, F., Scott, A.M., Ginos, J.Z., Finn, R.D., Kothari, P., Desai, R., Zhang, J., Beattie, B., Graham, M., Larson, S.M., Blasberg, R.G., “Imaging of brain tumor proliferative activity with Iodine-131-Iododeoxyuridine,” Journal of Nuclear Medicine, V35, p1407-1417, 1994.

[Vai00]

Vaidyanatkan, M., Clarke, L.P., Velthuizen, R.P., Phuphanich, S., Bensaid, A.M, Hall, L.O., Silbiger, M.L., “Evaluation of MRI segmentation methods for tumor volume determination,” Magnetic Resonance Imaging (in press).

[Van88]

Vannier, M.W., Speidel, C.M., Rickman, D.L., “Magnetic resonance imaging multispectral tissue classification,” News Physiol. Sci., V3, p148-154, 1988.

[Van91a] Vannier, M.W., Pilgram, T.K., Speidel, C.M., Neumann, L.R.,”Validation of magnetic resonance imaging multispectral tissue classification,” Comput. Med. Imaging Graph., V15, p217-223, 1991. [Van91b] Van der Knaap, M.S., et al., “Myelination as an expression of the functional maturity of the brain,” Developmental Medicine and Child Neurology, V33, p849-857, 1991. [Vel94]

Vellhuizen, R.P., Clarke, L.P., “An interface for validation of MR image segmentations,” Proceedings of IEEE Engineering in Medicine and Biology Society, V16, p547-548, 1994.

[Wal92]

Wallace, C.J., Seland, T.P., Fong, T.C., “Multiple sclerosis: The impact of MR imaging,” AJR, V158, p849-857, 1992.

[Wan98] Wang, Y., Adali, T., “Quantification and segmentation of brain tissues from MR images: A probabilistic neural network approach,” IEEE Trans. on Image Processing, V7, N8, p1165-1180, 1998. [Wel96]

Wells, W.M., Grimson, W.E.L., “Adaptive segmentation of MRI data,” IEEE Trans. on Med. Imag., V15, N4, p429-442, 1996.

[Wit88]

Witkin, A., Kass, M., Terzopoulos, D., “Snakes:Active contour models,” International Journal of Computer Vision,V4, p321-331, 1988.

132

[Xu94]

Xu, L., “Theories of unsupervised learning: PCA and its nonlinear extensions,” IEEE International Conference on Neural Networks, V2, p1252-1257, 1994.

[Xu98]

Xu D., Principe J., “Learning from examples with quadratic mutual information,” Neural Networks for Signal Processing - Proceedings of the IEEE Workshop, IEEE, Piscataway, NJ, USA. p155-164, 1998.

[Zad92]

Zadech, H.S., Windham, J.P., “A comparative analysis of several transformations for enhancement and segmentation of magnetic resonance image scene sequences,” IEEE Trans. on Med. Imag., V11, N3, p302318, 1992.

[Zad96]

Zadech, H.S., Windham, J.P., “Optimal linear transformation for MRI feature extraction,” IEEE Trans. on Med. Imag.,V15, N6, p749-767, 1996.

[Zha90]

Zhang, J., ”Multimodality imaging of brain structures for stereoactic surgery,” Radiology, V175, p435-441, 1990.

[Zij93]

Zijdenbos, A.P., Dawant, B.M., Margolin, R., “Measurement reliability and reproducibility in manual and semi-automatic MRI segmentation,” Proceedings of the IEEE-Engineering in Medicine and Biology Society, V15, p162-163, 1993.

BIOGRAPHICAL SKETCH

Erhan Gokcay was born in Istanbul,Turkey, on January 6, 1963. He attended “Istanbul Erkek Lisesi” High School in Istanbul, Turkey, from 1974 through 1981. In 1981, he began undergraduate studies at the Middle East Technical University in Ankara, Turkey, graduating with a B.S. in electrical and electronic engineering in 1986. In 1986, he began graduate studies at the Middle East Technical University in Ankara, Turkey, graduating with a M.S. degree in electrical and electronic engineering in 1991, and continued the graduate studies for a PhD in computer engineering at the same university. In 1993 he was accepted to the University of Florida and continued his PhD studies in the Computer and Information Sciences and Engineering Department at the University of Florida, in Gainesville, Florida. He worked as an application programmer in ASELSAN Military Electronics and Telecommunications Company, in Ankara, Turkey, and as a system programmer in STFA Enercom Computer Center, in Ankara, Turkey, from 1986 to 1990. He worked as the Technical Manager in Tulip Computers in Ankara, Turkey, until 1991, and he was responsible from the installation and maintenance of the computer systems and training the technical support team. He worked as a network administrator and hardware supervisor in Bilkent University, in Ankara, Turkey, from 1991 to 1993, where he was responsible from the installation, maintenance and support for the computer systems and campus-wide network. He was working as a system administrator in the CNEL and BME labs from 1993 until graduation at University of Florida, in Gainesville, Florida.

133

I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. _________________________________________________ Jose C. Principe, Chairman Professor of Electrical and Computer Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. _________________________________________________ John G. Harris Associate Professor of Electrical and Computer Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. _________________________________________________ Christiana M. Leonard Professor of NeuroScience I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. _________________________________________________ Joseph N. Wilson Assistant Professor of Computer and Information Sciences and Engineering I certify that I have read this study and that in my opinion it conforms to acceptable standards of scholarly presentation and is fully adequate, in scope and quality, as a dissertation for the degree of Doctor of Philosophy. _________________________________________________ William W. Edmonson Assistant Professor of Electrical and Computer Engineering

This dissertation was submitted to the Graduate Faculty of the College of Engineering and to the Graduate School and was accepted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. August 2000

_________________________________________________ M. Jack Ohanian Dean, College of Engineering _________________________________________________ Winfred M. Phillips Dean, Graduate School