comparison of neural networks and statistical ... - Semantic Scholar

COMPARISON OF NEURAL NETWORKS AND STATISTICAL METHODS IN MELANOMA CLASSIFICATION H. Ganstery, R. Rohrery, L. Palettay, T. Ebnery , A. Pinzy and M. Binderz Abstract: In this paper dierent strategies for classi cation are applied to automated melanoma recognition. First features are selected with a statistical approach, which are then classi ed by a k-nearest-neighbour classi er and a multi-layer perceptron. A third classi cation is done with a multi-layer perceptron on the entire feature set using a pruning strategy. The results show that the performances achieved with both the neural net (74; 5%) and the statistical approach (74; 8%) are comparable to the recognition rates of dermatologists. Furthermore it is shown that pruning the neural net or selecting a feature subset impressively reduces the complexity of the classi cation process and improves the generalization behaviour by removal of redundancy in the feature set.

1 Introduction

Melanoma is one of the most aggressive types of cancer. Since the curability of skin cancer by surgical excision is very high in early stages the early recognition is of utmost importance. Therefore sophisticated techniques like epiluminescence microscopy (ELM [9]) are investigated with computer aided image analysis in several research groups [6, 5]. In recent literature of melanoma recognition by image analysis, many publications on segmentation of skin lesion images (e.g. [3, 7]) and on design and calculation of features which describe the malignancy of a skin lesion (e.g. [4, 15]) can be found, but the evaluation of the feature quality is almost neglected. The study presented in this paper is a comparison between statistical and neural network approaches for classi cation. In the statistical approach features are selected, whereas in the Institute for Computer Graphics and Vision, Technical University Graz, Munzgrabenstrae 11, A-8010 Graz, Austria, E-mail: [email protected] z) Department of Dermatology, University of Vienna, W ahringer Gurtel 18-20, A-1090 Vienna, Austria ) This work was supported by the Austrian Science Foundation (FWF) under grant P11735-MED. The support of our medical partners from the Department of Dermatology at the Vienna General Hospital is gratefully acknowledged. y)

neural network approach the network is pruned, in order to reduce the complexity of the classi cation process and therefore increasing the classi cation performance through better generalization behaviour. All ELM-images used in the study were taken during routinely checks at the Vienna General Hospital and have been labeled by dermatologists. By random choice 268 lesions have been selected to form two equally sized classes. One class contains mainly atypical naevi and some early melanomas, while the other contains mostly common naevi and some nonmelanocytic lesions. After segmentation of the lesions 33 radiometric and geometric features were calculated which have been developed in cooperation with dermatologists [5] . In section 2 of this paper, the dierent approaches used in the study are introduced. The results of the application to skin lesion classi cation are discussed in section 3. Section 4 gives some conclusions on the results and nal remarks on further work in the direction of automated analysis of skin lesions. 2 Methods

A fundamental problem of pattern recognition is the complexity of the classi cation, because a large number of features can be found in the feature design process. In order to keep the complexity low, a reasonable number of useful features that approximate properties of the pattern recognition task should be calculated, but there does not exist a general theory for the design of such features. Nevertheless there exist methods to estimate their quality and techniques that reduce the complexity by removing redundant information in the feature set. Various suboptimal algorithms|including sequential search procedures [2, 10], neural network techniques [1, 8] and genetic algorithms [14]|have been suggested in order to increase the generalization ability and therefore the performance of the classi cation process. In this paper we are investigating three classi cation approaches: First we perform feature subset selection by sequential forward oating selection (SFFS [2]) and classify them with a k-nearest-neighbour (kNN) classi er. In the second approach the selected features from the SFFS are used in a classi cation with a multi-layer perceptron (MLP [13]). The third approach is a classi cation with a neural network using optimal brain damage (OBD [1]) as pruning strategy starting with the entire feature set. 2.1 Feature selection with kNN classi cation

The rst classi cation experiment uses SFFS [2] in order to derive a combination of features that correspond to dierent characteristics of the skin lesions. SFFS is a search method that uses alternate forward and backward search steps in order to nd a nearly optimal feature

subset. It starts with the empty set and selects in every forward step that feature from the entire feature set that performs best in combination with the already derived feature subset. In the backward step that feature is removed from the already derived subset that has least signi cance. The method is called oating because the number of alternate forward and backward steps is derived dynamically from the properties of the feature subset (i.e. recognition rate). 1)

The selected features are then used in a classi cation process with a kNN classi er. Since there is a relatively large overlap between the classes in the feature space the number of neighbours needed is high. The value of k used in the experiments of the study is 17, but the actual number of neighbours used in the classi er has negligible in uence on the achievable recognition rates if k is chosen in the interval from 15 to 35. 2.2 Neural network classi cation with selected features

The second classi cation approach uses a feed-forward multi-layer perceptron trained with error back-propagation. The inputs to the neural network are the features selected by the previous approach (four input units, cf. section 3.1). The network uses one hidden layer, of which the number of units is determined by evaluating the classi cation performance. It turned out that the best recognition rates could be achieved by using four hidden units ( gure 3). The output layer consists of two nodes corresponding to the target classes, benign and malignant skin lesions. 2.3 Neural network using a pruning strategy

The neural network applied here is an adaptation of the approach by Hintz-Madsen et al. [8], where a feed-forward multi-layer perceptron is trained by error back-propagation. In order to reduce the complexity and improve the generalization of the classi er, OBD [1] is used as a pruning strategy. By this method the least signi cant edges of the architecture are iteratively removed until the classi er's performance ceases to increase. The initial network consists of a F-N-C architecture, where F denotes the cardinality of the complete feature set (33) representing the input units, and C is the number of class labels (2), i.e. output units. First the number of hidden units (N ) of the fully connected network is reduced until it attains the maximum recognition rate. Then OBD is applied to provide a re ned pruning strategy. On the basis of a trained MLP, it evaluates the saliency of each free parameter in the network, i.e. weight w [1]. The weights are sorted by saliency and the most irrelevant edges are removed. The network is retrained until a termination criterion is i

i

Detailed descriptions of several statistical methods for feature selection and a comparison on the melanoma application can be found in [12, 11]. 1)

0.8

performance

0.75

0.7

0.65 result of the feature selection performance on the test data spline interpolation

0.6

0.55

5

10

15 20 size of the feature subset

25

30

Figure 1: Results of the feature selection obtained by application of the SFFS algorithm. As separability criterion the LOO performance of a 17-NN classi er was used. The behaviour on the test data was calculated by ten-fold cross validation. The x-axis indicates the subset size. The y-axis shows the recognition rate. Each in the plot represents the performance of the best subset found by SFFS for a given subset size. Each shows the corresponding recognition rate on the test data.

satis ed. Considering the change in the error function E due to small changes in the values of the weights w + w , the saliency is determined by 1 (1) = H w ; 2 where H are the diagonal terms of the Hessian of the architecture's free parameters, and E denotes the sum-of-squares error function. The results (section 3.3) con rm the eciency of the pruning strategy by a signi cant increase in the classi cation performance. i

i

i

i

ii

2

i

ii

3 Results 3.1 Feature selection with kNN classi cation

In the rst experiment SFFS was applied in order to determine the optimum number of features with respect to the achievable performance (i.e. recognition rate), which is measured as the leave-one-out (LOO) estimate of a kNN classi er. The resulting performances on the feature subsets, evaluated on independent test data using ten-fold cross-validation (XVAL), are presented in gure 1. The results indicate that a recognition rate of 71; 1% is achieved with four features, which

fraction of presence

1 0.8 0.6 0.4 0.2 0

5

10

15

20

25

30

feature

Figure 2: Resulting feature subsets of size four after the application of the SFFS algorithm using the performance of a kNN classi er as separability criterion. The dashed bars indicate the four features which have been selected using the whole dataset. The solid bars show the fraction of presence over the 10 XVAL-runs for each feature.

does not increase above 75% by including many more features. The best performance of 74:8% was achieved at a subset size of 18. The choice of just four features is preferable, because the computational complexity of both the feature calculation step and the classi er itself decreases. For this subset size even a brute force search is applicable. The best subset yields a performance of 77; 5% in feature selection. But on the independent test data the recognition rate is only 70; 0% which is lower than the performance of the suboptimal algorithm. Due to the XVAL procedure the SFFS was applied ten times to a slightly dierent dataset and therefore not the identical feature subset has been selected. This is illustrated in a bar diagram (Figure 2) which contains for each feature the fraction of how often it was selected divided by the total number of XVAL-runs. One can clearly see that the feature 20 (value minimum ) was chosen in almost every simulation. But for all other features the results seem to be rather diverse. By grouping features with similar properties (e.g. number of spots, hue variance represent `speckleness' of a lesion) it was found that in each case at least three dierent groups were chosen. The four most often selected features (value minimum, number of spots, polar variance, and saturation mean ) describe the dierent aspects darkness, `speckleness', asymmetry, and colouring, which are known to be of great importance in medical diagnosis and coincide with ELM-criteria . 2)

3)

2) 3)

We refer to [5] for a detailed description of all 33 features and their calculation. Criteria which are used by dermatologists to diagnose epiluminescence microscopic images of skin lesions.

Classification performance

72 71 70 69 68 67

2

4

6

8 10 Number of hidden units

12

14

Figure 3: Classi cation performance by variation of the number of units in the hidden layer of the MLP. 3.2 Neural network classi cation based on selected features

The neural network used in the second classi cation experiment was a fully connected MLP with 4 input units representing the features selected in the previous experiment (cf. section 3.1), and two output units. The number of the units in the hidden layer was derived by evaluation of the classi cation performance. The classi cation performance was achieved by 10-fold XVAL and in each XVAL step by averaging 10 runs with dierent initializations. The best overall recognition rate that could be achieved in this experiment was 71; 3% by using four hidden units ( gure 3). 3.3 Neural network with pruning strategy

The third approach started with a fully connected MLP having 33 input units and two output units. The number of units in the hidden layer was varied to nd the architecture with the best recognition rate. In a series of experiments, a number of three hidden units turned out to provide maximum performance. In order to reduce the complexity of the network we applied OBD to remove the least signi cant edges from the network. Since the data set is too small for generating independent training and test sets, XVAL is applied for the performance estimations on new data, like in the experiments described above. The parameters of the network were randomly initialized for several (5) runs and averaged over all XVAL steps. Figure 4 depicts the classi cation performance for dierent stages of the pruning process. Speci cally, two curves of equal hidden layer complexity demonstrate the variance in the results, whereas the performance attains approximately the same maximum. It can be seen that the XVAL-performance is increasing up to a removal of about 60 edges of the MLP. The best recognition rate was achieved after deletion of 54 edges. The connectionist architecture with the best classi cation performance is

75

70

70



75

65 60 55 50 45 0

20

40 60 80 Number of pruned edges

(a)

100

65 60 55 50 45 0

20

40 60 80 100 Number of pruned edges

120

140

(b)

Figure 4: Averaged recognition rates at several stages of the pruning process. The classi cation performance increases with progressive pruning until the classi er falls short of sucient complexity. Starting OBD on an optimized, fully connected architecture (a, 3 hidden units) outperforms a strategy starting with a more complex network (b, 4 hidden units). (a) depicts two runs to demonstrate the variance in the results.

Figure 5: Resulting network after pruning (54 edges have been removed).

depicted in gure 5, which consists of only 51 out of 105 edges of the fully connected network. Instead of rst optimizing the hidden layer of the fully connected network, one could start the pruning process with a higher number of hidden units and dedicate the removal of these units to the OBD method. In our experiments this strategy turned out to give inferior results ( gure 4) by a slightly increased classi er complexity. Attempts to apply OBD to a larger

architecture failed due to extensive computational costs. In the resulting optimized network, 6 of the 33 features have been omitted. The recognition rate of the network, calculated with 10 XVAL steps and averaging over 5 runs with dierent initializations in every step, is 74:5%. 4 Conclusion

In the present work the relevance of the 33 features developed by Ganster [5] in cooperation with dermatologists is investigated. By the application of feature selection the number of features essential for classi cation was optimized. The rst experiment has shown that only a very small subset of four well selected features is sucient to cover the major part of discriminatory information provided by all features. Thus signi cant reduction of expense needed for feature calculation is achieved and redundancies are eliminated. At the same time the classi er complexity is reduced which leads to a better generalization behaviour. The subset, which is found to be best, consists of features that describe dierent properties for the malignancy of skin lesions. A relation between these features and indicators used by dermatologists is found [12]. The recognition rates for the selected features are about 71; 1% for the kNN classi er and slightly higher (71; 3%) for the neural network classi cation. The highest recognition rates obtained with the statistical approach (74; 8%) and with the neural network (74; 5%) are very promising and comparable to the rate of dermatologists. In our application the statistical approach is preferable, because the computational costs are higher with the neural network due to the multiple random initializations and averaging process. In conclusion it can be seen that a careful selection of features or pruning a neural network impressively reduces the redundancy in the feature set and the complexity of the classi cation process. Due to the reduced amount of parameters in the classi cation process, the generalization behaviour of both classi ers is increased, which can be seen in the higher recognition rate of smaller subsets on one hand and the reduced network on the other hand. In the near future an enlarged dataset (more than 3000 records) will be available and therefore the very costly procedures of XVAL and LOO performances can be changed to classi cations with separate training and test sets. Furthermore the generalization ability and the classi cation performance on test data is expected to increase.

References [1] Y. Le Cun, J. Denker, and S. A. Solla. Optimal brain damage. In D. S. Touretzki, editor, Advances in Neural Information Processing Systems, volume 2, pages 598{605. Morgan Kaufmann Publishers, San Mateo, CA, 1990. [2] Pierre A. Devijver and Josef Kittler. Pattern Recognition: A Statistical Approach. Prentice-Hall International, Inc., London, 1982. [3] Atam P. Dhawan and Anne Sicsu. Segmentation of images of skin lesions using color and texture information of surface pigmentation. Computerized Medical Imaging and Graphics, 16(3):163{177, 1992. [4] K.-H. Franke, F. Gamann, and R. Gergs. Fruherkennung von Hautkrebs durch Farbbildverarbeitung. Mustererkennung 1993. Springer Verlag, 1993. Hrsg: Poppl, S. I. und Handels, H. [5] H. Ganster, M. Gelautz, A. Pinz, M. Binder, H. Pehamberger, M. Bammer, and J. Krocza. Initial results of automated melanoma recognition. In Gunilla Borgefors, editor, Theoriy and Applications of Image Analysis II, pages 343{354. World Scienti c Publishing Co. Pte. Ltd., 1995. [6] Adele Green, Nicholas Martin, John P tzner, Michael O'Rourke, and Ngaire Knight. Computer image analysis in the diagnosis of melanoma. Journal of the American Academy of Dermatology, 31(6):958{964, 1994. [7] Gregory A. Hance, Scott E. Umbaugh, Randy H. Moss, and William V. Stoecker. Unsupervised color image segmentation. IEEE Engineering in Medicine and Biology, 15(1):104{111, 1996. [8] Mads Hintz-Madsen, Lars Kai Hansen, Jan Larsen, Eric Olesen, and Krzysztof T. Drzewiecki. Detection of malignant melanoma using neural classi ers. Proceedings of International Conference on Engineerings Applications on Neural Networks EANN'96, June 1996. London, UK. [9] Hubert Pehamberger, Michael Binder, Andreas Steiner, and Klaus Wol. In vivo epiluminescence microscopy: Improvement of early diagnosis of melanoma. The Journal of Investigative Dermatology, 100(3):356S{362S, March 1993. Supplement. [10] P. Pudil, J. Novovicova, and J. Kittler. Floating search methods in feature selection. Pattern Recognition Letters, 15:1119{1125, November 1994. [11] Reinhard Rohrer. Die Qualitat bildanalytischer Merkmale zur Melanomerkennung. Master's thesis, Institute for Computer Graphics and Vision, Technical University Graz, Austria, July 1997. [12] Reinhard Rohrer, Harald Ganster, and Axel Pinz. Feature Selection in Melanoma Recognition. To appear in Proceedings of the ICPR'98. [13] D. Rumelhart, G. Hinton, and R. Williams. Learning representations by back-propagating errors. Nature, 323:533{536, 9 October 1986. [14] W. Siedlecki and J. Sklanski. On automatic feature selection. Int. Journal of Pattern Recognition and Arti cial Intelligence, 2:197{220, 1988. [15] Hirotsugu Takiwaki, Shiro Shirai, Yasumori Watanabe, Kohji Nakagawa, and Seiji Arase. A rudimentary system for automatic discrimination among lesions on the basis of color analysis of video images. Journal of the American Academy of Dermatology, 32(4):600{604, 1995.