Investigation of 3D textural features' discriminating ...

0 downloads 0 Views 2MB Size Report
patients diagnosed with IP secondary to connective tissue diseases, radiologically manifested with ground glass, reticular and honeycombing patterns (Fig. 1).
Investigation of 3D textural features’ discriminating ability in diffuse lung disease quantification in MDCT I. Mariolis, P. Korfiatis, L. Costaridou Department of Medical Physics University of Patras Patras, Greece [email protected] Abstract— A current trend in lung CT image analysis is Computer Aided Diagnosis (CAD) schemes aiming at DLD patterns quantification. The majority of such schemes exploit textural features combined with supervised classification algorithms. In this direction, several 3D texture feature sets have been proposed. However their discriminating ability is not systematically evaluated, in terms of individual feature sets or in conjunction to different classifiers. In this paper, four classification settings combined with the RLE feature set, commonly used in the literature, and Laws feature set, first time employed for DLD characterization, are evaluated. Furthermore, the combination of RLE and Laws features was examined using the same classification settings. Although both RLE and Laws feature sets presented high discriminative ability for all classifiers considered (classification accuracy > 96.5%), their combination achieved even better results, yielding classification accuracy above 98.6%. Keywords-Diffuse Lung Disease; Computed Tomography; 3D Texture; Laws Features; Run Length; Supervised Classification

I.

INTRODUCTION

Diffuse Lung Disease (DLD), accounting for 15% of respiratory practice, represents a large and heterogeneous group of disorders primarily affecting lung parenchyma [1], which can potentially lead to respiratory failure if therapy fails [2]. Computed Tomography (CT) is the modality of choice for the diagnosis of DLD and for the prediction of response to treatment. DLD is radiologically manifested as texture alteration of lung parenchyma. The clinical diagnosis of DLD in CT is based on assessment of lung parenchyma textural pattern type, extent and distribution within the lung. With High Resolution CT (HRCT) protocols sampling only 10% of lung volume, volumetric Multi-Detector CT (MDCT) protocols, capable of capturing the entire lung volume are emerging, lending themselves to computer aided characterization and quantification of the entire extent of DLD. Quantification of DLD patterns by radiologists is characterized by high inter- and intra-observer variability, due to lack of standardized criteria and further challenged by the volume of image data reviewed [1],[3]. Thus, Computer Aided Diagnosis schemes (CAD) for identification and characterization, i.e. quantification, are a necessity.

C. Kalogeropoulou1, D. Daoussis2, T. Petsas1 1

Department of Radiology Department of Internal Medicine University Hospital of Patras Patras, Greece

2

Proposed systems up to now exploit supervised textural pattern classification [3]. Xu et al.[4] used 3D co-occurrence, run-length encoding (RLE), fractals and first order statistics features to differentiate between five lung tissue types in MDCT datasets combined with two classifiers (support vector machine and Bayesian). Zavaletta et al. [1], proposed a scheme based on histogram signatures extracted from lung Volumes Of Interest (VOIs), using the Earth Movers’ Distance (EMV) similarity metric. Recently, Korfiatis et al. [5], exploited 3D co-occurrence features, to differentiate between 3 lung tissue types in MDCT datasets exploiting a k-Nearest Neighbor (kNN) classifier. Although several 3D texture feature sets have been proposed their exploitation is suboptimal, since the performance and robustness of individual feature sets, as well as of their combinations in conjunction to different classifier is not systematically evaluated. The aim of the paper is to investigate the discriminating ability of a commonly used 3D texture features set, RLE, regarding DLD pattern differentiation in lung MDCT and compare it, against the Law’s 3D texture feature set, exploited in this study. Classification into four different types of lung parenchyma patterns (normal, ground glass, reticular and honeycombing), exploiting, k-NN, probabilistic neural network (PNN), Bayesian and multinomial logistic regression (MLR) multiclass classification schemes is considered. In addition, the generalization ability of each classification scheme is investigated by means of 10-fold cross validation, while the statistical significance of differences between the estimated classification accuracies is examined using one-way ANOVA and Tukey’s honestly significant difference criterion. II.

MATERIALS AND METHODS

A. Dataset A pilot clinical case sample was acquired consisting of 30 MDCT scans corresponding to 5 normal patients and 25 patients diagnosed with IP secondary to connective tissue diseases, radiologically manifested with ground glass, reticular and honeycombing patterns (Fig. 1). MDCT scans were obtained with a Multislice (16x) CT (LightSpeed, GE), in the Department of Radiology at the University Hospital of Patras, Greece. Acquisition parameters of tube voltage, tube current and slice thickness were 140 kVp, 300 mA and 1.25 mm,

respectively. The image matrix size was 512x512 pixels with average pixel size of 0.89 mm. The MDCT scans were used to extract VOIs for training the classifiers employed for IP pattern identification and characterization. These sets consisted of 1173 cubic VOIs of variable size, defined by an expert radiologist, exploiting a home developed graphical user interface, representing patterns corresponding to reticular (458), ground glass opacities (195) honeycombing (249) and normal LP ( 271).

A statistical approach, the stepwise discriminant analysis [8] (SDA) is employed to select the most discriminant features reducing the dimensions of each feature set considered. 4) Classification: Four different multi-class classifiers were used to assign to each LP voxel a label of normal, ground glass, reticular or honeycombing using as inputs the selected textural features. The features have been scaled accordingly in order to be normalized to zero mean and unit variance. a) Probabilistic Neural Network (PNN) [9]: PNNs perform non-parametric classification by implementing the Parzen-window method. In the current study, the selected window functions follow normal distributions with different means but the same standard deviation σ for every training pattern. The value of σ was empirically defined. b) k-Nearest Neighbor classifier (k-NN) [10]: Nearest Neighbor is a widely used non-parametric classifier very similar to PNN. However, as opposed to PNN, uses a window of varying size to approximate the densities involved. According to the k-NN rule an unknown pattern is assigned to the majority class of its k nearest neighbors. In this work the Euclidean metric was used as the distance function assessing the proximity of the neighbors, while the value of k was empirically defined during validation of k-NN.

Fig. 1 CT appearance of (a) normal lung parenchyma, and DLD patterns; (b) ground glass, (c) reticular, (d) honeycombing.

B. Methods 1) Preprocessing: Prior to VOI selection lung parenchyma was segmented exploiting a lung field algorithm adapted to DLD presence followed by a vessel segmentation algorithm [5]. 2) Texture analysis a) 3D run length feature (RLE) set: Run length statistics capture the coarseness of texture in a specified direction. A run is defined as a string of consecutive voxels which have the same gray level intensity along a planar orientation. For a given VOI a run length matrix is defined as follows: P(i,j) represent the number of runs with voxels of gray level intensity equal to i and length of run equal to j along the direction (13 directions). For each run length matrix 11 features were calculated [6]. The mean and range of each feature over the 13 run length matrices (corresponding to 13 directions) was calculated, comprising a total of 26 run lengthbased features. b) Laws feature set: Textural features were extracted based on the method proposed by Laws [7]. According to this approach, textural features are extracted from VOIs that have been filtered by each one of the 125 3D Laws masks. These filtered volumes are characterized as texture energy volumes. From each one of the 125 volumes 5 first order statistics features were extracted resulting into 625 texture features. 3) Feature selection

c) Bayesian classifier [11]: Bayesian decision theory finds the optimal decision boundary for a classification task based on prior probabilities of each class and predifined classconditional densities. In this study multivariate normal classconditional densities were considered, while the parameters of these densities were estimated using the method of maximum likelihood. For the rest of this paper the aforementioned classification setting will be denoted as Normal Density Discriminant Functions classifier (NDDF). d) Multinomial logistic regression (MLR) [12]: When applied to pattern recognition problems, the output of an MLR model can be interpreted as an a-posteriori estimate of the probability that a given pattern belongs to each of the classes. The model’s parameters, i.e. the regression coefficients used to produce the probability estimates, are typically estimated by means of maximum likelihood or Bayesian techniques. All four classifiers are trained using supervised learning techniques, while every classifier has been tested using three different feature sets. The first two feature sets are consisting of the aforementioned RLE and Laws features, while the third set is the result of merging the above feature sets into the unified RLE+Laws feature set. Thus, on the whole, twelve different CAD schemes have been examined, corresponding to every possible combination of the considered feature sets and classifiers. 5) Parameter selection. VOI size (VS) and number of available grey levels in a VOI (NL) are parameters affecting CAD systems’ performance. The selection of these parameters was based on the analysis of the overall accuracy of the systems, using the leave one out validation method. Namely, grid search is performed and the set of parameters corresponding to the highest overall accuracy is selected for each CAD scheme. In case of PNN

classifier the parameter σ is also determined via grid search, while the same applies in case of k-NN classifier for the number of neighbors’ k. Thus, grid search determines two different types of parameters: VS, NL used in the data selection step, and σ, k used in the classification one. All parameters are simultaneously determined by a unified grid search. The unified search resulted to optimal parameters corresponding to the highest overall accuracy for each one of the twelve different CAD schemes considered. The grid considered 12 different VOI sizes, VS ∈ (11,13,…,33) and 4 different gray level binning, NL ∈ (32,64,128,256). For PNN, 19 different values of σ were considered, where σ ∈ (1,0.9,…0.1,0.09,…,0.01), while in the case of k-NN classifier, 16 different k values were examined, where k ∈ (1,3,… ,31). Notice that k was allowed to take only odd values, in order to avoid tie votes. III. TABLE I.

RESULTS

GRID SEARCH RESULTS OF OPTIMAL PARAMETERS FOR EACH CAD SCHEME Data selection parameters

CAD scheme Classifier PNN

k-NN

NDDF

MLR

Classification parameters

Feature set RLE Laws

VS

NL

σ

k

29 33

64 64

0.06 0.06

-

RLE+Laws

33

256

0.09

-

RLE Laws

29 29

64 32

-

1 1

RLE+Laws

33

32

-

1

RLE Laws

29 25

128 256

-

-

RLE+Laws

29

32

-

-

RLE Laws

25 31

64 32

-

-

RLE+Laws

33

64

-

-

TABLE II.

The parameters of the twelve CAD schemes determined by the aforementioned grid search procedure are presented in Table I. The above schemes were evaluated further using 10-fold cross validation [13]. The corresponding classification results are presented in Table II, where the mean overall Classification Accuracy (CA) over the folds for each CAD scheme is accompanied by the estimated standard deviation. In the next columns the sensitivity and specificity are presented separately for each class. In order to compare the performance of the four classifiers, one way ANOVA was performed separately for each feature set using the results of the 10-fold cross validation. In case of RLE features, no statistically significant difference in the accuracy of the four classifiers was found at the 95 % significance level. In case of Laws and RLE+Laws feature sets, statistically significant differences were detected and further multiple comparison was performed by applying Tukey’s honestly significant difference criterion [14]. The results for the 95 % significance level are presented in Table III, where Difference in % CA denotes the difference between Classifier’s 1 mean CA % and Classifier’s 2 mean CA %. The lower and upper bound of the estimated confidence intervals are presented in the left and right columns of Table III, respectively. TABLE III. STATISTICALLY SIGNIFICANT DIFFERENCES AMONG DIFFERRENT CLASSIFIERS UTILIZING SIMILAR FEATURES 95% Confidence Interval Feature set Classifier 1 PNN LAWS k-NN PNN RLE PNN + k-NN LAWS NDDF

Classifier 2

lower

Difference in % CA

upper

NDDF NDDF MLR MLR MLR MLR

0.89 0.43 0.05 0.51 0.51 0.28

2.90 2.44 1.75 1.47 1.47 1.24

4.91 4.45 3.45 2.42 2.42 2.19

OVERALL ACCURACY, CLASS SPECIFIC SENSITIVITY AND SPECIFICITY OF TWELVE CAD SCHEMES

CAD scheme Classi- Feature fier set

Number of selected features

Overall accuracy % mean ±S.D.

PNN RLE Laws

9 15

Sensitivity % mean ±S.D. Reticular

Normal

Ground glass

Reticular

Honeycombing

98.2 ±3.1 97.2 ±4.1

98.3 ±3.2 98.9 ±2.5

99.9 ±0.3 99.6 ±1.3

100 ±0.0 100 ±0.0

99.5 ±0.9 99.7 ±0.7

99.3 ±1.0 99.0 ±1.2

100 ±0.0 99.9 ±0.1

100 ±0.0

99.2 ±2.3

99.9 ±0.4

100 ±0.0

100 ±0.0

100 ±0.1

99.8 ±0.6

100 ±0.0

99.1 ±1.05 98.9 ±1.17

99.5 ±1.4 99.7 ±0.7

98.2 ±3.1 97.2 ±4.1

98.3 ±3.2 98.9 ±2.5

99.9 ±0.3 99.6 ±1.3

100 ±0.0 100 ±0.0

99.5 ±0.9 99.7 ±0.7

99.3 ±1.0 99.0 ±1.2

100 ±0.0 99.9 ±0.1

24

99.8 ±0.50

100 ±0.0

99.2 ±2.3

99.9 ±0.4

100 ±0.0

100 ±0.0

100 ±0.1

99.8 ±0.6

100 ±0.0

13 12

97.7 ±1.62 96.5 ±1.91

99.9 ±0.4 98.3 ±2.7

92.9 ±5.7 95.1 ±5.4

95.5 ±4.8 96.3 ±4.8

100 ±0.0 99.9 ±0.4

100 ±0.0 99.1 ±1.2

98.8 ±1.3 99.1 ±1.3

98.1 ±1.5 98.9 ±1.3

99.9 ±0.3 100 ±0.1

RLE+Laws

25

99.6 ±0.74

99.7 ±0.9

96.6 ±4.4

97.2 ±4.0

99.7 ±0.9

99.5 ±1.0

99.5 ±0.9

99.4 ±1.0

99.7 ±0.7

MLR RLE Laws RLE+Laws

13 14

97.5 ±1.63 97.6 ±1.77 98.6 ±1.19

96.0 ±3.6 99.5 ±1.4 99.9 ±0.3

96.8 ±4.1 96.1 ±5.2 99.9 ±0.3

96.1 ±4.3 98.9 ±2.8 99.1 ±2.4

99.1 ±1.8 99.5 ±1.4 100 ±0.0

99.8 ±0.6 99.9 ±0.3 100 ±0.1

99.0 ±1.2 99.7 ±0.7 99.8 ±0.6

97.3 ±1.8 98.8 ±1.5 99.9 ±0.2

100 ±0.0 100 ±0.0 100 ±0.0

k-NN RLE Laws RLE+Laws NDDF RLE Laws

98.6 ±1.25 99.5 ±0.78

99.5 ±1.4 99.7 ±0.7

24

99.8 ±0.58

9 15

Specificity % mean ±S.D. Honeycombing

RLE+Laws

Normal

Ground glass

27

The same multiple comparison procedure was used in order to assess the difference in the CA when different feature sets are employed by similar classifiers. No significant difference was found between RLE and Laws features for any of the classifiers. However, when PNN is combined with RLE+Laws features, the CA difference with respect to the corresponding PNN-RLE scheme is significant. The same conclusion was reached in case of NDDF classifier using RLE+Laws features, outperforming both RLE and Laws. The above results, including the corresponding 95 % confidence intervals are presented in Table IV. TABLE IV.

significant difference between RLE+Laws feature set and Laws feature set was also reported. As indicated by these results the performance of PNN and NDDF classifiers is increased when the original RLE and Laws features are combined. PNN performance presented stable behavior for a wide range of values for parameter σ, while k-NN presented decreasing performance for k>1. These two classifiers utilizing RLE+Laws features achieved overall the highest CA. Future work should consider investigation of statistical differences between the classification performances achieved, utilizing different VOI size sampling and gray level binning.

STATISTICALLY SIGNIFICANT DIFFERENCES AMONG SIMILAR CLASSIFIERS UTILIZING DIFFERENT FEATURES 95% Confidence Interval

Classifier Feature set 1 Feature set 2 PNN RLE+Laws RLE+Laws NDDF RLE+Laws

RLE RLE Laws

IV.

lower

Difference in % CA

upper

0.12 0.60 1.47

1.33 1.82 2.7

2.54 3.05 3.92

DISCUSSION

In this study, four classification settings combined with the RLE feature set, commonly used in the literature, and Laws feature set, first time employed for DLD characterization, are evaluated. In addition to these two sets, their combination, RLE+Laws feature set, was also investigated. Thus, twelve CAD schemes were considered.

V.

CONCLUSION

Although both RLE and Laws feature sets presented high discriminative ability for all classifiers considered (CA>96.5%), their combination achieved even better results (CA>98.6%). Furthermore, PNN and k-NN classifiers seem to be the proper choice for DLD characterization, achieving the highest CA (99.8%) of all experiments. ACKNOWLEDGMENT This work was supported in part by the Caratheodory Programme (C.591) of the University of Patras. REFERENCES [1]

The generalization ability of the classifier of each CAD scheme was investigated by means of 10-fold cross validation, while the statistical significance of differences between the estimated classification accuracies has been examined using one-way ANOVA and Tukey’s honestly significant difference criterion. Two experiments were conducted to examine the statistical significance of the differences between the reported performances of the considered schemes.

[2]

In the first experiment, the classifiers’ performance was examined, while the utilized feature set was held fixed. In case of RLE feature set, no statistically significant difference among the four classifiers was reported. For Laws feature set the highest difference in CA was reported between PNN and NDDF, with k-NN and NDDF difference presenting similar range. In case of RLE+Laws feature set the highest difference in CA was reported between both PNN and k-NN against MLR classifier, while statistically significant difference was also detected between NDDF and MLR. As indicated by these results k-NN and PNN present similar behavior, while NDDF’s and MLR’s performance differentiate in case of RLE+Laws.

[5]

In the second experiment, feature’s discriminating ability has been investigated. Thus, the performance of each classifier was evaluated and compared for all three feature sets. In case of k-NN classifier no statistically significant difference among the considered feature sets was reported. The same applies for MLR classifier. In case of PNN, the difference in the produced CA between RLE+Laws feature set and RLE feature set was statistically significant at the 95% significance level. The same applies for NDDF. However, in that case a statistically

[10]

[3]

[4]

[6] [7]

[8] [9]

[11] [12] [13]

[14]

V.A. Zavaletta, B.J. Bartholmai, R.A. Robb, “High Resolution Multidetector CT-Aided Tissue Analysis and Quantification of Lung Fibrosis,” Academic Radiology, vol.14, pp.772-787, 2007. Z.A. Aziz, et al, “HRCT diagnosis of diffuse parenchymal lung disease: inter-observer variation” Thorax, vol. 59,pp. 506-511, 2004. I.C. Sluimer, A. Schilham, M. Prokop, B. Van Ginneken, “Computer analysis of computer tomography scans of the lung: a survey,” IEEE Transactions in Medical Imaging, vol. 25, pp. 385–405, 2006. Y. Xu, E.J. Van Beek, Y. Hwanjo, J. Guo, G. Mclennan,E. Hoffman, “Computer-aided classification of interstitial lung diseases via MDCT: 3D adaptive multiple feature method (3DAMFM),” Academic. Radiology, vol. 13, pp.969–978, 2006. P. Korfiatis, A. Karahaliou, A. Kazantzi, C. Kalogeropoulou, L. Costaridou, “Texture based identification and characterization of interstitial pneumonia patterns in lung multidetector CT,” IEEE Transactions on Information Technology in Biomedicine, in press. M. Galloway, “Texture analysis using gray level run lengths:, Computer Graphics and Image Processing vol.4, pp. 172–179, 1975. M.T. Suzuki, Y. Yaginuma, T. Yamada, Y. Shimizu, "A Shape Feature Extraction Method Based on 3D Convolution Masks," Proc. of Eighth IEEE International Symposium on Multimedia (ISM'06), pp. 837-844, 2006. K. Einslein, A. Ralston, H.S. Wilf, “Statistical methods for digital computers,” New York: John Wiley & Sons, 1977. Specht, “Probabilistic neural networks,” Neural Networks, vol. 3, pp. 109–118, 1990. E. Patric and F. Fischer III, “A generalized k-nearest neighbor rule,” Information and Control, vol. 16, pp. 128-152, 1970. J. Bernardo and A. Smith, Bayesian Theory, Wiley, New York,1996. D. Hosmer, S. Lemeshow, Applied logistic regression. New York Wiley, second edition, 2000. R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” Proc. of the 14th Int. Joint Conf. on Art. Int. 2, vol. 12, pp. 1137–1143, 1995. Y. Hochberg and A. Tamhane, Multiple Comparison Procedures. Hoboken, NJ: John Wiley & Sons, 1987.