K-CM: A new artificial neural network. Application to

0 downloads 0 Views 1MB Size Report
Jun 28, 2014 - of energy liberated by the kth hidden node in favor of the jth output ..... Classification trees (CART) and classical k-nearest neighbor (k-NN) ..... [36] M. Alvarez-Guerra, D. Ballabio, J.M. Amigo, J.R. Viguri, R. Bro, A chemometric ...
Chemometrics and Intelligent Laboratory Systems 138 (2014) 110–119

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab

K-CM: A new artificial neural network. Application to supervised pattern recognition M. Buscema a,b, V. Consonni c,⁎, D. Ballabio c, A. Mauri c, G. Massini a, M. Breda a, R. Todeschini c a b c

SEMEION, via Sersale 117, 00128 Roma, Italy Dept. Mathematical and Statistical Sciences, University of Colorado, 1250 14th Street, 80217-3364 Denver, CO, USA Milano Chemometrics and QSAR Research Group, Department of Earth and Environmental Sciences, University of Milano-Bicocca, P.zza della Scienza 1, 20126 Milano, Italy

a r t i c l e

i n f o

Article history: Received 10 March 2014 Received in revised form 29 May 2014 Accepted 19 June 2014 Available online 28 June 2014 Keywords: Artificial neural networks Classification k-NN QSAR Auto-CM TWIST algorithm

a b s t r a c t Artificial neural networks can be currently considered as one of the most important emerging tools in multivariate analysis due to their ability to deal with non-liner complex systems. In this work, a recently proposed neural network, called K-Contractive Map (K-CM), is presented and its performance in classification is evaluated towards other well-known classification methods. K-CM exploits the nonlinear variable relationships provided by the Auto-CM neural network to obtain a fuzzy profiling of the samples and then applies the k-NN classifier to evaluate the class membership of samples. The algorithm Training with Input Selection and Testing (TWIST) is applied prior to K-CM to perform training/test data splitting for model parameter optimization and validation. This novel classification strategy was evaluated on ten different datasets and the obtained results were generally satisfactory. © 2014 Elsevier B.V. All rights reserved.

1. Introduction Supervised pattern recognition methods for classification are currently applied in several research fields, such as food chemistry, analytical chemistry, metabolomics, process monitoring, medical sciences, pharmaceutical chemistry, chemical modeling, environmental monitoring, as well as social and economic sciences. Classification is one of the fundamental methodologies in multivariate analysis and consists basically in finding mathematical models able of recognizing the membership of samples to their proper class [1,2]. Once a classification model has been trained, the membership of new samples to one of the defined classes can be predicted with a certain degree of accuracy. While the theory for multivariate calibration (regression) is extensive and has been exhaustively developed over the last thirty years, multivariate classification is less developed, even if several multivariate classification methods, based on different mathematical approaches, have been proposed so far [3–12]. Classification algorithms proposed in the literature have different characteristics and properties, so that they are able to face various potential issues related to class discrimination [13]. For example,

⁎ Corresponding author. Tel.: +39 02 64482818. E-mail address: [email protected] (V. Consonni).

http://dx.doi.org/10.1016/j.chemolab.2014.06.013 0169-7439/© 2014 Elsevier B.V. All rights reserved.

classification methods can be linear or non-linear (on the basis of the class decision boundary they can handle), probabilistic (if they are based on estimates of probability distributions), parametric or nonparametric, distance-based (if they require the calculation of distances between samples or between samples and class space representative points), pure or class-modeling methods (if boundaries to separate each specific class from the rest of the data space are defined), and global or local (if all data or just a few neighbors are used to decide upon class assignment). Among traditional classifiers, artificial neural networks (ANNs) have increasing applications and can be currently considered as one of the most important emerging tools in multivariate analysis. One of the reasons of their success is their ability of solving both supervised and unsupervised issues and dealing with non-linear relationships that are frequent in real datasets. The present article aims at demonstrating the potential use in chemical research of a recently proposed classification method. This is a machine learning method that couples an artificial neural network for the non-linear optimization of the variable relationships and a fuzzy profiling of the samples with the TWIST algorithm that is designed for optimal training/test data splitting and variable selection [14]. This novel classification strategy, called K-Contractive Map (K-CM) was tested on different benchmarked datasets. The obtained results were discussed in comparison with those derived from other classical

M. Buscema et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 110–119

111

classification methods, representing different families of the main algorithms for classification purposes. All methods were implemented using the WEKA software package and some in-house MATLAB functions.

Table 1 collects the basic symbols and terminology used in the following sections to explain the K-CM theory.

2. Theory

2.3. Data preprocessing

2.1. K-CM method overview

Prior to the K-CM analysis, all the variables are range scaled on the minimum and maximum values so that each scaled variable is allowed to vary in the range 0–1. In addition, the class variable is unfolded into a binary vector of dimension G, where G is the number of different class values and 1 indicates the class membership.

The K-Contractive Map (K-CM) method is an application of the most general Contractive Map (CM) technique (Fig. 1), specifically designed to solve supervised pattern recognition problems by the aid of the knearest neighbor (kNN) classifier [15]. This method is a supervised artificial neural network (ANN) that is capable of supporting its decisions by using a weighted semantic map of samples that are evaluated in blind testing and a variable semantic graph, which shows how the organized relations among all the variables of the training set were trained. The method consists of the following main stages, as described in Fig. 1: 1. Learning of the Auto-Contractive Map (Auto-CM) on the training set, which is a neural network that optimizes the non-linear relationships among all variables of the training set into a variable manyto-many relationship matrix (i.e. hidden-output connection matrix w). The Auto-CM technology is well-known in the literature for its excellent results in complex applications in the scientific world of medicine and bio-security [15]; 2. A fuzzy profiling of samples, by the z-transforms through which the variables of a sample are redefined in terms of a fuzzy membership grade. This fuzzy membership is determined on the basis of the variable connection matrix w generated by the Auto-CM system at the end of the learning phase; 3. Class membership assignment, undertaken by means of the knearest neighbors classifier applied to the z-transformed dataset [4]; 4. Data graphical visualization, performed by an algorithm for the organization of specific distance matrices derived from the training and test samples and redefined by the profiling algorithm.

2.2. Symbols and terminology

2.4. Auto-CM: unsupervised analysis of variable relationships The Auto-Contractive Map (Auto-CM) is an unsupervised adaptive neural network with architecture based on three layers of nodes [16–18]: an input layer that captures the signal from the environment, a hidden layer that modulates the signal within the network and an output layer that returns a response to the environment on the basis of the processing that occurred (Fig. 2). The three layers have the same number p of nodes, where p is the number of input variables. The connections between the input and the hidden layer are monodedicated, whereas the hidden and the output layers are completely connected. Each connection is assigned a weight; all the weights are initialized not randomly, but with the same value, close to the computer science zero (if the data are float, then 0.0000001). Let C be a positive real number not lower than 1, which is referred to as the contraction parameter; v, the vector of mono-dedicated connections between the input and hidden layer; w, the matrix (p × p) of the connections between the hidden and the output layer. The learning algorithm of Auto-CM may be summarized by a sequence of three characteristic steps: 1. Signal transfer from the input into the hidden layer;   v j ðt Þ ½h xij ðt Þ ¼ xij  1− C

Fig. 1. Outline of the Contractive Map technique.

ð1Þ

112

M. Buscema et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 110–119

Table 1 Symbols and terminology of the K-CM method. Symbol

Meaning

Symbol

Meaning

p n G xij xij (t) vj(t)

Number of variables Number of training samples Number of classes jth variable (input) value of the ith sample jth output value of the ith sample at epoch t Connection weight between the jth input and jth hidden node at epoch t

j, k i m g x[h] ij (t) wjk(t)

C zij

Contraction parameter jth z-transform of the ith sample

α dmi

Dmi

Euclidean distance between the mth test sample and the ith training sample, based on the z-transforms and the class memberships True membership to the gth class of the ith training sample

Mm,m′

Indices running on variables Index running on training samples Index running on test samples Index running on class memberships jth hidden value of the ith sample at epoch t Connection weight between the kth hidden and jth output node at epoch t learning rate constant Euclidean distance between the mth test sample and the ith training sample, based on the z-transforms Meta-distance between the test samples m and m′

^cmg

Estimated membership to the gth class of the mth test sample

[out]

cmg

where the jth variable value xij of the ith object presented to the network at the time t is transformed into the corresponding hidden node value x[h] ij (t) through the contractive factor; 2. Signal transfer from the hidden to the output layer:

Net ij ðt Þ ¼

½out

xij

  wjk ðt Þ ½h xik ðt Þ  1− C k¼1

p X

  Net ij ðt Þ ½h ðt Þ ¼ xij ðt Þ  1− 2 C

ð2Þ

ð3Þ

(t) is calculated through a where the value of the output node x[out] ij double conceptual step. An initial operation accumulates inside a quantity, the net input Netij, that is the contraction of all the hidden nodes through the weights between the hidden and output layer. Then, a second operation calculates the output value by further contracting the corresponding value of the hidden node through the early calculated net input for the output node. 3. Adaptation of the weights of input/hidden connections and hidden/ output connections: Δv j ðt Þ ¼

  n h i v j ðt Þ X ½h  1− xij −xij ðt Þ  xij C i¼1

v j ðt þ 1Þ ¼ v j ðt Þ þ α  Δv j ðt Þ

 Δwjk ðt Þ ¼

1−

 n h i wjk ðt Þ X ½h ½out ½h  xij ðt Þ−xij ðt Þ  xik ðt Þ C i¼1

ð4Þ

ð5Þ

ð6Þ

wjk ðt þ 1Þ ¼ wjk ðt Þ þ α  Δwjk ðt Þ

ð7Þ

where α is the learning rate constant. During the learning phase, for each ith input object one calculates the output values and the contributions to the update of the connection weights, which will be summed and applied at the end of the epoch. For the p mono-dedicated connections between the input and hidden layer, one considers the contractive factor, based on the current weight vj(t), and the difference between the values of the corresponding input and hidden nodes, further modulated for the input node itself. Likewise, for the p2 connections between the hidden and output layer, one calculates the contractive factor, based on the current weight wij(t), and the difference between the values of the corresponding hidden and output nodes, which is modulated by the term x[h] ik (t). Note that this term makes the change in the connection wij(t) proportional to the quantity of energy liberated by the kth hidden node in favor of the jth output node. The contraction parameter C has to be ≥ 1 to avoid Eq. (1) be null when vj(t) becomes equal to C. Additionally, it is convenient to pffiffiffi set up C ¼ p because in this way Eqs. (2) and (3) become consistent pffiffiffi and elegant (i.e., max[Netij (t)] may be “p” and then C ¼ p). From the equations above, it derives that the contractions establish a relationship of order between the layers: ½h

½out

xij ≥xij ðt Þ≥xij

ðt Þ

ð8Þ

as, in each step, the variable value is multiplied by a factor smaller than 1. The whole learning process, which essentially consists of a progressive adjustment of the connections aimed at the global minimization of energy, may be seen as a complex juxtaposition of phases of acceleration and deceleration of velocities of the learning signals (adaptations wij(t) and vi(t) inside the ANN connection matrix). During the training, the mono-dedicated weights vj grow monotonically, and with different speeds asymptotically towards the constant C: lim v j ðt Þ ¼ C:

t→∞

ð9Þ

At the end of the training phase (Δwij = 0), any input vector of the training set will generate null hidden and output vectors [18]: ½h

lim xij ðt Þ ¼ 0

ð10Þ

½out

ð11Þ

t→∞

lim xij

t→∞

Fig. 2. Example of Auto-CM architecture with p = 4.

ðt Þ ¼ 0:

This process can be seen as a minimization of the training vector energy, which is represented by a function through which the trained

M. Buscema et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 110–119

connections absorb completely the training input vectors. This process converges naturally and occurs with speed modulated by the input vectors, the learning rate constant α and the contraction constant C; after convergence, the process leaves its specific sign in the weights wjk between the hidden and the output layer. Thus, it is reasonable to say that the weights wjk are an unbiased estimate of the whole set of relationships among the input variables, without any overfitting. 2.5. z-Profiling algorithm: fuzzy transformation of samples In this phase of the method, the connection weight matrix w[ACM] generated by the Auto-CM network, which encodes the full set of relationships among the input variables, is used to transform the input vectors in a fuzzy manner through the following equation: 0

0 11 ½ACM p wjk 1 X @ @ AA xik  1− zij ¼ xij  1− 2  C C k¼1

ð12Þ

where ACM indicates the final set of hidden-output connection weights obtained when Auto-CM converges at the end of the learning phase. The outcome of this algorithm is the translation of the original dataset X into a new fuzzy dataset Z, more informative than the previous one. 2.6. K-NN algorithm: supervised analysis of samples In this phase of the K-CM method, the class membership of the test samples is determined according to the same principle as the knearest neighbor (kNN) classifier. For each mth test sample, the samples used in the training phase define its fuzzy (gradient) value through the calculation of the ztransform: 0

zmj

0 11 ½ACM p X wjk 1 AA ¼ xmj  @1− 2  xmk  @1− C C k¼1

k≠ j:

ð13Þ

In the first step, the Euclidean distance of each test sample (input variables + estimated class) from all the samples used for training (input variables + real class) is calculated, as the following:

Dmi

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uX 2 u p  zmj −zij dmi ¼ t

i ¼ 1; …; n;

∀m

ð14Þ

j¼1

and finally identify the k nearest neighbors as those training samples associated with the minimum values of dmi. In the case of k = 1, the test sample is assigned the class membership of the nearest training sample; for k N 1, the test sample is assigned the class membership according to the criterion of maximum class relative frequency over the k nearest neighbors. 2.7. The semantic map: data graphical visualization Once the class memberships of all the test samples have been established, in the final stage of the K-CM method, the whole set of internal relationships among the test samples can be visualized by the semantic map. The specific algorithm by which K-CM carries out the graphic projection of samples on the semantic map is based on the definition of a Meta-Distance and performed in two steps.

2 31 p  G  2 X 2 2 X ^cmg −cig 5 ¼4 zmj −zij þ

ð15Þ

g¼1

j¼1

where ^c and c are the estimated and real class memberships, respectively. Note that unlike the true class membership, which can only be 0 or 1, the estimated class membership for a test sample is the relative frequency of a given class over the k nearest neighbors (i.e., average class membership). These distance values between test and training samples can be collected into a rectangular matrix D, which is a new dataset that describes how distant all test samples are from each training sample, also considering their class memberships. In the second step, the Euclidean distance between every pair of test samples is calculated on the basis of the matrix D of relative distances. We define Meta-Distance (Mm,m′) the new distance between the test samples m and m′ based on the distances they have from all the training samples:

Mm;m0

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u n uX 2 ¼t ðDmi −Dm0 i Þ

ð16Þ

i¼1

Finally, the meta-distances of all the pairs of test samples are arranged into a symmetric square matrix M (m,m) that can be used to obtain the sample semantic graph, for instance by calculating a Minimum Spanning Tree (MST). The MST is often used as a filter on the distance matrix to highlight the most significant relationships between objects through the concepts of the graph theory [19]. Given an undirected graph G, representing a matrix of distances d, with V vertices, completely linked each other, representing the objects being evaluated, the total number of edges (E) of the graph is: E¼

Using the transforms z instead of the original variables x, one can compute the Euclidean distance between the test sample and each of the training samples as:

113

V  ðV−1Þ : 2

ð17Þ

The Minimum Spanning Tree problem is defined as follows: find an acyclic subset T of E that connects all of the vertices in the graph and whose total weight is minimized, where the total weight is given by:

dðTÞ ¼

V −1 X

V X

dij

ð18Þ

i¼1 j¼iþ1

where T is called spanning tree, and MST is the T with the minimum sum of its edge weights: MST ¼ minfdðTk Þg:

ð19Þ

The MST is what we might call the nervous system of any dataset. Indeed, adding up the connections predicted by the MST, we obtain the total energy of the system. The MST selects only those connections that minimize this energy, that is, the only ones that are necessary to maintain a cohesive system. 3. Selected classification methods for comparison purposes In this section, the algorithms used to compare the results of the novel K-CM method are briefly introduced.

114

M. Buscema et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 110–119

3.1. Bayesian algorithms The Bayesian algorithms are those based on Bayes' theorem, which states that, given a set of events that partition an event space, any event dependent on the event space enriches the knowledge of initial events. Among the Bayesian algorithms, Linear Discriminant Analysis (LDA) is a probabilistic parametric classification method that performs dimensionality reduction by maximizing the variance between categories and minimizing the variance within categories [20]. LDA assumes that the classes have identical covariance matrices and fits a multivariate normal density to each group with a pooled estimate of the covariance. In this work, LDA was carried out with a-priori class probability proportional to the number of objects in the class. 3.2. Optimization algorithms: sequential minimal optimization A Support Vector Machine (SVM) is a binary classifier that recognizes the hyperplane separating two different classes by maximizing the distance between the closest training examples [3]. The sequential minimal optimization (SMO) is an iterative algorithm used to solve the optimization problem described for the SVM by decomposing it into a series of sub-problems, most small enough so that they can be solved analytically [21–23]. 3.3. Regression algorithms Partial Least Squares Discriminant Analysis (PLS-DA) is a classification method that profits the properties of Partial Least Squares Regression with the discrimination power of a classification technique [24]. It finds fundamental relations between variables and the class vector by calculating latent variables (LVs), which are linear combinations of the original variables. In this study, PLS-DA models were optimized in cross-validation to select the optimal number of LVs. The logistic regression (Logistic) is a particular case of generalized linear regression applied in cases where the dependent variable y and its type are dichotomous [25,26]. The Multilayer Perceptron Model (MLP) is a generalization of the logistic regression model with a feed-forward flow and totally interconnected variables [25–27]. 3.4. k-NN algorithms The k-nearest neighbor (k-NN) classifier [4] is the simplest classification algorithm in the literature and is based on the similarity between objects in the dataset: samples are classified according to the majority of their k nearest neighbors in the multivariate data space. The calculation of similarity which characterizes the type of k-NN is used and can be done in several ways. In the classical k-NN method, the Euclidean metric was used to measure similarity between samples.

can be formulated as a binary rule. This is the case of the Classification and Regression Trees (CART), which are a form of binary recursive partitioning based on univariate rule induction [8]. Random Decision Trees (RF) were introduced by Leo Breiman to treat both problems of classification and regression [28]. These are defined as a collection of decision trees called a Forest. The Random Tree classifier takes in input variable vectors, the ranking for each tree in the forest, and assigns the class that had the largest number of recurrences.

4. Case studies Ten benchmark datasets were taken from the literature in order to extensively evaluate the classification performance of the novel K-CM method. The selected datasets cover a range of different dataset characteristics in terms of number of samples, variables and classes. Table 2 collects these basic features for each dataset and the corresponding literature reference. For the Biodeg dataset, an additional external set of 194 samples was used to further evaluate the modeling performance of the classification methods in analysis [30].

5. TWIST algorithm The algorithm Training with Input Selection and Testing (TWIST) is a complex evolutionary algorithm able to look for the best distribution of the global dataset divided in two optimally balanced subsets containing a minimum number of variables [14]. TWIST is comprised of two algorithms: the Input Selection (IS) algorithm for the choice of the best subset of variables and the Training & Testing Optimization (T&T) algorithm for the best splitting of the original datasets into two subsets for training and testing, respectively. T&T is an evolutionary algorithm whose population expresses, after each generation, different hypotheses about the splitting of the global dataset into two subsets. At any generation each individual of the genetic population indicates which samples of the global dataset have to be clustered into the subset A and which one into the subset B. The fitness function for the choice of the best splitting way of a dataset evaluates the agreement of the probability density functions of the two subsets, which have to be as similar as possible. Similar probability density functions are a prerequisite for pessimistic training and testing distribution. The optimized subset A and subset B can be used both for training and testing (learning from subset A and evaluate using subset B, and vice versa). The Input Selection (IS) algorithm operates as a specific evolutionary wrapper system that responds to the need to reduce the dimensionality of the data by extracting the minimum number of variables necessary to conserve the most information available. To integrate the IS algorithm

3.5. Tree algorithms Tree algorithms, or decision-making trees, rely on building a tree from the element's variables (nodes) and the possible values that they can take (strings) until one arrives at the leaves representing the class of the sample. The path from the root node to a leaf node through the arch value determines the path that a particular sample must take to reach the membership class. The tree construction is attained through recursively splitting of the dataset into smaller subsets where each subset contains samples belonging to as few classes as possible. In each split (node), the partitioning is performed in such a way to reduce entropy (maximize purity) of the new subsets. At each binary partitioning, the single variable that gives the purest subsets is selected and the partitioning

Table 2 Characteristics of the benchmark datasets. Dataset

Reference

No. samples

No. variables

No. classes

Apple Biodeg Blood Diabetes Digits Itaoils Perpot Sediment Sulfa Wines

[29] [30] [31] [32] [33] [34] [35] [36] [37] [38]

508 837 748 768 500 572 100 1413 50 178

15 12 4 8 7 8 2 16 7 13

2 2 2 2 10 9 2 2 2 3

M. Buscema et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 110–119

115

with T&T algorithm into one procedure (TWIST algorithm), each individual of the genetic population is comprised of two vectors: 1) a vector of n components with Boolean values, n being the number of the samples of the global dataset; value 1 indicates that the sample is saved into the subset A and value 0 that the sample is saved into the subset B; 2) a vector of p components with Boolean values, where p is the number of variables of the global dataset. In this case, when the value of a generic component of the vector is 1 the corresponding variable is saved into the subset A and the subset B, while if the value is 0, then the corresponding variable is removed. At the end of its evolution, TWIST will generate two subsets A/B of data with a very similar probability density of distribution for pattern recognition and an optimal number of selected variables. 6. Software Calculation of K-CM models and implementation of TWIST were carried out by a dedicated software released by SEMEION [15]. Models based on logistic regression (Logistic), multilayer perceptron (MLP), random forest (RF) and support vector machine learning (SMO) were calculated by means of the WEKA package [39]. Linear Discriminant Analysis (LDA), Partial Least Squares-Discriminant Analysis (PLS-DA), Classification trees (CART) and classical k-nearest neighbor (k-NN) classification were performed by the Classification toolbox for MATLAB developed by Milano Chemometrics & QSAR Research Group [40]. 7. Results and discussion To test the novel classification method K-CM, 10 benchmark datasets were selected with different features in terms of number of variables, samples and classes and the relevance of K-CM results was evaluated in comparison with 8 classical classification methods (k-NN, LDA, PLSDA, CART, RF, MLP, SMO, Logistic). Depending on the method, each dataset was subjected to a different preprocessing and then split into two distinct data subsets A and B by the TWIST algorithm in order to implement a two-fold cross-validation for the model parameter optimization and performance estimation. All the classification methods were applied following the same validation protocol and for each method, two different models derived from the two different subsets A and B were retained. Finally, for the dataset Biodeg, the true predictive ability of the models was evaluated on a blind test set that was not used in any stage of the model development.

Fig. 3. Architecture of the z-profiling algorithm with p = 4.

7.2. Training/test splitting by TWIST For parameter optimization and validation purposes, each dataset was split into two subsets A and B by using the TWIST algorithm [14]. The two subsets of data are quite balanced and with a very similar probability density of distribution. In this work, only the option T&T was activated since no variable selection was required. The A/B data partition is shown in Table 3 for all the selected benchmark datasets. 7.3. Validation protocol and evaluation of the classification performance For each dataset, classification models were validated on the basis of the following procedure: a) Samples were split into two subsets (A and B) on the basis of the TWIST algorithm (Table 3). b) Samples belonging to the subset A were used to calibrate classification models. c) Through the classification models calibrated on A, predictions for the samples of the subset B were calculated. Then, predictions of the samples in B were used to calculate the class sensitivities, that is, the ratio of the true positives over the total number of samples belonging to the class. The non-error rate achieved on B (NERB) was then calculated as the average of all class sensitivities. d) New classification models were next calibrated with the samples of the subset B and used to predict the samples of the subset A. The non-error rate achieved on A (NERA) was then calculated as previously described. e) For each model, the overall non-error rate (NER) was finally calculated as the average of NERA and NERB. To implement K-CM, k-NN and PLS-DA, the optimal number of neighbors (k) and latent variables (LV) were selected by maximizing the overall non-error rate (NER). 7.4. Example of K-CM analysis: Itaoils data set

7.1. Data scaling and transformation To implement the k nearest neighbors (k-NN) and PLS-DA methods, autoscaling was performed on each dataset. For all the other methods, the range scaling 1–0 was used to take all the variables in the same unit scale.

Table 3 TWIST A/B partition of the benchmark datasets. Dataset

Total no. samples

No. samples in A

No. samples in B

Apple Biodeg Blood Diabetes Digits Itaoils Perpot Sediment Sulfa Wines

508 837 748 768 500 572 100 1413 50 178

196 429 380 404 266 286 56 722 20 85

312 408 368 364 234 286 44 691 30 93

The dataset Itaoils [34] was selected as the case study to explain the whole K-CM procedure. This dataset consists of 572 olive oil samples that are described by their content of eight fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, arachidic, linolenic, and eicosenoic);

Table 4 Auto-CM connection weight matrix estimated as average from subsets A and B. X1: palmitic acid; X2: palmitoleic acid; X3: stearic acid; X4: oleic acid; X5: linoleic acid; X6: arachidic acid; X7: linolenic acid; X8: eicosenoic acid.

X1 X2 X3 X4 X5 X6 X7 X8

X1

X2

X3

X4

X5

X6

X7

X8

0.9150 0.9060 0.8495 0.8440 0.8980 0.8876 0.8899 0.8582

0.8930 0.9064 0.8232 0.8126 0.8903 0.8672 0.8611 0.8418

0.7917 0.7809 0.8558 0.8089 0.7733 0.8077 0.7946 0.7763

0.8243 0.8069 0.8429 0.9117 0.8199 0.8641 0.8626 0.7550

0.9021 0.9063 0.8372 0.8474 0.9324 0.8845 0.8990 0.8231

0.8895 0.8831 0.8655 0.8853 0.8818 0.9181 0.9031 0.8708

0.9102 0.8965 0.8742 0.9027 0.9147 0.9206 0.9379 0.8629

0.7938 0.7932 0.7731 0.7129 0.7498 0.8075 0.7749 0.8648

116

M. Buscema et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 110–119

Fig. 4. Auto-CM input-hidden connection weights vj for the 8 variables of Itaoils dataset.

the concentrations of these fatty acids vary from up to 85% for oleic acid to as low as 0.01% for eicosenoic acid. Olive oil samples were collected from 9 different geographical regions: 4 from Southern Italy (North and South Apulia, Calabria, Sicily), 2 from Sardinia (Inland and Coastal) and 3 from Northern Italy (Umbria, East and West Liguria). In the first stage of the procedure, the TWIST algorithm divided the data set in the two subsets A and B, each comprised of 286 samples equally distributed in the 9 classes. Then, Auto-CM was run on each data subset using p the ffiffiffi following parameter settings: contraction papffiffiffi 1 ¼ rameter C ¼ p ¼ 8 ¼ 2:828 and learning rate constant α ¼ 1n ¼ 286 −3 3:496  10 . During the learning phase, the 8 mono-dedicated connections vj(t) between the input and hidden layer were updated at the end of each epoch on the basis of the difference between the values of the corresponding input and hidden nodes and further modulated for the input node itself. Fig. 3 shows the update trend of these connection weights

during the entire learning phase. All the variable weights vj(t) conpffiffiffi verged to the same value (i.e., the contraction parameter C ¼ 8 ) after 600 epochs with different convergence speeds. The variable convergence speed depends on the constant C and the learning coefficient α: the higher C and α, the faster the global convergence. Under these conditions, it was observed that the convergence speed is lower for variables less influenced by the other variables. During the learning phase, also the hidden-output connections wjk(t) were updated at the end of each learning epoch on the basis of the current weight and the difference between the values of the corresponding hidden and output nodes. After convergence, all the linear and non-linear relationships between variables were encoded into the connection weight matrix, which is the final output of Auto-CM. Since Auto-CM was trained separately on subsets A and B of the Itaoils dataset, the final variable relationships were estimated as the average of the two derived connection weight matrices (Table 4).

Table 5 Itaoils dataset: class sensitivities for different values of k (no. of nearest neighbors) calculated as the average of the values derived from K-CM model on training set A and K-CM model on training set B. The last column collects the arithmetic mean of the 9 class sensitivities (NER).

k=1

Class 1

Class 2

Class 3

Class 4

Class 5

Class 6

92.3

85.6

98.6

69.5

100

100

Class 7

Class 8

Class 9

NER%

92.9

95.8

100

92.7

k=2

96.2

98.4

98.1

55.4

100

100

100

k=3

96.2

96.4

98.5

77.4

100

100

100

k=4

96.2

98.4

98.6

44.9

100

k=5

96.2

98.4

99.5

55.4

100

k=6

96.2

98.4

99.0

33.8

100

k=7

92.3

98.4

99.0

49.9

k=8

92.3

96.4

99.0

33.8

k=9

92.3

98.4

99.0

k = 10

92.3

96.4

99.0

The highlighted line indicates the best K-CM model.

96.7

100

93.8

100

96.5

97.9

100

92.5

93.5

97.9

100

93.4

96.7

92.9

97.9

100

90.5

100

96.7

93.5

100

100

92.2

100

96.7

95.2

100

100

90.4

36.4

100

96.7

89.4

100

100

90.2

25.2

100

96.7

93.5

100

100

89.2

100

100

95.8 100

M. Buscema et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 110–119 Table 6 Itaoils dataset: class sensitivities implementing K-CM on training set A to predict test set B (SnB%) and K-CM on training set B to predict test set A (SnA%) with k = 3. The last column collects the arithmetic mean of the 2 sensitivities for each class.

North Apulia Calabria South Apulia Sicily Inland Sardinia Coast Sardinia Umbria East Liguria West Liguria

No. samples A

No. samples B

SnB%

SnA%

Average class Sn%

13 25 106 17 32

12 31 100 19 33

100 96.8 98.0 84.2 100

92.3 96.0 99.1 75.6 100

96.2 96.4 98.5 77.4 100

18

15

100

100

100

21 26 28

29 24 23

100 100 100 NERB: 97.7 % Error no.: 6

100 100 100 NERA: 95.3 % Error No.: 8

100 100 100 NER: 96.5% Error No.: 14

117

final model NER%, which was calculated as the arithmetic mean of the 9 class sensitivities. Once the optimal model parameter had been selected, i.e., k = 3, the final K-CM model was calculated and the obtained classification results for each single class were collected in Table 6. It can be noted that the KCM model fails to recognize the true class membership for 14 out of 572 samples and these failures only concern the samples from Southern Italy (North and South Apulia, Calabria, Sicily). Previous studies on the classification of this dataset provided NER% ranging from 79.7 to 95.0 for different statistical techniques and training/test splitting [41]; the K-CM model gave NER% = 96.5 demonstrating to be a valid approach to classification problems of this type. A graphical visualization of the Itaoils samples and their distribution in the 9 considered Italian regions is achieved by the semantic map of Fig. 5, which is based on the meta-distances between samples. From this map, it is apparent how the majority of olive oil samples taken from the same region tend to cluster forming well-defined branches of the tree. In addition, this map clearly shows the anomalous samples that are misclassified by the K-CM model.

The Auto-CM trained weights, wjk, show: 7.5. Comparison of the classification results ▪ how each variable supports itself, wjj, that is, how much each variable can be explained by the other variables; in other words, low wjj values for variables X3 and X8 indicate that these two variables are not much explained by the other variables; it can be also noted that these variables are those for which the inputhidden connection weights vj converge last (see Fig. 4); ▪ how many non-symmetric associations between variables are present; for instance, variable X1 (i.e., palmitic acid) excites the variable X8 (i.e., eicosenoic acid) less than how much the variable X8 excites the variable X1. This asymmetry shows how much each variable contributes to the explanation of any others, and this relationship, obviously, does not need to be symmetric. From the results of Table 4, the most important considerations are that variables 2, 3, 4, and 8 are the less influenced by the other variables (see the corresponding column values); looking at the row values, variable X2 much contributes to explain variables X1 and X5, while variable X8 contributes to explain X7 and X6. Finally, variable X7 is on average the most influenced by all other variables. Table 5 collects the average class sensitivities derived from the two K-CM models, one implemented on training set A and the other on training set B, varying the model parameter k (no. of nearest neighbors) from 1 to 10. The optimal k value was selected as that maximizing the

In the last part of this research study, the novel method K-CM was tested on ten different benchmark datasets and its performance was evaluated towards the results of other well-known classification methods. To implement K-CM, k-NN and PLS-DA methods, the optimal model parameter (i.e., the number k of nearest neighbors for K-CM and k-NN, and the number LV of latent variables for PLS-DA) was determined in cross-validation by maximization of the overall NER%. The optimal model parameters for the different datasets are collected in Table 7. The classification performance (overall NER%) of all the classifiers for each dataset is shown in Table 8. The classifiers are ranked on the basis of their average NER%; the average rank is reported in the first column of the table. From the results of Table 8, it is apparent that the proposed method K-CM over-performs the other classifiers, it being ranked first with an average rank near 1 and average NER of 89.2%. K-CM gives the highest NER for all the datasets except for Digits for which the best result is achieved by LDA (NER of 75.2%). For Perpot and Wines, K-CM has the same performance as k-NN (NER of 100%) and PLS-DA (NER of 98.6), respectively. Considering the average NER%, the classification methods can be roughly divided in three different blocks: a) K-CM and k-NN are the

Fig. 5. Semantic map of the dataset Itaoils.

118

M. Buscema et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 110–119

Table 7 Model Parameters for the benchmark datasets. k: number of nearest neighbors; LV: number of PLS latent variables; No. Var.: number of tree variables.

K-CM k-NN PLS-DA CART

k/LV/No. var

Apple

Biodeg

Blood

Diabetes

Digits

Itaoils

Perpot

Sediment

Sulfa

Wines

k k LV No. var.

3 1 9 2-4

3 3 2 7-5

4 3 4 4-4

3 3 5 1-1

3 4 7 5-6

3 4 7 7-6

3 3 1 2-2

1 1 2 1-1

3 3 1 2-2

3 3 4 3-3

Table 8 Classification results (NER%) for the benchmark datasets.

K-CM k-NN RF PLS-DA MLP CART Logistic LDA SMO

Average rank

Average NER%

Apple

Biodeg

Blood

Diabetes

Digits

Itaoils

Perpot

Sediment

Sulfa

Wines

1.3 2.9 4.9 4.5 5.8 7.0 5.9 6.3 6.4

89.2 87.2 82.9 81.1 81.0 76.8 76.4 76.0 74.7

95.8 94.8 93.6 95.7 93.1 89.3 95.0 94.1 90.7

88.1 87.5 82.9 79.5 79.6 79.3 81.5 77.5 80.0

74.0 66.9 63.8 71.5 63.5 65.5 55.5 50.0 54.1

79.1 76.8 74.2 75.4 73.3 66.8 73.0 72.4 74.1

74.9 73.8 66.2 72.5 71.6 67.9 72.7 75.2 75.0

96.5 95.3 89.8 76.3 91.9 83.5 87.2 92.3 86.3

100 100 97.7 86.7 93.1 96.9 84.6 85.8 87.8

91.8 89.2 79.7 80.4 75.0 65.5 64.4 64.9 50.8

93.4 89.7 85.0 74.7 70.9 64.5 52.2 50.0 50.0

98.6 97.7 96.5 98.6 97.7 89.1 98.3 97.9 98.1

methods with the best overall performance with average NER of 89.2% and 87.2%, respectively; b) RF, PLS-DA and MLP perform in a similar way with NER between 81.0% and 82.9%; c) CART, Logistic, LDA and SMO show the lowest performance with NER between 74.7% and 76.8%. Also considering the average rank AR, the same considerations can be drawn; however, in this case, among the lowest classified methods, CART switches to the last position of the ranking. While looking at the specific datasets, some interesting considerations are: a) for Apple and Wines, all the methods perform similarly giving on average high NER: for Apple, the lowest and highest values are 89.3% (CART) and 95.8% (K-CM), respectively; for Wines, the lowest and highest values are 89.1% (CART) and 98.6% (K-CM and PLS-DA), respectively. b) also for Digits and Blood datasets, all the methods perform quite similarly but, in this case, they give on average low NER not greater than 75%; considering Digits, the best result of 75.2% is achieved by LDA and the worst results are those of RF and CART with NER about of 67%. c) for Sulfa and Sediments datasets, a significant difference in method performance can be observed: K-CM and k-NN give results of about 89% for Sulfa and 93% for Sediments, while the average NER of the remaining methods for these two datasets is around 64% and 69%, respectively. d) for the dataset Perpot, the methods are clearly partitioned into two categories: methods that perform well (K-CM, k-NN, RF and CART) with NER in the range 96.9%–100% and methods with low performance (PLS-DA, Logistic, LDA, SMO) with NER in the range 84.6%– 87.8%.

Table 9 Model predictive abilities on the Biodeg external validation set. SnNRB: model sensitivity for class NRB, SnRB: model sensitivity for class RB.

Our previous best model (Consensus 1) [42] K-CM k-NN Logistic SMO MLP PLS-DA RF LDA CART

NER%

SnNRB

SnRB

86.6 84.4 84.3 83.5 83.1 80.4 80.2 79.2 78.8 71.5

90.3 86.7 87.1 89.1 91.1 87.9 75.4 91.9 89.1 87.9

82.9 82.1 81.4 77.9 75.0 72.9 85.0 66.4 68.6 55.0

K-CM and k-NN gave in general similar results. Only for the Blood dataset K-CM over-performed k-NN with NER of 7% higher than that of k-NN; in all the other cases, only a slight improvement of NER was observed. A further analysis was undertaken to evaluate the predictive ability of models on a blind external dataset. To this end, the models developed for the Biodeg dataset were considered since, for this dataset, an additional validation set was available [30]. It consists of 194 test chemicals, which are classified as 70 ready biodegradable (RB) and 124 not ready biodegradable (NRB); from the original test set of 218 chemicals, 24 were excluded as identified outside the model applicability domain (AD). In our previous study [42], the best result achieved on this dataset was not greater than 87% of NER. This result was obtained by applying a consensus strategy based on three different models. Applying K-CM, the model NER achieved the value of 84.4% that is not far from that obtained by the previous consensus model. The predictive abilities of all the considered classifiers are summarized in Table 9; once again, these results demonstrate that K-CM along with the classical k-NN shows the best performance among all the tested classifiers.

8. Conclusions In this study, the theory of a novel supervised method for pattern recognition, named K-Contractive Map (K-CM), was presented. This method exploits the variable connection weights provided by the Auto-CM neural network strategy to obtain the z-transforms on which the k-NN classifier is applied for class membership evaluation. K-CM opens the possibility for bottom-up algorithms to provide a symbolic explanation of their learning and behavior. The symbolic level of the K-CM system is not, however, a set of naïve “If…Then” rules traditionally applied to all training data. Understanding of this symbolic level is trivial: the rise of functional non-linearity that interpolates the training set results in the exponential growth of explicit rules suitable for deriving a description. A more interesting symbolic level is the one which allows the explanation of the “fuzzy mental map” through which the learning algorithm is represented on the basis of training data and, simultaneously, the subjective similarities on which the same algorithm can operate on new cases. In other words, a bottomup algorithm can work in a symbolic way when it is able to selfgenerate a mental representation (weighted graph) of either what is learned (training set) and to dynamically place it in a chart of new experiences (test set), so as to induce it to reorganize the initial map. The concept of over-fitting is important in machine learning. Overfitting occurs when a model begins to memorize training data rather

M. Buscema et al. / Chemometrics and Intelligent Laboratory Systems 138 (2014) 110–119

than learning to generalize from trend. So efforts directed to make machine learning model more robust are of relevance. K-CM is not prone to over-fitting. The problem is solved from the beginning because K-CM equations do not work a simple minimization of the error function between targets and ANN output (“find a small output error in any way”); K-CM works with a complex minimization of the Energy function involving all the input variables as set of parallel constraints (“find the output only when everybody agrees”). In the comparative study, K-CM showed the best classification performance in validation for most of the considered datasets and, on average, over-performed the other classification methods. Among all the tested methods, k-NN demonstrated to have similar performance to K-CM, they being based on the same modeling principles; therefore, it can be concluded that all the methods that exploit the kNN strategy are reliable methods for classification and the novel K-CM can improve the classification results especially in those cases where non-linear relationships among variables are relevant. Conflict on interest There are no conflict of interest. References [1] T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer, 2009. [2] B.K. Lavine, W.S. Rayens, Classification: basic concepts, in: S.D. Brown, R. Tauler, B. Walczak (Eds.), Comprehensive Chemometrics, Elsevier, Amsterdam (The Netherlands), 2014. [3] V. Vapnik, The Support Vector Machine of Function Estimation, in: J.A.K. Suykens, J. Vandewalle (Eds.), Kluwer Academic Publishers, Boston (MA, USA), 1998. [4] B.R. Kowalski, C.F. Bender, The k-nearest neighbor classification rule (Pattern Recognition) applied to nuclear magnetic resonance spectral interpretation, Anal. Chem. 44 (1972) 1405–1411. [5] S. Wold, Pattern recognition by means of disjoint principal component models, Pattern Recogn. 8 (1976) 127–139. [6] D. Coomans, M.P. Derde, I. Broeckaert, D.L. Massart, Potential methods in pattern recognition, Anal. Chim. Acta. 133 (1981) 241–250. [7] D.J. Hand, Discrimination and Classification, Wiley, Chichester (UK), 1981. [8] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Wadsworth, Inc., Monterey, CA (USA), 1984. [9] M. James, Classification Algorithms, Collins, London (UK), 1985. [10] M.P. Derde, D.L. Massart, UNEQ: a disjoint modelling technique for pattern recognition based on normal distribution, Anal. Chim. Acta. 184 (1986) 33–51. [11] I.E. Frank, DASCO: a new classification method, Chemometr. Intell. Lab. Syst. 4 (1988) 215–222. [12] I.E. Frank, J.H. Friedman, Classification: oldtimers and newcomers, J. Chemom. 3 (1989) 463–475. [13] D. Ballabio, R. Todeschini, Multivariate classification for qualitative analysis, in: DaWen Sun (Ed.), Infrared Spectroscopy for Food Quality Analysis and Control, Elsevier, Amsterdam (The Netherlands), 2009. [14] M. Buscema, M. Breda, W. Lodwick, Training with input selection and testing (TWIST) algorithm: a significant advance in pattern recognition performance of machine learning, J. Intell. Learn. Syst. Appl. 5 (2013) 29–38. [15] M. Buscema, Supervised ANNs and Artificial Organisms, Ver. 22.0, Semeion Software, Rome (Italy), 1999. [16] M. Buscema, E. Grossi, D. Snowdon, P. Antuono, Auto-contractive maps: an artificial adaptive system for data mining. An application to Alzheimer disease, Curr. Alzheimer Res. 5 (2008) 481–498.

119

[17] M. Buscema, E. Grossi, The semantic connectivity map: an adapting self-organizing knowledge discovery method in data bases, Int. J. Data Min. Bioinform. 2 (2008) 362–404. [18] M. Buscema, P.L. Sacco, Auto-contractive maps, the H function, and the maximally regular graph (MRG): a new methodology for data mining, in: V. Capecchi (Ed.), Applications of Mathematics in Models, Artificial Neural Networks and Arts, Springer Science + Business Media B.V, 2010. [19] J.B. Kruskal, On the shortest spanning subtree of a graph and the traveling salesman problem, Proc. Am. Math. Soc. 7 (1956) 48–50. [20] G. McLachlan, Discriminant Analysis and Statistical Pattern Recognition, Wiley, New York (NY, USA), 1992. [21] J. Platt, Fast training of support vector machines using sequential minimal optimization, in: B. Schoelkopf, C.J.C. Burges, A.J. Smola (Eds.), Advances in Kernel Methods — Support Vector Learning, MIT Press, Cambridge (MA, USA), 1998. [22] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, K.R.-K. Murthy, Improvements to Platt's SMO algorithm for SVM classifier design, Neural Comput. 13 (2001) 637–649. [23] S.S. Keerthi, E.G. Gilbert, Convergence of a generalized SMO algorithm for SVM classifier design, Mach. Learn. 46 (2002) 351–360. [24] L. Ståhle, S. Wold, Partial least squares analysis with cross-validation for the twoclass problem: a Monte Carlo study, J. Chemometr. 1 (1987) 185–196. [25] S. Cessie, J.C. van Houwelingen, Ridge estimators in logistic regression, Appl. Stat. 41 (1992) 191–201. [26] D.W. Hosmer, S. Leneshow, Applied Logistic Regression, 2nd ed. Wiley, New York (NY, USA), 2000. [27] R. Collobert, S. Bengio, Links between perceptrons, MLPs and SVMs, Proc. of 21st Int. Conf. on Machine Learning (ICML), 2004. [28] L. Breiman, Random forest, Mach. Learn. 45 (2001) 5–32. [29] D. Ballabio, V. Consonni, F. Costa, Relationships between apple texture and rheological parameters by means of multivariate analysis, Chemometr. Intell. Lab. Syst. 111 (2012) 28–33. [30] K. Mansouri, T. Ringsted, D. Ballabio, R. Todeschini, V. Consonni, Quantitative structure-activity relationship models for ready biodegradability of chemicals, J. Chem. Inf. Model. 53 (2013) 867–878. [31] K.A. Baggerly, J.S. Morris, S.R. Edmonson, K.R. Coombes, Signal in noise: evaluating reported reproducibility of serum proteomic tests for ovarian cancer, J. Natl. Cancer Inst. 97 (2005) 307–309. [32] J.W. Smith, J.E. Everhart, W.C. Dickson, W.C. Knowler, R.S. Johannes, Using the ADAP learning algorithm to forecast the onset of diabetes mellitus, Proc. Symp. Comput. Appl. Med. Care 9 (1988) 261–265. [33] R. Todeschini, D. Ballabio, V. Consonni, A. Mauri, M. Pavan, CAIMAN (classification and influence matrix analysis): a new approach to the classification based on leverage-scaled functions, Chemometr. Intell. Lab. Syst. 87 (2007) 3–17. [34] M. Forina, C. Armanino, S. Lanteri, E. Tiscornia, Classification of olive oils from their fatty acid composition, Food Research and Data Analysis, Applied Science Publishers, London (UK), 1983. [35] M. Forina, Artificial Data Set, University of Genoa, 2005. [36] M. Alvarez-Guerra, D. Ballabio, J.M. Amigo, J.R. Viguri, R. Bro, A chemometric approach to the environmental problem of predicting toxicity in contaminated sediments, J. Chemom. 24 (2010) 379–386. [37] Y. Miyashita, Y. Takahashi, C. Takayama, T. Ohkubo, K. Fumatsu, S. Sasaki, Computerassisted structure/taste studies on sulfamates by pattern recognition methods, Anal. Chim. Acta. 184 (1986) 143–149. [38] M. Forina, C. Armanino, M. Castino, M. Ubigli, Multivariate data analysis as discriminating method of the origin of wines, Vitis 25 (1986) 189–201. [39] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutmann, I.H. Witten, The WEKA data mining software: an update, SIGKDD Explorations, Associaton for Computing Machinery, 2009. [40] D. Ballabio, V. Consonni, Classification tools in chemistry. Part 1: linear models. PLSDA, Anal. Methods 5 (2013) 3790–3798. [41] J. Zupan, M. Novič, X. Li, J. Gasteiger, Classification of multicomponent analytical data of olive oils using different neural networks, Anal. Chim. Acta. 292 (1994) 219–234. [42] F. Sahigara, D. Ballabio, R. Todeschini, V. Consonni, Assessing the validity of QSARs for ready biodegradability of chemicals: an applicability domain perspective, Curr. Comput. Aided Drug Des. (2014), http://dx.doi.org/10.2174/1573409910666140410110241.