Artificial Neural Networks for Nonlinear Regression and ... - IEEE Xplore

Artificial Neural Networks for Nonlinear Regression and Classification Alberto Landi, Paolo Piaggi

Marco Laurino, Danilo Menicucci

Dep. of Electrical Systems and Automation University of Pisa, Italy. e-mail: [email protected], [email protected]

Extreme Centre, Scuola Superiore Sant'Anna, Pisa, Italy. e-mail: [email protected]

Abstract— Linear regression and classification techniques are very common in statistical data analysis but they are often able to extract from data only linear models, which can be a limitation in real data context. Aim of this study is to build an innovative procedure to overcome this defect. Initially, a multiple linear regression analysis using the best-subset algorithm was performed to determine the variables for best predicting the dependent variable. Based on the same selected variables, Artificial Neural Networks were employed to improve the prediction of the linear model, taking advantage of their nonlinear modeling capability. Linear and nonlinear models were compared in their classification (ROC curves) and prediction (cross-validation) tasks: nonlinear model resulted to fit better data (36% vs. 10% variance explained for nonlinear and linear, respectively) and provided more reliable parameters for accuracy and misclassification rates (70% and 30% vs. 66% and 34%, respectively). Keywords-artificial neural networks; nonlinear regression; classification

I.

INTRODUCTION

Methods in multivariate statistical analysis are essential for working with large amounts of biomedical data. In classical multivariate statistical analysis, there is a hierarchy of linear methods, including multiple linear regression and linear classification. The common drawback of these classical methods is that only linear structures can be correctly extracted from the data. This paper presents an innovative procedure for building a nonlinear regression model using Artificial Neural Networks (ANN). The performance of this nonlinear approach, in terms of prediction and classification, was evaluated and compared with a standard linear model using a real clinical dataset, including more than 150 patients. Each patient record was composed by thirty independent (input, explanatory) variables and one dependent (output, response) variable, representing the percentage of success after the therapeutic action. II.

METHODS

At first, a best-subset algorithm applied to a multiple linear regression model was used to select the most significant predictors of the dependent variable. In such a way the number of variables used for the prediction was greatly reduced. A set of standard classification techniques (k-NearestNeighbor, Artificial Neural Networks for pattern recognition, Support Vector Machines and Naive Bayesian classifiers)

c 978-1-4244-8136-1/10/$26.00 2010 IEEE

were applied in order to evaluate the classification performance of the selected independent variables: to this purpose, patients were divided into two groups, according to quartiles of the dependent variable. An ad-hoc ANN was employed to perform a nonlinear regression using the selected independent variables and the dependent variable: a specific cost function provided a nonlinear formula to achieve the best correlation between the dependent variable and the selected predictors. Finally, the linear and the nonlinear model were compared in standard prediction and classification tasks. A. Best-subset algorithm and multiple linear regression model As a first step in order to explore multiple candidate models for classification, a standard multiple linear regression model [1] based on a best-subset algorithm [2] was determined. To avoid multicollinearity among the independent variables, the variable with the highest Variance Inflation Factor (VIF) was discarded [3]. This procedure was repeated until all remaining explanatory variables had a VIF score less than five, corresponding to multicollinearity R2m < 0.8 where R2m was the coefficient of determination of the linear regression model calculated using that specific variable as dependent variable and

VIF =

1 1 − Rm2

.

(1)

A best-subset (i.e., brute force search technique) regression algorithm was adopted. All possible combinations of explanatory variables (from one to four) were computed in the multiple linear regression frameworks with the same dependent variable: a list of the values of R2, adjusted R2 and p-value for each linear model was extracted. Among all models, it was chosen the model with the highest adjusted R2, the smallest standard deviation of residuals and p-value less than 0.05. In order to obtain a robust model, only subsets with a number of independent variables ranging from 1 to 4 were considered, for computational issues. Furthermore the parsimony principle stating that it is preferable to select the model with the smallest numbers of variables was adopted. Regression coefficients ȕ of the final model were a measure of the linear relationship between each independent variable and the dependent variable when the influences of the other predictors were partialled out or held constant. The

115

t-ratio test was used to evaluate the significance for each regression coefficient ȕ. Finally, the standardized dependent variable was predicted through a formula, based on the linear combination of standardized regression coefficients ȕ. B. Standard classification methods In order to validate the selected independent variables, they were applied for a more general classification task. We considered the variables as input for four well-known classification algorithms [4]. In this way, we could evaluate the information content of each variable and their relationships. In this study we used some nonparametric classification techniques based on supervised machine learning. The proposed classifiers were: 1. k-Nearest-Neighbors (k-NN), for k=1 and k=5. 2. Artificial Neural Networks for pattern recognition (ANNpr). 3. Support Vector Machines (SVM), with linear and radial basis kernel (rbf). 4. Naive Bayes (nB) classifiers, in a kernel configuration. The k-NN classifier assigns a test sample to the class of the majority of its k closest neighbors. The ANNpr classifier adapts the neural network framework to pattern recognition. Each computational neuron calculates the sum of inputs and uses an activation sigmoidal function for outputs. The neurons were organized in three layers. For the training process, we used backward propagation algorithm. Probably, SVM are the most important development in supervised classification techniques of recent years. The main idea of SVM is to construct a hyper-plane as a decision surface in such a way that the margin of separation between the two classes is maximized. The nB classifier is a probabilistic classifier based on applying the Bayes theorem. This classifier uses the assumption that predictor variables are independent random variables: this consideration makes possible to compute probabilities required by the Bayes formula from a training set. For our purpose, all classifiers were trained for a binary classification task: we divided the clinical dataset in two main target classes considering the first quartile of the dependent variable as threshold. .

hidden layer and output layer) with nonlinear activation functions at the hidden layer units. In this study, in order to identify the best correlation between the independent variables and the dependent variable, two feed-forward MLPs were used, one for nonlinear mapping of independent variables (x) into a single predicted score named u, the other one for linear mapping of the dependent variable (y) into a predicted score named v. For computational issues, input variables were initially standardized removing the mean value and dividing them by the standard deviation of each variable. These two networks simultaneously mapped from the inputs x and y to the scores u and v, respectively. A particular cost function forced the correlation between u and v to be maximized by finding the optimal values of weights and bias. In the first MLP (Fig. 1), the input layer consisted of variables considered as statistically significant by the previous best-subset algorithm; the hidden layer was characterized by some hidden neurons (e.g., in Fig. 1, five hidden neurons) with a nonlinear sigmoid function, and the output layer was composed of only one output neuron (the nonlinear score u). In the second MLP (Fig. 2), a linear mapping of the dependent variable was performed: the linear mapping (i.e., using a linear activation function in the hidden layer units) was chosen to simplify the optimization algorithm and to be able to compare the results with those obtained by using linear models (based on standard multiple regression).

Figure 1. Architecture of the MLP model for calculating nonlinear score u. The hyperbolic tangent function at hidden layer units gives non linearity to the entire network.

C. Neural networks based nonlinear regression model: architecture and learning algorithm The ANN [5-8] model used in this study was a MultiLayer Perceptron (MLP), a feed-forward neural network for mapping sets of input data onto a set of appropriate outputs. MLP is characterized by three layers of neurons (input layer,

116

Figure 2. Architecture of the MLP model for calculating linear score v. The identity function at hidden layer units yields to a linear transformation of dependent variable.

2010 10th International Conference on Intelligent Systems Design and Applications

For both MLP, the number of hidden neurons was determined through a trial-and-error process and following a general principle of parsimony, because no commonly accepted theory exists to determine the optimal number of neurons in the hidden layer: in detail, several runs (i.e., training of MLP) with increasing number of neurons were made. As a result of this step, the number of hidden neurons was chosen when the correlation between u and v did not improve appreciably by increasing the number of hidden units. For both MLP, the input variable vectors x and y were mapped to the neurons in the hidden layer u and v as follows:

° hx = tanh (Wx ⋅ x + bx ) ® °¯ hy = W y ⋅ y + by

(2)

where Wx and Wy are the weight matrices between input layer and hidden layer and bx and by are the bias parameter vectors of hidden layer units. The scores u and v were obtained from a linear combination of the hidden neurons vectors hx and hy, respectively, with

°u = W x ⋅ hx + b x ® °¯v = W y ⋅ hy + b y

(3)

To maximize the correlation between u and v, the specific cost function J = ícorr(u,v) (where corr was the Pearson correlation coefficient formula) was minimized by finding the optimal values of Wx, Wy, bx, by, W x , W y , ib x and b y . In addition, we applied the constraints u

2

= v

2

=1

u =

v

=0

and

(zero mean and unit variance for both the

scores) which were inserted into a modified cost function (Jm): Jm = −corr(u, v) + u + v +

u2 −1 +

v2 −1

(4)

A quasi-Newton algorithm carried out the nonlinear optimization. Because of the well-known problem of multiple local minima in the MLP cost function, there was no guarantee that the optimization algorithm reached the global minimum: hence a number of runs mapping from (x,y) to (u,v) using random initial parameters were performed. The number of runs was fixed to 200 and the run attaining the lowest value of Jm was selected as the final solution. MLP might suffer from overfitting [5]: to overcome this pitfall, 20% of the data were randomly selected as validation data and withheld from the training set of the MLP: runs where the correlation between u and v was found lower for the validation data than for training data set were rejected to avoid overfitted solutions.

This ANN architecture extended the regression performance of the linear model, which can be still obtained by replacing the nonlinear activation functions with the identity functions, removing the nonlinear capability of the model.

D. Classification and prediction The predictive performance of both models (linear and nonlinear) was evaluated by calculation of the true positive fraction (TPF, or sensitivity) and of the false positive fraction (FPF, or specificity). To this purpose cases were divided into 2 groups according to quartiles of the dependent variable: cases with a score within the 3 highest quartiles were arbitrarily assigned to the positive group while cases with a score within the lowest quartile were assigned to the negative group. Sensitivity was defined as the rate of cases correctly predicted in the positive group over those actually belonging to the positive group; specificity was defined as the rate of cases correctly predicted in the negative group over those actually belonging to the negative group. The sensitivity and specificity of both predicted scores (obtained from linear and nonlinear models) in relation to actual values were plotted for each possible predictive score cutoff in the so-called Receiver Operating Characteristic curves (ROC) and the Area Under each ROC Curve (AUC) was estimated [4]. AUC measures the discriminating accuracy of the linear or nonlinear model (i.e., the ability of the model to correctly classify cases in the positive or in the negative group). In order to check the generalization capability of both models on future as-yet-unseen data, a k-fold crossvalidation was also carried out [5]. In the k-fold crossvalidation, the original dataset was partitioned into k subsets: for each cross-validation step, a single subset was retained as the test set, and the remaining kí1 subsets were used as training set. Cases were divided in four groups, according to quartiles of the dependent variable: if the model correctly identified the quartile of the dependent variable, this was considered a success, otherwise a failure. The cross-validation process was then repeated k times, with each of the k subsets used exactly once as the test set. The k results from the folds have been repeated for 100 times and results have been averaged to produce a global estimation expressed by a 4×4 confusion matrix. In this work a 3-fold cross-validation was applied where each fold consisted of randomly selected samples.

2010 10th International Conference on Intelligent Systems Design and Applications

117

III.

RESULTS

A. Best-subset algorithm and linear regression model As a result of the best-subset algorithm, an ensemble of four independent variables (named Var. 1 to 4) was selected. The four independent variables accounted for about 10% of the dependent variable variance in the multiple linear regression model: the Pearson coefficient of correlation r, coefficient of determination R2 and p-value are shown in Table I. Table II illustrates that the VIF values for this model varied between 1.418 for Var. 3 and 1.007 for Var. 1, which were far below the recommended level [9]: therefore, VIF values suggested that independent variables included in this model did not suffer from the problem of multicollinearity. A predicted standardized score was calculated through the formula:

−0.148⋅( Var.1) −0.239⋅( Var.2) −0.26⋅( Var.3) +0.182⋅( Var.4)

(5)

based on standardized ȕ coefficients of multiple regression model. A simple regression analysis was then conducted with the actual standardized score as dependent variable and the predicted standardized score using (5) as the independent variable (Fig. 3).

B. Classification methods The results of classification stage, using the best subset of the four selected independent variable (Var. 1 to 4), is reported in Table III. The performance of each classifier was evaluated by the true positive rate.

Figure 3. Linear regression model. Figure shows predicted linear values on x-axis vs. actual scores on y-axis with the best fit solid line; vertical and horizontal dotted lines denote quartiles of predicted and actual scores, respectively and black crosses indicate centroids of the first and the last quartiles. TABLE III. k-NN (k=1) 67.9%

k-NN (k=5) 73.1%

RESULTS OF CLASSIFICATION WITH SELECTED INDEPENDENT VARIABLE SUBSET ANNpr 72.6%

SVM (linear) 75.7%

SVM (rbf) 73.3%

nB 74.6%

Table 3 illustrates that each classifiers showed a good performance rate, especially for linear SVM classifier. These results supported our independent variable selection algorithm, reflecting the high information content of selected variables subset.

C. Multi-layer perceptron based nonlinear model TABLE I.

SUMMARY OF LINEAR AND NONLINEAR MODELS 2

Model

r

R

Linear Nonlinear

0.326 0.604

0.107 0.365

TABLE II. Indep. Var. Var. 1 Var. 2 Var. 3 Var. 4 * = p