APA Format 6th Edition Template

5 downloads 17591 Views 147KB Size Report
Corresponding author's Email: [email protected]. Author Note: The authors are from the Department of Systems Science and Industrial Engineering ...
Proceedings of the 1st Annual World Conference of the Society for Industrial and Systems Engineering, Washington, D.C, USA September 16-18, 2012

Multivariate Analysis of Breast Cancer Prediction Parameters Benjamin Schleich, Hema Sudarsanam, and Chanchal Saha Department of Systems Science and Industrial Engineering State University of New York at Binghamton Binghamton, New York 13902 Corresponding author's Email: [email protected] Author Note: The authors are from the Department of Systems Science and Industrial Engineering (SSIE) and working in the Watson Institute for Systems Excellence (WISE) and Center for Autonomous Solar Power (CASP). The authors would like to convey their most profound thanks to SSIE, WISE and CASP for providing information, resources, and incessant support throughout the entire study period. Benjamin Schleich’s (corresponding author) tel.: +1-719-214-4924. Abstract: This study explores breast cancer data with multivariate techniques. Breast cancer is the second leading disease that causes death, 100 deaths per day, of women living in the USA. Therefore, this study aims in identifying the independent decision parameters given in the data-set which are accounted for correct prediction of breast cancer. Logistic Regression Analysis (LRA) is performed to assess the correct diagnosis probability. A Principal Component Analysis (PCA) is also performed to determine the number of decision parameters can be reduced. The experimental results indicate that the LRA has a very high correct validation while the PCA suggested that the principal components can be significantly reduced. In future, Factor Analysis (FA) can be conducted to find the latent relations among the variables, which might give a better understanding of the correlation among the variables. Keywords: Breast cancer, Logistic regression analysis, Principal component analysis

1. Introduction According to 2012 U.S. breast cancer statistics, breast cancer is one of the most common cancers among American women. It is the second leading disease that causes death in women. It has also been reported that about 39,520 women die due to breast cancer in the United States. But it has also been mentioned that the number of deaths due to breast cancer has been reduced since the 1990’s and this is basically due to the early diagnosis of the disease (Dong et al., 2011). Early diagnosis of breast cancer is considered as the most important factor in reducing breast cancer mortality. There is a high social and economic concern for breast cancer diagnosis, which explains why many researchers are involved in providing more accurate models that would help in early prediction of breast cancer. This study aims in identifying the independent variables, which can help to correctly predict breast cancer. Historical data provided by the University of Wisconsin Hospital has been analyzed. The data consisted of nine independent variables (X1-X9) that explain the characteristics of the cancer along with one dependent variable that helps in identifying the type of tumor (malignant or benign). Malignant tumor is more dangerous than the benign tumor (Gregory, 2012). The variables X1 to X9 describe clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli and mitoses in that order. In this study, Logistic Regression Analysis (LRA) is performed on the data to predict the failure probability of breast cancer assessment. Principal Component Analysis (PCA) is conducted to reduce the number of variables that can be used for the diagnosis or prediction of breast cancer. LRA and PCA techniques were chosen as the data showed a non-normal distribution. The data were collected in eight recorded samples arranged chronologically form the years 1989 to 1991. To eliminate the dependences among groups, only data from group one which contains the largest pool of data among the eight samples were used. Data in group one has 367 instances/cases with few missing values, nine dependent and one independent variable, The independent variable helps in diagnosing breast cancer by identifying whether there exists a benign or malignant tumor and the dependent variables describe biological characteristics of the cells like the clump thickness, uniformity of the cell size and shape, marginal adhesion, size of a single epithelial cell, bare nuclei, bland chromatin and rate of cell division (mitoses). It is assumed that every recorded observation of group one is independent with each other so that cases with missing values can be eliminated and finally, 353 instances were taken for further analysis.

ISBN: 97819384960-0-4

306

Proceedings of the 1st Annual World Conference of the Society for Industrial and Systems Engineering, Washington, D.C, USA September 16-18, 2012 The remainder of this study is structured as follows: A brief review of literature is presented in Section 2. Methodology is depicted in Section 3. Experimental results are presented in Section 4. Finally, conclusions and future directions of the research are addressed in Section 5.

2. Literature Review LRA has been applied in many fields, for example, Dong et al. (2011) predicted the failure probability of landslide dams. This topic is important to study as the right prediction can rescue people and save lives. Their proposed logistic model outperformed other traditional models like the geomorphic index (DBI), which was usually used for this domain. While other applied approaches have a higher failure rate in predictability and did not generate clear results, the proposed LRA model showed in the validation of several case studies that it performs very well. Another article by Garipagaoglu et al. (2009) studied the obesity risk factor for Turkish children. This article also proposed a LRA model which obtained significant risk factors for obesity. Testing of a program is another area where LRA (Salem et al., 2004). The goal of this approach was to reduce the testing costs and time for software. However, the experimental results show that the predictability of the model was between 80-90%, which was determined as not significant enough for software programs to ensure high quality, reliability, and sustainability. The approach applied in this study is almost similar to Dong et al. (2011) as LRA approach is applied in this study to identify which independent variables can be accounted for the correct prediction of breast cancer. PCA has been used in dimensionality reduction of various medical diagnosis data samples so that the data set can be processed further for study and better understanding of the diseases. In 2007, Polat and Gunes used PCA to reduce the dimension of the lung cancer data to 4 principal components that capture the maximum variation of the initial 57 input variables. PCA based technique has been reported to have helped in accurate prediction of impedance cardio-vasography (which serves as a non-destructive screening procedure before the expensive and painful angiographic studies) for the diagnosis of Leriche’s syndrome (Karamchandani, 2009). The PCA technique eliminated the correlated information which helped in improving the estimation performance of Leriche’s syndrome. It has been used in similar data pre-processing stage in order to capture the important feature for the diagnosis of other diseases like atherosclerosis, lung cancer, etc. (Latifoglu et al., 2008).

3. Methodology LRA creates models with categorical and continuous variables and therefore, it can handle data where the multivariate normality assumption does not hold (Sharma, 1997). However, LRA can only be used when the predictor variables are independent. A set of independent variables is denoted by the vector 𝑥 ′ = (𝑥1 , 𝑥2 , … , 𝑥𝑛 ) and the linear combination of those independent variables can be written by following equations: 𝑧 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑛 𝑥𝑛

(1)

𝑃𝐹 = 1 − 𝑃𝑆 =

(3)

𝑓(𝑧) =

𝑒𝑧

1+𝑒 𝑧

=

1

1+𝑒 −𝑧 𝑒 −𝑧

1+𝑒 −𝑧

= 𝑃𝑆

(2)

Here, the variable 𝑧 is referred to as the logit and a measure of the total contribution of all predictor variables. The logistic function can then be written as Equation 2. In addition, it can only take values between zero and one. Equation (2) usually is referred to as the probability of success (𝑃𝑆 ). The Probability of failure 𝑃𝐹 is expressed in equation (3). On the other hand, PCA can be conducted on mean-corrected or standardized data set. Transforming the data into standardizing form is always good to remove the variation in the given data. If the data set has strong correlation then the new Principal Components (PCs) can be interpreted well. New variable(s) are uncorrelated and linear combination of original variable (Sharma, 1996). Therefore, it is necessary to calculate the variance and correlation matrix of the original variables. The next step is to rotate the axis in such a way that it gives the maximum variance for the first PC. Eigenvalues are the variance of new variables, weights for forming PC equations are Eigen vectors, and numerical values of PCs are PC scores. The sums of squares of the Eigen vectors are one and new PCs are orthogonal in nature, therefore, sum of product of the PCs’ Eigenvectors are zero. In order to decide the number of PCs, for standardized data, consider PCs whose Eigenvalues are greater than one, for mean corrected and standardized data, it can be obtained from the Scree plot (Eigenvalue vs. PC plot) and look for elbow, it can also be obtained by retaining statistically important component. Another important parameter in PCA is leading which is the simple correlation between original and new variables, higher loading value (>0.5) of original 307

Proceedings of the 1st Annual World Conference of the Society for Industrial and Systems Engineering, Washington, D.C, USA September 16-18, 2012 variables poses more influence on PCs. These key parameters can be calculated manually or with the help of software. In this study, SAS 9.1 was used to obtain several key parameters for analyzing PCA Finally; the experimental results were interpreted to support the objective of this study. MINITAB 16 was used to plot few graphs and those are studied in the following sections.

4. Experimental Results 4.1 Logistic Regression Analysis The statistics for the model fit are presented below. Table 1: Model Fit Statistics for LRA Criterion Intercept Only Intercept and Covariates AIC 489.862 97.653 SC 493.729 136.318 -2 Log L 487.862 77.653 Testing Global Null Hypothesis is BETA=0 Test Chi-Square (ChiSq) DF Pr > ChiSq Likelihood Ratio 410.209 9 < 0.001 Score 284.338 9 < 0.001 Wald 63.448 9 < 0.001 The likelihood ratio test statistic is equal to 410.209, which correspond to a p-value less than 0.0001; hence it can be said that the model is significant. Table 2: Analysis of Maximum Likelihood Estimates Parameter Intercept X1 X2 X3 X4 X5 X6 X7 X8 X9

DF 1 1 1 1 1 1 1 1 1 1

Estimate -8.654 0.516 -0.089 0.349 0.429 0.128 0.37 0.221 0.133 0.426

Standard Error 1.252 0.158 0.227 0.244 0.180 0.174 0.112 0.217 0.122 0.315

Pr > ChiSq < 0.001 0.001 0.697 0.153 0.017 0.464 0.001 0.311 0.275 0.176

Table 2 shows that X1, X4, and X6 are significant as their p-value is below the significance level of 0.05. The remaining variables show that they do not have a significant effect on the predictability. The estimate (X1:0.5163) divided by the standard error (0.1583) gives the t-value (3.26) and the square of it represents the Wald chi-square value (10.6297). Because of the findings from Table 2, a stepwise logistic regression is conducted (as described in Sharma, 1996). Table 3: Stepwise Logistic Regression with a 0.05 Significance Level Step Entered 1 2 3 4

Effect Removed X6 X3 X1 X4

DF 1 1 1 1

Number In 1 2 3 4

308

Score ChiSq 224.2849 95.0208 21.5818 11.1951

Pr > ChiSq