Collateral Missing Value Imputation: A New Robust Missing Value ...

17 downloads 0 Views 393KB Size Report
Feb 24, 2005 - errors (Troyanskaya, 2001). Current ... data collection (Troyanskaya et al, 2001; Hellem et al,. 2004) ..... The rationale for this was that Olga et al.
Bioinformatics Advance Access published February 24, 2005

BIOINFORMATICS Collateral Missing Value Imputation: A New Robust Missing Value Estimation Algorithm For Microarray Data Muhammad Shoaib B. Sehgal*, Iqbal Gondal and Laurence S. Dooley Gippsland School of Computing and Information Technology, Monash University, VIC 3842, Australia

ABSTRACT Motivation: Microarray data is used in a range of application areas in biology, though often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible prior to using these algorithms. While many imputation algorithms have been proposed, more robust techniques need to be developed so that further analysis of biological data can be accurately undertaken. In this paper, an innovative missing value imputation algorithm called Collateral Missing Value Estimation (CMVE) is presented which uses multiple covariancebased imputation matrices for the final prediction of missing values. The matrices are computed and optimized using least square regression and linear programming methods. Results: The new CMVE algorithm has been compared with existing estimation techniques including Bayesian Principal Component Analysis Imputation (BPCA), Least Square Impute (LSImpute) and K-Nearest Neighbour (KNN). All these methods were rigorously tested to estimate missing values in three separate non-time series (ovarian cancer based) and one time series (yeast sporulation) dataset. Each method was quantitatively analyzed using the Normalized Root Mean Square (NRMS) error measure, covering a wide range of randomly introduced missing value probabilities from 0.01 to 0.2. Experiments were also undertaken upon the Yeast dataset which comprised 1.7% actual missing values, to test the hypothesis that CMVE performed better not only for randomly occurring but also for a real distribution of missing values. The results confirmed CMVE consistently demonstrated superior and robust estimation capability of missing values compared to the other methods for both series types of data, for the same order of computational complexity. A concise theoretical framework has also been formulated to validate the improved performance of the CMVE algorithm. Availability: The CMVE software is available on request from the authors. Contact: [email protected]

1

INTRODUCTION

DNA microarrays are extensively used to probe the genetic expression of tens of thousands of genes under a variety of conditions, as well as in the study of many biological processes varying from human tumours (Shoaib et al, 2004-1) to yeast sporulation (Troyanskaya et al, 2001). There are several statistical, mathematical and machine learning algorithms (Gustavo et al, 2003; Ramaswamy et al, 2001; Shipp et al, 2002) that exploit this data for diagnosis (Furey et al, 2000; Brown et al, 1997), drug discovery and protein sequencing for instance. The most commonly used methods include data dimension reduction techniques (Shoaib et al, 2004-5), class prediction techniques (Shoaib et al, 2004-2; Shoaib et al, 2004-3; Golub et al, 1999) and clustering methods (Munagala et al, 2004). Despite wide usage of microarray data, frequently they contain missing values with up to 90% of genes affected (Ouyang et al, 2004). Missing values can occur for various reasons such as spotting problems, slide scratches, blemishes on the chip, hybridization error, and image corruption or simply dust on the slide (Oba et al, 2003). It has been proven (Shoaib et al, 2004-4; Acuna et al, 2004) that missing values affect class prediction and data dimension reduction techniques such as, Support Vector Machines (SVM), Neural Networks (NN), Principal Component Analysis (PCA) and Singular Value Decomposition (SVD). The problem can be managed in many different ways from repeating the experiment, though this is often not feasible for economic reasons, through to simply ignoring the samples containing missing values, although this is inappropriate because usually there are only a very limited numbers of samples available. The best solution is to attempt to accurately estimate the missing values, but unfortunately most approaches use zero impute (replace the missing values by zero) or row average/median (replacement by the corresponding row average/median), neither of which take advantage of data correlations, so leading to high estimation errors (Troyanskaya, 2001). Current research demonstrates that if the correlation between data is exploited then missing value prediction error can be reduced significantly (Shoaib

© The Author (2005). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Muhammad Shoaib et al.

et al, 2004-4; Hellem et al 2004). Several methods including K-Nearest Neighbour (KNN) Impute, Least Square Imputation (LSImpute) (Hellem et al, 2004) and Bayesian PCA (BPCA) (Oba et al, 2003) have been used, however the prediction error generated using these methods still impacts on the performance of statistical and machine learning algorithms including class prediction, class discovery and differential gene identification algorithms (Shoaib et al, 2004-5). There is thus considerable potential to develop new techniques which will provide minimal prediction errors for different types of microarray data including both time and non-time series sequences. This paper presents a Collateral Missing Value Estimation (CMVE) algorithm which combines multiple value matrices for particular missing data and optimizes its parameters using linear programming and least square (LS) regression. CMVE is compared with other well-established techniques including KNN, LSImpute and BPCA, with their performance rigorously tested for the prediction of randomly introduced missing values, with probabilities ranging from 0.01 to 0.2 for the BRCA1, BRCA2, Sporadic mutation microarray data (mutations present in ovarian cancer), which is non-time series data (Amir et al, 2001). The reason for introducing missing values is that the number of actual missing values in the BRCA1, BRCA2 and Sporadic mutation data is negligibly small compared to the size of the dataset–only 0.01%, 0.003% and 0.01% values respectively. Since randomly introduced missing values may not be distributed in the same way as actual missing values (Oba et al, 2003), a separate experiment was performed, with CMVE and the other three estimation algorithms being applied to the Yeast sporulation time series dataset (Spellman et al, 1998) which contains 1.7% missing values. The Normalized Root Mean Square (NRMS) error (Ouyang et al, 2004) metric was used to quantitatively evaluate the estimation performance of each technique, with results demonstrating the improved accuracy and robustness of CMVE over a wide range of randomly introduced missing values. In addition, while computational complexity is not as critical a factor as accuracy for missing value imputation because estimation is performed only once during the data collection (Troyanskaya et al, 2001; Hellem et al, 2004), the order of computational complexity for CMVE proved to be exactly the same as the LSimpute and KNN algorithms. The remainder of the paper is organized as follows: Section 2 presents a brief overview of existing estimation techniques, with their respective advantages and disadvantages, while the new CMVE algorithm and methodology is detailed in Section 3. Section 4 provides the theoretical framework for the improved performance of CMVE compared to the KNN, LSImpute and BPCA algorithms, while Section 5 analyses fully the respective estimation performance of all four imputation methods. Section 6 provides some conclusions.

2

2

OVERVIEW OF EXISTING MISSING VALUE ESTIMATION TECHNIQUES

The following convention is adopted for all the imputation algorithms described in this paper. The microarray data has the form of an m x n matrix Y, where m is the number of genes and n is the number of samples. The YIJ component of Y represents the expression level of gene I for sample J. An overview is now presented of the strengths and limitations of the three estimation techniques used for comparative purposes in assessing the performance of CMVE.

2.1

K-Nearest Neighbour (KNN) Estimation

The KNN method imputes missing values by selecting genes with expression values similar to the gene of interest (Toyanasaka et al, 2001). In order to estimate the missing value YIJ of gene I in sample J, k genes are selected whose expression vectors are similar to genetic expression of I in samples other than J. The similarity measure between two expression vectors Y1 and Y2 is determined by the Euclidian distance over the observed components in sample J.

= Y1 Y2

(1)

The missing value is then estimated as the weighted average of the corresponding entries in the selected k expression vectors:^

k

YIJ =

Wi . X i i =1

1 Wi = i× where

=

k i =1

i

(2) (3)

and X is the input matrix containing

gene expressions. (2) and (3) show that each gene contribution is weighted by the similarity of its expression to gene I. The Euclidean distance measure used by KNN is sensitive to outlier values which may be present in microarray data; although log-transforming the data significantly reduces their effects on gene similarity determination (Toyanasaka et al, 2001). The choice of a small k degrades the performance of the classifier as the imputation process overemphasizes a few dominant genes in estimating the missing values. Conversely, a large neighbourhood may include genes that are significantly different from those containing missing values, so degrading the estimation process and commensurately the classifier’s performance. Empirical results have demonstrated that for small datasets k=10 is the best choice (Accuna et al, 2004), while Toyanasaka et al (2001) observed that KNN is insensitive to values of k in the range 10 to 20. The computational complexity of KNN is O(m2n), where m and n are the number of genes and samples respectively and while this is the same order as the LSImpute algo-

Collateral Missing Value Estimation for Microarray Data

rithm, Section 2.3 will show it is higher than BPCA. A vital feature of KNN is that it does not consider negative correlations between data which can lead to estimation errors.

2.2

Least Square Impute Estimation

Least Square Impute (LSImpute) is a regression based estimation method that exploits the correlation between genes. To estimate the missing value YIJ of gene I from gene expression matrix Y, the k-most correlated genes are firstly selected whose expression vectors are similar to gene I from Y in all samples except J, containing non-missing values for gene I. The LS regression method then estimates the missing value YIJ. By having the flexibility to adjust the number of predictor genes k in the regression, LSImpute performs best when data has a strong local correlation structure, and for the same order of computational complexity O(m2n), as KNN.

2.3

Bayesian Principal Component Analysis based Estimation

Bayesian Principal Component Analysis (BPCA) estimates missing values Ymiss in data matrix Y using those genes Yobs having no missing values. The probabilistic Principal Component Analysis (PPCA) is calculated using Bayes theorem and the Bayesian estimation calculates posterior distribution of model parameter and input matrix X containing gene expression samples using:-

p( , X | Y )

p(Y , X | ) p ( )

(4)

where p( ) is known as the prior distribution which contributes a priori preference to and X. Missing values are estimated using a Bayesian estimation algorithm which is executed for both and Ymiss (similar to the Expectation Maximization repetitive algorithm) and calculates the posterior distributions for and Ymiss, q( ) and q(Ymiss) (Oba et al, 2003). Finally, the missing values in the gene expression matrix are imputed using:-

3

THE COLLATERAL MISSING VALUE ESTIMATION (CMVE) ALGORITHM

The complete CMVE algorithm, which is detailed in Fig. 1, introduces the concept of multiple parallel estimations of missing values. For instance, if value YIJ of gene I and sample J is missing, multiple estimates ( 1, 2 and 3) are generated and the final estimate distilled from these estimates. The covariance function is employed since unlike KNN, it is unbiased in considering both positive and negative correlation values. The covariance function CoV is formally defined as:C oV =

q (Y miss ) = p (Y miss | Y obs ,

true )

(5) (6)

where true is the posterior of the missing value. By exploiting only the global correlation in the datasets, BPCA has the advantage of prediction speed incurring a computational complexity O(mn), which is one degree less than for both KNN and LSImpute. For imputation purposes however, improved estimation accuracy is always a greater priority than speed.

n i =1

(n 1)

(

)(

i

)

i

(7)

where is the predictor gene vector and the expression vector of gene I which has the missing values. The absolute diagonal covariance CoV is firstly computed for a gene vector , where every gene except I is iteratively considered as (Step 2 in Fig. 1). The genes are then ordered with respect to their values and the first k-ranked covariate genes Rk selected, whose expression vectors have the most similarity to gene I from Y in all samples except J (Step 4). The LS regression method (Harvey et al, 2004) is then applied to estimate value 1 for YIJ (Step 5) as:-

=

1

+ X+

(8)

where is the error term that minimizes the variance in the LS model (parameters and ). For a single regression, the estimate of and are respectively:-

=y

X and

xy

=

xx

=

1

n

( X J X )(YJ Y ) is the empirical J =1 n 1 covariance between X and Y, YJ is the gene with the missing value and XJ is the predictor gene in Rk. where

^

Y = Y miss q(Y miss )dY miss

1

xx

=

xy

1 n 1

n J =1

2

(X J

X ) is the empirical variance of X

with X and Y being the respective means over X1,...,Xn and Y1,...,Yn, so the LS estimate of Y given X is expressed as: ^

xy

Y =Y

(X

X )2

xx

The two other missing value estimates are respectively given by:-

2

=

k i =1

+

k i =1

2

and

3

(Step 6)

2

(9)

3

Muhammad Shoaib et al.

Pre Condition: Gene expression matrix Y(m,n) where m and n are the number of genes and samples respectively. Post Condition: Y with no missing values. Algorithm: STEP 1 Locate missing value YIJ in gene I and sample J. STEP 2 Compute the absolute covariance CoV of expression vector of gene I using (7) STEP 3 Rank genes (rows) based on CoV STEP 4 Select the k most effective rows Rk STEP 5 Use these values of Rk to estimate 1 by (8) STEP 6 Calculate 2 and 3 using (9) and (10) STEP 7 Calculate missing value YIJ using (14) and impute estimate in all future predictions. STEP 8 Seek the next missing value YIJ and repeat STEPS 2–7 until all missing Y values are estimated STEP 9 END Fig.1.The Collateral Missing Value Estimation (CMVE) algorithm k 3

=

i =1

(

T

× I)

k

+

(10)

is the vector that minimizes 0 in (12), is the where is the actual residual. These three normal residual and parameters are obtained from the Non-Negative Least Square (NNLS) algorithm (Charles et al, 1974). The objective is now to find a linear combination of models that best fit Rk and I. The objective function in NNLS minimizes, using linear programming techniques, the prediction error 0 so that:, , = min( 0 )

(11)

i.e. min( 0 ) is a function that locates the normal vector with minimum prediction error 0 and residual . The value of 0 in (11) is obtained from:0

= max( SV ( Rk .

I ))

(12)

where SV are the singular values of the difference vector between the dot product Rk and prediction coefficients with the gene expression row I. The tolerance used in the linear programming to compute vector is given by:Tol = k × n × max( SV ( Rk )) × C

4

1

+ .

2

+ .

3

4

THEORETICAL FOUNDATIONS OF CMVE

This section explores the theoretical principles underpinning the reasons behind why the CMVE algorithm provides better performance compared with KNN, LSImpute and BPCA techniques in estimating missing values. For completeness, a computational complexity analysis of CMVE is provided and shown to be exactly the same order as both LSImpute and KNN. Proposition 1: KNN only considers positive correlations.

If there are two sets and which are inversely proportional to each other, then the distance d between and will be larger in those sets which are proportional to each other. Several distance functions can be used for KNN, with the most common being Euclidian distance which is given by:-

d=

(15)

so d is always higher when is inversely proportional to , rather than when they are directly proportional to each other. Proposition 2: The CMVE algorithm considers both positive and negative correlation values.

Assume two sets and that are inversely proportional, so CoV < 0 , . From (7) it is clear if a high correlation exists between the gene values (either directly proportional and positive correlation or inversely proportional and negative correlation) a higher absolute CoV value will exist.

(13)

where k is the number of predictor genes, n the number of samples in the dataset and C is normalization factor. The final estimate for YIJ is formed using:= .

where = = = 0.33 ensures an equal weighting to the respective estimates 1, 2 and 3. The rationale for this choice is that as each estimate is highly data dependent, it avoids any bias towards one particular estimate. The reason (14) has a lower NRMS error is because the imputation matrix 1 uses LS regression, while matrices 2 and 3 use non-negative LS (NNLS), which is superior for estimating positive correlated values. NNLS is unable however, to estimate negative values and given microarray data possesses both negative and positive values, this was the motivation to embed the LSImpute-based matrix into the gene expression prediction, so combining the advantages of both algorithms to more accurately estimate the missing values.

(14)

Proposition 3: The probability P( ) of the normalized imputation error of missing values using CMVE is always less than that for BPCA, LSImpute and KNN.

The probability P( ) of the normalized imputation error of the missing value for correlated data is directly proportional to the number of missing values M (Mclean et al,

Collateral Missing Value Estimation for Microarray Data

2000). Assume P1 and P2 are the probabilities of normalized imputation errors of CMVE ( 1) and the three comparative algorithms ( 2) such that:M

P1 =

P(

1 ) P( M )

= M × P(

1 ) P( M )

(16)

i =0

M

P2 =

P(

2 ) P (i )

(17)

i =0

Since the comparative methods do not estimate any future missing value predictions, such algorithms only consider M missing values for each prediction. In contrast, CMVE uses estimated values for the future prediction of missing values so each estimate increases the number of predictor genes to be considered, while concomitantly decreasing the prediction probabilities in (17) so: P2 < P1 such that P2 0 when i as P(i) =0 for i=0

0

(18)

Proposition 4: CMVE always has a lower estimation error of missing values in the case of transitive gene dependency (Gene A B C) than BPCA, LSImpute and KNN.

Assume that gene Ga1 is correlated with S1 such that:-

G a1

S1 such that S1 = {G b1 , G b2 ...G bn } (19)

Similarly gene Gb1 is correlated with S2 as:G b1

S2 such that S2 = {G c1 , G c2 ...G cn } (20)

If the values of both Ga1 and Gb1 are missing then Gb1 can be predicted using set S2 and subsequently used to predict Ga1 more accurately using S1 by including Gb1 rather than ignoring it. CMVE, unlike the other imputation techniques considers estimated values in predicting future missing values. LSImpute replaces the gene missing value with an average value to compute the CoV matrix (Hellem et al, 2004). The NRMS error using this approach is always higher than CMVE, since each iteration (Fig. 1, Steps 1--- 7) lowers this error. KNN and BPCA consider that missing genes have no correlation with the missing value gene as they ignore these missing values while searching the estimation space. In contrast, CMVE includes these genes when searching for the most correlated gene. This may incur a small accumulative error in future predictions, but it will always be less than when either the average value of the gene is used or the gene is totally ignored. Proposition 5: CMVE generates a lower estimation error than BPCA when genes have dominant local correlation.

BPCA assumes only a global correlation structure and has a similar effect to selecting a high value of k for CMVE. Due to this assumption, BPCA does not provide accurate estimates when genes have dominant local correlation (Oba et al, 2003), because in predicting missing values, information from all genes is considered, many of which have little or no correlation with the gene with the missing value. In contrast, the CMVE variable k can be adjusted depending upon the type of the data, ensuring that only those genes with strong correlations are considered, which concomitantly reduces the estimation error. The empirical results presented in the next section demonstrate a value of k = 10 is suitable for local correlated data. Computational Complexity Analysis: The order of computational complexity for CMVE is exactly the same as for the KNN and LSImpute algorithms.

The critical operation for the CMVE, KNN and LSImpute algorithms is the search for the most correlated genes. These algorithms search for correlated genes with the gene, which has missing values. Each estimation takes linear time O(n), so for m genes and n samples the complexity order is O(m2n) for all algorithms. KNN uses a weighted average of k correlated genes to estimate the missing values, while CMVE and LSImpute use regression and linear programming for estimation, though these additional overheads are negligible compared to the time to search for the most correlated genes. Like KNN and LSImpute, CMVE also only searches once per estimation for correlated genes. As discussed in Section 2.3, BPCA has a computational complexity of O(mn) as it only considers the global correlation structure of the data. This is pyrrhic however, because the corresponding estimation accuracy is significantly inferior whenever data has a localized correlation structure.

5

RESULTS ANALYSIS

To test the different imputation algorithms, four different types of microarray data were used including both time series and non-time series data. The data set contained 18, 16, 27 and 77 samples of BRCA1, BRCA2, Sporadic mutations (neither BRCA1 nor BRCA2) of ovarian cancer data (non time series) and yeast sporulation data (time series) respectively. Each ovarian cancer data sample contained logarithmic microarray data of 6445 genes while there were 6179 genetic expressions per sample for yeast dataset. The rationale for selecting cancer data is that in such data some of genes are up/down regulated so it is very difficult to determine their expression levels from non regulated genes. The missing value estimation techniques were tested by randomly removing data values and then computing the estimation error. In the experiments, between 1% and 5% of the values were removed from each dataset samples and the NRMS error was computed by:-

5

Muhammad Shoaib et al.

=

RMS ( M M est ) RMS ( M )

(21)

where M is the original data matrix and Mest is the estimated matrix using KNN, LSImpute, BPCA and CMVE. This particular metric was used for error estimation because =1 for zero imputation (Ouyang et al, 2004). To compare the performance of the CMVE, KNN and LSImpute imputation algorithms, k=10 was used throughout the experiments. The rationale for this was that Olga et al (2001) observed that KNN was insensitive to values of k in the range 10 to 20 and the best estimation results were observed in this range. Hellem et al (2003) also suggested using k=10 for LSImpute. Fig. 2 plots the minimum overall prediction error rates for CMVE over a range of k values for the different test datasets, with results showing that k in the range 10 to 15 (highlighted) is the most appropriate. Lower k values include only a small set of correlated genes for prediction leading to prediction errors as other correlated genes are ignored. Conversely, when k is high, genes which have either very little or no correlation with the gene having missing values will be included in the prediction, again leading to erroneous results (Troyanskaya et al, 2001). B1

NRMS Error for Different Values of k for 5% Missing Values

B2 Sp

0.09

Yeas t

0.08 0.07

NRMS Error

0.06 0.05 0.04 0.03 0.02 0.01 0 1

4

7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

k

Fig. 2. NRMS error over a wide range of k in the CMVE algorithm for 5% missing values

To fully test the robustness of the new CMVE algorithm, experiments were performed for missing values up to 20%. Figs 8(a) and (b) show the error values for 10% missing values which especially reveal (Fig. 8.a) the significant deterioration in the results of KNN for the Sporadic dataset. To clarify the performance of CMVE, Fig 8(b) plots the error results without KNN, which consistently confirm the lower error values compared with LSImpute and BPCA. Figs 9(a) and (b) show the corresponding results for 20% missing values, which again reveal the superiority and greater robustness of the CMVE algorithm for missing value imputation. Note, for the sake of clarity a logarithmic scale is used in Fig 9(b). Whenever there is a high number of missing values in a gene, sparse covariance matrices will ensue and with them, the increased likelihood of ill-conditioning. The CMVE algorithm avoids ill-conditioning by ensuring the removal of all genes with more than 20% missing values prior to imputation. As highlighted in Section 1, experiments performed upon datasets with randomly introduced missing values may not truly reflect the nature of actual microarray data missing values. All four imputation algorithms were therefore tested on the Yeast time series data containing 1.7% missing values. Since NRMS errors could not be calculated for these actual missing values, the gene value adjacent to the gene with the missing value was replaced prior to applying the imputation algorithms. This had the effect of a delay function, while retaining the same distribution of missing values. The results in Table 1 again confirm the superior performance of CMVE, particularly when an additional 4%, 5% and 10% of missing values are introduced into the data, with the corresponding average improvements being 60%, 72% and 64% respectively. The imputation results also reveal some other broader noteworthy issues. KNN for instance, performed better when missing values were randomly introduced because KNN only considers positive correlations and certain randomly introduced missing values will inevitably contain negative correlations with other genetic data. Similarly, LSImpute exhibited an improved performance compared to BPCA (Oba et al, 2003) confirming the discussion underpinning Proposition 5.

Table 1. NRMS errors for the actual missing value distribution (AMVD) of 1.7% missing values and additional 1% to 10% randomly introduced values in the Yeast dataset

1% 2% 3% 4% 5% 10%

6

BPCAImpute

KNNImpute

LSImpute

CMVE

0.1485

0.0654

0.0130

0.0064

0.0319

0.8930

0.0849

0.0030

0.0569

0.5284

0.1555

0.0843

0.0674

0.6232

0.1612

0.0547

0.0846

0.9307

0.2003

0.0090

0.0927

0.5821

0.2071

0.0091

0.1756

0.8763

0.0638

0.0130

Imputation Errors for 1% Missing Values for Ovarian Cancer Data

0.08 Normalized RMS Error

Missing Values AMVD

0.1 0.09

0.07 0.06 BPCA

0.05

KNN LSImpute

0.04

CMVE

0.03 0.02 0.01 0 Brca1

Brca2

Sporadic

Data

Fig. 3. NRMS error for 1% missing values for ovarian cancer data

Collateral Missing Value Estimation for Microarray Data

0.11

Imputation Errors for 2% Missing Values for Ovarian Cancer Data

Imputation Errors for 10% Missing Values for Ovarain Cancer

0.1

0.045 0.04

0.08

0.035

0.07 0.06

BPCA KNN

0.05

LSImpute

0.04

CMVE

0.03 0.02

Normalized RMS Error

Normalized RMS Error

0.09

0.03 BPCA

0.025

LSImpute 0.02

CMVE

0.015 0.01 0.005

0.01 0

0

Brca1

Brca2

Brca1

Sporadic

Fig. 4. NRMS error for 2% missing values for ovarian cancer data 0.13

Brca2

Sporadic

Data

Data

Fig. 8(a). NRMS error for 10% missing values for ovarian cancer data. Imputation Errors for 10% Missing Values for Ovarain Cancer

Imputation Errors for 3% Missing Vaules for Ovarian Cancer Data

0.12

0.8

0.11 0.7

0.08 0.07

BPCA

0.06

KNN

0.05

LSImpute CMVE

0.04

Normalized RMS Error

Normalized RMS Error

0.1 0.09

0.6 0.5 BPCA 0.4

LSImpute CMVE

0.3 0.2

0.03 0.1

0.02

0

0.01

Brca1

0 Brca1

Brca2

Brca2

Sporadic

Data

Sporadic

Data

Fig. 5, NRMS error for 3% missing values for ovarian cancer data

Fig. 8(b). NRMS error of CMVE, BPCA and LSImpute for 10% missing values for ovarian cancer data

Imputation Errors for 4% Missing Vaules for Ovarian Cancer Data

Imputation Errors for 20% Missing Values for Ovarain Cancer

0.15 0.135

0.8 0.7

0.09 BPCA 0.075

KNN LSImpute

0.06

CMVE 0.045 0.03

Normalized RMS Error

Normalized RMS Error

0.12 0.105

0.6 0.5

BPCA KNN

0.4

LSImpute CMVE

0.3 0.2 0.1

0.015 0

0 Brca1

Brca2

Sporadic

Brca1

Data

Fig. 9(a). NRMS error for 20% missing values for ovarian cancer data

Imputation Errors for 5% Missing Vaules for Ovarian Cancer Data

0.14

Imputation Errors for 20% Missing Values for Ovarain Cancer

0.12

1

0.1 BPCA

0.08

LSImpute CMVE

0.04 0.02

Logarithmic

KNN

0.06

Normalized RMS Error

Normalized RMS Error

Sporadic

Data

Fig. 6. NRMS error for 4% missing values for ovarian cancer data 0.16

Brca2

0.1 BPCA KNN LSImpute CMVE 0.01

0 Brca1

Brca2

Sporadic

Data

0.001 Brca1

Brca2

Sporadic

Data

Fig. 7. NRMS error for 5% missing values for ovarian cancer data Fig. 9(b), NRMS error (log scale) for 20% missing values for ovarian cancer data

7

Muhammad Shoaib et al.

6

CONCLUSIONS

This paper has presented a new Collateral Missing Value Estimation (CMVE) algorithm based on the novel concept of multiple imputations. Experimental results confirmed that CMVE consistently provided superior estimation accuracy compared with the existing missing value imputation algorithms including KNN, LSImpute and BPCA. This performance improvement was especially evident when estimating higher numbers of missing values in both time series and non-time series data. The algorithm’s theoretical basis, which was the exploitation of a combination of global and local correlations in a given dataset, repeatedly proved to be a more effective and robust strategy than the distance function used by KNN, with no increase in the order of computational complexity for all values of k. The results corroborate the fact that CMVE can be successfully applied to accurately impute missing values prior to any microarray data experiment, crucially without any bias being introduced into the estimation process.

REFERENCES Acuna E., and Rodriguez C. (2004), The treatment of missing values and its effect in the classifier accuracy, Classification, Clustering and Data Mining Applications, pp. 639-648. Amir A.J., Yee C. J., Sotiriou C., Brantley K. R., Boyd J., Liu E. T. (2002) Gene expression profiles of BRCA1linked, BRCA2-linked, and Sporadic ovarian cancers. Journal of the National Cancer Institute, vol. 94 (13). Brown, W. N., Grundy, D. Lin, N. Cristianini., C. Sugnet., T. Furey, S., Ares M. and Haussler D (1997) Knowledgebased analysis of microarray gene expression data using support vector machines. Proc. Natl. Acad. Sci., 262— 267. Charles L. Lawson, Richard J. Hanson (1974) Solving least squares problems, Prentice-Hall, INC., Englewood Cliffs, N.J. Gustavo B., Monard C.M. (2003), An analysis of four missing data treatment methods for supervised learning, Applied Artificial Intelligence 17(5-6): pp. 519-533. Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D.W., Schummer, M., Haussler, D (2000). Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics, vol. 16(10):906—914. Harvey M. and Arthur C. (2004), Fitting models to biological Data using linear and nonlinear regression, Oxford University Press. Hellem B. T., B. Dysvik, I. Jonassen (2004) LSimpute: Accurate estimation of missing values in microarray data with least squares methods, Nucleic Acids Res., pp. 32(3):e34. Munagala, K., Tibshiran, R., and Brown, P.O. (2004) Cancer characterization and feature set extraction by discriminative margin clustering, BMC Bioinformatics. 5 (1): 21

8

McLean (2000) The predictive approach to teaching statistics. Journal of Statistics Education, v.8, n.3. Oba S., Sato M.A., Takemasa I., Monden M., Matsubara K., Ishii S., (2003) A bayesian missing value estimation method for gene expression profile data. Bioinformatics, vol.19 (16), pp.2088-2096. Ouyang M., Welsh W.J., Georgopoulos P. (2004) Gaussian mixture clustering and imputation of microarray data”, Bioinformatics. Ramaswamy S., Tamayo P., Rifkin R., Mukherjee S., Yeang C. H., Angelo M., Ladd C., Reich M., Latulippe E., Mesirov J. P., Poggio T., Gerald W., Loda M., Lander E. S. & Golub T. R. (2001), Multiclass cancer diagnosis using tumour gene expression signatures, Proc. Natl. Acad. Sci., USA, 98(26):pp. 15149-15154. Shoaib, M.B.S., Gondal, I. Dooley, L. (2004 -1) Support vector machine and generalized regression neural network based classification fusion models for cancer diagnosis. HIS’ 04, Japan. Shoaib, M.B.S., Gondal, I. Dooley, L. (2004 -2) A collimator neural network model for the classification of genetic data. ICBA 04, USA. Shoaib, M.B.S., Gondal, I. Dooley, L. (2004 -3) Communal neural network for ovarian cancer mutation classification. Complex 04, Australia. Shoaib, M.B.S., Gondal, I. Dooley, L. (2004 -4), K-Ranked covariance based missing values estimation for microarray data classification. HIS’ 04, Japan. Shoaib, M.B.S., Gondal, I. Dooley, L. (2004 -5), Statistical neural networks and support vector machine for the classification of genetic mutations in ovarian cancer, IEEE CIBCB 04, USA. Shipp M. A, Ross K. N., Tamayo P., Weng A. P., Kutok J. L, Aguiar R. C., Gaasenbeek M., Angelo M., Reich M., Pinkus G. S., Ray T. S., Koval M. A., Last K. W., Norton A., Lister T. A., Mesirov J., Neuberg D. S., Lander E. S., Aster J. C. & Golub T. R. (2002), Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning, Nat Med, 8(1):pp. 68-74. Spellman P.T., Sherlock G., Zhang M. Q., Iyer V.R., Anders K., Eisen M., Brown P.O., Botstein D., Futcher B. (1998). Comprehensive identification of cell cycleregulated genes of the yeast saccharomyces cerevisiae by microarray hybridization, Molecular Biology of the Cell 9, 3273-3297. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B. (2001) Missing value estimation methods for DNA microarrays., Bioinformatics, 2001 17:520-525 Golub T. R, Slonim D. K., Tamayo P., Huard, C., Gaasenbeek, M., Mesirov, J. P. , Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander E. S. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531-537.