Missing Value Estimation for DNA Microarray Expression Data: Least

0 downloads 0 Views 334KB Size Report
Feb 2, 2004 - Expression Data: Least Squares Imputation .... (2). Then, the missing value is estimated as a linear combination of values of experiments except ...
Missing Value Estimation for DNA Microarray Expression Data: Least Squares Imputation Hyunsoo KimÝ , Gene H. GolubÞ , and Haesun ParkÝ£ Ý

Department of Computer Science and Engineering, University of Minnesota, 200 Union Street S.E., 4-192 EE/CS Building, Minneapolis, MN 55455, U.S.A. and Þ

Computer Science Department, Stanford University, Gates Building 2B #280, Stanford CA 94305-9025, U.S.A. February 2, 2004

Abstract Motivation: Gene expression microarray data sets often contain missing expression values. Robust missing value estimation methods are needed since many algorithms for gene expression analysis require a complete matrix of gene array values. In this paper, imputation methods based on the least squares and cluster structure are proposed to estimate missing values in the gene expression data, which exploits local and cluster structures in the data. Methods: The proposed least squares based method (LSimpute) represents a target gene that has missing values as a linear combination of similar genes. The similar genes are chosen by k-nearest neighbors or the concept of coherent genes that have large absolute values of Pearson correlation coefficients. In addition, several cluster based imputation methods are proposed including a method based on dimension reduction (DRimpute) which takes advantage of cluster structure in the reduced dimensional space to acquire a more accurate cluster structure. Results: LSimpute outperforms the Bayesian principal component analysis (BPCA) based method as well as the conventional weighted -nearest neighbor method (KNNimpute) for all four different data sets tested Availability: The LSimpute is available from the authors upon request.  Corresponding author: Haesun Park, E-mail: [email protected], Phone: 612-625-0041, Fax: 612-625-0572. This

material is based upon work supported by the National Science Foundation Grants CCR-0204109 and ACI-0305543. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation (NSF).

1

Introduction Gene expression data sets often contain missing values for various reasons. For example, the background and the signal may have similar intensities, the surface of the chip may not be planar, there may be dust on the slides, the probe may not be properly fixed on the chip or washed properly, the hybridization step may not work properly. There are several approaches for estimating the missing values. Recently, for missing value estimation, the singular value decomposition based method (SVDimpute) and weighted -nearest neighbors imputation (KNNimpute) have been introduced (Troyanskaya et al., 2001). It has been shown that KNNimpute performs better on non-time series data or noisy time series data, while SVDimpute works well on time series data with low noise levels. Overall, the weighted -nearest neighbors based imputation provides a more robust method for missing value estimation than the SVD based method.

Ê ¢ to denote a gene expression data matrix with Throughout the paper, we will use   genes and  experiments, and assume   . In the matrix , a row  Ê ¢ represents expressions of the th gene in  experiments: 

 

 .. .



  

Ê



¢

A missing value in the th location of the th gene is denoted as , i.e.    . For simplicity of algorithm description, all missing value estimation algorithms mentioned in this paper are described first assuming there is a missing value in the first position of the first gene, i.e.

 





then general algorithms for our proposed missing value estimation methods for DNA microarray expression data are introduced. The KNNimpute method (Troyanskaya et al., 2001) finds ( ) other genes with expressions and with the values in their first positions not missing. The missing value most similar to that of is estimated by the weighted average of values in the first positions of the closest genes. For of the weighted average, the contribution of each gene is weighted by the similarity of its expression to that of . In the SVDimpute method (Troyanskaya et al., 2001), the SVD of the matrix  ¼ , which is obtained after all missing values of the  are substituted by zero or row averages, is computed. Then, using the most significant eigengenes of  ¼ , where the specific value of is either predeteris estimated by mined or determined based on data sets, a missing value  in the first array of regressing this gene against the most significant eigengenes. Using the coefficients of the regression, the missing value is estimated as a linear combination of the values in the first position of

eigengenes. When determining these regression coefficients, the first missing value of and first values of the eigengenes are not used. The above procedure is repeated until the total change of the matrix becomes insignificant. The computational complexity of SVDimpute is    , where is the number of iterations performed before the threshold value is reached SVDimpute is useful for time series data with low noise level. However, the SVD based imputation has several weaknesses since the solution relies on all genes and experiments in the data set and does not consider local structure nor cluster structure. For non-time series data, a clear expression pattern may not exist. For noisy data, expression patterns of the genes with smaller expression values may not be represented 2

well by the dominant eigengenes. Recently, Bayesian principal component analysis (BPCA), which simultaneously estimates a probabilistic model and latent variables within the framework of Bayesian inference, has been successfully applied to missing value estimation problems (Oba et al., 2002; Oba et al., 2003). A fixed rank approximation algorithm (FRAA) (Friedland et al., 2003) using the singular value decomposition has been proposed. However, FRAA could not outperform KNNimpute even though it is more accurate than replacing missing values with 0’s or with row means. In this paper, we introduce least squares based imputation methods and methods that exploit cluster structures and the local relationships among genes in the gene expression data for estimating missing values. We investigate all proposed imputation methods on the four different data sets and compare them with KNNimpute and an estimation method based on BPCA. Materials and Methods Least Squares Based Imputation Algorithms In this section, several least squares based imputation methods are proposed, where a target gene that has missing values is represented as a linear combination of similar genes. A target experiment that has a missing value can be represented as a linear combination of the other experiments as well. One of two approaches is used to estimate missing values depending on the relative sizes between the number of selected similar genes and the available experiments. Rather than using all available genes in the data, since only the genes with high similarity with the target gene are used in the proposed methods, our methods are described as local least squares algorithms. The similarity measures based on the  norm as well as the Pearson correlation coefficient are investigated. Local Least Squares Imputation (LLSimpute) In our Local Least Squares imputation (LLSimputation) method, to recover a missing value in the first location of ,    , the first nearest neighbor gene vectors for ,



Ê



¢   







are found, each of which does not contain any missing value in its first location as well as the rest of the vector. The value for  varies for each specific case of the missing value estimation since it depends on the missing values in the rest of the gene vectors. Then, based on these neighboring gene vectors, the matrix  Ê ¢   and the two vectors  Ê ¢ and  Ê   ¢ are formed. The rows of the matrix  consist of the nearest neighbor genes  Ê ¢   ,     , with their first values deleted, the elements of the vector  consists of the first values of the vectors   , whose missing first and the elements of the vector  are the    elements of the gene vector item is deleted. After the matrix , and the vectors  and  are formed, one of the following two least squares problems are solved depending on the relative sizes between and   . When

  , we can solve a least squares problem:

     Ü

Then, the missing value  is estimated as a linear combination of first values of genes



 

3

(1)

In this case, as approaches   , the effect of the genes less similar to the target gene become involved in the solution and the performance may become worse. A more detailed discussion is presented in the Results and Discussions section. When

   , the least squares problem to solve is

     Ü

(2)

Then, the missing value is estimated as a linear combination of values of experiments except the first experiment in the target , i.e.



 

In this case, using more data points, i.e. genes, is generally helpful for better estimation of the missing value. To incorporate weights of similarity for the -nearest neighbors into Eqn. (2), a weighted least squares problem

    Ü

(3)

can be solved where  is a diagonal matrix of weights and the th diagonal element, ( represents the similarity of the th gene to .

 

),

In the actual DNA microarray data, missing values occur in various locations. For estimating each element   in the th position of the th gene, we need to build the matrix  and the vectors  and  and solve the corresponding least squares problem according to its conditions When building a matrix  for estimating each missing value, all of the previously estimated entries are used as well in order to estimate the current missing value. Constrained Least Squares Imputation (CLSimpute) In KNNimputed, first, the nearest neighbors to the gene with a missing value are found as described earlier. Then a missing value is computed using the weighted average where the weights are based on the similarity values between the neighbors and , assigning higher weights to the genes closer to . Therefore, the choice of weights play a critical role. Our constrained least squares imputation (CLSimpute) algorithm is formulated to obtain the optimal weights of -nearest neighbors by least squares methods where the values of the weights are computed as a part of the solution. As in the LLSimpute method, the problem is divided into two cases depending on the relative sizes between the number of neighboring genes and the number of experiments   . Using the matrix  and the vectors  and  defined as in the LLSimpute method, when   , the following constrained least squares problem is solved

Ü            for      

Then, the missing value is calculated by



  4



where each component of  vector is the optimized weight of the corresponding gene by the least squares approach. When     as in the LLSimpute, the following problem is solved

     Ü Then, the missing value is calculated by



 

The constrained least squares problem was solved by quadratic programming, which uses an active set method (Gill et al., 1981). As in the LLSimpute, for each missing element in the DNA microarray expression data , a matrix  and vectors  and  are formed and the corresponding constrained least squares problem needs to be solved. Least Squares Imputation with the Pearson Correlation Coefficient (LSPimpute) In this subsection, another method based on the least squares imputation is introduced that takes advantage of the similar genes based on the Pearson correlation coefficient (Pearson, 1894). When there is a missing value in the first location of , the Pearson correlation coefficient between two vectors ¼ =          and ¼ =         is defined as

                      

(4)

where   is the average of values in ¼ and  is the standard deviation of these values. The components of that contain missing values are not considered in computing the coefficients. We used the absolute values of the Pearson correlation coefficients since the highly correlated but opposite signed components of the genes, i.e   , are also helpful in estimating missing values. In the least squares imputation with the Pearson correlation coefficient, missing values in the target genes are estimated by local least squares where highly correlated genes in the microarray data are selected based on the Pearson correlation coefficients. First, all Pearson correlation coefficients and the other genes are computed. After sorting the absolute values of the correlation between are greater than a coefficients, the genes whose absolute values of correlation coefficients with threshold value  are selected. Let   denote the number of genes satisfying the criterion. For example,  is the number of genes for which the absolute values of correlation coefficients are greater than 0.95. Let  define the number of genes that do not have any missing values. If there are sufficient numbers of very highly correlated genes (e.g.    ), it is reasonable to estimate the missing values by these genes. Therefore, the two cases are divided according to   as well as  for building the corresponding least squares problem. When     or  is greater or equal to a threshold number of genes  , i.e. (  ), we used the following least squares problem:

      Ü 5



(5)

where the rows of the matrix  Ê ¢   are the genes highly correlated to the target gene, which Ê ¢ of the first column vector, the missing do not contain a missing value. After building   . The missing value is estimated by a linear combination of the value can be obtained by  first values of the coherent genes. The number of threshold number of genes  is determined based on our numerical experiments. When 

    and   , we used the following least squares problem:

      Ü

(6)

Ê  ¢   that does not contain a missing value and  Ê  ¢ of the first column where  vector. The number of similar genes ( ), i.e. the number of rows of the matrix  , is determined by numerical experiments. We describe the selection of the threshold values ( ,  ) in the Results and Discussion section. The missing value can be calculated by 

 

This approach estimates a missing value by a linear combination of values of experiments excluding the first experiment in the target . The coefficients of the linear combination are determined by with respect to the above least squares problem using the genes that exhibit similar behavior to the experimental conditions. For solving a least squares problem, the truncated SVD can be used to achieve noise reduction in estimating missing values. LSPimpute/SVD is a variation of LSPimpute using these noise reduction schemes, where a specific reduced rank has to be determined. More discussion on this follows in the Results and Discussion section.

For estimating each missing element   of the th position of the th gene in the data matrix , we need to build a matrix  after computing gene correlation coefficients between  and other genes, and solve the corresponding least squares problem according to its conditions. Cluster Based Imputation Algorithms In KNNimpute, the estimated value by weighted average can be perturbed by genes in other clusters when the number of nearest neighbors, , is larger than the number of genes in the cluster that contains a target gene with a missing value. To avoid this problem, we developed imputation algorithms that estimate missing values by weighted average based on cluster structure as well as similarity. For simplicity of notation, we will assume that the matrix  Ê ¢ is partitioned into two submatrices as





£ 



Ê



¢

where £ has at least one missing value in each row and  

Ê



 has no missing value.

CLUSTERimpute When the gene expression data has a cluster structure, CLUSTERimpute missing value estimation can be used. In CLUSTERimpute, the following steps are taken: 6

1. Cluster the matrtix  using  -means (Lloyd, 1982), EM or any other clustering algorithm. 2. For all genes  in £ ,   nents with missing values,

    , that have missing values and ignoring the compo-

(a) Determine the cluster  where (b) Select

 belongs

genes among  most similar to



(c) Estimate the missing values by the weighted average of the chosen genes. Each selected gene is weighted by their similarities to  . For genes from  , apply additional weights. To measure gene similarity, we used the Euclidean distance since it was found that the log transformation seems to sufficiently reduce the effect of outliers in gene expression data (Troyanskaya et al., 2001). For our experiments, we tested both the  -means and EM algorithms to obtain cluster structure in the original input space. When we use the EM algorithm, a matrix whose all missing values are substituted by row averages is used as an input matrix in order not to lose rows that contain missing values in the clustering step. DRimpute It is well-known that conventional clustering algorithms such as expectation maximization (EM) and  -means are often trapped in a local minimum when handling high dimensional data. It is possible to avoid this problem by applying clustering algorithms in a reduced dimension space using the cluster structure preserving dimension reduction methods (Ding et al., 2002). DRimpute takes advantage of this approach to estimate missing values for DNA microarrays. When the number of clusters for the gene expression data is  , the missing value estimation with dimension reduction, DRimpute, works as follows. 1. Reduce the data dimension of  by applying PCA or any other methods. 2. Apply a clustering algorithm such as  -means or EM to obtain clusters in the reduced dimensional space. Use the cluster membership to construct cluster centroids in the original space. If the positions of centroids in the original space are converged, go to Step 4. 3. Compute the dimension reducing transformation using the cluster structure preserving dimension reduction methods such as Orthogonal Centroid (Park et al., 2003) or LDA/GSVD (Howland et al., 2003). Transform the data to the reduced dimensional space. Go to step 2.

 in £ ,      , that have missing values. Determine the cluster  where  belongs. Select the genes most similar to  among  .

4. For all genes (a) (b)

(c) Estimate the missing values by a weighted average of chosen genes. Each selected gene is weighted by their similarities to  . For genes from  , apply additional weights. The cluster membership obtained by the clustering algorithm in the reduced dimensional space was used in order to construct cluster centroids in the original space. In step 2, the cluster centroids 7

in the original space have to be constructed by the cluster membership in the reduced dimensional space. The centroid vector  for the th cluster in the original space can be calculated by



           ¾ 

(7)

where  is the number of genes in the th cluster,   is the set of indexes of the genes that belongs to the th cluster in the reduced dimensional space, and  is an expression vector for the th gene. The      is the probability that the point belongs to cluster  , where  is the th point in the reduced space. If we use the  -means algorithm, the probability is always 1. For our experiments, we used both  -means and EM algorithms to obtain the cluster structure in the reduced space and the Orthogonal Centroid dimension reduction method for DRimpute. SVDimpute Algorithms In the SVDimpute method (Troyanskaya et al., 2001), all missing values are substituted by row averages or zeros and then the SVD of the complete matrix is computed in order to obtain the eigengenes. There are alternative methods. After removing all rows that contain a missing value, we can apply the SVD to a complete submatrix with no missing data (Alter et al., 2003). We also tried a hybrid method between SVDimpute and KNNimpute. After all missing values are initially estimated by KNNimpute, eigengenes can be obtained by the SVD. We refer to these methods as SVDimpute/RowAverage, SVDimpute/Remove, and SVDimpute/kNN, respectively. Results and Discussion Four microarray data sets obtained from the Stanford Microarray Database (SMD) have been used in our experiments. The first data set was obtained from -factor block release that was studied for identification of cell-cycle regulated genes in yeast Saccharomyces cerevisiae (Spellman et al., 1998). We build a complete data matrix of 4304 genes and 18 experiments (SP.ALPHA) that does not have any missing value to assess missing value estimation methods. The second data set came from an elutriation data set (Spellman et al., 1998). We build a complete matrix of 4304 genes and 14 experiments (SP.ELU). The 4304 genes originally had no missing values in the -factor block release set and the elutriation data set. The third data set was from 784 cell cycle regulated genes, which were classified by Spellman et al. (Spellman et al., 1998) into five classes, for the same 14 experiments as the second data set. After removing all gene rows that have missing values, we built the third data set of 474 genes and 14 experiments (SP.CYCLE). We also built the fourth data set that has a large number of experiments. This data set was from a study of response to environmental changes in yeast (Gasch et al., 2001). It contains 6361 genes and 156 experiments that have time-series of specific treatments. We built a complete matrix of 2641 genes and 44 experiments after removing experimental columns that have more than 8% missing values and then selecting gene rows that do not have any missing value (GA.ENV). The SP.ALPHA and SP.ELU are the same data sets that were used in the study of BPCA (Oba et al., 2003). The SP.CYCLE data set was designed to test how much an imputing method can take advantage of strongly correlated genes in estimating missing values. The GA.ENV data set was prepared to test how efficiently an imputation method can handle a large number of experiments. The 8

normalized root mean squared (RMS) difference between the imputed matrix and the original matrix was used to assess the accuracy of the methods. Given an  -genes  -experiments original expression data matrix   from the SMD, we preÊ  ¢ where  genes have no missing values (   ). If pared the initial full matrix   there is no missing value in the original matrix , then the initial matrix   is set to  . Given an  -genes  -experiments initial expression data matrix   , 5% percent of the data elements of   are randomly selected and regarded as missing values. To measure the performance of an algorithm, we computed the normalized root mean squared difference between the imputed matrix and the initial matrix  . For LSPimpute, we set    . If    then  is set to a lower number between  and    . When  is too small, it may not have sufficient information about the relationships between experiments. When  is too large, genes that have small correlation with the target gene that has a missing value influence the result. Through several numerical experiments, we chose the value of     . The optimal value ( and  ) may vary with the corresponding data set but the values we chose were good for the four different data sets. In LSPimpute/SVD, the reduced dimension in the truncated SVD is set to be  .

CLUSTERimpute and DRimpute take advantage of the clusters for missing value estimation. KNNimpute also takes advantage of the local cluster structure by finding nearest neighbors and their weighted average based on the similarities. The effect of neighboring data points further away from the target gene and with memberships in the clusters other than where the target gene belongs is weakened by applying lower weights. Therefore, the choice of weights is important for KNNimpute. We have observed that the normalized RMS error without weight average is much larger than that of KNNimpute in our experiments. While KNNimpute is relatively insensitive to the exact value of within the range of 10-20 neighbors (Figure 1), CLUSTERimpute also has a minimum normalized RMS error in a similar range. In Figure 1, we compared normalized RMS errors of the missing value estimation methods discussed in this paper. For readability of the figures, we only show SVDimpute/kNN as a representative of SVD based imputation methods, since there was no significant performance difference among methods in the same category: SVDimpute/RowAverage and SVDimpute/kNN showed slightly better results than SVDimpute/Remove. The missing value estimation based on Bayesian principal component analysis (BPCA) showed good performance on the SP.ELU data set (See Figure 2). However, LLSimpute outperformed BPCA as well as KNNimpute for all four data sets when a large number of genes was involved. In Figure 3, overall, LLSimpute shows better performance as the number of genes increases for estimating missing values on the SP.CYCLE data set. The normalized RMS error values of LSPimpute and BPCA were 0.594 and 0.733 respectively. For the constrained least squares imputation (CLSimpute), we observed that its performance is better than LLSimpute when   . However, it could not give us significantly better results since the least squares problem Eqn. (1) highly depends on the number of very similar genes and the size of the over-determined system is usually much smaller than that of Eqn. (2). Even though LSPimpute/SVD was designed to deal with noisy data sets by using the truncated SVD, we could not obtain any better result than LSPimpute for all four data sets. In Figure 4, the normalized RMS error values of LSPimpute and BPCA on the GA.ENV data set were 0.534 and 0.603, respectively. It shows that LLSimpute works well even if the number of experiments is large. LSPimpute also significantly outperformed LLSimpute

9

on the GA.ENV data set even when LLSimpute took into account more than 400 similar genes. This is because LSPimpute takes advantage of coherent genes while LLSimpute uses Euclidean distance based on k-nearest neighbor genes. We introduced several missing value estimation methods and their variations. We have successfully developed a least squares based method using the concept of coherent genes for the missing value estimation of DNA microarray expression data. Once the coherent genes are identified, missing values can be estimated by representing the target gene with missing values as a linear combination of the similar genes or the target experiment that has missing values as a linear combination of related experiments. Our results show that the most successful general missing value estimation method is based on representing a target experiment that has a missing value as a linear combination of the other experiments. Acknowledgements The authors would like to thank the University of Minnesota Supercomputing Institute (MSI) for providing the computing facilities. We also thank Dr. Shigeyuki Oba for providing data sets and helpful discussions. REFERENCES Alter, O., Brown, P. O. and Botstein, D. (2003) Generalized singular value decomposition for comparative analysis of genome-scale expression datasets of two different organisms. Proc. Natl Acad. Sci. USA, 100 (6), 3351–3356. Ding, C., He, X., Zha, H. and Simon., H. D. (2002) Adaptive dimension reduction for clustering high dimensional data. In Proc. of the 2nd IEEE Int’l Conf. Data Mining Maebashi, Japan. Friedland, S., Niknejad, A. and Chihara, L. (2003). A simultaneous reconstruction of missing data in DNA microarrays. Institute for Mathematics and its Applications Preprint Series, No. 1948. Gasch, A. P., Huang, M., Metzner, S., Botstein, D., Elledge, S. J. and Brown, P. O. (2001) Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Mol. Biol. Cell, 12 (10), 2987–3003. Gill, P. E., Murray, W. and Wright, M. H. (1981) Practical Optimization. Academic Press, London, UK. Howland, P., Jeon, M. and Park, H. (2003) Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition. SIAM J. Matrix Anal. Appl., 25 (1), 165–179. Lloyd, S. P. (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory, 2, 129–137. Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K. and Ishii, S. (2002) Missing value estimation using mixture of PCAs. In Proceedings of International Conference on Artificial Neural Networks (ICANN2002) pp. 492–497 Springer, LNC2415. Oba, S., Sato, M., Takemasa, I., Monden, M., Matsubara, K. and Ishii, S. (2003) A Bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19 (16), 2088–2096.

10

Park, H., Jeon, M. and Rosen, J. B. (2003) Lower dimensional representation of text data based on centroids and least squares. BIT Numerical Mathematics, 42 (2), 1–22. Pearson, K. (1894) Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London, 185, 71–110. Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D. and Futcher, B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 3273– 3297. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R. B. (2001) Missing value estimation methods for DNA microarray. Bioinformatics, 17 (6), 520–525.

11

1.1 KNNimpute CLUSTERimpute DRimpute LLSimpute CLSimpute SVDimpute/kNN BPCA LSPimpute LSPimpute/SVD

1

Normalized RMS error

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0

50

100 150 200 250 300 The number of genes used for estimating missing values

350

Figure 1: Comparison of the normalized RMS errors of different methods and effect of the number of genes for estimating missing values on SP.ALPHA data set of 4304 genes and 18 experiments, with 5% of data missing with different estimating methods. For cluster-based algorithms, the number of clusters ( ) was fixed to 5. The results of the methods that do not depend on the number of genes are shown on the y-axis.

12

400

1.1 KNNimpute CLUSTERimpute DRimpute LLSimpute CLSimpute SVDimpute/kNN BPCA LSPimpute LSPimpute/SVD

1

Normalized RMS error

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0

50

100 150 200 250 300 The number of genes used for estimating missing values

350

Figure 2: Comparison of the normalized RMS errors of different methods and effect of the number of genes for estimating missing values on SP.ELU data set of 4304 genes and 14 experiments, with 5% of data missing with different estimating methods. For cluster-based algorithms, the number of clusters ( ) was fixed to 5. The results of the methods that do not depend on the number of genes are shown on the y-axis.

13

400

1.1 KNNimpute CLUSTERimpute DRimpute LLSimpute CLSimpute SVDimpute/kNN BPCA LSPimpute LSPimpute/SVD

1

Normalized RMS error

0.9

0.8

0.7

0.6

0.5

0

50

100 150 200 250 300 The number of genes used for estimating missing values

350

Figure 3: Comparison of the normalized RMS errors of different methods and effect of the number of genes for estimating missing values on SP.CYCLE data set of 474 genes and 14 experiments, with 5% of data missing with different estimating methods. For cluster-based algorithms, the number of clusters ( ) was fixed to 5. The results of the methods that do not depend on the number of genes are shown on the y-axis.

14

400

1.1 KNNimpute CLUSTERimpute DRimpute LLSimpute CLSimpute SVDimpute/kNN BPCA LSPimpute LSPimpute/SVD

1

Normalized RMS error

0.9

0.8

0.7

0.6

0.5

0

100

200 300 400 500 The number of genes used for estimating missing values

600

Figure 4: Comparison of the normalized RMS errors of different methods and effect of the number of genes for estimating missing values on GA.ENV data set of 2641 genes and 44 experiments, with 5% of data missing with different estimating methods. For cluster-based algorithms, the number of clusters ( ) was fixed to 5. The results of the methods that do not depend on the number of genes are shown on the y-axis.

15

700