Robust imputation method for missing values in microarray data

Yoon, Dankyu; Lee, Eun-Kyung; Park, Taesung

doi:10.1186/1471-2105-8-S2-S6

Volume 8 Supplement 2

Probabilistic Modeling and Machine Learning in Structural and Systems Biology

Research
Open access
Published: 03 May 2007

Robust imputation method for missing values in microarray data

Dankyu Yoon¹,
Eun-Kyung Lee² &
Taesung Park²

BMC Bioinformatics volume 8, Article number: S6 (2007) Cite this article

6386 Accesses
28 Citations
Metrics details

Abstract

Background

When analyzing microarray gene expression data, missing values are often encountered. Most multivariate statistical methods proposed for microarray data analysis cannot be applied when the data have missing values. Numerous imputation algorithms have been proposed to estimate the missing values. In this study, we develop a robust least squares estimation with principal components (RLSP) method by extending the local least square imputation (LLSimpute) method. The basic idea of our method is to employ quantile regression to estimate the missing values, using the estimated principal components of a selected set of similar genes.

Results

Using the normalized root mean squares error, the performance of the proposed method was evaluated and compared with other previously proposed imputation methods. The proposed RLSP method clearly outperformed the weighted k-nearest neighbors imputation (kNNimpute) method and LLSimpute method, and showed competitive results with Bayesian principal component analysis (BPCA) method.

Conclusion

Adapting the principal components of the selected genes and employing the quantile regression model improved the robustness and accuracy of missing value imputation. Thus, the proposed RLSP method is, according to our empirical studies, more robust and accurate than the widely used kNNimpute and LLSimpute methods.

Background

Microarray experiment technique has been successfully applied to a variety of biological studies including cancer classification, discovery of the unknown gene function, and identification of effects of a specific therapy. When analyzing microarray data, we often face missing values due to various factors such as scratches on the slide, spotting problems, dusts, experimental errors, and so on. In practice, every experiment contains missing entries and sometimes more than 90% of the genes in the microarray experiment are affected [1]. Moreover, most of the classic multivariate analysis methods for microarray data cannot be used when the data have missing values. Therefore, we need to treat missing values appropriately.

An easy way to handle missing data is to repeat the whole experiment. However, often it is not a realistic option secondary to economic limitations and/or scarcity of available biological material [2]. Accordingly, many missing value estimation methods have been developed. The weighted k-nearest neighbors imputation method (kNNimpute) selects genes with expression profiles similar to the gene of interest to impute missing values [3]. The singular value decomposition method (SVDimpute) employs a singular value decomposition to obtain to a set of mutually orthogonal patterns that can be linearly combined to approximate the expression of all genes in the data set [3]. In a comparative study presented by Troyanskaya et al. [3], kNNimpute is more robust and accurate than SVDimpute.

Least squares imputation (LSimpute) is a regression-based method using the correlation between both genes and arrays [2]. LSimpute showed best performance when data have a strong local correlation structure. Local least squares imputation (LLSimpute) is an extension of LSimpute method which selects k similar genes by L₂-norm or Pearson correlation and applies multiple regression to impute missing values [4].

Bayesian principal component analysis (BPCA) uses a Bayesian estimation algorithm to predict missing values [5]. BPCA suggests using the number of samples minus 1 as the number of principal axes. Since BPCA uses an EM-like repetitive algorithm to estimate missing values, it needs intensive computations to impute missing values. Gaussian mixture imputation (GMCimpute) estimates missing values using Gaussian mixture and model averaging [1]. Collateral missing value imputation (CMVE) [6] predicts missing values based on a multiple covariance-based imputation matrices and performs imputation using least square regression and linear programming methods.

Recently, several imputation methods using a priori information to impute missing values have been proposed such as a set theoretic framework approach based on projection onto convex sets (POCS) [7] and an approach based on the functional similarities of gene ontology [8]. While most traditional missing imputation methods treated spots as binary value such as missing or present, weighted nearest neighbours method (WeNNI) adopted a continuous spot quality weight for the missing value estimation [9].

Among these methods, kNNimpute, LSimpute and LLSimpute are most commonly used because they are easy to apply with less computational burdens. Note that LSimpute and LLSimpute are regression based methods and kNNimpute can also be regarded as a regression based method for the simple intercept model. In this paper, we focus on these regression based methods and present their improvements.

kNNimpute and LLSimpute both use the k selected genes to estimate missing values. Kim et al. [4] showed LLSimpute performed well for a large value of k, say over 200. However, it is inefficient to use such a large number of genes to estimate one missing value from a practical point of view. Furthermore, there is no guarantee that the selected k is sufficiently large enough for LLSimpute to perform well. Surprisingly, the performance of LLSimpute becomes very poor when k is close to the number of samples.

On the other hand, kNNimpute performs well with relatively small values of k. For example, kNNimpute suggests using 10 or 15 similar genes. However, kNNimpute performs poorly when k is too small or too large. Its performance depends on the sample size and the correlations between genes. Therefore, kNNimpute is negatively affected by a badly chosen k.

To overcome the limitations of these regression based imputation methods, we propose the robust least square estimation with principal components (RLSP) method. RLSP is an improved version of LLSimpute. We use the estimated principal components of the selected genes and apply quantile regression to estimate missing values with the estimated principal components. Note that the most imputation methods are not robust to outliers. RLSP performs well even when k is small. Moreover RLSP shows similar performance with LLSimpute when k is large. Therefore, RLSP is more robust to the choice of k.

The normalized root mean squared error (NRMSE) is used to evaluate the differences in performances between the proposed RLSP method and the other imputation methods for various missing rates [4]. The RLSP method clearly outperforms the LLSimpute method.

Methods

A whole gene expression profile is represented by a G × N matrix, Y, where the rows correspond to the genes, the columns correspond to the experiments (samples), and the entry Y_i,jis the expression level of gene i in experiment j. For simplicity, we assume that the target gene vector g* has a missing value at the first sample, denoted by α. For the k selected similar genes, let $g_{s_{j}}$ be a N × 1 vector consisting of the j th selected genes with its first element $w_{s_{j}}$ , where s_jdenote the index for representing k selected genes for j = 1,...,k. The selected similar genes have complete values without missing observations. Then,

(\begin{matrix} g^{* T} \\ g_{s_{1}}^{T} \\ g_{s_{2}}^{T} \\ ⋮ \\ g_{s_{k}}^{T} \end{matrix}) = (\begin{matrix} α & y_{1} & y_{2} & \dots & y_{N - 1} \\ w_{s_{1}} & x_{s_{1}, 1} & x_{s_{1}, 2} & \dots & x_{s_{1}, N - 1} \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ w_{s_{k}} & x_{s_{k}, 1} & x_{s_{k}, 2} & \dots & x_{s_{k}, N - 1} \end{matrix}) = (\begin{matrix} α & y^{T} \\ w & X \end{matrix}) (1)

where $w = {[w_{s_{1}}, w_{s_{2}}, \dots, w_{s_{k}}]}^{T}$ , and y = [y₁, y₂,...,y_N-1]^Tis a subvector of g* excluding the missing value α.

LLSimpute selects the k most similar genes using L₂-norm or Pearson correlation and applies multiple regression to impute missing values with a linear combination of the k selected genes. LLSimpute applies multiple regression in two ways.

The first model is

y = X^Tβ + ε

\hat{α} = w^{T} \hat{β} = w^{T} {(X X^{T})}^{-} X y (2)

and the second model is

w = X β* + ε*

{\hat{α}}^{*} = y^{T} {\hat{β}}^{*} = y^{T} {(X^{T} X)}^{-} X^{T} w (3)

where (XX^T)^- is the generalized inverse of (XX^T). If N is larger than k, (XX^T)^- in the first model is easier to calculate (X^TX)^- than in the second model and vice versa.

In case of multiple missing values in a gene, all missing components of each gene are excluded to find similar genes. Then, the vectors w and y^T, and matrix X are formed in a similar way as in the case of one missing entry, only with different dimensions.

LLSimpute showed a good performance for a relatively large value of k. However, if a value of k is close to the number of samples, LLSimpute performed poorly compared to other imputation methods. It is probably due to the multi-collinearity problem that LLSimpute performs poorly when k is small. The patterns of gene expression are highly correlated leading to the poor performance of multiple regression.

To overcome this limitation, we perform a regression with the principal components rather than the original data. Our technique utilizes the selection of two models in terms of k and applies the principal components analysis to the k selected genes. Also we consider the robustness to reduce the effects of the outliers by fitting robust regression.

The RLSP method consists of three parts: (1) selection of k similar genes, (2) principal component analysis with the k selected genes, and (3) robust regression analysis using these principal components. We describe these processes step by step.

STEP 1 : Selection of k similar genes

To impute a missing value α, the k similar genes are used for RLSP, where k is a pre-determined number. In LLSimpute, L₂-norm or Pearson correlation coefficient is used to select k similar genes. However, it is well known that L₂-norm and Pearson correlation coefficients are sensitive to outliers. In RLSP, we use L₁-norm as a distance measure to select the k similar genes for imputing the missing values of the gene g*,

d (g^{*}, g_{i}) = \sum_{j = 1}^{N - 1} | x_{i, j} - y_{j} | .

STEP 2 : Principal component

After selecting k similar genes, we perform the principal component analysis. We define two types of variance-covariance matrix. The first one is a k × k matrix

V = \sum_{i = 1}^{N - 1} \sum_{j = 1}^{N - 1} (x_{i} - \bar{x} .) {(x_{j} - \bar{x} .)}^{T} (4)

where $x_{i} = {[x_{s_{1}, i}, x_{s_{2}, i}, \dots, x_{s_{k}, i}]}^{T}, \bar{x} . = {[{\bar{x}}_{s_{1}, .}, {\bar{x}}_{s_{2}, .}, \dots, {\bar{x}}_{s_{k}, .}]}^{T}$ , and ${\bar{x}}_{s_{l}, .} = \frac{1}{N - 1} \sum_{i = 1}^{N - 1} x_{s_{l}, i}$ . The second type is a (N-1) × (N-1) matrix

V^{*} = \sum_{i = 1}^{k} \sum_{j = 1}^{k} (x_{i}^{*} - {\bar{x}}_{.}^{*}) {(x_{j}^{*} - {\bar{x}}_{.}^{*})}^{T} (5)

where $x_{i}^{*} = {[x_{s_{i}, 1}, x_{s_{i}, 2}, \dots, x_{s_{i}, N - 1}]}^{T}, {\bar{x}}_{.}^{*} = {[{\bar{x}}_{., 1}, {\bar{x}}_{., 2}, \dots, {\bar{x}}_{., N - 1}]}^{T}$ , and ${\bar{x}}_{., l} = \frac{1}{k} \sum_{i = 1}^{k} x_{s_{i}, l}$ .

When k is larger than N, the size of V matrix becomes too large to handle and it is not computationally efficient to derive the principal components. Therefore, we use V* instead of V and use a different type of regression. V corresponds to the first model and V* corresponds to the second model in LLS impute, respectively. Kim et al. [4] showed that the solutions based on V and V* are in fact the same. Let PC_x= {PC₁, PC₂,...,PC_k} be the principal components using V and $P C_{x}^{*} = {P C_{1}^{*}, P C_{2}^{*}, \dots, P C_{N - 1}^{*}}$ be the principal components using V*. Then, these principal components PC_xand $P C_{x}^{*}$ are used for the first type of regression (equation (2)) and the second type of regression model (equation (3)), respectively.

STEP 3 : Robust regression

We use PC₁, PC₂,...,PC_por $P C_{1}^{*}, P C_{2}^{*}, \dots, P C_{p}^{*}$ as new exploratory variables and fit the regression model in a robust manner, where p is the predetermined number of the principal components. The corresponding regression models are

y = \sum_{i = 1}^{p} P C_{i} β_{i} + ε (6)

and

w = \sum_{i = 1}^{p} P C_{i}^{*} β_{i}^{*} + ε^{*} (7)

In our method, we use a quantile regression to fit the regression model in a robust manner. Robust regression usually provides an alternative analysis to least square regression when fundamental assumptions such as normality or variance homogeneity are violated. The quantile regression using the 50th percentile estimates the model parameters by minimizing the sum of absolute values of the residuals [10]. It is estimated by minimizing $\sum_{j = 1}^{N - 1} | y_{j} - \sum_{i = 1}^{p} P C_{i j} β_{i} |$ or $\sum_{j = 1}^{k} | w_{s_{j}} - \sum_{i = 1}^{p} P C_{i j}^{*} β_{i}^{*} |$ . This way our analysis method can reject outliers and maintains robustness.

If p = k and the regression model is estimated by the least squares method, RLSP is the same as LLSimpute. When k is small, we recommend using p = 1 and fit the regression model y_i= β₁PC_1i+ ε_iusing the sum of least absolute deviations. The imputed value of α is defined by $\hat{α} = {\hat{β}}_{1} P C_{w}$ where PC_wis the projected data of w onto the direction of PC₁. For k much larger than N, we recommend using p close to the number of the sample size.

Results and discussion

Datasets

Four data sets are used for the comparative study: three Spellman data sets (ALPHA, ELU, and ALPHA+ELU, [5, 11]), and Gasch data set [12]. These data sets were also used in the comparative study of LLSimpute [4]. ALPHA dataset was obtained from α-factor block release studied for the identification of cell-cycle regulated genes in yeast Saccharomyces cerevisiae [13]. ELU dataset was elutriation dataset in the same study. After removing all the genes with missing values in the ALPHA and ELU datasets, we obtained complete data matrices that contain 4,304 genes and 18 experiments and 4,304 genes and 14 experiments, respectively. ALPHA+ELU dataset was used for the examination of the additional sample effects as studied Oba et al. [5]. Gasch dataset was obtained from the study of genomic expression responses to DNA damage [14]. After removing all genes with missing values, a complete data matrix with 2641 genes and 44 experiments was prepared for this study. For the simulation study, 1% and 5% missing observations were randomly generated in these data sets.

We use the normalized root mean squares error (NRMSE) to evaluate the performances of the missing value imputation approaches, computed by

NRMSE = \sqrt{m e a n [{(y_{guess} - y_{answer})}^{2}]} / s d (y_{answer}),

where y_guessis the imputed value and y_answerthe true value.

Experimental results

Figures 1 and 2 show the plots of NRMSE vs. k for the ELU Spellman data set, when the missing rates are 1% and 5%, respectively. We compare our RLSP with kNNimpute, LLSimpute and BPCA. For a large k, both LLSimpute and RLSP show highly competitive results and perform best compared to kNNimpute and BPCA. However, for a smaller k, RLSP performed much better than LLSimpute. LLSimpute shows a high peak when k is close to the number of samples. Because highly correlated k genes are usually selected in LLSimpute, the poor performance of LLSimpute is probably due to multi-collinearity of the selected k genes.

Figures 3 and 4 show the smallest NRMSE values for four data sets. Figure 3 represents the results of the case of 1% missing rate. LLSimpute and RLSP showed the similar results and the best performances in ALPHA and ELU data sets. For ALPHA+ELU and Gasch data, BPCA showed a little bit better performance than RLSP and LLSimpute, but it is competitive to LLSimpute and RLSP. Figure 4 shows the results of the case of the 5% missing rate. RLSP, LLSimpute and BPCA show competitive performances for ALPHA and ELU data sets. For large datasets such as ALPHA+ELU and Gasch data sets, BPCA showed a little bit better performance. kNNimpute showed the worst performance in all data sets.

We presume that the differences in the performance of missing value imputation methods are highly dependent on the data set as well as the value of k. When a moderate value of k is selected, say when k is close to the number of samples, the proposed RLSP method outperforms the LLSimpute method on all data sets (data not shown). As the missing rate increases, NRMSEs increase rapidly for all methods and the performances of all four methods become worse.

Conclusion

The proposed RLSP method was motivated by a similar idea to that of the LLSimpute method. Both methods use the information from the selected k genes to estimate missing observations. LLSimpute uses the least squares method using the selected k genes. On the other hand, RLSP uses the principal components of the selected genes instead of the original k genes, and employs the quantile regression model for a robust analysis. The use of the principal components leads to a large difference between the two methods. The performance of LLSimpute is poor when k is small (near the number of samples). Assuming that multi-collinearity is probably the main cause for the poor performance of LLSimpute, RLSP addresses this problem using the principal components and then applying the robust regression approach to reduce the effect of outliers. In summary, RLSP showed more stable performance than LLSimpute for all data sets in our comparative studies.

The performance of RLSP may depend on the value of p, the number of principal components. By varying the value of p, we examined its effect on the parameter estimators. The result showed that the performance of RLSP is optimal when the value of p is close to the number of the sample size (data not shown). A similar result was obtained from a previous study of the BPCA method [5]. However, since the imputation procedure is executed for each missing value of a gene, the optimal value of p may differ from gene to gene. Selecting an optimal value for each missing value requires intensive computation. Thus, we recommend using the number of sample size as the value of p for practical application, although we expect that an appropriate choice of p would improve the performance of the RLSP method.

In terms of computational efficiency, although RLSP and BPCA showed competitive results, BPCA required a higher computational demand due to the EM-like repetitive algorithm. In addition, RLSP seemed less computationally intensive than CMVE. However, a further study based on the same platform would be desirable for the systematic comparison.

The presented method consists of three separate steps, where the first step applied L1 metric to select similar genes, the second step performs PCA on the selected set, which is a L2 method, and finally in the third step L1 metric is applied again to perform robust regression. Among the several combinations of metrics, the proposed combination provided the minimum NRMSE and provided the most computationally efficient result.

The main motivation of the robust regression was to reduce the effect of outliers in estimation of missing observations. Our empirical studies demonstrated that the effect of outliers were not large enough to cause huge differences between robust regression and ordinary regression. Among the several robust regression methods including Tukey's bi-weight M-estimator, the proposed quantile regression using the 50th percentile provided the best result. However, a further study on selecting the better robust method is desirable.

Finally, most missing imputation methods for microarray data assume the simple missing data mechanism to be the so called 'missing completely at random' [15]. However, this mechanism may assume too much to be expected to hold in real applications. Therefore, more complicated methods are required for handling other possible missing data mechanisms. By incorporating the missing data mechanism or missing patterns in the microarray data, we could improve the performance of the missing imputation method.

References

Ouyang M, Welsh WJ, Georgopoulos P: Gaussian mixture clustering and imputation of microarray data. Bioinformatics 2004, 20(6):917–923. 10.1093/bioinformatics/bth007
Article CAS PubMed Google Scholar
Bo TH, Dysvik B, Jonassen I: LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res 2004, 32(3):e34. 10.1093/nar/gnh026
Article PubMed Central PubMed Google Scholar
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R: Missing value estimation methods for DNA microarrys. Bioinformatics 2001, 17(6):520–525. 10.1093/bioinformatics/17.6.520
Article CAS PubMed Google Scholar
Kim H, Golub GH, Park H: Missing Value Estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 2005, 21(2):187–198. 10.1093/bioinformatics/bth499
Article CAS PubMed Google Scholar
Oba S, Sato M, Takemasa I, Monden M, Matsubara K, Ishii S: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 2003, 19(16):2088–2096. 10.1093/bioinformatics/btg287
Article CAS PubMed Google Scholar
Sehgal MS, Gondal I, Dooley LS: Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data. Bioinformatics 2005, 21(10):2417–2423. 10.1093/bioinformatics/bti345
Article CAS PubMed Google Scholar
Gan X, Liew AW, Yan H: Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Res 2006, 34(5):1608–1619. 10.1093/nar/gkl047
Article PubMed Central CAS PubMed Google Scholar
Tuikkala J, Elo L, Nevalainen OS, Aittokallio T: Improving missing value estimation in microarray data with gene ontology. Bioinformatics 2006, 22(5):566–572. 10.1093/bioinformatics/btk019
Article CAS PubMed Google Scholar
Johansson P, Hakkinen J: Improving missing value imputation of microarray data by using spot quality weights. BMC Bioinformatics 2006, 7():306. 10.1186/1471-2105-7-306
Article PubMed Central PubMed Google Scholar
Koenker R, Hallock K: Quantile Regression. Journal of Economic Perspectives 2001, 15: 143–156.
Article Google Scholar
Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 1998, 9: 3273–3297.
Article PubMed Central CAS PubMed Google Scholar
Gasch AP, Huang M, Metzner S, Bostein D, Elledge SJ, Brown PO: Genomic expression responses to DNA damaging agents and the regulatory role of the yeast ATR homolog Mec1p. Mol Biol Cell 2001, 12: 2987–3003.
Article PubMed Central CAS PubMed Google Scholar
Yeast Cell Cycle Analysis Project[http://cellcycle-www.stanford.edu]
The web supplement to Gasch et al[http://www-genome.stanford.edu/Mec1]
Little RJA, Rubin DB: Statistical analysis with missing data. 2nd edition. Wiley, Hoboken, New Jersey; 2002.
Google Scholar

Download references

Acknowledgements

The authors would like to thank the editors and two anonymous referees whose comments were extremely helpful. The authors also would like to thank Dr. Sehgal for many constructive discussions on the CMVE program. The work was supported by the National Research Laboratory Program of Korea Science and Engineering Foundation (M10500000126) and the Brain Korea 21 Project of the Ministry of Education.

This article has been published as part of BMC Bioinformatics Volume 8, Supplement 2, 2007: Probabilistic Modeling and Machine Learning in Structural and Systems Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S2.

Author information

Authors and Affiliations

Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea
Dankyu Yoon
Department of Statistics, College of Natural Science, Seoul National University, San 56-1, Shin Lim-Dong, Kwanak-ku, Seoul, 151-742, Korea
Eun-Kyung Lee & Taesung Park

Authors

Dankyu Yoon
View author publications
You can also search for this author in PubMed Google Scholar
Eun-Kyung Lee
View author publications
You can also search for this author in PubMed Google Scholar
Taesung Park
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Taesung Park.

Additional information

Authors' contributions

All authors contributed to the development of RLSP method and the comparative studies with previously proposed imputation methods.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Yoon, D., Lee, EK. & Park, T. Robust imputation method for missing values in microarray data. BMC Bioinformatics 8 (Suppl 2), S6 (2007). https://doi.org/10.1186/1471-2105-8-S2-S6

Download citation

Published: 03 May 2007
DOI: https://doi.org/10.1186/1471-2105-8-S2-S6

Probabilistic Modeling and Machine Learning in Structural and Systems Biology

Robust imputation method for missing values in microarray data