The expression level of a gene can be regarded as an estimate of the amount of protein it produces in a given cellular state. Different technologies can be used to measure the expression level of a gene. One of the most important is microarray technology, which allows the simultaneous measurement of the expression levels of thousands of genes [1].
As with many types of experimental data, gene expression data obtained from microarray experiments often contain missing values (MVs) [2-5]. This can occur for several reasons: insufficient resolution, image corruption, fabrication errors, poor hybridization, or contaminants due to dust or scratches on the chip. However, many standard methods for gene expression data analysis, including some classification and clustering techniques, require a complete data matrix as input. Thus, in such a context, methods for handling missing data are needed.
The simplest way of dealing with MVs is to discard the observations that contain them. However, this method is practical only when (1) the data contain a relatively small number of observations containing MVs, or when (2) the analysis of the complete examples will not lead to a serious bias during the inference [6,7]. Neither is viable in the context of microarray data. For example, as pointed out in [5,8], it is common for gene expression data to have up to 5% MVs, which could affect up to 90% of the genes.
Thus, in the microarray setting, instead of repeating the biological experiment (too expensive) or discarding all observations with missing values (negative impact on downstream analyses), many MV imputation methods have been proposed in the literature [4,5]. MV imputation constitutes an entire class of procedures that aim to supply the MVs with estimated ones [7].
The simplest imputation algorithms consist of replacing the MVs by zero or by the corresponding row/column average [9]. However, as discussed in [2], because such methods do not take into account the correlation structure of the data, they tend to perform poorly in terms of estimation accuracy, i.e., how close the estimated value is to the missing value. More complex algorithms that employ gene correlations have been proposed to minimize this problem. These methods include, for example, the weighted k-nearest neighbor (WKNN) [2], local least squares (LLS) [10], expectation maximization approach (EM_array) [11] and Bayesian principal component analysis (BPCA) [12] procedures.
Indeed, in the past few years, several papers have presented systematic evaluations of different imputation algorithms [3-5,13-16]. In general, the validation is accomplished by calculating various performance indices about the relation between the imputed and the (known) original values [16]. Initially, most imputation algorithms were evaluated with an emphasis on the accuracy of their imputation (fidelity to the true expression values), using metrics such as the root mean squared error (RMSE).
However, it can be argued that the success of the expression value estimation should be evaluated also in more practical terms [15,16]. For instance, one can consider the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. In fact, in practical terms of the downstream objective of the experiment (e.g., clustering or classification), if the differences between the outcomes are biologically insignificant, then it is irrelevant whether an MVs imputation improvement is statistically significant or not. Furthermore, in most studies, missing values in microarray data sets are assumed to be missing at random. By following this assumption, the great majority of the aforementioned papers apply imputation methods considering a uniform distribution of missing values on genes and microarrays. This, however, is not a realistic assumption, since missing values tend to arise in a systematic manner in practice [16]. Therefore, these studies are not able to model gene- or array-specific artifacts that induce missing values. Because our analysis is based on an actual MV distribution, we are not susceptible to biases introduced by artificial imputation experiments.
Regarding related work, [17] evaluated the impact of six imputation methods on clustering analyses using eight yeast cDNA microarray datasets, including both time series and steady-state experiments. They performed a cluster analysis on the imputed data using the k-means algorithm. The partitions generated after imputation were compared with those obtained originally on the complete datasets. The authors showed that even when there are clear differences in the accuracy of imputation, as assessed using RMSE, such differences could become insignificant when the methods are evaluated in terms of how well they reproduce the original gene clusters, or in their biological interpretations.
With respect to classification, the most closely related research is that in [13], who applied three imputation algorithms, WKNN, LLS and BPCA, to five different cancer gene expression datasets. The classification accuracy was estimated using three different types of classifiers: support vector machine (SVM), k-nearest neighbor (kNN), and classification and regression tree (CART). They showed that, except for replacement by zeros, the imputation algorithm made little difference in classification.
A limitation of the work in [17] and [13] is that both were based on a small selection of datasets, and presented no statistical evidence for their conclusions. In contrast, we have performed the first systematic comparison of classical missing value imputation methods by following a typical microarray data analysis workflow. More specifically, we analyzed the impact of five missing data imputation methods on several clustering and classification algorithms applied to 12 cancer gene expression datasets. We followed the preprocessing protocol suggested in [18] to do so; that is, we first discarded the genes with missing data for more than some particular number of observations. Next, we replaced the remaining MVs using each of the five data imputation approaches, followed by filtering the genes with low variation. Moreover, for the first time in this field, we used a statistical framework to evaluate whether the differing imputation methods significantly affected the performances of the various classification/clustering methods.