- Research article
- Open Access
Missing value imputation improves clustering and interpretation of gene expression microarray data
- Johannes Tuikkala^{1}Email author,
- Laura L Elo^{2, 3},
- Olli S Nevalainen^{1} and
- Tero Aittokallio^{2, 3, 4}
https://doi.org/10.1186/1471-2105-9-202
© Tuikkala et al; licensee BioMed Central Ltd. 2008
- Received: 06 August 2007
- Accepted: 18 April 2008
- Published: 18 April 2008
Abstract
Background
Missing values frequently pose problems in gene expression microarray experiments as they can hinder downstream analysis of the datasets. While several missing value imputation approaches are available to the microarray users and new ones are constantly being developed, there is no general consensus on how to choose between the different methods since their performance seems to vary drastically depending on the dataset being used.
Results
We show that this discrepancy can mostly be attributed to the way in which imputation methods have traditionally been developed and evaluated. By comparing a number of advanced imputation methods on recent microarray datasets, we show that even when there are marked differences in the measurement-level imputation accuracies across the datasets, these differences become negligible when the methods are evaluated in terms of how well they can reproduce the original gene clusters or their biological interpretations. Regardless of the evaluation approach, however, imputation always gave better results than ignoring missing data points or replacing them with zeros or average values, emphasizing the continued importance of using more advanced imputation methods.
Conclusion
The results demonstrate that, while missing values are still severely complicating microarray data analysis, their impact on the discovery of biologically meaningful gene groups can – up to a certain degree – be reduced by using readily available and relatively fast imputation methods, such as the Bayesian Principal Components Algorithm (BPCA).
Keywords
- Gene Ontology
- Support Vector Regression
- Cluster Result
- Imputation Method
- Biological Interpretation
Background
During the past decade, microarray technology has become a major tool in functional genomics and biomedical research. It has been successfully used, for example, in genome-wide gene expression profiling [1], tumor classification [2], and construction of gene regulatory networks [3]. Gene expression data analysts currently have a wide range of computational tools available to them. Cluster analysis is typically one of the first exploratory tools used on a new gene expression microarray dataset [4]. It allows researchers to find natural groups of genes without any a priori information, providing computational predictions and hypotheses about functional roles of unknown genes for subsequent experimental testing. Clusters of genes are often given a biological interpretation using the Gene Ontology (GO) annotations which are significantly enriched for the genes in a given cluster. The success of such an analytical approach, however, heavily depends on the quality of the microarray data being analyzed.
Although gene expression microarrays have developed much during the past years, the technology is still rather error prone, resulting in datasets with compromised accuracy and coverage. In particular, the existence of missing values due to various experimental factors still remains a frequent problem especially in cDNA microarray experiments. If a complete dataset is required, as is the case for most clustering tools, data analysts typically have three options before carrying out analysis on the data: they can either discard the genes (or arrays) that contain missing data, replace missing data values with some constant (e.g. zero), or estimate (i.e. impute) values of missing data entries. Many imputation methods are available that utilize the information present in the non-missing part of the dataset. Such methods include, for example, the weighted k-nearest neighbour approach [5] and the local least squares imputation [6]. Alternatively, external information in the form of biological constraints [7], GO annotations [8] or additional microarray datasets [9], can be used to improve the accuracy of these traditional methods, provided relevant information is available for the given experimental system or study organism.
While most of the imputation algorithms currently being used have been evaluated only in terms of the similarity between the original and imputed data points, we argue that the success of preprocessing methods should ideally be evaluated also in other terms, for example, based on clustering results and their biological interpretation, that are of more practical importance for the biologist. The motivation is that, even though there are substantial differences in the imputation accuracy between the methods at the measurement level, some of these differences may be biologically insignificant or simply originate from measurement inaccuracies, unless they can also be observed at the next step of the data analysis. Moreover, the imputation methods have conventionally been developed and validated under the assumption that missing values occur completely at random. This assumption does not always hold in practise since the multiple experimental measurements (arrays) may involve variable technical and/or experimental conditions, such as differences in hybridization, media or time. Accordingly, the distribution of missing entries in many microarray experiments is highly non-random, which may have resulted conclusions regarding the relative merits of the different imputation methods being drawn too hastily.
In this paper, we systematically evaluate a number of imputation strategies based on their ability to produce complete datasets for the needs of partitional clustering. We compare seven imputation algorithms for which ready-to-use implementation is freely available and easily accessible to the microarray community. Beyond the imputation accuracy on the measurement level, we evaluate the methods in terms of their ability to reproduce the gene partitions and the significant GO terms discovered from the original datasets. The imputation methods are compared on eight real microarray datasets, consisting of microarray designs often encountered in practice, such as time series and steady state experiments. The diversity of these datasets allows us to investigate the effects on the imputation results of various data properties, such as the number of measurements per gene, missing value rate and distribution, as well as their correlation structure. Some recommendations for the microarray data analysts on the use of imputation methods in different situations are given in the discussion.
Results
Datasets.
Name | N | M | M _{ C } | M _{ F } | MV _{ SD } | MV | Type | PC1 |
---|---|---|---|---|---|---|---|---|
Brauer05 | 19 | 6256 | 3924 | 3066 | 4.0 | 6.7% | MT | 54.9% |
Ronen05 | 26 | 7070 | 4916 | 2695 | 3.2 | 3.8% | MT | 51.1% |
Spahira04A | 23 | 4771 | 2970 | 2090 | 3.9 | 2.7% | TS | 62.0% |
Spahira04B | 14 | 4771 | 3340 | 2898 | 4.2 | 3.0% | TS | 54.1% |
Hirao03 | 8 | 6229 | 5913 | 259 | 0.7 | 0.9% | SS | 43.3% |
Yoshimoto02 | 24 | 6102 | 4379 | 2323 | 1.9 | 3.2% | MT | 64.7% |
Wyrick99 | 7 | 6180 | 6169 | 3600 | 0.0 | 0.0% | TS | 61.3% |
Spellman98E | 14 | 6075 | 5766 | 1094 | 0.4 | 0.4% | TS | 39.9% |
Imputation methods. The running times were calculated for the Ronen05 dataset with 10% of missing values. (Intel C2D T7200@2 GHz with 2 GB RAM was used).
Imputation method | Implementation | Running time | Reference | URL |
---|---|---|---|---|
Support Vector Regression (SVR) | C++ | 940 s | [17] | [32] |
Iterated Local Least Squares (iLLS) | Matlab | 938 s | [16] | [33] |
Local Least Squares (LLS) | Matlab | 334 s | [6] | [34] |
Bayesian Principal Component Algorithm (BPCA) | Matlab | 197 s | [18] | [35] |
k Nearest Neighbor (KNN) | C++ | 16 s | [5] | [36] |
Row Average (RAVG) | Matlab | < 1 s | [6] | [34] |
Zero imputation (ZERO) | Matlab | < 1 s | [5] | [37] |
Running times of the different imputation methods in an example dataset (Ronen05) are presented in Table 2 for the missing value rate of 10%. The Bayesian inference-based BPCA imputation was the fastest among the more advanced imputation methods (SVR, LLS, iLLS, and BPCA). However, the simple k NN was still about 10 times faster than BPCA. The running time of the original LLS method was about one third of that of its iterative version iLLS. The support vector-based SVR and the iterative iLLS algorithms were clearly the slowest imputation methods. The iLLS method produced, on rare occasion, estimates for missing values which were up to ten times larger than the original values. This seems to suggest an anomaly in the system's implementation or operation.
Agreement with the original data values
The imputation accuracy of the more advanced imputation methods (LLS, iLLS, BPCA, and SVR) depended heavily on the properties of the dataset being imputed, and therefore it was difficult to find a clear winner among these methods. However, if we count the number of times that a method was best or second best across all of the datasets, then iLLS and BPCA were the two best methods when assessed in terms of the NRMSE imputation accuracy. A limitation of the iLLS method is that its implementation produces, in some cases, inconsistent results which led to a large variation among the replicate datasets (see e.g. Hirao03 and Ronen05 in Figure 2). As the BPCA imputation was relatively fast and provided robust estimates over a wide range of different situations, it could be recommended in cases where performance is measured solely by imputation accuracy.
Agreement with the original clustering results
Effects of random missing value generation and cluster initialization
Discussion
We have carried out a practical comparison of recent missing value imputation methods by following the typical microarray data analysis work flow: the given dataset is first clustered and then the resulting gene groups are interpreted in terms of their enriched GO annotations. Our key finding was that clear and consistent differences can be found between the imputation methods when the imputation accuracy is evaluated at the measurement level using the NRMSE; however, similar differences were not found when the success of imputation was assessed in terms of ability to reproduce original clustering solutions or biological interpretations using the ADBP error on the gene groups or GO terms, respectively. The observed dependence of the NRMSE on the properties of a dataset can seriously bias the evaluation of imputation algorithms. In fact, it enables one to select a dataset that favours one's own imputation method. Another potential pitfall lies with the missing value distribution. If the imputation algorithm uses the assumption of completely random missing values, for example in parameter estimation, when in fact this is not case, it may lead to sub-optimal imputation results.
Regardless of the evaluation approach used, our results strongly support earlier observations that imputation is always preferable to ignoring the missing values or replacing them with zeros or average values. In addition to the NRMSE validation, Jörnsten et al. examined the effect of imputation on the significance analysis of differental expression; they observed that missing values affect the detection of differentially expressed genes and that the more sophisticated methods, such BPCA and LinCmb, are notably better than the simple RAVG or k NN imputation methods [10]. Scheel et al. also studied the influence of missing value imputation on the detection of differentially expressed genes from microarray data; they showed that the k NN imputation can lead to a greater loss of differentially expressed genes than if their LinImp method is used, and that intensity-dependent missing values have a more severe effect on the downstream analysis than missing values occurring completely at random [11]. Wang et al. studied the impact of imputation on the related downstream analysis, disease classification; they discovered that while the ZERO imputation resulted in poor classification accuracy, the k NN, LLS and BPCA imputation methods only varied slightly in terms of classification performance [12]. Shi et al. also studied the effect of missing value imputation on classification accuracy; they discovered that the BPCA and iLLS imputation methods could estimate the missing values to the classification accuracy achieved on the original complete data [13].
To our knowledge, only one other study has investigated the effect of missing values and their imputation on the preservation of clustering solutions. Brevern et al. concentrated on hierarchical clustering and the k NN imputation method only and did not consider biological interpretations of the clustering results; their main findings were that even a small amount of missing values may dramatically decrease the stability of hierarchical clustering algorithms and that the k NN imputation learly improves this stability [14]. Our results are in good agreement with these findings. However, our specific aim was to investigate the effect of missing values on the partitional clustering algorithms, such as k-means, and to find out whether more advanced imputation methods, such as LLS, SVR and BPCA, can provide better clustering solutions than the traditional k NN approach. Our results suggest that BPCA provides fast, robust and accurate results, especially when the missing value rate is lower than 5%. None of the imputation methods could reasonably correct for the influence of missing values above this 5% threshold. In these cases, one should consider removing the genes with many missing values or repeating the experiments if possible.
A number of different clustering algorithms have been proposed for the exploratory investigation of gene expression microarray data [4]. We chose the k-means algorithm since it is very fast, unlike e.g. hierarchical clustering, is widely used among biologists, and produces results which are relatively straightforward to interpret. The fast running time of k-means was especially important because of the large number of clusterings performed. Recent comparative studies have also demonstrated that partitional clustering methods often produce more meaningful clustering solutions than hierarchical clustering methods [4]. To avoid using random starting points in k-means clustering, we used the KKZ initialization since it has been found to perform better than many other methods [15]. Due to the extensive computational requirements, we could not afford to do many random cluster initializations. In future studies, it would be interesting to compare more sophisticated clustering methods, such as fuzzy c-means, together with alternative approaches for selecting good initializations and appropriate numbers of clusters for each dataset.
Conclusion
Missing values have remained a frequent problem in microarray experiments and their adverse effect on the clustering is beyond the capacity of simple imputation methods, like ignoring the missing data points or replacing them with zeros or average values. Biological interpretation of the gene clusters can, to a certain missing value rate, be preserved by using readily available and relatively fast imputation methods, such as the Bayesian Principal Components Algorithm (BPCA).
Methods
Imputation methods
Relative expression data produced by cDNA microarrays can be represented as an M × N matrix $G={({g}_{ij})}_{i,j=1}^{M,N}$, in which the entries of G are the log_{2} expression ratios for M genes ${g}_{i}={({g}_{ij})}_{j=1}^{N}$ and N measurements (i.e. the rows represent the genes and the columns stand for the different time points or conditions). In order to estimate a missing value in G, one typically first selects k genes that are somehow similar to the gene g_{ i }with the missing entry. A number of statistical techniques are available to estimate the missing value on the basis of such neighbouring genes. In one of the earliest methods, known as the weighted k-Nearest Neighbour (k NN) imputation, the estimate is calculated as a weighted average over the corresponding values of the neighbouring genes [5]. The average Euclidean distance (Eq. 2) is typically applied both in the selection and weighting of the neighbours, although in principle any other dissimilarity measure between the genes could be used instead. In the comparisons between different missing value estimation methods, we used as references two simple imputation strategies: In Zero Imputation (ZERO), missing values are always replaced by a zero value, whereas Row Average (RAVG) replaces missing values by the average value of the non-missing values of the particular gene. These simple approaches are also being used as pre-processing steps in the more advanced imputation algorithms detailed below.
The Local Least Squares (LLS) imputation is a regression-based estimation method that takes into account the local correlation structure of the data [6]. The algorithm can also automatically select the appropriate number of neighbouring genes. Its modified version, the iterated Local Least Squares (iLLS) imputation, uses iteratively data from each imputation step as an input for the next iteration step [16]. Both the LLS and iLLS methods use RAVG imputation of missing entries prior to the primary missing value estimation if the proportion of missing values is deemed too high to construct a reliable complete data matrix. In the Support Vector Regression (SVR) imputation, the neighbouring genes are first selected using a specific kernel function, the so-called radial basis function, and an estimate of the missing value is calculated using a relatively time-consuming quadratic optimization [17]. SVR uses the ZERO imputation of missing entries prior to the primary imputation but it also codes the entries so that the relative importance of ZERO imputed entries is attenuated when the SVR imputation has been completed for the other missing entries. Among these more advanced imputation methods, the Bayesian Principal Components Algorithm (BPCA) is appealing because, although it involves Bayesian estimation together with the iterative expectation maximization algorithm, its application is relatively fast and straightforward due to the fact that it does not contain any adjustable parameters if default settings are used [18]. In the BPCA method, all of the missing entries are first imputed with the RAVG imputation to obtain a complete data matrix.
All of the imputation methods included in the comparisons are freely available on the internet (the links to the websites are listed in Table 2). BPCA, LLS, iLLS, ZERO and RAVG are implemented with Matlab, whereas SVR and k NN are implemented with C++. The k NN imputation method was run with default parameter (k = 10), and SVR with default parameters (c = 8, g = 0.125, r = 0.01), since their optimization is not trivial and could have biased the results. The automatic parameter estimator was used with LLS and iLLS. The other methods do not contain any free parameters.
Clustering method
- 1.
The initial k centroids are chosen either randomly or using a given initialization rule.
- 2.
Each gene is assigned to the cluster whose centroid is closest to its expression profile.
- 3.
Centroids are re-calculated as arithmetic means of the genes belonging to the clusters.
Steps 2 and 3 of the algorithm are iterated until the clusters become stable or until a given number of steps have been performed. As the standard k-means algorithm does not include any rules for the selection of the initial centroids, we used the so-called KKZ (named after the Katsavounidis, Kuo, and Zhang) initialization algorithm [19]. The KKZ method selects the first centroid C_{1} as the gene with maximal Euclidean norm, i.e. C_{1} ≡ argmax||g_{ i }||. Each of the subsequent centroids C_{ j }, for j = 2,..., k, is iteratively selected as the gene whose distance to the closest centroid is maximal among the remaining genes. The KKZ method has turned out to be among the best initialization rules developed for the partitional clustering algorithms [15]. In addition to the KKZ initialization, we also tested random starting points in the k-means clustering (these results are provided in Additional Files 1 and 2). The k-means algorithm and KKZ initialization were implemented with the Java programming language.
Dissimilarity measures
is the simple L_{2} norm on vectors and it measures the absolute squared difference between two gene profiles g_{ i }and g_{ j }. Since we use standardized gene expression profiles, the Euclidean distance and Pearson correlation yield identical clustering results.
where δ(g_{ il }, g_{ jl }) equals one if values g_{ il }and g_{ jl }are present both in genes i and j, and zero otherwise. A is the number of cases where there is a missing value in either of the genes, i.e., $A={\displaystyle {\sum}_{l=1}^{N}(1-\delta ({g}_{il},{g}_{jl}))}$.
The average Euclidean distance is also used in the selection of the neighbouring genes in the k NN imputation. The problem with this modified measure is its significant loss of information when the number of missing values increases. In the worst case when, for a given gene pair (i, j), N equals A (i.e. there is a missing value in either or both genes at every position), it is not be possible to calculate the distance, even though half of the values of each gene may still be present. In this case, we define $\overline{d(i,j)}=\infty $.
Validation methods
Here, b is the number of genes in the cluster, K is the number of genes in the cluster annotated with GO term t, B is the number of genes in the background distribution (i.e. the whole dataset), and T is the total number of genes in the background distribution annotated with GO term t [23]. The p-value can be interpreted as the probability of randomly finding the same or higher number of genes annotated with the particular GO term from the background gene list. We used the enrichment p-values to find between 1 and 20 most enriched GO terms for each cluster of partitions U and V. These clusters of GO terms were compared using the ADBP measure as for the clusters of genes, that is, we quantified the proportion of GO terms present in the original data that were also discovered in the imputed dataset.
where mean() stands for the arithmetic mean of the elements in its argument array. The normalization by the root mean squared original values results in the ZERO imputation always having an NRMSE value of one, thus providing a convenient reference error level when comparing different imputation methods and making the NRMSE results comparable among different datasets.
Datasets
We used eight yeast cDNA microarray datasets for this study (see Table 1). The Brauer05 dataset is from a recently published study of the yeast response to glucose limitation and it contains multiple time series measured under different experimental conditions [24]. The Ronen05 dataset is from a study of the yeast response to glucose pulses when limiting galactose and contains two time series [25]. Spahira04A and Spahira04B are two different time series datasets from a study of the effect of oxidative stress on the yeast cell cycle [26]. The Hirao03 dataset comprises non-time series data from a study of the identification of selective inhibitors of NAD^{+}-dependent deactylaces [27]. The Yoshimoto02 dataset comprises multiple time series data from a study of the yeast genome-wide analysis of gene expression regulated by the calcineurin signalling pathway [28]. The Wyric99 dataset is from a study of the nucleosomes and silencing factor effect on the global gene expression [29]. The Spellman98E dataset is the elutriation part of a study of the cell cycle-regulated genes identification on yeast [30], which has been used in many comparisons of microarray analysis tools.
In gene expression clustering, a typical preprocessing step is to remove those genes which remain constant over all of the conditions under analysis, as the microarray experiment cannot reliably provide any insight into possible functional roles of such genes. Accordingly, each dataset was first filtered so that genes with low variation in expression values were left out. More specifically, we filtered out those genes i for which max_{ j }g_{ ij }- min_{ j }g_{ ij }< 1.5. Datasets were then normalized on a gene-by-gene basis so that the average expression ratio of each gene became 0 and the standard deviation 1. As the Euclidean distance and Pearson correlation are directly proportional to each other after such standardization, and thus yield similar clustering results, we used only the Euclidean distance measure in the results.
Declarations
Acknowledgements
The work was supported by the Academy of Finland (grant 203632) and the Graduate School in Computational Biology, Bioinformatics, and Biometry (ComBi). Part of the computations presented in this work were made with the help of the computing environment of the Finnish IT Center for Science (CSC). The authors thank Dr. Milla Kibble for the language review of the manuscript.
Authors’ Affiliations
References
- DeRisi JL, Iyer VR, Brown PO: Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 1997, 278: 680–686. 10.1126/science.278.5338.680View ArticlePubMedGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov P, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 285: 531–537. 10.1126/science.286.5439.531View ArticleGoogle Scholar
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nature Genetics 1999, 22: 281–285. 10.1038/10343View ArticlePubMedGoogle Scholar
- D'haeseleer P: How does gene expression clustering work? Nature Biotechnology 2005, 23: 1499–1501. 10.1038/nbt1205-1499View ArticlePubMedGoogle Scholar
- Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17: 520–525. 10.1093/bioinformatics/17.6.520View ArticlePubMedGoogle Scholar
- Kim H, Golub GH, Park H: Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 2005, 21: 187–198. 10.1093/bioinformatics/bth499View ArticlePubMedGoogle Scholar
- Gan X, Liew AWC, Yan H: Microarray missing data imputation based on a set theoretic framework and biological knowledge. Necleid Acids Research 2006, 34: 1608–1619. 10.1093/nar/gkl047View ArticleGoogle Scholar
- Tuikkala J, Elo L, Nevalainen OS, Aittokallio T: Improving missing value estimation in microarray data with gene ontology. Bioinformatics 2006, 22: 566–572. 10.1093/bioinformatics/btk019View ArticlePubMedGoogle Scholar
- Hu J, Li H, Waterman MS, Zhou XJ: Integrative missing value estimation for microarray data. BMC Bioinformatics 2006, 7: 449. 10.1186/1471-2105-7-449PubMed CentralView ArticlePubMedGoogle Scholar
- Jörnsten R, Wang HY, Welsh WJ, Ouyang M: DNA microarray data imputation and significance analysis of differential expression. Bioinformatics 2005, 21: 4155–4161. 10.1093/bioinformatics/bti638View ArticlePubMedGoogle Scholar
- Scheel I, Aldrin M, Glad I, Sorum R, Lyng H, Frigessi A: The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics 2005, 21: 4272–4279. 10.1093/bioinformatics/bti708View ArticlePubMedGoogle Scholar
- Wang D, Lv Y, Guo Z, Li X, Li Y, Zhu J, Yang D, Xu J, Wang C, Rao S, Yang B: Effects of replacing the unreliable cDNA microarray measurements on the disease classification based on gene expression profiles and functional modules. Bioinformatics 2006, 22: 2883–2889. 10.1093/bioinformatics/btl339View ArticlePubMedGoogle Scholar
- Shi Y, Cai Z, Lin G: Classification accuracy based microarray missing value imputation. In Bioinformatics Algorithms: Techniques and Applications. Edited by: Mandoiu I, Zelikovsky A. New Jersey: Wiley-Interscience; 2007:303–328.View ArticleGoogle Scholar
- de Brevern AG, Hazout S, Malpertuy A: Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering. BMC Bioinformatics 2004, 5: 114. 10.1186/1471-2105-5-114PubMed CentralView ArticlePubMedGoogle Scholar
- He J, Lan M, Tan CL, Sung SY, Low HB: Initialization of cluster refinement algorithms: a review and comparative study. Proc IEEE Int Joint Conf on Neural Networks, IEEE 2004.Google Scholar
- Cai Z, Heydari M, Lin G: Iterated local least squares microarray missing value imputation. Journal of Bioinformatics and Computational Biology 2006, 4: 1–23. 10.1142/S0219720006002302View ArticleGoogle Scholar
- Wang X, Li A, Jiang Z, Feng H: Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme. BMC Bioinformatics 2006, 7: 32. 10.1186/1471-2105-7-32PubMed CentralView ArticlePubMedGoogle Scholar
- Oba S, Sato MA, Takemasa I, Monden M, Matsubara K, Ishii S: A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 2003, 19: 2088–2096. 10.1093/bioinformatics/btg287View ArticlePubMedGoogle Scholar
- Al-Daoud M, Roberts S: New methods for the initialisation of clusters. Pattern Recognition Letters 1996, 17: 451–455. 10.1016/0167-8655(95)00119-0View ArticleGoogle Scholar
- Handl J, Knowles J, Kell D: Computational cluster validation in post-genomic data analysis. Bioinformatics 2005, 21: 3201–3212. 10.1093/bioinformatics/bti517View ArticlePubMedGoogle Scholar
- Kuhn H: The Hungarian method for the assignment problem. Naval Res Logist Quart 1955, 2: 83–97. 10.1002/nav.3800020109View ArticleGoogle Scholar
- Lange T, Roth V, Braun ML, Buhmann JM: Stability-based validation of clustering solutions. Neural Computation 2004, 16: 1299–1323. 10.1162/089976604773717621View ArticlePubMedGoogle Scholar
- Kankainen M, Brader G, Törönen P, Palva ET, Holm L: Identifying functional gene sets from hierarchically clustered expression data: map of abiotic stress regulated genes in Arabidopsis thaliana . Nucleid Acids Research 2006, 34: e124. 10.1093/nar/gkl694View ArticleGoogle Scholar
- Brauer MJ, Saldanha AJ, Dolinski K, Botstein D: Homeostatic Adjustment and Metabolic Remodeling in Glucose-limited Yeast Cultures. Molecular Biology of Cell 2005, 16: 2503–2517. 10.1091/mbc.E04-11-0968View ArticleGoogle Scholar
- Ronen M, Botstein D: Transcriptional response of steady-state yeast cultures to transient perturbations in carbon source. PNAS 2005, 103: 389–394. 10.1073/pnas.0509978103PubMed CentralView ArticlePubMedGoogle Scholar
- Shapira M, Segal E, Botstein D: Disruption of Yeast Forkhead-associated Cell Cycle Transcription by Oxidative Stress. Molecular Biology of Cell 2004, 15: 5659–5669. 10.1091/mbc.E04-04-0340View ArticleGoogle Scholar
- Hirao M, Posakony J, Nelson M, Hruby H, Jung M, Simon JA, Bedalov A: Identification of selective inhibitors of NAD+-dependent deacetylases using phenotypic screens in yeast. J Biol Chem 2003, 278: 52773–52782. 10.1074/jbc.M308966200View ArticlePubMedGoogle Scholar
- Yoshimoto H, Saltsman K, Gasch AP, Li HX, Ogawa N, Botstein D, Brown PO, Cyert MS: Genome-wide analysis of gene expression regulated by the calcineurin/Crz1p signaling pathway in Saccharomyces cerevisiae. J Biol Chem 2002, 277: 31079–31088. 10.1074/jbc.M202718200View ArticlePubMedGoogle Scholar
- Wyrick JJ, Holstege FC, Jennings EG, Causton HC, Shore D, Grunstein M, Lander ES, Young RA: Chromosomal landscape of nucleosome-dependent gene expression and silencing in yeast. Nature 1999, 402: 418–421. 10.1038/46567View ArticlePubMedGoogle Scholar
- Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Molecular Biology of Cell 1998, 9: 3273–3297.View ArticleGoogle Scholar
- Feten G, Almøy T, Aastveit AH: Prediction of missing values in microarray and use of mixed models to evaluate the predictors. Statistical Applications in Genetics and Molecular Biology 2005, 4: 10. 10.2202/1544-6115.1120View ArticleGoogle Scholar
- SVR algorithm[http://202.38.78.189/downloads/svrimpute.html]
- iLLS algorithm[http://www.cs.ualberta.ca/~ghlin/src/WebTools/imputation.php]
- LLS algorithm[http://www-users.cs.umn.edu/~hskim/tools.html]
- BPCA algorithm[http://hawaii.naist.jp/~shige-o/tools/]
- KNN algorithm[http://function.princeton.edu/knnimpute/]
- ZERO imputation[http://users.utu.fi/jotatu/zero.m]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.