 Research
 Open Access
 Published:
Optimal clustering with missing values
BMC Bioinformatics volume 20, Article number: 321 (2019)
Abstract
Background
Missing values frequently arise in modern biomedical studies due to various reasons, including missing tests or complex profiling technologies for different omics measurements. Missing values can complicate the application of clustering algorithms, whose goals are to group points based on some similarity criterion. A common practice for dealing with missing values in the context of clustering is to first impute the missing values, and then apply the clustering algorithm on the completed data.
Results
We consider missing values in the context of optimal clustering, which finds an optimal clustering operator with reference to an underlying random labeled point process (RLPP). We show how the missingvalue problem fits neatly into the overall framework of optimal clustering by incorporating the missing value mechanism into the random labeled point process and then marginalizing out the missingvalue process. In particular, we demonstrate the proposed framework for the Gaussian model with arbitrary covariance structures. Comprehensive experimental studies on both synthetic and realworld RNAseq data show the superior performance of the proposed optimal clustering with missing values when compared to various clustering approaches.
Conclusion
Optimal clustering with missing values obviates the need for imputationbased preprocessing of the data, while at the same time possessing smaller clustering errors.
Background
Clustering has been a mainstay of genomics since the early days of geneexpression microarrays [1]. For instance, expression profiles can be taken over various tissue samples and then clustered according to the expression levels for each sample, the aim being to discriminate pathologies based on their differential patterns of gene expression [2]. In particular, modelbased clustering, which assumes that the data are generated by a finite mixture of underlying probability distributions, has gained popularity over heuristic clustering algorithms, for which there is no concrete way of determining the number of clusters or the best clustering method [3]. Modelbased clustering methods [4] provide more robust criteria for selecting the appropriate number of clusters. For example, in a Bayesian framework, utilizing Bayes Factor can incorporate both a priori knowledge of different models, and goodness of fit of the parametric model to the observed data. Moreover, nonparametric models such as Dirichletprocess mixture models [5] provide a more flexible approach for clustering, by automatically learning the number of components. In smallsample settings, modelbased approaches that incorporate model uncertainty have proved successful in designing robust operators [6–9], and in objectivebased experiment design to expedite the discovery of such operators [10–12].
Whereas classification theory is grounded in featurelabel distributions with the error being the probability that the classifier mislabels a point [6, 13]; clustering algorithms operate on random labeled point processes (RLPPs) with error being the probability that a point will be placed into the wrong cluster (partition) [14]. An optimal (Bayes) clusterer minimizes the clustering error and can be found with respect to an appropriate representation of the cluster error [15].
A common problem in clustering is the existence of missing values. These are ubiquitous with highthroughput sequencing technologies, such as microarrays [16] and RNA sequencing (RNAseq) [17]. For instance, with microarrays, missing data can occur due to poor resolution, image corruption, or dust or scratches on the slide [18], while for RNAseq, the sequencing machine may fail to detect genes with low expression levels owing to the random sampling nature of sequencing technologies. As a result of these missing data mechanisms, gene expression data from microarray or RNAseq experiments are usually in the form of large matrices, with rows and columns corresponding to genes and experimental conditions or different subjects, respectively, with some values missing. Imputation methods, such as MICE [19], Amelia II [20] and missForest [21], are usually employed to complete the data matrix before clustering analysis; however, in smallsample settings, which are common in genomic applications, these methods face difficulties, including colinearity due to potential high correlation between genes in samples, which precludes the successful imputation of missing values.
In this paper we follow a different direction by incorporating the generation of missing values with the original generating random labeled point process, thereby producing a new RLPP that generates the actual observed points with missing values. The optimal clusterer in the context of missing values is obtained by marginalizing out the missing features in the new RLPP. One potential challenge arising here is that in the case of missing values with general patterns, conducting the marginalization can be computationally intractable, and hence resorting to approximation methods such as Monte Carlo integration is necessary.
Although the proposed framework for optimal clustering can incorporate the probabilistic modeling of arbitrary types of missing data mechanisms, to facilitate analysis, throughout this work we assume data are missing completely at random (MCAR) [22]. In this scenario, the parameters of the missingness mechanism are independent of other model parameters and therefore vanish after the expectation operation in the calculation of the posterior of label functions for clustering assignment.
We derive the optimal clusterer for different scenarios in which features are distributed according to multivariate Gaussian distributions. The performance of this clusterer is compared to various methods, including kPOD [23] and fuzzy cmeans with optimal completion strategy [24], which are methods for directly clustering data with missing values, and also kmeans [25], fuzzy cmeans [26] and hierarchical clustering [27] with the missing values imputed. Comprehensive simulations based on synthetic data show the superior performance of the proposed framework for clustering with missing values over a range of simulation setups. Moreover, evaluations based on RNAseq data further verify the superior performance of the proposed method in a realworld application with missing data.
Methods
Optimal clustering
Given a point set \(S \subset \mathbb {R}^{d}\), where d is the dimension of the space, denote the number of points in S by η(S). A random labeled point process (RLPP) is a pair (Ξ,Λ), where Ξ is a point process generating S and Λ generates random labels on point set S. Ξ maps from a probability space to \([N;\mathcal {N}]\), where N is the family of finite sequences in \(\mathbb {R}^{d}\) and \(\mathcal {N}\) is the smallest σalgebra on N such that for any Borel set B in \(\mathbb { R}^{d}\), the mapping S→η(S∩B) is measurable. A random labeling is a family, Λ={Φ_{S}:S∈N}, where Φ_{S} is a random label function on the point set S in N. Denoting the set of labels by L={1,2,...,l}, Φ_{S} has a probability mass function on L^{S} defined by P_{S}(ϕ_{S})=P(Φ_{S}=ϕ_{S}Ξ=S), where ϕ_{S}:S→L is a deterministic function assigning a label to each point in S.
A label operator λ maps point sets to label functions, λ(S)=ϕ_{S,λ}∈L^{S}. For any set S, label function ϕ_{S} and label operator λ, the label mismatch error is defined as
where I_{A} is an indicator function equal to 1 if A is true and 0 otherwise. The error of label function λ(S) is computed as \( \epsilon _{\lambda }(S)=\mathbb {E}_{\Phi _{S}}[\epsilon _{\lambda }(S,\phi _{S})S]\), and the error of label operator λ for the corresponding RLPP is then defined by \(\epsilon \lbrack \lambda ]=\mathbb {E}_{\Xi }\mathbb {E}_{\Phi _{\Xi }}[\epsilon _{\lambda }(\Xi,\phi _{\Xi })]\).
Clustering involves identifying partitions of a point set rather than the actual labeling, where a partition of S into l clusters has the form \( \mathcal {P}_{S} = \{S_{1},S_{2},...,S_{l} \}\) such that S_{i}’s are disjoint and \( S = \bigcup _{i=1}^{l} S_{i}\). A cluster operator ζ maps point sets to partitions, \(\zeta (S)=\mathcal {P}_{S,\zeta }\). Considering the label switching property of clustering operators, let us define F_{ζ} as the family of label operators that all induce the same partitions as the clustering operator ζ. More precisely, a label function ϕ_{S} induces partition \(\mathcal {P}_{S} = \{S_{1},S_{2},...,S_{l} \}\), if S_{i}={x∈S:ϕ_{S}(x)=l_{i}} for distinct l_{i}∈L. Thereby, λ∈F_{ζ} if and only if ϕ_{S,λ} induces the same partition as ζ(S) for all S∈N. For any set S, label function ϕ_{S} and cluster operator ζ, define the cluster mismatch error by
the error of partition ζ(S) by \(\epsilon _{\zeta }(S) = \mathbb {E}_{\Phi _{S}} [\epsilon _{\zeta }(S,\phi _{S})S ]\) and the error of cluster operator ζ for the RLPP by \(\epsilon [\zeta ] = \mathbb {E}_{\Xi } \mathbb {E}_{\Phi _{\Xi }} [\epsilon _{\zeta }(\Xi,\phi _{\Xi })]\).
As shown in [15], error definitions for partitions can be represented in terms of risk with intuitive cost functions. Specifically, define \(G_{\mathcal {P}_{S}}\) such that \(\phi _{S} \in G_{\mathcal {P}_{S}}\) if and only if ϕ_{S} induces \(\mathcal {P}_{S}\). The error of partition can be expressed as
where \(\mathcal {K}_{S}\) is the set of all possible partitions of S, \(P_{S}(\mathcal {P}_{S}) = \sum _{\phi _{S} \in G_{\mathcal {P}_{S}}} P_{S}(\phi _{S})\) is the probability mass function on partitions \(\mathcal {P}_{S}\) of S, and the partition cost function between partitions \(\mathcal {P}_{S}\) and \(\mathcal {Q}_{S}\) of S is defined as
with \(\phi _{S,\mathcal {P}_{S}}\) being any member of \(G_{\mathcal {P}_{S}}\). A Bayes cluster operator ζ^{∗} is a clusterer with the minimal error ε[ζ^{∗}], called the Bayes error, obtained by a Bayes partition, ζ^{∗}(S) for each set S∈N:
The Bayes clusterer can be solved for each fixed S individually. More specifically, the search space in the minimization and the set of partitions with known probabilities in the summation can be constrained to subsets of \(\mathcal {K}_{S}\), denoted by \(\mathcal {C}_{S}\) and \(\mathcal {R}_{S}\), respectively. We refer to \(\mathcal {C}_{S}\) and \(\mathcal {R}_{S}\) as the set of candidate partitions and the set of reference partitions, respectively. Following [15], we can search for the optimal clusterer based on both optimal and suboptimal procedures (detailed in “Results and discussion” section) with derived bounds that can be used to optimally reduce the size of \(\mathcal {C}_{S}\) and \(\mathcal {R}_{S}\).
Gaussian model with missing values
We consider an RLPP model that generates the points in the set S according to a Gaussian model, where features of x∈S can be missing completely at random due to a missing data mechanism independent of the RLPP. More precisely, the points x∈S with label ϕ_{S}(x)=i are drawn independently from a Gaussian distribution with parameters ρ_{i}={μ_{i},Σ_{i}}. Assuming n_{i} sample points with label i, we divide the observations into G_{i}≤n_{i} groups, where all n_{ig} points in group g have the same set, J_{ig}, of observed features with cardinality J_{ig}=d_{ig}. Denoting by S_{ig} the set of sample points in group g of label i, we represent the pattern of missing data in this group using a d_{ig}×d matrix M_{ig}, where each row is a ddimensional vector with a single nonzero element with value 1 corresponding to the observed feature’s index. Thus, the nonmissing portion of sample point x∈S_{ig}, i.e. M_{ig}x, has Gaussian distribution \( \mathrm {N}(M_{ig}\mu _{i},M_{ig}\Sigma _{i} M_{ig}^{T})\).
Given ρ={ρ_{1},ρ_{2},...,ρ_{l}} of independent parameters, to evaluate the posterior probability of random labeling function ϕ_{S}∈L^{S}, we have
where P(ϕ_{S}) is the prior probability on label functions, which we assume does not depend on the specific points in S.
Known means and covariances
When mean and covariance parameters of labelconditional distributions are known, the prior probability f(μ_{i},Σ_{i}) in (6) is a point mass at ρ_{i}={μ_{i},Σ_{i}}. Thus,
We define the groupg statistics of label i as
where m_{ig} and Ψ_{ig} are the sample mean and scatter matrix, employing only the observed n_{ig} data points in group g of label i. The posterior probability of labeling function (7) then can be expressed as
where \(\Sigma _{ig}=M_{ig}\Sigma _{i} M_{ig}^{T}\) is the covariance matrix corresponding to group g of label i.
Gaussian means and known covariances
Under this model, data points are generated according to Gaussians whose mean parameters are random and their covariance matrices are fixed. Specifically, for label i we have \(\mu _{i} \sim \mathrm {N}(m_{i},\frac {1}{\nu _{i}} \Sigma _{i})\), where ν_{i}>0 and m_{i} is a length d real vector. Thus the posterior of label function given the point set S can be derived as
By completing the square and using the normalization constant of multivariate Gaussian distribution, the integral in this equation can be expressed as
where
GaussianInverseWishart Means and Covariances
Under this model, data points are generated from Gaussian distributions with random mean and covariance parameters. More precisely, the parameters associated with label i are distributed as \({\mu _{i}\Sigma _{i} \sim \mathrm {N}\left (m_{i},\frac {1 }{\nu _{i}}\Sigma _{i}\right)}\) and Σ_{i}∼IW(κ_{i},Ψ_{i}), where the covariance has inverseWishart distribution
To compute the posterior probability of labeling function (6), we first marginalize out the mean parameters μ_{i} in a similar fashion to (10), obtaining
The integration in the above equation has no closed form solution, thus we resort to Monte Carlo integration for approximating it. Specifically, denoting the term in the brackets in Eq. (15) as g(Σ_{i}), we draw J samples \(\Sigma _{i}^{(j)} \sim \text {IW}(\kappa _{i},\Psi _{i})\), j=1,2,...,J, and then compute the integral as \(\frac {1}{J} \sum _{j=1}^{J} g(\Sigma _{i}^{(j)})\).
Results and discussion
The performance of the proposed method for optimal clustering with missing values at random is compared with some suboptimal versions, two other methods for clustering data with missing values, and also classical clustering algorithms with imputed missing values. The performance comparison is carried out on synthetic data generated from different Gaussian RLPP models with different missing probability setups, and also on a publicly available dataset of breast cancer generated by TCGA Research Network (https://cancergenome.nih.gov/). In our experiments, the results of the exact optimal solution for the RLPP with missing at random (Optimal) is provided for smaller point sets, i.e. wherever computationally feasible. We have also tested two suboptimal solutions, similar to the ideas in [15], for an RLPP with missing at random. In the first one (Subopt. Pmax), the set of reference partitions (\(\mathcal {R}_{S}\)) is restricted to a closed ball of a specified radius centered on the partition with the highest probability, where the distance of two partitions is defined as the minimum Hamming distance between labels inducing the partitions. In both Optimal and Pmax, the reference set is further constrained to the partitions that assign the correct number of points to each cluster, but the set of candidate partitions (\(\mathcal {C}_{S}\)) includes all the possible partitions of n points, i.e. 2^{n−1}. In the second suboptimal solution (Subopt. Pseed), a local search within Hamming distance at 1 is performed starting from five random initial partitions to approximately find the partition with (possibly local) maximum probability. Then the sets of reference and candidate partitions are constrained to the partitions with correct cluster sizes with a specified Hamming distance from the found (local) maximum probability partition. The bounds derived in [15] for reducing the set of candidate and reference partitions are used, where applicable, in Optimal, Pseed, and Pmax.
In all scenarios, kPOD and fuzzy cmeans with optimal completion strategy (FCMOCS) are directly applied to the data with missing values. In the simulations in [24], where FCMOCS is introduced, to initialize cluster centers, the authors apply ordinary fuzzy cmeans to the complete data, i.e. using knowledge of the missing values. To have a fair comparison with other methods, we calculate the initial cluster centers for FCMOCS by applying fuzzy cmeans to the subset of points with no missing features for lower missing rates. For higher missing rates we impute the missing values by the mean of the corresponding feature values across all points, and then apply fuzzy cmeans to all the points to initialize the cluster centers. In order to apply the classical algorithms, the missing values are imputed according to [28], by employing a multivariate Gibbs sampler that iteratively generates samples for missing values and parameters based on the observed data. The classical algorithms included in our experiments include kmeans (KM), fuzzy cmeans (FCM), hierarchical clustering with single linkage (Hier. (Si)), and hierarchical clustering with complete linkage (Hier. (Co)). Moreover, completely random clusterer (Random) results are also included for performance comparisons.
Simulated data
In the simulation analysis, the number of clusters is fixed at 2, and the dimensionality of the RLPPs (number of features) is set to 5. Additional results for 20 features are provided in Additional file 1. Point generation is done based on a Gaussian mixture model (GMM). Three different scenarios for the parameters of the GMM are considered: i) Fixed known means and covariances ii) Known covariances and unknown means with Gaussian distributions. iii) Unknown means and covariances with Gaussian inverseWishart distributions. We select the values of the parameters of the point generation process to have an approximate Bayes error of 0.15. The selected values are shown in Table 1. For the point set generation, the number of points from each cluster is fixed a priori. The distributions are first drawn from the assumed model, and then the points are generated based on the drawn distributions. A subset of the points’ features is randomly selected to be hidden based on missing at random with different missing probabilities. Four different setups for the number of points are considered in our simulation analysis: 10 points from each cluster (n_{1}=n_{2}=10), 12 points from one cluster and 8 points from the other cluster (n_{1}=12,n_{2}=8), 35 points from each cluster (n_{1}=n_{2}=35), and 42 points from one cluster and 28 points from the other cluster (n_{1}=42,n_{2}=28). When having unequal sized clusters, in half of the repetitions n_{1} points are generated from the first distribution and n_{2} points from the second distribution, and viceversa in the other half. In each simulation repetition, all clustering methods are applied to the points to generate a vector of labels that induces a twocluster partition. The predicted label vector by each method is compared with the true label vector of each point in the point set to calculate the error of that method on that point set. In other words, for each method the number of points assigned to a cluster different from their true one are counted (after accounting for the label switching issue) and divided by the total number of points (n=n_{1}+n_{2}) to calculate the clustering error of that method on the point set. These errors are averaged across all point sets in different repetitions to empirically estimate the clustering error of each method under a model and fixed missingvalue probability. In cases with n=70, since applying Optimal and Pmax is computationally prohibitive, we only provide the results for Pseed.
In Additional file 1, the average clustering errors are shown as a function of the Hamming distance threshold used to define the set of reference partitions in Pmax and Pseed, for different simulation scenarios. From the Figures in Additional file 1, we see that in all cases, the performances of Pmax and Pseed are quite insensitive to the set threshold of the Hamming distance for reference partitions. Note that in these types of figures all the other methods’ performances other than Pmax and Pseed are constant in each plot.
The average results for the fixed mean vectors and covariance matrices across 100 repetitions are shown in Fig. 1. Here, the Hamming distance threshold for reference partitions in Pmax and Pseed is fixed at 1. It can be seen that Optimal, Pmax, and Pseed outperform all the other methods in all the smaller sample size settings, and Pmax and Pseed have virtually the same performance as Optimal. For the larger sample size settings where only Pseed is applied, its superior performance compared with other methods is clear from the figure.
Figure 2 depicts the comparison results under the unknown mean vectors with Gaussian distributions and fixed covariance matrices averaged over 80 repetitions. The Hamming distance threshold in Pmax and Pseed is set to 2. For smaller sample sizes, Optimal, Pmax and Pseed have lower average errors than all the other methods. We can see that for balanced clusters the suboptimal and optimal solutions have very close performances, but for the unbalanced clusters case with higher missing probabilities the difference between Optimal and Pmax and Pseed gets noticeable. For larger sample sizes Pseed consistently outperforms the other methods, although for lower missing probabilities it has closer competitors. In all cases, as the missing probability increases, the superior performance of the proposed methods becomes more significant.
The average results under the unknown mean vectors and coavriance matrices with GaussianinverseWishart distribution over 40 repetitions are provided in Fig. 3. In the smaller sample size cases, the Hamming distance threshold in Pmax and Pseed is fixed at 8, and we can see that the proposed suboptimal (Pmax and Pseed) and optimal clustering with missing values have very close average errors, and all are much lower than the other methods’ errors. For larger sample sizes, only the results for missing probability equal to 0.15 are shown vs. the Hamming distance threshold used to define the reference partitions in Pseed. Again, Pseed performs better than the other methods.
RNAseq data
In this section the performance of the clustering methods are examined on a publicly available RNAseq dataset of breast cancer. The data is available on The Cancer Genome Atlas (TCGA) [29], and is procured by the R package TCGS2STAT [30]. It consists of matched tumor and normal samples, and includes 97 points from each. The original data are in terms of the number of sequence reads mapped to each gene. RNAseq data are integers, highly skewed and overdispersed [31–35]. Thus, we apply a variance stabilizing transformation (VST) [36] implemented in DESeq2 package [37], and transform data to a log2 scale that have been normalized with respect to library size. For all subsequent analysis, other than for calculating clustering errors, we assume that the labels of data are unknown. Feature selection is performed in a completely unsupervised manner, since in clustering no labels are known in practice. The top 10 genes in terms of variance to mean ratio of expression are picked as features to be used in clustering algorithms. In general, for setting prior hyperparameters, external sources of information like signaling pathways, where available, can be leveraged [38, 39]. Here, we only use a subset of the discarded gene expressions, i.e. the next 90 top genes (in terms of variance to mean ratio of expression), for prior hyperparameters calibration for the optimal/suboptimal approaches. We follow the approach in [40] and employ the method of moments for prior calibration, but unlike [40], a single set of hyperparameters is estimated and used for both clusters, since the labels of data are not available. It is well known that in small sample size settings, estimation of covariance matrices, scatter matrices and even mean vectors may be problematic. Therefore, similar to [40], we assume the following structure
and estimate five scalars (m, σ^{2}, ρ, κ, ν) from the data.
In each repetition, stratified sampling is done, i.e. n_{1} and n_{2} points are sampled randomly from each group (normal and tumor). When n_{1}≠n_{2}, in half of the repetitions n_{1} and n_{2} points are randomly selected from the normal and tumor samples, respectively, and viceversa in the other half. Prior calibration is performed in each repetition, and 15% of the selected features are considered as missing values. Similar to the experiments on the simulated data, the clustering error of each method in each iteration is calculated by comparing the predicted labels and true labels of the sampled points (accounting for label switching issue), and the average results over 40 repetitions are provided in Fig. 4. It can be seen that the proposed optimal clustering with missing values and its suboptimal versions outperform the other algorithms. It is worth noting that the performance of Pseed is more sensitive to the selected Hamming distance threshold for reference partitions compared with the results on simulated data.
Conclusion
The methodology employed in this paper is very natural. As with any signal processing problem, the basic problem is to find an optimal operator from a class of operators given the underlying random process and a cost function, which is often an error associated with operator performance. While it may not be possible to compute the optimal operator, one can at least employ suboptimal approximations to it while knowing the penalties associated with the approximations.
In this paper, we have, in effect, confronted an old problem in signal processing: If we wish to make a decision based on a noisy observed signal, is it better to filter the observed signal and then determine the optimal decision on the filtered signal, or to find the optimal decision based directly on the observed signal? The answer is the latter. The reason is that the latter approach is fully optimal relative to the actual observation process, whereas, even if in the first approach the filtering is optimal relative to the noise process, the first approach produces a composite of two actions, filter and decision, each of which is only optimal relative to a portion of the actual observation process. In the present situation involving clustering, in the standard imputationfollowedbyclustering approach, it is typically the case that neither the filter (imputation) nor the decision (clustering) is optimal, so that even more advantage is obtained by optimal clustering over the missingvalueadjusted RLPP.
Abbreviations
 FCMOCS:

Fuzzy cmeans with optimal completion strategy
 GMM:

Gaussian mixture model
 Hier. (Co):

Hierarchical clustering with complete linkage
 Hier. (Si):

Hierarchical clustering with single linkage
 MCAR:

Missing completely at random
 RLPP:

Random labeled point process
 RNAseq:

RNA sequencing
 TCGA:

The Cancer Genome Atlas
 VST:

Variance stabilizing transformation
References
 1
BenDor A, Shamir R, Yakhini Z. Clustering gene expression patterns. J Comput Biol. 1999; 6:281–97. https://doi.org/10.1089/106652799318274. PMID: 10582567.
 2
Bittner M, Meitzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M, Simon R, Yakhini Z, BenDor A, Sampas N, Dougherty E, Wang E, Marincola F, Gooden C, Lueders J, Glatfelter A, Pollock P, Carpten J, Gillanders E, Leja D, Dietrich K, Beaudry C, Berens M, Alberts D, Sondak V, Hayward N, Trent J. Molecular classification of cutaneous malignant melanoma by gene expression profiling. J Comput Biol. 2000; 406(3):536–40.
 3
Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL. Modelbased clustering and data transformations for gene expression data. Bioinformatics. 2001; 17(10):977–87.
 4
Fraley C, Raftery AE. Modelbased clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002; 97(458):611–31.
 5
MacEachern SN, Müller P. Estimating mixture of Dirichlet process models. J Comput Graph Stat. 1998; 7(2):223–38.
 6
Dalton LA, Dougherty ER. Optimal classifiers with minimum expected error within a Bayesian frameworkpart i: Discrete and Gaussian models. Pattern Recogn. 2013; 46(5):1301–14. https://doi.org/10.1016/j.patcog.2012.10.018.
 7
Imani M, BragaNeto UM. Control of gene regulatory networks with noisy measurements and uncertain inputs. IEEE Trans Control Netw Syst. 2018; 5(2):760–9. https://doi.org/10.1109/TCNS.2017.2746341.
 8
Karbalayghareh A, BragaNeto U, Dougherty ER. Intrinsically Bayesian robust classifier for singlecell gene expression trajectories in gene regulatory networks. BMC Syst Biol. 2018; 12(3):23.
 9
Imani M, BragaNeto U. Control of gene regulatory networks using Bayesian inverse reinforcement learning. IEEE/ACM Trans Comput Biol Bioinforma. 2018:1. https://doi.org/10.1109/TCBB.2018.2830357.
 10
Boluki S, Qian X, Dougherty ER. Experimental design via generalized mean objective cost of uncertainty. IEEE Access. 2019; 7:2223–30. https://doi.org/10.1109/ACCESS.2018.2886576.
 11
Broumand A, Esfahani MS, Yoon B. J., Dougherty ER. Discrete optimal Bayesian classification with errorconditioned sequential sampling. Pattern Recognit. 2015; 48(11):3766–82. https://doi.org/10.1016/j.patcog.2015.03.023.
 12
Talapatra A, Boluki S, Duong T, Qian X, Dougherty E, Arróyave R. Autonomous efficient experiment design for materials discovery with Bayesian model averaging. Phys Rev Mater. 2018; 2:113803. https://doi.org/10.1103/PhysRevMaterials.2.113803.
 13
Karbalayghareh A, Qian X, Dougherty ER. Optimal Bayesian transfer learning. IEEE Trans Signal Process. 2018; 66(14):3724–39. https://doi.org/10.1109/TSP.2018.2839583.
 14
Dougherty ER, Brun M. A probabilistic theory of clustering. Pattern Recogn. 2004; 37(5):917–25. https://doi.org/10.1016/j.patcog.2003.10.003.
 15
Dalton LA, Benalcázar ME, Brun M, Dougherty ER. Analytic representation of Bayes labeling and Bayes clustering operators for random labeled point processes. IEEE Trans Signal Process. 2015; 63(6):1605–20. https://doi.org/10.1109/TSP.2015.2399870.
 16
Schena M, Shalon D, Davis RW, Brown PO. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science. 1995; 270(5235):467–70.
 17
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNAseq. Nat Methods. 2008; 5(7):621.
 18
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB. Missing value estimation methods for DNA microarrays. Bioinformatics. 2001; 17(6):520–5.
 19
van Buuren S, GroothuisOudshoorn K. mice: Multivariate Imputation by Chained Equations in R. J Stat Softw Artic. 2011; 45(3):1–67. https://doi.org/10.18637/jss.v045.i03.
 20
Honaker J, King G, Blackwell M, et al. Amelia ii: A program for missing data. J Stat Softw. 2011; 45(7):1–47.
 21
Stekhoven DJ, Bühlmann P. Missforest—nonparametric missing value imputation for mixedtype data. Bioinformatics. 2011; 28(1):112–8.
 22
Little RJ, Rubin DB. Statistical Analysis with Missing Data vol. 333.New Jersey: Wiley; 2014.
 23
Chi JT, Chi EC, Baraniuk RG. kpod: A method for kmeans clustering of missing data. Am Stat. 2016; 70(1):91–9. https://doi.org/10.1080/00031305.2015.1086685.
 24
Hathaway RJ, Bezdek JC. Fuzzy cmeans clustering of incomplete data. IEEE Trans Sys Man Cybern B (Cybern). 2001; 31(5):735–44. https://doi.org/10.1109/3477.956035.
 25
Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY. An efficient kmeans clustering algorithm: Analysis and implementation. IEEE Trans Pattern Anal Mach Intell. 2002; 24(7):881–92.
 26
Bezdek JC, Ehrlich R, Full W. FCM: The fuzzy cmeans clustering algorithm. Comput Geosci. 1984; 10(23):191–203.
 27
Johnson SC. Hierarchical clustering schemes. Psychometrika. 1967; 32(3):241–54.
 28
Dadaneh SZ, Dougherty ER, Qian X. Optimal Bayesian classification with missing values. IEEE Trans Signal Process. 2018; 66(16):4182–92.
 29
The Cancer Genome Atlas Research Network (TCGA). Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008; 455(7216):1061.
 30
Wan Y. W., Allen GI, Liu Z. TCGA2STAT: simple TCGA data access for integrated statistical analysis in R. Bioinformatics. 2015; 32(6):952–4.
 31
Dadaneh SZ, Zhou M, Qian X. Bayesian negative binomial regression for differential expression with confounding factors. Bioinformatics. 2018; 1. https://doi.org/10.1093/bioinformatics/bty330.
 32
Hajiramezanali E, Zamani Dadaneh S, Karbalayghareh A, Zhou M, Qian X. Bayesian multidomain learning for cancer subtype discovery from nextgeneration sequencing count data In: Bengio S, Wallach H, Larochelle H, Grauman K, CesaBianchi N, Garnett R, editors. Advances in Neural Information Processing Systems 31. Curran Associates, Inc.: 2018. p. 9115–24.
 33
Dadaneh SZ, Qian X, Zhou M. BNPSeq: Bayesian nonparametric differential expression analysis of sequencing count data. J Am Stat Assoc. 2018; 113(521):81–94.
 34
Hajiramezanali E, Dadaneh SZ, de Figueiredo P, Sze S. H., Zhou M, Qian X. Differential expression analysis of dynamical sequencing count data with a Gamma Markov chain. 2018. arXiv preprint arXiv:1803.02527.
 35
Broumand A, Hu T. A length bias corrected likelihood ratio test for the detection of differentially expressed pathways in RNAseq data. In: 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP): 2015. p. 1145–9. https://doi.org/10.1109/GlobalSIP.2015.7418377.
 36
Durbin BP, Hardin JS, Hawkins DM, Rocke DM. A variancestabilizing transformation for geneexpression microarray data. Bioinformatics. 2002; 18(suppl_1):105–10.
 37
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014; 15(12):550.
 38
Boluki S, Esfahani MS, Qian X, Dougherty ER. Constructing PathwayBased Priors within a Gaussian Mixture Model for Bayesian Regression and Classification. IEEE/ACM Trans Comput Biol Bioinforma. 2019; 16(2):524–37. https://doi.org/10.1109/TCBB.2017.2778715. ISSN: 15455963.
 39
Boluki S, Esfahani MS, Qian X, Dougherty ER. Incorporating biological prior knowledge for Bayesian learning via maximal knowledgedriven information priors. BMC Bioinformatics. 2017; 18(14):552. https://doi.org/10.1186/s1285901718934.
 40
Dalton LA, Dougherty ER. Application of the Bayesian MMSE estimator for classification error to gene expression microarray data. Bioinformatics. 2011; 27(13):1822–31.
Acknowledgment
We thank Texas A&M High Performance Research Computing for providing computational resources to perform experiments in this paper.
Funding
This work was funded in part by Awards CCF1553281 and IIS1812641 from the National Science Foundation, and a DMREF grant from the National Science Foundation, award number 1534534. The publication cost of this article was funded by Award IIS1812641 from the National Science Foundation.
Availability of data and materials
The publicly available real datasets analyzed during the current study have been generated by the TCGA Research Network https://cancergenome.nih.gov/.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 12, 2019: Selected original research articles from the Fifth International Workshop on Computational Network Biology: Modeling, Analysis and Control (CNBMAC 2018): Bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume20supplement12.
Author information
Affiliations
Contributions
SB and SZD developed the method, performed the experiments, and wrote the first draft. XQ and ERD proofread and edited the manuscript, and oversaw the project. All authors have read and approved final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
Supplementary materials. Additional figures are given in a single multipage PDF. (PDF 516 KB)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Boluki, S., Zamani Dadaneh, S., Qian, X. et al. Optimal clustering with missing values. BMC Bioinformatics 20, 321 (2019). https://doi.org/10.1186/s1285901928323
Published:
Keywords
 Clustering
 Missing data
 Optimal design
 Pattern recognition