Fitting Gaussian mixture models on incomplete data

Background Bioinformatics investigators often gain insights by combining information across multiple and disparate data sets. Merging data from multiple sources frequently results in data sets that are incomplete or contain missing values. Although missing data are ubiquitous, existing implementations of Gaussian mixture models (GMMs) either cannot accommodate missing data, or do so by imposing simplifying assumptions that limit the applicability of the model. In the presence of missing data, a standard ad hoc practice is to perform complete case analysis or imputation prior to model fitting. Both approaches have serious drawbacks, potentially resulting in biased and unstable parameter estimates. Results Here we present missingness-aware Gaussian mixture models (MGMM), an R package for fitting GMMs in the presence of missing data. Unlike existing GMM implementations that can accommodate missing data, MGMM places no restrictions on the form of the covariance matrix. Using three case studies on real and simulated ’omics data sets, we demonstrate that, when the underlying data distribution is near-to a GMM, MGMM is more effective at recovering the true cluster assignments than either the existing GMM implementations that accommodate missing data, or fitting a standard GMM after state of the art imputation. Moreover, MGMM provides an accurate assessment of cluster assignment uncertainty, even when the generative distribution is not a GMM. Conclusion Compared to state-of-the-art competitors, MGMM demonstrates a better ability to recover the true cluster assignments for a wide variety of data sets and a large range of missingness rates. MGMM provides the bioinformatics community with a powerful, easy-to-use, and statistically sound tool for performing clustering and density estimation in the presence of missing data. MGMM is publicly available as an R package on CRAN: https://CRAN.R-project.org/package=MGMM. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04740-9.


Objective function
The log likelihood of the observed data is: where y obs i represents only the observed components of the complete-data vector y i . However, due to the presence of a sum within the logarithm, there are no closedform solutions for optimizing (1) in general. Here we develop the objective function optimized by the ECM algorithm, which is a lower bound on the observed data log likelihood [3].
Define the d × d residual outer product matrix as: Let θ = µ 1 , Σ 1 , · · · , µ k , Σ k ) collect the means and covariances of the component normal distributions. Complete data refers to the case where all elements of y i and all cluster assignments z i are observed. The complete data log likelihood is: Let π (r) and θ (r) denote the current estimates of π and θ, and let D obs = ∪ n i=1 {y obs i } denote the observed data. The ECM objective is defined as the expectation of the complete data log likelihood (2) given the observed data and the current parameter states: Q(π, θ|π (r) , θ (r) ) ≡ E ℓ(π, θ) D obs ; π (r) , θ (r) .
Define the responsibility of the jth cluster for the ith observation as the current conditional probability of membership given the observed data: From Bayes' theorem, the responsibility is expressible as: Here f (y obs i |µ For the ith observation, define the jth working response vector as: whereŷ miss,(r) ij is the current conditional expectation of the missing elements of y i under the assumption that observation i originated from cluster j: obs,j ). Finally, define the working residual outer product as: The working residual outer product is the current conditional expectation of the residual outer product given the observed data. In terms of the responsibility and the working residual outer product, the ECM objective is: In contrast to the complete data log likelihood in (2), the ECM objective (6) is a function of the observed data only.

Optimization
ECM differs from classic EM in that the M-step is partitioned into a sequence of conditional maximizations [5]. In the present case, estimation of the means (µ j ), followed by the covariances (Σ j ), and finally the cluster membership probabilities π. The advantage of ECM is that each conditional maximization is available in closed form.
The optimization procedure is outlined in Algorithm 1. Initial estimates of π and θ are required. The approach adopted by MGMM is to perform an initial k-means clustering on complete observations [1].π is estimated as the proportion of the complete observations assigned to each cluster, and θ is estimated by calculating the within-cluster means and covariances.
Update equations were derived by differentiating the ECM objective in (6) and solving the resulting score equations. The update for µ j is: whereγ (r) ij is the current responsibility (3) of the jth cluster for the ith observation, is the total responsibility of the jth cluster, andŷ ij is the current working response vector (4). Note that (7) takes the form of a responsibility-weighted average of the working response vectors. Similarly, the update for Σ j is: which is a responsibility-weighted average of the working residual outer products (5). Given the updated means {µ Finally, the update for π j is: After all parameters have been updated, the current value of the ECM objective (6), and the algorithm continues until the increment Q (r+1) − Q (r) in the objective falls below a pre-specified tolerance ϵ.

7:
Update the cluster membership probabilitiesπ

Assumption on missingness
Standard approaches to addressing missing data, including complete case analysis and naive mean or median imputation, tacitly assume that the data are missing completely at random (MCAR). Under MCAR, whether an element of data is missing is completely independent of its value. However, unbiased estimation via the maximum likelihood-based approach presented here only requires that the missingness occurs at random (MAR), which is a weaker and more plausible assumption. Under MAR, whether an element of data is missing is independent of its value conditional on those elements of the data that are observed [4].
To elaborate on this assumption, let R il = 1 if the lth component of the ith observation (that is, Y il ) is observed, and define the response indicator vector r i = vec(R i1 , · · · , R ip ). For the ith subject, let Y obs Unbiased estimation under MNAR requires knowledge of the missing data mechanism, which is typically not available.

Selecting the number of clusters
In some settings, such as the RNA-seq data set, the number of clusters is suggested by the number of classes present (i.e. tumor types). In others, such as clustering GWAS summary statistics, the number of clusters is unknown. The ChooseK function from the MGMM package aims to assist in choosing the number of clusters k. For a range of possible k, bootstrap samples of the input data are selected, a GMM is fit, and several cluster quality metrics are calculated. These include the Bayesian Information Criterion, the Calinski-Harabaz Index, the Davies-Bouldin Index, and the silhouette width. For each candidate k, the mean and standard error, across bootstrap data sets, of the cluster quality metrics are recorded. For each metric, the cluster number leading to the optimal quality k opt and the smallest cluster number whose quality was within 1 standard error of optimal k 1se are determined. Which clustering criterion to optimize may depend on the application. Absent other considerations, we suggest optimizing the silhouette width, which typically does not select an excessive number of clusters, and using k 1se in the interest of parsimony. For additional discussion on choosing the number of clusters, see [2].

Multiple imputation-based inference
Suppose the parameters (π, θ) of a GMM have been estimated from the observed data, however interest ultimately lies in estimating ϑ, whose estimatorθ(D) is a function of the complete data D. A valid estimate for the variance ofθ is obtained using Rubin's rules: Notice that the estimated sampling variance ofθ is the sum of two terms. The first term is a measure of within-imputation variability, whereas the second term is a measure of between-imputation variability. Usingθ andV(θ), standard inference procedures may now be applied. For example, an asymptotically valid Wald statistic for assessing H 0 : ϑ = ϑ 0 is: Under the null hypothesis, T W follows an asymptotic χ 2 d (0) distribution with d = dim(ϑ) degrees of freedom.

Additional analyses of GWAS summary statistics
Here we present two additional examples of clustering summary statistics from real GWAS of cardiovascular disease risk factors ( Supplementary Figures 1 and 2). In both cases, the underlying data generating mechanism is unknown, but is unlikely to be a GMM. One cluster (green), corresponding to SNPs associated with BMI, is well differentiated, while the remaining two clusters (blue and gold) are poorly resolved, and it is not obvious in either case that there truly are thee clusters present in the observed data. In this context, non-linear imputation, via kNN or random forests, followed by standard GMM outperformed MGMM. This is not surprising: when the observed data arise from a distribution that is far from a GMM, using MGMM to circumvent imputation is unlikely to succeed. These examples underscore the point made in the conclusions that GMMs, and MGMM in particular, are not suited to all clustering tasks.

Comparison of the MICE-filtered and MGMMfiltered procedures
We compared the fraction of observations deemed unassignable by the MICE-filtered and MGMM-filtered to ensure that filtering did not lead to an over-enrichment of complete observations, as this might have artificially increased the apparent performance. Supplementary Figure 3 displays the entropy distribution for the MICE and MGMM method and the fraction of filtered data point by entropy treshold. Supplementary Figure 4 presents the fraction of observations deemed unassignable from the cancer RNA-Seq data set, and the second 5-trait real GWAS benchmark. Note that, by design of the filtering procedure (see main material and methods), both methods remove the same proportion of observations, with minor variations due to the imprecision of the empirical distribution function for assignment entropy. As the missing data ratio increased, so too did the proportion of unassignable observations, likely due to a loss of information along coordinates important for differentiating the clusters. Supplementary Figure 5 verifies that the proportion of complete observations remaining after filtering was not dramatically increased by either MGMM or MICE. The black dashed line on each panel represents the expected fraction of complete observations for a given missingness, assuming the missingness occurs completely at random. For missingness m and data of dimension d, this fraction is f complete = (1 − m) d .

Omnibus association test
The omnibus test is a classic, multi-trait test for assessing the association between a given SNP and any of a set of traits. Let z denote a d dimensional vector of Z-scores quantifying the association between the SNP and the d traits. These are (asymptotically) normally distributed, with mean µ and covariance Σ: Note that covariance arises among the Z scores when the traits are correlated. The null hypothesis of the omnibus test is that the means of all summary statistics are zero: The alternative is that at least one mean is non-zero. The omnibus test statistic is the quadratic form: Under the null, T omni follows a central χ 2 d (0) distribution with d degrees of freedom: