 Research
 Open Access
 Published:
Autoencoderbased cluster ensembles for singlecell RNAseq data analysis
BMC Bioinformatics volume 20, Article number: 660 (2019)
Abstract
Background
Singlecell RNAsequencing (scRNAseq) is a transformative technology, allowing global transcriptomes of individual cells to be profiled with high accuracy. An essential task in scRNAseq data analysis is the identification of cell types from complex samples or tissues profiled in an experiment. To this end, clustering has become a key computational technique for grouping cells based on their transcriptome profiles, enabling subsequent cell type identification from each cluster of cells. Due to the high featuredimensionality of the transcriptome (i.e. the large number of measured genes in each cell) and because only a small fraction of genes are cell typespecific and therefore informative for generating cell typespecific clusters, clustering directly on the original feature/gene dimension may lead to uninformative clusters and hinder correct cell type identification.
Results
Here, we propose an autoencoderbased cluster ensemble framework in which we first take random subspace projections from the data, then compress each random projection to a lowdimensional space using an autoencoder artificial neural network, and finally apply ensemble clustering across all encoded datasets to generate clusters of cells. We employ four evaluation metrics to benchmark clustering performance and our experiments demonstrate that the proposed autoencoderbased cluster ensemble can lead to substantially improved cell typespecific clusters when applied with both the standard kmeans clustering algorithm and a stateoftheart kernelbased clustering algorithm (SIMLR) designed specifically for scRNAseq data. Compared to directly using these clustering algorithms on the original datasets, the performance improvement in some cases is up to 100%, depending on the evaluation metric used.
Conclusions
Our results suggest that the proposed framework can facilitate more accurate cell type identification as well as other downstream analyses. The code for creating the proposed autoencoderbased cluster ensemble framework is freely available from https://github.com/gedcom/scCCESS
Background
Transcriptome profiling by singlecell RNAsequencing (scRNAseq) is a fastemerging technology for studying complex tissues and biological systems at the singlecell level [1]. Identification of cell types present in a biological sample or system is a vital part of scRNAseq data analysis workflow [2]. The key computational technique for unbiased cell type identification from scRNAseq data is unsupervised clustering [3]. Typically, this is achieved by using a clustering algorithm to partition cells in a scRNAseq dataset into distinct groups and subsequently annotating each group to a type of cell based on cell type marker genes and/or other biological knowledge of cell type characteristics [4].
Due to the critical role played by cell type identification for downstream analyses, significant effort has been devoted to tailoring standard clustering algorithms or developing new ones for scRNAseq data clustering and cell type identification [5]. These include standard kmeans clustering, hierarchical clustering, and variants that are specifically designed for scRNAseq data (i.e. RaceID/RaceID2 [6], CIDR [7]) as well as more advanced methods that utilise likelihoodbased mixture modelling (countClust) [8], densitybased spatial clustering [9] and kernelbased singlecell clustering (SIMLR) [10]. Several studies have compared and summarised various clustering algorithms used for scRNAseq data analysis [11–13].
One of the key challenges in scRNAseq data clustering is handling specific characteristics of the data including high featuredimensionality and high featureredundancy. This is because typically a large number of genes are profiled in the experiment but only a small proportion of them are cell typespecific and therefore informative for cell type identification. Hence, clustering directly on the original highdimensional feature space may result in suboptimal partitioning of the cells due to low signaltonoise ratio. To reduce the high featuredimensionality in scRNAseq data for visualisation and downstream analyses, various dimension reduction techniques, including traditional approaches as well as newly developed ones, have been applied to scRNAseq data. These include generic methods such as principal component analysis (PCA), independent component analysis (ICA), nonnegative matrix factorization [14], and tdistributed stochastic neighbour embedding (tSNE) [15], as well as other methods developed for scRNAseq data, such as zero inflated factor analysis (ZIFA) [16]. Recently, deep learning techniques such as scvis, a deep generative model [17], and a scNN [18], a neural network model, were developed specifically for scRNAseq data dimensionreduction. While these new developments are primarily focused on scRNAseq data visualisation, they represent the first applications of deep learning techniques for scRNAseq data analysis.
Ensemble learning is an established field in machine learning and has a wide application in bioinformatics [19]. Ensemble clustering via random initialisation is a popular ensemble learning method for clustering [20]. While this approach was found to improve stability of the kmeans clustering algorithm, it appeared to have a less consistent effect on clustering accuracy [21]. Ensemble clustering via random projection is an alternative ensemble learning method for clustering. This approach was applied to DNA microarray data analysis and resulted in improved clustering accuracy [22]. Weighted ensemble clustering combines multiple clustering outputs based on their respective quality [23]. Recently, cluster ensembles have been generated by combining outputs from different upstream processing and similarity metrics [24] or different clustering algorithms for cell type identification from scRNAseq data [25, 26]. While these heuristic methods were found to be effective for improving clustering accuracy in cell type identification, they are adhoc in nature and may not fully explore characteristics and biological signals in scRNAseq data when clustering.
To extract biological signal from scRNAseq data while at the same time addressing the issues of high featuredimensionality and high featureredundancy, here we propose an autoencoderbased cluster ensemble framework for the robust clustering of cells for cell type identification from scRNAseq data. The proposed framework first randomly projects the original scRNAseq datasets into subspaces to create ‘diversity’ [27] and then trains autoencoder networks to compress each such random projection to a lowdimensional feature space. Subsequently, clusterings are generated on all encoded datasets and consolidated into a final ensemble output.
The proposed framework of ensemble clustering via autoencoderbased dimensionreduction and its application to scRNAseq is a principled approach and the first of its kind. We demonstrate that (1) the autoencoderreduced ensemble clustering of scRNAseq data significantly improves clustering accuracy of cell types, whereas simple ensemble clustering without autoencoderbased dimension reduction showed no clear improvement; (2) improvement of clustering accuracy in general increases with the ensemble size; and (3) the proposed framework can improve cell typespecific clustering when applied using either the standard kmeans clustering algorithm or a stateoftheart kernelbased clustering algorithm (SIMLR) [10] specifically designed for scRNAseq data analysis. This demonstrates that the proposed framework can be coupled with different clustering algorithms to facilitate accurate cell type identification and other downstream analyses.
Results
Hyperparameter optimisation for autoencoders
We undertook a grid search to optimise three hyperparameters including random projection size, encoded feature space size and autoencoder learning rate during backpropagation; this was performed across four datasets (Table 2) using the ARI, NMI, FM and Jaccard index metrics discussed above. Together, the four metrics across four datasets made a total of sixteen dimensions across which to optimise. We used Pareto analysis [28] to select an appropriate combination of parameters from across all four optimisation datasets without giving priority to any single dataset or metric. A Pareto rank of 1 indicates an optimal clustering results on a selection of optimisation datasets using a combination of hyperparameters.
As the Pareto front becomes larger and more ambiguous as more datasets are included, we tested the robustness of each parameter set by obtaining the Pareto front for all possible combinations of 1, 2, 3 or all 4 datasets and counting the number of such combinations for which the given parameter set appears in the Pareto front (Fig. 1). We determined that the most robust highaccuracy results were obtained by selecting 2048 genes during random projection; producing an encoded feature space of 16 dimensions; and training the autoencoder using a learning rate of 0.001. All evaluation benchmarks were undertaken using this parameter combination and a hidden layer width of 128.
Ensemble of kmeans clustering
We first asked if the ensemble of autoencoderbased clustering can improve upon the performance of a single clustering run on a single encoded dataset. To test this, we first used a standard kmeans clustering algorithm (“Clustering algorithms” section) to create base clustering results and tested the performance of different ensemble sizes based on ARI, NMI, FM and Jaccard. Note that we repeatedly ran the entire procedure multiple times to account for the variability in the clustering results. We found that in general the overall ensemble clustering performance improves as the number of base clustering runs increases (Fig. 2, light blue boxes) according to all four evaluation metrics and in all four datasets used for evaluation. These results demonstrate that the ensemble of autoencoderbased clustering framework can indeed improve cell type identification for the kmeans clustering algorithm.
We wondered if such an improvement from ensemble clustering is independent of the random projection and autoencoder framework. Hence, we compared the performance of kmeans clustering on the raw input expression matrix without using the autoencoder framework (that is, without applying the random projection and autoencoder steps) with different ensemble sizes. We found that the improvement in clustering performance was diminished in most cases (Fig. 2, red boxes), suggesting that the improved clustering performance is due to the random projection and autoencoder steps implemented in the proposed framework in addition to the ensemble step. Notably, the autoencoder framework also enhanced the data signaltonoise ratio in most cases as can been seem from the improved performance of autoencoderbased kmeans cluster compared to direct kmeans clustering on the raw input at the ensemble size of 1.
Another interesting observation is that the variance of the clustering outputs in general decreased with the increasing number of base clusterings (Fig. 2). While the ensemble of kmeans clustering without random projection and autoencoder does not improve cell type identification accuracy, it does reduce the clustering variability, and therefore resulted in more stable and reproducible clustering outputs compared to a single run of kmeans clustering. These results are consistent with previous findings [21]. In comparison, autoencoderbased ensemble clustering led to both more accurate cell type identification and a reduction of variability, both of which are desirable characteristics for scRNAseq data analysis.
Autoencoderbased SIMLR ensemble
While the autoencoderbased cluster ensemble framework is able to improve the performance of a standard kmeans clustering algorithm in both accuracy and reproducibility of cell type clustering, we wondered if such an ensemble framework could also improve the performance of the latest clustering algorithm. To this end, we applied the proposed framework to a stateoftheart kernelbased clustering algorithm, SIMLR, designed specifically for cell type identification on scRNAseq data. Because the computational complexity of SIMLR grows exponentially with the number of cells in a dataset, we focused our evaluation on the two smaller datasets (i.e. GSE84371 and GSE82187). Similar to kmeans clustering, we found in these cases that the performance of the autoencoderbased SIMLR ensemble improved with increased ensemble size (Fig. 3). Clustering variability also generally decreased with larger ensemble sizes. These results demonstrate that the proposed autoencoderbased cluster ensemble framework can also lead to more accurate cell type identification and clustering reproducibility from scRNAseq data for SIMLR.
While cell type clustering accuracy improves with the larger ensemble sizes for both kmeans clustering algorithm and SIMLR (Figs. 2 and 3), we observed that this improvement plateaus at an ensemble size of 50 (Fig. 3). We therefore recommend an ensemble size of 50 as a good tradeoff between clustering output quality and computational time. Note that the computational complexity of the proposed cluster ensemble framework increases linearly with respect to the ensemble size.
Performance comparison of autoencoderbased cluster ensemble
Typically, a single run of a clustering algorithm is used to identify cell types from a given scRNAseq dataset. An interesting question is how much improvement the proposed autoencoderbased cluster ensemble offers compared to the common clustering procedure where a clustering algorithm is directly applied to raw gene expression data (that is, without random projection and autoencoder steps). To address this, we next quantified cell type clustering accuracy from the direct application of kmeans and SIMLR clustering to the raw gene expression input and compared these with the autoencoderbased cluster ensemble of kmeans and SIMLR, respectively. Note that an ensemble size of 50 was used for the cluster ensemble. kmeans clustering and the random projection step in ensemble clustering are nondeterministic; while SIMLR contains technically nondeterministic steps (including a kmeans step), we found that repeated runs on the same raw dataset with different random seeds produced identical clustering partitions. Consequently, SIMLR may be thought of as functionally deterministic. To account for variability in the clustering results for stochastic methods, we repeated clustering ten times and calculated the standard deviation across multiple runs. Table 1 summaries these results.
Specifically, the autoencoderbased kmeans ensemble improved cell type clustering for an average of about 30% in the four evaluation datasets according to all four evaluation metrics (Table 1). Clustering variability was also typically smaller using the autoencoderbased kmeans ensemble. Perhaps more strikingly, the cell type clustering accuracy as measured by ARI and Jaccard metrics for the autoencoderbased SIMLR ensemble improved about 50% to 100% compared to using SIMLR alone on the raw expression matrix for the mouse neurons and striatum datasets. Moreover, we found that the cell type clustering accuracy of SIMLR is substantially better than the standard kmeans clustering algorithm, suggesting SIMLR is indeed an effective clustering algorithm for scRNAseq data analysis. Therefore, the further gain in clustering accuracy by coupling SIMLR with the proposed autoencoderbased cluster ensemble is of practical importance and will add to the stateoftheart methods for scRNAseq data analysis.
Comparison of autoencoderbased cluster ensemble with PCAbased clustering
We next compared the performance of autoencoderbased cluster ensemble with PCAbased clustering. PCA is a commonly used dimension reduction method and has been widely used for reducing the highdimensionality of scRNAseq data prior to clustering cell types. By benchmarking the performance of cell type clustering across the evaluation datasets, we found that in almost all cases autoencoderbased clustering ensemble outperformed PCAbased approach for both kmeans clustering and SIMLR according to all four evaluation metrics (Fig. 4). We confirmed the statistical significance of these performance improvements using the Wilcoxon Rank Sum test. These results further demonstrate the utility of the autoencoderbased cluster ensemble for more accurate clustering of cell types in scRNAseq datasets.
Discussions
There may be further opportunities to build on the proposed method:
Firstly, we performed hyperparameter optimization over four datasets searching for the most robust configuration. While our chosen configuration was the most consistently accurate over all possible combinations of these four datasets, we saw that it was not among the most accurate configurations for two of the optimization datasets individually. Additionally, there is no guarantee that this combination falls into a global or nearglobal optimum across scRNAseq datasets in general, or that such an optimum exists. Devising a way to produce parameter configurations based on individual dataset characteristics without groundtruth labels may be an avenue for further exploration.
Secondly, the current iteration of our proposed method uses random subspace projection to reduce the dimension of datasets prior to autoencoder training. An additional direction for future research may include exploring other methods of basic dimension reduction, such as weighted gene selection based on variability or other metrics; this may be more useful in capturing cell typespecific characteristics by retaining genes containing more biological signal related to cell type.
Lastly, while the clustering algorithms kmeans and SIMLR were used as independent components in our current proposed framework, an interesting direction for future work might be the development of an artificial neural network architecture and training method which performs simultaneous dimension reduction and clustering. An integrated approach such as this may facilitate the exploration of clustering output in the reduced feature space.
Conclusions
High throughput scRNAseq technology is transforming biological and medical research by allowing the global transcriptome profiles of individual cells from heterogeneous samples and tissues to be quantified with high precision. Cell type identification has become essential in scRNAseq data analysis, and clustering has been the key computational approach used for this task. In this study, we have proposed an autoencoderbased ensemble clustering approach by incorporating several stateoftheart techniques in a computational framework.
We evaluated the proposed clustering framework on its impact on the level and robustness of cell type identification accuracy using a collection of scRNAseq datasets with predefined cell type annotations. Based on previously defined gold standards for each scRNAseq dataset, we demonstrate that the proposed framework is highly effective for cell type identification. The application of the proposed framework to both a standard kmeans clustering algorithm and a stateoftheart kernelbased clustering algorithms, SIMLR, illustrates its generalisability and applicability to other clustering algorithms. We therefore envision the proposed framework being flexibly adopted into the common workflow for scRNAseq data analysis.
Methods
The autoencoderbased cluster ensemble framework is summarised in Fig. 5a. The proposed framework accepts scRNAseq data in the form of an N×M expression matrix (denoted as X) where N is the number of cells and M is the number of genes.
Dimension reduction by autoencoders
Genes are randomly selected from the input dataset to produce a set of “random projection” datasets X^{t} (t=1,...,T), each with a dimension of N×M^{′}. The purpose of this step is to create ‘diversity’ [27] in subsequent encodings and individual clusterings of these datasets to achieve a more robust consensus in the resulting ensemble. Following the random projection step, each matrix X^{t} is then used to train a fullyconnected autoencoder neural network. An autoencoder is an artificial neural network consisting of two subnetworks: an encoder and a decoder, intersecting at a ‘bottleneck’ layer of a smaller size than the original input. The network is trained to reconstruct the original input with minimum error, forcing the network to learn to encode the information contained within the smaller latent space of the output of the bottleneck layer[29].
In the autoencoders used with our framework, the encoder accepts samples of cell data from X as input. The encoder contains a single hidden layer and an output layer which produces reduceddimension encodings of the aforementioned samples. The decoder subnetwork accepts these encoded samples as input, passing these through a single hidden layer and an output layer which produces reconstructions of the original samples. In both subnetworks, the activation function of the hidden layer is a ‘Leaky ReLu’ [30]; linear activation is applied to all other layers.
Each autoencoder is trained by minimising reconstruction error using the mean squared loss function:
where X^{t} is the input expression matrix from the t^{th} random projection and \(X^{t^{\prime }}\) is the autoencoder’s reconstruction of X^{t}.
Following training, each matrix X^{t} is fed through its respective autoencoder and a lowdimension encoded dataset is extracted from the encoder output. Training and hyperparameter optimisation of autoencoders are discussed in Hyperparameter optimisation for autoencoders section.
Clustering algorithms
To perform clustering on dimensionreduced datasets generated from autoencoders, we utilised both a standard kmeans clustering algorithm with Lloyd’s implementation [32] and a kernelbased clustering algorithm (SIMLR) specifically designed for scRNAseq data analysis [10].
Given an initial set of random centres m_{1},...,m_{K}, and a distance matrix D (typically computed from Euclidean space), the algorithm first finds the closest cluster centres for each of all cells based on their expression profiles \(X^{e}={x^{e}_{1},..., x^{e}_{N}}\):
and then updates the cluster centres:
The output is the assignment of each cell based on its expression profile x to a cluster k∈1,...,K.
SIMLR calculates the distance matrix for cells using multiple kernels as follows:
where w_{l} is the weight of a Gaussian kernel function for a pair of cells defined as follows:
where ε_{ij} is the variance and x_{i}−x_{j}^{2} is the squared distance between cell i and j, calculated from their expression profiles x_{i} and x_{j}. To test the proposed framework, we utilised SIMLR (Version 1.8.0) implemented in Bioconductor (Release 3.7).
The number of clusters to be created was set according to the number of predefined cell types/classes in each scRNAseq data for both the kmeans clustering and SIMLR (see “Evaluation metrics” section). After obtaining individual clustering outputs (denoted as P^{t}) from either kmeans clustering or SIMLR, a fixedpoint algorithm for obtaining hard least squares Euclidean consensus partitions was applied to compute the consensus P_{E} of individual partitions [31]:
in which w_{b} is the weight associated with individual clustering output and is set to 1 in our case, and D(P_{t},P_{E})^{2} is the squared Euclidean function for computing the distance of an individual partition with the consensus partition.
Together, the proposed autoencoderbased cluster ensemble framework can be summarised in pseudocode as below.
Data description and evaluation
This section summarises the scRNAseq datasets and performance assessment metrics utilised for method evaluation.
scRNAseq datasets
A collection of eight publicly available scRNAseq datasets (Table 2) were utilized in this study. These datasets were downloaded from the NCBI GEO repository, the EMBLEBI ArrayExpress repository, or the Broad Institute SingleCell database portal. The log_{2}transformed transcripts per million (TPM) or counts per million (CPM) values (as determined by the original publication for a given dataset) were used to quantify full length gene expression for datasets generated by SMARTer or SMARTseq2 protocols. UMIfiltered counts were used to quantify gene expression for the Dropseq datasets. All datasets have undergone celltype identification using biological knowledge from their respective original publications which we retain for evaluation purposes. For each dataset, genes detected in less than 20% of cells were removed. This step trims the number of genes and allows only those that are expressed in at least a subset of cells to be considered for subsequent analyses. Four datasets were used to optimise autoencoder hyperparameters. We present evaluation benchmarking results for four additional datasets.
Evaluation metrics
A common approach to assess the performance of clustering methods for cell type identification in scRNAseq data analysis is to compare the concordance of the clustering outputs of cells with a ‘gold standard’. As mentioned above, such a gold standard may be obtained from orthogonal information such as cell type marker genes and/or other biological knowledge of cell type characteristics. In this study, cell type annotations from their original publications are used as ‘gold standards’.
For each dataset, the number of clusters for both kmeans clustering and SIMLR was set as the number of predefined classes based on its original publication and the concordance between the clustering outputs and the ‘gold standard’ were measured using different metrics. Here we employed a panel of four evaluation metrics including Adjusted Rand index (ARI), normalized mutual information (NMI), FowlkesMallows index (FM), and Jaccard index [40] (Fig. 6).
Let G, P be the cell partitions based on the gold standard and the clustering output respectively. We define a, the number of pairs of cells assigned to the same group in both partitions; b, the number of pairs of cells assigned to the same cell type in the first partition but to different cell types in the second partition; c, the number of pairs of cells assigned to different cell types in the first partition but to the same cell type in the second partition; and d, the number of pairs of cells assigned to different cell types in both partitions. ARI, FM, and Jaccard index can then be calculated as follows:
Let G={u_{1},u_{2},...,u_{k}} and P={v_{1},v_{2},...,v_{k}} denote the gold standard and the clustering partition across K classes, respectively. NMI is defined as follows:
where I(G,P) is the mutual information of G and P, defined as
and H(G) and H(P) are the entropy of partitions G and P calculated as
where N is the total number of cells.
Availability of data and materials
The datasets generated and/or analysed during the current study are available from the GEO, EBI, and Broad repositories. Details of datasets are listed in Table 1.
Abbreviations
 ARI:

Adjusted rand index
 CPM:

Counts per million
 FM:

FowlkesMallows index
 PCA:

principal component analysis
 scRNAseq:

Singlecell RNAseq
 SIMLR:

kernelbased clustering algorithm
 tSNE:

tdistributed stochastic neighbour embedding
 TPM:

Transcripts per million
References
Ziegenhain C, Vieth B, Parekh S, et al.Comparative analysis of singlecell rna sequencing methods. Mol Cell. 2017; 65(4):631–43.
Trapnell C. Defining cell types and states with singlecell genomics. Genome Res. 2015; 25(10):1491–8.
Bacher R, Kendziorski C. Design and computational analysis of singlecell rnasequencing experiments. Genome Biol. 2016; 17(1):63.
Kolodziejczyk AA, Kim JK, Svensson V, Marioni JC, Teichmann SA. The technology and biology of singlecell rna sequencing. Mol Cell. 2015; 58(4):610–20.
Kiselev VY, Andrews TS, Hemberg M. Challenges in unsupervised clustering of singlecell rnaseq data. Nat Rev Genet. 2019; 20:273–282.
Grün D, Lyubimova A, Kester L, et al.Singlecell messenger rna sequencing reveals rare intestinal cell types. Nature. 2015; 525(7568):251.
Lin P, Troup M, Ho JW. Cidr: Ultrafast and accurate clustering through imputation for singlecell rnaseq data. Genome Biol. 2017; 18(1):59.
Dey KK, Hsiao CJ, Stephens M. Visualizing the structure of rnaseq expression data using grade of membership models. PLoS Genet. 2017; 13(3):1006599.
Macosko EZ, Basu A, Satija R, et al.Highly parallel genomewide expression profiling of individual cells using nanoliter droplets. Cell. 2015; 161(5):1202–14.
Wang B, Zhu J, Pierson E, Ramazzotti D, Batzoglou S. Visualization and analysis of singlecell rnaseq data by kernelbased similarity learning. Nat Methods. 2017; 14(4):414.
Freytag S, Tian L, Lönnstedt I, Ng M, Bahlo M. Comparison of clustering tools in R for mediumsized 10 × Genomics singlecell RNAsequencing data. F1000Research. 2018; 7:1297. https://doi.org/10.12688/f1000research.15809.1.
Duò A, Robinson MD, Soneson C. A systematic performance evaluation of clustering methods for singlecell RNAseq data. F1000Research. 2018; 7:1141. https://doi.org/10.12688/f1000research.15666.1.
Kim T, Chen IR, Lin Y, Wang AYY, Yang JYH, Yang P. Impact of similarity metrics on singlecell rnaseq data clustering. Brief Bioinforma. 2018. https://doi.org/10.1093/bib/bby076.
Shao C, Höfer T. Robust classification of singlecell transcriptome data by nonnegative matrix factorization. Bioinformatics. 2017; 33(2):235–42.
Maaten Lvd, Hinton G. Visualizing data using tsne. J Mach Learn Res. 2008; 9(Nov):2579–605.
Pierson E, Yau C. Zifa: Dimensionality reduction for zeroinflated singlecell gene expression analysis. Genome Biol. 2015; 16(1):241.
Ding J, Condon A, Shah SP. Interpretable dimensionality reduction of single cell transcriptome data with deep generative models. Nat Commun. 2018; 9(1):2002.
Lin C, Jain S, Kim H, BarJoseph Z. Using neural networks for reducing the dimensions of singlecell rnaseq data. Nucleic Acids Res. 2017; 45(17):156.
Yang P, Hwa Yang Y, B Zhou B, Y Zomaya A. A review of ensemble methods in bioinformatics. Curr Bioinforma. 2010; 5(4):296–308.
VegaPons S, RuizShulcloper J. Int J Pattern Recogn Artif Intell. 2011; 25(03):337–72.
Kuncheva LI, Vetrov DP. Evaluation of stability of kmeans cluster ensembles with respect to random initialization. IEEE Trans Patt Anal Mach Intell. 2006; 28(11):1798–808.
Avogadri R, Valentini G. Fuzzy ensemble clustering based on random projections for dna microarray data analysis. Artif Intell Med. 2009; 45(23):173–83.
Ren Y, Domeniconi C, Zhang G, Yu G. Weightedobject ensemble clustering. In: Data Mining (ICDM), 2013 IEEE 13th International Conference On. IEEE: 2013. p. 627–36.
Kiselev VY, Kirschner K, Schaub M, et al.Sc3: consensus clustering of singlecell rnaseq data. Nat Methods. 2017; 14(5):483.
Yang Y, Huh R, Culpepper HW, Lin Y, Love MI, Li Y. Safeclustering: Singlecell aggregated (from ensemble) clustering for singlecell rnaseq data. Bioinformatics. 2018; 35(8):1269–77.
Risso D, Purvis L, Fletcher RB, et al.clusterexperiment and RSEC: A bioconductor package and framework for clustering of singlecell and other large gene expression datasets. PLoS Comput Biol. 2018; 14(9):1006378.
Kuncheva LI, Hadjitodorov ST. Using diversity in cluster ensembles. In: 2004 IEEE International Conference On Systems, Man and Cybernetics. IEEE: 2004. p. 1214–9. https://doi.org/10.1109/icsmc.2004.1399790.
Ngatchou P, Zarei A, ElSharkawi A. Pareto multi objective optimization. In: Intelligent Systems Application to Power Systems, 2005. Proceedings of the 13th International Conference On. Arlington: IEEE: 2005. p. 84–91.
Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006; 313(5786):504–7.
Maas AL, Hannun AY, Ng AY. Rectifier nonlinearities improve neural network acoustic models. In: in ICML Workshop on Deep Learning for Audio, Speech and Language Processing. Atlanta: 2013.
Hornik K. A clue for cluster ensembles. J Stat Softw. 2005; 14(12):1–25.
Lloyd S. Least squares quantization in pcm. IEEE Trans Inf Theory. 1982; 28(2):129–37.
Zeisel A, MuñozManchado AB, Codeluppi S, et al.Cell types in the mouse cortex and hippocampus revealed by singlecell rnaseq. Science. 2015; 347(6226):1138–42.
Deng Q, Ramsköld D, Reinius B, Sandberg R. Singlecell rnaseq reveals dynamic, random monoallelic gene expression in mammalian cells. Science. 2014; 343(6167):193–6.
Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, Gephart MGH, Barres BA, Quake SR. A survey of human brain transcriptome diversity at the single cell level. Proc Natl Acad Sci. 2015; 112(23):7285–90.
Petropoulos S, Edsgärd D, Reinius B, Deng Q, Panula SP, Codeluppi S, Reyes AP, Linnarsson S, Sandberg R, Lanner F. Singlecell rnaseq reveals lineage and x chromosome dynamics in human preimplantation embryos. Cell. 2016; 165(4):1012–26.
Habib N, Li Y, Heidenreich M, et al.Divseq: Singlenucleus rnaseq reveals dynamics of rare adult newborn neurons. Science. 2016; 353(6302):925–8.
Gokce O, Stanley GM, Treutlein B, et al.Cellular taxonomy of the mouse striatum as revealed by singlecell rnaseq. Cell Rep. 2016; 16(4):1126–37.
Habib N, AvrahamDavidi I, Basu A, et al.Massively parallel singlenucleus rnaseq with droncseq. Nat Methods. 2017; 14(10):955.
Wagner S, Wagner D. Comparing Clusterings: an Overview: Universität Karlsruhe, Fakultät für Informatik Karlsruhe; 2007.
Acknowledgements
The authors thank their colleagues at the School of Mathematics and Statistics; and School of Life and Environmental Sciences for informative discussion and valuable feedback.
About this supplement
This article has been published as part of BMC Bioinformatics, Volume 20 Supplement 19, 2019: 18th International Conference on Bioinformatics. The full contents of the supplement are available at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume20supplement19.
Funding
Publication of this supplement was funded by the Australian Research Council Discovery Early Career Researcher Award (DE170100759) and a National Health and Medical Research Council (NHMRC) Investigator Grant (1173469) to P.Y., the Australian Research Council Discovery Projects (DP170100654) to JYHY and PY, the National Health and Medical Research Council (NHMRC)/Career Development Fellowship (1105271) to JYHY, the Australian Government Research Training Program Scholarship to TAG and the Judith and David Coffey Life Lab Gift scholarship to TAG.
Author information
Authors and Affiliations
Contributions
PY and TAG conceived the study. TAG led the experimental design and data analyses. TK contributed to the data curation and analyses; LN contributed to the algorithm design with input from DT; PY and TAG interpreted the experimental results with input from JGB and JYHY; PY and TAG wrote the manuscript with input from all authors. All authors reviewed and approved the final version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Geddes, T., Kim, T., Nan, L. et al. Autoencoderbased cluster ensembles for singlecell RNAseq data analysis. BMC Bioinformatics 20 (Suppl 19), 660 (2019). https://doi.org/10.1186/s1285901931795
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901931795
Keywords
 Autoencoder
 Cluster ensemble
 Single cells
 scRNAseq
 Singlecell transcriptome
 Cell type identification