To demonstrate the performance of pcaReduce method, we considered two single cell RNA-seq dataset examples. The first contains a collection of cells originating from diverse biological tissues [30]; and the second dataset the mouse sensory neuronal cells [27] discussed in the Introduction. These were selected as they contained pre-existing hierarchical cluster structures that can be used to assess unsupervised algorithmic performance. Here we show that pcaReduce can be applied to re-capture the known cellular hierarchies and we compare to other statistical techniques, which are commonly applied to address similar cell sub-typing problems. Below, all examples were implemented using the free statistical computing platform R (https://www.r-project.org).
Cells from disparate tissues
We obtained single cell RNA-seq dataset [30] for 300 cells whose transcriptional measurements were taken across 8,686 genes (see Additional file 1, Section A for further details on data preparation). The data were derived from 11 cell types: K562 – myeloid (chronic leukemia), HL60 – myeloid (acute leukemia), CRL-2339 – lymphoblastoid; iPS – pluripotent; CRL-2338 – epithelial, BJ – fibroblast (from human foreskin), Kera – foreskin keratinocyte; NPC – neural progenitor cells, GW(16, 21, 21+3) – gestational week (16,21, 21+3 weeks), fetal cortex (see Fig. 3
a). In addition, as specified in the original study by [30], these cell types could also be grouped into four disparate tissue sources: blood, stem, skin and neural tissues. We refer to these as the cell line-level and tissue-level classifications respectively and use these as ground-truth classes in our performance assessment; i.e. we will focus on data partitions into K=11 and K=4 clusters.
We applied pcaReduce to this dataset to construct a hierarchical clustering of cells. First, we initially projected the data into the subspace spanned by the first 30 principal components following a PCA and performed an initial K-means clustering to get initial cluster assignments (using K=31 clusters) [31]. After this, we applied different merging strategies to construct the cellular hierarchies: first, when merging is performed based on the most probable cluster merge value (see (i) in Algorithm 1) and, secondly, when merge candidates are probabilistically sampled (see (ii) in Algorithmic overview section). The former gives a single hierarchical clustering whilst the latter can give a range of candidates hierarchies based on repeated sampling.
We compared the hierarchical clustering given by pcaReduce for levels K=4 and K=11 to the true cell line and tissue level classifications respectively using the Adjusted Rand Index [32]. Note that, in Fig. 3
a, the projection of the eleven cell lines in two-dimensional principal component space cannot be separated into distinct groups. It is only possible to do this in higher dimensional representations. Figure 3
b illustrates the performance of pcaReduce using the sampling-based merge operation where each line corresponds to a single run of the method. Although, pcaReduce has no knowledge of the true number of cell line or tissue labels, the correspondence between the hierarchical clustering output of pcaReduce and the true classification peaks at around levels 4 and 11 of the hierarchies respectively which it discovers without any prior knowledge.
In order to gain an understanding of the misclassifications, we looked specifically at the most probable hierarchical structure identified using pcaReduce (Fig. 3
c). Compared to the known cell line and tissue labels (see Additional file 1: Figure S1), the 11-cluster structure given by pcaReduce did not fully differentiate the 11 cell types. This is not unsurprising since the 11 cell types included a set of three maturing neural cell types (GW16, GW21 and GW21+3) that are highly related. Interestingly, pcaReduce grouped these cell types together, which is not an entirely inappropriate operation since the expression variation between the maturing neural cells maybe relatively low compared to unrelated cell types. There was also some class splitting, for example, two sub-groups of K562 cells were identified. Figure 3
a qualitatively indicates that this may make sense as some K562 cells were closer in overall expression to HL60 cells than other K562 cells.
At the 4-cluster level the assignments given by pcaReduce gave some interesting group structures. The ground truth tissue-level classification assumed the existence of blood, neural, dermal and stem cell types but pcaReduce identified that the CRL-2338 and CRL-2339 cell lines should form a group. This is interesting as CRL-2338 is a cell line derived from a primary stage IIA, grade 3 invasive ductal carcinoma and CRL-2339 is a B lymphoblastoid cell line initiated from peripheral blood lymphocytes from the same patient. Pluripotent stem cells (iPS) were also grouped by pcaReduce with neural progenitor cells (NPC) which is also reasonable if we consider this a stem cell-like group. Overall, whilst pcaReduce did not give a 4-cluster classification that was identical to the original tissue classifications [30], the output produced are not nonsensical. In comparison, the output of standard hierarchical clustering failed to separate the cells both at the cell line and tissue level into any obvious structure (see Additional file 1: Figure S1).
In order to fully assess the performance of pcaReduce, we compared it to a set of alternative approaches (see Fig. 4). This includes popular methods such as: K-means, hierarchical clustering (HC), Mclust – mixture modelling for model-based clustering [33] combined with tSNE – a nonlinear dimensionality reduction/visualisation technique [22]; and recent single cell methodology – SNN-Cliq, which determines similarities between cells based on a shared nearest neighbours algorithm and performs single cell clustering using a graph-theoretical approach [25], and SC3 which uses spectral transformations of a cell-to-cell distance matrix followed by k-means and consensus-based clustering [26]. Details regarding all parameters and running specifications for each clustering approach are summarised in Additional file 1.
Figure 4 shows the relative performance of the different approaches. The score of t-SNE/Mclust (Method 11) was used as a benchmark and the true number of cell lines was given as a known parameter. The t-SNE algorithm is a frequently used non-linear dimensionality reduction technique for single cell expression analysis and Mclust is a well known and popular clustering algorithm (based on Gaussian mixture models). The combination provides considerable considering flexibility. When we compared the classifications at K=11 of the various hierarchical clustering methods, including our own pcaReduce, we found that the standard hierarchical clustering approaches did less well than pcaReduce. For example, Method 3 uses the same initial K-means cluster initialisation as pcaReduce and cluster merging criterion but does not use different data representations by removing principal components after each merge. This subtle alteration in methodology appears to make a fundamental difference in performance. Method 6 uses hierarchical clustering applied to the same initial principal component projection of the data as pcaReduce but performance is again low and highly dependent on distance measure used for the clustering (see Additional file 1: Figure S4).
Whilst some methods had comparable performance to pcaReduce in terms of capturing the cell line level classifications, their performance diminished for the tissue-level ones. Here, the benchmark used was Method 11 when the number of clusters (4) was given as an input parameter. Despite this advantage it failed to identify any clustering structure even closely resembling the ground truth we are using. Similarly, Method 7 uses hierarchical clustering applied to a t-SNE projection of the data, this had a reasonable ARANDI score for the cell line level classification but when the clusters were merged into 4 groups these had no correspondence to the ground truth. SC3 was able to achieve comparable performance to pcaReduce for specific range of values for a parameter d – the number of eigenvectors retained following the spectral transformation of the cell-to-cell distance matrix. However, the range of d that gave greatest concordance was not the default setting (0.04N<d<0.07N), see Additional file 1: Figure S2. Overall, pcaReduce gave consistently provide clustering results that were closest to the ground truth (see ARANDI score in Fig. 4) for both the cell line and tissue classifications. Its performance suggests that the gradual use of successively reduced dimensional representations of the data helps to merge clusters together in a sensible way.
Note, there is stochasticity in the clustering structures produced by pcaReduce due to the random initialisation provided by the K-means algorithm and probabilistic merge steps. Interestingly, motivated by the use of consensus clustering in SC3, we applied an ensemble clustering method across the pcaReduce clustering structures (see Additional file 1 for details of the methods used), the consensus clustering structure achieves a high level of concordance with the cell line and tissue level classification. Finally further details regarding sensitivity in initial selection of q – the initial number of clusters used – is shown in Additional file 1: Figure S3).
Mouse neuronal cells
We next returned to the mouse neuronal cell dataset discussed in the Introduction that contains measurements across 25,334 genes [27]. The study classified cells according to four principle neuronal groups: non-peptidergic nociceptor cells (NP), peptidergic nociceptor cells (PEP), neurofilament containing cells (NF), and tyrosine hydroxylase containing cells (TH) (Fig. 5
a). In addition to this, it was suggested that the NP, PEP and NF cells possessed further subtypes (Fig. 1
b, c). We now examined whether pcaReduce could recover these three layers of clustering structures within its hierarchical output without the use of marker genes.
We applied pcaReduce and computed the correspondence between its 4-, 8- and 11-cluster structures and those in the original study. Figure 5
b shows that the absolute classification accuracy was relatively low compared to the previous cell line experiment. This is unsurprising as the four pre-dominant neural cell groups form a complex cluster pattern in the subspace spanned by PC2-4 (see Fig. 5
a) and would be hard to segregate in an entirely unsupervised way as we propose. This is especially evident from PCA plots summarised in Additional file 1: Figure S5, where we plot pairwise combinations of various principle components and highlight cells that should correspond to neuronal subtypes: NF3 and NF4 (lower-left) and NP2 and NP3 (upper-right).
The performance of pcaReduce generally outperformed the other approaches we tried across the three clustering structures (see Additional file 1: Figure S6) except for Method 11 (t-SNE + Mclust), which used the true number of clusters by default and acts as an artificial benchmark, and SC3 (Method 12). Interestingly, classification performance was again increased by applying consensus clustering across the pcaReduce sampled clustering structures. Although this had greater effect for the 4-cluster structures than the 8- and 11-clusters. In the case of SC3, the classification performance was sensitive to the choice of the number of eigenvectors (d) used in the spectral transformation step (see Additional file 1: Figure S7). Strong classification performance was obtained using the default range of d but we note that this range was chosen based on optimizing against a number of single cell datasets and their ground truth classifications including this mouse neuronal dataset [26]. Outside of this default range, performance varied and could be similar to the levels obtained by other methods (Additional file 1: Figure S7).
Using the most probable structure given by pcaReduce, we noted that the three groups of the four top level groups are predominantly dominated by NP, TH and NF cells respectively (Fig. 5
c) matching groups in the classification by [27]. Marker gene expression patterns (obtained from [27]) for these three groups also corresponded to those found in the original study confirming their identities (Fig. 6). The main source of discordance comes from 66 NP cells being assigned to the same group as 60 PEP cells giving a combined group of these cells (NP/PEP) not present in the original classification. When we examined the expression of the marker genes, we discovered that the expression of these genes was strikingly similar between the NP and PEP groups found by [27] with the only major difference being complete zero expression of Mrgprd in the PEP group whilst only some NP cells show zero expression for this gene. Therefore it is perhaps unsurprising that cells from these two groups were merged by pcaReduce. Interestingly, the PEP and NP cells correspond to sub-classes of nociceptors (peptidergic and nonpeptidergic respectively). This combined NP/PEP groups does subsequently become partitioned as the number of clusters was allowed to increase into subgroups dominated by NP and PEP cells respectively. Note that the use of t-SNE – a non-linear dimensionality reduction technique – did not well-separate the four groups either (Additional file 1: Figure S8), and it would not be obvious, without known markers, how to delineate each group.
At the 11-cluster level pcaReduce identified multiple subgroups of TH cells that had high Th gene expression but possessed different patterns of expression in other marker genes (Additional file 1: Figure S9). In contrast, the original study [27] only possessed one Th group. This difference alone drives much of the classification discordance between pcaReduce and the original classes but this discordance may not be an “error” but simply a different choice of clustering compatible with the same data. Further differences are driven by the combined NP-PEP cluster generated by pcaReduce that is then propagated down the hierarchical structure. Finally, the decomposition of the NF cluster by pcaReduce, splits the NF group into three subgroups with striking similarity to the NF1, NF2/3 and NF4/5 groups in [27].