pcaReduce: hierarchical clustering of single cell transcriptional profiles

Background Advances in single cell genomics provide a way of routinely generating transcriptomics data at the single cell level. A frequent requirement of single cell expression analysis is the identification of novel patterns of heterogeneity across single cells that might explain complex cellular states or tissue composition. To date, classical statistical analysis tools have being routinely applied, but there is considerable scope for the development of novel statistical approaches that are better adapted to the challenges of inferring cellular hierarchies. Results We have developed a novel agglomerative clustering method that we call pcaReduce to generate a cell state hierarchy where each cluster branch is associated with a principal component of variation that can be used to differentiate two cell states. Using two real single cell datasets, we compared our approach to other commonly used statistical techniques, such as K-means and hierarchical clustering. We found that pcaReduce was able to give more consistent clustering structures when compared to broad and detailed cell type labels. Conclusions Our novel integration of principal components analysis and hierarchical clustering establishes a connection between the representation of the expression data and the number of cell types that can be discovered. In doing so we found that pcaReduce performs better than either technique in isolation in terms of characterising putative cell states. Our methodology is complimentary to other single cell clustering techniques and adds to a growing palette of single cell bioinformatics tools for profiling heterogeneous cell populations. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0984-y) contains supplementary material, which is available to authorized users.


Additional File
Note on low level single cell data processing and gene expression matrix preparation Cells from disparate tissues.
We have used R package called Rsubread [4] to assign mapped sequencing reads to genomic features, i.e. to perform transcript counting, which was achieved using function featureCounts.
In order to construct a gene expression matrix for higher level analysis, we have performed basic cell and gene filtering: (a) From 347 samples we have focused on a subset of 301 cells (a subset without ERCC validation cells and bulk sample), which were used in the main study by Pollen et al. [5]. In addition, we have removed one cell that had 0 expression levels across all genes, this left us with a 300 cells in total.
(b) For gene filtering we have used R package called edgeR [6], we kept those genes that fulfilled at least 100 counts per million (cpm) in at least 10 samples; this left us with 8686 genes in total. Lastly, we have transformed gene expression counts to a logarithmic values; more precisely, values x ij in matrix X were obtained by log 2 (x 0 ij + 1), where x 0 ij are read counts of a gene.

Mouse neuronal cells.
We have downloaded readily pre-processed dataset from the website http:// linnarssonlab.org/drg/. For the main study, we focused on 622 cells that were classified as neurones Figure S1: Comparison of pcaReduce and hierarchical clustering results on the disparate tissues data. (A) Hierarchical relationships between cells based on known cellular labels. (B) Hierarchical relationships between cells based on cellular labels identified by single run of pcaReduce under the most probable merging criterion. Also ARANDI scores between true clusterings, K = 4 and K = 11, and identified clusterings by pcaReduce are summarised in grey rectangles. (C) Hierarchical relationships between cells based on cellular labels identified by a standard hierarchical clustering (HC) algorithm; ARANDI scores between true clusterings, K = 4 and K = 11, and identified clusterings by HC are summarised in grey rectangles. CELL LINE EXAMPLE Figure S2: Variability of sc3 clustering structures Plots illustrate SC3 [3] performance for different parameter ranges d given in the table (inset) for the tissue (ks = 4) and cell line (ks = 11) level classifications. Orange points show the ARANDI scores between the true clustering structure and those identified by sc3. Blue points show the ARANDI scores computed between the pcaReduce ensemble clusters and those of SC3. In red we highlight the scores obtained using the default d range used in SC3.
All other Figure S5: Separability of mouse neuronal cell types in principal component space. Scatterplots illustrate the first 15 principle directions. In the left triangle NF3 and NF4 cells are highlighted in blue and dark blue points. In the right triangle NP2 and NP3 cells are highlighted in green and dark green points.   Figure S6: Comparison of classification performance on mouse neuronal cells. Boxplots 1-4 are based on 100 runs of each method, blue and green circles correspond to the outcome of consensus clustering of pcaReduce output across 100 runs with sampling and max merging settings respectively; methods 5-8 are evaluated based on all possible distance metrics; 9,11 based on all possible covariance structures, method 11 based on various parameter d ranges. (A) ARANDI score between true clustering, K = 4, and clusterings identified by Methods 1-11. (B) ARANDI score between the true clustering, K = 8, and clusterings identified by Methods 1-11. (C) ARANDI score between true clustering, K = 11, and clusterings identified by Methods 1-11.   Figure S9: Marker gene expression levels across 11 neuronal clusters identified using the pcaReduce algorithm.