GiniClust3: a fast and memory-efficient tool for rare cell type identification

Background With the rapid development of single-cell RNA sequencing technology, it is possible to dissect cell-type composition at high resolution. A number of methods have been developed with the purpose to identify rare cell types. However, existing methods are still not scalable to large datasets, limiting their utility. To overcome this limitation, we present a new software package, called GiniClust3, which is an extension of GiniClust2 and significantly faster and memory-efficient than previous versions. Results Using GiniClust3, it only takes about 7 h to identify both common and rare cell clusters from a dataset that contains more than one million cells. Cell type mapping and perturbation analyses show that GiniClust3 could robustly identify cell clusters. Conclusions Taken together, these results suggest that GiniClust3 is a powerful tool to identify both common and rare cell population and can handle large dataset. GiniCluster3 is implemented in the open-source python package and available at https://github.com/rdong08/GiniClust3.


Background
The rapid development of single cell technologies has greatly enabled biologists to systematically characterize cellular heterogeneity (see reviews [1][2][3][4]). While many methods have been developed to identify cell types from single cell transcriptomic data [5][6][7], most are designed to identify common cell types. As the throughput becomes much higher, it is also of considerable interest to specifically identify rare cell types. Several methods have been developed [8][9][10][11][12][13]; however, existing methods are not scalable to very large datasets. Considering the fact that atlas-scale datasets may contain hundreds of thousands or even millions of cells [5,[14][15][16], there is an urgent need to develop faster method for rare cell type detection.
In previous work, we developed GiniClust to identify rare cell clusters, using a Giniindex based approach to select rare cell-type associated genes [11]. Recently, we extended the method to identify both common and rare cell clusters, using a clusteraware, weighted ensemble clustering approach [12]. These methods have been used to analyze datasets containing up to 68,000 cells. Here we have further optimized the algorithm so that it can be efficiently used to analyze dataset containing over one million cells. By using a real single-cell RNA-seq dataset as an example, we show that this new extension, which we call GiniClust3, can efficiently and accurately identify both common and rare cell types.

Details of GiniClust3 pipeline
The overall strategy is similar to GiniClust2 [12]. The implementation of each step is optimized to improve computation and memory efficiency (Fig. 1a). Compare with GiniClust2, there are two major changes. First, we used Leiden, which were suitable for large datasets, to replace DBSCAN for the clustering step. Second, we generated consensus matrix based on cluster level of Gini and Fano cluster results, instead of cell level. Both changes could highly increase the computational efficiency. The details of the GiniClust3 pipeline are as follows.
Step 1: clustering cells using Gini index-based features a. Gini index calculation and normalization. After data pre-processing, the Gini index for each gene is calculated as twice of the area between the diagonal and Lorenz curve, as described before [11]. The range of Gini index values is between 0 to 1. Then, Gini index values are normalized by using a two-step LOESS regression procedure as described before. Genes with Gini index value ≥0.6 and p value < 0.0001 are labeled as high Gini genes and selected for further analysis.
b. Cell cluster identification by Leiden algorithm. In previous versions [11,12], DBSCAN was used for clustering. While DBSCAN is effective for identify rare cell clusters, this method is both time and memory consuming. In GiniClust3, we replace DBSCAN with the Leiden clustering algorithm [17], which is known for improved numerical efficiency. Alternatively, users can also select the Louvain clustering algorithm [18] by setting "method = louvain". The neighbor size we set in Gini index-based clustering of mouse brain single-cell dataset is 15 (neighbors = 15). Lower threshold for neighbor size to efficiently identify rare clusters in smaller datasets is recommended (default value = 5).
Step 2: clustering cells using Fano factor-based features Highly variable genes are identified by using Scanpy. These genes are used to identify common cell clusters by using principal component analysis (PCA) followed by Leiden or Louvain clustering, using the default settings in Scanpy [7]. The neighbor size we set in Fano factor-based clustering of mouse brain single-cell dataset is 15 (neighbors = 15).
Step 3: combining the clusters from steps 1 and 2 via a cluster-aware, weighted consensus clustering approach effectively The weighted consensus clustering method is described before [12] with modifications. Connectivity of cells in different cluster results (P G and P F ) are calculated. To improve computational efficiency, we kept one cell to represent cells with same Gini and Fano cluster results. Thus, the computational efficiency is associated with Gini and Fano cluster numbers rather than cell numbers. Then, we calculate the consensus matrix based on these n cells from different Gini and Fano clusters. If two cells are clustered in the same group, the connectivity is 1, otherwise the connectivity is 0 (formula (a)). We set the cell-specific weights for the Fano factor-based clusters w F as a constant , where x i is the proportion of the GiniClust cluster for cell i, μ' is the rare cell type proportion at which GiniClust and Fano factor-based clustering methods have approximately the same ability to detect rare cell types, and s' represents how quickly GiniClust loses its ability to detect rare cell types above μ'.
The cell pair-specific weights were firstly defined as formula (c). Then, after normalization of the w F and w G (formula (d)), the consensus value was calculated based on the weight (w G ij and w F ij ) and connection (M ij (P G ) and M ij (P F )) (formula (e)).
k-means clustering is applied to the consensus matrix M ij , then the results are easily converted back to single-cell level clustering. Finally, clusters with cell population < 1% are considered as rare clusters.
Data source and pre-processing of the data A mouse brain single-cell RNA-seq dataset was downloaded from 10X genomics website: (https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_ neurons). This dataset contains 1.3 million cells obtained from cortex, hippocampus and ventricular zones of E18 mice. Raw data was pre-processed by using Scrublet [19] (version 0.2.1) to remove doublets with default setting. The resulting data was further filtered to remove genes expressed in fewer than ten cells and cells expressed fewer than 500 genes. A total number of 1,244,774 cells and 21,493 genes passed this filter were retained for further analysis. Raw UMI counts were normalized by Scanpy [7] with the following parameter setting: sc.pp.normalize_per_cell (counts_per_cell_after = 1e4).

Results
Compared with GiniClust2, we did two major modifications to optimize the performance. First, clustering method which consumes time and memory is replaced with method suitable for large scale dataset. Second, we speed up GiniClust3 by generating consensus matrix in cluster level rather than cell level. Both the modifications could highly increase the speed and reduce the memory consumption of GiniClust3.
To test the utility of GiniClust3, we applied the method to analyze a public single-cell RNA-seq dataset containing 1.3 million single cells obtained from three regions in the mouse brain (see Implementation for details). After filtering out lowly-expressed genes and poor-quality cells (such as those likely to be doublets), a 1,244,774 cell-by-21,494 gene count matrix was left for further analysis. We next sought to characterize the identities of cell populations by using GiniClust3. A total number of 16 common and 17 rare cell clusters (cell population < 1%) were identified (Fig. 1b, S1a), with the smallest cluster containing only 21 cells (cell population = 0.002%) (Fig. 1c and Table S1). The total time of cluster identification for both common and rare cell took~7-h time, and 103G memory on a Xeon E5-2683 with 56 threads and 640GB memory server, indicating GiniClust3 is suitable for analyzing very large datasets.
We then systematically evaluate the time and memory consumption in different scales, we randomly subsampled 1.3 million mouse brain scRNA-seq dataset, range from 5 K to 1 M cells. The time and memory consumption scale almost linearly with cell number, as the regression slope is close to 1 in both cases (Fig. S1b, slope = 1.08 for running time; Fig. S1c, slope = 0.92, for memory usage). To evaluate the robustness of GiniClust3, we repeated the analysis using randomly subsampled data. To this end, 50% of the cells were randomly selected from common clusters (≥1%). Since our main focus was to identify rare cell clusters, the cells assigned to these rare clusters (< 1%) identified above were all retained. By repeating this subsampling method for 10 times and applying GiniClust3 to the subsampled datasets, we found most of the clusters in subsampled datasets are consistent with the original ones, the median Normalized Mutual Information (NMI) is 0.81 (Fig. S1d). Taken together, these analyses show that GiniClust3 is a sensitive, accurate and efficient clustering method that can be used in many applications.

Conclusions
With the technological development and protocol improvement, the scaling of singlecell RNA-seq is increasing in an exponential way [23], providing a great opportunity to identify previously unrecognized rare cell types. We have shown that GiniClust3 is an accurate and highly scalable method for detecting rare cell types from large single-cell RNA-seq datasets. GiniClust3 could identify both common and rare cell population and handle large dataset containing more than one million cells in an effective way. This property is important to comprehensively identify cell types in large datasets and may be particularly useful for atlas datasets in future.