Boosting scRNA-seq data clustering by cluster-aware feature weighting

Background The rapid development of single-cell RNA sequencing (scRNA-seq) enables the exploration of cell heterogeneity, which is usually done by scRNA-seq data clustering. The essence of scRNA-seq data clustering is to group cells by measuring the similarities among genes/transcripts of cells. And the selection of features for cell similarity evaluation is of great importance, which will significantly impact clustering effectiveness and efficiency. Results In this paper, we propose a novel method called CaFew to select genes based on cluster-aware feature weighting. By optimizing the clustering objective function, CaFew obtains a feature weight matrix, which is further used for feature selection. The genes have large weights in at least one cluster or the genes whose weights vary greatly in different clusters are selected. Experiments on 8 real scRNA-seq datasets show that CaFew can obviously improve the clustering performance of existing scRNA-seq data clustering methods. Particularly, the combination of CaFew with SC3 achieves the state-of-art performance. Furthermore, CaFew also benefits the visualization of scRNA-seq data. Conclusion CaFew is an effective scRNA-seq data clustering method due to its gene selection mechanism based on cluster-aware feature weighting, and it is a useful tool for scRNA-seq data analysis.

a set of cells into a certain number of homogeneous groups, each of which is referred to as a cluster. Clustering has been applied to solving many biological problems such as detecting new cell types [8], cell lineage tracking [9][10][11], studying pathogenic mechanism [12], exploring drug responses and even disease diagnosis [13,14].
Up to now, a number of clustering methods have been proposed specifically for scRNA-seq data, which exploit the unique characteristics of scRNA-seq data, including high sparsity, dimensionality and noise that seriously challenge the conventional clustering algorithms [15][16][17]. Recently, Li, Guan and Zhou conducted a comprehensive survey and comparison study on scRNA-seq data clustering algorithms [18], where existing scRNA-seq data clustering methods are classified into six types: distance-based, densitybased, graph-based, matrix-based, model-based and deep learning-based.
Distance-based methods use distance-based clustering algorithms, such as k-means and hierarchical clustering. The majority of scRNA-seq clustering methods in the literature such as SC3 [19], SINCERA [20], CIDR [21], RaceID [22] and pcaReduce [23] fall into this category. Out of which, SC3 is a consensus clustering method that applies three kinds of similarity measurements and two kinds of feature transformation techniques to integrate clustering results [19]. SINCERA is a package for scRNA-seq data analysis, where some pre-process steps such as normalization and quality control are performed before clustering, and hierarchical clustering is applied to the similarity matrix generated by centered Pearson's correlation and average linkage [20]. CIDR performs hierarchical clustering on a few top principal coordinates obtained by principal coordinate analysis (PCoA) over the Euclidean distance matrix of samples [21]. pcaReduce is an agglomerative clustering approach that integrates principal components analysis and hierarchical clustering [23].
Density-based methods employ density-based clustering mechanisms such as the DBSCAN algorithm [24] and its variants [25]. For example, Jiang et al. [26] proposed a method named giniClust to identify rare cell types in scRNA-seq data. They applied the DBSCAN algorithm after selecting significantly different genes based on the Gini index. Graph-based methods first transform the data to a graph, over which a graph clustering algorithm is applied. Two examples of this type are SNNCliq [27] and Seurat [28], both discover sub-graphs on the Shared Nearest Neighbors (SNN) graph. The probability model-based methods cluster data based on a certain probability distribution or process. For example, DIMM-SC [29] is specifically proposed for processing dropletbased scRNA-seq data, based on the Dirichlet Mixture Model. Prabhakaran et al. [30] proposed another probability model-based method to correct technical variations based on the Dirichlet Process. The matrix-based methods first derive a matrix from the original scRNA-seq data, and then perform matrix splitting or decomposition to cluster data. Representative algorithms include the method based on nonnegative matrix factorization (NMF) [31] and the BackSPIN [32] based on sorting points into neighborhoods (SPIN).
With the growth of scRNA-seq data, deep learning (DL) is applicable to scRNA-seq data clustering. For example, Eraslan et al. [4] proposed a deep count auto-encoder network (DCA) to de-noise scRNA-seq data and then cluster the data with the features extracted by a multi-layer neural network. Lopez et al. [33] introduced the scVI method to derive probabilistic representations of scRNA-seq data from deep generative model of variational auto-encoder. DL-based scRNA-seq data clustering methods try to learn representations of scRNA-seq data through deep neural networks. In essence, they are owned by a kind of nonlinear feature transformation techniques.
Actually, most existing methods exploit the similarity between cells to do scRNA-seq data clustering, while similarity evaluation relies on the selection of genes. Consequently, how many and which genes are used for clustering is particularly important. Currently, there are mainly two types of gene-filtering strategies: (1) threshold-based approaches that select genes whose expression values satisfy a certain threshold, and (2) variation index-based approaches that measure the variation of gene expression values in different cells and then select genes with large variations.
For instance, SC3 uses a threshold-based approach to filter genes before clustering. Specifically, it removes genes/transcripts that are either expressed (expression value > 2) in less than X% of cells (rare genes/transcripts) or expressed (expression value > 0) in at least (100 − X) % of cells (ubiquitous genes/transcripts), where the default value of X is 6 [19]. Some additional methods like NMF-based clustering methods and RaceID also remove genes of low expressions [22,31]. As for the second type, the GiniClust algorithm selects high Gini-index genes by fitting the relationship between the Gini index and max gene expression level with LOESS regression [26]. BackSPIN selects the top 5000 genes as informative features based on coefficient of variation (CV), defined as the standard deviation divided by the mean [32]. Generally, threshold-based approaches can be regarded as a kind of simplified denoising methods, while variation index-based methods select differentially expressed genes. However, neither of them takes the goal of clustering into account.
In this paper, we propose a clustering-aware feature weighting method CaFew for scRNA-seq data clustering. First, by resolving the optimization problem of clustering, a weight matrix indicating the importance of features in different clusters is derived. Then, we select genes based on the weight matrix. Concretely, we select those genes with a relatively large weight in at least one cluster or a large weight variation across different clusters. Experiments over several benchmark datasets show that CaFew can effectively boost clustering performance. What is more, by combining CaFew with SC3 (denoted as "CaFew+SC3"), we achieve the state of the art performance. Finally, CaFew is also helpful for scRNA-seq data visualization.

Results
In this section, we evaluate CaFew in clustering scRNA-seq data. First, we introduce 8 publicly available scRNA-seq datasets and clustering evaluation metric. Then, we present the experimental results of CaFew on these datasets, including selected features, clustering accuracy and visualization.

Datasets and performance metric
We collect 8 publicly available scRNA-seq datasets with ground-truth cell type information. Table 1 presents the statistical information of these datasets, including the number of cells, clusters and genes and their sequencing protocols. We can note that these datasets range in size from dozens to thousands, with more than 15,000 genes/ transcripts. The number of cell types varies from 3 to 16. Units of gene/transcript levels include FPKM (Fragments Per Kilobase of exon model per Million mapped reads), CPM (Counts of exon model per Million mapped reads) and UMI (Unique Molecule Identifier). Specifically, UMI uses a direct measurement of transcript copies for each transcript [34], while the other two metrics normalize the raw read counts based on sequencing depth and gene length. In addition, these scRNA-seq data were generated from some representative sequencing platforms, such as Smart-Seq2 [35], Microwell-seq [36] and 10X [37] etc.
In our experiments, we use Adjusted Rand Index (ARI) to measure the clustering performance. Given the ground truth class assignments labels_true and the predicted class assignments labels_predict , ARI measures the similarity of these two assignments [38]. Concretely, the overlapping between two assignments can be summarized as a contingency table, which reports the intersection cardinality of each true-predicted cluster pair. ARI is calculated as follows: where m is the number of cells totally in the dataset, t ij is the value at the ith-row and the jth-column in the contingency table, a i is the sum of the ith-row of the contingency table, b j is the sum of the jth-column of the contingency table, and () denotes a binomial coefficient. ARI ranges from − 1 to 1, where a negative value means mismatch and '1' indicates a perfect match.

Feature selection results
As there are two screening steps in the CaFew algorithm, we present the number of genes remained after the first and second screening steps respectively as "#Genes-S1" and "#Genes-S2" in Table 2, where we also present the ratio of selected genes over the total genes.
We can notice that more than half of the features are removed after the first filtering step. For example, there are more than 20000 genes in GSE59892. Only 9894 features are retained after the first filtering step, accounting for 38.4% of the total genes. After the secondary screening step, only a few hundred features remain,  [37] which is less than 5% of the total genes. In conclusion, CaFew can select much fewer genes that are conductive to clustering.

Effect of gene selection on clustering
Here, to check whether the selected genes are more effective in exposing the cluster structures in datasets, we apply the Davies-Bouldin index (DBI) to the 8 datasets, which is defined as follows [39]: where d i is the average distance between each sample and the centroid of cluster i, d ij is the distance between the centroids of clusters i and j, and k is the number of clusters. Obviously, the smaller DBI is, the more compact the clusters. Table 3 presents the DBI values before and after gene selection on the 8 datasets. We can see that after using CaFew for feature selection, the DBI for all the datasets is significantly lower than that when all features are used. This result indicates that after selecting features based on feature weighting, the cluster structure is clearer, which shows that our feature selection is helpful to cluster.

Clustering performance
To demonstrate the effectiveness of CaFew, we compare the clustering performance before and after it is applied to different clustering methods, including traditional clustering algorithms and several high-performance clustering algorithms proposed specifically for scRNA-seq data.

Results of traditional clustering algorithms
We consider five traditional clustering algorithms, including k-means [40], PAM [41], DBSCAN [24], Hierarchical Clustering [42] and Gaussian mixture models [43]. After feature selection, these algorithms are applied to clustering the 8 datasets. Performance results are shown in Fig. 1. As shown in Fig. 1, most methods can get improved accuracy on some scRNA-seq datasets after using CaFew to select genes. Concretely, take dataset GSE59892 for example, four algorithms (k-means, PAM, Hierarchical Clustering and GMM) maintain their clustering accuracy after feature selection, while DBSCAN is improved. Especially, all methods' clustering accuracy is significantly improved on two datasets: GSE36552 and E-MTAB-2600. The reason is that CaFew is able to remove some noise genes. However, some methods get degraded clustering accuracy on some datasets after feature selection. This is because that some datasets own relatively complex cluster structures, and the traditional clustering algorithms cannot capture these structures when only hundreds of features are used.

Results of clustering methods specifically for scRNA-seq data
Based on the previous comparative studies on clustering algorithms for scRNA-seq data [44,45], we choose several representative methods to test the performance of CaFew. Since SC3 and Seurat achieve the state of the art clustering performance, we mainly test the effect of CaFew on them, and we call them "CaFew+SC3" and "CaFew +Seurat" after using CaFew for gene selection. The results are presented in Fig. 2, where ARI values of different methods are represented by different color bars.  Fig. 2, we can see that "CaFew+SC3" achieves better performance than other methods on most of the datasets. Its average ARI on the 8 datasets is 0.74, higher than that of the other methods. Concretely, SC3 gets an ARI of 0.94 on dataset GSE59892, while "CaFew+SC3" improves the ARI to 1. For the dataset GSE36552, "CaFew+SC3" performs best with an ARI of 0.89, which is higher than the AIR (0.64) of SC3 and the other six methods. Moreover, the dataset with the largest improvement on clustering accuracy is E-MTAB-2600, and the AIR is improved from 0.15 to 0.72. In general, except for GSE60361, the clustering accuracy of "CaFew+SC3" on the other datasets is better than that of SC3.
As for "CaFew +Seurat", its clustering accuracy on three scRNA-seq datasets is the same as that of Seurat, gets higher ARI values on four datasets and a lower ARI on only one dataset. We notice that the improvement on ARI for Seurat is not as significant as that for SC3, this is because that CaFew adopts a distance based clustering mechanism like k-means, so it is more beneficial to distance-based clustering methods such as SC3.

Visualization
Here we investigate whether CaFew can help with visualization. In our experiments, we adopt t-SNE [46] and UMAP [47] for scRNA-seq data visualization. Figure 3 displays the visualization results of four datasets before and after feature selection. Here, points of similar color belong to the same cluster.  Fig. 3, we can see that after selecting genes by CaFew, both t-SNE and UMAP can separate different clusters more apart, while making cells from the same cluster closer. This shows that CaFew is beneficial to clustering.
Take the dataset GSE59892 for example, there are 49 samples in 3 clusters with 25,737 features. When all features are utilized, some green and pink points mix with the orange points. After selecting genes by CaFew, the green, pink and orange points are clearly separated.

Discussion
It is well known that the clustering performance is heavily impacted by the characteristics of input samples. For example, DBSCAN can find clusters of any shape, while k-means assumes that clusters are convex shaped. The distribution of samples in different classes may also impact the clustering performance. If the sample distribution is extremely uneven, it is hard for clustering algorithms to find very small clusters such as rare cell types and tumor cells etc. Since CaFew adopts similar idea of k-means in the process of weight matrix calculation, the clustering performance on non-convex/uneven distribution data cannot get so significantly improved as on convex/even distribution data. With CaFew, the clustering performance of distance-based methods like k-means and SC3 can be considerably improved, but its effectiveness is not so obvious on the other types of methods like Seurat. Additionally, the pre-defined cluster number also affects clustering result. The number of clusters can be determined with prior knowledge or can be estimated by some specific computational approaches. CaFew does not address this issue, but directly uses the exact number of clusters in optimization.
For future work, on the one hand, we will explore alternative clustering optimization mechanism that is not restrict to distance-based clustering, and develop specific methods to determine the number of clusters in the framework of CaFew. On the other hand, we will try to integrate auxiliary biological information into CaFew, and extend this study to the field of differentially expressed gene analysis, especially the study of diseasespecific genes.

Conclusion
In this paper, we propose a novel algorithm CaFew to select features for scRNA-seq data clustering based on cluster-aware feature weighting. By solving the clustering optimization problem, CaFew first obtains the weight matrix W of features with regard to different clusters. Then, it filters out genes with small weight in all clusters or a small weight variation across all clusters. Extensive experiments on 8 real datasets show that selecting features with CaFew can boost clustering performance and the combination of CaFew and SC3 achieves the state of the art performance. Li et al. BMC Bioinformatics (2021) 22:130

Methods
In this section, we describe the CaFew method in detail. Figure 4 illustrates the pipeline of CaFew, which consists of three major steps: (1) Removing uninformative and redundant genes; (2) Deriving the feature weight matrix by solving the clustering optimization problem; And (3) selecting genes based on the weight matrix.

Data preprocessing
In general, scRNA-seq data is extremely sparse, with some genes of zero expression in a large number of cells. The generation of these zero values is due to that these genes are not expressed in these cells, or their genetic products are not detected (also called "dropout events" [17]). Therefore, raw scRNA-seq data are highly uncertain and noisy, which seriously impacts the downstream computational analysis. In the data preprocessing step, we directly delete genes that are expressed only in a small number of cells (less than 2%) [19].
On the other hand, the expression patterns of some genes are very close. If all of them are used for clustering, it only incurs a large amount of calculation cost, but contributes little to clustering. Therefore, we remove the redundant genes. By calculating the Pearson Correlation Coefficient (PCC) between genes, only one gene is conserved as the representative for genes with PCC greater than 0.99.

Cluster-aware feature weighting
As the importance of each feature (gene) is different across different clusters, we assign different weight to each feature over different clusters, and formulate the clustering objective function as follows [48][49][50]: where K means the number of clusters, n is the number of features (genes), χ = K k=1 χ k is the union of K clusters and χ k indicates the kth cluster. W is a K * n weight matrix and w kj indicates the weight of gene j in cluster k. A large w kj indicates that feature j is important to cluster k. d j ki means the distance between sample i and the center of cluster k on feature j. Equation (3) consists of two parts: the first part indicates the sum of distance between samples within each cluster under feature weighting, the second part is the sum of squares of weights, and δ k is a parameter to balance the two parts. By minimizing the first part, we can get compact clusters, and by minimizing the second part, we attempt to select as a small number of features as possible for each cluster.
To  sub-problems. And by setting the derivatives of J k with respect to w kj and k to zero, we can derive where the first part 1 n is the initial weight (i.e., all features are treated equally), the second part reflects the intra-cluster distance difference between the average over all features and the feature j. A positive value of the second part means feature j can make the intraclusters distance smaller and its weight will be become larger than 1 n . Conversely, a negative value of that part will reduce the weight of feature j in cluster k. δ k is evaluated in an iterative way as follows: where C δ is a constant, the superscripts (t) and (t − 1) indicate the current iteration t and the previous iteration (t − 1) , respectively. Denote the weighted distance between sample i and cluster center k as D ik , we have Each data point (cell) is assigned to the nearest cluster, that is, After the assignment of samples, the cluster centers are updated as follows: where x mj is the median value of feature j in cluster k. By using the median of samples (instead of the mean) to update the cluster center, clustering will be more robust to outliers [51]. Simply, we use Manhattan distance to evaluate d j ki , that is,

Gene selection based on feature weights
Different from previous feature selection methods, CaFew selects features from the perspective of clusters, instead of cells. On the one hand, feature weight reflects the importance of a feature to a cluster, so the features with large weights are more informative in clustering than those of small weight. On the other hand, a feature whose weight varies greatly among the clusters is usually a "marker" gene that is more conducive to distinguish cells. Based on these two observations, we propose the following strategies to select features.

Weight based screening
To remove features with small weight, we first calculate the maximum value of each feature in the weight matrix W. Then we divide these features into N groups according to their maximum values, where N=3.322 * log 10 (n) − 1 , according to the Empirical Sturges' formula [52]. After arranging the groups of genes in ascending order of their maximum values, we remove the first group of genes, and iteratively use Sturges' formula to group the remaining features until the number of features is less than 10000. One advantage of this method is that the features of the same interval can be kept as far as possible, instead of some important features being omitted due to the "violent cutting" like simply setting a threshold.

Weight-deviation based screening
To further select the "marker" genes for clustering, we measure the variation of feature weights across clusters by CV (defined as the ratio of standard deviation over mean).
Since there is strong correlation between mean and CV, we build a linear model to fit CV by mean: log(CV 2 ) = a * log10(mean) + b to choose the most significant features of variation. For each feature, we calculate the residual value d, which is defined as the difference between the true CV and the fitted value. Then, the residual value is normalized as z-score: (d − d)/δ , where d and δ are the mean and the standard deviation of d.
Finally, z-scores are converted into p-values with the assumption that all z-scores follow normal distribution. In our experiments, we select the features (genes) whose p ≤ 0.05.

The CaFew algorithm
The pseudo-code of CaFew is outlined in Algorithm 1. Lines 1-3 are for data preprocessing. Line 4 initializes the variables; Lines 5-12 are for deriving the weight matrix of genes; Lines 13-15 are for weight-based gene screening, which removes the genes of small weight; Lines 16-19 are for weight-deviation based screening, which filters genes with small weight variations across clusters.