Identifying cell types from single-cell data based on similarities and dissimilarities between cells

Background With the development of the technology of single-cell sequence, revealing homogeneity and heterogeneity between cells has become a new area of computational systems biology research. However, the clustering of cell types becomes more complex with the mutual penetration between different types of cells and the instability of gene expression. One way of overcoming this problem is to group similar, related single cells together by the means of various clustering analysis methods. Although some methods such as spectral clustering can do well in the identification of cell types, they only consider the similarities between cells and ignore the influence of dissimilarities on clustering results. This methodology may limit the performance of most of the conventional clustering algorithms for the identification of clusters, it needs to develop special methods for high-dimensional sparse categorical data. Results Inspired by the phenomenon that same type cells have similar gene expression patterns, but different types of cells evoke dissimilar gene expression patterns, we improve the existing spectral clustering method for clustering single-cell data that is based on both similarities and dissimilarities between cells. The method first measures the similarity/dissimilarity among cells, then constructs the incidence matrix by fusing similarity matrix with dissimilarity matrix, and, finally, uses the eigenvalues of the incidence matrix to perform dimensionality reduction and employs the K-means algorithm in the low dimensional space to achieve clustering. The proposed improved spectral clustering method is compared with the conventional spectral clustering method in recognizing cell types on several real single-cell RNA-seq datasets. Conclusions In summary, we show that adding intercellular dissimilarity can effectively improve accuracy and achieve robustness and that improved spectral clustering method outperforms the traditional spectral clustering method in grouping cells.


Background
In recent years the development of single-cell sequencing technologies opens a new point of view on a series of complex biological phenomena at the single-cell level [1]. Rich datasets produced with these technologies can be utilized to investigate differences in gene expression between individual cells, characterize cell types, and study heterogeneity in cell line [2]. Nevertheless, different types of cells are often infiltrated into each other in the traditional biological experiments [3]. An effective way of solving this problem would be to group individual cells by using the method of clustering so that cells within the same cluster establish extremely similar patterns of gene expression.
The process of grouping cells based on single-cell data is an unsupervised clustering problem, and a collection of computational methods have been presented to sort out this problem such as hierarchical analysis [4], K-means [5], principal component analysis (PCA) [6] and spectral clustering [7]. However, potential technical and biological issues bring great challenges such as much noise, many missing values, high gene expression variability and so on [8]. In addition, the number of genes assayed in scRNA-seq is much larger than the number of cells for classification, which may lead to the distances between cells become similar. Accordingly, most of traditional clustering algorithms lose their action in partitioning the cells into well-separated groups.
Many people have worked hard to circumvent these problems in recent years, they have tried their best to define cell types on the basis of single-cell gene expression patterns. For example, Buettner et al. [9] presented a single-cell latent variable model to identify otherwise undetectable subpopulations of cells. Xu and Su used the conception of shared nearest neighbor and proposed a novel algorithm named shared nearest neighbor (SNN)-Cliq that groups cells, which could generate desirable solutions with high accuracy and sensitivity [10]. Höfer and Shao adapted Nonnegative Matrix Factorization (NMF) [11,12] to study the problem of the unsupervised learning of cell subtypes from single-cell gene expression data [13]. Kiselev et al. [14] put forward single-cell consensus clustering (SC3), which combined all the different clustering outcomes into a consensus matrix and determined the final results by complete-linkage hierarchical clustering of the consensus matrix. Lin et al. [15] incorporated prior biological knowledge to test various neural networks architectures and used these to obtain a reduced dimension representation of the single-cell expression data for identifying a unique group of cells. Gao et al. [16] adopted a likelihood-based strategy using the two-state model of the stochastic gene transcription process and developed Clustering And Lineage Inference in Single Cell Transcriptional Analysis (CALISTA) for clustering and lineage inference analysis. Zheng et al. [17] drew inspiration from the self-expression of the cells with the same group, imposed the non-negative and low rank structure on the similarity matrix, and then proposed a SinNLRR method for scRNA-seq cell type detection. Zhu et al. [18] explored a method by combining structure entropy and k nearest neighbor to identify cell subpopulations in scRNA-seq data. Jiang et al. [19] proposed a new cell similarity measure based on cell-pair differentiability correlation and further developed a variance analysis based clustering algorithm that can identify cell types accurately. For identifying cell subtypes, most of these approaches do reasonably well for some situations by employing feature selection or dimensionality reduction to reduce the noise of original data and speed up the calculation processes [20].
Spectral clustering (SC), as one of the most popular modern clustering algorithms, uses the first k eigenvectors of the Laplacian matrix derived from the similarity matrix to carry out dimensionality reduction for clustering. SC is very easy to implement and can be realized efficiently by using standard linear algebra methods [21]. Generally speaking, there are three methods for constructing a similarity matrix: ǫ-neighborhood, k-nearest neighbor, or fully connected. All methods are based on using distance measurement by several different choices available, including Euclidean distance, Pearson's correlation, Spearman's correlation, Gaussian similarity function and so on. In general, the performance of clustering is quite sensitive to the choice of similarity measurement. Lately, there are several computational analysis methods available to improve the clustering effect of SC. For instance, Lu et al. [22] proposed a convex Sparse Spectral Clustering (SSC) model which extended the traditional spectral clustering method with a sparse regularization and proposed the Pairwise Sparse Spectral Clustering (PSSC) method which seeks to improve the clustering performance by leveraging the multi-view information. Wang et al. [23] combined multiple kernels to fit the structure of the data best and employed a rank constraint in the learned cell-to-cell similarity and graph diffusion in order to perform dimension reduction, clustering, and visualization. Park and Zhao utilized multiple doubly stochastic similarity matrices to learn a similarity matrix and imposed a sparse structure on the target matrix followed by shrinking pairwise differences of the rows in the target matrix to extend spectral clustering algorithm [24].
Although these methods can get promising effect in identifying cell types, they only consider the impact of the positive similarities between cells on the clustering result and not consider the impact of negative similarities. That is to say, only the similarities are considered, but the dissimilarities are overlooked. This methodology may have limitation on the effectiveness of those clustering algorithms based on spectral analysis for grouping cells that belong to the same cell types. However, the intuitive goal of SC is to divide the data points (representing single cells) into several groups such that points in the same group are similar and points in different groups are dissimilar to each other [21]. Hence, dissimilarities between single cells should not be ignored. In this study, we build a suitable incidence matrix considering similarities as well as dissimilarities between cells meanwhile and improve spectral clustering method for partition cells. In the process of our improved algorithm, we adopt the dissimilarity matrix to stress the dissimilarities between the natural groupings, and a parameter is adjusted to balance the similarity matrix and dissimilarity matrix.
To investigate the performance of the improved method, we first apply it in breast cancer data to distinguish tumor cells, stromal cells, and immune cells and compare the results with the conventional SC. Then we apply it to other four scRNA-seq datasets which are characterized as highly confident in the cell labels. Our result shows that taking into account similarities as well as dissimilarities increase performance. Moreover, the clustering results indicate that the improved method gets higher accuracy and strong robustness in identifying cell subpopulations.

Results
We applied the improved spectral clustering (ISC) method to several published single cell datasets. The results were compared with conventional spectral clustering by Purity, Rand Index (RI), Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI).

Breast cancer data
The first biological dataset we tested had RNA-seq data of 549 single cells. After the filtering steps as described in the method session, 34 single cells with low sequencing quality were discarded. Among the remaining 515 single cells, it has been testified that there were 317 epithelial breast cancer cells, 175 tumor-associated immune cells and 23 non-carcinoma stromal cells, that can be considered gold standards. 11986 genes were selected by strict quality control and the gene normalizations were implemented before they were capable of clustering the cells into distinct groups.
The parameter ω is provided to trade off the weight between similarity and dissimilarity on the incidence matrix. The value of ω has to be between 0 and 1. As the value of ω gets smaller, the more emphasis is put on the similarity inside a cluster, especially, when ω equals zero, the improved spectral clustering is the conventional spectral clustering. The closer that the value of ω is to 1, the more attention is paid to the dissimilarity between clusters. When h and q are fixed to 80, the performance of improved spectral clustering with the change of parameter ω is shown in Fig. 1. As can be seen from Fig. 1, with the parameter ω grows, Purity, RI, ARI and NMI values all maintain steady in the beginning, then increase drastically and all reach their maximum values when ω is equal Fig. 1 The performance of the improved spectral clustering with the variation of parameter ω when the values of h and q are fixed to 80. Purity, RI, ARI and NMI all reach their maximum values when ω is equal to 0.4 to 0.4, and then these indices fall back quickly, lastly they rise to become stable. It can be obtained that the clustering results of improved spectral clustering (when ω is equal to 0.4) are better than the performance of conventional spectral clustering (when ω is equal to 0). This demonstrates that when using spectral clustering algorithm taking both the similarity within the cluster and the dissimilarity between clusters into account can't be worse than only considering similarity within the cluster.
In the implementation process of the improved spectral clustering, there are other two required parameters, h and q, which represent the width of similar neighborhoods and dissimilar neighborhoods, respectively. In this study, the effects of each parameter to clustering results are discussed. If the number of cells is ns and the number of cell types is nt, we first round ns up to the nearest hundreds and divide it by 100 as step-size ss, then we increase the h from ss to 0.5 × ns/nt with interval ss for studying the influence of the parameter h. For example, there are 3 cell types of 515 cells in breast cancer dataset, we consider h ∈ {5, 10, 15, 20, . . . , 80, 85} , and when the value of h is given, q is set equal to h, or equal to h/2. Thus, the incidence matrix can be obtained by 32 different parameter combinations. The best performance of improved spectral clustering with different parameter combinations of h and q is listed in Table 1. When only the similarity is considered, ω is set to zero, the improved spectral clustering is the conventional spectral clustering. When in consideration of similarity and dissimilarity, ω is set to a non-zero value. It can be drawn from Table 1 that improved spectral clustering performs better with various combinations of h and q settings in breast cancer dataset. Although ω is different when improved spectral clustering is in the best performance according to different combinations of h and q, the results show the better robustness and our improved algorithm is also insensitive to the values of parameter h and q.
As the value of h increases, the conventional spectral clustering is getting better and better. when h = 85 , q = 0 , the conventional spectral clustering has the best performance, the Purity, RI, ARI and NMI values are 0.7417, 0.5544, 0.1077 and 0.2915, respectively. But no matter what the values of h and q are, improved spectral clustering shows stable performance. When h = 15 , q = 7 and ω = 0.2 , the improved spectral clustering gains the best clustering results in terms of Purity and NMI, which are 0.9281 and 0.5784, respectively. When h = 80 , q = 80 and ω = 0.4 , the improved spectral clustering performs best in terms of RI and ARI, which are 0.7633 and 0.5252, respectively. Although, the clustering results of improved spectral clustering are pretty good, the ARI value and NMI value are not so satisfactory, they are still less than 0.6. Maybe it is because, among three types of cells isolated from individual tumor tissues, tumor cells have distinct chromosomal expression patterns, recapitulating tumor-specific copy number variations while immune cells and stromal cells have no apparent copy number variation patterns [3]. The separation of the latter two types of cells become a little difficult by the clustering method based on gene expression pattern.
Moreover, to determine whether the improved spectral clustering is significantly better than the conventional spectral clustering, we use the non-parametric one-tailed Wilcoxon rank sum test. We calculate the P-value of the test, as shown in Fig. 2, and take it as the significant levels of difference between the improved spectral clustering and the conventional spectral clustering. To test for a difference in the evaluation metrics of improved spectral clustering and conventional spectral clustering, we use the following procedures. We first calculate the evaluation metrics of improved spectral clustering and conventional spectral clustering with various ω value for given values of h and q and record the best performance of improved spectral clustering and conventional spectral clustering. This process was repeated when h and q are changed at the same time. The significance level of the tests is then calculated by the proportion of the evaluation metrics of the conventional spectral clustering that exceeds the evaluation metrics of the improved spectral clustering. Calculation and comparisons show that the evaluation metrics of improved spectral clustering is significantly greater than those of conventional spectral clustering and there is remarkable differences between them.

Other real data
we then compare our proposed improved spectral clustering with the conventional spectral clustering on other four single-cell RNA sequence datasets featuring highconfidence cell labels. These datasets are derived from different single-cell RNA-seq techniques and are collected from human or mouse. Some cells involve in different biological process, some are original from different tissues, and some are generated from different lines [25][26][27][28]. All the original expressions have been pre-processed in previous study [23,24]. Dendritic cells (DCs) dataset consists of 251 cells at three different progenitor stages and 11834 genes which pass the gene filter step. A mixture of diverse single cells (MCs) dataset consists of 249 single cells were captured from a mixture of 11 cell populations. After initial filtering steps similar to DCs dataset above, 14805 genes  {6, 12, 18, 24, . . . , 72, 78} in NCs. when h is given a fixed value, q is set equal to h, or equal to h/2. Then improved spectral clustering method with different combinations of parameters are applied to clustering cells in these datasets. Table 2 shows the best performance of traditional spectral clustering ( ω = 0 ) and improved spectral clustering ( ω = 0 ). From the four index values given in the Table 2, it can be seen that the improved spectral clustering is a notch above the conventional spectral clustering. By improved spectral clustering, Purity, RI, ARI and NMI are all increased in some degree, the biggest rise with a 23.3% increase. Furthermore, we can see that in MCs dataset, although the clustering results of conventional spectral clustering have been proved to be satisfactory, improved spectral clustering can get better results. Although ARI and NMI are increasing in DCs dataset, they are still low, perhaps this is because although progenitor populations retained expression of surface markers at the protein level associated with the respective specific progenitor stages, individual cells had already shifted transcriptionally toward the next step in differentiation, there existed a significant overlap in gene-expression profiles among the development of dendritic cell subsets [25].

Discussion
Large volume of single cell data have emerged in response to the progress of next-generation sequencing technology, how to take full advantage of these rich data is very important. One of the most powerful applications of single-cell data is to define cell types by clustering analyse on the basis of gene expression patterns. The clustering qualities have an effect on the outcome of downstream analysis. Up to now, many clustering algorithms for identifying subtypes of cells have been proposed.
Owing to the high dimensionality of the single-cell data, the gaps among the distances between cells narrow. Thus, it is unreliable to define cell types on the basis of these high-dimensional data directly. Effective dimensionality reduction could make the measure of the distance between cells more accurate in cells clustering. For example, spectral clustering projects data into a lower-dimensional space based on the eigenvectors corresponding to the k smallest eigenvalues of the Laplace matrix, and Laplace matrix is deduced according to the incidence matrix. However, the general method for constructing the incidence matrix only attaches importance to similarities between cells and overlooks the dissimilarities between cells. The dissimilarities between cells contain the discrepancy in expression pattern between different cell types and have very influential consequences in identifying clusters. We expect that imposing the dissimilarities between cells can help to achieve better clustering results. In this study, the conventional spectral clustering method has been improved for clustering single cells by the combination of similarities and dissimilarities between cells. Furthermore, we apply this improved method to five published single-cell datasets including cells from different tissues, stages, cell lines and so on. The results show that it performs better than conventional spectral clustering based on several metrics. Through the integration of similarities and dissimilarities, the classification accuracy is improved. The performance of the proposed method with various parameter combinations also shows the better robustness of the improved method.
Although improved spectral clustering makes some progress in identifying cell types, the ability to detect cell types still could be developed the most. Several problems are really challenges, which include what measurements are used to reflect the distance between cells, how to reasonably measure the similarities and dissimilarities, how many similar cells and dissimilar cells are to choose for constructing similarity matrix and dissimilarity matrix and how to balance similarity matrix and dissimilarity matrix to construct incidence matrix. The answers to these questions depend on specific data and solving these problems will require data-driven approaches. In addition, the prediction of the number of clusters is a challenge. In the future, it would be interesting to develop a more effective clustering method by integrating improved spectral clustering and other computational analysis methods.

Conclusion
In this study, we have improved conventional spectral clustering algorithm for separating single cells into distinct groups by incorporating dissimilarities between cells with similarities. We have shown that its performance is superior to conventional spectral clustering method on several published single cell datasets.

Data sources
In this study, we used five published single-cell datasets. At first, we put emphasis on the analysis of primary breast cancer cells (BCCs). The original single-cell RNA sequencing was downloaded from the NCBI GEO database under the accession code GSE75688 [3]. Eleven primary tumor specimens and two metastatic lymph nodes were collected and processed for single-cell RNA sequencing. In total, 549 single-cell cDNAs were subjected to RNA sequencing.
Then, we directly applied the improved algorithm to other processed single-cell gene expression datasets from previously published papers [23,24]. DCs arise from a cascade of progenitors that gradually differentiate in the bone marrow [25]. Schlitzer et al. used mRNA sequencings of 251 dendritic cell progenitors to investigate the transcriptomic relationships. Those dendritic cell progenitors had been in one of the following three cellular states: macrophage dendritic cell progenitor, common dendritic cell progenitor, and pre-dendritic cell. Pollen et al. [26] made an unbiased analysis and comparison of 249 MCs with greater than 500,000 reads from 11 populations by microfluidic single-cell capture and low-coverage sequencing of many cells. Kolodziejczyk et al. [27] collected 704 single-cell transcriptomes of ESCs cultured in three different conditions: serum, 2i, and the alternative ground state a2i and studied on how different culture conditions influence pluripotent states of ESCs. Usoskin et al. [28] used comprehensive transcriptome analysis of 622 single mouse NCs for identification of four neuronal groups, which reveals the diversity and complexity of primary sensory system underlying somatic sensation. The basic information for the above-mentioned single-cell datasets is listed in Table 3.

Data preprocessing
To eliminate noises or missing data that are contained in the dataset, a data preprocessing procedure is carried out first. As shown in Fig. 3, it consists of the following steps. Step 1: removing cells with low sequencing quality RNA-SeQC tool is used to remove cells with low-quality sequencing values [29], if the number of total reads is less than 3,000,00 or the mapping rate is less than 50% or the number of detected genes is less than 2000 or the portion of intergenic region is more than 30% , the cells are identified as outlier cells, which are excluded for further analysis.

Step 2: filtering out genes with low expression values
First, genes with a transcript per million (TPM) value less than 1 are considered unreliable and replaced with 0; second, TPM values are log2-transformed after adding a value of 1 (log2(TMP+1)) in order to reduce the effect of highly expressed genes; and third, genes expressed in < 10% of the bulk groups are discarded.

Step 3: normalizing gene expression data
For removing systematic variation in an experiment which affects the measured gene expression levels and examining relative expression levels, the gene expression data are first centered by subtracting the average expression of each gene from all cells, and then are divided by the variance of each gene from all cells.

Improvement of spectral clustering
Let P = {p 1 , p 2 , . . . , p n } denote a given set of data points, where each data point p i is a r dimensional column vector, S = (s ij ) ∈ R n×n is a symmetric similarity matrix, where s ij ≥ 0 is a measure of the similarity between data points p i and p j , a greater value of s ij indicates data points p i and p j are more similar. In conventional spectral clustering, we are trying our best to construct a k dimensional column feature vector x i for each data point p i , where k is far less than r. Intuitively, if two data points are more similar, their feature vectors should be closer to each other in the feature space. Then each data point can be represented by a k dimensional feature vector. Therefore, the problem of finding k dimensional feature vectors can be converted into the following optimization problem: where I k is a unit matrix. Let D be a diagonal matrix that has its lth diagonal entry equals to the sum of all elements in the lth row of the similarity matrix, then one can calculate the Laplacian matrix as L = D − S . Define a feature n × k matrix M = [m 1 , m 2 , . . . , m k ] , where m j is the unit eigenvector corresponding to the jth minimum eigenvalue of the Laplacian matrix L. Let x i be the ith row of matrix M. Then it can be proved that x i (i = 1, 2, . . . , n) is the solution of the above optimization problem (1). With these k dimensional features of all data points, any feature-based clustering method can be used to perform cluster analysis. (1) We improve the conventional spectral clustering by taking much account of the dissimilarities between data points. A symmetric dissimilarity matrix DS = (ds ij ) ∈ R n×n is used to define the dissimilarities between data points, where ds ij ≤ 0 , the smaller this value, the more dissimilar between data points p i and p j . We are also trying to get a k dimensional column feature vector y i for each data point p i , where k is far less than r. Analogously, if two data points are more dissimilar, their feature vectors should be more distant to each other in the feature space. So the optimization problem can be formulated as follows: Considering similar and dissimilar representation problems meanwhile, we are attempting to find a k dimensional column feature vector z i for each data point p i , where k is far less than r. If two data points are more similar, their feature vectors should be closer to each other while if two data points are more dissimilar, their feature vectors should be more distant to each other in the feature space. Therefore, by some simple algebraic manipulation we can join optimization problem (1) and (2) to obtain the following equivalent expression: where 0 ≤ ω ≤ 1 is a parameter that is used to balance the similarity and dissimilarity described by feature vectors. Obviously, when ω = 0 , problem (3) is transformed into optimization problem (1), while when ω = 1 , problem (3) is transformed into optimization problem (2). W = (1 − ω)S + ωDS = (w ij ) ∈ R n×n is a weighted symmetric incidence matrix that defines the relationship between data points, if w ij > 0 this indicates that the data points p i and p j are similar, if w ij < 0 this shows that the data points p i and p j are distant, if ω ij = 0 this means that data points p i and p j are irrelevant. Let subject to ZZ T = I k problem (4). Then we can use any feature-based clustering algorithm on the first k eigenvectors to cluster data points.

Identifying cell types using improved spectral clustering
After preprocessing the single-cell dataset, constructing an appropriate incidence matrix is key to cluster single cells by improved spectral clustering. The detailed steps, depicted in Fig. 4, are given as follows.

Quantifying pairwise similarities and dissimilarities
Spearman's rank correlation coefficient (denoted by the Greek letter ρ ) is a non-parametric measure of correlation that assesses the relationship between two variables without making any assumptions, we use it to measure the similarity/dissimilarity between cells. The ρ of two cells (i and j) is calculated as: where m is the number of genes, d t represents the difference between the two numbers in tth pair of gene ranks. It can vary between -1 and 1. The similarity s(i, j) and dissimilarity ds(i, j) between cell i and cell j can then be calculated as: If the ρ between cell i and cell j is close to 1, which represents the gene expression levels of cell i and cell j tend to be relatively high or low simultaneously, in other words, cell i and cell j have semblable gene expression patterns, the higher the ρ is, the greater the similarity is. Likewise, if the ρ between cell i and cell j is close to -1, which means the gene expression levels of cell i and cell j appear to have an opposite trend, that is to say, there is a large dissimilarity between the gene expression patterns of cell i and cell j, the Fig. 4 The framework of improved spectral clustering for grouping cells. Cell-cell similarity/dissimilarity networks were construct and then were integrated to build an incidence matrix. Selection features were used directly by K-means algorithms to assign cells to clusters lower the ρ is, the stronger the dissimilarity is. Besides Spearman's rank correlation coefficient, Pearson's correlation coefficient can be used to calculate the similarity/dissimilarity among cells.

Constructing incidence matrix based on similarities and dissimilarities
For each cell i, the similarities between cell i and every other cell are sorted in descending order, and the dissimilarities between cell i and every other cell are sorted in ascending fashion. The similarity matrix S = (s ij ) ∈ R n×n is designed as follows: for cell i and cell j, if cell i is among the top h similar cells of cell j, or cell j is among the top h similar cells of cell i, then s ij = s ji = s(i, j) = s(j, i) ; otherwise, s ij = s ji = 0 . Likewise, the dissimilarity matrix DS = (ds ij ) ∈ R n×n is built as follows: for cell i and cell j, ds ij = ds ji = ds(i, j) = ds(j, i) if ds(i, j) is in the top q of the sorted dissimilarity list of cell i or ds(j, i) is in the top q of the sorted dissimilarity list of cell j ; otherwise, ds ij = ds ji = 0. The incidence matrix W is constructed by incorporating similarity matrix S with dissimilarity matrix DS using the following equation: where ω is selected from the set {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1} , ω is used to trade off the proportion of similarity and dissimilarity in the incidence matrix.

Extracting feature vectors for K-means clustering
After constructing a incidence matrix W by the way described above, we can get a generalized Laplacian matrix L ′ = D ′ − W , where D ′ is a diagonal matrix with the row-sums of W on the diagonal and zeros in the off-diagonal elements. If the number of clusters is k, the first k eigenvectors u 1 , u 1 , . . . , u k of the generalized Laplacian matrix L ′ are calculated. Let u 1 , u 1 , . . . , u k be the columns of matrix U ∈ R n×k , the ith row of U would be the feature vector corresponding to cell i. Then k-means algorithm is performed to cluster cells with these feature vectors by using MATLAB's kmeans function.

Evaluation metrics
In this study, four indices are employed to evaluate the performance of improved spectral clustering and conventional spectral clustering algorithm, including Purity, RI, ARI and NMI. Let the C U -partition U = {U 1 , U 2 , . . . , U C U } be our calculation partition of n data points p 1 , p 2 , · · · , p n , the C V -partition V = {V 1 , V 2 , . . . , V C V } be the genuine partition. We can define the contingency table T = (t ij ) ∈ R C U ×C V , where entry t ij is the number of data points that are both in cluster U i and V j . Each obtained cluster U i (i = 1, 2, . . . , C U ) is assigned to the cluster V j (j = 1, 2, . . . , C V ) which has the largest number in the ith row of contingency table, and then the accuracy of this assignment is computed by the sum of the entry of the best assigned in the contingency table by the total number of data points (N): where t i . denotes the elements in the ith row of contingency table, max() is the largest element.
RI measures the fraction of pairs of data points that are classified in the same way in both clusterings with the number of pairs of all data points. Thus, it is defined by: where n 00 denotes the size of pairs that are in different clusters under U and V, n 11 denotes the size of pairs that are in the same cluster under U and V.
ARI is the normalized difference of the RI and its expected value under the assumption that a generalized hypergeometric distribution as null hypothesis [30]. Mathematically, it is defined as follows: where t i. = C V j=1 t ij is the sum of row i in the contingency table T, t .j = C U i=1 t ij is the sum of column j in the contingency table T. The ARI ranges from − 1 to 1. the larger ARI, the better the quality of clustering.
NMI provides a sound normalized indication to the comparison of clusterings, which has its origin in information theory and is based on the notion of entropy [31], it is defined as: where the numerator represents the mutual information between V and U, and the denominator denotes the entropy of the clusterings V and U.
We use these external indices to evaluate the agreement between the results of improved spectral clustering and the true clusters, and the agreement between the results of conventional spectral clustering and the true clusters, respectively. The more the agreement, the better the performance of the clustering method.