NDRindex: a method for the quality assessment of single-cell RNA-Seq preprocessing data

Background Single-cell RNA sequencing can be used to fairly determine cell types, which is beneficial to the medical field, especially the many recent studies on COVID-19. Generally, single-cell RNA data analysis pipelines include data normalization, size reduction, and unsupervised clustering. However, different normalization and size reduction methods will significantly affect the results of clustering and cell type enrichment analysis. Choices of preprocessing paths is crucial in scRNA-Seq data mining, because a proper preprocessing path can extract more important information from complex raw data and lead to more accurate clustering results. Results We proposed a method called NDRindex (Normalization and Dimensionality Reduction index) to evaluate data quality of outcomes of normalization and dimensionality reduction methods. The method includes a function to calculate the degree of data aggregation, which is the key to measuring data quality before clustering. For the five single-cell RNA sequence datasets we tested, the results proved the efficacy and accuracy of our index. Conclusions This method we introduce focuses on filling the blanks in the selection of preprocessing paths, and the result proves its effectiveness and accuracy. Our research provides useful indicators for the evaluation of RNA-Seq data.

Most single-cell RNA-seq data is sparse, and almost 90% data is zero measurements. so we use dimension reduction methods to convert the high-dimensional data into low-dimensional data. Sammon [14] mapping and T-SNE [15] are dimension reduction methods that keeps the data manifold unchanged, while principal component analysis (PCA) are designed to extract the important information. Methods like LSPCA [16] and ESPCA [17] combines traditional PCA with other algorithms to overcome the shortcomings of PCA. In addition, some clustering methods also provide normalization and dimensionality reduction methods, such as Seurat [18] and SC3 [5].
Various normalization and dimension reduction methods use different data processing algorithms and obtain different clustering results. Ideally, normalization and dimension reduction methods should produce high-quality data, and the aggregation results should be meaningful. Due to poor clustering trends, completely random data is not conducive to clustering [19]. In order to solve this problem, we propose NDRindex (Normalization and Dimensionality Reduction index) to evaluate the degree of data aggregation. By comparing all combinations of normalization and dimension reduction methods, the data with highest NDRindex will be the selected for further clustering.

Implementation
As input, NDRindex requires a gene expression matrix, normalization methods and dimension reduction methods. To make this step easier, f NDRindex includes five normalization methods TMM, Linnorm, Scale, Scarn, Seurat and three-dimensional reduction methods PCA, tSNE and Sammon.
Then NDRindex evaluates the data qualities. The prepossessed data with the highest NDRindex score are chose and saved, then outputted.
Finally, clustering techniques (k-menas, hclust, etc.), are applied to the selected data. After that, the clustering result is output. The entire workflow can be described as shown in Fig. 1.
The key to the NDRindex method is an algorithm for evaluating data quality. Not all data is suitable for clustering. If the data set does not contain natural clusters, the clustering results will be meaningless, so it is very important to analyze the tendency of data clustering and evaluate its quality [19]. If the data set does not contain natural clusters, the clustering results will be meaningless, so it is very important to analyze the tendency of data clustering and evaluate its quality [19]. NDRindex algorithm evaluates the cluster tendency by calculating the aggregation degree of data. The higher the degree of clustering, the more points are distributed in a relatively small area, indicating the existence of natural clusters. However, assessing the degree of aggregation is a difficult problem. For example, given two points with the distance 50 cm. If we consider points less than 5 cm apart aggregative, the two points will be considered as two clusters. If we consider points less than 500 cm apart, the two points will probably be considered as one cluster. Thus the degree of aggregation is closely related to the distances of the points and the definition of aggregation. Based on the above assumptions, the NDR index is designed as follows: Step 1. Calculate the distance matrix and 'average scale' of data. According to experience, if the data spread over a larger area, the definition of 'aggregative' should be loosened; if there are more data points, the definition of 'aggregative' should be enforced, so it is assumed that the range of data distribution is proportional to the definition of 'close' , and the number of data points is Inversely proportional to the definition 'close' . The 'average scale' of data is defined as M log 10 n , where M is the lower quartile distance of all point pairs and represents the range of data distribution, n is the sample number of the database. When the distance of two points is smaller than the 'average scale' , they would be considered 'close' .
Step 2. Clustering and find the point gathering areas. After that, NDRindex will find some clusters, each represents a point gathering area.
Step 3. Calculating the final index. For each cluster, the average of the distances from all points to the geometric center is defined as the cluster radius. A smaller cluster radius indicates a smaller and dense point collection area and a larger degree of clustering. Therefore, we define the final index as: where To reduce randomness, NDRindex runs this algorithm 100 times and takes the average value as the final result.
The procedure below can be described as pseudo-code as Fig. 2 described.

Results
To compare the performance of NDRindex, we applied the method to simulated and real data sets. The simulation dataset contains data of different quality. Some of them have obvious patterns and are suitable for grouping, while others are not. As shown in Fig. 3, the results show that our method can clearly distinguish them. For real datasets, we select five widely used single-cell RNA-Seq datasets, five normalization methods (TMM [6], Linnorm [11], scran [8], Seurat [18], scale)) and three dimension reduction methods (tSNE [15], PCA, sammon [14]). We collect the output of each combination of methods and subject them all to four typical clustering algorithms and compare the clustering results with ARI. As shown in Fig. 4, the result shows that the NDRindex algorithm chooses the data with the highest ARI, which shows that the NDRindex algorithm chooses a good combination of methods. We submit the data that NDRindex chosen to hierarchical clustering algorithm, and compare the result with other four methods (SC3 [5], pcaReduce [20], SNN-Cliq [21], SINCERA [22], SRURAT [18]) by ARI. As showed in Fig. 5, the performance of NDRindex shows its relatively high accuracy and stability.

Discussion
For any REA-seq data, if there were at least one combination of normalization method and dimensionality reduction method, and the user believed that the optimal combination exists, NDRindex would be able to process as it is an evaluation to the best combinations of existing normalization methods and dimensionality reduction methods. If there is neither a defined normalization method nor dimensionality reduction, or the user cannot be sure whether at least one of the best combinations processes the data correctly, NDRindex would not be applicable. For instance, consider a data set consists of a homogeneous population of cells. If the user have multiple normalization methods and  Results of NDRindex on simulative data. Every line shows one type of simulative data we test, line 1 to line 4 are two-dimensional normal distribution, square, hexagram, random shape, respectively. For each line, column a to column c are four data whose scale are decreased by order, column d is a line graph shows how NDRIndex changes with the decrease of data scale. When data become more aggregate, NDRindex always become higher Fig. 4 Data quality assessment of NDRindex chosen and unchosen. For each database, we test five normalization methods (TMM, Linnorm, scran, Seurat, scale) and three dimensionality reduction methods (tSNE, PCA, sammon). We select the result of each combination and submit all twelve of them to four typical clustering methods and benchmark the clustering results with ARI. Figure 3.a to 3.d shows the results of clustering methods kmeans, hclust, adpclust, ap_clust, respectively. Comparing the data NDRindex chosen (red rectangular) and the data NDRindex unchosen, we find that most of the chosen combination get the highest ARI (orange rectangular) during clustering, nearly all chosen combination get the ARI above upper quantile (blue rectangular). That means NDRindex do select high quality data that is suitable for clustering Comparison between NDRindex and other RNA-Seq processing methods. We submit the data that NDRindex choosen to hclust algorithm, and compare the result with other four methods (SC3, pcaReduce, SNN-Cliq, SINCERA, SRURAT) by comparing ARI. We run each method one hundred times, the dots represent the ARI between the inferred clusterings and reference labels of each running. and the height of rectangular represents the average ARI