scAnnotatR: framework to accurately classify cell types in single-cell RNA-sequencing data

Background Automatic cell type identification is essential to alleviate a key bottleneck in scRNA-seq data analysis. While most existing classification tools show good sensitivity and specificity, they often fail to adequately not-classify cells that are missing in the used reference. Additionally, many tools do not scale to the continuously increasing size of current scRNA-seq datasets. Therefore, additional tools are needed to solve these challenges. Results scAnnotatR is a novel R package that provides a complete framework to classify cells in scRNA-seq datasets using pre-trained classifiers. It supports both Seurat and Bioconductor’s SingleCellExperiment and is thereby compatible with the vast majority of R-based analysis workflows. scAnnotatR uses hierarchically organised SVMs to distinguish a specific cell type versus all others. It shows comparable or even superior accuracy, sensitivity and specificity compared to existing tools while being able to not-classify unknown cell types. Moreover, scAnnotatR is the only of the best performing tools able to process datasets containing more than 600,000 cells. Conclusions scAnnotatR is freely available on GitHub (https://github.com/grisslab/scAnnotatR) and through Bioconductor (from version 3.14). It is consistently among the best performing tools in terms of classification accuracy while scaling to the largest datasets. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04574-5.


Supplementary Methods
The complete code to 1) prepare the test datasets, 2) run all tools, 3) evaluate the benchmark results, and 4) create all benchmark figures of the manuscript is available on CodeOcean at https://doi.org/10.24433/CO.8414972.v1.

Classification process
The core classification task is performed using linear SVM-based learning models on predefined sets of features through the caret R package. C was constantly set at 1. Cross-validation and down-sampling was controlled by caret while training. scAnnotatR further supports the concept of "child" and "parent" classifiers. Child classifiers are trained and applied only to cells that were already classified by the respective parent classifier.
Internally, scAnnotatR continuously ensures that the classifier database is consistent and that child classifiers are only added to a database if the respective parent classifier is present. This structure can be visualized using the "visualize_tree" function. Internally, it uses the data.tree R package to store the hierarchical structure of cell types. Taking the database location provided by users, the package automatically loads the database, retrieves the classifiers, forms cell type relationships and creates a tree of cell types. The tree has nodes corresponding to main or parent cell types and leaves corresponding to cell types having no children. Thereby, the user can get a quick overview over all available cell types. scAnnotatR supports both Seurat and Bioconductor's SingleCellExperiment class objects as input to its central "classify_cells" function. The classification result is returned as new metadata slots in the input object storing the cell type(s) and the respective classification probabilities. This includes possible ambiguous cell type assignments with all corresponding candidates, the classification probability for every cell type, as well as the most probable cell type. Thereby, the classification results can be directly visualized and further analysed using the respective standard plotting functions and workflows of the respective scRNA-seq pipeline.
Training new classifier process scAnnotatR simplifies the task of training and evaluating new classifiers. The training process ("train_classifier" function) supports both Seurat and SingleCellExperiment objects as input and produces an scAnnotatR object. The user only has to specify which features are used for the training process. More general markers, which are expressed by a large cell population, help increase the sensitivity of the models but decrease their specificity. More specific markers help to better distinguish the training cell types versus the others but also decrease the sensitivity of the trained models. In some cases, negative markers are also used to differentiate between closely related cell types. In low depth-sequencing data, the expression levels of marker genes may show very different distributions. To solve this issue, the train_classifier function automatically performs a z-score transformation on the input data. A balancing process ensures that an equal number of cells are present in the target and the non-target class. Finally, the caret training function is used to train the classifier. Thereby, all steps required to train a new classifier are available in a single, simple-to-use function.
The hierarchical structure of classifiers is one of the key features of scAnnotatR. This allows us to 1/ better reflect the natural biological relationship between cell types and/or between cell type and phenotypes, and 2/ increase the prediction accuracy. For instance, a CD4+ T cell must be positive to T cell classifier, and a cell negative to T cell classifier must never be a CD4+ T cell. Moreover, it enables us to select the most appropriate markers at each level. For example, marker genes helpful in differentiating T vs. all other cells are not helpful to differentiate CD4+ from CD8+ T cells. Finally, this feature may allow us to recognize unexpected sub-populations as mentioned in the manuscript (such as T cells not further classified).
The parent-child relationship of two classifiers is defined during the training process. Most commonly, this follows the developmental trajectory of cells. . There are cases though, where this is not directly possible. Here we recommend to base this decision on the most distinct markers available. For example, central memory CD4+ T cells could be defined as a child of CD4+ T cells or central memory T cells. Since it is easier to differentiate CD4+ from CD8+ T cells, we recommend first training classifiers for these cell types and then creating child classifiers to identify the final central memory CD4+ T cells.
It is worth mentioning that the efficiency of classifier hierarchical structure increases with the increasing number of mutual markers. Therefore, a fully differentiated phenotype or too distinctive subtype might not be relevant to be a child of any other cell type because it has lost the majority of markers existing in their relatives. For instance, Langerhans cells (LC) have been supposed to be a subset of macrophages [1]. However, LC markers are far different from this potential parent, since LC expresses langerin (CD207) and CD1a, and macrophages express CD14 and/or CD16. In this case, we would propose to keep LC classifier as an independent classifier from the macrophage classifier.
For that reason, we leave the user an option to choose training independent classifiers besides hierarchical structured classifier.
The testing process through the "test_classifier" function follows a similar process as the training function. Using an independent test dataset as input, the test_classifer function calculates an overall AUC score, the accuracy, sensitivity and specificity of the classifier at the default probability threshold (0.5) and at thresholds from 0.1 to 0.9 with steps of 0.1 to simplify the tuning of the probability threshold. A setter for the prediction probability threshold is available to adjust the sensitivity and specificity of the trained models, based on the specific research question or particular user's interest. Once the classification model's performance meets the user's expectations, scAnnotatR provides several functions to store the classifiers in a common database.
These functions ensure that the database remains consistent with respect to parent and child classifiers. Thereby, scAnnotatR provides a complete infrastructure to train and evaluate new classifiers. Ten PBMC datasets were used to evaluate the performance on closely related cell types: seven (Smartseq2, SeqWell, inDrop, CelSeq2, DropSeq, 10xv2, and 10xv3) from Ding et al. [12], the PBMC 3k from Zheng et al. [13], the SCP345 PBMC dataset [14], and one PBMC 500 dataset, which is a subset of Zheng's PBMC 3k processed by the ILoReg pipeline [15]. The SCP345 PBMC dataset could not be included in the second level of the benchmark as it lacks detailed immune cell type annotations.

Datasets
To benchmark the scalability and performance of the tools we used five datasets: the (joint) Ding PBMC dataset, the Reynolds skin cell atlas, and the Stephenson Covid-19 cell atlas, the Zilionis lung dataset, the HIV1 dataset.

Data preprocessing and cell type assignment
scRNA-seq data was downloaded from GEO for the Sade-Feldman and the Jerby-Arnon datasets (TPM counts). For the Sade-Feldman melanoma dataset, we preprocessed the data following the authors' approach: First, we filtered out mitochondrial genes. We then retrieved only cells expressing at least 1000 features and only features expressed in at least 3 cells. Finally, we kept only cells with housekeeping genes expressed at low levels: log2(TPM + 1) < 2.5. For the Jerby-Arnon dataset, we eliminated mitochondrial genes, and filtered out cells having less than 1000 expressed genes and genes expressed in less than 3 cells. The datasets were then normalized and scaled using the basic pipeline of Seurat v3 regressing out confounders, such as patients (for Sade-Feldman dataset), samples and cohorts (for Jerby-Arnon dataset). After that, the data dimension was reduced to the first 40 (Sade-Feldman dataset) and 45 (Jerby-Arnon dataset) principal components.
clustering was performed with default parameters. Cell annotation in these datasets was manually assigned on the cluster-level based on known canonical markers ( PCA. This dataset therefore can only be preprocessed again using the Seurat SCTransform process.
The HIV1 dataset was downloaded from the Single Cell Portal [21] and processed using Seurat's scTransform pipeline regressing out the donor identification. 50 principal components were used for clustering and the cell labels were provided by authors and integrated into our Seurat object.
The human lung dataset by Zilionis et al. [22] is available through the scRNAseq R package [17] in the form of a SCE object. This dataset was used only for runtime benchmark, therefore we kept as many cells as possible. Therefore, only 73 cells having no expression on all genes were eliminated from the SCE object. The following processing steps relied on the pipeline proposed by the Orchestrating Single-Cell Analysis with Bioconductor, Chapter 42 [23]. Particularly, the dataset was log normalized by the library size. We then blocked on the sample (Library in cell metadata) information to remove batch effect while selecting top 5000 highly variable genes. We used fastMNN proposed by batchelor to correct the batch effect on the reduced dimensions level. Therefore, MNN reduced dimensions were used for the UMAP generation and cell clustering. The analysis stopped after the clustering process.
The Reynolds skin cell atlas was downloaded via the Development Cell Atlas portal [24] in the H5AD format. Using the scanpy workflow, we loaded the dataset into an AnnData object, processed and analyzed the dataset based on the authors' description. We converted the resulting AnnData object (without the embedded spaces) to a SingleCellExperiment object using the anndata2ri library in python.
The SCE object of the HIV1 dataset, the joint PBMC Ding dataset, the Zilionis lung dataset, and the Reynolds skin cell atlas were converted to Seurat objects to further benchmark scPred including the original PCA information.
The data was log normalized and stored as an AnnData object, in an H5AD formatted file. Similar to the Reynolds skin cell atlas, the mentioned AnnData object was then converted into a SingleCellExperiment object for further processing in R. To make this dataset usable with scPred, we continued principal component analysis in the Bioconductor pipeline, and lastly converted the analyzed SCE object to a Seurat object.

Pretrained learning models for basic immune cells
Training and testing of classifiers was performed using the package's inbuilt functions (see above).
We used multiple datasets to train and test our package's inbuilt classifiers. In particular, the

Benchmarking discrete populations using the pancreas datasets
This benchmark was performed using the 6 pancreas datasets and a 6-fold cross-validation scheme.
In each fold, one among six datasets was used for training and the other five for testing.
For scAnnotatR, we used the same set of features in the training of pancreatic alpha, beta, delta, gamma, ductal, and acinar classifiers throughout the six folds. For scClassify classifiers training, we followed their basic pipeline by using limma as feature selection method, pearson as similarity calculation method, and WKNN as the KNN algorithm. As Garnett needs a marker gene list as well as a training set we used two benchmarking strategies. In the first approach, we applied the same list of marker genes that we used for scAnnotatR on six different training sets throughout six folds, while the later approach took the top 10 differentially expressed genes of the cell groups from the corresponding training set. SCINA uses only marker genes, so we used the same top 10 differentially expressed markers as the second benchmark of Garnett does. For all datasets we used Seurat's FindAllMarkers function to find the top 10 differentially expressed genes. For scPred, the originally preprocessed data was provided as input in the form of an Seurat object as scPred does not support other input objects. Additionally, we evaluated scPred re-processing all datasets using the scTransform Seurat pipeline throughout. The scpred_prediction data slot returned by scPredict function was used as scPred's classification. For all other methods, we used the standard pipeline provided by the authors for training and testing. Cells with ambiguous assignments in the source data were excluded from the training and testing process.
The benchmark assesses three aspects of the cell classification accuracy: 1) the overall accuracy of the classification, 2) how well a specific cell type was recognized, and 3) how often cell types that were not part of the reference were misclassified.
The accuracy was calculated based on the total number of correct assignments and correct Here, we refer to this rate as the 'unknown population detection rate'.

Equation 4: =
The ambiguous/intermediate cell assignment does not affect the unknown population rate.
Therefore, only one unknown population detection rate is reported for cell classification tools providing ambiguous/intermediate prediction.
In order to ensure a fair comparison between tools regarding the unknown population detection rate, we decided to apply a confidence threshold onto SciBet's prediction. Based on the SciBet's documentation, we treated all predictions with confidence below 0.4 (default value) as 'unassigned'.

Benchmarking closely related populations
This benchmark was performed in a multiple-fold cross-validation scheme. Additionally, the benchmark was performed using two levels of annotations. In level one, the datasets are annotated with a low level of detail of immune cell types while in level two, sub-cell types are annotated. Ten datasets were used in this benchmark: the PBMC 3k dataset analyzed by Seurat v3.1 [19]; the PBMC 500 dataset, which was a subset of the PBMC 3k dataset but preprocessed and analyzed by ILoReg v1.0 [15]; the PBMC dataset by Ding et al. [12] containing seven subsets corresponding to seven different sequencing protocols; and the SCP345 PBMC dataset [14]. The latter could not be included in the second level of the benchmark as it lacked detailed cell annotations.
In both benchmarks (level one and two), scAnnotatR used the same feature sets as the in-built classifiers for the respective cell types. For scClassify, classifiers were trained by using limma as feature selection method, pearson as similarity calculation method, and WKNN as the KNN algorithm. Garnett and SCINA used the top 10 differentially expressed genes found by Seurat's FindAllMarkers function. scPred was further evaluated using two additional approaches: once where training and testing datasets were processed by the same pipeline, and once where additionally all classifiers were optimized manually. The training and optimization process was run with all scPred default parameters. The scpred_prediction data slot returned by scPredict function was used as scPred's classification. For all other methods, we used the standard pipeline provided by the authors for training and testing.
Cell types were classified in each dataset and the general accuracy, sensitivity and specificity calculated per cell type as described above. Results are reported as average across all cell types per dataset. For each dataset, we further assessed the unknown population detection rate (see above).
Several tools organise cell types in a hierarchical structure. For scAnnotatR and Garnett, the hierarchical structure of cell types was defined by the user during the training process (scAnnotatR) or when defining markers (Garnett). In level two of the benchmark, these two tools classify major/parent cell types (T cell, monocytes) before classifying their subtypes (CD4+ T cells/CD8+ cells, CD14+ monocytes/CD16+ monocytes). In this case, cells correctly assigned by major cell types (only one predicted cell type per cell) but not classified by any of the child classifiers are also considered to be correct classification.

Runtime estimation
We assessed the runtime on five specifically selected large datasets: the joint Ding PBMC dataset [12], the Reynolds skin cell atlas [4], the Stephenson Covid-19 cell atlas [5], the Zilionis lung dataset [22], and the HIV1 dataset [21] ranging from 31,021 to 647,366 cells. The Sade-Feldman melanoma dataset was used for training. For Garnett and SCINA, top 10 differentially expressed genes of the cell groups from the Sade-Feldman melanoma dataset were provided. For scAnnotatR, all package in-built models were applied. For scClassify classifier training, we tested both non-ensemble learning and ensemble learning. Since the resulting prediction remained the same, we used the non-ensemble learning model in the runtime benchmark. For tools with separate training and predicting processes, we recorded only the time consumed by the prediction. For the other tools, runtime consumed by the main classification functions was recorded, even though those functions might integrate internal training processes. The prediction runtime was recorded as wall clock time.