We tested PCA and AEs to address the integration of different datasets. Specifically, we used all the AEs implemented in our scAEspy tool: VAE [17], an AE only based on the Maximum Mean Discrepancy (MMD) distance (called here MMDAE) [35], MMDVAE, Gaussian-mixture VAE (GMVAE), and two novel Gaussian-mixture AEs that we developed, called GMMMD and GMMMDVAE, respectively. In all the performed tests, the constrained versions of the following loss functions were used: Negative Binomial (NB), Poisson, zero-inflated NB (ZINB), zero-inflated Poisson (ZIP). We used a number of Gaussian distributions equal to the number of datasets to integrate for GMVAE, GMMMD, and GMMMDVAE. In addition, we tested the following configurations of the hidden layer and latent space to understand how the dimension of the AEs might potentially affect their performance: (256, 64), (256, 32), (256, 16), (128, 64), (128, 32), (128, 16), (64, 32), and (64, 16), where (H, L) represents the number of neurons composing the hidden layer (H neurons) and latent space (L neurons).
In order to deal with the possible batch-effects, we applied the following approaches, as suggested in [8, 11] and being the most used batch-effect removal tools in the literature: Batch Balanced k-Nearest Neighbours (BBKNN) [36], Harmony [37], ComBat [38,39,40], and the Seurat implementation of the Canonical Correlation Analysis (CCA) [13]. Thus, we compared vanilla PCA and AEs, PCA and AEs followed by either BBKNN or Harmony (Fig. 2), ComBat, and CCA. For each batch-effect removal tool, we used the different samples that have to be merged as batches.
The proposed strategies were compared on three publicly available datasets, namely: Peripheral Blood Mononuclear Cells (PBMCs), Pancreatic Islet Cells (PICs), and Mouse Cell Atlas (MCA) by using well-known clustering metrics (i.e., Adjusted Rand Index, Adjusted Mutual Information Index, Fowlkes Mallows Index, Homogeneity Score, and V-Measure). It is worth mentioning that generally the cell-types are manually identified by expert biologists starting from an overclustering or underclustering of the data, possibly followed by different steps of subclustering of some clusters. Here, we evaluate how the different strategies are able to automatically separate the cells by fixing the number of clusters equal to the number of cell-types manually identified by the authors of the papers.
Datasets
Peripheral blood mononuclear cells
PBMCs from eight patients with systemic lupus erythematosus were collected and processed using the \(10\times\) Chromium Genomics platform [41]. The dataset is composed of a control group (6573 cells) and an interferon-\(\beta\) stimulated group (7466 cells). We considered the 8 distinct cell-types identified by the authors following a standard workflow [41]. The count matrices were downloaded from Seurat’s tutorial “Integrating stimulated vs. control PBMC datasets to learn cell-type specific responses” (https://satijalab.org/seurat/v3.0/immune_alignment.html).
Pancreatic islet cells
PIC datasets were generated independently using four different platforms: CEL-Seq [42] (1004 cells), CEL-Seq2 [43] (2285 cells), Fluidigm C1 [44] (638 cells), and Smart-Seq2 [45] (2394 cells). For our tests, we considered the 13 different cell-types across the datasets identified in [46] by applying PCA on the scaled integrated data matrix. The count matrices were downloaded from Seurat’s tutorial “Integration and Label Transfer” (https://satijalab.org/seurat/v3.0/integration.html).
Mouse cell atlas
MCA is composed of two different datasets. The former was generated by Han et al. [47] using Microwell-Seq (4239 cells) [47], while the latter by the Tabula Muris Consortium [48] using Smart-Seq2 (2715 cells). The 11 distinct cell-types with the highest number of cells, which were present in both datasets, have been taken into account as in [11]. The count matrices were downloaded from the public GitHub repository related to [11] (https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking).
Metrics
Adjusted rand index
The Rand Index (RI) is a similarity measure between the results obtained from the application of two different clustering methods. The first clustering method is used as ground truth (i.e., true clusters), while the second one has to be evaluated (i.e., predicted clusters). The RI is calculated by considering all pairs of samples appearing in the clusters, namely, it counts the pairs that are assigned either to the same or different clusters in both the predicted and the true clusters. The Adjusted RI (ARI) [49] is the “adjusted for chance” version of the RI. Its values vary in the range \([-1,1]\): a value close to 0 means a random assignment, independently of the number of clusters, while a value equal to 1 indicates that the clusters obtained with both clustering approaches are identical. Negative values are obtained if the index is less than the expected index.
Adjusted mutual information index
The Mutual Information Index (MII) [50] represents the mutual information of two random variables, which is a similarity measure of the mutual dependence between the two variables. Specifically, it is used to quantify the amount of information that can be gained by one random variable observing the other variable. The MII is strictly correlated with the entropy of a random variable, which quantifies the expected “amount of information” that is contained in a random variable. This index is used to measure the similarity between two labels of the same data. Similarly to ARI, the Adjusted MII (AMII) is “adjusted for chance” and its values vary in the range [0, 1].
Fowlkes mallows index
The Fowlkes Mallows Index (FMI) [51] measures the similarity between the clusters obtained by using two different clustering approaches. It is defined as the geometric mean between precision and recall. Assuming that the first clustering approach is the ground truth, the precision is the percentage of the results that are relevant, while the recall refers to the percentage of total relevant results correctly assigned by the second clustering approach. The index ranges from 0 to 1.
Homogeneity score
The result of the tested clustering approach satisfies the Homogeneity Score (HS) [52] if all of its clusters contain only cells that belong to a single cell-type. Its values range from 0 to 1, where 1 indicates perfectly homogeneous labelling. Notice that by switching true cluster labels with the predicted cluster labels, the Completeness Score is obtained.
Completeness score
The result of the tested clustering approach satisfies the Completeness Score (CS) [52] if all the cells that belong to a given cell-type are elements of the same cluster. Its values range from 0 to 1, where 1 indicates perfectly complete labelling. Notice that by switching true cluster labels with the predicted cluster labels, the HS is obtained.
V-measure
The V-Measure (VM) [53] is the harmonic mean between HS and CS; it is equivalent to MII when the arithmetic mean is used as aggregation function.
Integration of multiple datasets obtained with the same sequencing platforms
Nowadays, various scRNA-Seq platforms are currently available (e.g., droplet-based and plate-based [54,55,56,57,58,59,60,61,62,63]) and their integration is often challenging due to the differences in biological sample batches as well as to the used experimental platforms. To test whether AEs can be effectively applied to combine multiple datasets, generated using the same platform but under different experimental conditions, we used the PBMC datasets.
We merged the control and treated datasets by first using vanilla PCA and AEs, and then PCA and AEs followed by either BBKNN or Harmony, ComBat, and CCA. After the construction of the neighbourhood graphs, we performed a clustering step by using the Leiden algorithm [64]. Since in the original paper 8 different cell-types were manually identified [41], we selected Leiden’s resolutions that allowed us to obtain 8 distinct clusters and calculated all the metrics described above. In what follows, the calculated values of all metrics are given in percentages (mean ± standard deviation). For each metric, the higher the value the better the result.
Our analysis showed that the CCA-based approach, proposed in the Seurat library, achieved a mean ARI equal to \(73.49\%\) (\(\pm 1.52\%\)), ComBat reached a mean ARI of \(72.84\%\) (\(\pm 0.78\%\)), vanilla PCA had a mean ARI of \(68.90\%\) (\(\pm 0.86\%\)), PCA followed by BBKNN was able to obtain a mean ARI of \(83.65\%\) (\(\pm 0.81\%\)), while PCA followed by Harmony reached a mean ARI of \(82.83\%\) (\(\pm 1.20\%\)), as shown in Fig. 3a. Among all the tested AEs, MMDAE followed by Harmony (using the NB loss function, 256 neurons for the hidden layer, and 32 neurons for the latent space) achieved the best results, with a mean ARI equal to \(87.18\%\) (\(\pm 0.49\%\)). In order to assess whether any of the results obtained by the best AE were different from a statistical point of view, we applied the Mann–Whitney U test with the Bonferroni correction [65,66,67]. In all the comparisons, MMDAE followed by Harmony had a p-value lower than 0.0001, confirming that the achieved results are statistically different compared to those achieved by the other approaches.
Regarding the AMII, CCA had a mean value of \(66.46\%\) (\(\pm 0.50\%\)), ComBat achieved a mean value of \(70.95\%\) (\(\pm 0.82\%\)), vanilla PCA obtained a mean value of \(68.44\%\) (\(\pm 1.00\%\)), PCA followed by BBKNN reached a mean value of \(75.22\%\) (\(\pm 0.76\%\)), while PCA followed by Harmony reached a mean value of \(74.55\%\) (\(\pm 1.19\%\)). MMDAE followed by Harmony had better results, with a mean value equal to \(78.61\%\) (\(\pm 0.29\%\)). MMDAE followed by Harmony outperformed the other strategies also in terms of of FMS, HS, CS, and VM (see Additional file 2 and Fig. 4).
We also compared the results obtained by the best AE for each of the tested dimension (H, L) in terms of ARI (Fig. 3b). GMMMD followed by Harmony (using the NB loss function) obtained the best results for the dimension (64, 16), GMMMD followed by Harmony (using the Poisson loss function) reached the best results for the dimensions (128, 16) and (256, 64), and GMMMD followed by BBKNN (using the NB loss function) achieved the best results for the dimension (128, 32). MMDAE followed by Harmony (using the NB loss function) was able to reach the best results for the dimensions (64, 32), (256, 16), and (256, 32), while MMDAE followed by Harmony (using the Poisson loss function) obtained the best result for the dimensions (128, 64). Notice that we used two Gaussian distributions because we merged two different datasets.
In order to visually assess the quality of the separation of the manually annotated cell-type and the found clusters, we plotted them in the UMAP space generated starting from the MMDAE followed by Harmony space (Fig. 3c, d). Finally, we also plotted the two samples in the same UMAP space to visually see the quality of the alignment between the two samples themselves (Fig. 5a). This plot confirms that the batch-effects were completely removed.
Our analysis showed that clustering the neighbourhood graph generated from AE spaces allowed for a better identification of the existing cell-types when compared to other approaches, thus confirming the ARI results.
Integration of multiple datasets obtained with different sequencing platforms
Combining datasets from different studies and scRNA-Seq platforms can be a powerful approach to obtain complete information about the biological system under investigation. However, when datasets generated with different platforms are combined, the high variability in the gene expression matrices can obscure the existing biological relationships. For example, the gene expression values are much higher in data acquired with plate-based methods (i.e., up to millions) than in those acquired with droplet-based methods (i.e., a few thousands). Thus, combining gene expression data that spread across several orders of magnitude is a difficult task that cannot be tackled by using linear approaches like PCA. To examine how well AEs perform in solving this issue, we combined four PIC datasets acquired with CEL-Seq [56], CEL-Seq2 [57], Fluidigm C1 [63], and Smart-Seq2 protocols [62].
We integrated the datasets by first using vanilla PCA and AEs, and then PCA and AEs followed by either BBKNN or Harmony, ComBat, and CCA. Since in the original paper 13 cell-types were manually annotated for the PIC datasets [46], we clustered the neighbourhood graphs using the Leiden algorithm considering only the resolutions that allowed us to obtain 13 distinct clusters. We then calculated ARI, AMII, FMS, HS, CS, and VM metrics. The calculated values of all metrics are given in percentages (mean ± standard deviation); for each metric, the higher the value the better the result.
CCA had a very low mean ARI, i.e., \(5.45\%\) (\(\pm 0.22\%\)), ComBat obtained a mean ARI of \(76.20\%\) (\(\pm 3.06\%\)), vanilla PCA achieved a mean ARI of \(61.38\%\) (\(\pm 0.09\%\)), PCA followed by BBKNN reached a mean ARI of \(71.49\%\) (\(\pm 0.58\%\)), while PCA followed by Harmony was able to obtain a mean ARI of \(94.00\%\) (\(\pm 0.36\%\)), see Fig. 6a. GMMMD followed by Harmony (using the NB loss function, 256 neurons for the hidden layer, and 32 neurons for the latent space) outperformed the other AEs, achieving a mean ARI equal to \(94.23\%\) (\(\pm 0.12\%\)). In all the comparisons, except for the one against PCA followed by Harmony, GMMMD followed by Harmony had a p-value lower than 0.0001, confirming that the achieved results are statistically different with respect to those obtained by the other approaches.
Similar results were achieved for the AMII metric: CCA reached a mean value equal to \(16.57\%\) (\(\pm 0.33\%\)), ComBat obtained a mean value of \(76.11\%\) (\(\pm 1.14\%\)), vanilla PCA reached a mean value of \(71.70\%\) (\(\pm 0.11\%\)), PCA followed by BBKNN achieved a mean value of \(77.55\%\) (\(\pm 0.65\%\)) and PCA followed by Harmony a mean value of \(91.17\%\) (\(\pm 0.47\%\)), while GMMMD followed by Harmony was able to reach a mean value equal to \(89.37\%\) (\(\pm 0.02\%\)). Considering the other measures, both PCA and GMMMD followed by Harmony obtained very similar results, outperforming the other strategies (see Additional file 3 and Fig. 7).
Considering the best AE for each of the tested dimension (H, L) in terms of ARI (see Fig. 6b), GMMMD followed by Harmony (using the NB loss function) resulted the best choice for the dimensions (128, 64) and (256, 32), while it obtained the best results for the dimension (64, 32) when the Poisson loss function was used. GMVAE followed by Harmony (using the Poisson loss function) reached the best results for the dimensions (128, 32) and (256, 16). MMDAE followed by Harmony achieved the best results for the dimensions (256, 64) and (64, 16), exploiting the NB loss function and Poisson loss function, respectively. Finally, VAE followed by Harmony obtained the best results with the Poisson function for the dimension (128, 16). Note that we exploited four Gaussian distributions because we merged four different datasets.
The quality of the separation of the manually annotated cell-type and found clusters can be visually evaluated in Fig. 6c, d. We finally visualised the cells (coloured by platform) using the UMAP space generated from the GMMMD followed by Harmony space (Fig. 5b) to confirm that the batch-effects among the samples sequenced with different platforms were correctly removed. Taken together, our analysis shows that GMMMD followed by Harmony can efficiently identify the “shared” cell-types across the different platforms due to its ability to deal with the high variability in the gene expression matrices. We would like to highlight that PCA followed by Harmony was capable of achieving good results because the original clusters were obtained by applying a similar pipeline [46].
As a final test, we combined the two MCA datasets acquired with Microwell-Seq [47] and Smart-Seq2 protocols [62]. We integrated the datasets in the same way we did in the other two tests. We clustered the neighbourhood graphs using the Leiden algorithm considering only the resolutions that allowed us to obtain 11 distinct clusters because 11 distinct cell-types were manually annotated for the PIC datasets [47]. We then calculated all metrics.
In such a case, MMDVAE followed by Harmony (using the Poisson loss function, 256 neurons for the hidden layer, and 64 neurons for the latent space) outperformed the other AEs as well as the other strategies, obtaining a mean ARI equal to \(79.50\%\) (\(\pm 0.02\%\)), as shown in Fig. 8a. ComBat achieved the worst mean ARI, i.e., \(54.13\%\) (\(\pm 4.22\%\)), CCA reached a mean ARI of \(57.62\%\) (\(\pm 0.75\%\)), vanilla PCA obtained a mean ARI of \(67.29\%\) (\(\pm 11.55\%\)), PCA followed by BBKNN had a similar mean ARI, that is, \(67.73\%\) (\(\pm 3.98\%\)), while PCA followed by Harmony achieved a mean ARI of \(66.08\%\) (\(\pm 0.11\%\)). MMDVAE followed by Harmony had a p-value lower than 0.0001 in all the tested comparisons. Considering the other metrics, MMDVAE followed by Harmony generally obtained better results compared to the other strategies (see Additional file 4, Figs. 8b, and 9).
Comparing the best AE for each of the tested dimension (H, L) in terms of ARI, the vanilla GMMMD with the NB loss function obtained the best results for the dimension (128, 16), while GMMMD followed by Harmony reached the best results for the dimensions (256, 32) and (64, 16), exploiting the Poisson loss function and NB loss function, respectively. MMDAE followed by BBKNN (using the ZIP loss function) achieved the best results for the dimensions (256, 16) and (128, 64), exploiting the NB loss function and Poisson loss function, respectively. MMDVAE followed by Harmony resulted the best choice for the dimensions (128, 16) and (256, 64) when coupled with the NB loss function and Poisson loss function, respectively. Finally, VAE followed by Harmony with the Poisson loss function obtained the best results for the dimension (64, 32). As for the the integration of the PBMC datasets, we used two Gaussian distributions because we merged two different datasets.
Figure 8c, d show the UMAP generated from the MMDVAE followed by Harmony space coloured by the manually annotated cell-type and found clusters, respectively, while Fig. 5c depicts the cells coloured by platform on the same UMAP space, confirming that the batch-effects between the two samples were correctly removed. In this case, the achieved results show that MMDVAE followed by Harmony was able to better identify the “shared” cell-types across the different platforms.
In order to test the computational performance of scAEspy, we considered a VAE with a single hidden layer composed of 256 neurons, a latent space of 32 neurons, and the NB loss function. All tests were executed on a laptop equipped with an Intel Core i7-1165G7 CPU (clock 2.8 GHz) and 16 GB of RAM, running on Ubuntu 20.04 LTS. In all the tested datasets, we used the top 1000 HVGs. The executions lasted 95.02 seconds, 101.05 seconds, and 223.45 seconds for PIC, MCA, and PBMC, respectively. Considering that PIC, MCA, and PBMC are composed of 6321, 6954, and 14039 cells, respectively, scAEspy scales linearly with the number of cells.