Analysis of single-cell RNA sequencing data based on autoencoders

Background Single-cell RNA sequencing (scRNA-Seq) experiments are gaining ground to study the molecular processes that drive normal development as well as the onset of different pathologies. Finding an effective and efficient low-dimensional representation of the data is one of the most important steps in the downstream analysis of scRNA-Seq data, as it could provide a better identification of known or putatively novel cell-types. Another step that still poses a challenge is the integration of different scRNA-Seq datasets. Though standard computational pipelines to gain knowledge from scRNA-Seq data exist, a further improvement could be achieved by means of machine learning approaches. Results Autoencoders (AEs) have been effectively used to capture the non-linearities among gene interactions of scRNA-Seq data, so that the deployment of AE-based tools might represent the way forward in this context. We introduce here scAEspy, a unifying tool that embodies: (1) four of the most advanced AEs, (2) two novel AEs that we developed on purpose, (3) different loss functions. We show that scAEspy can be coupled with various batch-effect removal tools to integrate data by different scRNA-Seq platforms, in order to better identify the cell-types. We benchmarked scAEspy against the most used batch-effect removal tools, showing that our AE-based strategies outperform the existing solutions. Conclusions scAEspy is a user-friendly tool that enables using the most recent and promising AEs to analyse scRNA-Seq data by only setting up two user-defined parameters. Thanks to its modularity, scAEspy can be easily extended to accommodate new AEs to further improve the downstream analysis of scRNA-Seq data. Considering the relevant results we achieved, scAEspy can be considered as a starting point to build a more comprehensive toolkit designed to integrate multi single-cell omics. Supplementary information The online version contains supplementary material available at 10.1186/s12859-021-04150-3.

that aim at understanding the molecular processes driving normal development and the onset of pathologies [1,2]. This field of research continuously poses new computational questions that have to be addressed [3].
One of the most important steps in scRNA-Seq analysis is the clustering of cells into groups that correspond to known or putatively novel cell-types, by considering the expression of common sets of signature genes. However, this step still remains a challenging task because applying clustering approaches in high-dimensional spaces can generate misleading results, as the distance between most pairs of points is similar [4]. As a consequence, finding an effective and efficient low-dimensional representation of the data is one of the most crucial steps in the downstream analysis of scRNA-Seq data. A common workflow of downstream analysis, depicted in Fig. 1, includes two dimensionality reduction steps: (1) Principal Component Analysis (PCA) [5] for an initial reduction of the dimensions based on the Highly Variable Genes (HVGs), and (2) a non-linear dimensionality reduction approach-e.g., t-distributed Stochastic Neighbour Embedding (t-SNE) [6] or Uniform Manifold Approximation and Projection (UMAP) [7]-on the PCA space for visualisation purposes (e.g., showing the labelled clusters) [8,9]. In addition, when multiple scRNA-Seq datasets have to be combined for further analyses, the technical non-negligible batch-effects that may exist among the datasets must be taken into account [3,8,[10][11][12][13], making the dimensionality reduction even PCA on standardised data

Clustering Marker genes
Step (vii)

Quality Control
Step (i)

Highly variable genes
Step (iii) Fig. 1 A common workflow for the downstream analysis of scRNA-Seq data. The workflow includes the following seven steps: (i) quality control to remove low-quality cells that may add technical noise, which could obscure the real biological signals; (ii) normalisation and log-transformation; (iii) identification of the HVGs to reduce the dimensionality of the dataset by including only the most informative genes; (iv) standardisation of each gene to zero mean and unit variance; (v) dimensionality reduction generally obtained by applying PCA; (vi) clustering of the cells starting from the low-dimensional representation of the data that are used to annotate the obtained clusters (i.e., identification of known and putatively novel cell-types); (vii) data visualisation on the low-dimensional space generated by applying a non-linear approach (e.g., t-SNE or UMAP) on the reduced space calculated in step (v) more complicated and fundamental. Indeed, finding a salient batch corrected and a low dimensional embedding space can help to better partition and distinguish the various cell-types.
Although commonly used approaches for dimensionality reduction achieved good performance when applied to scRNA-Seq data [8], novel and more robust dimensionality reduction strategies should be used to account for the sparsity, intrinsic noise, unexpected dropout, and burst effects [3,14], as well as the low amounts of RNA that are typically present in single-cells. Ding et al. showed that low-dimensional representations of the original data learned using latent variable models preserve both the local and global neighbour structures of the original data [15]. Autoencoders (AEs) showed outstanding performance in this regard due to their ability to capture the strong non-linearities among the gene interactions existing in the high-dimensional expression space.

Autoencoders for denoising and dimensionality reduction
Deep Count AE network (DCA) was one of the first AE-based approach proposed to denoise scRNA-Seq datasets [16] by considering the count distribution, overdispersion, and sparsity of the data. DCA relies on a negative binomial noise model, with or without zero-inflation, to capture non-linear gene-gene dependencies. Starting from the vanilla version of the Variational AE (VAE) [17], several approaches have been proposed. Among them, single-cell Variational Inference (scVI) was the first scalable framework that allowed for a probabilistic representation and analysis of gene expression datasets [18]. scVI was built upon Deep Neural Networks (DNNs) and stochastic optimization to consider the information across similar cells and genes to approximate the distributions underlying the analysed gene expression data. This computational tool allows for coupling low-dimensional probabilistic representation of gene expression data with the downstream analysis to consider the measurement of uncertainty through a statistical model. Svensson et al. integrated a Linearly Decoded VAE (LDVAE) into scVI [19], enabling the identification of relationships among the cell representation coordinates and gene weights via a factor mode. Single-cell VAE (scVAE) was introduced to directly model the raw counts from RNA-seq data [20]. More importantly, the authors proposed a Gaussian-mixture model to better learn biologically plausible groupings of scRNA-Seq data on the latent space.
The framework called Decomposition using Hierarchical AE (scDHA) is composed of two modules [21]. The first module is a non-negative kernel AE able to provide a nonnegative, part-based denoised representation of the original data. During this step, the genes and the components having an insignificant contribution to the denoised representation of the data are removed. The second module is a stacked Bayesian self-learning network built upon the VAE. This specific module is used to project the denoised data into a low-dimensional space used during the downstream analysis. scDHA outperformed PCA, t-SNE, and UMAP in terms of silhouette index [22] on the tested datasets.
AEs coupled with disentanglement methods have been used to both improve the data representation and obtain better separation of the biological factors of variation in gene expression data [23]. In addition, a graph AE, consisting of graph convolutional layers, was developed to predict relationships between single-cells. This framework can be used to identify the cell-types in the dataset under analysis and discover the driver genes for the differentiation process. Wang et al. proposed a deep VAE for scRNA-Seq data named VASC [24], a deep multi-layer generative model that improves the dimensionality reduction and visualisation steps in an unsupervised manner. Thanks to its ability to model dropout events-that can hinder various downstream analysis steps (e.g., clustering analysis, differential expression analysis, inference of gene-to-gene relationships) by introducing a high number of zero counts in the expression matrices-and to find non-linear hierarchical representations of the data, VASC obtained superior performance with respect to four state-of-the-art dimensionality reduction and visualisation approaches [24].
Dimensionality Reduction with Adversarial VAE (DR-A) has been recently proposed to fulfil the dimensionality reduction step from a data-driven point of view [25]. Compared to the previous approaches, DR-A exploits an adversarial VAE-based framework, which is a recent variant of generative adversarial networks. DR-A generally obtained more accurate low-dimensional representation of scRNA-Seq data compared to stateof-the-art approaches (e.g., PCA, scVI, t-SNE, UMAP), leading to better clustering performance. Geddes et al. proposed an AE-based cluster ensemble framework to improve the clustering process [26]. As a first step, random subspace projections of the data are compressed onto a low-dimensional space by exploiting an AE, obtaining different encoded spaces. Then, an ensemble clustering approach is applied across all the encoded spaces to generate a more accurate clustering of the cells.

Autoencoders for the imputation of missing data
AutoImpute was proposed to deal with the insufficient quantities of starting RNA in the individual cells, a problem that generally leads to significant dropout events. As a consequence, the resulting gene expression matrices are sparse and contain a high number of zero counts. AutoImpute is an AE-based imputation method that works on sparse gene expression matrices, trying to learn the inherent distribution of the input data to assign the missing values [27]. scSVA was also proposed to identify and recover dropout events [28], which are imputed by fitting a mixed model of each possible cell-type. In addition, it performs an efficient feature extraction step of the high-dimensional scRNA-Seq data, obtaining a low-dimensional embedding. In the tests showed by the authors, scSVA was able to outperform different state-of-the-art and novel approaches (e.g., PCA, t-SNE, UMAP, VASC).
Other two methods based on non-parametric AEs were proposed to address the imputation problem [29]. Learning with AuToEncoder (LATE) relies on an AE that is directly trained on a gene expression matrix with parameters randomly generated, while TRANSfer learning with LATE (TRANSLATE) takes into consideration a reference gene expression dataset to estimate the parameters that are then used by LATE on the new gene expression matrix. LATE and TRANSLATE were able to obtain outstanding performance on both real and simulated data by recovering non-linear relationships in pairs of genes, allowing for a better identification and separation of the cell-types.
GraphSCI combines Graph convolution network and AE, and it is the first approach that systematically integrates gene-to-gene relationships with the gene expression data into a deep learning framework. GraphSCI is able to impute the dropout events by taking advantage of low-dimensional representations of similar cells and gene-gene interactions [30].
Generally, in the existing AEs the input data are usually codified in a specific format, making their integration into the existing scRNA-Seq analysis toolkits (e.g., Scanpy [31] and Seurat [32]) a difficult task. In addition, the existing tools are implemented in Keras (https:// github. com/ fchol let/ keras), TensorFlow [33], or PyTorch [34], and all the three libraries are thus required to run them. Finally, the currently available AEs cannot be directly exploited to obtain the latent space or to generate synthetic cells. In order to overcome the described limitations, we developed scAEspy, which is a unifying, userfriendly, and standalone tool that relies only on TensorFlow and allows easy access to different AEs by setting up only two user-defined parameters. scAEspy can be used on High-Performance Computing (HPC) infrastructures to speed-up its execution. It can be easily run on clusters of both Central Processing Units (CPUs) and Graphics Processing Units (GPUs). Indeed, it was designed and developed to be executed on multi-and many-core infrastructures. In addition, scAEspy gives access to the latent space, generated by the trained AE, which can be directly used to show the cells in this embedded space or as a starting point for other dimensionality reduction approaches (e.g., t-SNE and UMAP) as well as downstream analyses (e.g., batch-effect removal).
In this work, we show how scAEspy can be used to deal with the existing batch-effects among samples. Indeed, the application of batch-effect removal tools into the latent space ( Fig. 2) allowed us to outperform state-of-the-art methods as well as the same batch-effect removal tools applied on the PCA space. Finally, scAEspy implements different loss functions, which are fundamental to deal with different sequencing platforms.

Results
We tested PCA and AEs to address the integration of different datasets. Specifically, we used all the AEs implemented in our scAEspy tool: VAE [17], an AE only based on the Maximum Mean Discrepancy (MMD) distance (called here MMDAE) [35], MMD-VAE, Gaussian-mixture VAE (GMVAE), and two novel Gaussian-mixture AEs that we developed, called GMMMD and GMMMDVAE, respectively. In all the performed tests, the constrained versions of the following loss functions were used: Negative Binomial (NB), Poisson, zero-inflated NB (ZINB), zero-inflated Poisson (ZIP). We used a number of Gaussian distributions equal to the number of datasets to integrate for GMVAE, GMMMD, and GMMMDVAE. In addition, we tested the following configurations of the hidden layer and latent space to understand how the dimension of the AEs might potentially affect their performance: (256, 64), (256, 32), (256, 16), (128, 64), (128, 32), (128, 16), (64,32), and (64,16), where (H, L) represents the number of neurons composing the hidden layer (H neurons) and latent space (L neurons).
In order to deal with the possible batch-effects, we applied the following approaches, as suggested in [8,11] and being the most used batch-effect removal tools in the literature: Batch Balanced k-Nearest Neighbours (BBKNN) [36], Harmony [37], ComBat [38][39][40], and the Seurat implementation of the Canonical Correlation Analysis (CCA) [13]. Thus, we compared vanilla PCA and AEs, PCA and AEs followed by either BBKNN or Harmony (Fig. 2), ComBat, and CCA. For each batch-effect removal tool, we used the different samples that have to be merged as batches.
The proposed strategies were compared on three publicly available datasets, namely: Peripheral Blood Mononuclear Cells (PBMCs), Pancreatic Islet Cells (PICs), and Mouse Cell Atlas (MCA) by using well-known clustering metrics (i.e., Adjusted Rand Index, Adjusted Mutual Information Index, Fowlkes Mallows Index, Homogeneity Score, and V-Measure). It is worth mentioning that generally the cell-types are manually identified by expert biologists starting from an overclustering or underclustering of the data, possibly followed by different steps of subclustering of some clusters. Here, we evaluate how the different strategies are able to automatically separate the cells by fixing the number of clusters equal to the number of cell-types manually identified by the authors of the papers.   Specifically, they are selected within each sample separately and then merged to avoid the selection of batch-specific genes. scAEspy is used to reduce the HVG space (k dimensions), and the obtained latent space can be (i) used to calculate a t-SNE space, (ii) corrected by Harmony, and (iii) used to infer an uncorrected neighbourhood graph. The corrected latent space by Harmony is then used to build a neighbourhood graph, which is clustered by using the Leiden algorithm and used to calculate a UMAP space. Otherwise, BBKNN is applied to rebuild a uncorrected neighbourhood graph by taking into account the possible batch-effects. The corrected neighbourhood graph built by BBKNN is then clustered by using the Leiden algorithm and used to calculate a UMAP space. In order to assign the correct label to the obtained clusters, the marker genes are calculated by using the Mann-Whitney U test. Finally, the annotated clusters can be visualised in both t-SNE and UMAP space Tangherloni

Peripheral blood mononuclear cells
PBMCs from eight patients with systemic lupus erythematosus were collected and processed using the 10× Chromium Genomics platform [41]. The dataset is composed of a control group (6573 cells) and an interferon-β stimulated group (7466 cells). We considered the 8 distinct cell-types identified by the authors following a standard workflow [41]. The count matrices were downloaded from Seurat's tutorial "Integrating stimulated vs. control PBMC datasets to learn cell-type specific responses" (https:// satij alab. org/ seurat/ v3.0/ immune_ align ment. html).

Mouse cell atlas
MCA is composed of two different datasets. The former was generated by Han et al. [47] using Microwell-Seq (4239 cells) [47], while the latter by the Tabula Muris Consortium [48] using Smart-Seq2 (2715 cells). The 11 distinct cell-types with the highest number of cells, which were present in both datasets, have been taken into account as in [11]. The count matrices were downloaded from the public GitHub repository related to [11] (https:// github. com/ Jinmi aoChe nLab/ Batch-effect-remov al-bench marki ng).

Adjusted rand index
The Rand Index (RI) is a similarity measure between the results obtained from the application of two different clustering methods. The first clustering method is used as ground truth (i.e., true clusters), while the second one has to be evaluated (i.e., predicted clusters). The RI is calculated by considering all pairs of samples appearing in the clusters, namely, it counts the pairs that are assigned either to the same or different clusters in both the predicted and the true clusters. The Adjusted RI (ARI) [49] is the "adjusted for chance" version of the RI. Its values vary in the range [−1, 1] : a value close to 0 means a random assignment, independently of the number of clusters, while a value equal to 1 indicates that the clusters obtained with both clustering approaches are identical. Negative values are obtained if the index is less than the expected index.

Adjusted mutual information index
The Mutual Information Index (MII) [50] represents the mutual information of two random variables, which is a similarity measure of the mutual dependence between the two variables. Specifically, it is used to quantify the amount of information that can be gained by one random variable observing the other variable. The MII is strictly correlated with the entropy of a random variable, which quantifies the expected "amount of information" that is contained in a random variable. This index is used to measure the similarity between two labels of the same data. Similarly to ARI, the Adjusted MII (AMII) is "adjusted for chance" and its values vary in the range [0, 1].

Fowlkes mallows index
The Fowlkes Mallows Index (FMI) [51] measures the similarity between the clusters obtained by using two different clustering approaches. It is defined as the geometric mean between precision and recall. Assuming that the first clustering approach is the ground truth, the precision is the percentage of the results that are relevant, while the recall refers to the percentage of total relevant results correctly assigned by the second clustering approach. The index ranges from 0 to 1.

Homogeneity score
The result of the tested clustering approach satisfies the Homogeneity Score (HS) [52] if all of its clusters contain only cells that belong to a single cell-type. Its values range from 0 to 1, where 1 indicates perfectly homogeneous labelling. Notice that by switching true cluster labels with the predicted cluster labels, the Completeness Score is obtained.

Completeness score
The result of the tested clustering approach satisfies the Completeness Score (CS) [52] if all the cells that belong to a given cell-type are elements of the same cluster. Its values range from 0 to 1, where 1 indicates perfectly complete labelling. Notice that by switching true cluster labels with the predicted cluster labels, the HS is obtained.

V-measure
The V-Measure (VM) [53] is the harmonic mean between HS and CS; it is equivalent to MII when the arithmetic mean is used as aggregation function.

Integration of multiple datasets obtained with the same sequencing platforms
Nowadays, various scRNA-Seq platforms are currently available (e.g., droplet-based and plate-based [54][55][56][57][58][59][60][61][62][63]) and their integration is often challenging due to the differences in biological sample batches as well as to the used experimental platforms. To test whether AEs can be effectively applied to combine multiple datasets, generated using the same platform but under different experimental conditions, we used the PBMC datasets. We merged the control and treated datasets by first using vanilla PCA and AEs, and then PCA and AEs followed by either BBKNN or Harmony, ComBat, and CCA. After the construction of the neighbourhood graphs, we performed a clustering step by using the Leiden algorithm [64]. Since in the original paper 8 different cell-types were manually identified [41], we selected Leiden's resolutions that allowed us to obtain 8 distinct clusters and calculated all the metrics described above. In what follows, the calculated values of all metrics are given in percentages (mean ± standard deviation). For each metric, the higher the value the better the result.
Our analysis showed that the CCA-based approach, proposed in the Seurat library, achieved a mean ARI equal to 73.49% ( ±1.52% ), ComBat reached a mean ARI of 72.84% ( ±0.78% ), vanilla PCA had a mean ARI of 68.90% ( ±0.86% ), PCA followed by BBKNN was able to obtain a mean ARI of 83.65% ( ±0.81% ), while PCA followed by Harmony reached a mean ARI of 82.83% ( ±1.20% ), as shown in Fig. 3a. Among all the tested AEs, MMDAE followed by Harmony (using the NB loss function, 256 neurons for the hidden layer, and 32 neurons for the latent space) achieved the best results, with a mean ARI equal to 87.18% ( ±0.49% ). In order to assess whether any of the results obtained by the best AE were different from a statistical point of view, we applied the Mann-Whitney U test with the Bonferroni correction [65][66][67]. In all the comparisons, MMDAE followed by Harmony had a p-value lower than 0.0001, confirming that the achieved results are statistically different compared to those achieved by the other approaches.
Regarding the AMII, CCA had a mean value of 66.46% ( ±0.50% ), ComBat achieved a mean value of 70.95% ( ±0.82% ), vanilla PCA obtained a mean value of 68.44% ( ±1.00% ), PCA followed by BBKNN reached a mean value of 75.22% ( ±0.76% ), while PCA followed by Harmony reached a mean value of 74.55% ( ±1.19% ). MMDAE followed by Harmony had better results, with a mean value equal to 78.61% ( ±0.29% ). MMDAE followed by Harmony outperformed the other strategies also in terms of of FMS, HS, CS, and VM (see Additional file 2 and Fig. 4).
We also compared the results obtained by the best AE for each of the tested dimension (H, L) in terms of ARI (Fig. 3b). GMMMD followed by Harmony (using the NB loss function) obtained the best results for the dimension (64,16), GMMMD followed by Harmony (using the Poisson loss function) reached the best results for the dimensions (128, 16) and (256, 64), and GMMMD followed by BBKNN (using the NB loss function) achieved the best results for the dimension (128, 32). MMDAE followed by Harmony (using the NB loss function) was able to reach the best results for the dimensions (64, 32), (256, 16), and (256, 32), while MMDAE followed by Harmony (using the Poisson loss function) obtained the best result for the dimensions (128, 64). Notice that we used two Gaussian distributions because we merged two different datasets.
In order to visually assess the quality of the separation of the manually annotated cell-type and the found clusters, we plotted them in the UMAP space generated starting from the MMDAE followed by Harmony space (Fig. 3c, d). Finally, we also plotted the two samples in the same UMAP space to visually see the quality of the alignment between the two samples themselves (Fig. 5a). This plot confirms that the batcheffects were completely removed.
Our analysis showed that clustering the neighbourhood graph generated from AE spaces allowed for a better identification of the existing cell-types when compared to other approaches, thus confirming the ARI results.

Integration of multiple datasets obtained with different sequencing platforms
Combining datasets from different studies and scRNA-Seq platforms can be a powerful approach to obtain complete information about the biological system under investigation. However, when datasets generated with different platforms are combined, the high variability in the gene expression matrices can obscure the existing biological relationships. For example, the gene expression values are much higher in data acquired with plate-based methods (i.e., up to millions) than in those acquired with droplet-based methods (i.e., a few thousands). Thus, combining gene expression data that spread across several orders of magnitude is a difficult task that cannot be tackled by using linear approaches like PCA. To examine how well AEs perform in solving this issue, we combined four PIC datasets acquired with CEL-Seq [56], CEL-Seq2 [57], Fluidigm C1 [63], and Smart-Seq2 protocols [62]. We integrated the datasets by first using vanilla PCA and AEs, and then PCA and AEs followed by either BBKNN or Harmony, ComBat, and CCA. Since in the original paper 13 cell-types were manually annotated for the PIC datasets [46], we clustered the neighbourhood graphs using the Leiden algorithm considering only the resolutions that allowed us to obtain 13 distinct clusters. We then calculated ARI, in percentages (mean ± standard deviation); for each metric, the higher the value the better the result. CCA had a very low mean ARI, i.e., 5.45% ( ±0.22% ), ComBat obtained a mean ARI of 76.20% ( ±3.06% ), vanilla PCA achieved a mean ARI of 61.38% ( ±0.09% ), PCA followed by BBKNN reached a mean ARI of 71.49% ( ±0.58% ), while PCA followed by Harmony was able to obtain a mean ARI of 94.00% ( ±0.36% ), see Fig. 6a. GMMMD followed by Harmony (using the NB loss function, 256 neurons for the hidden layer, and 32 neurons for the latent space) outperformed the other AEs, achieving a mean ARI equal to 94.23% ( ±0.12% ). In all the comparisons, except for the one against PCA followed by Harmony, GMMMD followed by Harmony had a p-value lower than 0.0001, confirming that the achieved results are statistically different with respect to those obtained by the other approaches.
Similar results were achieved for the AMII metric: CCA reached a mean value equal to 16.57% ( ±0.33% ), ComBat obtained a mean value of 76.11% ( ±1.14% ), vanilla PCA reached a mean value of 71.70% ( ±0.11% ), PCA followed by BBKNN achieved a mean value of 77.55% ( ±0.65% ) and PCA followed by Harmony a mean value of 91.17% ( ±0.47% ), while GMMMD followed by Harmony was able to reach a mean value equal to 89.37% ( ±0.02% ). Considering the other measures, both PCA and GMMMD followed by Harmony obtained very similar results, outperforming the other strategies (see Additional file 3 and Fig. 7).
Considering the best AE for each of the tested dimension (H, L) in terms of ARI (see Fig. 6b), GMMMD followed by Harmony (using the NB loss function) resulted the best choice for the dimensions (128, 64) and (256, 32), while it obtained the best results for the dimension (64, 32) when the Poisson loss function was used. GMVAE followed by Harmony (using the Poisson loss function) reached the best results for the dimensions (128, 32) and (256, 16). MMDAE followed by Harmony achieved the best results for the dimensions (256, 64) and (64,16), exploiting the NB loss function and Poisson loss function, respectively. Finally, VAE followed by Harmony obtained the best results with the Poisson function for the dimension (128, 16). Note that we exploited four Gaussian distributions because we merged four different datasets.
The quality of the separation of the manually annotated cell-type and found clusters can be visually evaluated in Fig. 6c, d. We finally visualised the cells (coloured by platform) using the UMAP space generated from the GMMMD followed by Harmony space (Fig. 5b) to confirm that the batch-effects among the samples sequenced with different platforms were correctly removed. Taken together, our analysis shows that GMMMD followed by Harmony can efficiently identify the "shared" cell-types (See figure on next page.) Fig. 4 Boxplot showing the values of the calculated metrics using CCA, ComBat, PCA, MMDAE followed by Harmony with dimension (256, 32), PCA followed by BBKNN, and PCA followed by Harmony as well as by the best AE for each of the tested dimension (H, L), analysing the PBMC datasets. a AMII achieved by the different strategies. b AMII achieved by the best AE for each of the tested dimension. c FMS achieved by the different strategies. d FMS achieved by the best AE for each of the tested dimension. e HS achieved by the different strategies. f HS achieved by the best AE for each of the tested dimension. g CS achieved by the different strategies. h CS achieved by the best AE for each of the tested dimension. i VM achieved by the different strategies. j VM achieved by the best AE for each of the tested dimension. p-value ≤ 0.0001 (****); 0.0001 < p -value ≤ 0.001 (***); 0.001 < p-value ≤ 0.01 (**); 0.01 < p-value across the different platforms due to its ability to deal with the high variability in the gene expression matrices. We would like to highlight that PCA followed by Harmony was capable of achieving good results because the original clusters were obtained by applying a similar pipeline [46]. As a final test, we combined the two MCA datasets acquired with Microwell-Seq [47] and Smart-Seq2 protocols [62]. We integrated the datasets in the same way we did in the other two tests. We clustered the neighbourhood graphs using the Leiden algorithm considering only the resolutions that allowed us to obtain 11 distinct clusters because 11 distinct cell-types were manually annotated for the PIC datasets [47]. We then calculated all metrics.
In such a case, MMDVAE followed by Harmony (using the Poisson loss function, 256 neurons for the hidden layer, and 64 neurons for the latent space) outperformed the other AEs as well as the other strategies, obtaining a mean ARI equal to 79.50% ( ±0.02% ), as shown in Fig. 8a. ComBat achieved the worst mean ARI, i.e., 54.13% ( ±4.22% ), CCA reached a mean ARI of 57.62% ( ±0.75% ), vanilla PCA obtained a mean ARI of 67.29% ( ±11.55% ), PCA followed by BBKNN had a similar mean ARI, that is, 67.73% ( ±3.98% ), while PCA followed by Harmony achieved a mean ARI of 66.08% ( ±0.11% ). MMDVAE followed by Harmony had a p-value lower than 0.0001 in all the tested comparisons. Considering the other metrics, MMDVAE followed by Harmony generally obtained better results compared to the other strategies (see Additional file 4, Figs. 8b, and 9).
Comparing the best AE for each of the tested dimension (H, L) in terms of ARI, the vanilla GMMMD with the NB loss function obtained the best results for the dimension (128, 16), while GMMMD followed by Harmony reached the best results for the dimensions (256, 32) and (64,16), exploiting the Poisson loss function and NB loss function, respectively. MMDAE followed by BBKNN (using the ZIP loss function) achieved the best results for the dimensions (256, 16) and (128, 64), exploiting the NB loss function and Poisson loss function, respectively. MMDVAE followed by Harmony resulted the best choice for the dimensions (128, 16) and (256, 64) when coupled with the NB loss function and Poisson loss function, respectively. Finally, VAE followed by Harmony with the Poisson loss function obtained the best results for the dimension (64,32). As for the the integration of the PBMC datasets, we used two Gaussian distributions because we merged two different datasets. Figure 8c, d show the UMAP generated from the MMDVAE followed by Harmony space coloured by the manually annotated cell-type and found clusters, respectively, while Fig. 5c depicts the cells coloured by platform on the same UMAP space, confirming that the batch-effects between the two samples were correctly removed. In this case, the achieved results show that MMDVAE followed by Harmony was able to better identify the "shared" cell-types across the different platforms. In order to test the computational performance of scAEspy, we considered a VAE with a single hidden layer composed of 256 neurons, a latent space of 32 neurons, and a b c d

Discussion
Non-linear approaches for dimensionality reduction can be effectively used to capture the non-linearities among gene interactions that may exist in the high-dimensional expression space of scRNA-Seq data [15]. Among the different non-linear approaches, AEs showed outstanding performance, outperforming other approaches like UMAP and t-SNE. Several AE-based methods have been developed so far, but their integration with the common single-cell toolkits results in a difficult task because they usually require input data codified in a specific format. In addition, three different machine learning libraries are required to run them (i.e., Keras, TensorFlow, PyTorch).
Here, we proposed scAEspy, a unifying and user-friendly tool that allows the user to use the most recent and promising AEs (i.e., VAE, MMDAE, MMDVAE, and GMVAE). We also designed and developed GMMMD and GMMMDVAE, two novel AEs that combine MMDAE and MMDVAE with GMVAE to exploit more than one Gaussian distribution. We introduced a learnable prior distribution in the latent space to model the dimensionality of the subpopulations of cells composing the data, or to combine multiple samples.
We integrated AEs with both Harmony and BBKNN to remove the existing batcheffects among different datasets. Our results showed that exploiting the latent space to remove the existing batch-effects permits for better identification of the cell subpopulations. As a batch-effect removal tool, Harmony allowed for achieving better results than BBKNN in the majority of the cases. When different droplet-based data have to be combined, our GMMMD and the MMDAE, coupled with the constrained NB and Poisson loss functions, obtained the best results compared to all the other AEs. In order to combine and analyse multiple datasets, generated by using different scRNA-Seq platforms, both GMMMD and MMDVAE, mainly used together with the NB and Poisson loss functions, outperformed the other strategies. However, also GMVAE and the simple VAE obtained outstanding performance, highlighting that the Kullback-Leibler divergence function can become fundamental to handle data spreading various orders of magnitude, especially the high values (up to millions) introduced by plate-based methods. It is clear that using more than one Gaussian distribution allows for obtaining a better integration of the datasets and separation of the cell-types when more than two datasets have to be integrated, as clearly shown by the results reached on the PIC datasets.
As a good practice, we suggest filtering the gene expression matrices using the top HVGs (i.e., 1000), calculated separately within each batch and merged to avoid the selection of batch-specific genes. The original counts, corresponding to the top HVGs, should be the input data of scAEspy. GMMMD coupled with Harmony should be used to integrate different droplet-based datasets as well as to merge datasets generated by using different scRNA-Seq platforms. Harmony should be applied to the latent space produced by GMMMD to take into account the possible batches among the datasets. When a single dataset has to be analysed, the users should use MMDAE with droplet-based data, while MMDVAE should be preferred for data acquired with plate-based methods. In all cases, using a single hidden layer composed of 256 neurons, a latent space with 32 neurons, and the constrained NB or Poisson loss function generally allows for obtaining satisfactory results.
Considering the achieved results on the identification of the clusters, scAEspy can be used at the basis of methods that aim at automatically identifying the cell-types composing the scRNA-Seq datasets under analysis [68]. As a matter of fact, scAEspy coupled with BBKNN was successfully applied to integrate 15 different foetal human samples, enabling the identification of rare blood progenitor cells [69].

Conclusions
In this study, we proposed an AE-based and user-friendly tool, named scAEspy, which allows for using the most recent and promising AEs to analyse scRNA-Seq data. The user can select the desired AE by only setting up two user-defined parameters. Once the selected AE has been trained, it can be used to generate synthetic cells to increase the number of data for further downstream analyses (e.g., training classifiers). In scAEspy, the latent space is easily accessible and thus allows the user to perform different types of analysis, such as the correction of possible batch-effect in a reduced non-linear space or the inference of differentiation trajectories. In this case, the latent space can be utilised to generate the "pseudotime" that measures transcriptional changes that a cell undergoes during the dynamic process.
Thanks to its modularity, scAEspy can be extended to accommodate new AEs so that the user will be always able to utilise the latest and cutting-edge AEs [70], which can improve the downstream analysis of scRNA-Seq data. It is worth noticing that scAEspy can be used on HPC infrastructures, both based on CPUs and GPUs, to speed-up the computations. This is a crucial point when datasets composed of hundreds of thousands of cells are analysed. In such cases, the required running time drastically increases, so relying on HPC infrastructures is the best solution to incredibly reduce the prohibitive running time.

Future improvements
As an improvement, prior biological knowledge about genes from ontologies can be incorporated into scAEspy. Ontologies can introduce useful information into machine learning systems that are used to solve biological problems. They allow for integrating data from different omics (e.g., genomics, transcriptomics, proteomics, and metabolomics) as structured representations of semantic knowledge, which is commonly used for the representation of biological concepts. This approach has been successfully applied to predict the clinical targets from high-dimensional low-sample data [71]. Specifically, ontology embeddings are able to capture the semantic similarities among the genes, which can be exploited to sparsify the network connections. In addition, the Gene Ontology (GO) [72] can be exploited to interpret the extracted features from the latent spaces generated by the AEs, thus bringing an explanation to the learned representations of the gene expression data. As a possible example, g:Profiler [73] focusing on GO terms, Kyoto Encyclopedia of Genes and Genomes (KEGG), and Reactome can be used on the learned embeddings to investigate the joint effects of different gene sets within specific biological pathways. This approach can help the interpretability and explainability of the learned embeddings of the used AEs.

Integration of multi-omics data
Since AEs showed outstanding performance in the integration of multi-omics of cancer data [70], we plan to extend scAEspy to analyse other single-cell omics. For instance, AEs can be applied to analyse scATAC-seq, where the identification of the cell-types is still very difficult due to technical challenges [10,74]. scAEspy could be effectively applied to analyse disparate types of single-cell data from different points of view. The latent representations of different or combined single-cell omics can be used for further and more in-depth analyses. For instance, the application of other machine learning techniques (e.g., deep neural networks) to the latent representations could facilitate the identification of interesting patterns on gene expression or methylation data, as well as relationships among genomics variants. In that regard, scAEspy can be the starting point to build a more comprehensive toolkit designed to integrate multi single-cell omics as an integration and extension of the work proposed in [70].

Methods
We developed scAEspy so that it can be easily integrated into both Scanpy and Seurat pipelines, as it directly works on a gene expression matrix (see Fig. 2). We integrated into a single tool the latest and most powerful AEs designed to resolve the problems underlying scRNA-Seq data (e.g., sparsity, intrinsic noise, dropout events [3]). Specifically, scAEspy is comprised of six AEs, based on the VAE [17] and InfoVAE [35] architectures. The following most advanced AEs are included in scAEspy: VAE, MMDAE, MMDVAE, GMVAE, and two novel Gaussian-mixture AEs that we developed, called GMMMD and GMMMDVAE. GMMMD is a modification of the MMDAE where more than one Gaussian distribution is used to model different modes and only the MMD function is used as divergence function. GMMMDVAE is a combination of MMDVAE and GMVAE where both the MMD function [75] and the Kullback-Leibler divergence function [76] are used. scAEspy allows the user to exploit these six different AEs by setting up two user-defined parameters, α and , which are needed to balance the MMD and the Kullback-Leibler divergence functions. We designed and developed GMMMD and GMMMDVAE starting from InfoVAE [35] and scVAE [20]. In addition, a learnable mixture distribution was used for the prior distribution in the latent space, and also the marginal conditional distribution was defined to be a learnable mixture distribution with the same number of components as the prior distribution. Finally, the user can also select the following loss functions: NB, constrained NB, Poisson, constrained Poisson, ZINB, constrained ZINB, ZIP, constrained ZIP, and Mean Square Error (MSE).

The tested batch-effect removal tools
Originally proposed to deal with batch-effects in microarray gene expression data [38], ComBat has been successfully applied to analyse scRNA-Seq data [77]. Briefly, given a gene expression matrix, it is firstly standardised so that all genes have similar means and variances. Then, starting from the obtained standardised matrix, standard distributions are fitted using a Bayesian approach to estimate the existing batch-effects in the data. Finally, the original expression matrix is corrected using the computed batch-effect estimators. In our tests, we used the default parameter settings provided by the Scanpy function combat. We then applied PCA on the space obtained by the top k (here, we set k = 1000 ) HVGs calculated by using the function provided by Scanpy (v.1.4.5.1), where the top HVGs are separately selected within each batch and merged to avoid the selection of batch-specific genes. We calculated the first 50 components and applied the socalled "elbow method" to select the number of components for the downstream analysis [8]. The "elbow method" was applied taking into consideration the variance explained by each PCA component, which can be visualised using the Scanpy function pca_vari-ance_ratio. Including less informative PCA components might help in the identification of rare cell types, but they can also introduce noise that hinders the downstream analysis steps; thus, we opted to include only the most informative PCA components. Specifically, we used the first 18, 13, and 18 components for PBMC, PIC, and MCA datasets, respectively. After that, we calculated the neighbourhood graph by using the default parameter settings proposed in Scanpy. We clustered the obtained neighbourhood graphs with the Leiden algorithm by selecting the values of the resolution parameter such that the number of clusters was equal to the manually annotated clusters. Finally, all the metrics for each found resolution have been calculated.
As another batch-effect removal tool, we used the CCA-based approach proposed in the Seurat package (v.2.3.4) [13]. We applied both RunCCA and MultiCCA Seurat functions to integrate two batches and more than two batches, respectively. Firstly, we normalised and log-transformed the counts. Then, we calculated the top 1000 HVGs by using the function provided by Scanpy (v.1.4.5.1). We also scaled the log-transformed data to zero mean and unit variance. In both RunCCA and MultiCCA Seurat functions, as a first step, the CCA components were exploited to compute the linear combinations of the genes with the maximum correlation between the batches. We used the default number of CCA components (i.e., 20) provided by the functions RunCCA and Multi-CCA of the Seurat package, as also suggested in [36]. A dynamic time warping (Align-Subspace Seurat function), which accounts for population density changes, was then used to align the calculated vectors and obtain a single low-dimensional subspace where the batch-effects are corrected. We calculated the neighbourhood graph, using the default parameter settings proposed in Scanpy, starting from the aligned low-dimensional subspace. We clustered the built neighbourhood graphs with the Leiden algorithm as explained before. Finally, we calculated all the metrics for each found resolution.
We also applied Harmony [37] to remove the batch-effects. Starting from a reduced space (e.g., PCA space or latent space), Harmony exploits an iterative clustering-based procedure to remove the multiple-dataset-specific batch-effects. In each iteration, the following four steps are applied: (i) the cells are grouped into multiple-dataset clusters by exploiting a variant of the soft k-means clustering, which is a fast and flexible method developed to cluster single-cell data; (ii) a centroid is calculated for each cluster and for each specific dataset; (iii) using the calculated centroids, a correction factor is derived for each dataset; (iv) the correction factors are then used to correct each cell with a cell-specific factor.
As a further batch-effect removal tool, we applied BBKNN [36]. Polanski et al. [36] showed that BBKNN has comparable or better performance in removing batch-effects with respect to the CCA-based approach proposed in the Seurat package, Scanorama [78], and mnnCorrect [79]. In addition, BBKNN is a lightweight graph alignment method that requires minimal changes to the classical workflow. Indeed, it computes the k-nearest neighbours in a reduced space (e.g., PCA or latent space), where the nearest neighbours are identified in a batch-balanced manner using a user-defined distance (in our tests, we used the Euclidean distance). The neighbour information is transformed into connectivities to build a graph where all cells across batches are linked together. We used both Harmony and BBKNN to correct the PCA and AE spaces.
As a final step, we calculated the UMAP spaces starting from the built neighbourhood graphs and using the default parameter settings proposed in Scanpy, except for the initialisation of the low dimensional embedding (i.e., init_pos equal to random, and ran-dom_state equal to 10 of the umap function).

The proposed pipeline
We modified the workflow shown in Fig. 1 by replacing PCA with AEs (Fig. 2). We merged the gene expression matrices of E different samples ( E = 2 , E = 4 , and E = 2 for PBMC, PIC, and MCA datasets, respectively). We applied both PCA and AEs on the space obtained by the top 1000 HVGs calculated by using the implementation provided by Scanpy (v. 1.4.5.1).
For what concerns PCA, we firstly normalised and log-transformed the counts, then we applied a classic standardisation, that is, the distribution of the expression of each gene was scaled to zero mean and unit variance. We calculated the first 50 components; after that, we used the "elbow method" to select the first 12, 14, and 19 components for the PBMC, PIC, and MCA datasets, respectively. As in the case of the ComBat workflow, the "elbow method" was applied considering the variance explained by each PCA component.
Regarding AEs, we used the original counts since AEs showed to achieve better results when applied using the raw counts [20]. Indeed, using the counts allows for exploiting discrete probability distributions, such as Poisson and NB distributions, which obtained the best results in our tests. In all the tests presented here, we used a single hidden layer. In addition, we set 100 epochs, sigmoid activation functions, and a batch equal to 100 samples (i.e., cells). In all tests we used the Adam optimizer [80].
After that, we applied three different strategies (Fig. 2): (i) we calculated the neighbourhood graph in both PCA and AE spaces by using the default parameter settings proposed in Scanpy. Then, we clustered the obtained neighbourhood graphs with the Leiden algorithm as described before. Finally, we calculated all the metrics for each found resolution; (ii) we performed a similar analysis where we firstly corrected the PCA and AE spaces using Harmony [37] with the default parameter settings proposed in https:// github. com/ slowk ow/ harmo nypy; (iii) we performed the same analysis described in (i) by replacing the neighbourhood graphs with those generated using BBKNN, using the default parameter settings.
It is worth mentioning that in all the tested batch-effect removal tools, we used the different samples that have to be merged as batches. Specifically, the number of batches is equal to 2, 4, and 2 for the PBMC, PIC, and MCA datasets, respectively.

The generalised formulation of scAEspy
In this work, we used the notation proposed in [35] to extend MMDVAE with multiple Gaussian distributions as well as to introduce a learnable prior distribution in the latent space. The idea behind the introduction of learnable coefficients is that they might be suitable to model the diversity among the subpopulations of cells composing the data or to combine multiple samples or datasets.
We consider p * (x) as the unknown probability in the input space over which the optimisation problem is formulated, z is the latent representation of x with |z| ≤ |x| . The encoder is identified by a function e φ : x � → z , while the decoder by a function d θ : z � → x.
We remind that in VAEs, the input x is not mapped into a single point in the latent space, but it is represented by a probability distribution over the latent space. In what follows, we denote by q(z) any possible distribution in the latent space and by y ∈ {1, . . . , K } a categorical random variable, where K corresponds to the number of desired Gaussian distributions. As general strict divergence function, we considered the MMD(·) divergence function [75].
The ELBO term proposed in this work, which is the measure maximised during the training of AEs, is: where KL(·) is the Kullback-Leibler divergence [76] between two distributions. All the mathematical details required to derive the generalised formula shown in Eq. (1) can be found in the Additional file 1.

Availability and requirements
scAEspy is written in Python programming language (v.3.6.5) and it relies on Tensor-Flow (v.1.12.0), an open-source and massively used machine learning library [33].  scAEspy requires the following Python libraries: NumPy, SciKit-Learn, Matplotlib, and Seaborn. scAEspy's open-source code is available on GitLab: https:// gitlab. com/ cvejicgroup/ scaes py under the GPL-3 license. scAEspy can be easily installed using the Python package installer pip, which allows the user to use scAEspy either as a standalone tool exploiting the command line interface, or as a Python package that can be integrated into Python scripts and Jupyter Notebooks. The repository contains all the scripts, code and Jupyter Notebooks used to obtain the results shown in the paper. In the provided Jupyter Notebooks, we show how easy it is to integrate scAEspy and Scanpy, and how the data can be visualised and explored by using both scAEspy and Scanpy's functions. We also provide a detailed description of scAEspy's parameters so that it can be used by both novice and expert researchers for downstream analyses.