To demonstrate Asc-Seurat's functionalities, we analyzed the publicly available 10× Genomics’ 3k Peripheral Blood Mononuclear Cells (PBMC) dataset , showcasing the analysis of an individual sample. In addition, we used a second PBMC dataset to demonstrate the analysis integrating multiple samples in Asc-Seurat. The second PBMC dataset was generated by Hang et al., 2018  and distributed as part of the SeuratData package . It contains two samples and approximately fourteen thousand cells. Both samples contain a pool of PBMC cells from eight patients. However, one sample was stimulated by treatment with IFN-β (Treatment) while the second sample is a control (Control). Moreover, we provide a detailed comparison among the available web applications. While these web applications partly overlap with Asc-Seurat’s capabilities, none to date comprises the range of essential tools available in Asc-Seurat.
Asc-Seurat use case 1—analysis of an individual sample
Loading the data, quality control, data normalization and clustering
For analysis using Asc-Seurat, all scRNA-seq datasets should be stored in a subdirectory inside the directory data/, generated during the installation. Asc-Seurat’s interface will display compatible files stored within the data/ folder, from where the data of interest can be selected. Next, users can provide a name for the project and define the initial parameters to select cells to be loaded in the web application. For the 10× Genomics’ PBMC dataset, we selected only cells expressing at least 200 genes, and only genes expressed in three or more cells. These parameters are fully adjusteable in Asc-Seurat.
After loading the dataset, a violin plot shows the distribution of the number of expressed genes, the number of Unique Molecular Identifiers or independent transcript, and the percentage of mitochondrial genes detected in each cell (Additional file 1: Fig. S1). Users can then define more restrictive parameters to remove undesirable cells based on the observed distribution. For the PBMC dataset, we selected only cells expressing more than 250 and less than 2500 genes. We also excluded cells with more than 5% of transcripts from mitochondrial origin (Additional file 1: Fig. S1).
Subsequently, users select the normalization procedure to be applied to the dataset (log-normalization or SCTransform), as well as parameters for the dimension reduction using PCA. For the PBMC dataset, we performed the log normalization using a scale factor of 10000. Also, the dimension reduction by PCA was performed using the 2000 most variable genes selected by the “vst” method. Default values sufficient for most of the datasets are provided. After executing the PCA, an elbow plot is generated to help users define how many principal components (PCs) should be used for clustering the data. For the PBMC dataset, we used the first 10 PCs (Additional file 1: Fig. S2).
Before executing the clustering step, it is necessary to inform the resolution parameter, which strongly influences the profile and number of clusters identified for a dataset. Selecting larger values will favor splitting cells into more clusters while selecting smaller ones has the opposite effect. For the PBMC dataset, a resolution of 0.5 was selected, and nine clusters were identified (Additional file 1: Fig. S3).
Differential expression analysis and gene marker identification
Asc-Seurat provides an assortment of algorithms to identify gene markers for individual clusters or DEGs among clusters. As an example, we searched for gene markers for cluster 3 of the PBMC dataset. When using the non-parametric Wilcoxon rank-sum test, filtering for genes expressed in at least 10% of the cells in the cluster, with a (log) fold change higher than 0.25 and an adjusted p value smaller than 0.05, 397 gene markers were identified (Additional file 1: Fig. S4).
Gene expression visualization
Asc-Seurat provides a variety of plots for gene expression visualization. From a list of selected genes, it is possible to visualize in a heatmap the averaged expression of each gene in each cluster (Fig. 2D) and, in a UMAP plot, the expression of the gene at the cell level (Fig. 2E). Moreover, violin plots (Fig. 2F) and dot plots (Fig. 2G) provide a tool for the visualization of the expression profile of each cluster, with emphasis on the inter-cluster comparison. As an example, we generated a heatmap plot for the five most significant markers identified in cluster 3 (Additional file 1: Fig. S5) and show their expression profile at the cell level (Additional file 1: Fig. S6) and the cluster level (Additional file 1: Fig. S7).
Trajectory inference and identification of genes defining the trajectory
Identifying genes affecting the developmental trajectory is critical for understanding how cells differentiate from one type to another. Therefore, after exploring the clusters, users may want to identify the developmental trajectory between cells in different clusters, subclusters, or states (i.e., cells responding to treatment). Moreover, it can be of interest to identify genes that vary in their expression within a trajectory.
To infer a developmental trajectory, users can either execute the capabilities of the embedded slingshot R package or select from dozens of models contained in dynverse. The choice of the model is important since some models are designed to perform well when the inferred trajectory follows a specific topology but perform poorly in others . After executing the analysis, three plots showing different inferred trajectory representations are generated (Fig. 4A). For the PBMC dataset, a developmental trajectory containing three lineages was identified using the nine clusters as input (Additional file 1: Fig. S8).
After inferring the developmental trajectory, it is possible to visualize the expression of genes of interest in the cells within the trajectory. Asc-Seurat provides two options for the visualization of gene expression within the trajectory: (1) the visualization of the same three trajectories represented in Fig. 4A, but colored by the gene expression (Fig. 4B), and (2) a heatmap displaying the expression of genes in each cell, ordered by the cell position within the trajectory (Fig. 4C).
For the PBMC dataset, we opted to show the 50 most significant DEGs within the trajectory, as ranked by their importance value estimated by dynverse (Additional file 1: Fig. S9). We selected three representative genes to show their expression using the three approaches mentioned above; NKG7, expressed in cells at the beginning of the trajectory; and LST1 and MS4A1, expressed in alternative branches in later parts of the trajectory (Additional file 1: Fig. S10).
Recovering functional annotation information and GO enrichment analysis
In many instances, users are interested in obtaining more information about a gene, or a set of genes, to support the interpretation of the data and the development of new hypotheses. For example, Asc-Seurat produces lists of gene markers, DEGs, and DEGs within a trajectory that might be of particular interest. By providing the capacity of querying BioMart servers via the biomaRt package , Asc-Seurat allows recovering the functional annotation for genes of several species. Furthermore, GO term enrichment analysis is also provided to verify if one or more GO terms are over-represented or under-represented in a set of selected genes.
As an example, we executed the GO term enrichment analysis for the set of 50 most important DEGs within the trajectory inferred for the PBMC dataset, according to dynfeature’s importance value, using all expressed genes as the universe (background) of the analysis. We identified two terms related to the immune system as significantly enriched (Additional file 1: Fig. S11).
Asc-Seurat use case 2—analysis of multiple samples using the integration approach
Using Seurat’s integration approach, the analysis of multiple samples is, in many ways, similar to the analysis of an individual sample. Therefore, while mentioning all required steps, we will focus on the steps where the analysis of multiple samples diverges the most when using Asc-Seurat.
Data loading, quality control, normalization, and integration
For the integration of multiple samples, the steps of loading the data are different from when using a single sample. Users still need to add their datasets in the data/ directory, creating a subdirectory for each sample. However, users also need to provide a configuration file containing the parameter values for each sample. An example of the configuration file is generated during the installation. We also provide the configuration file used to integrate the two samples from the PBMC IFN-β dataset in Additional file 1: Table S1. These parameters include the name of the sample and the values used in the quality control. Therefore, users need to explore each sample individually and define these parameters before starting the integration of the samples. Moreover, within Asc-Seurat’s interface, users also need to select the normalization to be performed in the dataset and other parameters for the integration. The selected parameters for the PBMC IFN-β dataset are shown in Additional file 1: Fig. S12 and are extensively described in Asc-Seurat’s documentation.
Clustering, differential expression, and expression visualization
After the integration is completed, the analysis is similar to the described above for a single sample. A violin plot showing the distribution of cells is generated, and users can select more strict filtering parameters, then perform the PCA and clustering. For the PBMC IFN-β, we did not apply cell filtering after the integration. Next, 20 PCs and a resolution of 0.5 were used for clustering, and 15 clusters were identified (Additional file 1: Fig. S13 and Additional file 1: Fig. S14).
Two significant differences exist when searching for gene markers or DEGs using multiple samples. First, the search for gene markers identifies those that are also conserved among samples. Second, it is possible to identify DEGs between samples for each cluster. For example, we identified 182 DEGs between the treatment and control for cluster 7 (Additional file 1: Fig. S15).
In terms of expression visualization, the main difference of using an integrated dataset is that the UMAP plot showing the gene expression per cell is separated by sample, allowing a visual comparison between them. For example, we selected the five most DEGs that are more highly expressed in the treatment sample for cluster 7 (Additional file 1: Fig. S16 and Additional file 1: Fig. S17).
Trajectory inference and identification of genes defining the trajectory
For the trajectory inference, the analysis is conducted similarly for both an individual sample or an integrated dataset containing multiple samples. The only difference is that the user can indicate that the dataset contains multiple samples and, therefore, visualize the distribution of the cells within the trajectory colored by sample. The distribution of the cells of the PCMB IFN-β within the trajectory and colored by sample is shown in Additional file 1: Fig. S18.