 Research
 Open access
 Published:
eSPRESSO: topological clustering of singlecell transcriptomics data to reveal informative genes for spatio–temporal architectures of cells
BMC Bioinformatics volume 24, Article number: 252 (2023)
Abstract
Background
Bioinformatics capability to analyze spatio–temporal dynamics of gene expression is essential in understanding animal development. Animal cells are spatially organized as functional tissues where cellular gene expression data contain information that governs morphogenesis during the developmental process. Although several computational tissue reconstruction methods using transcriptomics data have been proposed, those methods have been ineffective in arranging cells in their correct positions in tissues or organs unless spatial information is explicitly provided.
Results
This study demonstrates stochastic selforganizing map clustering with Markov chain Monte Carlo calculations for optimizing informative genes effectively reconstruct any spatio–temporal topology of cells from their transcriptome profiles with only a coarse topological guideline. The method, eSPRESSO (enhanced SPatial REconstruction by Stochastic SelfOrganizing Map), provides a powerful in silico spatio–temporal tissue reconstruction capability, as confirmed by using human embryonic heart and mouse embryo, brain, embryonic heart, and liver lobule with generally high reproducibility (average max. accuracy = 92.0%), while revealing topologically informative genes, or spatial discriminator genes. Furthermore, eSPRESSO was used for temporal analysis of human pancreatic organoids to infer rational developmental trajectories with several candidate ‘temporal’ discriminator genes responsible for various cell type differentiations.
Conclusions
eSPRESSO provides a novel strategy for analyzing mechanisms underlying the spatio–temporal formation of cellular organizations.
Background
Analysis of biological functions and disease mechanisms based on highthroughput singlecell RNAsequencing (scRNAseq) is becoming a widely accepted and fundamental technique in cell biology [1,2,3]. In particular, methods for highthroughput, spatially resolved scRNAseq have been developed and are attracting attention as novel analytics in this field. Singlemolecule fluorescence in situ hybridization (FISH) [4] has been widely used for quantitating transcript numbers at singlecell resolution while preserving 3D locations of cells, often within the context of a diseased tissue of interest. Highly multiplexed methods such as seqFISH [5] or MERFISH [6] have been employed to measure the transcripts for over 10,000 genes of target cells in 3D locations. Highresolution, 2Dgrid primerbased RNA sequencing of tissues fixed on a glass plate has also been developed for mapping transcript abundance and drawing spatial cellular location maps after reconstructing 3D images using multiple 2D maps [7]. However, these methods are still in their infancy and require further improvement in terms of practical costs and convenience for wholeorgan research.
Alternatively, several computational methods for reconstructing 3D tissues by estimating the spatial positions of individual cells using gene expression data obtained by scRNAseq have been reported [8,9,10,11,12,13,14]. These methods can be roughly divided into two approaches: the landmark approach and the ab initio approach. The landmark approach estimates the 3D position of each cell on the basis of gene expression profiles while using the spatial information of marker genes obtained by other experiments such as in situ hybridization [8,9,10]. Conversely, the ab initio approach assigns each cell to a 3D space according to the principal component score calculated from gene expression profiles without using spatial reference information [11,12,13,14]. Although current principal component analysis (PCA)based methods may be insufficient for 3D reconstruction, an ab initio approach that does not depend on the spatial information of marker genes is promising and desired for 3D reconstruction. Other methods for the reconstruction of spatial relationships of cells from nonspatial scRNAseq data have been developed, including novoSpaRc [15], SpaOTsc [16], ScoMap [17], and CSOmap [18], which have provided new biological insights into spatial gene expression patterns and spatially informative genes within tissue. However, although these methods are able to project cell data to a 2D or 3D pseudo space or reference map, the reproducibility of real tissue structure is not thoroughly discussed. Advanced experimental techniques for the simultaneous acquisition of gene expression profiles and cell locations have also been developed and become widely used, such as 10x Visium, Slideseq [19, 20], HDST [21], and Stereoseq [22]. Accordingly, several spatial transcriptome analysis methods that use the output of the aforementioned techniques have been proposed, including SpaGCN [23], Squidpy [24], and SpatialID [25]. However, most of them are aimed at the segmentation of tissue and the detection of spatially informative genes, not at reconstruction.
Previously, we reported a novel 3D reconstruction method using SPRESSO (SPatial REconstruction by Stochastic SelfOrganizing Map) [26], which features gene selections based on gene sets from Gene Ontology (GO) [27, 28]. The method yielded high success rates of 3D reconstructions of mouse gastrula stage embryos and demonstrated a remarkable ability to identify spatial discriminator genes (SDGs) that contribute to differentiation and tissue morphogenesis. This method, however, was preliminary and simply projected four domains of mouse gastrula to only a cubic structure, and thus was inapplicable to more complicated tissue structures of organs such as the heart or pancreas.
In this work, we remove the limitations of structural presentations by introducing graphbased selforganizing map (SOM) clustering for the reconstruction of any topology of cell domains in tissues, as long as they can be drawn as network diagrams or graphs. The basic concept of the graphbased SOM clustering was reported in 1990 by Kohonen et al. [29] who introduced a Kohonen map in 1982 [30] as an artificial neural network and a computationally convenient abstraction building on the biological models of neuronal systems [31] or morphogenesis models by Turing [32]. Many useful topological structures other than square grids, such as hexagonal grids, toroidal grids whose opposite ends are seamlessly connected [33], have been introduced. However, graphnetwork representation is the simplest and most abstract yet comprehensive method for describing the relationships between cell domains in tissues. Here, we applied graphbased SOM clustering to various types of mouse and human organs to reconstruct or infer cellular organizations while revealing informative SDGs that can be ranked using novel virtual knockout (VKO) analysis. Furthermore, we extended graphbased SOM clustering to temporal analysis to elucidate the cell lineage trajectories of human pancreatic organoids. We used uniform manifold approximation and projection (UMAP) [34] to visually confirm topologies and draw new insights from the resultant topologies of cell domains.
Results
All the datasets used in this paper are summarized in Table 1.
Topological clustering of gene expression data by graphbased SOM with gene set optimization
Inspired by the original Kohonen SOM learning theory [30], we extended our previously developed method, SPRESSO [26], to graphbased eSPRESSO (enhancedSPRESSO) topological clustering, which is theoretically applicable to any celltocell relationships of tissues or organs if they are represented as a graph structure. The basic algorithm for topological clustering of cells is a combinatorial optimization to find the best gene sets to reproduce known structures, or a topological guidelines for learning the gene expression vectors of cells, as schematically shown in Fig. 1a. Given the original topology of cell domains, we can calculate the accuracy of topology reproducibility by counting the correct and incorrect edges between cells, either connected or unconnected, after a learning process. We can also calculate the clustering performance without considering the topology of the cell domains using the adjusted Rand index (ARI). As an integrated score of these two metrics, we calculate the clustering score defined by the weighted sum of the accuracy and ARI, where we set equal weights. We used the Markov chain Monte Carlo (MCMC) [35] method to optimize the best gene sets to attain the maximum topological clustering score. The evaluation and optimization processes require a known topological structure, which makes the method not fully unsupervised. However, the purpose of eSPRESSO is not simply to search for marker genes of known domains, but also to search for the best gene sets that reconstruct the adjacency relationship between the domains. Therefore, graphbased SOM learning considers tissue topology to detect spatially contributing genes showing expression gradients across neighboring cell domain networks. We previously tuned the SOM learning performance by introducing a stochasticlearning version of SOM, or stochasticSOM [26], which enhances learning efficiency in the later phase where the extent of learning ability usually decreases monotonically. In preliminary implementations with mouse gastrula embryo (E7.0), which use the same dataset as that in SPRESSO [26], eSPRESSO showed perfect reproducibility, i.e., the average of the maximum scores of ten runs was 2.00 (accuracy: 1.00 + ARI: 1.00) (Table 1). Furthermore, the 87 consensus SDGs that were found three or more times in the ten runs suggest biologically interesting genes that contribute to embryonic development, such as “cell adhesion”, “cell differentiation”, or “nervous system development”, by GO enrichment analysis (Additional file 1: Fig. S1).
We first tested the ability of eSPRESSO with mouse E7.5 gastrula embryo [36], whose topological structures are too complicated to be handled or reproduced by our previous SPRESSO, by generating a graph structure of seven cell domains (Ect1, Ect2, Ect3, PS, MA, MP, E1E2E3) derived from domaintodomain contact relationships in the original paper (Fig. 1a). We optimized the gene set to maximize topology reproducibility by 1000 steps of MCMC calculations. To enhance the efficient optimization and stable reproducibility, we performed replica exchange [37] based on eight parallel MCMC processes (Fig. 1b). In this study, we conducted ten runs of clustering, each of which randomly downsampled ten (or all if only fewer cells existed) cells from each of the domains, to calculate scores. We used gene expression data from 64 sections. The average of the maximum scores of ten runs was 1.93 (accuracy: 0.98 + ARI: 0.95) after the MCMC optimization, and the topological distance map which shows the shortest path distances of samples for the original (lower triangle) and reconstructed (upper triangle) structures is shown (Fig. 1c). A heatmap of gene expression data for the optimized 36 consensus SDGs that were found three or more times in the ten runs is also shown (Fig. 1d). The final topological clustering result for the 83 sections using SDGs was confirmed visually by UMAP (Fig. 1e and Additional file 2: Movie S1).
Reconstruction of simple layer and complicated structures by eSPRESSO
To evaluate the performance of eSPRESSO, we applied the method to a onedimensional layer and complicated topological structures of the cell domains. We first tried to reconstruct mouse liver lobule structure data, where nine layers of cell domains or tissues from a concentric circle exist, as provided by Halpern et al. [10] (Fig. 2a–d). eSPRESSO reconstructed the relationships of 90 cells that are randomly selected from each of the nine domains, and the average of the maximum scores of ten runs was 1.30 (accuracy: 0.82 + ARI: 0.48). The 66 resultant consensus SDGs that were found three or more times in the ten runs represent many known liverspecific genes such as cytochrome P450 (Cyp2f2, Cyp2e1, Cyp2a12, and Cyp2d9), fibrinogen alpha chain (Fga), and hepatocyte growth factor activator (Hgfac), and GO enrichment analysis provided the candidates for layerdependent functions of the liver lobules (Fig. 2c, d and Additional file 3: Table S1). Ten differentially expressed genes, Cdh1, Cyp2e1, Cyp2f2, Gas2, Glul, Gst3, Npr2, Pck1, Por, and Sds, between periportal and perivenous hepatocytes previously reported by Braeuning et al. [38] overlap with our consensus SDGs but many others are also found in eSPRESSO results. This was probably due to the difference between the differentially expressed genes in the two hepatocyte regions in their study and the gradiently expressed genes across the nine layers in our topological clustering. Indeed, the visual projection by UMAP clustering of all 1415 cells using the consensus SDGs indicated that there were clear layer structures although it did not reproduce the circular structure with this layer topology (Fig. 2b and Additional file 4: Movie S2). We also tested the topology reproducibility of the layer structures using scRNAseq data from two areas in the mouse brain neocortex: ALM (anterior lateral motor cortex) and VISp (primary visual cortex), which have three and four layers, respectively [39]. Both showed perfect reproducibility, namely, the average of the maximum scores of ten runs was 2.00 (accuracy: 1.00 + ARI: 1.00) (Table 1).
Next, we attempted to reconstruct complicated topological structures using eSPRESSO. As an example of organs with complicated tissue structures, we used spatial transcriptomics data of the human developmental heart [40]. We first reconstructed eight domain (TVM, CVM, Epi, CBIC, MMV, AMV, OTLV, and AM) structures for embryonic heart at postconceptional week (PCW) 4.5–5 or 6, and seven domain (the same domains as above except CBIC) structures for that at PCW 9 with the average of the maximum scores of ten runs being 1.23–1.50 (accuracy: 0.75–0.86 and ARI: 0.48–0.64) (Table 1). For PCW 9 embryonic heart analysis, we prepared two topological guidelines, dense and sparse network models (Additional file 1: Fig. S2). All of the seven domains were connected either directly or indirectly in the dense network, whereas two separated domain clusters (TVM–CVM–Epi–AM–OTLV and AMV–MMV) were generated in the sparse network (enclosed by red rectangles in Fig. 2e). Because of the uncertainty of the complicated domain relationships in the dense network model, we adopted the sparse network model for eSPRESSO for the 3D reconstruction of the embryonic heart structure. Surprisingly, the resultant UMAP view of all 1358 spots suggests that the two separated domains should be connected, which resembles the original PCW 9 embryonic heart data [40], where the small domain cluster (AMV–MMV) is appropriately located between the OTLV and the other domains (Fig. 2e, f and Additional file 4: Movie S2). This result indicates that eSPRESSO may be useful for reconstructing the spatial relationships of cells, even with limited topological information. Furthermore, the 70 consensus SDGs that were found three or more times in ten runs may suggest biologically interesting genes that contribute to the selforganization of the embryonic heart, such as “heart development” or “cell adhesion”, by GO enrichment analysis (Fig. 2g, h and Additional file 5: Table S2).
Application to virtual knockout experiment
One of the most promising applications of eSPRESSO is in silico VKO experiments because real knockout experiments in human or animal studies may result in lethality or evoke ethical problems. The development of the mouse heart is characterized by the spatially and temporally controlled expansion and differentiation of cardiogenic progenitor cells into an initially single primitive heart tube, from which the complex organ structure of the heart arises. We used scRNAseq data of mouse embryonic heart (E9.25) [41], which represent the pseudocircular structure of seven domains, of which five project from the anterior and posterior second heart fields (AHF and pSHF) (Fig. 3a). We performed ten runs of eSPRESSO clustering and reconstructed the topology with a high average maximum score of 1.79 (accuracy: 0.94 + ARI: 0.85). (Table 1). The visual projection by UMAP clustering of all 3331 cells using the 20 consensus SDGs with the same frequency as above shows a circular representation of tissue domains, which is consistent with the original structure (Fig. 3b and Additional file 6: Movie S3). The consensus SDGs include novel as well as known genes in mouse heart development or cardiac diseases, such as PICALM interacting mitotic regulator (Fam64a) [42], four and a half LIM domain protein 1 (Fhl1) [43], and others (Fig. 3c, d, Table 2 and Additional file 7: Table S3).
To understand the topological importance of SDGs, we performed VKO experiments with eSPRESSO topological reconstruction by deleting genes one by one. For each VKO, we ran topological clustering 100 times with different random seeds and calculated the mean score (Table 2). Among the 20 SDGs, desmin (Des), prothymosin alpha (Ptma), and cyclin D2 (Ccnd2) are the most influential genes for mouse heart structure, as indicated in the literature [44,45,46], which decreases the normal mean score of 1.143 to 1.05, 1.07, and 1.08, respectively. The visual inspection by UMAP clustering for individual VKO experiments (Des^{−}, Ptma^{−}, or Ccnd2^{−}) reveals that the reconstructed topologies had lower domain clustering resolutions with the same plotting parameters as normal SDGs (Fig. 3e and Additional file 6: Movie S3). We then performed multiple VKO experiments by deleting Des and Ptma, which dramatically decreased the mean score to 0.92. We further performed multiple VKO experiments by deleting all three genes, Des, Ptma, and Ccnd2, but the mean score did not decrease further; rather, it slightly increased to 0.97 (Table 2). The UMAP by multiple VKO experiments was indeed aberrant compared with that of normal SDGs, where the domains become unclear and fused (Fig. 3f, g and Additional file 6: Movie S3), indicating that the three genes may be vital to cardiac development. Indeed, the gene expression distribution of the three genes clearly and spatially indicates that the genes are reversely and complementarily upregulated across cardiac regions; Des and Ptma are mainly upregulated in the OFT and pSHF domains, respectively, and downregulated vice versa, whereas Ccnd2 is mainly upregulated in the LV and the pSHF domains but downregulated in the middle of both, that is, in the Atrial domain (Fig. 3h and Additional file 8: Movie S4). This tendency of the gene expression gradient was also confirmed in the reconstructed model of mouse developmental heart by novoSpaRc [15], namely Des and Ptma showed opposite gene expression patterns, and the expression region of Ccnd2 overlaps with that of Des and Ptma (Additional file 1: Fig. S3).
Developmental analysis of human pancreatic organoids by eSPRESSO
One of the powerful characteristics of eSPRESSO is demonstrated by the analysis of organs during the developmental processes. We used scRNAseq data of human pancreatic organoids from stages 3, 4, 5, and 6 by Veres et al. [47], and attempted to reconstruct developmental or temporal structures using cells from all stages. We first combined data from all four stages to generate a coarse topological guideline of ten cell types (PDX1^{+}, NKX61^{+}, NEUROG3^{+}, SCα, SST^{+}HHEX^{+}, FEV^{high}ISL^{low}, SCβ, SCEC, CHGA^{+}FOXJ1^{+}, and Exocrine) on the basis of the developmental model proposed in Figure 5 in the original paper [47] (Fig. 4a). The eSPRESSO clustering result showed very low topology reproducibility, namely, the average of the maximum scores of ten runs is 1.15 (accuracy: 0.76 + ARI: 0.39) (Table 1). In this case, the accuracy, which indicates topological relevance, may be sufficient, although not perfect, the ARI is unexpectedly low, which implies poor clustering results (Fig. 4b and Additional file 9: Movie S5). This is probably due to the data incongruity problem because many of the cells are assigned to the same cell type for different stages (such as SCα and SST^{+}HHEX^{+} in stages 4, 5, and 6; and SC β, SCEC, SST^{+}HHEX^{+}, CHGA^{+}FOXJ1^{+}, and Exocrine in stages 5 and 6) by the original paper. There are 83 consensus temporal discriminator genes (TDGs), including pancreasspecific genes such as insulin (INS) and insulin gene enhancer protein ISL1 (ISL1) [48] (Fig. 4c, d and Additional file 10: Table S4). The VKO analysis indicated that proglucagon (GCG), ISL1, and somatostatin (SST) were the most influential genes for pancreas development (Additional file 11: Table S5). Indeed, GCG is expressed in alpha cells [49], ISL1 is involved in embryogenesis of the islets of Langerhans [50], and SST proteins are produced in delta cells [51] (Fig. 4f and Additional file 9: Movie S5). In the reconstructed model of the human developmental pancreas using novoSpaRc, no significant differences were observed in the gene expression patterns of GCG, ISL1, and SST. In addition, subtle gene expression gradients, as shown by eSPRESSO were not observed using novoSpaRc (Additional file 1: Fig. S4).
Although the task was highly challenging, we devised a method for inferring a possible developmental model from the eSPRESSO results. We used consensus TDGs to generate a gene expression matrix for all the cells (83 genes by 20,205 cells). We then reduced the matrix to the original 18 cell domains or types (separated by individual stages) by taking the average within each cell domain. Using this reduced matrix, we inferred a sparse graphical Gaussian model (GGM) of cell type network based on gene expression covariance matrix using EBICglasso in qgraph package in R [52]. Considering the directions of the developmental stages, we further reduced the original 18 cell type network into a ten cell type developmental model by aggregating the same cell type from different stages if they have 0.2 or more partial correlations (Fig. 4e and Additional file 1: Fig. S5). Interestingly, the resultant developmental model was significantly different from the original model. For example, NEUROG3^{+} cells are precursors of SCβ and SCEC in the coarse guideline model, but these cells are also the precursors of SCα in the inferred model. Similarly, NKX61^{+} cells are the precursors of NEUROG3^{+} cells in both the coarse and inferred models, but they are also the precursors of Exocrine cells in the inferred model. These results were attributed to the actual gene expression gradients of the 83 TDGs. Indeed, a review paper on human pancreatic development indicated that NEUROG3^{+} cells can be the precursors of both SCα and SC β and that NKX61^{+} cells can be the precursors of duct (exocrine) cells [53]. Interestingly, however, scRNAseq analysis using Pdx1GFP or other gene transgenic mouse lines to trace the pancreatic lineage revealed two αcell generation pathways that directly generate α cells from early Pdx1^{+} cells, called the first wave, in addition to the known late α—and βlineage differentiation [54].
Discussion and conclusions
We developed a computational method that combines cell clustering using stochasticSOM under topological constraints and gene set optimization using MCMC calculations. In general, the reproducibility of the known topology of cell domains is successful (Table 1), and this method may innovatively add more information to the existing clustering method by finding spatio–temporally distributed discriminator genes (SDGs or TDGs). The S(T)DGs identified by eSPRESSO showed large overlaps with the cluster biomarkers selected by Seurat [55], and all of the S(T)DGs were contained in the cluster biomarkers in some cases (Additional file 1: Fig. S6). Because the number of Seurat’s marker genes in the default settings is much higher than the number of S(T)DGs, the results make sense. As expected, most of the S(T)DGs were found to be statistically significant in Seurat with the adjusted pvalues calculated by the Wilcoxon rank sum test between one domain and the other domains. (Additional file 1: Fig. S6). However, some genes did not show sufficient statistical significance, meaning that eSPRESSO was able to identify gene sets that were difficult to obtain by differentially expressed gene analysis with a simple statistical test. In addition, eSPRESSO was able to infer the developmental architecture of cell trajectories even with limited knowledge or coarse topological guidelines of the developmental model. Using in silico VKO analysis with eSPRESSO, it may be also useful to investigate the effects of individual genes on the topological structure, which may be difficult or impossible to accomplish in some cases in real experiments.
As mentioned in the background, direct experimental methods for the 3D reconstruction of tissues such as highly multiplexed FISH or direct spatially resolved transcriptomics have been proposed [5,6,7]. The primary purpose of these methods is to map or reconstruct cellular locations using gene expression data. Although these methods can contribute to detecting differentially expressed genes among distinct tissue or cell domains, they do not consider tissue topology, and thus, may miss the detection of spatially contributing genes that show expression gradients across neighboring cell domain networks. eSPRESSO may complement existing spatial transcriptomics methods by detecting gradiently or globally expressed genes that contribute to cell domain network organizations in a spatio–temporal manner. We would like to emphasize here that ARI may be high, but accuracy may be low for unstructured or complex tissues, but the results of human heart (PCW 9) analysis also show that eSPRESSO is effective in reconstructing the spatial relationship between domains even when the topology information is limited. To investigate the topological limitation of eSPRESSO, we performed computational experiments using synthetic singlecell datasets generated by the dyngen package [56]. Owing to the high calculation cost, we tested five of 14 dyngen singlecell transcription factor network models: linear, cycle, bifurcating, branching, and disconnected backbones [56], and generated their cellular topology graphs (Additional file 1: Figs. S7–10). eSPRESSO was able to reproduce the input topologies with nearperfect accuracy (i.e., the mean values of maximum accuracies and maximum ARIs were higher than 0.99 and 0.97, respectively) with SDGs of various expression gradient patterns regardless of the input topology graphs (Additional file 1: Figs. S9 and S10). Furthermore, when we created ten modified graphs for each dataset by randomly reversing the presence or absence of edges for each vertex pair, we observed that eSPRESSO is robust in general, that is, the mean maximum scores were higher than 1.85 for a low input cellular topology misspecification rate of 10% for all backbone configuration, and higher than 1.90 even for a high input misspecification rate of 50% except for bifurcating or branching backbone configuration (Additional file 1: Fig. S10).
eSPRESSO also has several benefits in the analysis of human organ development, where many genes give rise to homozygous lethality in knockout experiments. Although animal studies have presented evidence of candidate lethal genes in humans, it is not possible to reproduce gene knockout organs in humans because of ethical and technical reasons. Human organoids derived from the induced pluripotent stem (iPS) or embryonic stem (ES) cells show promise as an alternative approach for mimicking gene knockout organs in the human body. However, the occurrence of abnormalities is often stochastic and affected by other internal or environmental factors, and a large number of repeated experiments are sometimes required to reproduce the phenotypes [57]. In contrast, although our in silico VKO method is still in the preliminary stage, it does not require any finetuned protocols and reproduces the same results with a low calculation cost. In addition, such virtual experiments are possible not only for VKO experiments but also for virtual knockin (VKI) experiments, and both are a promising alternative to costly in vitro and in vivo experiments, because the experiments can be performed regardless of the number or combination of knockin and knockout genes.
SOM learning, first proposed by Kohonen and colleagues [30], is an established method in the field of clustering. They introduced graphbased SOM clustering, but this method has few biological applications [29]. The original SOM learning monotonically decreases the learning regions and efficiency, which causes problems in the clustering resolution in the later phase. We found that efficient scheduling for learning is necessary for practical application to current singlecell transcriptomics datasets. We introduced stochasticSOM, which has similar behavior to the Gibbs sampling approach where SOM clustering is continued by random sampling even in the late learning phase. This improvement dramatically enhanced the clustering resolution and enabled efficient implementation of graphbased SOM clustering.
Finally, we would like to stress that our current algorithms and analysis pipeline reported herein are still primitive and limited, and many improvements are necessary. First, there are cases in which certain difficulties remain in the creation of topology graphs. For a onedimensional layered structure, such as a mouse brain (ALM and VISp), or a relatively simple tissue structure, such as a mouse embryo, it is possible to easily create a topology graph based on general biological knowledge. However, it is sometimes difficult to accurately represent complex topological structures, such as the heart, in graphs. In fact, for the human heart dataset, we created a dense model that considers all indefinite adjacency relations as connected and a sparse model that considers those as not connected. In addition, graph resolution was not determined by itself, but depended on the purpose of the user’s analysis. As an example guideline for automatic determination, it may be possible to use the clustering results from Louvain’s method [58].
Second, the extraction of initial gene candidates is solely dependent on the Random Forest algorithm that detects feature genes by the Boruta package in R and may miss important SDGs or TDGs that overlap categories or cell domains in distant branches or leaves. Although the input genes to Boruta were selected on the basis of the frequency of nonzero expression in all cells and the standard deviation in the computational experiments, we observed no significant change in the structure reconstruction performance in our datasets even when highly variable genes selected simply by Seurat were used as the input genes to Boruta (Additional file 1: Fig. S11 and Additional file 12: Table S6). The replica exchange MCMC calculation is also a suboptimal approach, although it enhances combinatorial optimization to allow quick convergence to the global optimum, which sometimes fails to find the best gene set to reproduce the topology using SOM clustering. Furthermore, the selected feature gene sets differ across runs. Therefore, in the computational experiments, genes detected as SDGs three or more times in ten runs were used as consensus SDGs in subsequent analyses. This experimental approach has contributed to improving the reliability of SDGs even for singlecell and minibulk data, which are often noisy.
Third, it is difficult for eSPRESSO to determine whether the differences in gene expression patterns in SDGs are due to spatial distributions or cell type differences. To address this limitation, a more effective algorithm that detects hierarchical relationships of gene expression profiles should be developed. Furthermore, the computation time in this study ranged from 1 to 10 days on 2.60 GHz E52670 Intel Xeon CPU, depending on the amount and complexity of data. We expect that more efficient and finegrained algorithms for finding the global optimum in an acceptably short time may replace the modules in our pipeline in the future.
Methods
Data collection and preprocessing
To confirm the performance of eSPRESSO, 14 datasets of human and mouse transcriptomes were collected from seven papers [10, 36, 39,40,41, 47, 59]. Each dataset was used as a log10transformed expression profile, and genes whose expression values are greater than 1.0 in at least two samples and whose standard deviations across all samples are greater than 0.05 were extracted. Adjacent matrices of spatial domains were constructed on the basis of original papers and biological knowledge by the authors of this paper.
Mm embryo (E7.0)
Peng et al. collected transcriptome profiles of embryo sections by laser microdissection [59]. We downloaded and used E1 dataset from GEO (GSE65924) [60], which is one of the triplicates of single embryos. The dataset contains 41 sections (~ 20 cells per sample) with four spatial domains (d1: anterior, d2: lateraldistal, d3: lateralproximal, and d4: posterior). The expression values were saved as fragments per kilobase of exon per million mapped reads (FPKM).
Mm embryo (E7.5)
Peng et al. collected transcriptome profiles of embryo sections at various developmental stages by laser microdissection [36]. We downloaded reference samples of E7.5 from GEO (GSE120963). The dataset contains 83 sections (20–40 cells per sample) with nine spatial domains (Ect1–3: ectoderm, PS: primitive streak, MA: anterior mesoderm, MP: posterior mesoderm, E1–3: endoderm). To simplify the domain structure, we integrated E1–3 into En as a single domain. The expression values were saved as FPKM.
Mm brain (ALM and VISp)
Tasic et al. collected transcriptome profiles of two regions of adult mouse cortex: anterior lateral motor cortex (ALM) and primary visual cortex (VISp) [39]. We downloaded exon count datasets of ALM and VISp from GEO (GSE115746), respectively. For both datasets, we extracted 3809 and 7049 single cells belonging to clusters with distinct layer numbers (L2/3, L4, L5, L6, and L6b), respectively, where the L6b cells were merged to the L6 cluster. ALM consists of three spatial domains of L2/3, L5, and L6, whereas VISp consists of four spatial domains of L2/3 to L6. The expression values were saved as raw count data and transformed into counts per million mapped reads (CPM) values when analyzed.
Mm heart (E7.75, E8.25, and E9.25)
de Soysa et al. collected transcriptome profiles of mouse heart at three developmental stages: E7.75, E8.25, and E9.25 [41]. We downloaded source data of all developmental stages from the supplementary files of the original paper. In addition, we requested the expression profiles of sinus venosus (SV) and atria at E9.25 to the authors of the original paper for integration with the downloadable E9.25 data. For E7.75 data, the anterior heart field (AHF), the left ventricle (LV), atria (Atrial), and the posterior second heart field (pSHF) of wild type were extracted and the total number of single cells was 1259. For E8.25 and E9.25 data, 3331 and 3911 single cells of AHF, the SHFderived outflow tract (OFT), the right ventricle (RV), LV, atria (Atrial), SV, and pSHF were extracted, respectively. The expression values were saved as logtransformed unique molecular identifier (UMI) counts.
Mm liver
Halpern et al. collected transcriptome profiles of mouse liver and estimated their lobule coordinates by a panel of zonated landmark genes [10]. The authors provided a singlecell gene expression profile and a posterior probability matrix showing the probabilities of being the original layer for each single cell, and they can be downloaded as supplementary files of the original paper. In order to simplify the input data, we determined the layer that gives the maximum probability as the original layer for each single cell. This dataset consists of 1415 single cells with nine domains (layers 1–9). The expression values were saved as raw UMI counts and transformed into CPM when analyzed.
Hs pancreas organoid (S3–S6)
Veres et al. collected transcriptome profiles of human pancreas at four differentiation stages [47]. We downloaded the preprocessed gene expression profile of Protocol x1 from four stages 3–6 (S3–S6) from GEO (GSE114412), where profiles of replications were removed. Dataset S3 consists of 5955 single cells of PDX1^{+} progenitors (pdx1). For dataset S4, the expression profiles consist of 5273 single cells with five domains: NKX61^{+} progenitors (nkx61), NEUROG3^{+} progenitors (neurog3), SC\(\alpha\) (sc_alpha), SST^{+}HHEX^{+} (sst_hhex), and FEV^{high}ISL^{low} (fev_high_isl_low). Datasets S5 and S6 consist of 3926 and 5051 single cells from six domains: SC\(\beta\) (sc_beta), SC\(\alpha\) (scalpha), SCEC (scec), CHGA^{+}FOXJ1^{+} (chga_foxj1), SST^{+}HHEX^{+}, and Nonendocrine (exo). These datasets were merged into one dataset of 20,205 single cells with complementing missing gene expression values from raw count data provided by GEO (GSE114412). The expression values of these datasets were saved as UMI counts and transformed into CPM when analyzed.
Hs heart (PCW 4.5–5, PCW 6.5, and PCW 9)
Asp et al. collected transcriptome profiles of human heart at three developmental stages in the first trimester: postconception weeks (PCWs) 4.5–5, 6.5, and 9 [40]. Each expression profile and the corresponding metadata are downloadable from a data repository (https://www.spatialresearch.org). The spatial transcriptome data of PCW 4.5–5, PCW 6.5, and PCW 9 consist of 238, 1515, and 1358 spots of tissue sections. The PCW 4.5–5 and PCW 6.5 data cover eight domains: compact ventricular myocardium (CVM), trabecular ventricular myocardium (TVM), atrial myocardium (AM), outflow tract and large vessels (OTLV), atrioventricular mesenchyme and valves (AMV), mediastinal mesenchyme and vessels (MMV), cavities with blood and immune cells (CBIC), and epicardium (Epi), whereas the PCW 9 data cover seven domains (the same domains as above except CBIC).
Overview of eSPRESSO
Our previously reported SPRESSO [26], which is a 3D reconstruction method using SOM clustering and GObased feature gene selection, has achieved a high success rate in the 3D reconstruction of mouse gastrula structure and shown a remarkable ability to identify SDGs that contribute to differentiation and tissue morphogenesis. However, it projects the mouse gastrula structure into a simple cubic structure composed of four domains and is therefore inapplicable to more complex tissues. In addition, whereas the feature gene selection using GO has enabled us to search feature genes on the basis of their functions, it is difficult to expand the search space to combinations of GOs from the perspective of computation time because more than 40,000 GOs have been defined.
eSPRESSO is able to overcome the limitations of SPRESSO through the introduction of graphbased SOM clustering and gene set optimization by the MCMC framework. The graphbased SOM clustering enables the reconstruction of any topology of cell domains in tissues, as long as they can be drawn as network diagrams or graphs, thereby greatly expanding the applicability of eSPRESSO. Meanwhile, the gene set optimization by the MCMC framework enables more flexible search for feature gene sets that is not restricted to the definitions of GOs while limiting the search space to promising areas by combining with Boruta, a feature gene selection method. Details of graphbased SOM clustering and gene set optimization by the MCMC framework are described in the following sections.
Stochastic selforganizing map (stochasticSOM) clustering
The selforganizing map (SOM) is an unsupervised clustering method proposed by Kohonen [30]. In general, SOM projects input highdimensional data onto a limited number of output classes or units, so that different units with similar centroid vectors are placed close to each other in a mapping layer that is usually given in a 2D plane. Let \({\varvec{X}} = \left( {{\varvec{x}}_{1} ,{\varvec{x}}_{2} , \ldots ,{\varvec{x}}_{n} } \right)\) be a set of input samples with the \(p\)dimensional vectors, i.e., \({\varvec{x}}_{j} = \left( {x_{j1} ,x_{j2} , \ldots ,x_{jp} } \right)\) \(\left( {j = 1, 2, \ldots , n} \right)\). The mapping layer consists of \(k\) units, and their centroid vectors \({\varvec{m}}_{i} = \left( {m_{i1} , m_{i2} , \ldots ,m_{ip} } \right)\) \(\left( {i = 1, 2, \ldots , k} \right)\) are randomly initialized and assigned to each unit. The similarity between input sample \(j\) and all units \(i\) is defined by the Euclidean distance. First, the SOM algorithm finds the unit \(c\) with the highest similarity according to the following equation as the best matching unit (BMU).
where \(\left\ \cdot \right\\) denotes the Euclidean distance, or norm of a vector, and \(t\) is time step. The centroid vector \({\varvec{m}}_{i} \left( t \right)\) of all units of the mapping layer at time \(t\) is updated by the following equations.
where \(h_{ci} \left( t \right)\) is a neighborhood function that determines how much \({\varvec{m}}_{i} \left( t \right)\) receives the learning influence of \({\varvec{x}}_{j}\) when it is updated. \(\alpha \left( t \right)\) and \(\sigma \left( t \right)\) are the learning rate parameter and a function defining the radius of the neighboring region, respectively. In addition, \({\varvec{r}}_{c}\) and \({\varvec{r}}_{i}\) are the position vectors in the mapping layer of units \(c\) and \(i\). The SOM algorithm repeats updates of \({\varvec{m}}_{i}\) until the learning step \(t\) reaches \(T,\) which is given as a parameter for all input samples \(j\).
In the general SOM clustering, its result is affected by the order in which the samples are input. To eliminate this effect, the batchlearning SOM was also proposed [30]. In the batchlearning SOM, \({\varvec{m}}_{i} \left( t \right)\) is updated only after all samples are given by the following equations.
The general SOM learning often converges to local minima in early steps if the number of units in the mapping layer is extremely small. In order to increase the possibility of escaping from the local minima and reaching the global maxima, a stochasticSOM that introduces a random variable into the neighborhood function has been proposed, which makes the learning converge gradually [26]. The neighborhood function of the stochasticSOM is
where \({\text{rnd}}\left[ {0.5, 1} \right)\) is a function that generates uniform random values of at least 0.5 and less than 1.0.
Graphbased stochasticSOM clustering
A graph \(G = \left( {V,E} \right)\) is a pair of finite nonempty set \(V\) and finite set \(E \subseteq V \times V\). The elements \(u,v \in V\) of graph \(G\) are called vertices, and the elements \(e = \left\{ {u,v} \right\} \in E\) are called edges. The sets of vertices and edges of graph \(G\) are denoted as \(V\left( G \right)\) and \(E\left( G \right)\), and their numbers are denoted as \(\left {V\left( G \right)} \right\) and \(\left {E\left( G \right)} \right\), respectively. A path on \(G\) is a nonempty graph \(P\left( G \right) = \left( {V,E} \right)\), where \(V = \left\{ {v_{i} ,v_{i + 1} , \ldots ,v_{j} } \right\}\) and \(E = \left\{ {\left\{ {v_{i} ,v_{i + 1} } \right\},\left\{ {v_{i + 1} ,v_{i + 2} } \right\}, \ldots ,\left\{ {v_{j  1} ,v_{j} } \right\}} \right\}\), and all \(v_{k}\) values are distinct. The distance \(d_{G} \left( {u,v} \right)\) between two vertices \(u\) and \(v\) on \(G\) is given by the length (i.e., the number of edges) of the shortest path between \(u\) and \(v\). Graph \(G\) is often represented by square matrix \(A\left( G \right) = \left[ {a_{ij} } \right] \left( {i,j = 1,2, \ldots ,\left {V\left( G \right)} \right} \right)\) that shows the adjacency between vertices; this matrix is called an adjacency matrix, where \(a_{ij} = 1\) if \(\left\{ {v_{i} ,v_{j} } \right\} \in E\left( G \right),\) otherwise \(a_{ij} = 0\) for the vertices \(v_{i}\) and \(v_{j}\) corresponding to \(i\) and \(j\), respectively. Note that \(a_{ij} = 0\) when \(i\) and \(j\) are identical because selfloops are not assumed.
To improve the performance of stochasticSOM, we newly propose graphbased SOM (graphSOM) clustering. In the graphSOM, the mapping layer is given by graph \(G\) represented by the adjacency matrix \(A\left( G \right)\). Although in the general SOM, the distance between units \(i\) and \(j\) in the mapping layer is computed by the Euclidean distance \(\left\ {{\varvec{r}}_{i}  {\varvec{r}}_{j} } \right\\) between the corresponding position vectors \({\varvec{r}}_{i}\) and \({\varvec{r}}_{j}\), each unit corresponds to a vertex on \(G\) and the distance between the units is given by distance \(d_{G} \left( {v_{i} ,v_{j} } \right)\) between the vertices \(v_{i}\) and \(v_{j}\) on \(G\) in the graphSOM. Therefore, the neighborhood function \(h_{ci} \left( t \right)\) at time \(t\) of the stochastic graphSOM is given by the following equation.
Evaluation of graphSOM clustering results
The clustering result of the stochastic graphSOM is evaluated on the basis of two criteria: prediction accuracy (\(Accuracy\)) and adjusted Rand index (ARI). The score function is defined by the following equation:
where \(a\) is a constant parameter that adjusts the weight of ARI for Accuracy and \(a = 1.0\) is employed as the default setting of eSPRESSO.
Prediction accuracy: accuracy
For a pair of cell samples \(c_{i}\) and \(c_{j}\), let \(d_{i}\) and \(d_{j}\) be the true domains to which they belong, and let \(\hat{d}_{i}\) and \(\hat{d}_{j}\) be the domains to which they are estimated to belong by the stochastic graphSOM. Here, assuming that \(s_{xy}\) is an element of the adjacency matrix \(A\left( G \right)\) corresponding to the input graph \(G\), the prediction score \(s_{ij}\) for a pair \(\left\{ {c_{i} ,c_{j} } \right\}\) is given by the following equation
Therefore, the prediction accuracy \(Accuracy\) for all cell sample pairs is defined by the following equation
where n is the total number of cell samples.
Adjusted Rand index (ARI)
The adjusted Rand index (ARI) measures the similarity between two clustering results [61]. Let \({\mathcal{X}} = \left\{ {{\mathcal{X}}_{1} ,{\mathcal{X}}_{2} , \ldots ,{\mathcal{X}}_{m} } \right\}\) and \({\mathcal{Y}} = \left\{ {{\mathcal{Y}}_{1} ,{\mathcal{Y}}_{2} , \ldots ,{\mathcal{Y}}_{m} } \right\}\) be families of sets of cell samples, where \(m\) is the number of domains. The overlap of cell samples between \({\mathcal{X}}_{i}\) and \({\mathcal{Y}}_{j}\) is denoted by \(n_{ij} \left( { = \left {{\mathcal{X}}_{i} \cap {\mathcal{Y}}_{j} } \right} \right)\). The number of cell samples belonging to \({\mathcal{X}}_{i}\) (resp. \({\mathcal{Y}}_{j}\)) can be represented by using \(n_{ij}\) as \(a_{i} = \mathop \sum \limits_{j = 1}^{m} n_{ij}\) (resp. \(b_{j} = \mathop \sum \limits_{i = 1}^{m} n_{ij}\)). Therefore, ARI can be defined by the following equation:
where \(n\) is the total number of cell samples. Here, assuming that \({\mathcal{X}}\) and \({\mathcal{Y}}\) are the true family of domains and the family of domains estimated by the stochastic graphSOM, respectively, the similarity between the true domain classification and the estimated domain classification can be computed by ARI.
Optimization of cluster allocation of graphSOM
When applying eSPRESSO to tissues whose topological structure is unstructured or complex, GraphSOM results often have high ARI and low prediction accuracy because ARI itself calculates the clustering accuracy without considering the celltocell adjacency between the clusters. Therefore, eSPRESSO introduces an operation to optimize the allocation of clusters to the vertices on graphSOM by swapping the allocation while preserving the members of each cluster. The swapping operation occurs on the basis of the simulated annealing (SA) strategy [62]. A pair of clusters \(i\) and \(j\) is selected to be swapped according to the selection probability \(p_{i,j}\), which is weighted by the postswapping accuracy computed in advance.
where \(z_{i,j} = \frac{{acc_{i,j}  \mu }}{\sigma }\), and \(\mu\) and \(\sigma\) are the mean value and the standard deviation of the accuracy \(acc_{i,j}\) for the case that clusters \(i\) and \(j\) are swapped. \(m\) is the number of clusters and \(c\) is a constant parameter defined by \(c = \sqrt {m/2}\). Once a pair of swap candidates is selected, whether or not the swap is adopted is determined according to the following adoption probability \(p_{SA}\) based on the SA strategy.
where \(\Delta f\) is the difference in accuracy before and after the cluster swapping (i.e., \(a_{before}  a_{after}\)). \(T_{t}\) is a temperature parameter at the \(t\)th learning step, which exponentially decreases from 1.0 to 0.001. Finally, when clusters \(i\) and \(j\) are swapped, the centroid vectors of the corresponding vertices on the graphSOM are also swapped.
Optimization of gene set by Markov chain Monte Carlo (MCMC) framework
In order to obtain the spatial discriminator genes (SDGs), eSPRESSO employs Random Forestbased feature gene selection method Boruta [63] and replica exchange Markov chain Monte Carlobased gene set optimization.
Feature gene selection by Boruta
Kursa and Rudnicki proposed the Boruta method for a Random Forestbased algorithm for feature selection [63]. One of the properties of Boruta is to classify features into three classes: confirmed, tentative, and rejected, rather than order them. In eSPRESSO clustering, we selected from 64 to 2377 confirmed and tentative genes as features for each dataset by increasing the value of parameter `maxRuns` in the Boruta package of programming language R.
Replica exchange MCMC optimization
After obtaining the feature genes by Boruta, eSPRESSO searches for the optimum combination of the feature genes by the replica exchange MCMC framework and then outputs the SDGs. The replica exchange MCMC is an extended algorithm of MCMC for improving sampling efficiency [37]. In a general simulated annealing (SA) algorithm [62], which is one of the optimization algorithms based on MCMC sampling, there is only one temperature parameter that determines whether to adopt or reject the newly obtained sample, and the probability of being adopted is relatively high even for samples with a large energy difference when the temperature is high. However, it is difficult to get out of the local minima at a low temperature. As a result, the probability of being rejected increases and sampling efficiency decreases. In the replica exchange MCMC, multiple systems called replicas with different parameters are simulated at the same time, and the states of the replicas are exchanged between different temperatures according to the following exchange probability:
where \(E_{k}\) and \(T_{k}\) are the energy and the temperature of replica \(k\), respectively. By updating the variables of each replica at their respective temperatures and moving on the temperature axis at the same time, its longterm behavior can be regarded as a random walk.
In eSPRESSO, sampling in each replica is done by the following MCMC sampling algorithm.
 (Step 1):

Let \(G\) be an empty set and initialize the score of \(G\) with negative infinity and let \(G_{b}\) be the feature gene set obtained by Boruta.
 (Step 2):

Select \(n^{\prime}\) genes from \(G_{b}\) at uniformly random (i.e., according to equal probability) and let \(G\prime\) be the set of \(n^{\prime}\) genes.
 (Step 3):

Repeat the following procedures \(T\) times.

i.
Generate candidate gene sets \(\mathcal{G}= \left\{ {G_{cand}^{1} ,G_{cand}^{2} , \ldots ,G_{cand}^{N} } \right\}\) by \(G_{cand}^{i} = G^{\prime} \cup \left\{ {g_{i} } \right\}\) and \(G_{cand}^{j} = G\prime \setminus \left\{ {{\text{g}}_{{\text{j}}} } \right\}\) for all \(g_{i} \in G_{b} \setminus G^{\prime }\) and all \(g_{j} \in G^{\prime}\).

ii.
Remove gene sets \(G_{cand}^{i}\), which are already sampled.

iii.
If \(\mathcal{G}\) is an empty set, replace gene \(g_{j} \in G^{\prime}\) with another gene \(g_{k} \in G_{b} \setminus \left\{ {g_{j} } \right\}\) and add them to \(\mathcal{G}\).

iv.
For all \(G_{cand}^{i}\), execute the stochastic graphSOM clustering and compute the scores of the clustering results (see the Section Evaluation of graphSOM clustering results for more details).

v.
Define selection probability \(p_{i}\) for \(G_{cand}^{i}\) by
$$p_{i} = \frac{{\left( {{\text{exp}}\left( {z_{i} } \right)} \right)^{c} }}{{\mathop \sum \nolimits_{i = 1}^{N} \left( {{\text{exp}}\left( {z_{i} } \right)} \right)^{c} }},$$where \(z_{i} = \frac{{s_{i}  \mu }}{\sigma }\), and \(\mu\) and \(\sigma\) are the mean value and the standard deviation of the scores \(s_{i}\) of \(G_{cand}^{i}\), respectively. \(c\) is a constant parameter and \(c = \sqrt {\left \mathcal{G} \right/2}\) is employed in the computational experiments in this paper.

vi.
Determine whether to adopt or reject \(G_{cand}^{x}\) according to the adoption probability \(p_{SA}\) of the SA strategy.
$$p_{SA} = \left\{ {\begin{array}{*{20}c} {1 \left( {\Delta f \le 0} \right)} \\ {exp\left( {\frac{  \Delta f}{{T_{t} }}} \right) (\Delta f > 0)} \\ \end{array} } \right.$$Note that eSPRESSO handles the maximization problem for the score \(s\), so that the difference \(\Delta f\) is defined by \(\Delta f =  \left( {s_{x}  s} \right)\), where \(s_{x}\) and \(s\) are the scores of \(G_{cand}^{x}\) and \(G\), respectively.

vii.
Update \(G^{\prime}\) by \(G_{cand}^{x}\) if adopted, and update also \(G\) by \(G_{cand}^{x}\) if the score of \(G_{cand}^{x}\) is larger than that of \(G\).

i.
 (Step 4):

Output \(G\) as the optimized gene set.
Gene Ontology (GO) analysis
eSPRESSO generates (sub)optimal gene sets for topological clustering. We repeated eSPRESSO clustering with different random seeds and determined consensus spatial or temporal discriminator genes that are found three or more times in ten runs. Then, we used biomaRt package [64] in R4.0.5 to count the numbers of consensus genes and all genes assigned to each GO term in “Biological process” and “Molecular function” categories. Using phyper function in R, we calculated pvalues on the basis of the hypergeometric distribution model.
Virtual knockout analysis
To analyze the dependencies of the spatial or temporal discriminator genes, we performed eSPRESSO clustering 100 times with different seeds for the same data by removing genes in a combinatorial manner, and calculated the mean score. We ranked the genes according to the decrease of the mean score to find the most to least influential genes for clustering.
Glasso for sparse developmental network estimation
Once we obtained a gene by cell domain matrix by averaging the gene expression of consensus gene sets, we inferred a sparse graphical Gaussian model (GGM) of cell domain network based on gene expression covariance matrix using EBICglasso with gamma = 0.5 and nlambda = 1000 in qgraph package [65] in R.
Reconstruction of spatial arrangement by novoSpaRc
To compare the gene expression gradients in models reconstructed by eSPRESSO with those obtained by another approach, we reconstructed the mouse developmental heart and human developmental pancreas by novoSpaRc [15]. We employed a circle shape as a target space with the same number of locations as the number of cell samples and reconstructed the tissues using `reconstruct` function with alpha_linear = 0 and epsilon = 5e−3, where the human developmental pancreas data were downsampled to 5000 cells in advance.
Cluster biomarker detection by Seurat
eSPRESSO identifies SDGs or TDGs by graphbased SOM clustering and MCMC optimization. As an alternative approach, differential gene expression analysis acquires domainspecifically expressed genes without considering the adjacency between domains. To compare S(T)DGs and such domainspecifically expressed genes, we performed cluster biomarker detection by Seurat [55] using the `FindMarkers` function for all datasets. For each dataset, biomarkers with adjusted pvalues less than 0.05 and log2fold change greater than 0.25 were collected from each domain. Venn diagrams comparing the consensus S(T)DGs identified by eSPRESSO and the cluster biomarkers identified by Seurat were generated by the `VennDiagram` package in R. Furthermore, to clarify the statistical significance of those S(T)DGs when comparing between domains, we performed enrichment analysis of S(T)DGs for the gene lists ranked according to the adjusted pvalues obtained by the Wilcoxon rank sum test between one domain and the other domains for each dataset by using the `fgseaMultilevel` and `plotEnrichment` functions in `fgsea` package in R.
Synthetic singlecell datasets and topology graphs
In order to evaluate the performance of eSPRESSO, we performed additional computational experiments using synthetic singlecell datasets. Although eSPRESSO requires a set of a gene expression data and a topology graph of cell domains, as far as we know, there is no method to artificially create a topology graph on the basis of real singlecell dataset. Therefore, first, we created synthetic singlecell datasets by using `dyngen` in R package [56] with a standard procedure (i.e., applying `initialise_model` and `generate_dataset` functions), which can simulate biological data for 14 predefined transcription factor network models including linear, cycle, bifurcating, branching, and disconnected backbones configurations, and then generated the topology graphs of cells based on the Pearson correlation coefficients of these gene expression data. When generating the topology graphs, we first identified cell clusters by Louvain’s clustering [58] for each synthetic singlecell dataset after standard preprocessing by Seurat (i.e., applying `NormalizeData`, `FindVariableFeatures`, `ScaleData`, `RunPCA`, `FindNeighbors`, and `FindClusters` functions). Finally, we generated the topology graphs of cells using the Pearson correlation coefficients of the centroids of identified clusters, where the clustering cutoff was set between 0.70 and 0.85 so that the resulting topology graphs hold consistency, i.e., disconnected for disconnected dataset and connected for the other datasets.
Impact of misspecification of input topology
In order to clarify the impact of input topology misspecification on the prediction accuracy and the ARI, we performed randomization using the synthetic topology graphs. We selected five datasets derived from typical dyngen models with linear, cycle, bifurcating, branching, and disconnected backbones from the 14 synthetic datasets created in the previous section. Then, we created ten modified disconnected graphs for disconnected and ten connected graphs for the other datasets by reversing the presence or absence of edges for each vertex pair according to the randomization probability \(p\), which takes a value of 0.1, 0.2, or 0.5. When applying eSPRESSO, we calculated the prediction accuracy and the ARI for each synthetic dataset by using the modified topology instead of the original topology together with the synthetic expression count matrix as input.
Availability of data and materials
The proposed methods including feature gene selection, 3D reconstruction using stochasticSOM clustering, and visualization are implemented in R and available at https://github.com/tmorikuicr/espresso. All the gene expression data used in this study are included in the package.
Abbreviations
 ARI:

Adjusted Rand index
 BMU:

Best matching unit
 CPM:

Counts per million mapped reads
 ES:

Embryonic stem
 FISH:

Fluorescence in situ hybridization
 FPKM:

Fragments per kilobase of exon per million mapped reads
 GEO:

Gene Expression Omnibus
 GGM:

Graphical Gaussian model
 GO:

Gene Ontology
 iPS:

Induced pluripotent stem
 MCMC:

Markov chain Monte Carlo
 PCA:

Principal component analysis
 PCW:

Postconceptional week
 SA:

Simulated annealing
 scRNAseq:

Singlecell RNAsequencing
 SDG:

Spatial discriminator gene
 SOM:

Selforganizing map
 TDG:

Temporal discriminator gene
 UMAP:

Uniform manifold approximation and projection
 UMI:

Unique molecular identifier
 VKO:

Virtual knockout
References
Rossi MA, Basiri ML, McHenry JA, Kosyk O, Otis JM, van den Munkhof HE, et al. Obesity remodels activity and transcriptional state of a lateral hypothalamic brake on feeding. Science. 2019;364:1271–4.
Jackson HW, Fischer JR, Zanotelli VRT, Ali HR, Mechera R, Soysal SD, et al. The singlecell pathology landscape of breast cancer. Nature. 2020;578:615–20.
Litviňuková M, TalaveraLópez C, Maatz H, Reichart D, Worth CL, Lindberg EL, et al. Cells of the adult human heart. Nature. 2020;588:466–72.
Battich N, Stoeger T, Pelkmans L. Imagebased transcriptomics in thousands of single human cells at singlemolecule resolution. Nat Methods. 2013;10:1127–33.
Lubeck E, Cai L. Singlecell systems biology by superresolution imaging and combinatorial labeling. Nat Methods. 2012;9:743–8.
Chen KH, Boettiger AN, Moffitt JR, Wang S, Zhuang X. Spatially resolved, highly multiplexed RNA profiling in single cells. Science. 2015;348:aaa6090.
Ståhl PL, Salmén F, Vickovic S, Lundmark A, Navarro JF, Magnusson J, et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science. 2016;353:78–82.
Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of singlecell gene expression data. Nat Biotechnol. 2015;33:495–502.
Achim K, Pettit JB, Saraiva LR, Gavriouchkina D, Larsson T, Arendt D, et al. Highthroughput spatial mapping of singlecell RNAseq data to tissue of origin. Nat Biotechnol. 2015;33:503–9.
Halpern KB, Shenhav R, MatcovitchNatan O, Tóth B, Lemze D, Golan M, et al. Singlecell spatial reconstruction reveals global division of labour in the mammalian liver. Nature. 2017;542:352–6.
DurruthyDurruthy R, Gottlieb A, Hartman BH, Waldhaus J, Laske RD, Altman R, et al. Reconstruction of the mouse otocyst and early neuroblast lineage at singlecell resolution. Cell. 2014;157:964–78.
DurruthyDurruthy R, Gottlieb A, Heller S. 3D computational reconstruction of tissues with hollow spherical morphologies using singlecell gene expression data. Nat Protoc. 2015;10:459–74.
DurruthyDurruthy J, Wossidlo M, Pai S, Takahashi Y, Kang G, Omberg L, et al. Spatiotemporal reconstruction of the human blastocyst by singlecell geneexpression analysis informs induction of naive pluripotency. Dev Cell. 2016;38:100–15.
Li J, Luo H, Wang R, Lang J, Zhu S, Zhang Z, et al. Systematic reconstruction of molecular cascades regulating GP development using singlecell RNAseq. Cell Rep. 2016;15:1467–80.
Nitzan M, Karaiskos N, Friedman N, Rajewsky N. Gene expression cartography. Nature. 2019;576:132–7.
Cang Z, Nie Q. Inferring spatial and signaling relationships between cells from single cell transcriptomic data. Nat Commun. 2020;11:2084.
GonzálezBlas C, Quan XJ, DuranRomaña R, Taskiran II, Koldere D, Davie K, et al. Identification of genomic enhancers through spatial integration of singlecell transcriptomics and epigenomics. Mol Syst Biol. 2020;16:e9438.
Ren X, Zhong G, Zhang Q, Zhang L, Sun Y, Zhang Z. Reconstruction of cell spatial organization from singlecell RNA sequencing data based on ligandreceptor mediated selfassembly. Cell Res. 2020;30:763–78.
Rodriques SG, Stickels RR, Goeva A, Martin CA, Murray E, Vanderburg CR, et al. Slideseq: a scalable technology for measuring genomewide expression at high spatial resolution. Science. 2019;363:1463–7.
Stickels RR, Murray E, Kumar P, Li J, Marshall JL, Bella DJD, et al. Highly sensitive spatial transcriptomics at nearcellular resolution with SlideseqV2. Nat Biotechnol. 2021;39:313–9.
Vickovic S, Eraslan G, Salmén F, Klughammer J, Stenbeck L, Schapiro D, et al. Highdefinition spatial transcriptomics for in situ tissue profiling. Nat Methods. 2019;16:987–90.
Chen A, Liao S, Cheng M, Ma K, Wu L, Lai Y, et al. Spatiotemporal transcriptomic atlas of mouse organogenesis using DNA nanoballpatterned arrays. Cell. 2022;185:1777–92.
Hu J, Li X, Coleman K, Schroeder A, Ma N, Irwin DJ, et al. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nat Methods. 2021;18:1342–51.
Palla G, Spitzer H, Klein M, Fischer D, Schaar AC, Kuemmerle LB, et al. Squidpy: a scalable framework for spatial omics analysis. Nat Methods. 2022;19:171–8.
Shen R, Liu L, Wu Z, Zhang Y, Yuan Z, Guo J, et al. SpatialID: a cell typing method for spatially resolved transcriptomics via transfer learning and spatial embedding. Nat Commun. 2022;13:7640.
Mori T, Takaoka H, Yamane J, Alev C, Fujibuchi W. Novel computational model of gastrula morphogenesis to identify spatial discriminator genes by selforganizing map (SOM) clustering. Sci Rep. 2019;9:1–10.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25:25–9.
The Gene Ontology Consortium. Expansion of the gene ontology knowledgebase and resources. Nucleic Acids Res. 2017;45:D331–8.
Kangas JA, Kohonen TK, Laaksonen JT. Variants of selforganizing maps. IEEE Trans Neural Net. 1990;1:93–9.
Kohonen T. Selforganized formation of topologically correct feature maps. Biol Cybern. 1982;43:59–69.
von der Malsburg C. Selforganization of orientation sensitive cells in the striate cortex. Kybernetik. 1973;14:85–100.
Turing AM. The chemical basis of morphogenesis. Philos Trans R Soc Lond B Biol Sci. 1952;237:37–72.
Li X, Gasteiger J, Zupan J. On the topology distortion in selforganizing feature maps. Biol Cybern. 1993;70:189–98.
Becht E, McInnes L, Healy J, Dutertre CA, Kwok IWH, Ng LG, et al. Dimensionality reduction for visualizing singlecell data using UMAP. Nat Biotechnol. 2018;37:38–44.
Andrieu C, De Freitas N, Doucet A, Jordan MI. An Introduction to MCMC for machine learning. Mach Learn. 2003;50:5–43.
Peng G, Suo S, Cui G, Yu F, Wang R, Chen J, et al. Molecular architecture of lineage allocation and tissue organization in early mouse embryo. Nature. 2019;572:528–32.
Swendsen RH, Wang JS. Replica monte carlo simulation of spinglasses. Phys Rev Lett. 1986;57:2607.
Braeuning A, Ittrich C, Köhle C, Hailfinger S, Bonin M, Buchmann A, et al. Differential gene expression in periportal and perivenous mouse hepatocytes. FEBS J. 2006;273:5051–61.
Tasic B, Yao Z, Graybuck LT, Smith KA, Nguyen TN, Bertagnolli D, et al. Shared and distinct transcriptomic cell types across neocortical areas. Nature. 2018;563:72–8.
Asp M, Giacomello S, Larsson L, Wu C, Fürth D, Qian X, et al. A spatio–temporal organwide gene expression and cell atlas of the developing human heart. Cell. 2019;179:16471660.e19.
de Soysa TY, Ranade SS, Okawa S, Ravichandran S, Huang Y, Salunga HT, et al. Singlecell analysis of cardiogenesis reveals basis for organlevel developmental defects. Nature. 2019;572:120–4.
Hashimoto K, Kodama A, Honda T, Hanashima A, Ujihara Y, Murayama T, et al. Fam64a is a novel cell cycle promoter of hypoxic fetal cardiomyocytes in mice. Sci Rep. 2017;7:1–17.
Han S, Cui C, He H, Shen X, Chen Y, Wang Y, et al. FHL1 regulates myoblast differentiation and autophagy through its interaction with LC3. J Cell Physiol. 2020;235:4667–78.
Wang X, Osinska H, Gerdes AM, Robbins J. Desmin filaments and cardiac disease: establishing causality. J Card Fail. 2002;8:S287–92.
Gladka MM, Kohela A, Molenaar B, Versteeg D, Kooijman L, MonshouwerKloots J, et al. Cardiomyocytes stimulate angiogenesis after ischemic injury in a ZEB2dependent manner. Nat Commun. 2021;12:1–16.
Eghbali A, Dukes A, Toischer K, Hasenfuss G, Field LJ. Cell cyclemediated cardiac regeneration in the mouse heart. Curr Cardiol Rep. 2019;21:131.
Veres A, Faust AL, Bushnell HL, Engquist EN, Kenty JHR, Harb G, et al. Charting cellular identity during human in vitro βcell differentiation. Nature. 2019;569:368–73.
Zhang D, Jiang W, Liu M, Sui X, Yin X, Chen S, et al. Highly efficient differentiation of human ES cells and iPS cells into mature pancreatic insulinproducing cells. Cell Res. 2009;19:429–38.
Sandoval DA, D’Alessio DA. Physiology of proglucagon peptides: role of glucagon and GLP1 in health and disease. Physiol Rev. 2015;95:513–48.
Bethea M, Liu Y, Wade AK, Mullen R, Gupta R, Gelfanov V, et al. The isletexpressed Lhx1 transcription factor interacts with Islet1 and contributes to glucose homeostasis. Am J Physiol Endocrinol Metab. 2019;316:E397–409.
Haris B, Saraswathi S, Hussain K. Somatostatin analogues for the treatment of hyperinsulinaemic hypoglycaemia. Ther Adv Endocrinol Metab. 2020;11:2042018820965068.
Epskamp S, Fried EI. A tutorial on regularized partial correlation networks. Psychol Methods. 2018;23:617–34.
Baeyens L, Lemper M, Staels W, Groef SD, Leu ND, Heremans Y, et al. (Re)generating human beta cells: status, pitfalls, and perspectives. Physiol Rev. 2018;98:1143–67.
Yu XX, Qiu WL, Yang L, Zhang Y, He MY, Li LC, et al. Defining multistep cell fate decision pathways during pancreatic development at singlecell resolution. EMBO J. 2019;38: e100164.
Hao Y, Hao S, AndersenNissen E, Mauck WM III, Zheng S, Butler A, et al. Integrated analysis of multimodal singlecell data. Cell. 2021;184:35733587.e29.
Cannoodt R, Saelens W, Deconinck L, Saeys Y. Spearheading future omics analyses using dyngen, a multimodal simulator of single cells. Nat Commun. 2021;12(1):3942.
Bhaduri A, Andrews MG, Leon WM, Jung D, Shin D, Allen D, et al. Cell stress in cortical organoids impairs molecular subtype specification. Nature. 2020;578:142–8.
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech: Theory Exp. 2008;2008:P10008.
Peng G, Suo S, Chen J, Chen W, Liu C, Yu F, et al. Spatial transcriptome for the molecular annotation of lineage fates and cell identity in midgastrula mouse embryo. Dev Cell. 2016;36:681–97.
Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, et al. NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Res. 2013;41(Database issue):D991–5.
Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971;66:846–50.
Kirkpatrick S, Gelatt CD Jr, Vacchi MP. Optimization by simulated annealing. Science. 1983;220:671–80.
Kursa MB, Rudnicki WR. Feature selection with the Boruta package. J Stat Softw. 2010;36:1–13.
Durinck S, Spellman PT, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc. 2009;4:1184–91.
Epskamp S, Cramer AO, Waldorp LJ, Schmittmann VD, Borsboom D. qgraph: network visualizations of relationships in psychometric data. J Stat Softw. 2012;24(48):1–8.
Acknowledgements
The authors deeply appreciate Dr. Peter Karagiannis for kindly reviewing the manuscript. The supercomputing resource was provided by Human Genome Center, the Institute of Medical Science, the University of Tokyo (http://sc.hgc.jp/shirokane.html).
Funding
This work was partially supported by the Core Center for iPS Cell Research, the Research Center Network for Realization of Regenerative Medicine, the Japan Agency for Medical Research and Development (20bm0104001h0008/21bm0104001h0009), GrantsinAid for Scientific Research on Innovative Areas (17H06391, 17H06396), A (22H00532), B (21H02790), and Challenging Research (Exploratory) (22K19830) from the Ministry of Education, Culture, Sports, Science and Technology (MEXT), and the iPS Cell Research Fund.
Author information
Authors and Affiliations
Contributions
WF conceptualized and designed the study. TM and TA contributed to development the software. TM, KL, and TT performed the computational experiments. JY, CA, AK, KO, and JKY performed the data interpretation from a biological point of view. TM, HK and WF wrote the manuscript. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
TA was an Associate Editor of BMC Bioinformatics and the remaining authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1 Fig. S1.
GO mapping of SDGs in Biological Process for mouse E7.0 gastrula embryo. The pvalues determined on the basis of hypergeometric distribution statistics are shown on top of bars. Fig. S2. Topology graphs for 12 datasets used as input for eSPRESSO clustering in this study. Fig. S3. Gene expression gradients in reconstructed models by eSPRESSO (upper) and novoSpaRc (lower) for the top three important genes in VKO analysis of mouse developmental heart. Fig. S4. Gene expression gradients in reconstructed models by eSPRESSO (upper) and novoSpaRc (lower) for the top three important genes in VKO analysis of the human developmental pancreas. Fig. S5. GGM raw result by glasso for original 18 cell types at individual stages (left) and induced developmental model by reduced 10 cell types (right). Fig. S6. Venn diagrams representing consensus S(T)DGs identified by eSPRESSO and cluster biomarkers identified by Seurat, and enrichment plots of the S(T)DGs for the gene lists ranked according to the adjusted pvalues obtained by the Wilcoxon rank sum test between one domain and the other domains. P.adj and NES denote adjusted pvalue and normalized enrichment score, respectively. Fig. S7 Synthetic singlecell dataset created from a dyngen model with linear backbone of transcription factor module configuration, and a topology graph generated from the dataset. a Module network of linear backbone generated by dyngen. b Transcription factor and target gene regulatory network generated by dyngen. c UMAP of 1,000 cells of the synthetic dataset. Numbers in the legend are the cluster numbers identified by the Louvain method. d Topology graph generated from the synthetic dataset. Vertex labels indicate domain names corresponding to the cluster numbers in (c). Fig. S8. Topology graphs generated from synthetic singlecell datasets of predefined 14 backbones by dyngen. The dyngen parameters for generating each dataset are indicated by Seed, random seed; #TF, the number of transcription factors; #Targets, the number of target genes; and #HK, the number of housekeeping genes. Thresholds indicate the cutoff values for generating topology graphs from the correlation networks of identified domains. Fig. S9 Expression gradients of SDGs identified by eSPRESSO on topology graphs generated from the synthetic datasets. Vertex colors indicate gene expression levels (Zscore) of the cluster centroids. Fig. S10. Impact of misspecification of input topology. Each plot shows the mean values of maximum scores (= accuracy + ARI), maximum accuracies, and maximum ARIs, respectively, and 95% confidence intervals for ten experiments with increasing randomization probability. Fig. S11. Venn diagrams representing input genes to Boruta selected on the basis of the frequency of nonzero expression in all cells as well as the standard deviation (eSPRESSO) and highly variable genes by Seurat (Seurat).
Additional file 2 Movie S1. The final topological clustering result for the 83 sections using 36 consensus SDGs for mouse E7.5 gastrula embryo.
Additional file 3 Table S1.
66 consensus SDGs for E9.5 mouse liver and their GO enrichment analysis.
Additional file 4 Movie S2. The final topological clustering result for 1,415 cells with 66 consensus SDGs for mouse liver lobule structure data (left), and 1,358 spots with 70 consensus SDGs for PCW 9 human embryonic heart data (right).
Additional file 5 Table S2.
70 consensus SDGs for PCW 9 human heart and their GO enrichment analysis.
Additional file 6 Movie S3. The final topological clustering result for the 3,331 singlecells using 20 consensus SDGs for mouse E9.25 embryonic heart (top left). Topological clustering results for individual VKO experiments (Des^{−}, Ptma^{−}, or Ccnd2^{−}) and multiple VKO experiments (Des^{−}Ptma^{−}, and Des^{−}Ptma^{−}Ccnd2^{−}) are also shown.
Additional file 7 Table S3.
20 consensus SDGs for E9.25 mouse heart and their GO enrichment analysis.
Additional file 8 Movie S4. The gene expression distribution of Des, Ptma and Ccnd2 genes. Des and Ptma are mainly upregulated in OFT and pSHF domains, respectively, and downregulated vice versa, whereas Ccnd2 is mainly upregulated in LV and pSHF domains but downregulated in the middle of both, i.e., in Atrial domain.
Additional file 9 Movie S5. The final topological clustering result for the 20,205 singlecells using 83 consensus TDGs for 10 cell types in human developmental pancreas organoids (s3–s6). The gene expression distribution of the three most influential genes (GCG, ISL1, and SST) revealed by VKO analysis.
Additional file 10 Table S4.
83 consensus TDGs for s3–s6 human developmental pancreas organoids and their GO enrichment analysis.
Additional file 11 Table S5.
Virtual knockout genes and performances of pancreas developmental topology reconstructions by running topological clustering 100 times (score = accuracy + ARI).
Additional file 12 Table S6.
Datasets and performances of 3D structure reconstructions with highly variable genes selected by Seurat (score = accuracy + ARI).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Mori, T., Takase, T., Lan, KC. et al. eSPRESSO: topological clustering of singlecell transcriptomics data to reveal informative genes for spatio–temporal architectures of cells. BMC Bioinformatics 24, 252 (2023). https://doi.org/10.1186/s12859023053554
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859023053554