Integrating single-cell multi-omics data through the MinNet framework
As with other state-of-the-art integration methods, the MinNet framework follows the statistical concept of integrating omics data: Cells from different modalities are projected into the same latent space that captures the shared variance in all omics data. To generate this co-embedding space, the Siamese neural network simultaneously receives as inputs one cell from modality 1 (e.g., scRNA-seq) and another from modality 2 (e.g., scATAC-seq) and projects them into the same n-dimensional vectors using the encoder. To ensure this n-dimensional vector space is a good representation of the shared main biological variance, two losses are applied following the encoder.
The first and most important is the contrastive loss [22]. Here, we convert the concept of shared main variance to a more computationally feasible metric for the neural network – similarity and differences among cells. An ideal co-embedding space should be consistent with the original data on this metric: similar cells are close and very different cells are far away. Thus, our contrastive loss aims to reduce the distance between similar cells and separate different cells in the n-dimensional space. To achieve this goal, randomly chosen cell pairs are prepared before each training epoch for calculating either positive or negative contrastive loss. Positive pairs are the identical cells in the two modalities, and the loss is the Euclidian distance between each pair in the co-embedding spaces. Negative pairs are different cells sampled from the data, and the loss is calculated as a margin constant \(m\) minus the Euclidian distance. By training the model to minimize the loss, the distances between corresponding cells get smaller while the distances between negative pairs get larger. In this way, main biological variance is kept in the co-embedding space.
Usually, the margin value m is a constant for all negative pairs during Siamese neural network training. However, cells from different cell types are more diverse than those from the same cell type. Thus, to maintain these differences in the co-embedding space, we designed mas a flexible value depending on how much the cell pairs differ. (See Methods for technical details). Intuitively, cells that differ more pose higher variance, so a larger margin is assigned to separate them in the space. In contrast, similar cells pose little variance, so a small margin value is assigned (Fig. 1B). With the flexible margin, the datasets main variance will be better kept in the final integration space.
The second loss is cell-type classification loss. It is also designed to capture the main variance because it separates different cell types from each other. The output of the first encoder layer is sent to the label classification layer for cross-entropy loss calculation. Also, previous studies observed improved performance with this loss due to its ability to accelerate the optimization process [21].
This supervised model needs to be trained with paired multi-omics datasets from techniques like 10X Multiome, SHARE-seq [9], and SNARE-seq [8], which profile the transcriptome and chromatin accessibility simultaneously, or Cite-seq [23], which profiles transcriptome and epitopes in the same cells. The weighted sum of classification and contrastive loss is minimized during training to ensure optimized modality mixing and clustering. After training, the model can be easily applied in user’s target datasets.
We applied the framework to two tasks: transcriptome and chromatin accessibility data integration, which takes gene expression and gene activity score as its input; and transcriptome and epitope data integration taking gene expression and protein abundance. With the trained models, users can provide two simply normalized datasets and obtain the co-embedding space for downstream analysis, including aligning/pairing cells between modalities, unsupervised clustering, and cis-regulatory element inferring via pseudo-bulk generated from the embedding space (Fig. 1C).
Benchmarking shows that MinNet is robust and generalizable in alignment and clustering
To test the performance and generalizability of our two models trained on 10X Multiome bone marrow mononuclear cells (BMMC) data and Cite-seq BMMC data, we evaluated our method and compared it with existing ones, including GLUE [19], bindSC [24], Seurat v3 [13], Liger [16], and Liger's online version [25], on two untouched test sets from the NeurIPS 2021 Competition [26] in which the cell-to-cell correspondence is known. This large dataset has 10X Multiome and Cite-seq sequencing results from four sequencing sites and ten donors. Our models were trained on samples from some of the donors and three sequencing sites, leaving other donors untouched (test set 1) and the fourth sequencing site untouched (test set 2) (See Additional file 1: Table S1 for details).
After applying all algorithms to the benchmarking datasets, we evaluated all final integration results (see Additional file 1: Figs. S1–4 for the UMAP visualization) based on several metrics. First, we used the silhouette coefficient score [27] to measure the integration performance of the co-embedding space generated by the algorithms, focusing on how well modalities are mixing while cell types are separating from each other. Compared with other methods, MinNet attained a higher score in both modality mixing and clustering (Fig. 2A). The cell type silhouette coefficient indicates how well cell types are separated in the co-embedding space. In the real-world setting, when researchers have no labels for their dataset, they will use unsupervised clustering to annotate the cell types; a better separation will ensure more precise annotations. We tested this proposition by performing unsupervised clustering on the algorithms' embeddings and testing the consistency between unsupervised clusters and cell type annotations using the adjusted rand index [28] (Fig. 2B). Results show that MinNet-based clustering is the most concordant with the ground truth at the primary cell type level. Moreover, when subtypes were identified, our model was still competitive with the top models in 10X Multiome data and outperformed all models in Cite-seq data. We also evaluated cell type integration performance based on label transferring accuracy, another common task in actual practice when researchers want to transfer the annotated labels from one modality to the other. MinNet can accurately transfer most of the labels, even if the cell numbers are small, while methods like Seurat are biased toward the major cell types (Fig. 2C, E). With Silhouette score and label transfer accuracy, our model’s performance is validated at the cell type level.
Beyond the cell type resolution integration, single-cell level cell alignment is also essential in some cases, like cell type sub-typing and mini-bulk generation for downstream analysis. To evaluate this higher resolution performance, the FOSCTTM (Fraction of samples closer than the true match) score [29] was measured for all generated co-embeddings. MinNet ranked in first or second place in all four datasets, and performed especially well in Cite-seq data where it demonstrated a significant competitive advantage at single-cell resolution (Fig. 2D).
For broad usability, a supervised model must be generalizable. Our model’s success in integrating untouched donor datasets and untouched sequencing site datasets already demonstrated its generalizability, but we also wanted to test it with other tissues. First, we undertook the same evaluation using the 10X Multiome peripheral blood mononuclear cell (PBMC) dataset [30]. Based on silhouette scores, FOSCTTM scores, and label transfer accuracies (Additional file 1: Fig. S5A, C, D), our model still performed competitively and generated adequate co-embedding space (Additional file 1: Fig. S6). Though the model had never seen many of the cell types in the PBMC dataset, it still separated most cell types well, demonstrating its generalizability. This result is due to the model’s contrastive loss design, which learned the common sense of similar tasks rather than a specific task [31].
However, this generalizability was limited to similar tissues, such as BMMC and PBMC. We also applied the trained model to the 10X Multiome human brain dataset [32], an entirely different tissue. The resulting co-embedding space showed little biological information and failed to cluster well (Additional file 1: Fig. S5B). Therefore, we concluded that the generalizability of our algorithm could be expanded to similar tissues but not distinct ones. Nevertheless, this supervised approach can easily be trained on target tissues and has a higher specificity than other models. For example, the traditional machine learning models, including bindSC, Liger, and Seurat, have better label transfer accuracies on the PBMC dataset than on the BMMC dataset, which we believe is due to cell type balance. That is, when the numbers of cells in each cell type are relatively even, these methods perform well. But in cases like the BMMC datasets, which consist mostly of monocytes, label transfer of minor cell types is inaccurate. In contrast, our approach is not significantly influenced by unevenly distributed cell type sizes because of its superior specificity.
MinNet is superior in removing batch effect while maintaining biological variance
To distinguish between batch variance and biological variance, we trained our model with multiple batches from different donors and sequencing sites. While the training input was normalized data without batch correction, the contrastive loss was based on the graph after batch correction by ComBat implemented in Scanpy [33]. With this design, the model is required to produce the joint embedding that eliminates batch effects while retaining biological differences.
To test the performance of batch effect removal, we generated three more testing scenarios with the available benchmark datasets that represent real case practice problems. The first scenario tested all algorithms’ performance when both scRNA-seq and scATAC-seq experiments are performed independently on identical batches. The second and third scenarios tested the integration performance of scRNA-seq and scATAC-seq datasets profiled from different batches. In all three cases, we compared our model with those mentioned above.
The silhouette score and label transfer accuracy were chosen for evaluation since cells were different in the second and third cases and FOSCTTM score is not feasible. Figure 3A shows the co-embedding space of the first case. While mixing the omics data, MinNet successfully separated cell types and mixed the batches. Its performance is quantified and compared with other algorithms in Fig. 3B using cell type and batch silhouette scores. BindSC was better at mixing batches, but MinNet distinguished the cell type variance from batch variance to provide better cell type separation. In the last two tests, most algorithms generated a co-embedding space that mixed the batch well, meaning all the manifold alignments worked between two single-batch omics data (Additional file 1: Figs. S7, S8). However, when it came to the mixed batches, some algorithms failed due to their inability to remove the batch effect, resulting in small cell type silhouette scores (Fig. 3B). MinNet thus outperformed the other models in clustering and label transferring (Fig. 3C).
This evaluation is especially important as it mimics real-world practice in which researchers use independent profiling from different batches or even from independent, publicly available datasets. We also conducted testing on Cite-seq data in a similar case setting (Additional file 1: Fig. S3) and observed that none of the algorithms succeeded in mixing batches from different donors and sequencing sites, but they could mix batches from other donors and the same sequencing sites (Additional file 1: Fig. S4). We hypothesize that this result is due to the significant sequencing platform differences of Cite-seq technology and believe further investigation is warranted.
Model-based smoothing helps correlation-based cis-regulatory element inferring
After benchmarking our model, we demonstrated how integration can impact the downstream analysis and help discover the interplays among genetic layers.
Cis-regulatory elements, such as enhancers and promoters, are genomic regions that control development and physiology by regulating gene expression [34]. Inferring the regulation between open chromatin regions and gene expression is of great importance in understanding biological and disease processes. Usually, cis-regulatory element inferring is done by calculating the correlation between chromatin regions, e.g., Cicero [35]. With multi-omics data and integration methods available, we can calculate the correlation between regions and gene expression using the aligned cells, which is a more direct way of linking genome with transcriptome. Here, we implemented smoothing, mini-bulk generating, and cis-regulatory element inferring and validated the method using the 10X Multiome peripheral blood mononuclear cells (PBMC) dataset [30].
First, to account for the high dropout rate and noise [36] in single-cell data, we built functions to smooth [37] the data. Specifically, we complemented the missing values in cells based on their K nearest neighbors in our single-cell resolution co-embedding space to decrease the sparsity (Fig. 4A). After smoothing, we generated mini-bulk data before undertaking any downstream analysis.
To test how this smoothing and mini-bulk generating improve downstream analysis, we calculated the Spearman's correlations between genes and their 2 kb nearby peaks in mini-bulk data, which are believed to be positively correlated. Results show that non-smoothed raw mini-bulk data has a lower correlation level than true pair mini-bulk data correlation, meaning the dropout rate compromises downstream analysis when no cell correspondence is available between modalities (Fig. 4A middle). But when smoothing was applied to the five nearest neighbors, the correlation levels reached that of the true pair mini-bulk. The correlation was even higher than the true pair when the number of neighbors was increased. To demonstrate the importance of smoothing, we offer an example in Fig. 4B. Chr3:102402234–102402739 is in the TSS region of the gene FGF14, which means the pair should be positively correlated. But because of the high dropout rate, non-smoothed mini-bulk data showed a negative Spearman correlation coefficient. When we applied nearest neighbor complementation, their association became positive.
We further validated the model-based, cis-regulatory element-inferring approach with external Promoter Capture Hi-C (pcHi-C) evidence of the interaction between genome regions [38]. We applied non-smoothed, true pair, and smoothed mini-bulk data to calculate Spearman’s correlation between genes and their 150 kb nearby peaks. While the mean correlations of pcHi-C unsupported peak-gene pairs were not greatly increased by smoothing, the mean correlation level of pcHi-C supported pairs did increase. The difference in correlation between supported and unsupported pairs is clearly shown in the heatmap (Fig. 4C). Again, with only five nearest neighbors smoothing, the correlation reached the same level as true pair mini-bulk (Figure S9A), but the 0-25 k peak-gene pairs are non-distinguishable. We think the proximity of genes increased the co-openness even though they don't have a regulatory relationship.
The extremely high correlation peak-gene pairs yielded by this approach are worth further investigation because they indicate potential regulatory relationships. For example, LEF1 encodes the protein that can bind to a functionally important site in the T-cell receptor-alpha enhancer [39] and therefore shows a variant expression level in subtypes of T cells. Three peaks are within the 150 kb upstream of the LEF1 TSS region. In 5NN smoothing mini-bulk correlation, two peaks have high correlations with LEF1 expression (chr4-108170508–108173850: \({r}_{s}\) = 0.930; chr4-108315129–108315649: \({r}_{s}\) = 0.877) and are supported by pcHi-C evidence. The other has a low correlation and is not supported by pcHi-C (chr4-108301923–108302013: \({r}_{s}\)= 0.371). These results showed consistency with Hi-C and are validated by the data visualized in Fig. 4D. On the other hand, some unsupported correlations also showed potential regulatory relationships. CCR2 encoded protein is a receptor for monocyte chemoattractant protein-1, a chemokine that specifically mediates monocyte chemotaxis [40, 41]. It is a monocyte marker; thus, the expression varies in many PBMC cell types. Six peaks showed high correlation with CCR2, three were supported by Hi-C evidence (chr3-46206526–46210451: \({r}_{s}\) = 0.647; chr3-46297386–46301922: \({r}_{s}\) = 0.612; chr3-46212,074–46213996: \({r}_{s}\) = 0.612) and three were not (chr3-46317953–46318717:\({r}_{s}\) = 0.634; chr3-46312405–46313554: \({r}_{s}\) = 0.607; chr3-46228191–46229079: \({r}_{s}\) = 0.605). But when validated in the original data, we saw a correlation of all six peaks with the gene, indicating potential genomic links (Additional file 1: Additional file 1: Fig. S9B). Furthermore, the unsupported chr3-46312405–46313554, together with Hi-C supported chr3-46,206,526–46,210,451 and chr3-46297386–46301922, were enriched in the motif for STAT3 + IL-21 binding, providing further evidence supporting this finding. IL-21 is a known cytokine with diverse effects on immune cells, including CD4 + and CD8 + T cells, B cells, macrophages, monocytes, and dendritic cells [42]. Thus, CCR2 might be involved in IL-21-induced cell adhesion through these binding sites, which has been considered by other researchers [43].
MinNet provides missing analysis of COVID-19 multi-modal data
To demonstrate how our model can help study diseases, we applied it to a publicly available COVID-19 dataset [44] where healthy controls and patients with various Worldwide Health Organization (WHO) severity score-rated PBMC samples were profiled with independent scRNA-seq and scATAC-seq. We followed their scRNA-seq Differential Expressed Genes (DEGs) analysis using the PBMC trained- and BMMC trained-models and provided the missing part of the integration analysis to allow for more potential discoveries.
Preprocessed and normalized data were provided to either the PBMC or BMMC trained model to create the final co-embedding space (Fig. 5A). The final integration space separated cell types and mixed modalities well. The batch effect from different samples was removed, while the difference in severity was still clearly apparent, as shown in the UMAP colored by WHO severity score. We next evaluated the consistency between the cell type annotation and our clustering by calculating label transfer accuracy from scATAC-seq to scRNA-seq or the reverse direction (Fig. 5B). Except for cell types with only a few cells, the consistency was high between the two independent annotations and our embedding space. Both the PBMC and BMMC models were able to integrate the dataset because of their generalizability.
Using the model’s co-embedding space, we generated the mini-bulk data per cell type and then inferred the potential regulatory relationships between genes and peaks within a 150 k bp distance from the transcription starting point. Two cell types, Natural Killer (NK) cells and monocytes, were chosen for the integration analysis because they were the most dysfunctional cell types identified in the original study. Only DEGs by severity group were included in this analysis as compensation for their primary scRNA-seq DEG analysis. For example, in NK cells, the DEG HLA-DPB1 is correlated with chr6-32940846–32941346 (\({r}_{s}\) = 0.333, Fig. 5C, Additional file 1: Fig. S10A) and showed differences among WHO severity groups. In monocytes, the DEG FPR2 is associated with the peak chr19-51735953–51736453 (\({r}_{s}\) = 0.434, Fig. 5C, Additional file 1: Fig. S10B). These results were further validated with the raw data and could be potential regulatory sites for the DEGs.
Besides inferring the causal relationships between COVID-19 influenced peaks and genes, we can also compare the correlations among severity groups to discover dysfunction in severely infected groups. Thus, we generated the pseudo-bulk data per cell type for each severity group separately and calculated Spearman’s correlation between genes and peaks within a 150 k bp distance. The inconsistency in correlation level might indicate the dysfunction in regulation between genes and peaks. For example, in monocytes, EIF4B and its remote peaks chr12-52884739–52885239 are positively correlated in healthy control and moderately infected patients but are negatively correlated in severely infected patients (Fig. 5D). Such findings indicate that the peak's positive regulation is destroyed in severely infected patients. One potential rationale is that the enhancing protein was competitively replaced by another inhibitory protein during the enhancer and promoter interaction, leading to the negative correlation between openness and expression.