In the following, we will describe the details about the single-cell miRNA-mRNA co-sequencing data used, interpolating pseudo-cells in small-scale single-cell transcriptomics data, the identification of cell-specific miRNA-mRNA regulatory networks, and subsequent analysis of the identified single-cell miRNA regulation.
Single-cell miRNA-mRNA co-sequencing data
We obtain matched miRNA and mRNA co-sequencing expression data in 19 half K562 cells from Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/) with accession number GSE114071. The K562 cells used are the first human chronic myelogenous leukemia (CML) cell line. For the duplicate miRNAs or mRNAs with the same gene symbols in the dataset, we compute the average expression values of them as their final expression values. Since gene expression variability may be a reason of non-genetic cell-to-cell heterogeneity [6], as a feature selection, we remove all the miRNAs and mRNAs with constant expression values (the standard deviation of their expression values in all single-cells is 0) across the 19 half K562 cells. The matched miRNA and mRNA expression data are then pre-processed by using \(\log_{2} (x + 1)\) transformation. As a result, we have the matched expression profiles of 212 miRNAs and 15,361 mRNAs in the 19 half K562 cells.
Interpolating pseudo-cells in small-scale single-cell transcriptomics data
When the number of samples in a dataset is small, it is not guaranteed that a good representation of the population can be inferred from the data. It is required in [8] that when applying the CSN method, to estimate the association of each miRNA-mRNA pair, the number of cells in the single-cell transcriptomics dataset used should be more than 100. Since the proposed CSmiR method is adapted from the CSN method, for small-scale single-cell transcriptomics dataset like the one with 19 K562 half cells, it is necessary to enlarge the number of cells.
After interpolating pseudo-cells into original single-cell transcriptomics data, the main challenge is that the distribution of each gene (miRNA or mRNA) and joint distribution of each miRNA-mRNA pair will not be changed. To tackle this problem, we need to guarantee that the proportion of each cell type in the interpolated pseudo-cells is the same as that in the real single-cells. That is to say, the cell-type of the interpolated pseudo-cells should be the same as that in the real single-cells, and the number of the interpolated pseudo-cells of each cell-type also increases with the same probability. Based on this, for each pseudo-cell, we sample from the original single-cell transcriptomics data, i.e. the 19 single-cells uniformly with replacement, to generate it. To meet the requirement of having at least 100 single-cells, the number of interpolated pseudo-cells between two adjacent half K562 cells is set to 5. Here, two K562 single-cells with adjacent sample IDs (generated by half-cell genomics method) are regarded as adjacent single-cells. As a result, for each run of bootstrapping, we obtain the expression profiles of 212 miRNAs and 15,361 mRNAs in 109 half K562 cells (including both real and pseudo half K562 cells). All the B bootstrapping datasets are used for subsequent analysis. In this work, the number of bootstrapping B is set to 100.
Identifying cell-specific miRNA-mRNA regulatory networks by integrating putative miRNA-mRNA binding information
To reconstruct cell-specific miRNA-mRNA regulatory networks for real cells from the given single-cell dataset (including both real cells and interpolated pseudo-cells), it is necessary to construct a reliable statistic to evaluate the association between miRNAs and mRNAs. By using the statistic, the identified cell-specific miRNA-mRNA regulatory networks should be robust in the case of high dropout rate (also called technical noise from single-cell sequencing technology) and adding new cells (pseudo-cells in this work). Based on this, for each cell (including both real cells and interpolated pseudo-cells) in the given single-cell dataset, we apply the statistic used in the CSN method [8] to build a miRNA-mRNA regulatory network. In the case of high dropout rate and adding new cells, it is demonstrated that the CSN method is robust in identifying cell-specific networks. Therefore, when building the network, we adapt the CSN method for the discovery of miRNA-mRNA regulation. Moreover, we use the putative miRNA-mRNA binding information from TargetScan v7.2 [18] as prior knowledge to constrain the computational space between miRNAs and mRNAs. Specifically, for each putative miRNA-mRNA pair miRr and mRt in cell k, we evaluate the association between the miRNA and mRNA using the statistical test as described in the following.
To estimate the association between miRr and mRt in cell k, the CSN method draws a scatter diagram using the expression values of miRr and mRt. As shown in Fig. 6, rk and tk denote expression values of miRr and mRt in cell k respectively, and the medium, light and dark grey boxes represent the neighborhood of rk, tk and (rk, tk) respectively. The number of points in the medium, light and dark grey boxes are nr(k), nt(k) and nrt(k) respectively. Then we construct the statistic,\(\rho_{rt}^{(k)}\) as:
$$\rho_{rt}^{(k)} = \frac{{n_{rt}^{(k)} }}{n} - \frac{{n_{r}^{(k)} }}{n} \cdot \frac{{n_{t}^{(k)} }}{n}$$
(1)
where n is the total number of cells in the given dataset, \(\frac{{n_{r}^{(k)} }}{n}\) and \(\frac{{n_{t}^{(k)} }}{n}\) are the marginal probabilities of the expression levels of miRr and mRt respectively \(\left( {\frac{{n_{r}^{(k)} }}{n} = \frac{{n_{t}^{(k)} }}{n} = 0.1} \right)\) as empirically suggested by the CSN method), and \(\frac{{n_{rt}^{(k)} }}{n}\) is the joint probability of miRr and mRt.
It has been proved in [8] that \(\rho_{rt}^{(k)}\) approximately follows a normal distribution, and the normalized statistic \(z_{rt}^{(k)}\) is:
$$\begin{aligned} z_{rt}^{(k)} & = \frac{{\rho_{rt}^{(k)} - \mu_{rt}^{(k)} }}{{\sigma_{rt}^{(k)} }} \\ & = \frac{{\sqrt {n - 1} \cdot (n \cdot n_{rt}^{(k)} - n_{r}^{(k)} n_{t}^{(k)} )}}{{\sqrt {n_{r}^{(k)} n_{t}^{(k)} (n - n_{r}^{(k)} )(n - n_{t}^{(k)} )} }} \\ \end{aligned}$$
(2)
where \(\mu_{rt}^{(k)} = 0\) and \(\sigma_{rt}^{(k)} = \sqrt {\frac{{n_{r}^{(k)} n_{t}^{(k)} (n - n_{r}^{(k)} )(n - n_{t}^{(k)} )}}{{n^{4} (n - 1)}}}\) are the mean value and standard deviation of \(\rho_{rt}^{(k)}\), respectively. \(z_{rt}^{(k)}\) obeys standard normal distribution with mean value of 0 and standard deviation of 1. Each \(z_{rt}^{(k)}\) value corresponds to a p value for evaluating the significance of the association between miRr and mRt. The smaller pvalue indicates that the miRNA miRr and the mRNA mRt are more likely to be associated with each other in cell k. Here, the significant p value cutoff is set to 0.01.
For example, if we have a single-cell transcriptomics dataset containing 100 cells, the association between miRr and mRt in cell k is calculated as follows. Figure 6 is the scatter diagram using the expression values of miRr and mRt. Then, we draw the two boxes near rk and tk based on the predetermined nr(k) and nt(k) (\(n_{r}^{(k)} = n_{t}^{(k)} = 0.1n = 10\)). The value of \(n_{rt}^{(k)}\) is 4 by counting the red points in the third box which is the intersection between the drawn two boxes. According to Eqs. (1) and (2), the association \(\rho_{rt}^{(k)}\) and normalized association \(z_{rt}^{(k)}\) between miRr and mRt in cell k is 0.03 and \(\sqrt {11}\). By using pnorm R function, the corresponding significance p value of \(z_{rt}^{(k)} { = }\sqrt {11}\) is 4.56E−04 (\(1 - pnorm(\sqrt {11} )\)). Given the significant p value cutoff of 0.01, the miRNA miRr and the mRNA mRt are regarded as to be associated with each other in cell k.
Unstable estimation between the miRNA miRr and the mRNA mRt in cell k caused by small number of samples is a challenge to CSmiR. It is known that bootstrapping is a re-sampling technique used to obtain a reasonably accurate estimate of the population, and can be used to tackle the small sample problem. Therefore, to tackle this issue, we regard the median value of all the normalized associations \(z_{rt}^{(k)}\) calculated in all the B runs of booststrapping as the final estimation of the asscociation between miRr and mRt in cell k. If the final association corresponds a small significance p value (i.e. less than 0.01), the miRNA miRr and the mRNA mRt are associated with each other in cell k. As we are only interested in the miRNA-mRNA regulatory networks for each of the real cells (in the K562 dataset, they are the 19 real half K562 cells in the original dataset), at the end of this stage, we only keep the cell-specific miRNA-mRNA regulatory networks for the real cells. It is noted that each cell-specific miRNA-mRNA regulatory network is a bipartite graph where nodes are miRNAs and mRNAs and an edge is pointing from a miRNA to a mRNA.
Downstream analysis with cell-specific networks
At the network level, it is known that gene regulatory network provides an insight into investigating gene regulation. In the same vein, the discovered cell-specific miRNA-mRNA regulatory networks in the previous step could also help to explore miRNA regulation. To explore cell-specific miRNA regulation, based on the identified cell-specific miRNA-mRNA regulatory networks, CSmiR conduct the following types of downstream analyses: (1) discovering conserved and rewired miRNA regulation, (2) single-cell clustering analysis, (3) cell–cell crosstalk analysis, and (4) functional analysis of miRNA regulation.
Discovering conserved and rewired miRNA regulation
In a cell, the regulation of some miRNAs is “on” whereas the regulation of some miRNAs is “off” [30], indicated by having outgoing edges from the miRNAs or having no outgoing edges in the cell-specific network, respectively. It is possible that the regulation of some miRNAs is “on” in multiple cells and some miRNA regulations only maintain “on” in one cell. This “on/off state” phenomenon could help reveal the heterogeneity and commonality of miRNA regulation across different cells. Assuming that each cell is characterized by miRNA regulation, the conserved and rewired miRNA regulation across different cells can reflect the commonality and heterogeneity of cells, respectively. In this work, we discover conserved and rewired miRNA regulation in terms of both miRNA-mRNA regulatory network and hub miRNAs. Previous studies [31, 32] have shown that nearly 20% of the nodes in a biological network are regarded as essential nodes. The essential nodes in a biological network are subject to several topological properties (e.g. node degree) or biological relevance. Here, for simplicity, we select the top 20% of miRNAs based on node degrees in each cell-specific miRNA-mRNA regulatory network as hub miRNAs. Normally, if a miRNA-mRNA interaction or hub miRNA exists in more single-cells, the miRNA-mRNA interaction or hub miRNA tends to be more conservative. Here, the miRNA-mRNA interactions or hub miRNAs that are always “on” in at least 17 real K562 cells (~ 90%, generally ranked as a highly conservative level) are defined as conserved interactions or hubs, and the miRNA-mRNA interactions or hub miRNAs that are “on” in only one K562 cell are defined as rewired interactions or hubs. By assembling the conserved and rewired miRNA-mRNA interactions or hubs, we can obtain conserved and rewired miRNA-mRNA regulatory networks or hub miRNAs, respectively. These networks and hubs could provide insights into the heterogeneity and similarity of miRNA regulation across different single-cells.
Single-cell clustering analysis
Clustering single-cells based on single-cell RNA sequencing data is a fundamental task to understand tissue complexity, e.g. the number of subtypes [33]. In this paper, instead of directly using single-cell RNA sequencing data, we can use cell–cell similarity matrices for clustering single-cells, i.e. clustering cells based on their similarities on miRNA-mRNA interactions or hub miRNAs.
To reveal the heterogeneity of miRNA regulation across different cells, we investigate cell–cell similarity in terms of their miRNA-mRNA regulatory networks. The smaller the similarity between two single-cells is, the more heterogeneous they are.
We consider two types of similarities between two single-cells: the similarity on miRNA-mRNA interactions in their networks and the similarity on hub miRNAs in their networks. Following the similarity calculation method in [34], we calculate the interaction similarity and hub miRNA similarity using Eq. (3) below.
$$sim_{ij} = \frac{{intersect(term_{i} ,term_{j} )}}{{min(term_{i} ,term_{j} )}}$$
(3)
where termi and termj denote the numbers of interactions or numbers of hub miRNAs in the cell-specific miRNA-mRNA regulatory networks of cells i and j, respectively, \(intersect(term_{i} ,term_{j} )\) denotes the number of miRNA-mRNA interactions or hub miRNAs common to the cell-specific miRNA-mRNA regulatory networks of cells i and j, and \(min(term_{i} ,term_{j} )\) returns the smaller value out of termi and termj, i.e. the smaller value out of the numbers of miRNA-mRNA interactions or the numbers of hub miRNAs in the cell-specific miRNA-mRNA regulatory networks of cells i and j.
For comparison, we also calculate the similarity between two single-cells based on single-cell expression data. The normalized Euclidean distance \(nor\_dis_{ij}\) between cells i and j is calculated as follows:
$$\begin{aligned} nor\_dis_{ij} & = \frac{{dis_{ij} - \min (dis)}}{\max (dis) - \min (dis)} \\ dis_{ij} & = \sqrt {(e_{i1} - e_{j1} )^{2} + \cdots (e_{ik} - e_{jk} )^{2} + \cdots + (e_{ig} - e_{jg} )^{2} } \\ dis & = (dis_{ij} ) \in {\mathbb{R}}^{m \times m} \\ \end{aligned}$$
(4)
where \(e_{ik}\) and \(e_{jk}\) denote the expression levels of gene k in cells i and j respectively, g is the total number of genes (miRNAs and mRNAs), m is the number of real K562 single-cells.
After calculating the similarity and the distance between each pair of real half K562 cells, we obtain two similarity matrices \(SI_{m \times m}\) and \(SH_{m \times m}\) (where m is the number of real cells) in terms of cell-specific miRNA-mRNA interactions and cell-specific hub miRNAs respectively, and one distance matrix \(dis_{m \times m}\) in terms of single-cell expression data. Based on the similarity and distance matrices, we can conduct single-cell clustering analysis, e.g. hierarchical clustering analysis.
Cell–cell crosstalk analysis
In addition to single-cell clustering analysis, the similarity matrices can also be used for cell–cell crosstalk analysis. The cell–cell crosstalk is an indirect cell–cell communication and plays an important role in biological systems. For instance, cell–cell crosstalk can influence gene expression patterns [35], and involve in the development and regeneration of the respiratory system as well [36]. For each cell–cell pair, a higher similarity means sharing more number of miRNA-mRNA interactions or hub miRNAs between two cells. Previous studies [37, 38] have shown that miRNAs and their targets play important roles in cell signaling pathways. Therefore, when the shared miRNA-mRNA interactions or hub miRNAs involve in cell signaling pathways, a higher similarity between the cell pair implies that the two cells share more common cell signaling pathways and have a higher probability of signaling with each other (crosstalk). Based on this assumption, empirically, we use the median similarity value of all cell–cell pairs in the interaction or hub miRNA similarity matrix as the cutoff to evaluate whether two cells have crosstalk relationship or not. That is, if the similarity value between celli and cellj is larger than the median similarity value, celli and cellj have a crosstalk relationship. Following the empirical principle, we can evaluate whether each cell–cell pair has a crosstalk relationship or not. After assembling the cell–cell crosstalk relationships in terms of miRNA-mRNA interactions or hub miRNAs, we can obtain a cell–cell crosstalk network.
Functional analysis of miRNA regulation
To validate and apply the identified cell-specific miRNA regulatory networks, we also conduct functional analysis of miRNA regulation at both network and module levels.
At the network level, we conduct functional validation of the cell-specific miRNA-mRNA regulatory networks by using third-party databases. Since there are no experimentally validated databases at single-cell level, we use two well-known experimentally validated databases named miRTarBase v8.0 [39] and TarBase v8.0 [40] at bulk-cell level for validation. Meanwhile, since the K562 cells used are closely associated with chronic myelogenous leukemia (CML), we collect a list of miRNAs and mRNAs associated with CML to investigate CML-related miRNA regulation. The list of CML-related miRNAs is from Human MicroRNA Disease Database HMDD v3.0 [41], and the list of CML-related mRNAs is from DisGeNET v5.0 [42], which is one of the largest publicly available collections of genes and variants associated to human diseases. We focus on identifying CML-related miRNA-mRNA pairs where the miRNAs and mRNAs individually are in the list of CML-related miRNAs and mRNAs.
At the module level, we discover miRNA-mRNA regulatory modules by using the biclique R package [43]. We consider each miRNA-mRNA regulatory module is a complete bipartite graph or a biclique, and the numbers of miRNAs and mRNAs in each module are at least 2 and 3, respectively. Here, a complete bipartite graph or a biclique is a special type of bipartite graph where every miRNA is connected to every mRNA. To understand potential biological implications associated with the identified miRNA-mRNA regulatory modules, we perform Gene Ontology (GO) [44], Kyoto Encyclopedia of Genes and Genomes (KEGG) [45], Reactome [46], Cancer hallmark [47], and Cell marker [48] enrichment analysis by using the clusterProfiler R package [49]. A GO, KEGG, Reactome, Hallmark or Cell marker term with adjusted p value < 0.05 (adjusted by Benjamini–Hochberg method) is regarded as a significant term. We also conduct CML enrichment analysis by using a hyper-geometric test to evaluate whether the miRNAs and mRNAs in each module are significantly enriched in CML or not. The significance p value of each module enriched in CML is calculated as:
$$p - value = 1 - \sum\limits_{x = 0}^{s - 1} {\frac{{\left( \begin{gathered} Q \hfill \\ x \hfill \\ \end{gathered} \right)\left( \begin{gathered} N - Q \hfill \\ M - x \hfill \\ \end{gathered} \right)}}{{\left( \begin{gathered} N \hfill \\ M \hfill \\ \end{gathered} \right)}}}$$
(5)
where N is the total number of genes (miRNAs and mRNAs) expressed in the dataset, Q represents the number of CML-related genes in the dataset, M is the total number of genes in each module, and s is the number of CML-related genes in each module. The cutoff of p value is set as 0.05.