A Markov random field model for network-based differential expression analysis of single-cell RNA-seq data

Background Recent development of single cell sequencing technologies has made it possible to identify genes with different expression (DE) levels at the cell type level between different groups of samples. In this article, we propose to borrow information through known biological networks to increase statistical power to identify differentially expressed genes (DEGs). Results We develop MRFscRNAseq, which is based on a Markov random field (MRF) model to appropriately accommodate gene network information as well as dependencies among cell types to identify cell-type specific DEGs. We implement an Expectation-Maximization (EM) algorithm with mean field-like approximation to estimate model parameters and a Gibbs sampler to infer DE status. Simulation study shows that our method has better power to detect cell-type specific DEGs than conventional methods while appropriately controlling type I error rate. The usefulness of our method is demonstrated through its application to study the pathogenesis and biological processes of idiopathic pulmonary fibrosis (IPF) using a single-cell RNA-sequencing (scRNA-seq) data set, which contains 18,150 protein-coding genes across 38 cell types on lung tissues from 32 IPF patients and 28 normal controls. Conclusions The proposed MRF model is implemented in the R package MRFscRNAseq available on GitHub. By utilizing gene-gene and cell-cell networks, our method increases statistical power to detect differentially expressed genes from scRNA-seq data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04412-0.


SAMD3
LGALS3B LGM LGR RYR S100A1 S100A10 S100A11 S100A13 S100A7 S100A8 S100A9 S100B S100P S100Z  TAF   TAN   TBC1D4   TBC1D8   TBX   TCEAL5   TCEAL9   TCEAN   TCF1   TCF4   TCF7   TCH   TCL   TCO   TCT   TDR   TEC   TENT4   TENT5   TES   TEX   TFD   TFF   TFR   TGF   TGM   THBD   THBS   THR   TIA   TIG   TIMP1   TIMP3   TJP1   TJP2   TK1   TKT   TLE   TLR   TM4   TMEM10   TMEM13   TMEM176A   TMEM176B   TMEM18   TMEM51   TMEM52   TMS   TMX   TNC   TNF   TNFA   TNFRSF10B   TNFRSF10D   TNFRSF11A  TNFRSF11B   TNFRSF17   TNFRSF18   TNFRSF8   TNFSF10   TNFSF11   TNFSF14   TNFSF8   TNIK   TNIP   TNNI   TNNT   TNS   TO B   TO M   TO P   TOX   TP6   TPD52   TPD52L   TPM   TPS   TPX   TRAF1   TRAF3   TRIB   TRIM3   TRIM6   TRO   TRP   TSC   TSG   TSH   TSN   TSPAN1   TSPAN4   TSPAN5   TSPO   TSPY   TTC   TTL   TUBA   TUBB   TULP1   TULP2   TWI   TXK   TXL   TXN   TXNDC11   TXNDC16   TXNDC5   TYM Figure 2B. Simulation results for Scenario B. We varied the number of cells per cell type in each subject to be 100. The results are plotted in terms of sensitivity, specificity and FDR for two-sample t-test, the likelihood ratio test that adopts a logistic regression framework, Wilcoxon rank sum test, MAST, three pseudo-bulk methods: DESeq2, edgeR, and limma, two mixed model methods: dream and its updated version dream2, and the proposed MRF model. Each box-plot represents 100 replications.  Figure 2C. Simulation results for Scenario C. We varied the to be 3. The results are plotted in terms of sensitivity, specificity and FDR for two-sample t-test, the likelihood ratio test that adopts a logistic regression framework, Wilcoxon rank sum test, MAST, three pseudo-bulk methods: DESeq2, edgeR, and limma, two mixed model methods: dream and its updated version dream2, and the proposed MRF model. Each box-plot represents 100 replications.  Figure 2E. The impact of different adjusted p-value thresholds on sensitivity, specificity, and FDR. Four adjusted p-value cutoffs were used: 0.01, 0.05, 0.1, and 0.2. Sensitivity, specificity, and FDR are plotted for all ten methods under the simulation settings where = 0.2, and = 0.2 or = 0.4 in the main manuscript (corresponding to the first two cases in Figure 2).  Figure 3. The two additional cell networks we used in our analysis in order to assess the importance of cell network structures in the proposed MRF model. Models with cell network in (A) are labeled with C1, and models with cell network in (B) are labeled with C2.

Main No
The DE results are based on two-sample t-test.
Main w/ BioGrid Yes It uses test statistics from two-sample t-test as DE evidence, gene network from BioGrid database, and cell network in Figure 1 to build the MRF model.

Main w/ IntAct
Yes It uses test statistics from two-sample t-test as DE evidence, gene network from IntAct database, and cell network in Figure 1 to build the MRF model.

Wilcoxon No
The DE results are based on Wilcoxon test.
Wilcoxon w/ BioGrid Yes It uses test statistics from Wilcoxon test as DE evidence, gene network from BioGrid database, and cell network in Figure 1 to build the MRF model.

Wilcoxon w/ IntAct
Yes It uses test statistics from Wilcoxon test as DE evidence, gene network from IntAct database, and cell network in Figure 1 to build the MRF model.

MAST No
The DE results are based on MAST analysis.

MAST w/ BioGrid
Yes It uses test statistics from MAST analysis as DE evidence, gene network from BioGrid database, and cell network in Figure 1 to build the MRF model.

MAST w/ IntAct
Yes It uses test statistics from MAST analysis as DE evidence, gene network from IntAct database, and cell network in Figure 1 to build the MRF model.

Main w/ BioGrid C1 Yes
It uses test statistics from two-sample t-test as DE evidence, gene network from BioGrid database, and cell network C1 in Supplementary Figure 3 to build the MRF model.

Main w/ IntAct C1 Yes
It uses test statistics from two-sample t-test as DE evidence, gene network from IntAct database, and cell network C1 in Supplementary Figure 3 to build the MRF model.

Main w/ BioGrid C2 Yes
It uses test statistics from two-sample t-test as DE evidence, gene network from BioGrid database, and cell network C2 in Supplementary Figure 3 to build the MRF model.

Main w/ IntAct C2 Yes
It uses test statistics from two-sample t-test as DE evidence, gene network from IntAct database, and cell network C2 in Supplementary Figure 3 to build the MRF model. Figure 4. The parameter estimates for the Main MRF models, MAST MRF models, and Wilcoxon MRF models with BioGrid and IntAct gene networks. (A) shows the trace plots for , !"#" , and $"%% from the EM algorithm with mean filedlike approximation. (B) shows the parameter estimates for , !"#" , and $"%% and their corresponding 95% confidence intervals.