 Research
 Open Access
 Published:
A novel method for predicting cell abundance based on singlecell RNAseq data
BMC Bioinformatics volume 22, Article number: 281 (2021)
Abstract
Background
It is important to understand the composition of cell type and its proportion in intact tissues, as changes in certain cell types are the underlying cause of disease in humans. Although compositions of cell type and ratios can be obtained by singlecell sequencing, singlecell sequencing is currently expensive and cannot be applied in clinical studies involving a large number of subjects. Therefore, it is useful to apply the bulk RNASeq dataset and the singlecell RNA dataset to deconvolute and obtain the cell type composition in the tissue.
Results
By analyzing the existing cell population prediction methods, we found that most of the existing methods need the celltypespecific gene expression profile as the input of the signature matrix. However, in real applications, it is not always possible to find an available signature matrix. To solve this problem, we proposed a novel method, named DCap, to predict cell abundance. DCap is a deconvolution method based on nonnegative least squares. DCap considers the weight resulting from measurement noise of bulk RNAseq and calculation error of singlecell RNAseq data, during the calculation process of nonnegative least squares and performs the weighted iterative calculation based on least squares. By weighting the bulk tissue gene expression matrix and singlecell gene expression matrix, DCap minimizes the measurement error of bulk RNASeq and also reduces errors resulting from differences in the number of expressed genes in the same type of cells in different samples. Evaluation test shows that DCap performs better in cell type abundance prediction than existing methods.
Conclusion
DCap solves the deconvolution problem using weighted nonnegative least squares to predict cell type abundance in tissues. DCap has better prediction results and does not need to prepare a signature matrix that gives the celltypespecific gene expression profile in advance. By using DCap, we can better study the changes in cell proportion in diseased tissues and provide more information on the followup treatment of diseases.
Background
Biological tissues are often complex and consist of many morphologically similar cells and intercellular substances. For example, blood contains various cell types such as granulocyte, erythroid, megakaryocytic, and mononuclear cells [1]. It is important to understand the composition of cell types and their proportion in intact tissues, as changes in certain cell types in tissues might be the underlying causes of diseases in humans [2]. If we can describe the difference in the composition of cell type for different diseases or different subjects, we can understand the mechanism of the disease better and research the cell targets to treat the disease better [3, 4]. Based on the singlecell RNA sequencing data, the composition of cell types and their proportion in intact tissues can be estimated. With the bulk RNAseq data of a certain type of tissue and the corresponding composition of cell types, the composition of cell types for the tissue can be predicted by the deconvolution method.
Bulk RNAseq is a widely used method in cell sequencing. It extracts DNA from all cells in the tissue and then breaks it down into fragments [5]. The data obtained by bulk RNAseq represents the average expression of genes across all cells in the tissue. Compared with bulk RNAseq, singlecell sequencing uses singlecell separation technology to separate individual cells and uses optimized nextgeneration DNA sequencing technology (NGS) to detect the sequence of single cells and obtain gene expression profiles of individual cells [6]. Singlecell sequencing technology can obtain differences between cells in specific microenvironments to facilitate the study of their functional differences. It helps us to study different cell types, which is of great benefit to the study of developmental biology. Although singlecell sequencing can obtain the composition and abundance of cell type, it is expensive to be applied in clinical studies involving numerous subjects. Therefore, it is urgent to develop a method to infer the proportion of each cell type in the tissue, based on known cell typespecific gene expression profiles obtained from scRNAseq data.
According to the implementation of the deconvolution method, existing methods can be broadly divided into two categories: nonnegative least squaresbased methods and Support Vector Regression (SVR)based methods.
The least squaresbased method is a mathematical optimization method. It finds the best function match for the data by minimizing the sum of the squares of the errors. The leastsquares method can be used to obtain unknown data and minimize the sum of the squares of the errors between the obtained data and the actual data [7]. There are several deconvolution methods based on nonnegative least squares, such as DeconRNASeq, MuSiC. DeconRNASeq [8] is an R package for deconvolution of heterogeneous tissues based on mRNAseq data. It uses a globally optimized nonnegative decomposition algorithm to estimate the mixing ratio of different cell types in nextgeneration sequencing data through quadratic programming. The input of DeconRNASeq is a celltypespecific gene expression matrix and a mixture gene expression matrix, and the output is a cell proportion matrix. MuSiC [9] is an R package that utilizes celltypespecific gene expression from singlecell RNA sequencing data to characterize cell type compositions from bulk RNAseq data in complex tissues. It uses weighted nonnegative least squares (WNNLS) to implement deconvolution. The input of MuSiC is a singlecell RNAseq dataset and a tissue gene expression matrix obtained by bulk RNAseq, and the output is a cell occupancy matrix. MuSiC weights the nonnegative least squares input matrix based on the variance of the expression of the same type of cells in different samples.
Support vector machine (SVM) is a supervised learning method used for classification and regression [10]. There are several deconvolution methods based on SVR, such as CIBERSORT, BseqSC, and CPM. CIBERDORT [11] is a webbased tool that uses gene expression data to estimates cell type abundance in mixed cell populations. CIBERDORT provides a signature gene file named LM22, which contains 22 different types of immune cells. If the bulk data only includes these cell types, users can use the LM22 directly and obtain the deconvolution result. If other cell types are involved in the input, users need to upload the signature gene file. BseqSC [12] is an R package that obtains cell type ratios based on the CIBERDORT deconvolution step and integrates the obtained ratio into cell typespecific differential analysis. CPM [13] is an R package that identifies cell abundance from a large number of gene expression data of heterogeneous samples using deconvolution based on cell population mapping. To improve the performance in the presence of a large number of reference profiles, CPM uses a consensus approach. It repeats the deconvolution method N times in N different subsets of the reference profile. The final predicted abundance result is the average of N calculation results.
There are also some cell abundance prediction methods that do not use deconvolution for prediction, such as UNDO and TIMER. UNDO [14] is an Rpackage for unsupervised deconvolution of mixed expression matrices of tumor stromal cells. It automatically detects cellspecific marker genes located on the scatter radius of mixed gene expression, estimates the proportion of cells in each sample, and deconvolutes the mixed expression into cellspecific expression profiles. It does not require a signature matrix that provides the celltypespecific gene expression profile. TIMER [15] is a webbased tool for systematically assessing the clinical impact of different immune cells in specific cancers. It can estimate the abundance of six types of immune cells in the tumor microenvironment through a new statistical method.
The major limitation of existing methods is that users need to provide the signature matrix of celltypespecific gene expression profiles. However, the signature matrix is not always available. Among the aforementioned methods, MuSiC only needs singlecell data to generate a signature matrix. Therefore, we improved the process of calculating the signature matrix and proposed a better method DCap (Deconvolution Cell abundance prediction).
Result
Experimental dataset
We used three datasets as experimental datasets, including two singlecell RNA sequencing datasets and one bulk RNAseq dataset. Details are shown in Table 1.
Evaluation metrics
Three metrics are used for evaluation: rootmeansquare deviation (RMSD), mean absolute difference (mAD) and pearson product moment correlation coefficient (R).
Rootmeansquare deviation
The rootmeansquare is a measurement method used to estimate the difference between values. RMSD is applied to evaluate the error in the prediction. The smaller RMSD indicates that the predicted value is closer to the ground truth.
The calculation equation of RMSD is:
where \(\alpha\) represents the true value and \(\hat{\alpha }\) represents the predicted value.
Mean absolute difference
The mean absolute difference represents the average difference between the predicted value and ground truth. It is also used to express the quality of the predicted results. The smaller mAD represents the closer the predicted value to ground truth.
mAD is calculated as:
where \(\alpha\) represents the true value and \(\hat{\alpha }\) represents the predicted value.
Pearson correlation coefficient
Pearson productmoment correlation coefficient is applied to measure the degree of linear correlation between two variables, whose value is between \(1\) and 1. A higher correlation between the predicted value and ground truth represents the better prediction result. The higher the Pearson productmoment correlation coefficient represents better prediction results.
Pearson correlation coefficient between two variables is the quotient of variance and standard deviation between two variables. The calculation equation of R is:
where \(\alpha\) represents the true value and \(\hat{\alpha }\) represents the predicted value.
Performance evaluation on simulated dataset
To demonstrate and evaluate DCap, we first carried out simulation experiments. Two singlecell datasets EMTAB5061 [16] and GSE81608 [17] were used in the simulation experiment.
Simulation dataset generation
The method has two inputs: a bulk RNASeq dataset and a singlecell RNAseq dataset. The singlecell RNAseq dataset is EMTAB5061. We use another singlecell RNAseq dataset, the GSE81608 dataset, to generate the bulk RNASeq dataset.
The GSE81608 dataset contains 18 samples (12 normal samples and 6 T2D diseases samples). If every sample is a bulk RNASeq data, we can obtain a dataset containing 18 bulk RNASeq data. The gene expression matrix of all cells from the same sample is merged to obtain the gene expression matrix of the bulk RNASeq data. Then, we record the number of cells of each type in each bulk RNASeq data to provide ground truth for the subsequent evaluation method.
Experimental results
To perform benchmark tests systematically, we first applied DCap and four other methods (Nonnegative least squares (NNLS), MuSiC, CIBERSORT, and BSEQsc) to the simulated dataset to obtain the predicted cell abundance. We use three metrics (RMSD, mAD, R) to evaluate the results of different methods. Table 2 shows that DCap performs the best among the five methods on all three evaluation metrics. The RMSD and mAD values of DCap are the smallest, and the Rvalue of DCap is the highest among the five methods.
To compare with ground truth data, we visualize ground truth data and the prediction results of the three algorithms (DCap, MuSiC, and NNLS) in Fig. 1. The result show that DCAP performs the best among three methods. We made the heat map of the absolute difference between the predicted value and ground truth in Fig. 2.
Figure 2 shows that DCap is superior to the other two methods. To understand the comparison between DCap and other methods more clearly, we made the boxplot of the difference between the predicted value and ground truth of each cell type, shown in Fig. 3. A smaller difference between the predicted value and true value represents better results. Finally, we aggregate the absolute difference of the same method and made the boxplot of the absolute difference of each method in Fig. 4. Figure 4 shows that the total absolute difference between the predicted value and the true value of DCap is the smallest. DCap performs better than other methods in general.
Cell proportion prediction on real dataset
We applied the model to real bulk RNAseq dataset to analyze the proportion of various types of cells in real tissues.
We used GSE50244 [18], which is the bulk RNASeq dataset, and EMTAB5061, which is the singlecell RNA dataset, as input. The GSE50244 dataset contains gene expression data of 89 islet samples.
By applying DCap and three other methods, we estimate the proportion of the 6 main cell types in the islet: alpha, beta, delta, gamma, acinar and ductal, which account for more than \(90\) \(\%\) of the whole islet’s cells. The relative abundance of cell types is shown in Fig. 5.
The results show that the proportion of beta cells is the largest, which is also in line with the the known biomedical knowledge. The results of all the four methods show that the proportion of gamma cells is the least.
Discussion
The prevalence of type 2 diabetes mellitus (T2D) is generally determined by the level of HbA1c. When the patient’s HbA1c level was greater than 6.5\(\%\), the patient was diagnosed with T2D. With the progression of T2D, the number of beta cells decreases gradually. As the HbA1c level increases, the number of beta cells decreased gradually.
We evaluated the performance of DCap from the cell changes caused by T2D disease. Based on the proportion of beta cells in all islet tissues and the corresponding HbA1c level, a regression curve can be obtained by linear regression. The linear regression method can be measured by \(r^2\) and p values. In detail, \(r^2\) ranges from 0 to 1. The closer \(r^2\) gets to 1, the better performance it represents. The smaller pvalue represents the more reliability of the linear regression model. Therefore, we performed regression modeling in Fig. 6.
Figure 6 indicates that the proportion of beta cells predicted by DCap is correlated with the HbA1C level. DCap has a better \(r^2\) and smaller Pvalue, which shows that DCap’s prediction results are generally better than the other three methods.
Conclusion
We proposed a novel method, named Dcap, to predict cell abundance. Compared with most other methods, DCap does not need a singlecell reference matrix in advance. It reduces the difficulty of cell abundance prediction. It only needs bulk RNAseq datasets of tissue gene expression and corresponding singlecell RNAseq datasets to predict cell abundance. The result shows that DCap performs better than other methods. We can study the changes of cell abundance in diseased tissues better and provide more information for the following treatment of diseases. Inspired by the success of deep learning methods in biomedical data analysis [19,20,21,22], we will apply deep learning methods to predict cell abundance in the future.
Method
The flow chart of DCap is shown in Fig. 7.
The inputs of DCap are bulk RNAseq datasets and singlecell RNAseq datasets. First, the singlecell dataset is used to obtain the singlecell gene expression matrix and the crosscellular variance matrix of the gene for the deconvolution. Then, the bulk tissue gene expression matrix and the singlecell gene expression matrix are deconvolved. The weighted matrix is calculated by these two matrices. Finally, the weighted matrix is used for deconvolution, and the aforementioned steps are repeated until the result converges.
Singlecell RNAseq dataset processing
The singlecell RNAseq technology can measure gene expression profiles at the cell level. A singlecell RNAseq dataset often contains cells of multiple types from multiple samples (subjects). For example, mouse kidney cell data from Park et al. [23] was derived from seven healthy mouse kidneys containing 16 types of 43,745 cells. Each cell contains the expression value of 16,273 genes. Therefore, it is necessary to select cell types according to the input data to be deconvoluted. Then we generated a singlecell gene expression matrix based on singlecell RNAseq datasets. The generated matrix includes the expression profile of each gene at different types of cell types. Each row in the matrix represents a gene. Each column in the matrix represents a cell type. Therefore, the quality of the singlecell RNAseq dataset process is important for predicting cell abundance.
Calculating average abundance matrix of genes
Each row of the average abundance matrix represents a gene. Each column represents a cell type. The value in the matrix represents the average abundance of a certain gene in a certain type of cell.
In tissue j, the relative abundance of gene g in cells of type k is \(\theta _{jg}^K\). \(Y_{jgc}\) is the number of mRNA molecules of gene g in cell c. \(C_j^k\) is the set of cell index for cell type k. \(\theta _{jg}^k\) is calculated as:
The singlecell RNAseq dataset contains multiple tissues from different subjects, and \(\theta _{jg}^k\) is different for different subjects. Therefore, we first calculate \(\theta _{jg}^k\) for tissue cells of each subject. The final gene relative abundance \(\theta _{g}^{k'}\) is the average of \(\theta _{jg}^k\) across different subjects. Considering the existence of abnormal values, we firstly determine the abnormal values before calculating the final gene relative abundance.
As shown in Fig. 8a, all values of \(\theta\) are placed on a number axis. The Kmeans clustering method is used to group all the values into different clusters to find the center point \(\theta _c\). Then, the outliers are removed based on the distance from the center point. Let the set distance threshold be \(\rho _{\theta }\), then
where, \(\left {\theta _{jg}^k  \theta _c} \right < \rho _{\theta }\), \(J_\theta\) is the number of \(\theta\) after excluding outliers. Generally, \(\rho _{\theta }\) is selected as the most suitable value by means of grid searching technique.
Calculating crosssample variance matrix of genes in different cell types
Rows of the crosssample variance matrix of genes represent genes. Columns represent different cell types. The values in the matrix represent the variance of the expression of a gene in different samples in a certain cell type.
In tissue j, the variance of gene g expression in different samples in cells of type k is \(V_{jg}^K\). \(V_{jg}^k\) is calculated as:
Calculating cell size for each cell type
The value in the cell size vector of each tissue represents the average number of RNA molecules for each cell type.
For tissue j, let \(m_{j}^k=C_{j}^k\) be the total number of cells of type k and \(S_{j}^k\) be the average of the total number of RNA molecules for cells of type k. \(S_j^k\) is calculated as:
For different subjects, \(S_{j}^k\) are different. Therefore, we first calculate \(S_{j}^k\) for each subject. The final gene relative abundance \(S_{j}^{K'}\) is the average \(S_{j}^k\) across different subjects. As shown in Fig. 8b, all values of S are placed on a number axis. The Kmeans clustering method is used to group all the values into different clusters to find the center point \(S_c\). Outliers are removed by the method introduced in the previous subsection.
Let the set distance threshold be \(\rho _{s}\), then
where, \(\left {S _{jg}^k  S _c} \right < \rho _{s}\), \(J_S\) is the number of S without outliers. Generally, \(\rho _{s}\) is selected as the most suitable value by means of grid searching technique.
Calculating singlecell gene expression matrix
Rows of the singlecell gene expression matrix represent different genes. Columns represent different cell types. The values in the matrix represent the expression level of genes in a certain type of cell.
Let \(Y_{jg}\) be the total number of mRNA molecules of gene g in a given tissue j, consisting of K types of cells. \(Y_{jg}\) are calculated as:
Based on Eqs. (1)–(6), \(Y_{jg}\) can be represented as:
Let \({m_j} = \mathop \sum \limits _{k = 1}^K m_j^k\) be the total number of cells in tissue j. Let \(p_j^k = \frac{{m_j^k}}{{{m_j}}}\) be the proportion of cells of type k in tissue j. \(\frac{{{Y_{jg}}}}{{{m_j}}}\) is calculated as:
The gene expression level of the gene g in the cells of type k is \(X_g^k\). \(X_g^k\) is calculated as:
Weighted matrix equation derivation
Considering Eq. (6), in the absence of error, we can directly use \(Y_g^k\) and \(X_g^k\) to find \(p_j^k\). However, in actual cases, when we use the bulk RNAseq to obtain \(Y_g^k\), there is measurement noise. Therefore, we need to modify Eq. (6). In order to guarantee the condition of \(\mathop \sum \nolimits _{k = 1}^K p_j^k=1\), the adjustment parameter C is added to the equation.
where, \(\epsilon _{j g} \sim N\left( 0, \delta _{j g}^{2}\right)\) represents the measurement error of bulk RNAseq.
After \(X_{jg}\) and \(p_j\) are calculated, the variance between the actual value of \(Y_{jg}\) and the estimated value is:
In addition to the measurement error that occurs during the bulk RNAseq process, there is also an error in generating the singlecell reference matrix \(X_g^k\). In different samples (eg, unified tissues derived from different subjects), the same type of cells have different gene expression levels.
We define a gene with a small variance of expression in the same cell type between different samples as an information gene. The expression of the information gene is stable in this cell type. Genes with a large variance of expression in the same cell type between different samples are defined as noninformation genes. Therefore, the relative abundance of gene g in cells of type k may not be a unique value in the calculation of a singlecell reference matrix across different samples.
Both types of errors are important. Both types of errors may happen during the process of obtaining data. The importance of different types of errors may be different for different datasets. In DCap, the weight of these two types of errors is considered the same. We use the sum of these two types of errors as weight information to improve prediction accuracy. So we can calculate the variance of the actual value of \(Y_{jg}\) and the estimated value \(p_j\) is:
where \(V_{gk}\) is the variance of the expression of gene g in different samples for type k cells.
Therefore, for the tissue j, \(w_{jg}\) is calculated as:
Considering the case of \(Var\left[ {{Y_{jg}}{p_j}} \right] =0\), the adjustment parameter n is added to the equation 11 to calculate the final weight:
Weighting the two matrices during the deconvolution process can reduce errors and improve the accuracy of the estimates. However, in the actual case, \(\delta _{jg}^2\) is unknown. Therefore, we start from nonnegative least squares and use iteration to estimate the weight until convergence.
Deconvolution equation derivation
Based on Eqs. (6) and (7), \(Y_{jg}\) is calculated as:
Then we multiplied the weights to both sides of Eq. (15):
Let A, B, and C be three matrices, where \(A = \frac{{\sqrt{{w_{jg}}} {Y_j}}}{{{m_j}}}\), \(B=p_j\), \(C = \sqrt{{w_{jg}}} X\). The problem can be defined as calculating the B matrix when \(mi{n_A}\left( {BC  {A^2}} \right)\), which is also the problem of least squares solution.
After inputting the singlecell dataset, we use Eq. (10) to calculate the singlecell reference matrix.
The gene expression matrix Y usually contains gene expression of multiple tissues. We predict each tissue separately and integrate the results into one matrix.
Availability of data and materials
Data analyzed in this study were a reanalysis of existing data, which are openly available at locations cited in the reference section. EMTAB5061 dataset has been deposited in ArrayExpress (EBI) with links: https://www.ebi.ac.uk/arrayexpress/experiments/EMTAB5061/ GSE50244 dataset has been deposited in the NCBI GEO with links: https://www.ncbi.nlm.nih.gov//geo/query/acc.cgi?acc=GSE50244 GSE81608 dataset has been deposited in the NCBI GEO with links: https://www.ncbi.nlm.nih.gov//geo/query/acc.cgi?acc=GSE81608.
Abbreviations
 SVR:

Support vector regression
 SVM:

Support vector machine
 NNLS:

Nonnegative least squares
 RMSD:

Rootmeansquare Deviation
 mAD:

Mean absolute difference
 T2D:

Type 2 diabetes
References
 1.
Kaiser CA, Krieger M, Lodish ABH. Molecular cell biology. San Francisco: WH Freeman; 2007.
 2.
Schelker M, Feau S, Du J, Ranu N, Klipp E, MacBeath G, Schoeberl B, Raue A. Estimation of immune cell content in tumour tissue using singlecell RNAseq data. Nat Commun. 2017;8:2032.
 3.
Wang T, Peng Q, Liu B, Liu Y, Wang Y. An endtoend heterogeneous graph representation learningbased framework for drugtarget interaction prediction. Brief Bioinform. 2020;8:418.
 4.
Zhang Y, Dai H, Yun Y, Liu S, Shang X. Metaknowledge dictionary learning on 1bit response data for student knowledge diagnosis. Knowl Based Syst. 2020;205:106290.
 5.
Owens B. Genomics: the single life. Nat News. 2012;491:27.
 6.
Eberwine J, Sul JY, Bartfai T, Kim J. The promise of singlecell sequencing. Nat Methods. 2014;11(1):25.
 7.
Björck A. Least squares methods. In: Handbook of numerical analysis. 1990;1, pp. 465–652.
 8.
Gong T, Szustakowski JD. Deconrnaseq: a statistical framework for deconvolution of heterogeneous tissue samples based on MRNAseq data. Bioinformatics. 2013;29(8):1083–5.
 9.
Wang X, Park J, Susztak K, Zhang NR, Li M. Bulk tissue cell type deconvolution with multisubject singlecell expression reference. Nat Commun. 2019;10(1):380.
 10.
Basak D, Pal S, Patranabis DC. Support vector regression. Neural Inf Process Lett Rev. 2007;11:203–24.
 11.
Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, Hoang CD, Diehn M, Alizadeh AA. Robust enumeration of cell subsets from tissue expression profiles. Nat Methods. 2015;12(5):453.
 12.
Baron M, Veres A, Wolock SL, Faust AL, Gaujoux R, Vetere A, Ryu JH, Wagner BK, ShenOrr SS, Klein AM. A singlecell transcriptomic map of the human and mouse pancreas reveals interand intracell population structure. Cell Syst. 2016;3(4):346–3604.
 13.
Frishberg A, PeshesYaloz N, Cohn O, Rosentul D, Steuerman Y, Valadarsky L, Yankovitz G, Mandelboim M, Iraqi FA, Amit I. Cell composition analysis of bulk genomics using singlecell data. Nat Methods. 2019;16:327–32.
 14.
Wang N, Gong T, Clarke R, Chen L, Shih IM, Zhang Z, Levine DA, Xuan J, Wang Y. Undo: a bioconductor r package for unsupervised deconvolution of mixed gene expressions in tumor samples. Bioinformatics. 2015;31(1):137–9.
 15.
Li B, Severson E, Pignon JC, Zhao HQ, Li TW, Novak J, Jiang P, Shen H, Aster JC, Rodig S. Comprehensive analyses of tumor immunity: implications for cancer immunotherapy. Genome Biol. 2016;1(17):174.
 16.
Segerstolpe Ã, Palasantza A, Eliasson P, Andersson EM, Andréasson AC, Sun X, Picelli S, Sabirsh A, Clausen M, Bjursell MK. Singlecell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 2016;24(4):593–607.
 17.
Xin Y, Kim J, Okamoto H, Ni M, Wei Y, Adler C, Murphy AJ, Yancopoulos GD, Lin C, Gromada J. Rna sequencing of single human islet cells reveals type 2 diabetes genes. Cell Metab. 2016;24(4):608–15.
 18.
Fadista J, Vikman P, Laakso EO, Mollet IG, Esguerra JL, Taneera J, Storm P, Osmark P, Ladenvall C, Prasad RB. Global genomic and transcriptomic analysis of human pancreatic islets reveals novel genes influencing glucose metabolism. Proc Natl Acad Sci. 2014;111(38):13924–9.
 19.
Peng J, Hui W, Li Q, Chen B, Hao J, Jiang Q, Shang X, Wei Z. A learningbased framework for mirnadisease association prediction using neural networks. Bioinformatics. 2018;21:21.
 20.
Peng J, Wang X, Shang X. Combining gene ontology with deep neural networks to enhance the clustering of single cell RNAseq data. BMC Bioinform. 2019;20:284.
 21.
Peng J, Xue H, Wei Z, Tuncali I, Hao J, Shang X. Integrating multinetwork topology for gene function prediction using deep neural networks. Brief Bioinform. 2020;22(2):2096–105.
 22.
Peng J, Wang Y, Guan J, Li J, Han R, Hao J, Wei Z, Shang X. An endtoend heterogeneous graph representation learningbased framework for drugtarget interaction prediction. Brief Bioinform. 2021. https://doi.org/10.1093/bib/bbaa430.
 23.
Park J, Shrestha R, Qiu C, Kondo A, Huang S, Werth M, Li M, Barasch J, Suszták K. Singlecell transcriptomics of the mouse kidney reveals potential cellular targets of kidney disease. Science. 2018;360(6390):2131.
Acknowledgements
Not applicable.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 22 Supplement 9, 2021: Selected articles from the Biological Ontologies and Knowledge bases workshop 2019: part two. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume22supplement9.
Funding
Publication costs were funded by National Natural Science Foundation of China (No.61702421, U1811262,61772426), the international Postdoctoral Fellowship Program (no.20180029), China Postdoctoral Science Foundation(No.2017M610651), Fundamental Research Funds for the Central Universities(No.3102018zy033), Top International University Visiting Program for Outstanding Young Scholars of Northwestern Polytechnical University. The funding bodies had no roles in the design, collection, and analysis of the research.
Author information
Affiliations
Contributions
JP and XS designed the algorithm; LH implemented the algorithm; JP and LH wrote this manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent to publish
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Peng, J., Han, L. & Shang, X. A novel method for predicting cell abundance based on singlecell RNAseq data. BMC Bioinformatics 22, 281 (2021). https://doi.org/10.1186/s12859021041874
Received:
Accepted:
Published:
Keywords
 Deconvolution
 Bioinformatics
 Cell abundance prediction
 Weighted least squares