Skip to main content

scEvoNet: a gradient boosting-based method for prediction of cell state evolution



Exploring the function or the developmental history of cells in various organisms provides insights into a given cell type's core molecular characteristics and putative evolutionary mechanisms. Numerous computational methods now exist for analyzing single-cell data and identifying cell states. These methods mostly rely on the expression of genes considered as markers for a given cell state. Yet, there is a lack of scRNA-seq computational tools to study the evolution of cell states, particularly how cell states change their molecular profiles. This can include novel gene activation or the novel deployment of programs already existing in other cell types, known as co-option.


Here we present scEvoNet, a Python tool for predicting cell type evolution in cross-species or cancer-related scRNA-seq datasets. ScEvoNet builds the confusion matrix of cell states and a bipartite network connecting genes and cell states. It allows a user to obtain a set of genes shared by the characteristic signature of two cell states even between distantly-related datasets. These genes can be used as indicators of either evolutionary divergence or co-option occurring during organism or tumor evolution. Our results on cancer and developmental datasets indicate that scEvoNet is a helpful tool for the initial screening of such genes as well as for measuring cell state similarities.


The scEvoNet package is implemented in Python and is freely available from Utilizing this framework and exploring the continuum of transcriptome states between developmental stages and species will help explain cell state dynamics.

Peer Review reports


Cells, the fundamental construction blocks of multicellular organisms, are characterized by great diversity in complex multicellular organisms. They include differentiated and function-specific cells, their stem cells for cell renewal during lifetime, and all the transitional states between these two points. In disease, cell and tissue homeostasis are altered, leading to the appearance of new pathological and dysfunctional cells. During evolution, the diversification of cell types is caused by genomic individualization relying on fundamental evolutionary principles such as functional segregation, divergence, co-option of gene modules, and de novo gene emergence. Co-option of gene programs is a mechanism allowing the emergence of new functions in a cell type by using existing gene networks from other cell types [1, 2]. The understanding of cell biology emanates from describing cells by their functions, their gene expression, interactions with their environment, and their lineage relationships. The emergence of single-cell RNA sequencing (scRNA-seq) began a new age of transcriptomic research, extending our understanding of cell heterogeneity and dynamics. Highly detailed atlases of cell types were produced for many tissues and organisms, in normal or pathological conditions [3,4,5,6,7]. Comparing those highly divergent datasets would allow asking key questions regarding the conservation of core genetic programs in poorly-related cellular contexts, the origins of cellular diversity and its evolutionary mechanisms, or the transcriptional paths leading to disease. However, data received from various biological conditions and various organisms is entangled by technical and biological batch effects which vastly complicates their comparison [8, 9]. Thus, forces shaping transcriptome dynamics remain poorly understood. Another application of scRNA-seq in evolutionary biology is accessing tumor heterogeneity and tracking its transformation as well as assessing the selective evolution of tumors during therapy or metastatic progression [10]. ScRNA-seq overcomes the constraints of classic bulk RNA sequencing by estimating transcriptome at a single-cell level and characterizing various cell types in the tumor microenvironment. Moreover, this allows a better understanding of the molecular mechanisms facilitating tumor occurrence. Although it could potentially reveal the somatic mutations during tumor evolution, scRNA-seq data sparsity [11] often prevents mutation calling (one of the main information sources for studying tumor evolution). Still, the scRNA-seq of tumors can determine the dynamic changes in tumor heterogeneity and the transcriptional evolution of tumor cells during metastasis development [12].

Currently, there is a lack of a specific tool that uses closely or distantly-related scRNA-seq datasets as input to study the potential co-option and evolution of gene programs between different organisms during development and differentiation, or between tumor cells at different stages of tumor progression. Kun Xu et al. [12] used Monocle [13] and scVelo [14] to study transcriptome dynamics of malignant cells between the primary tumor and lymph node metastases. They also used NATMI [15] for the generation of the cell-to-cell receptor-ligand network where edges are generated based on the expression of the ligand in one cell type and of its related receptor in another cell type. However, this latter strategy is not designed to study co-option in cancer which is a crucial mechanism forcing the molecular changes that propel tumor progression [16]. In another work, Pandey et al. use scRNA-seq to study the evolutions of neuronal types by comparing cell types in larva and adult zebrafish. They utilized Random Forest to generate a model for each cell type and predict cells with each model to build a confusion matrix mapping cell types by the number of cells predicted with each model [17]. Yet, this strategy is not wrapped into a framework and is not designed to extract genes that are characteristic of cell type transitions. These genes could be considered the drivers of genomic individualization, which is the process of diversification of cell types during evolution. These genes might play a role in functional segregation, divergence, co-option of gene modules, and de novo gene emergence. Thus, we present scEvoNet, a method that builds a cell type-to-gene network using the Light Gradient Boosting Machine (LGBM) algorithm [18] overcoming different domain effects (different species/different datasets) and dropouts that are inherent for the scRNA-seq data [19]. This tool predicts potentially co-opted genes together with genes characteristic of each cell state during development across species. Recently we showed the ability of a similar LGBM-based classifier to detect neural crest cells in distantly-related scRNA-seq datasets [20]. Despite technical batch effects (datasets were made in different laboratories with different technologies) and biological batch effects (datasets were from two evolutionarily distant organisms and at different developmental time points), we have achieved a high AUC score of 0.95 for classifying zebrafish cells with our frog-based NC model [20]. Here, we have expanded this method: scEvoNet applies to a variety of applications, e.g., between different time points during a given organism’s development, between species, and when comparing primary tumor and metastasis. We believe that scEvoNet will facilitate the study of cell state transitions in a variety of contexts and from highly divergent datasets.


The workflow of scEvoNet is illustrated in Fig. 1A. scEvoNet takes (1) an expression matrix, and (2) a list of cell labels as input data per organism/time point of interest. For each cell type provided by a user, scEvoNet generates an LGBM binary classifier (one cell type vs all other cells) in two steps (Fig. 1B). The LGBM itself is a gradient boosting framework which is based on a tree learning algorithms. Firstly, it generates a model considering all genes in the dataset. For the obtained model of the particular cell type, scEvoNet selects the top 3000 important features (cell types-related genes) and uses only them to re-train the final cell type model which will be used for the generation of the cell types confusion matrix. The basis for choosing the top 3000 important features (cell types-related genes) for retraining the final cell type model is to achieve a smaller batch effect. The batch effect refers to the variation in gene expression caused by technical differences in the scRNA-seq data, such as different sets of genes being expressed as a result of the drop-out effect. By reducing the number of genes used in the model, we aim to reduce the impact of the batch effect on the accuracy of the model. 3000 genes were chosen as the optimal number because it strikes a balance between minimizing the batch effect and still maintaining a sufficient number of genes to accurately classify cell types. Thus, this is part of the domain adaptation that we perform to make the resulting model less dependent on the number of genes that are missing in another dataset to which the model will be applied (e.g., 30% of the genes considered crucial for training a NC classifier on frog data were absent from the zebrafish dataset [20]). Additionally, to reduce the effect of the different biological domains between datasets, and to reduce the effect of the scRNA-seq data sparsity, we apply a sigmoid function that smooths expression units in a more flexible manner than simple binarization, which has been shown to keep enough information for scRNA-seq data analysis [21]. For the model training, we use early stopping to avoid overfitting with 10 rounds which determines the actual number of estimators in the regressions. Overfitting occurs when a model becomes too complex and begins to fit the noise in the training data instead of the underlying pattern, leading to poor generalization performance on unseen data. Early stopping helps to mitigate this issue by preventing the model from learning the noise in the training data.

Fig. 1
figure 1

scEvoNet scheme. A scEvoNet takes a list of clusters and a matrix of expression for each sample as input. For each sample, it generates an object with cell type classifiers and top important features for each cluster from the provided set of clusters. In the final step, the tool builds a confusion matrix and a network of genes associated with each cell type. B We use the LGBM algorithm to produce a classifier for each cell type. To smooth the data in order to deal with the batch effects we apply the sigmoid function and use only the top important features to create the final cell type classifiers

Next, scEvoNet uses each cell type model to predict cells from both datasets. This way we get a confusion matrix with cell type to cell type comparison. In the next step, scEvoNet builds a network where the nodes are cell types or genes so that cell types can only connect to genes. To do this, firstly we extract top features (genes) which are important for each cell type (both positively correlated and negatively correlated). Feature importance refers to the evaluation of the contribution of each feature to the prediction accuracy of the model. It measures how much a feature affects the impurity of the target variable, by quantifying the average decrease in impurity caused by splits of the data based on that feature. Once we identified the important features for each cell type, we combine all the cell type-related important features (genes) into one main network with all cell types and cell type-related genes. This strategy is similar to GRNboost2 [22] which outperformed many other tools in a recent benchmarking study [23]. GRNboost2 generates a gene–gene network similarly, whereas scEvoNet extends it to all cell types in two datasets. Furthermore, scEvoNet implements a shortest path search in order to generate a subnetwork of interest. For example, to study the evolution of a particular cell type, a user might request all the shortest paths (with a selected cut-off on their length) between two cell types and scEvoNet will yield all the genes and cell types between these two nodes. It is possible to specify how many close cell types should be in the in the subnetwork of the interest, in this case the confusion matrix is used as a metric of the cell types similarity. Each gene-to-cell type connection has a weight which is an importance value (a score displaying how useful each feature was in the building of the boosted decision trees within the model) by which users can filter sub-networks.


First, we applied scEvoNet to identify core characteristics during the evolution of the neural crest (NC) cells using two different vertebrate organisms. The NC is a multipotent and migratory cell population unique to vertebrates and essential notably for pigment, peripheral and enteric nervous system, and craniofacial structures formation [24]. For the input data, we used whole embryo scRNA-seq datasets for the Xenopus tropicalis, a non-amniote tetrapod vertebrate, at an early developmental stage (late gastrulation stage 12, and neurulation stages 13 (neural plate) and 14 (neural fold)) and Mus musculus, a mammal, at a similar developmental stage (late gastrulation stage 8.25) [20, 25].

Before using scEvoNet, we validated how one NC-based classifier trained on the Xenopus dataset will recognize NC cells in the whole embryo mouse dataset and obtained the result of a 0.89 AUC score (Fig. 2A, B). The NC classifier demonstrated the ability to accurately identify neural crest cells in a dataset from a different organism with a low rate of false positives. This is significant because each classifier is utilized to generate a gene program, and the more accurately the cells of a given type are classified, the more accurate the gene program’s signature will be. Next, using published cell type annotations for these two whole embryo scRNA-seq datasets, we run scEvoNet and obtained the confusion matrix (Fig. 2C). In this dataset of extended complexity, the highest similarity score for frog NC remains the mouse NC. Thereafter, we built a network of cell types and related genes. To identify genes that are highly conserved and cell type-specific in the evolution of the NC we selected a sub-network that consists of the shortest paths from Xenopus NC to mouse NC with the top 3 close cell types according to the confusion matrix. To obtain a larger subnetwork number_of_shortest_paths = 300 was used. Subsequently, we determined several groups of genes differently related to cell types within this subnetwork (Fig. 2D). The first group includes genes that are associated between NC and a closely-related cell type in only one organism (e.g., NC and neural plate in frog: sox2, sox3 (neural plate markers [26]), snai2 (neural crest marker [27]), hes1, zic1 (neural border markers [20]); NC and midhindbrain in mice: fodx3, gadd45a (neural crest marker [28]), mdk, ptn (neural crest-derived neurons markers [29])). If this gene expression signature was the consequence of the evolutionary divergence of function, this could be studied using the scRNA-seq of the ancestor organisms. The second group consists of genes that are characteristic of NC both in Xenopus and Mus musculus (pax3, tfap2c, tfap2a, tfap2b, sox9): all are known markers of NC or their progenitors [30]. The third group includes genes that are associated between the frog NC and mouse NC, and shared with the mouse NC-related cell types (as defined with confusion matrix of similarities): mafb, cldn6 for the neural plate; zic3, tfap2a for the midhindbrain. Thus, our tool was not only able to construct a matrix of similar cell types that can be used to study cell types similarities, but also defined three groups of genes that may have diverse roles in a cross-species transformation of the molecular profile of neural crest cells.

Fig. 2
figure 2

The development of the specific cell type between frog and mouse. A First UMAP represents highlighted annotated neural crest cells in the whole embryo dataset, second UMAP represents predicted neural crest cells with our classifier, and third UMAP represents predicted scores of our classifier (the predicted scores represent the probability of a given data point belonging to the NC class). B The AUC score for the neural crest classifier is 0.89 (it measures the ability of a model to distinguish between positive/negative classes by calculating the area under the ROC curve, which is a plot of the true positive rate against the false positive rate as the decision threshold is varied). C The confusion matrix for mouse and frog samples (prefix x_ is for Xenopus, prefix m_ is for mouse). The values in the confusion matrix are the correlations between two lists of scores for all cell type models. D Selecting the subnetwork of 300 shortest paths from Xenopus neural crest to mouse neural crest shows characteristic genes that are shared with closely-related cell types, such as mafb or cldn6 (group 3) between frog neural crest and mouse neural tube. It also reveals two groups of genes: genes from group 1 are organism-specific genes (frog neural crest and frog neural plate), and genes from group 2 are important genes for the specific cell type (NC) in both organisms (x-neural crest and m-neural crest)

Next, we applied scEvoNet to a human breast cancer metastasis dataset [12]. We selected a patient with available datasets for the primary tumor and the lymph node metastases. We used a standard Scanpy [31] pipeline to obtain clusters for both matrices (primary tumor and metastasis). The expression matrix was preprocessed by applying a threshold of 300 genes and 500 counts to filter out low quality data, and excluding very highly expressed genes from the computation of the normalization factor for each cell. Next, dimensionality reduction was performed using PCA and a neighborhood graph was computed, with the number of principal components selected based on the amount of variance explained by each PC and the number of neighbors specified manually. Finally, Leiden clustering [31] was applied to identify subpopulations of cells based on gene expression patterns. Marker genes from the source paper were used to annotate obtained clusters (Fig. 3A). First, using scEvoNet, we calculated the confusion matrix (Fig. 3B). As expected, we observed high connectivity between cells of the same type from the primary tumor or metastasis in lymph nodes, e.g., B cells, plasma cells, immune cells, macrophages, dendritic cells, and tumor cells. Next, to study cancer cell evolution, we used scEvoNet to discover common cluster-specific genes between the most distant malignant clusters from the primary tumor (cluster p_cancer_cells_cox6+) and metastasis (cluster m_cancer_cells_gapdh+). We found 14 genes that were directly connected to both cell types (Fig. 3C), among them malat1, levels of which inversely correlate with breast cancer progression and metastatic capacity [32], and b2m an important marker involved in carcinogenesis, invasion, and metastasis [33]. Among this list of genes directly connecting two tumor cell types were several mitochondrial genes. Although a common hypothesis relates the expression of mitochondrial genes to sample or data processing artifacts, growing evidence supports the importance of mitochondrial genes in cancer metastasis [34]. Additionally, the gene ontology (GO) analysis indicated a connection between these 14 genes and immune responses (Additional file 1). Next, we explored what other genes from close cell types might be involved in tumor evolution. To do so, we selected a subnetwork with all the genes related to cell types of interest and 5 similar cell types according to the confusion matrix obtained earlier. As a result, we determined two cancer cell types (metastatic and primary) that have as a common network neighbor the lymph node B cells, through genes hmgb1 and b2m. Interestingly, it was shown previously that exosomal hmgb1 promotes hepatocellular carcinoma immune evasion by stimulating TIM-1+ regulatory B cell expansion [35]. Also, blockade of the hmgb1 signaling pathway inhibits tumor growth in diffuse large B-cell lymphoma [36]. In another work, b2m specific B cells were defined as the most important prometastatic B cell cluster essentially contributing to distant metastasis in Clear Cell Renal Cell Carcinoma [37]. B2m is also an important element in the immune escape mechanism since a decrease in b2m expression reduces the number of antigens presented on the cell surface, including tumor-related antigens, which has been shown in particular in diffuse large B-cell lymphoma [38]. Thus, scEvoNet here provides a result supported by the literature, suggesting that users can retrieve meaningful gene candidates involved in tumor progression and immune escape in cancer.

Fig. 3
figure 3

Primary tumor vs metastasis comparison. A UMAPs for the primary human breast cancer (left) and metastasis in the lymph node (right). B The confusion matrix shows different rates of similarity between different clusters of cancer cells in primary tumor and metastasis (the p_ prefix is for primary, and the m_ prefix is for metastasis). The values in the confusion matrix are the correlations between two lists of scores for all cell type models. C Two subnetworks of the relation of the cluster of cancer cells in primary tumor and cancer cells in metastasis. On the left subnetwork, we show only genes related to some other cell types, on the right subnetwork we selected genes that are directly connected to clusters of interest

Our subsequent objective was to replicate and advance the findings of a prior investigation that compared the cell types in the zebrafish habenula between its larval and adult phases [17]. The clustering was performed similarly to that used in the analysis of the earlier breast cancer dataset. We have identified the same number of neuronal types, 15 in total for each dataset. Firstly, we generated a confusion matrix of cell types and verified the association of comparable clusters as outlined in the original study (Fig. 4A). Generally, our findings aligned with the previously published results, including, for instance, the association of the larval kiss1+ cluster with multiple kiss1+ clusters in the adult sample (Fig. 4A, green stars). Furthermore, we established a link between the cluster of tubb5+ immature neurons and the rpl3+tubb5+ adult neurons, as reported in the original article (Fig. 4A, blue arrows; Fig. 4B, C). Our analysis also confirmed the close proximity of habenula clusters in larva and adult Zebrafish expressing tac3a (1a and 3y clusters, Fig. 4A, red arrows). To sum up, the first step of scEvoNet effectively reconfirmed the results of a comprehensive comparison of clusters between the larva and adult phases. Moreover, our method of displaying the similarity between cell types is more advanced, as it enables a user to compare clusters simultaneously in two datasets which provides a more comprehensive view and allows user to assess the similarity between clusters by comparing them to those of the same type/stage within and across datasets.

Fig. 4
figure 4

Comparison of habenular neuron types between larval and adult phases. A The confusion matrix shows different rates of similarity between different clusters in larval and adult datasets (a in a prefix means adult, e.g., 3a; y is for larval, e.g., 1y). Red arrows indicate larval and adult tac3a+ clusters, blue arrows indicate larval and adult tubb5+ clusters. Kiss1+ clusters are indicated by *; B The UMAPs display clustering and selected genes for the larval stage; C The UMAPs represent the adult zebrafish dataset; D. The graph represents a subnetwork constructed from similar nerve cells in different datasets (3y_tca3a+ and 1a_tac3a+). It displays key genes involved in the transition and neighboring cell types which were selected for transition analysis. * indicates the RGMA gene, linking 1a1/3y and 6a1 clusters

Furthermore, bipartite network (generated with the second step of scEvoNet) analysis revealed that, for instance, rgma gene contributes significantly in a subgroup of cells that also co-express tac3a in larval stages (cluster 3y) and a cluster in adults that co-express gng8 and gng2 (cluster 6a; Fig. 4D). Simultaneously, expression of rgma decreases in a related 1a cluster in the adult samples (Fig. 4B, C). This change in expression pattern of rgma suggests that the gene may have been re-used (or maintained) from its original function to a new one in the development of the nervous system across different stages (from larva cluster 3y to adult cluster 6a). It is also possible that the rgma gene is involved in multiple processes during nervous system development, and its expression level and pattern are regulated by different factors in different stages. Further research is necessary to fully understand the role and function of the rgma gene in the development of the nervous system. Therefore, scEvoNet demonstrates the potential of our tool to investigate the dynamics of gene program utilization by exploring how different genes programs are connected through various time points.


The evolution of cell types and gene programs is one of the main focuses of developmental biology and is crucial for a better understanding of the origin of particular functions. For the moment, there is a lack of computational tools to address this question using the abundant scRNA-seq data publicly available databases. We found only one existing approach which is not wrapped into a usable framework (e.g., R/Python package) and has only one application (cell states comparison) so it cannot be used to extract genes that are linked to cell types transitions such as co-opted genes or genes conservatively important for several cell states [17]. In this manuscript, we present scEvoNet, a method for analyzing the evolution of cell states from highly sparse scRNA-seq data. Our study demonstrates the feasibility of using this method to examine transitions between different species, stages, and from primary tumors to metastasis. With this tool, we re-discover a canonical gene signature that remains conserved through evolution, and also predict species-specific genes and new candidates associated with similar cell types. Our findings may indicate the co-option of genes or shared programs in closely related cell types. It also suggests the potential use of an immune escape mechanism in breast cancer metastasis, which has previously been shown in another cancer type. Yet, one limitation is that scEvoNet does not match gene sequences and only works with labels provided by the user, which can reduce the number of genes to be found between different cell types in cross-species comparison. scEvoNet has the potential to greatly advance the field of developmental biology by allowing for the study of cell type evolution and gene program switching. This can be achieved through stage-to-stage comparison and tracking the development of certain cell types at different timepoints. With scEvoNet, it may be possible to uncover new insights into the origin of particular functions and the mechanisms behind cell type transitions.

The tool is adjustable and can be utilized for an initial screening strategy. It is compatible with AnnData object format used in the Scanpy Python package [31].

Availability of data and materials

We have been using public data and did not produce sequence data by ourselves. Four scRNA-seq datasets were used: 1. Mus musculus whole embryo dataset [25], 2. Xenopus tropicalist whole embryo dataset [20], 3. Human breast cancer dataset [12], 4. Zebrafish larval and adult habenular dataset [17].

Code availability

List the following: Project name: scEvoNet. Project home: Operating system(s): Mac OS, Linux. Programming language: Python 3. Other requirement: lightgbm ≥ 3.0.0, pandas ≥ 0.24, networkx ≥ 2.5. License: MIT license. Any restrictions to use by non-academics: MIT license.



Single-cell RNA sequencing


Light gradient boosting machine


Neural crest


  1. Arendt D. The evolution of cell types in animals: emerging principles from molecular studies. Nat Rev Genet. 2008;9:868–82.

    Article  CAS  PubMed  Google Scholar 

  2. Arendt D, Musser JM, Baker CVH, Bergman A, Cepko C, Erwin DH, et al. The origin and evolution of cell types. Nat Rev Genet. 2016;17:744–57.

    Article  CAS  PubMed  Google Scholar 

  3. Wagner DE, Weinreb C, Collins ZM, Briggs JA, Megason SG, Klein AM. Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo. Science. 2018;360:981–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Briggs JA, Weinreb C, Wagner DE, Megason S, Peshkin L, Kirschner MW, et al. The dynamics of gene expression in vertebrate embryogenesis at single-cell resolution. Science. 2018;360:eaar5780.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Cao J, Packer JS, Ramani V, Cusanovich DA, Huynh C, Daza R, et al. Comprehensive single cell transcriptional profiling of a multicellular organism. Science. 2017;357:661–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Haber AL, Biton M, Rogel N, Herbst RH, Shekhar K, Smillie C, et al. A single-cell survey of the small intestinal epithelium. Nature. 2017;551:333–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Zeisel A, Hochgerner H, Lönnerberg P, Johnsson A, Memic F, van der Zwan J, et al. Molecular architecture of the mouse nervous system. Cell. 2018;174:999-1014.e22.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Marioni JC, Arendt D. How single-cell genomics is changing evolutionary and developmental biology. Annu Rev Cell Dev Biol. 2017;33:537–53.

    Article  CAS  PubMed  Google Scholar 

  9. Stuart T, Satija R. Integrative single-cell analysis. Nat Rev Genet. 2019;20:257–72.

    Article  CAS  PubMed  Google Scholar 

  10. Saunders NA, Simpson F, Thompson EW, Hill MM, Endo-Munoz L, Leggatt G, et al. Role of intratumoural heterogeneity in cancer drug resistance: molecular and clinical perspectives. EMBO Mol Med. 2012;4:675–84.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Lähnemann D, Köster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, et al. Eleven grand challenges in single-cell data science. Genome Biol. 2020;21:31.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Xu K, Wang R, Xie H, Hu L, Wang C, Xu J, et al. Single-cell RNA sequencing reveals cell heterogeneity and transcriptome profile of breast cancer lymph node metastasis. Oncogenesis. 2021;10:1–12.

    Article  Google Scholar 

  13. Trapnell C, Cacchiarelli D, Grimsby J, Pokharel P, Li S, Morse M, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol. 2014;32:381–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Bergen V, Lange M, Peidli S, Wolf FA, Theis FJ. Generalizing RNA velocity to transient cell states through dynamical modeling. Nat Biotechnol. 2020;38:1408–14.

    Article  CAS  PubMed  Google Scholar 

  15. Hou R, Denisenko E, Ong HT, Ramilowski JA, Forrest ARR. Predicting cell-to-cell communication networks using NATMI. Nat Commun. 2020;11:5011.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Billaud M, Santoro M. Is co-option a prevailing mechanism during cancer progression? Cancer Res. 2011;71:6572–5.

    Article  CAS  PubMed  Google Scholar 

  17. Pandey S, Shekhar K, Regev A, Schier AF. Comprehensive identification and spatial mapping of habenular neuronal types using single-cell RNA-seq. Curr Biol. 2018;28:1052-1065.e7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY. LightGBM: a highly efficient gradient boosting decision tree. In: Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R, editors. Advances in neural information processing systems. vol 30. Curran Associates, Inc.; 2017.

    Google Scholar 

  19. Stegle O, Teichmann SA, Marioni JC. Computational and analytical challenges in single-cell transcriptomics. Nat Rev Genet. 2015;16:133–45.

    Article  CAS  PubMed  Google Scholar 

  20. Kotov A, Alkobtawi M, Seal S, Kappès V, Ruiz SM, Arbès H, Harland R, Peshkin L, Monsoro-Burq AH. From neural border to migratory stage: a comprehensive single cell roadmap of the timing and regulatory logic driving cranial and vagal neural crest emergence. bioRxiv. 2022.

    Article  Google Scholar 

  21. Qiu P. Embracing the dropouts in single-cell RNA-seq analysis. Nat Commun. 2020;11:1169.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Moerman T, Aibar Santos S, Bravo González-Blas C, Simm J, Moreau Y, Aerts J, et al. GRNBoost2 and Arboreto: efficient and scalable inference of gene regulatory networks. Bioinformatics. 2019;35:2159–61.

    Article  CAS  PubMed  Google Scholar 

  23. Pratapa A, Jalihal AP, Law JN, Bharadwaj A, Murali TM. Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Nat Methods. 2020;17:147–54.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Seal S, Monsoro-Burq AH. Insights into the early gene regulatory network controlling neural crest and placode fate choices at the neural border. Front Physiol. 2020;11: 608812.

    Article  PubMed  PubMed Central  Google Scholar 

  25. Ibarra-Soria X, Jawaid W, Pijuan-Sala B, Ladopoulos V, Scialdone A, Jörg DJ, et al. Defining murine organogenesis at single-cell resolution reveals a role for the leukotriene pathway in regulating blood progenitor formation. Nat Cell Biol. 2018;20:127–34.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Archer TC, Jin J, Casey ES. Interaction of Sox1, Sox2, Sox3 and Oct4 during primary neurogenesis. Dev Biol. 2011;350:429–40.

    Article  CAS  PubMed  Google Scholar 

  27. Nieto MA. A snail tale and the chicken embryo. Int J Dev Biol. 2018;62:121–6.

    Article  CAS  PubMed  Google Scholar 

  28. Kaufmann LT, Niehrs C. Gadd45a and Gadd45g regulate neural development and exit from pluripotency in Xenopus. Mech Dev. 2011;128:401–11.

    Article  CAS  PubMed  Google Scholar 

  29. Cui R, Lwigale P. Expression of the heparin-binding growth factors Midkine and Pleiotrophin during ocular development. Gene Expr Patterns. 2019;32:28–37.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Simões-Costa M, Bronner ME. Establishing neural crest identity: a gene regulatory recipe. Development. 2015;142:242–57.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19:15.

    Article  PubMed  PubMed Central  Google Scholar 

  32. Kim J, Piao H-L, Kim B-J, Yao F, Han Z, Wang Y, et al. Long noncoding RNA MALAT1 suppresses breast cancer metastasis. Nat Genet. 2018;50:1705–15.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Liu C, Yang Z, Li D, Liu Z, Miao X, Yang L, et al. Overexpression of B2M and loss of ALK7 expression are associated with invasion, metastasis, and poor-prognosis of the pancreatic ductal adenocarcinoma. Cancer Biomark. 2015;15:735–43.

    Article  CAS  PubMed  Google Scholar 

  34. Beadnell TC, Scheid AD, Vivian CJ, Welch DR. Roles of the mitochondrial genetics in cancer metastasis: not to be ignored any longer. Cancer Metastasis Rev. 2018;37:615–32.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Ye L, Zhang Q, Cheng Y, Chen X, Wang G, Shi M, et al. Tumor-derived exosomal HMGB1 fosters hepatocellular carcinoma immune evasion by promoting TIM-1+ regulatory B cell expansion. J Immunother Cancer. 2018;6:145.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Zhang T, Guan X-W, Gribben JG, Liu F-T, Jia L. Blockade of HMGB1 signaling pathway by ethyl pyruvate inhibits tumor growth in diffuse large B-cell lymphoma. Cell Death Dis. 2019;10:1–15.

    Google Scholar 

  37. Yang F, Zhao J, Luo X, Li T, Wang Z, Wei Q, et al. Transcriptome profiling reveals B-lineage cells contribute to the poor prognosis and metastasis of clear cell renal cell carcinoma. Front Oncol. 2021;11:731896.

    Article  PubMed  PubMed Central  Google Scholar 

  38. Challa-Malladi M, Lieu YK, Califano O, Holmes A, Bhagat G, Murty VV, et al. Combined genetic inactivation of Beta2-microglobulin and CD58 reveals frequent escape from immune recognition in diffuse large B-cell lymphoma. Cancer Cell. 2011;20:728–40.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references


The authors are grateful to Drs. Igor Adameyko and Leon Peshkin for insightful scientific discussions and comments on the manuscript.


This project received funding from European Union’s Horizon 2020 research and innovation program under Marie Skłodowska-Curie Grant agreement No 860635, NEUcrest ITN (to AHMB); Agence Nationale pour la Recherche (ANR-21-CE13-0028; to AHMB) and Institut Universitaire de France (to AHMB); Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute).

Author information

Authors and Affiliations



AK, AHMB, AZ conceived the project. AK designed the strategy, performed the work, did the programming, wrote and edited the manuscript. AZ reviewed the code and edited the manuscript. AHMB supervised the project and edited the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Anne-Helene Monsoro-Burq.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

AZ is employed by EvoTec Company. The remaining authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

The results of the GO term analysis using ShinyGO 0.77 on multiple gene lists obtained with scEvoNet.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kotov, A., Zinovyev, A. & Monsoro-Burq, AH. scEvoNet: a gradient boosting-based method for prediction of cell state evolution. BMC Bioinformatics 24, 83 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: