 Research
 Open Access
 Published:
EnsInfer: a simple ensemble approach to network inference outperforms any single method
BMC Bioinformatics volume 24, Article number: 114 (2023)
Abstract
This study evaluates both a variety of existing base causal inference methods and a variety of ensemble methods. We show that: (i) base network inference methods vary in their performance across different datasets, so a method that works poorly on one dataset may work well on another; (ii) a nonhomogeneous ensemble method in the form of a Naive Bayes classifier leads overall to as good or better results than using the best single base method or any other ensemble method; (iii) for the best results, the ensemble method should integrate all methods that satisfy a statistical test of normality on training data. The resulting ensemble model EnsInfer easily integrates all kinds of RNAseq data as well as new and existing inference methods. The paper categorizes and reviews stateoftheart underlying methods, describes the EnsInfer ensemble approach in detail, and presents experimental results. The source code and data used will be made available to the community upon publication.
Introduction
Network inference
A gene regulatory network (GRN) consists of molecular regulators (including DNA segments, messenger RNAs, and transcription factors) in a cell and the causal links between regulators and gene targets. Causality here means that the regulator influences the RNA expression of the gene target. Network inference is the problem of identifying such causal links. In machine learning terms, since the set of regulator genes and target genes are given, the network inference problem can be viewed as a binary classification task to determine whether or not a potential regulatory edge between any pair of regulator and target gene exists.
Because network inference facilitates the understanding of the biological systems at the level of molecular interactions, it potentially enables the designed repression or enhancement of groups of molecules. This has applications ranging from drug design and medical treatment to the reduced use of fertilizer in agriculture. Accurate network inference and functional validation is an ongoing challenge for systems biology. Over the past two decade, numerous gene regulatory network inference technologies have been proposed to tackle this problem from different perspectives [1,2,3,4,5,6].
Individual methods feed an ensemble method
Pratapa et al. [7] presented a framework called BEELINE to evaluate stateoftheart network inference algorithms. The vast majority of the inference algorithms including the ones we are going to incorporate into the ensemble approach can be roughly categorized into three types:

1.
Pairwise correlation models make use of various kinds of correlations between a target gene’s expression and potentially causal transcription factor expressions. PPCOR [8] computes the partial and semipartial correlation coefficients for each gene pair. LEAP [9] calculates the Pearson coefficient of each gene pair in a timeseries background that considers a timedelay in regulatory response. PIDC [10] looks at the distribution of the gene expression and calculates the gene pairwise mutual information between distributions. SCRIBE [11] also looks at mutual information between gene expression distributions, and, like LEAP, considers timelagged correlation in timeseries data. Finally, there is correlation on any set of steadystate data.

2.
Treebased models use random forests (or their close variants) to predict the gene expression of each target gene based on the expression of regulator genes (transcription factors). Such models then use feature importance to determine the weight of each regulatortarget interaction. High weights correspond to regulatory edges. Examples include GENIE3 [2], a faster alternative GRNBoost2 [12], and the inference method OutPredict [13]. OutPredict also takes prior information (e.g., binding data) into account during training and test.

3.
Ordinary differential equation (ODE)based regression approaches model the target gene expression as a dependent on the time derivative of the expression of regulatory genes. Inferelator [1] is a regularized regression model that focuses on feature selection. Its latest iteration, Inferelator 3.0 [14], makes use of single cell data to learn regulatory networks. SCODE [3] is a direct application of fast ODEbased regression. SINCERITIES [15] utilizes KolmogorovSmirnov testbased ridge regression. GRISLI [16] is an ODE solver that accounts for gene expression velocity.
The BEELINE benchmark of 12 different inference algorithms showed that while some algorithms generally perform better than others, there is no definitive best solution that can be applied to all datasets. Our approach complements theirs: in addition to studying the performance of individual algorithms (including some promising ones that they did not study), we show that an ensemble method that we call EnsInfer can obtain as good or better results than any single method and improves upon previous ensemble methods [17,18,19]. In vision and language applications, some work, such as [20, 21], uses clustering based ensemble on large data to create balanced sets which are then sent to distinct learners. In addition to showing the benefits of combining multiple inference methods, our pipeline also provides a practical combination strategy.
Materials and methods
Underlying network inference algorithms
Here we introduce the inference algorithms we used in this ensemble workflow. Our workflow and the opensource code we provide allows the easy incorporation of new inference algorithms.
Experimental setup: the data
All level 1 network inference algorithms take gene expression level data as input, there are two main sources for these data: synthetic data generated by simulation software with a given regulatory network or transcriptomewide RNA sequencing (RNAseq) data from living organisms. These data can be measured in a temporal manner to constitute timeseries data or measured in temporally unrelated discrete states to constitute steadystate data. RNAseq data can also be classified into two categories: bulk RNAseq data which is obtained using all cells inside a sample tissue and single cell RNAseq data which examines the transcriptome information of a single cell [22]. Details about the gene expression datasets used in our experiments are listed below:

1.
Synthetic data from the DREAM3 and DREAM4 in silico challenges consists of ten datasets each with 100 genes and varying regulatory network structures [23, 24]. The gene expression data was generated by GeneNetWeaver, the software which provided data for the DREAM3 and DREAM4 challenges. Simulation settings were kept as the default DREAM4 challenge settings except that we generated five different time intervals between data points: 10 min, 20 min, 25 min, 50 min (default value), and 100 min. The benefit of using this synthetic data is that the underlying network is precisely known by construction.

2.
Bacterial experimental RNAseq data from B. subtilis (bulk RNA) containing 4218 genes and 239 TFs. The training and testing sets came from a network consisting of 154 TFs and 3144 regulatory edges [25].

3.
Plant experimental RNAseq data (bulk RNA, timeseries) from Arabidopsis shoot tissue consisting of 2286 genes and 263 transcription factors (TFs). Both the training and testing sets came from a network consisting of 29 TFs and 4247 regulatory edges [26].

4.
Mouse Embryonic Stem Cell (mESC) experimental singlecell RNAseq data. containing 500 genes and 47 TFs. The training and testing sets came from a functional interaction network consisting of 47 TFs and 3226 regulatory edges [27].

5.
Human Embryonic Stem Cell (hESC) experimental singlecell RNAseq data containing 1115 genes and 130 TFs. The training and testing sets came from a ChiPSeq network consisting of 130 TFs and 3144 regulatory edges [28].
In this work, we have focused on either temporal timeseries bulk RNAseq or single cell RNAseq data for which pseudotime information is available. One reason is that some of the inference algorithms in the BEELINE framework require temporal information input. The other is the well known epistemiological reason: steady state data gives simultaneous correlation information, but does not clarify the causal relationship. By contrast, because causation moves forward in time, time series datasets are more useful for causal network inference.
Ensemble approach
Because one single inference method may not (and, in fact, does not) suit all scenarios, we propose EnsInfer, an ensemble approach to the network inference problem: each individual network inference method will work as a first level learning algorithm that gives a set of predictions from the gene expression input. Then we train a secondlevel ensemble learning algorithm that combines results from those first level learners. As first level inference methods are all different from each other, this forms a heterogeneous stacking ensemble process [29, 30]. The end goal is the binary classification task of determining whether or not a potential regulating edge from transcription factor gene to target gene exists.
Thus, base network inference methods such as GENIE3 or Inferelator will work as Level 1 inference methods and individually predict whether some transcription factor TF regulates some target gene g by giving each possible edge a confidence score. The resulting edge predictions of all the level 1 inference methods can be fed into the second level ensemble learner. Previous ensemble approaches include a voting method [17, 18], but other approaches have been used for other applications: a random forest classifier, or a Naive Bayesian classifier. The pipeline is shown in Fig. 1.
Each level 1 inference method infers regulation based on all the given gene expression data. By contrast, the ensemble learner takes a training set consisting of a randomly chosen subset of regulators from gold standard (normally, experimentally verified present/absent) edges and creates a model whose input is the confidence score output of each level 1 inference method and whose output is a prediction about whether each potential edge regulates or not. One thing to note is that, for the sake of consistency across different methods, we use the confidence scores on all regulatory edges of each level 1 inference method not just the highly confident edges. This benefits the level 2 ensemble efforts, because all information inferred from level 1 methods is preserved for level 2 models.
The ensemble method uses this model and the outputs of the level 1 inference methods to predict for each transcription factor in the test set, whether a given possible edge leaving that transcription factor corresponds to a true regulatory relation. This process translates well to real world applications, where EnsInfer learns from the known regulatory relations within an organism or tissue structure, and makes predictions for untested transcription factors.
We evaluated eight different models to function as level 2 ensemble models using synthetic data. Those models include: voting [17], logistic regression, logistic regression with stochastic gradient descent (SGD), Naive Bayes with a Gaussian kernel, support vector machines, knearest neighbors, random forest, adaptive boost trees, and XGBoost [31]. All models except XGBoost are provided by the scikitlearn python package [32]. We used a separate DREAM4 dataset with 100 genes to perform hyperparameter tuning for all level 2 ensemble models. For each of the tunable ensemble models, a discrete set of hyperparameter combinations spanned by the common selections of core model parameters were crossvalidated on this DREAM4 dataset For each method, the best performing hyperparameter combination was used for the later level 2 comparison experiments. Details about the hyperparameter grid search and resulting best parameter settings for each model can be found in Additional file 1: Table S1.
We compare the area under precisionrecall curve on the test data of the ensemble learner against that of the level 1 inference methods that have access to the same training data.
Algorithmic workflow of the ensemble approach
All underlying inference algorithms were executed through the BEELINE framework proposed by [7] to which we added OutPredict and Inferelator which weren’t included in the original BEELINE package.
The confidence scores of the underlying algorithms for each potential edge in the regulatory network became inputs to the level 2 ensemble model, as illustrated in Fig. 1. To compare the performance of different inference methods, we use Area Under the PrecisionRecall Curve (AUPRC) as the primary metric in all experiments. The reason for choosing AUPRC is that experimentalists can choose a high confidence cutoff to identify the most likely causal transcription factors for a given target gene. A comprehensive summary of the results can be found in Tables 1 and 2 for experiments on the DREAM in silico datasets and Fig. 2 for the three real world species.
For the in silico DREAM datasets, the underlying gold standard priors that define each regulatory network were divided into a 2:1 training/testing split, so there were twice as many regulators in training as in testing. Because the split was done with respect to the regulators, the training and testing sets share no common transcription factors. We believe splitting based on transcription factors is the correct approach, because experimental assays commonly overexpress or repress particular transcription factors. The practical goal is that if a species has some TFs with experimentally validated edges, then edges from untested TFs can be inferred.
For each dataset, we first applied 11 base level inference methods to the training data both to determine a promising single method to apply to the test data and as an input to the construction of the ensemble model. Out of the 12 methods included in BEELINE, SINCERITIES, SINGE and SCNS either produced no output or exceeded the time limit of one week for one or more of the datasets. We applied those individual level 1 inference methods (not only the most promising ones from the training data) as well as the level 2 non homogeneous ensemble models to the test set.
To assess ensemble models, we compared them with one another and with the best level 1 inference methods in both training and testing evaluations. As [17] have pointed out for the DREAM challenge, one simple yet (in DREAM at least) effective way to integrate multiple inference results is to rank potential edges according to their average rank given by all inference methods. We will also include this “community” method as a reference point for our ensemble models.
The experiments on the DREAM in silico datasets focused on three objectives: (i) for each dataset, how well did the level 1 inference methods that performed best on the training set perform on the test set? (ii) how well did the ensemble learners perform on the test set? (iii) how did the level 1 inference method that performed best on the test set compare to the level 2 ensemble models? Note that the comparison of (iii) is unfair to the ensemble models, because there is no way to know a priori which level 1 inference method will perform best on a given test set, so choosing the best one gives an unfair advantage to the level 1 inference methods.
On the experimental datasets from real world species, similarly, four level 1 methods from BEELINE: GRNVBEM, GRISLI, SINGE and SCNS were not able to produce proper inference results due to time or memory constraints on the larger datasets (e.g. they did not finish after a week), hence were not included in the ensemble approach. We then applied the best performing level 2 ensemble models from the DREAM experiments to the available 10 base level inference methods. Furthermore, we varied the input to the level 2 ensemble models by including or excluding the results from the three most poorly performing level 1 inference methods.
Results
Base inference method performance
On the DREAM datasets, the performance of the algorithms featured in the BEELINE framework is consistent with the original paper [7]. GENIE3, GRNBOOST and PIDC performed the best among the algorithms the BEELINE authors tested. As it happened, the methods we added to the framework (Inferelator and OutPredict) outperformed those methods in many cases. Nevertheless, no individual level 1 inference method dominated the others, as seen in Table 1. We also note that while the best level 1 inference methods in training is often the best algorithm in testing, that is not always the case.
Ensemble performance
The Naive Bayes model we used works on the assumption that the likelihood distribution of edge presence is Gaussianlike with respect to any given input’s confidence score. On the DREAM datasets, two of the eleven level 1 inference methods (GRISLI and PIDC) produced outputs that did not have a Gaussian like distribution (reflected as a negative kurtosis). We therefore experimented using all level 1 inference methods as input as well as using only the level 1 inference methods whose output distribution has positive kurtosis as inputs to better accommodate Naive Bayes model. The combined results are presented in Table 2. For most ensemble methods, using all 11 level 1 inference methods versus using 9 does not change the result. However, for Naive Bayes and Logistic Regression with Stochastic Gradient Descent, eliminating those level 1 inference methods that produce nonGaussian like output helps. In fact, Naive Bayes is overall the winner across all tested models and configurations when the input is limited by the positive kurtosis filter. While logistic regression, random forest and adaptive boosting also performed favorably compared to the best performing level 1 inference methods in training as well as compared to the average rank of level 1 inference methods of [17].
Hence, for real world experimental datasets, the four models: logistic regression, Naive Bayes, random forest and adaptive boosting were selected as level 2 ensemble models for analysis (see Fig. 2). Here the likelihood distributions of all results from the level 1 inference methods have a positive kurtosis measure, so all 10 of them were utilized for the level 2 ensemble methods.
Figure 2 shows that

The Naive Bayes approach on inputs having positive kurtosis outperforms the other three ensemble method, so our system EnsInfer uses Naive Bayes as the default option.

Including results from weak learners has a marginal impact (sometimes positive and sometimes negative) on the final ensemble performance. For the sake of simplicity, therefore, EnsInfer includes inputs from all available inference methods having positive kurtosis, even the weak ones.
Figure 3 shows that the Naive Bayes ensemble approach significantly (pvalue < 0.05) outperformed the best level 1 method on B. subtilis and Arabidopsis. The ensemble method with all level 1 methods still had an advantage in mESC data although the performance gain was less statistically significant with pvalue of 0.133. To calculate the pvalue, we conservatively chose a nonparametric paired resampling approach [33] because we did not want to assume any particular distribution on the data. We used a paired test because we measured the AUPRC gain for each training/test split. (That is, the set of training/testing splits were established randomly and initially. Then, for the numerical experiments, each method used that set.) In the hESC case, the Naive Bayes ensemble method achieved approximately the same level of performance as the best level 1 method on the test set. As noted above, the best level 1 inference method for the test set cannot be known a priori (and not even looking at each method’s performance in training), so using an ensemble method gives high performance without having to know which level 1 inference method is best. The Naive Bayes approach also consistently outperformed the average voting ensemble approach [17].
EnsInfer: Compared to running a single inference method, the ensemble approach requires an amount of computation resources equal to the sum of the time to run all base inference algorithms, plus the ensemble effort itself. However, all base inference methods can be executed in parallel, so the wall clock time of executing level 1 inference process is just the time of the slowest method which often is also single threaded. The level 2 ensemble effort itself is less than 1/10 the time of the slowest base method as shown in Additional file 2: Table S2. We can therefore conclude that EnsInfer’s wall clock time is close to that of the slowest base inference method.
Discussion
Consistently with [7], we find that no one inference method is best for all datasets tested in our study. However, a Naive Bayes level 2 ensemble model built from level 1 inference methods having positive kurtosis holds great promise as a general ensemble network inference approach and is thus the basis of EnsInfer. Naive Bayes may work better than more sophisticated Bayesian methods, because at the core of the Bayesian method, we need to estimate the likelihood distribution of p(xe) where x is the score given by a level 1 inference method and e is the existence of an edge. Since this generative process varies dramatically across different datasets and inference methods, the Gaussian assumption used by Naive Bayes is as good as any and keeps the model simple.
Please note, however, that there are cases when Naive Bayes does not improve on the best single inference method. This happens primarily when the results are little better than random. For example, the inferred regulatory networks from singlecell human embryonic stem cell data from [7] was barely better than random using any base method in BEELINE. The ensemble does not improve that.
Naive Bayes works particularly well in a sparse data environment, which is often the case when experimental data is hard to come by. For example, there are only 29 experimentally validated transcription factors for Arabidopsis and 154 for B. subtilis. Another point in favor of Naive Bayes is that the the size of the feature space (the number of outputs of the level 1 inference methods) is small. If the training dataset and feature space were larger, Random Forestbased approaches might do better. Our current investigation used roughly a dozen level 1 inference methods. Other promising new inference ones could be added such as BiXGBoost and DeepSEM [4, 5]. A level 2 ensemble method might potentially require a feature selection step if many more inference algorithms were included.
Conclusion
The main overall benefit of the ensemble method EnsInfer is its robust and flexible nature. Instead of picking a network inference method and hoping that it will perform well on a dataset, EnsInfer uses a combination of stateoftheart inference approaches and combines them using a simple Naive Bayes ensemble model. Because the ensemble approach essentially turns all the predictions from different inference algorithms into priors about each edge in the network, EnsInfer easily allows the integration of diverse kinds of data (e.g. bulk RNAseq, single cell RNAseq) as well as new inference methods.
Availability of data and materials
All experimental data and source code for the ensemble process can be found at our Github repository: https://github.com/IcyFermion/network_inference_ensemble
References
Bonneau R, Reiss DJ, Shannon P, Facciotti M, Hood L, Baliga NS, Thorsson V. The inferelator: an algorithm for learning parsimonious regulatory networks from systemsbiology data sets de novo. Genome Biol. 2006;7(5):1–16.
HuynhThu VA, Irrthum A, Wehenkel L, Geurts P. Inferring regulatory networks from expression data using treebased methods. PLoS ONE. 2010;5(9):12776.
Matsumoto H, Kiryu H, Furusawa C, Ko MS, Ko SB, Gouda N, Hayashi T, Nikaido I. Scode: an efficient regulatory network inference algorithm from singlecell rnaseq during differentiation. Bioinformatics. 2017;33(15):2314–21.
Zheng R, Li M, Chen X, Wu FX, Pan Y, Wang J. Bixgboost: a scalable, flexible boostingbased method for reconstructing gene regulatory networks. Bioinformatics. 2019;35(11):1893–900.
Shu H, Zhou J, Lian Q, Li H, Zhao D, Zeng J, Ma J. Modeling gene regulatory networks using neural network architectures. Nat Comput Sci. 2021;1(7):491–501.
Zhao M, He W, Tang J, Zou Q, Guo F. A comprehensive overview and critical evaluation of gene regulatory network inference technologies. Brief Bioinform. 2021;22(5):009.
Pratapa A, Jalihal AP, Law JN, Bharadwaj A, Murali T. Benchmarking algorithms for gene regulatory network inference from singlecell transcriptomic data. Nat Methods. 2020;17(2):147–54.
Kim S. ppcor: an r package for a fast calculation to semipartial correlation coefficients. Commun Stat Appl Methods. 2015;22(6):665.
Specht AT, Li J. Leap: constructing gene coexpression networks for singlecell rnasequencing data using pseudotime ordering. Bioinformatics. 2017;33(5):764–6.
Chan TE, Stumpf MP, Babtie AC. Gene regulatory network inference from singlecell data using multivariate information measures. Cell Syst. 2017;5(3):251–67.
Qiu X, Rahimzamani A, Wang L, Mao Q, Durham T, McFalineFigueroa JL, Saunders L, Trapnell C, Kannan S: Towards inferring causal gene regulatory networks from single cell expression measurements. BioRxiv, 426981 (2018)
Moerman T, Aibar Santos S, Bravo GonzálezBlas C, Simm J, Moreau Y, Aerts J, Aerts S. Grnboost2 and arboreto: efficient and scalable inference of gene regulatory networks. Bioinformatics. 2019;35(12):2159–61.
Cirrone J, Brooks MD, Bonneau R, Coruzzi GM, Shasha DE. Outpredict: multiple datasets can improve prediction of expression and inference of causality. Sci Rep. 2020;10(1):1–9.
Gibbs CS, Jackson CA, Saldi GA, Shah A, Tjärnberg A, Watters A, De Veaux N, Tchourine K, Yi R, Hamamsy T, et al.: Singlecell gene regulatory network inference at scale: The inferelator 3.0. BioRxiv (2021)
Papili Gao N, UdDean SM, Gandrillon O, Gunawan R. Sincerities: inferring gene regulatory networks from timestamped single cell transcriptional expression profiles. Bioinformatics. 2018;34(2):258–66.
AubinFrankowski PC, Vert JP. Gene regulation inference from singlecell rnaseq data with linear differential equations and velocity inference. Bioinformatics. 2020;36(18):4774–80.
Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, Kellis M, Collins JJ, Stolovitzky G. Wisdom of crowds for robust gene network inference. Nat Methods. 2012;9(8):796–804.
Hill SM, Heiser LM, Cokelaer T, Unger M, Nesser NK, Carlin DE, Zhang Y, Sokolov A, Paull EO, Wong CK. Inferring causal molecular networks: empirical assessment through a communitybased effort. Nat Methods. 2016;13(4):310–8.
SaintAntoine MM, Singh A. Network inference in systems biology: recent developments, challenges, and applications. Curr Opin Biotechnol. 2020;63:89–98.
Jan Z, Verma B. Multiple strong and balanced clusterbased ensemble of deep learners. Pattern Recogn. 2020;107:107420.
Shahabadi MSE, Tabrizchi H, Rafsanjani MK, Gupta B, Palmieri F. A combination of clusteringbased undersampling with ensemble methods for solving imbalanced class problem in intelligent systems. Technol Forecast Soc Chang. 2021;169:120796.
Stark R, Grzelak M, Hadfield J. Rna sequencing: the teenage years. Nat Rev Genet. 2019;20(11):631–56.
Prill RJ, Marbach D, SaezRodriguez J, Sorger PK, Alexopoulos LG, Xue X, Clarke ND, AltanBonnet G, Stolovitzky G. Towards a rigorous assessment of systems biology models: the dream3 challenges. PLoS ONE. 2010;5(2):9202.
Schaffter T, Marbach D, Floreano D. Genenetweaver: in silico benchmark generation and performance profiling of network inference methods. Bioinformatics. 2011;27(16):2263–70.
ArrietaOrtiz ML, Hafemeister C, Bate AR, Chu T, Greenfield A, Shuster B, Barry SN, Gallitto M, Liu B, Kacmarczyk T. An experimentally supported model of the bacillus subtilis global transcriptional regulatory network. Mol Syst Biol. 2015;11(11):839.
Varala K, MarshallColón A, Cirrone J, Brooks MD, Pasquino AV, Léran S, Mittal S, Rock TM, Edwards MB, Kim GJ. Temporal transcriptional logic of dynamic regulatory networks underlying nitrogen signaling and use in plants. Proc Natl Acad Sci. 2018;115(25):6494–9.
Hayashi T, Ozaki H, Sasagawa Y, Umeda M, Danno H, Nikaido I. Singlecell fulllength total rna sequencing uncovers dynamics of recursive splicing and enhancer rnas. Nat Commun. 2018;9(1):1–16.
Chu LF, Leng N, Zhang J, Hou Z, Mamott D, Vereide DT, Choi J, Kendziorski C, Stewart R, Thomson JA. Singlecell rnaseq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm. Genome Biol. 2016;17(1):1–20.
Wolpert DH. Stacked generalization. Neural Netw. 1992;5(2):241–59.
Aburomman AA, Reaz MBI. A survey of intrusion detection systems based on ensemble and hybrid classifiers. Comput Secur. 2017;65:135–52.
Chen T, Guestrin C: Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd Acm Sigkdd international conference on knowledge discovery and data mining, pp. 785–794 (2016)
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikitlearn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.
Shasha D, Wilson M. Statistics is easy! Synth Lect Math Stat. 2010;3(1):1–174.
Acknowledgements
We would like to acknowledge the suggestions and help of Jacopo Cirrone and Ji Huang.
Funding
This work was support by the U.S. National Institutes of Health 1R01GM12175301A1; the U.S. National Science Foundation under Grants MCB1412232, IOS1339362, and MCB0929338; and by NYU WIRELESS. That support is greatly appreciated.
Author information
Authors and Affiliations
Contributions
BS and DS wrote the main manuscript text and GC helped with revisions in the biology part of the manuscript. BS carried out all the experiment and analysis for the paper. All figures were prepared by BS. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1: Table S1.
Parameter search space for all ensemble methods used in our experiments, and the optimal parameters obtained from tuning on a DREAM dataset.
Additional file 2: Table S2.
Execution time of various inference algorithms and ensemble methods used in this research. Time was measured for the mESC, hESC, arabidopsis and B. subtilis dataset, on an Ubuntu 20.04 system with AMD Ryzen™ 9 5900X CPU.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Shen, B., Coruzzi, G. & Shasha, D. EnsInfer: a simple ensemble approach to network inference outperforms any single method. BMC Bioinformatics 24, 114 (2023). https://doi.org/10.1186/s12859023052311
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859023052311
Keywords
 Gene regulatory networks
 Machine learning
 Transcriptional regulation
 Non homogeneous ensemble