3off2: A network reconstruction algorithm based on 2-point and 3-point information statistics

Background The reconstruction of reliable graphical models from observational data is important in bioinformatics and other computational fields applying network reconstruction methods to large, yet finite datasets. The main network reconstruction approaches are either based on Bayesian scores, which enable the ranking of alternative Bayesian networks, or rely on the identification of structural independencies, which correspond to missing edges in the underlying network. Bayesian inference methods typically require heuristic search strategies, such as hill-climbing algorithms, to sample the super-exponential space of possible networks. By contrast, constraint-based methods, such as the PC and IC algorithms, are expected to run in polynomial time on sparse underlying graphs, provided that a correct list of conditional independencies is available. Yet, in practice, conditional independencies need to be ascertained from the available observational data, based on adjustable statistical significance levels, and are not robust to sampling noise from finite datasets. Results We propose a more robust approach to reconstruct graphical models from finite datasets. It combines constraint-based and Bayesian approaches to infer structural independencies based on the ranking of their most likely contributing nodes. In a nutshell, this local optimization scheme and corresponding 3off2 algorithm iteratively “take off” the most likely conditional 3-point information from the 2-point (mutual) information between each pair of nodes. Conditional independencies are thus derived by progressively collecting the most significant indirect contributions to all pairwise mutual information. The resulting network skeleton is then partially directed by orienting and propagating edge directions, based on the sign and magnitude of the conditional 3-point information of unshielded triples. The approach is shown to outperform both constraint-based and Bayesian inference methods on a range of benchmark networks. The 3off2 approach is then applied to the reconstruction of the hematopoiesis regulation network based on recent single cell expression data and is found to retrieve more experimentally ascertained regulations between transcription factors than with other available methods. Conclusions The novel information-theoretic approach and corresponding 3off2 algorithm combine constraint-based and Bayesian inference methods to reliably reconstruct graphical models, despite inherent sampling noise in finite datasets. In particular, experimentally verified interactions as well as novel predicted regulations are established on the hematopoiesis regulatory networks based on single cell expression data. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0856-x) contains supplementary material, which is available to authorized users.

• the PC algorithm [1] with stable skeleton implementation and majority rule orientations [2], as implemented in the pcalg package [3,4] • the Bayesian inference method using the hill-climbing heuristics implemented in the bnlearn package [5] • Aracne [6], an information-based inference approach, which iteratively prunes links with the weakest mutual information based on the Data Processing Inequality. We have used the Aracne implementation of the minet package [7] • the MMHC algorithm: an hybrid approach [8] combining constraint-based and Bayesian approaches by first identifying both parents and children of each node of the underlying graphical model and then performing a greedy Bayesian hill-climbing search restricted to the identified parents and children of each node. The MMHC method is implemented in the bnlearn package [5].
Comparisons are made between the reconstructed network (or its CPDAG for Bayesian and MMHC methods) and the CPDAG of the benchmark network (filled line). T P with incorrect orientation or incorrect non-orientation are counted as F P edges (see main text). In addition, the skeleton of the reconstructed network is also compared to the skeleton of the benchmark network (dashed line). Note that Aracne only predicts network skeletons. For sample sizes from N = 10 to 50, 000 data points, the methods have been tested on 50 replicates and the Figures S1-S26 give the average results over these multiple replicates. The five following benchmark networks have been considered (refer to the bnlearn package [5] for more details on these networks): HEPAR II 70 nodes, 123 links, 1,453 parameters, average degree 3.51, maximum in-degree 6 1 Evaluation of the PC method by significance level In this section, the results of the PC inference method [1], as implemented in the pcalg package [3,4], are evaluated using the following parameter values: • a significance level α = 0.001 (PC 1e-03) • a significance level α = 0.01 (PC 1e-02) • a significance level α = 0.1 (PC 1e-01) In particular, the stable implementation of the PC algorithm has been used, as well as the majority rule for the orientation and propagation steps [2]. As shown in Figures S1-S5, the PC inference method typically requires the significance level to be adjusted to larger values (α = 0.1) at small sample sizes and to smaller values (α = 0.01 or α = 0.001) on larger datasets to improve the CPDAG F-scores.

Evaluation of the Aracne reconstruction method
In this section, the results of the Aracne inference method, as implemented in the minet package [7], are evaluated using the following parameter values: • a threshold for the minimum difference in mutual information set to = 1/N (Aracne Eps 1/N) • a threshold for the minimum difference in mutual information set to = 0 (Aracne Eps 0) As shown in Figures S6-S10, small positive values for the threshold parameters for minimum difference in mutual information typically worsen F-scores.

Evaluation of the Bayesian methods by score
In this section, the results of the Bayesian inference method using a hill-climbing (HC) heuristics with 100 random restarts [9], as implemented in the bnlearn package [5], are evaluated using the following parameter values: • Bayesian Dirichlet equivalent (BDe) score (HC BDe) • Akaike Information Criteria (AIC) score (HC AIC) • Bayesian Information Criteria (BIC) score (HC BIC) As shown in Figures S11-S15, depending on the underlying causal network, one has to choose the most suitable score for the hill-climbing heuristic approach to output its best reconstruction. The AIC score should be preferred for INSURANCE ( Figure S15), the BIC score for the ALARM and HEPAR II (Figures S11 & S14) and the BDe score for the CHILD and BARLEY networks (Figures S13 & S12).  As discussed in the Methods section (see main text), Figures S16-S20 show that the best results on benchmark networks are obtained with the NML score. The MDL score leads to equivalent results, as expected, in the limit of very large datasets (see Appendix). However, with smaller datasets, the most reliable results with the MDL score are obtained using non-shifted instead of shifted 2-point and 3-point information terms in the 3off2 rank of individual edges, Eq. 21. This is because the MDL complexity tends to underestimate the importance of edges between nodes with many levels (see Appendix). For finite datasets, it easily leads to spurious conditional independencies, I (x; y|{ui}) < 0, when using shifted 2-point and 3-point information, Eq. 22, whereas using non-shifted information in the 3off2 ranks (Eq. 21) tends to limit the number of false negatives as early errors in {u i } can only increase I(x; y|{ui}) 0, in the end, in Eq. 23.

Evaluation of 3off2 against Bayesian and MMHC methods
In this section, the results of the 3off2 inference approach are evaluated against the Bayesian inference and the MMHC [8] methods using the following parameter values: • HC BIC: Bayesian inference using Bayesian Information Criteria (BIC) score and hill-climbing (HC) heuristics with 100 random restarts [9] as implemented in the bnlearn package [5] • MMHC BIC: Hybrid approach [8] combining constraint-based and Bayesian approaches by first identifying both parents and children of each node of the underlying graphical model and then performing a greedy Bayesian hill-climbing search restricted to the identified parents and children of each node. We have used the BIC criteria with significance parameter α = 0.1 (using BDe criteria and the range α = 0.001−0.1 gives very similar results for all tested benchmarks, not shown). The MMHC method is implemented in the bnlearn package [5].
• 3off2 MDL(rank I): 3off2 inference approach using Minimum Description Length (MDL) criteria and non-shifted 2-point and 3-point information terms in the rank of individual edges As discussed in Methods section (see main article), the MDL complexity using shifted 2-point and 3-point information terms leads the 3off2 inference approach to cumulate false negative edges with smaller datasets. Yet, the 3off2 algorithm gives good results when using the MDL criteria with non-shifted 2-point and 3-point information terms in the rank of individual edges. As shown in Figures S21-S25, CPDAG F-scores of the 3off2 reconstruction method are typically better or comparable to the Bayesian hill-climbing heuristics using the BIC score, except for the CHILD benchmark network, although 3off2 eventually reaches better results than the Bayesian inference approach for very large datasets (N > 20, 000).  • PC 1e-01: PC inference method [1] implemented in the pcalg package [3,4] with a significance level α = 0.1 • HC BIC: Bayesian inference using Bayesian Information Criteria (BIC) score and hill-climbing (HC) heuristics with 100 random restarts [9] implemented in the bnlearn package [5] • MMHC: Hybrid approach [8] combining constraint-based and Bayesian approaches by first identifying both parents and children of each node of the underlying graphical model and then performing a greedy Bayesian hill-climbing search restricted to the identified parents and children of each node. We have used the BIC criteria with significance parameter α = 0.1 (using BDe criteria and the range α = 0.001 − 0.1 gives very similar results for all tested benchmarks, not shown). The MMHC method is implemented in the bnlearn package [5].
As shown in Figure S26, the execution times for the 3off2 reconstruction method follow typically similar trends (although shifted by a roughly constant factor, ×5-10) as Bayesian hill-climbing heuristics or the fast MMHC hybrid method. By contrast, the PC algorithm, which is quite fast for small datasets, becomes significantly slower for larger datasets.   Table S1: Interactions reconstructed by 3off2 and four alternative methods on the subnetwork of 11 known regulatory interactions in hematopoiesis ( Figure 7). These calculations were made assuming that the 11 experimentally proven interactions correspond to the true regulatory network (in practice, we expect that some of the inferred edges counted as false positives might in fact turn out to be correctly predicted links). u indicates that the calculations were made on the undirected network, whereas d means directed network. The "Distance" measure corresponds to the number of "errors" (including orientations for the directed network comparison) over the total number of possible edges, i.e. Distance=2(F P + F N )/n(n − 1), where n = 10 is the number of nodes in the subnetwork including the 11 known regulatory interactions ( Figure 7).