To comprehensively evaulate different models, we simulated co-expression data from four real protein-protein interaction networks (Fig. 2) used in Allen et al. [5], which was selected Keshava Prasad et al. [20]. See the download link for the four real network structure in the Availability of data and materials section. Details of the generative model are discussed below. We investigated the performance of each method for data with various noise levels and sample sizes.

### Generative model

We used Gaussian graphical models that are mainly used to infer the gene association network to simulate expression data. Let **y**_{i}=(*y*_{i1},…,*y*_{ij},…,*y*_{ip}) denotes the collection of expression levels for each gene observed in sample *i*. This was simulated from a zero-mean multivariate normal distribution *y*_{i}=MN(**0**_{p}, *Σ*+*ε*^{2}**I**_{p×p}), where **0**_{p} denotes the *p*-dimension zero vector and **I**_{p×p} denotes the *p*-by-*p* identity matrix. For the covariance matrix *Σ*, we generated its concentration matrix *Ω*=*Σ*^{−1} following Peng, et al. [13]. The initial matrix *Ω* was created by setting

$${\begin{aligned} \omega_{jh}=\left\{ \begin{array}{ll} 1 &, \ j = h \\ 0 &, \ j \neq h, j \nsim h \\ 0.5\text{Uniform}(-1,-0.5)+0.5\text{Uniform}(0.5,1) &,\ j \neq h,j \sim h \end{array} \right., \end{aligned}} $$

where *U**n**i**f**o**r**m*(*a*,*b*) represents uniform distribution on interval (*a*,*b*), *j*∼*h* indicates that there is an edge between gene *j* and *h*, \(j\nsim h \) means otherwise. The network structure was chosen from one of the four real protein-protein interaction networks [20, 21], each of which was approximately scale-free (see Fig. 2). Then, the non-zero elements in *Ω* were rescaled to assure positive definiteness. Specifically, for each row, we first summed the absolute values of the off-diagonal elements, and divided each off-diagonal entry by 1.5-fold their sum. Next, we averaged this rescaled matrix with its transpose to ensure symmetry. We then set 0.1 to those non-zero entries with absolute value smaller than 0.1. After that, the inverse of the final matrix was denoted by **A**=*Ω*^{−1}. Each element in the covariance matrix *Σ* was determined by \(\delta _{jh}=\alpha _{jh}/\sqrt {\alpha _{jj}\alpha _{hh}}\). For the noise level *ε*, we considered three cases: *ε*=0, 0.1, 0.5.

### Performance metric

We evaluated the result of each method by plotting its operating characteristic curve (ROC) and calculating the area under the ROC curve (AUC). As different methods generate different outputs, we used their corresponding approaches to plot ROC curves for a fair comparison. GeneNet and BayesianGLASSO yield a continuous estimate of each partial correlation *ρ*_{jh}. They do not require a tuning parameter. Thus, an edge between gene *j* and *h* was determined if the absolute value of *ρ*_{jh} was greater than a certain threshold. Then the ROC curves were obtained by plotting false positive rates (FPRs) against true positive rates (TPRs) under different thresholds. For mutual information-based methods, we choose the tuning parameter *α*=0.03 as suggested by the authors [17, 18]. Then, an edge between gene *j* and *h* was determined if the estimated entropy was greater than a threshold. The ROC curves were obtained by plotting FPRs against TPRs under different thresholds. Note that we only included PCACMI in the simulation, since CMI2NI produced the same result as PCACMI did. For the other methods that need a tuning parameter, the ROC curves were obtained by plotting FPRs and TPRs under different choices of the tuning parameter.

### Result summarization

As shown in the result of simulation study (Additional file 4: Figure S1-S8), BayesianGLASSO and ENA generally outperform other methods, which is consistent with the literature [6, 16]. Besides, mutual information-based methods also show competitive results. NS, GLASSO, and GLASSO-SF, which share the same strategy, have similar accuracy. As the earliest developed method, GeneNet has significantly lagged performance. Not surprisingly, all methods lose power when either a higher level of noise manifests or a smaller number of samples is generated.

We also logged the computational time of each method in Table S2 (Additional file 5). The Bayesian method consumed several orders of magnitude more time, and it soon went beyond real applicability when the number of genes in the network increased to hundreds. Most other methods shared similar efficacy in the simulation settings, with mutual information-based methods being a little slower.