### Software functionality

NEArender possesses both core and ancillary functions for network enrichment analysis. It is implemented as an R package and described in details, beyond the official manual, in an R vignette https://cran.r-project.org/web/packages/NEArender/vignettes/NEArender_vignette.pdf

The input shall contain three components: 1) one or multiple user-defined (experimental or theoretical) gene sets which have to be functionally characterized, called altered gene sets (AGS); 2) a collection of functional gene sets (FGS) which would enable functional characterization through their known functions, and 3) a global network of functional coupling (NET). Input can be provided as either text files or pre-processed R lists and matrices. Figure 1 presents the relationship between the major components and steps of NEArender.

In fact, although we denote input in terms of “genes”, a range of functional nodes of a biological network can be analyzed with this algorithm, such as protein molecules, genomic regions that encode proteins, microRNAs, promoters, and enhancers etc. Nodes listed in AGS, FGS, and NET should employ the same ID format. The package performs network enrichment analysis through the fast binomial test as described in Methods and quantifies enrichment in each AGS-FGS pair with a number of statistics: chi-squared score, z-score, *p*-value, q-value (the *p*-value adjusted for multiple testing, a.k.a. false discovery rate [4], number of network edges that exist between any nodes of AGS and FGS (but neither within AGS nor within FGS), and respective number of AGS-FGS edges expected by chance, calculated with the binomial formula.

In order to run enrichment analysis, the user should prepare the input components. Since the AGS is the most dynamic and user-specific part of the input, functionality for AGS compilation and processing is most developed. First, AGSs can be prepared in advance externally, e.g. as a list of genomic variations reported in a given genome. Alternatively, the package functions can create AGSs from sample columns of an R matrix. This can be done with a number of different algorithms (two of these we use below in the section ‘Robustness of results obtained from different replicates’. Then the number of genes included in each AGS would be either data-driven (all significant genes) or pre-defined by user (top ranking ones regardless of significance). In addition, the algorithm ‘toprandom’ generates random AGSs of a user-set size. Finally, a special function allows direct creation of AGSs as full sample-specific sets of mutated genes. FGSs are usually imported from a file by listing all members of each functional set. Due to the high network density and, hence, statistical power of NEA, there is a special option, which is not available in GSEA: single genes can be treated as FGS as well. A full list of such single-gene FGSs can be created from all network nodes of NET with parameter as_genes_fgs, so that each FGS item in the output list contains just one gene. It is practical to use a large pre-compiled FGS collection, such as all ontology terms, pathways, or a union of resources, e.g. MSigDB database [5]. Alternatively, users can create custom single- or multi-gene FGS collections of their own. We note however that using relatively few FGSs and/or AGSs in one analysis (so that the total number of AGS-FGS tests is below a few hundreds) would not allow estimating the q-values properly.

While higher order topological biases are, as discussed above, of arguable importance for NEA, another network feature is vital for the enrichment evaluation used in this package. Namely, the parametric algorithm would produce unbiased estimates only in scale-free networks, i.e. where node connectivity values follow the power law distribution [6]. Being an almost ubiquitous feature in the full scale biological networks, this is still not the case in networks that are artificially constructed from e.g. ChIP-seq based collections of transcription factor binding events [7] or from computationally predicted microRNA-transcript targeting data [8]. If such network components are desirable, we recommend employing software that involves network permutation tests, i.e. the randomization, such as in [2, 3]. Therefore, the package enables evaluating network topology for scale-freeness and second-order dependencies (described below) as well as benchmarking alternative NETs using either standard or custom FGSs, as described in [9]. Briefly, the benchmark consists of as many test cases as there are FGS members in total (multiple occurrences of the same genes in different FGS are treated separately). For each such gene, the procedure tests the null hypothesis of the gene not being an FGS member. The true positive or false negative result is assigned if the gene receives an NEA score above or below a certain threshold, respectively. In parallel, randomly picked genes with close node degree values are tested against the same FGS to estimate specificity via the false positive versus true negative ratio. The counts of alternative test outcomes TP, TN, FP, and FN at variable NEA thresholds are used for plotting ROC curves.

### Comparison to randomization based algorithm

We compared the *p*-value distributions between the two methods (Fig. 2). The both were capable of adjusting *p*-values for multiple testing, but we omit the adjusted value analysis here since our GS test set of 330 FGS (mostly KEGG pathways with addition of GO terms and other sets related to cancer, cytokine signaling, inter-cellular communications etc. [10] abounded with highly functionally similar GS pairs and thus the fraction of significant q-values was ~30%. We instead focus on the *p*-values in order not to miss important distribution details. It is apparent that the estimates from network randomization become more consistent with the growing number of randomization runs, from *N* = 3 to *N* = 300. The third column of Fig. 2 displays strong and asymptotically increasing correlation between *p*-values from the network randomization z-test (NRZ) and chi-squared binomial formula (CSB) (Spearman *r* = {0.83; 0.94; 0.97; 0.98; 0.99} for *N* = {3; 10; 30; 100; 300}, respectively). Since CSB employed a deterministic procedure, we assume that all dispersion of *p*-value points in the NRZ-CSB space should be attributed to sampling errors by the stochastic NRZ algorithm. CSB and NRZ *p*-values converged sufficiently well only at *N* > 30. According to the quantile analysis with Q-Q plots (2nd column), CSB *p*-values were more conservative than those of NRZ. Next, the latter was somewhat less sensitive than CSB in regard of small GS, especially of those where both GSs were small, below 30 or 10 nodes altogether (green and blue lines in the first column). Again, NRZ *p*-values from the computationally feasible but insufficient *N* = {3;10; 30} exhibited more disagreement with respective CSB values. In addition to the sums *N*
_{AGS} + *N*
_{FGS}, we stratified the Q-Q plots by the sum of node degrees *C*
_{AGS} + *C*
_{FGS} and by the actual number of edges between two GSs, *N*
_{edges} (Additional file 1: Figure S1A and B). The sum *C*
_{AGS} + *C*
_{FGS} appeared a strongly biasing factor (although *N* and *C* expectedly correlated in the test set). The influence of *N*
_{edges} was almost non-existent.

Since there was no gold standard for network enrichment, we could not unambiguously conclude that CSB was more precise than NRZ. However it seems to be preferable because of its, in general, more conservative estimates and, in particular, higher sensitivity on small GS as well as the convergence of NRZ at higher values of N. The NRZ-specific bias on small GSs likely occurred because between-GS connectivity values could not take on negative values, hence the left tail of the distribution of *N*
_{edges} was compressed and standard deviation must be underestimated.

Thus while using NRZ, the deviation from CSB values and bias due to small GS size were considerably large at *N* < 100. However, performing many randomizations might make the procedure prohibitively slow. While using NRZ, the computation time grows linearly with the number of randomizations. For example on a server running 2.6.32-642.el6.x86_64 (OS Scientific Linux),one randomization of the given network took around 3 min plus 20 min for counting connectivity in the given FGS collection, so that the total time for 100 randomizations was 100*(20 + 3) = 2300 min. For the most direct comparison, the same perl software [9] in the CSB mode, i.e. without network randomization, run 15 min. In the both cases, the perl process occupied between 300 and 400 MB RAM. Running R package NEArender required less than 5 min and below 200 MB RAM. We note that in many practical tasks the number of either AGS or FGS could be significantly reduced. However as stated above, the current package is mostly meant for generating NEA scores that could be used as predictive features in machine learning - and those AGSxFGS matrices are supposed to be large.

### Robustness of results obtained from different replicates

Statistically underpowered experimental designs are not welcomed in the scientific community but, due to the lack of resources, are still practiced. In particular, it is common to represent patients in large cohort screens with single samples [11, 12]. Both the GSEA and NEA analyses measure enrichment in order to summarize signals from individual genes' to the level of pathways and biological processes. This feature suggests a potential increase in robustness of conclusions in experiments that lack replicates. We decided to investigate this robustness of using single-gene expression values versus GSEA and NEA under different scenarios using cell transcriptome data from the FANTOM5 CAGE RNA sequencing set [13]. In order to make results comparable between NEA and GSEA, we use the binomial version of the latter [1], which allowed applying the analyses to the same input gene sets. Commonly, statistical power is analyzed via extrapolating variance estimates [14, 15]. Since GSEA and NEA do not provide such estimates, we instead watched consistency of conclusions drawn from individual replicates.

The samples, in 3 to 5 biological (distinct healthy individuals) replicates, originated from different fibroblast, epithelial, and smooth muscle cells. In total, 43 replicates from 11 cell types quantified RNA transcript tags mapped to 16620 genes. This provided a broad range of degrees of dissimilarity between type-specific transcriptomes. The dissimilarity was evaluated as a fraction of significantly DE genes in fully replicated designs and ranged from 2.4% (“fibroblast.periodontal” vs. “fibroblast.gingivial”) to 54.8% (“smooth.muscle.umbilical.artery” vs. “gingival.epithelial”) DE genes. This set-up allowed us to model situations of analyzing DE of each gene *g* using single samples, e.g. *g*
_{
Ai
} vs. *g*
_{
Bj
} on samples *i* and *j* from cell types *A* and *B*, and then using all available replicates for *A* and *B* (*N*
_{
A
}, *N*
_{
B
}): \( {g}_{A\left\{1\dots {N}_A\right\}} \) vs. \( {g}_{B\left\{1\dots {N}_B\right\}} \). Note that calculating DE *p*-values was thus possible only in the latter case. Otherwise, DE could only be ranked by values of fold change between *A*
_{
i
} and *B*
_{
j
}. We quantified DE in all the 55 possible *A*
_{
i
} - *B*
_{
j
} pairs between the 11 cell types. As an example, if there were *N*
_{
A
} = 3 and *N*
_{
B
} = 5 replicates, then we could model DE measurements on 5*3 = 15 pairs of single samples plus one analysis \( {g}_{A\left\{1\dots {N}_A\right\}} \) vs. \( {g}_{B\left\{1\dots {N}_B\right\}} \). In the following, we review preservation of DE estimates across different cell samples contrasts analyzed with three different methods: the default DE analysis on individual genes, GSEA, and NEA. The preservation was evaluated in the form of Spearman rank correlation coefficients between: 1) fold change values of individual genes for gene-wise analysis“as is”,2) *P*-values of FGS enrichment scores obtained with binomial GSEA, and 3) *P*-values of FGS enrichment scores obtained with CSB NEA.

The example scatterplots at Fig. 3 (a, c, and e) display agreement between the same replicate pairs of the same contrast between fibroblast gingivial vs. tenocyte cells. The fold change values and *p*-values, although being different in their nature and scale, allowed us to rank the results in the same way as a biological researcher would prioritize them. We note here that in GSEA and NEA neither *p*-values nor enrichment scores (being mutually rank-invariant) explicitly provide sampling errors similar to between-sample variance in the DE analysis. All such comparisons between sample pairs were quantified by Spearman rank correlations and used as individual points in the plots on the right (b, d, and f). One observation from the plots C, and E was that regardless of the correlation strength, NEA possessed a much higher statistical power to detect enrichment – which has been discussed in details by Alexeyenko et al. in 2012 [2]. Indeed, no GSEA *p*-values appeared significant after the Bonferroni adjustment for multiple testing in either replicate pair (brown dotted vertical and horizontal lines). For comparison, almost 1/3rd of the NEA scores were significant in the both replicate pairs.

GSEA and NEA were each tested using six alternative variants of AGS, generated for each contrast: a) lists of top 100, 300, and 1000 DE genes ranked by fold change and b) lists of genes with absolute log
_{
2
}
(fold change) values exceeding 1, 2, and 4. AGS sizes of the form (b) were not fixed and depended on the magnitude of the difference between the two transcriptomes. When using all available replicates in the analyses (a), the genes were ranked by *p*-value with functions lmFit and eBayes in R package limma [16]. In (b) we required that, in addition to the fold change cutoff, the adjusted *p*-values did not exceed 0.05.

Overall, the scores from NEA agreed with each other much better than those from GSEA (Fig. 3, g and h). By investigating the AGS- and method-specific scatterplots for AGSs of type”abs (log
_{
2
}
(fold change)) > 2” (Fig. 3, b, d, and f; respective plots for all the six AGS types are available as Additional file 1: Figure S3), we observed that the level of correlation between the analyses (Y-axis) depended on strength of the pairwise difference between cell type transcriptomes, expressed as the fraction of DE genes (X-axis). Ideally, we would observe independence of the difference strength (so that ranks fully correlate regardless of fold change, *p*-value, FDR etc.) and perfect correlation between DE results from both individual sample pairs and replicated analyses. However, it was unrealistic to expect large and efficient AGSs from few (e.g. 5…7%) significantly DE genes. Still, we can conclude that NEA output was considerably closer to the desirable pattern than those of the gene-wise and GSEA analyses. Indeed, in addition to the stronger overall correlation, NEA performed particularly well on weakly different cell types pairs (see the regression lines’ intercepts with Y-axes and the rank *r* values in the bottom right corners in Fig. 3, b, d, and f). The only exception could be found for AGSs top100 (Additional file 1: Figure S3), where GSEA appeared superior over NEA. However, this option would be practically unusable in GSEA because of its low statistical power on small gene sets and, hence, few significant enrichment values.

We also could compare consistency of results obtained via i) individual sample pairs against each other (“*g*
_{
Ai
} vs. *g*
_{
Bj
}”against “*g*
_{
Am
} vs. *g*
_{
Bn
}”) and ii) results from individual sample pairs against the replicated analysis (“*g*
_{
Ai
} vs. *g*
_{
Bj
}”against “\( {g}_{A\left\{1\dots {N}_A\right\}} \) vs. \( {g}_{B\left\{1\dots {N}_B\right\}} \)”). One can also see that the option (ii), i.e. the comparison of non-replicated to replicated analyses (colored points), often exhibited poorer correlation than comparisons (i) (black points), especially for GSEA (the same pattern can be seen in the boxplots of Additional file 1: Figure S2).