INBIA: a boosting methodology for proteomic network inference

Background The analysis of tissue-specific protein interaction networks and their functional enrichment in pathological and normal tissues provides insights on the etiology of diseases. The Pan-cancer proteomic project, in The Cancer Genome Atlas, collects protein expressions in human cancers and it is a reference resource for the functional study of cancers. However, established protocols to infer interaction networks from protein expressions are still missing. Results We have developed a methodology called Inference Network Based on iRefIndex Analysis (INBIA) to accurately correlate proteomic inferred relations to protein-protein interaction (PPI) networks. INBIA makes use of 14 network inference methods on protein expressions related to 16 cancer types. It uses as reference model the iRefIndex human PPI network. Predictions are validated through non-interacting and tissue specific PPI networks resources. The first, Negatome, takes into account likely non-interacting proteins by combining both structure properties and literature mining. The latter, TissueNet and GIANT, report experimentally verified PPIs in more than 50 human tissues. The reliability of the proposed methodology is assessed by comparing INBIA with PERA, a tool which infers protein interaction networks from Pathway Commons, by both functional and topological analysis. Conclusion Results show that INBIA is a valuable approach to predict proteomic interactions in pathological conditions starting from the current knowledge of human protein interactions. Electronic supplementary material The online version of this article (10.1186/s12859-018-2183-5) contains supplementary material, which is available to authorized users.


Methods and parameters
We selected 14 network inference methods based on several statistical measures, such as: correlation, mutual information, entropy maximization, regression. The methods were applied to reverse-phase protein arrays (RPPA) datasets retrieved from TCPA containing expression levels of 190 proteins and phosphoproteins in 16 cancer types (Table S1). Below we report the description of the procedures used to execute the methods. The complete analysis was implemented in R language. Table  S2 reports the used R packages and parameters.

Network inference methods description
Concerning correlation, we used Pearson, Spearman, and Partial Correlation. The last measure has been computed using the inverse of the Pearson correlation matrix through the R function cor2pcor which makes use of Singular Value Decomposition (SVD) [1]. Among correlation-based approaches, the Weighted correlation network analysis (WGCNA) yields module detection, network of co-expressed genes, gene significance evaluation and topological analysis [2]. We used the Topological Overlap Matrix (TOM) similarity [3] to predict putative edges between proteins in the datasets. The analysis have been carried out by considering the best power computed with pickSoftThreshold for correlation network reconstruction such that it resembles a scale-free topology. The blockwiseModules function reconstructs the network from data and finds module of interconnected nodes with minimum module size of 30. The dynamic tree cut approach to select best cut was also assessed with cutreeDynamic function by using average linkage hierarchical clustering of dissimilarity TOM matrix (1 − T OM ). The Ridge and Lasso regression methods compute penalized estimates of the regression coefficients which tend to vanish when using L2 or L1 norm, respectively. These methods are applied to the inverse covariance matrix. Ridge method computed from ridge.net creates a vector lambda of 1, 000 penalty terms that depends on the number of samples n and the number of splits in k-fold cross-validation, then applies ridge regression. The advantage of Lasso regression represents its low computational complexity. For this reason, it has been applied to high dimensional data [4]. On the other hand the main shortcoming to work on biological data is that it needs space reduction procedures to reduce false positives. The implementation through adalasso.net yields best coefficient from cross-validation-optimal regression, then computes partial correlation.  WGCNA TOMsimilarityFromExpr networkType = "unsigned", TOMType = "unsigned", power = best.power WGCNA Graphical lasso solves the lack of reductions procedures by using the block coordinate descent algorithm. Elastic Net combines the Lasso and Ridge penalties achieving better results compared to Lasso with several real world data [5]. Complementary approaches, e.g. Shrinkage (Gene net) and Partial Least Square Regression, try to estimate covariance matrices in order to infer dependencies among genes. Results were filtered according to the significance of the method's predictions. In particular, for each predicted protein-protein edge a p-value, based on Graphical Gaussian Model [1], is computed. Graphical Gaussian models are used to describe gene association networks from partial correlation. The original model hypothesizes that the genomic data follows a multivariate normal distribution and has positive definite covariance matrix. It states that correlation coefficients represent direct or undirect regulations by common genes. The higher the partial correlation between two genes is the more is the strength of the direct interaction between them. It further computes the partial correlation matrix from the inversion of the correlation coefficients. This model has been improved for small sample size in [1] by estimating partial correlation coefficients from the inversion of the correlation matrix with Moore-Penrose pseudoinverse matrix and then used bootstrap aggregation to stabilize the estimator.
ARACNE [6], CLR [7], and MRNET [8] are based on the estimation of the mutual information between variables. ARACNE exploits the data processing inequality (DPI) to delete false positives (i.e. indirect edges). It has been applied both on synthetic and biological gene expression data to create large networks with more than 120, 000 gene interactions. CLR computes the distribution of the values of mutual information for all the pairs of variables. Next, it applies the relevance network approach to filter non-significant interactions. Minimum-redundancy-maximumrelevance (MRNET) algorithm selects the variables with maximum relevance in terms of mutual information. The indirect interactions caused by interdependence among selected variables are penalized through a minimum redundancy criterion. Table S3 reports the topological measures in the networks inferred by our approach INBIA with respect to those computed by PERA [9]. For each undirected network we computed:

Inferred Networks Properties and Statistics
1 Number of nodes; 2 Number of edges; 3 Edge density: the ratio of the network edges and the total possible edges; 4 Transitivity: the global transitivity represents the number of triangles divided by the number of triples in the network; 5 Diameter: the maximum length of the shortest path between any two vertices in the network; 6 Degree centralization: obtained from vertices degree, represents the level of network centralization normalized by the maximum centrality level; 7 Closeness centrality: similar to degree centralization, is the normalized closeness centralization index, where the closeness for each vertex is defined as is the shortest path connecting nodes u and v in the network; 8 Betweenness centrality: for each vertex v, the fraction of shortest paths between every other two nodes in the networks that contain v as an intermediate vertex; 9 Mean distance: the average of the length of all network's shortest paths. Table S4 reports the F-measure, i.e. a combination of precision and recall, considering direct (k = 1) and indirect (k = 2, 3, 4) interactions, where k represents the length of paths within the gold standard connecting the proteins. For each tissue, the best method is selected by computing the maximum F-measure across all methods.
To assess the quality of predicted edges, inferred networks are compared with tissue-specific networks from TissueNet v.2 [10] and GIANT [11]. For example, in order to find the best match between cancer and normal counterpart we sought if a specific cancer can occur in that tissue, e.g. BLCA corresponds to urinary bladder, BRCA to mammary gland, GBM to brain, PRAD to prostate gland, and so on. Every file, e.g. urinary bladder in GIANT, contains in each row an edge between two proteins and the corresponding class as defined in [11]. We verified the overlap of predicted PPIs from INBIA with those contained in the normal counterpart in GIANT and collected the classes. Cancer types were associated to tissues in TissueNet v.2 and GIANT following the search keywords reported in Table S5.  Table S6 reports the number of interactions predicted by the compared methods (INBIA and PERA) and the number of such intersections present in TissueNet. The differences of overlaps among INBIA and PERA was statistically significant with p-value < 0.001 (by using T-test).
The Figure S1 reports the consensus networks for INBIA (a) and PERA (b) constructed by considering common predictions in all predicted cancers network (for each cancer type we take the ensemble network). Each node size in the INBIA's (a) and PERA's (b) network corresponds to collective influence and nodes are colored in red if they belong to MDS. We further analyzed the INBIA consensus network by considering OncoPanel [12], a custom targeted next-generation sequencing assay for cancer composed by 282 genes. Only 49 genes out of 282 are present in TCPA datasets but, interestingly, the consensus network obtained in Figure S1 contains a total of 42 genes and 15 of these are also present in OncoPanel. Figure S2 represents the labeled INBIA consensus network. The complete results of this analysis are reported in Additional File 5. Table S7 reports gene set enrichment analysis for all genes in the TCPA dataset. Most genes related to TCPA proteins overlap with genes up-regulated by activa-  tion of the PI3K/AKT/mTOR pathway and mediating the programmed cell death by activation of caspases [13]. Other significant enriched gene sets are hypoxia, that contain genes up-regulated in response to low oxygen levels and genes downregulated in response to ultraviolet radiation [13]. Nodes are valuated associating their function to a measure called node collective influence and to the beloging of  the node to the minimum dominating set (MDS). MDS is the minimal subset of nodes such that every other node in the network that do not belong to this set is directly connected to a MDS node [14]. The proposed solution to find MDS follows a greedy approach. At each step it selects the node c with the maximum degree, sets it as 'visited' and adds it to the current MDS. Then c's neighbors, i.e. nodes directly connected to c, are set to 'visited' too and removed from the set of nodes to consider at the next step. The procedure ends when every node is visited. It has been proved that the approximation error is less than ln(δ) + 2 where δ is the maximum degree in the input graph [15]. For each tissue, the network of PPI predictions related to the best method was selected, then its MDS was computed. The collective influence (CI) measure, first introduced in [16], finds influencers within a network. These nodes have a topologically key role in the spreading of information along the entire network. This can be used to stop the diffusion of an epidemic disease. The CI measure reveals the importance of 'weak' nodes that  have low degree but also takes into account the contribution of neighbors with high degree called hubs. In PPI networks related to a specific disease, the influencers can be proteins with a biological function related to the initiation of the signaling process for that pathology.

Table S7
Gene set enrichment analysis of all TCPA genes performed with MSigDB. The proteins are mostly involved in PI3K/AKT/mTOR pathway and apoptosis regulation. K is the number of genes in Gene Set, k the number of genes in overlap, p-value and FDR q-value are non-corrected and corrected significativity of the enrichment, respectively. Let δBall(i, r) for node i be the set of nodes at distance r from node i through a shortest path in graph G. Then, collective influence is: where d G (v) is the degree of node v in graph G. As suggested in [16], the CI for node i is more informative when r ≥ 1 and r is lower than the diameter of the network, otherwise CI vanishes. For these reasons, we set r = 2. For each tissue, the network of PPI predictions related to the best method was selected, then we computed the collective influence for each protein. The higher is the CI value the higher is the number of hubs a protein is connected to. Additional File 5 reports the lists of MDSs and collective influence of nodes, for each cancer PPI network inferred by our approach and PERA.