A protein network refinement method based on module discovery and biological information

Pan, Li; Wang, Haoyue; Yang, Bo; Li, Wenbin

doi:10.1186/s12859-024-05772-z

Research
Open access
Published: 20 April 2024

A protein network refinement method based on module discovery and biological information

Li Pan^1,2^na1,
Haoyue Wang¹^na1,
Bo Yang^1,2 &
…
Wenbin Li¹

BMC Bioinformatics volume 25, Article number: 157 (2024) Cite this article

319 Accesses
Metrics details

Abstract

Background

The identification of essential proteins can help in understanding the minimum requirements for cell survival and development to discover drug targets and prevent disease. Nowadays, node ranking methods are a common way to identify essential proteins, but the poor data quality of the underlying PIN has somewhat hindered the identification accuracy of essential proteins for these methods in the PIN. Therefore, researchers constructed refinement networks by considering certain biological properties of interacting protein pairs to improve the performance of node ranking methods in the PIN. Studies show that proteins in a complex are more likely to be essential than proteins not present in the complex. However, the modularity is usually ignored for the refinement methods of the PINs.

Methods

Based on this, we proposed a network refinement method based on module discovery and biological information. The idea is, first, to extract the maximal connected subgraph in the PIN, and to divide it into different modules by using Fast-unfolding algorithm; then, to detect critical modules according to the orthologous information, subcellular localization information and topology information within each module; finally, to construct a more refined network (CM-PIN) by using the identified critical modules.

Results

To evaluate the effectiveness of the proposed method, we used 12 typical node ranking methods (LAC, DC, DMNC, NC, TP, LID, CC, BC, PR, LR, PeC, WDC) to compare the overall performance of the CM-PIN with those on the S-PIN, D-PIN and RD-PIN. The experimental results showed that the CM-PIN was optimal in terms of the identification number of essential proteins, precision-recall curve, Jackknifing method and other criteria, and can help to identify essential proteins more accurately.

Peer Review reports

Background

Proteins are the most significant components of living organisms and have very important biological functions, participating in gene regulation, cellular metabolism, and are the main bearers of biological life activities. Proteins are subdivided into essential and non-essential proteins, among which, essential proteins are particularly important for life activities, and their absence can lead to the failure of the organism to survive [1]. In addition, essential proteins are associated with human disease-causing genes, and their identification and analysis can help in the design of drug targets.

Early studies of essential proteins were mainly conducted by wet experimental methods such as RNA interference [2], single gene knockout [3] and conditional gene knockout [4], which often have the drawbacks of being expensive and time-consuming, therefore, the identification of essential proteins by computational methods has become the current trend.

Node ranking methods are commonly used to identify essential proteins in the protein–protein interaction network (PIN). Initially, researchers used network-based centrality methods to identify essential proteins in the original PIN (static PIN) [5], such as degree centrality (DC) [6], local average connectivity centrality (LAC) [7], node clustering centrality (NC) [8], maximum neighborhood component density centrality (DMNC) [9], topological potential centrality (TP) [10], neighbor interaction density centrality (LID) [11], closeness centrality (CC) [12], betweenness centrality (BC) [13], pagerank centrality (PR) [14], leaderrank centrality (LR) [15], etc.

However, the centrality methods only use the topological features of protein interaction networks for assessing the importance of proteins, and thus it’s difficult to obtain desired predictive performance. In recent years, researchers tended to integrate multiple biological information of proteins to help identify essential proteins more accurately. For example, Li [16] et al. and Tang [17] et al., proposed the PeC and the WDC methods by integrating the degree of co-expression between protein pairs in gene expression profiles and the edge clustering coefficients of their interactions. Qin et al. [18] proposed the LBCC method, which is based on network topological features and protein complex; Li et al. [19] pointed out that proteins in complex are more likely to be essential than proteins not present in the complex, and they proposed the UC method by combining protein complexes and topological features of PINs. Lei et al. [20] proposed the PCSD method that fuses the degree of protein complex involvement and subgraph density. Zhong et al. [21] used a dynamic threshold method to binarize gene expression values and proposed the JDC method to combine the co-expression states and edge clustering coefficients of protein pairs at multiple times.

Although these node ranking methods have made great progress in identifying essential proteins, most of them require the use of topological information of proteins in the PIN for identification of essential proteins, especially network-based centrality methods, which are highly dependent on the accuracy of the underlying PINs. However, most of the PINs obtained from high-throughput experiments have been found to contain false positives or false negatives [22], which may somewhat interfere with the identification accuracy of essential proteins by most node ranking methods.

To improve the identification accuracy of essential proteins, some researchers used biological information of proteins to filter out unreliable interactions between proteins in the PIN, thereby constructing a refined PIN to identify essential proteins for node ranking methods. For example, based on static PIN (S-PIN), Xiao et al. [23] removed from it some unreliable interactions by determining whether protein pairs were activated at the same time in terms of gene expression level data, and constructed a once-refined PIN (D-PIN). Subsequently, Li et al. [24] further removed some unreliable interactions from the DPIN by determining whether protein pairs appeared in the same subcellular compartment, and constructed a twice-refined PIN (RD-PIN).

Nevertheless, some researchers pointed out that PINs have modular characteristics [25,26,27], the essentiality of a protein is not only related to the protein itself, but also to the functional module in which the protein is located, and proteins within modules have higher similarity than those in other modules. Furthermore, Zotenko et al. [28] found that in PINs, a large number of essential proteins may be present in highly dense functional modules. The aforementioned studies focused only on the edges between protein nodes to refine the network, ignoring the modularity feature of PINs. Therefore, it is still a question worth exploring how to better utilize the modularity feature of PINs to construct an efficient PIN and improve the performances of node ranking methods.

For the identification of community structure in complex networks, researchers have proposed a series of module discovery algorithms. For example, algorithms based on modularity [29, 30] and information-theoretic framework [31] can divide non-overlapping modules in complex networks; while the modules discovered by using clique-percolation based [32] and edge-clustering based [33] methods can be overlapping. In particular, in recent studies, some researchers have made use of network structure and node attributes to cluster complex networks more accurately [34,35,36]. For example, Hu et al. [35] and Yang et al. [36], developed two fuzzy-based graph clustering algorithms that well take into account the key dependencies between node embedding and resulting clustering. In our study, a modularity-based Fast-unfolding algorithm was used to partition PINs into modules and analyze the differences between modules.

We found that the biological and topological information contained in different modules of PIN varies greatly. For example, some modules are dense but contain few essential proteins, which may be counterproductive for identifying essential proteins in the PIN. Therefore, the identification and selection of critical modules is of great significance for the construction of higher quality PINs. That is to say, if the network can be refined properly in combination with the modularity of the PIN, the performance of the node ranking method in the PIN may be improved more effectively.

Based on this, in this paper, we proposed a network refinement method based on module discovery and biological information to improve the identification accuracy of essential proteins for node ranking methods. The idea is, for a PIN, firstly, to remove the interactions in some small connected subgraphs from the PIN; secondly, to divide the maximal connected subgraph into several closely connected modules by the Fast-unfolding algorithm that fuses the modularity; thirdly, to select the critical modules by combining orthologous information and subcellular localization information of proteins and topological features of each module; finally, to construct a more refined PIN (CM-PIN) according to the selected critical modules.

To evaluate the effectiveness of the network refinement method proposed in this paper, two different species of Saccharomyces cerevisiae and Human sapiens were used for validation. We applied 12 node ranking methods (LAC, DC, DMNC, NC, TP, LID, CC, BC, PR, LR, PeC, WDC) on the S-PIN, D-PIN, and RD-PIN, and compared the results with those on the CM-PIN obtained on these networks, respectively. The experimental results showed that in terms of the identification number of essential proteins at top 100–600, Jackknifing method, the area under the precision-recall curves, sensitivity, specificity, positive predictive value, negative predictive value, F-measure, Matthews correlation coefficient and accuracy, the performances of the 12 node ranking methods on the CM-PIN are optimal. All of these prove that the network refinement method proposed in this paper can obtain a more efficient PIN, which is conducive to improve the identification accuracy of essential proteins for node ranking methods, and is superior to the existing refinement networks (D-PIN and RD-PIN).

Methods

In this section, first, we described how to build these three protein interaction networks: S-PIN, D-PIN, and RD-PIN. Second, we described how to screen the critical modules by the biological information of proteins and the topological features of each module, and constructed CM-PINs, on S-PIN, D-PIN, and RDPIN respectively, the overall steps of this approach were shown in Fig. 1.

S-PIN, D-PIN and RD-PIN

A static protein–protein interaction network (S-PIN) [37,38,39], is an undirected graph G_S = (V_S, E_S), where V_S represents the set of proteins and E_S represents the set of protein interactions.

A dynamic protein–protein interaction network (D-PIN) [23] is an edge-induced subgraph G_D = (V_D, E_D) of the S-PIN in terms of the gene expression levels of proteins, where V_D = V_S and E_D ⊆ E_S. Let e_ik denotes the value of gene expression level of v_i at time point t_k. If e_ik is greater than τ_i, then v_i is active at time point t_k. for any (v_i, v_j) ∈ E_S, if both v_i and v_j are activated at time point t_k, the interaction between them is preserved in E_D, otherwise it is removed from E_D. The activity threshold τ_i of protein v_i was calculated by using the following equation [25]:

$$\tau_{i} = \mu_{i} + \sigma_{i}$$

(1)

$$\mu_{i} = \frac{{\sum\nolimits_{k = 1}^{n} {e_{ik} } }}{n}$$

(2)

$$\sigma_{i} = \sqrt {\frac{{\sum\nolimits_{k = 1}^{n} {(e_{ik} - \mu_{i} )^{2} } }}{n}}$$

(3)

where μ_i denotes the mean of the n time-point gene expression level values of the protein and σ_i is the standard deviation of the gene expression level values of v_i. In this paper, n = 36 for Saccharomyces cerevisiae and n = 64 for Human sapiens.

A refined dynamic protein–protein interaction network (RD-PIN) [24] is an edge-induced subgraph G_RD = (V_RD, E_RD) of the D-PIN in terms of subcellular localization information of proteins, where V_RD = V_D and E_RD ⊆ E_D. Let L(v_i) = {l₁(v_i), …, l_m(v_i), …, l_r(v_i)} be the 11 subcellular localization statuses of protein v_i, where r = 11. If v_i is in the mth subcellular compartment, then l_m(v_i) = 1, otherwise l_m(v_i) = 0. For any (v_i, v_j) ∈ E_D, only when l_m(v_i) = l_m(v_j) = 1, the interaction between v_i and v_j will be preserved in E_RD, otherwise their interaction will be removed from the E_RD.

Construction of the CM-PIN

The construction of the CM-PIN consists of four steps (the following steps are consistent with Fig. 1):

Step 1: retaining interactions in maximal connected subgraphs, that is, to remove the interactions in the remaining small connected subgraphs of the given PIN;
Step 2: module discovery based on Fast-unfolding algorithm, that is, to divide the obtained maximum connected subgraph into several modules using the Fast-unfolding algorithm;
Step 3: detecting critical modules,that is, to screen out critical modules by using biological and topological information of modules;
Step 4: refining the protein–protein interaction network, that is, to remove the interaction of non-critical modules in the original PIN and construct the CM-PIN.

The construction process of the CM-PIN is described in the following algorithm.

Step 1: retaining interactions in maximal connected subgraphs

It has been found that PINs have scale-free properties [40, 41]. The scale-free property means that the degrees of the nodes in PIN obey a power-law distribution, so PIN belongs to a scale-free network. Considering that PIN is a disconnected graph and consists of several connected subgraphs, where most of the proteins and their interactions are present in a maximal connected subgraph, while the number of proteins and their interactions in some remaining connected subgraphs are very small. As shown in Table 1, we counted the proportion of interactions in the maximal connected subgraphs of the YDIP, YBioGRID and HDIP datasets to the original network interactions.

Table 1 The proportion of interactions in the maximal connected subgraphs to the original network interactions on YDIP, YBioGRID and HDIP datasets

Full size table

Step 2: module discovery based on Fast-unfolding algorithm

It has been shown that PINs have modular properties [25, 26], and the modularity reflects the presence of highly connected protein clusters in PINs. So far, the clustering of protein interaction networks is an effective method for module delineation. In the paper, the Fast-unfolding module discovery algorithm, a hierarchical clustering method, is used for module division of the PIN.

The purpose of module partitioning is to make the connections within the partitioned modules tighter and the connections between modules sparser. In order to evaluate whether the module division is feasible, Newman et al. [29] proposed the concept of modularity. Defining e_ii as the ratio of the sum of all connected edges within module i to the total number of edges in the network and a_i as the ratio of the total number of neighboring nodes of nodes within module i to the total number of edges, the modularity Q can be expressed as:

$$Q = \sum\nolimits_{i} {(e_{ii} - a_{i}^{2} )}$$

(4)

A larger modularity represents a tighter connection within the module, and conversely, a smaller modularity represents a sparser connection within the module, and when the modularity Q reaches its maximum value, the division of modules is optimal.

Blondel et al. [30] proposed a Fast-unfolding algorithm for discovering module structures on large networks, which is a heuristic algorithm based on modularity optimization. Compared with traditional module discovery algorithms, Fast-unfolding has lower time complexity on large-scale networks and stable results for module partitioning, which is the reason why this algorithm is chosen to partition modules in this paper. The implementation steps of Fast-unfolding algorithm are as follows: first, initialization, divide each protein node into different modules; second, for each protein node, try to divide it into the module where its neighboring nodes are located, calculate the modularity Q at this time, and judge whether the difference ΔQ between the modularity before and after the division is positive, if it is positive, accept this division, if not, abandon this division; third, repeat the above process until the modularity Q can no longer be increased, then the division of modules is completed, and C = {c₁, c₂, …, c_i, …, c_m} is the set of modules and m is the number of module divisions. It is worth noting that the divided modules are non-overlapping.

Step 3: detecting critical modules

To determine the importance of each module, we used three features (i.e., orthologous information, subcellular localization information, and topological information of the module) to score each module in the PIN.

(1) Determine the importance of modules using orthologous information of proteins.

Studies have shown that essential proteins evolve much more slowly than non-essential proteins [42], i.e., essential proteins are more conserved. We believe that the modules containing more conserved proteins are more likely to be critical, and the conserved properties of proteins can mainly be found in the orthologous information of proteins. Therefore, we calculate the Pearson correlation coefficient between each module and the protein orthologous information in the PIN as the first score of the module. For protein v_i, let O(v_i) represent the set of reference organisms in which at least an orthologous protein pair including v_i occurs, |O(v_i)| is the orthologous score of v_i, and the vector consisting of orthologous scores of all proteins in the PIN is represented by y. For a module c_i, its vector is represented as xi that only contains 0 and 1 (1 if the protein is in the module c_i, 0 otherwise). The Pearson correlation coefficient PC(c_i) between module c_i and the orthologous scores is:

$$PC(c_{i} ) = \frac{{\sum\nolimits_{j = 1}^{n} {(xi_{j} - \mu_{xi} )(y_{j} - \mu_{y} )} }}{{\sqrt {\sum\nolimits_{j = 1}^{n} {(xi_{j} - \mu_{xi} )^{2} } \sum\nolimits_{j = 1}^{n} {(y_{j} - \mu_{y} )^{2} } } }}$$

(5)

where n is the number of proteins in the PIN, and μ_xi and μ_y are the mean values of xi and y. Thus, the set of possible critical modules selected based on the orthologous information of the proteins within the module is denote as C_orth = {c_i|PC(c_i) ≥ th₁}, where th₁ is a threshold value.

(2) Determine the importance of modules using subcellular localization information of proteins.

The importance of the protein is not only related to the orthologous information of the protein, but also to the subcellular localization information of the protein, which can identify the critical modules in the PIN from another perspective. We observed the number of times proteins and essential proteins were present in each subcellular compartment, and found that proteins and even essential proteins were most widely distributed in the nucleus. Therefore, we thought that the more times proteins within the module were present in the nucleus, the more likely that module was critical. For the module c_i, we calculate the number of times the protein in module c_i occurs in the nucleus as its second score, denoted by NSL(c_i):

$$NSL(c_{i} ) = \frac{{N(c_{i} )}}{{n(c_{i} )}}$$

(6)

where N(c_i) is the number of times the protein within the module appears in the nucleus and n(c_i) is the number of nodes within the module. The set of the possible critical modules selected based on the subcellular localization information of the proteins within the module is represented by C_sub = {c_i|NSL(c_i) ≥ th₂}, where th₂ is a threshold value.

(3) Determine the importance of modules using topological characteristics of modules.

To identify the importance of the module, we also used the topological characteristics of each module in the network. It has been pointed out that a large number of essential proteins may exist in highly dense functional modules [28]. Thus, we thought that the richer the interactions within the module, the more likely it is to play an important role in the whole network, so we calculated the topological characteristics of module c_i as its third score, denoted by TF(c_i):

$$TF(c_{i} ) = \frac{{I(c_{i} ) - O(c_{i} )}}{{n(c_{i} )}}$$

(7)

where I(c_i) is the number of interactions inside module c_i, O(c_i) is the number of interactions between module c_i and other modules, and n(c_i) is the number of nodes of module c_i. And according to the topological characteristics of the module, modules less than th₃ are selected as the set of potentially non-critical modules, that is, C_topo = {c_i|TF(c_i) ≤ th₃}, where th₃ is a threshold value.

Step 4: refining the protein–protein interaction network

Finally, we integrated the above three features of the modules to obtain the final selected critical modules, that is, C_critical = {c_i|C_orth ∪ (C_sub/C_topo)}. For a PIN (S-PIN, D-PIN or RD-PIN) G = (V, E), ∀(v_i, v_j) ∈ E, if v_i and v_j are both in the critical modules C_critical, their interaction will be retained, otherwise their interactions will be removed from the E, thus obtain the finally refined E_CM, resulting in a more refined CM-PIN, G_CM = (V_CM, E_CM), where V_CM = V.

Experiment and discussion

Materials and datasets

We first performed a complete experiment using the Saccharomyces cerevisiae dataset, as this dataset is currently the most complete of all species and has been widely used to test various methods for identifying essential proteins. Then, we used the Human sapiens dataset to verify the validity of the proposed method.

Protein–protein interaction datasets and essential proteins

The two protein–protein interaction datasets from Saccharomyces cerevisiae used in this paper were downloaded from YDIP [43] and YBioGRID [44], which contain 15,166 and 52,833 interactions, respectively, covering 4746 and 5616 proteins. A dataset of protein–protein interactions from Homo sapiens was downloaded from HDIP [45], which contains 6892 interactions covering 4615 proteins. Essential proteins were collected from the following data sets [46,47,48]: DEG, MIPS, SGD, OGEE. The YDIP, YBioGRID, and HDIP datasets contain 1130, 1199 and 726 essential proteins, respectively.

Other biological information

(1) Gene expression profile: The gene expression profiles of the yeast and human datasets were downloaded from GSE3431 [49] and GSE86354 [50], respectively, containing 6,777 and 18,912 proteins. GSE3431 dataset records the observation data of 36 time points during three successive metabolic cycles and GSE86354 dataset records expression profiles across 8 tissue including 64 time points. (2) Subcellular localization information: Subcellular location information for both species was downloaded from the COMPARTMENTS dataset [51], which both contain 11 subcellular compartments. (3) Orthologous information: Information on orthologous proteins of yeast and human was taken from Version 7 [52] and Version 8 [53] of the InParanoid database, which contain 100 and 162 genome-wide paired comparison sets, respectively.

Node ranking methods

To verify the performance of the CM-PIN, we used 12 typical node ranking methods (DC [6], LAC [7], NC [8], DMNC [9], TP [10], LID [11], CC [12], BC [13], PR [14], LR [15], PeC [16], WDC [17]) and compared their performances of the identification of essential proteins on the CM-PIN with that on the S-PIN and two existing refinement networks (D-PIN [23] and RD-PIN [24]). The node ranking method will first calculate the importance scores of all protein nodes in the network according to its formula, then rank the proteins in descending order according to the importance scores, and finally a part of highly ranked proteins will be considered as essential proteins.

Experimental results and analysis on Saccharomyces cerevisiae

Analysis of the number of essential proteins identification

In order to prove that the network refinement method proposed in this paper can effectively improve the number of essential proteins identified by each node ranking method, we obtained more efficient CM-PINs on the SPIN, DPIN and RDPIN of the YDIP and YBioGRID datasets, respectively. And the numbers of essential proteins identified by node ranking methods at top 100, top 200, top 300, top 400, top 500, and top 600 on the CM-PIN were compared with their performance on the S-PIN, D-PIN, and RD-PIN, as shown in Tables 2 and 3. We denoted CM-PIN refined from S-PIN (D-PIN or RD-PIN) by CM-PIN(S) (CM-PIN(D) or CM-PIN(RD)), and marked the optimal item in bold when comparing two or more items in all subsequent tables.

Table 2 Comparison of the number of essential proteins identified by 12 node ranking methods on the S-PIN, D-PIN, RD-PIN and the CM-PIN at top 100–600 on YDIP dataset

Full size table

Table 3 Comparison of the number of essential proteins identified by 12 node ranking methods on the S-PIN, D-PIN, RD-PIN and the CM-PIN at top 100–600 on YBioGRID dataset

Full size table

It can be seen that the CM-PIN can significantly improve the identification accuracy of essential proteins by node ranking methods on yeast datasets, whether it is static PIN or refined PIN, and the values of top 100-top 600 on the CM-PIN are higher than those of the other three existing PINs. Compared with different PINs, the average improvement ratio of 12 node ranking methods at top 600 on YDIP and YBioGRID datasets was: 9.82% and 20.58% for the CM-PIN refined on the S-PIN; 11.30% and 15.15% for the CM-PIN refined on the D-PIN; 9.65% and 7.79% for the CM-PIN refined on the RD-PIN. And even some node ranking methods have a significant improvement, for example, compared with the S-PIN, the BC method has improved by 18.22% at top 600 on the CM-PIN on YDIP dataset; compared with the D-PIN, the CC method has improved by 56.74% at top 600 on the CM-PIN on YBioGRID dataset. In addition, the LID method was able to identify 405 essential proteins at top 600 on the CM-PIN refined on the RD-PIN on YDIP dataset, which has a very high identification accuracy. All of these illustrated the effectiveness of our method and demonstrate that CM-PIN is a more refined and effective network.

It is worth noting that the focus of this paper is to improve the overall performance of node ranking methods, so we pay more attention to the accuracy of these methods at top 1130 for YDIP (top 1199 for YBioGRID, or top 7,26 for HDIP). Meanwhile, the accuracy at top 100 can also receive a certain increase at this case. On the other hand, if we want to focus on the improvement of the performance at the top 100, we can also achieve good results in the accuracy of the top 100 by adjusting the parameters of our method appropriately. For example, when setting the parameters th₁ = 0.1, th₂ = 2, and th₃ = −2, the CM-PIN(RD) for YBioGRID can significantly improve the top 100 values of the node ranking methods. However, their top 1199 values will decline to a certain extent at this time. Therefore, the readers can strengthen the specified performance index by adjusting the parameters according to their own concerns.

Validated by using the Jackknifing method

In order to evaluate the overall performance of CM-PIN more comprehensively, we used the Jackknifing method [24, 54]. The horizontal axis of the Jackknifing plot indicates the number of proteins that ranked high in the network and the vertical axis represents the number of essential proteins among these top-ranked proteins. Figures 2 and 3 showed the number of essential proteins in the top K highest scoring proteins for each node ranking method in S-PIN, D-PIN, RD-PIN and CM-PIN (the CM-PIN with the best performance of the node ranking method among the three CM-PINs is selected), Among them, K is the number of essential proteins, K = 1130 and K = 1199 on YDIP and YBioGRID respectively. It is obvious that on the CM-PIN, the Jackknifing curves of these methods are all above the other three networks on both two yeast datasets, and the differences are significant, whether it is neighborhood-based, path-based or eigenvector-based centrality methods, even the node sorting methods that integrates multiple biological information. This further demonstrated that the network refinement method in this paper is effective in removing noise and false positives from protein interaction networks and proved that the CM-PIN is a more efficient network.

Analysis of precision-recall curves

As the identification of essential proteins is a sample imbalance problem, the number of negative class samples (non-essential proteins) is much larger than the number of positive class samples (essential proteins). When it comes to identifying essential proteins, we tend to more concerned with how many positive samples (essential proteins) can be identified [55]. Therefore, to assess the significance of the CM-PIN, we used precision-recall curves to compare the efficiency of essential protein identification of 12 node ranking methods (see Figs. 4 and 5). The vertical axis (precision) of the precision-recall curve reflects the proportion of the true positive examples in the positive examples determined by the classifier, and the horizontal axis (recall) reflects the proportion of the positive examples determined by the classifier in the total positive examples. What’s more, we further calculated the area under the precision-recall curve (PRAUC), as shown in Table 4, and it can be seen that both the precision-recall curves and PRAUC values on the CM-PIN of two yeast datasets were the best. The improvement rate of PRAUC value of 12 node ranking methods on the CM-PIN on YDIP and YBioGRID was: 3.28%-18.29% and 7.18%-54.62% for S-PIN; 5.85%-17.36% and 6.81%-38.55% for D-PIN; 4.61%-15.70% and 0.50%-11.63% for RD-PIN. All of these proved the validity of the CM-PIN again.

Table 4 Comparison of PRAUC values of 12 node ranking methods on the S-PIN, D-PIN, RD-PIN and their corresponding CM-PIN on YDIP and YBioGRID datasets

Full size table

Validated by accuracy

To further evaluate the overall performance of CM-PIN and the accuracy of essential protein identification, we used the following seven evaluation metrics: sensitivity (SN), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), F-measure (FM), Matthews correlation coefficient (MCC) and accuracy (ACC). Among them, the calculation formulas of sensitivity and recall are consistent, the calculation formulas of positive predictive value and precision are also consistent. The top K proteins after the descending order of importance scores of proteins were assumed to be essential proteins (K = 1130 and K = 1199 are the number of essential proteins for the YDIP and YBioGRID), and the calculation formulas are as follows,

$$SN = \frac{TP}{{TP + FN}}$$

(8)

$$SP = \frac{TN}{{FP + TN}}$$

(9)

$$PPV = \frac{TP}{{TP + FP}}$$

(10)

$$NPV = \frac{TN}{{TN + FN}}$$

(11)

$$FM = \frac{2 \times SN \times PPV}{{SN + PPV}}$$

(12)

$$MCC = \frac{TP \times TN - FP \times FN}{{\sqrt {(TP + FP)(TP + FN)(TN + FP)(TN + FN)} }}$$

(13)

$$ACC = \frac{TP + TN}{{TP + TN + FP + FN}}$$

(14)

where TP is the correctly predicted essential protein, FP stands for the incorrectly predicted essential protein, TN refers to the correctly predicted non-essential protein, and FN represents the incorrectly predicted non-essential protein.

Tables 5 and 6 showed the comparison results of the 12 node ranking methods on the seven indicators of S-PIN, D-PIN, RD-PIN and CM-PIN (RD). It can be seen that the seven evaluation indicators of the 12 node ranking methods on the CM-PIN on two yeast datasets are both better than the other three networks, which indicates that the method of refining networks by modules in this paper is feasible and can effectively improve the identification accuracy of essential proteins.

Table 5 Comparison of seven evaluation indices for 12 node ranking methods on YDIP datasets

Full size table

Table 6 Comparison of seven evaluation indices for 12 node ranking methods on YBIOGRID datasets

Full size table

Selection and analysis of thresholds

In this section, taking the RD-PIN of the YDIP as an example, first, we described the concrete steps of construction of the CM-PIN on the basis of the RD-PIN and the motivation of using PIN's modular feature refining network. Then, we analyzed how to select the thresholds. Finally, we listed the thresholds used by all the CM-PINs built on the two yeast datasets in this paper.

On YDIP dataset, the optimal partitioning of modules was achieved by the Fast-unfolding algorithm when the modularity Q = 0.7408, at which point the RD-PIN was partitioned into 26 modules. We calculated three metrics for each module in RD-PIN: PC, NSL, and TF (as shown in Table 7) by using the biological information of the proteins and the topological information of the modules in the network. We also observed the number and proportion of essential proteins in each module and found that there was variation between modules and that some modules with sparse interactions within modules or with little biologically important information contained few essential proteins, which may be the potential non-critical modules. For example, the NSL values of modules 1, 24, and 26 are zero, which means that the proteins in their modules do not appear in the subcellular compartments of the nucleus, and after the thresholds screening, they will likely be defined as non-critical modules. Therefore, in order to get a more effective network, we need to try to identify seemingly more critical modules in the network and remove some of the interactions in modules with less biological and topological information.

Table 7 Biological and topological characterization of each module in the RD-PIN on YDIP dataset

Full size table

To obtain the variation rule of the effect of thresholds on the selection of critical modules and the performance of the network, according to the data distribution of three metrics in the module, we let th₁ ∈ {−0.02, −0.005, 0.015}, th₂ ∈ {1.5, 2}, th₃ ∈ {0.25, 0.5}, and listed the effect of the networks on the identification accuracy of essential proteins with different values of the thresholds, respectively (as shown in Table 8, the experimental results in the table are the performance of LID in different networks). The experimental results showed that when th₁ and th₂ were small and th₃ was large, more critical modules were selected. At this time, there was still a large amount of noise in the network that had not been eliminated and the improvement in identification accuracy of essential proteins was not significant, for example, when th₁ = −0.02, th₂ = 1.5 and th₃ = 0.5, the identification accuracy of essential proteins at top 600 and PRAUC have improved compared with RD-PIN, but the identification accuracy of essential proteins at top 1130 is not as good as RD-PIN. In contrast, when th₁ and th₂ were larger, fewer critical modules were selected. At this time, critical parts of the network may have been removed, and the improvement in the network's identification accuracy of essential proteins was not optimal, for example, when th₁ = 0.015, th₂ = 2 and th₃ = 0.5, the identification accuracy of essential proteins at top 1130 of LID in CM-PIN was still inferior to RD-PIN. Among them, the change of th₁ and th₂ has a greater impact on the selection of modules, because biological information can better assist in identifying essential proteins than the topology information of the network. When th₁ = −0.005, th₂ = 2 and th₃ = 0.25, the optimal CM-PIN on YDIP dataset is obtained.

Table 8 The variation of the effect of thresholds on the selection of critical modules and the performance of the network

Full size table

Finally, we listed in Table 9 the selection thresholds and module information of CM-PINs constructed in two datasets of yeast in this paper.

Table 9 The selection thresholds and module information of CM-PINs constructed in YDIP and YBioGRID datasets

Full size table

Analysis of reasons for the improvement of identification accuracy of essential proteins

In order to discuss the reason why the identification accuracy of essential proteins of each node ranking method on the CM-PIN is higher than that on the other three networks (S-PIN, D-PIN, RD-PIN), we also calculated the ratio of essential proteins in different proteins at top 600 of each node ranking method on the CM-PIN and the other three networks, as shown in Fig. 6. It can be seen that on the CM-PIN, each node ranking method can identify some different essential proteins that cannot be identified on the other three networks. Even compared with the best RD-PIN in the three networks, some node ranking methods can identify a large part of different essential proteins at top 600 on the CM-PIN, such as CC, which can identify 31.3% of the different essential proteins on the CM-PIN that cannot be identified on the RD-PIN. Therefore, the essential protein identification accuracy on the CM-PIN is optimal for each node ranking method.

Validated on Human sapiens

In order to further verify whether the network refinement method proposed in this paper can play its advantages in other species, we obtained their corresponding CM-PINs from S-PIN, D-PIN and RD-PIN in the Human sapiens dataset (Table 10 listed the module information and threshold selection of CM-PINs obtained in each network), and compared the performance of 12 node ranking methods on these networks (see Table 11). It can be seen that the performances of the 12 node ranking methods are almost optimal on the CM-PIN. The performance of the node sorting method on the twice-refined PIN (RD-PIN) is inferior to that on the once-refined PIN (D-PIN) due to fewer raw interactions in the HDIP dataset. That is why the individual indexes of the WDC method on the CM-PIN (refined on the RD-PIN) are inferior to that of the RD-PIN. Compared with S-PIN, D-PIN and RD-PIN, the CM-PINs can improve the PRAUC values of 12 node ranking methods to 14.37%-47.57% for S-PIN, 6.41%-24.90% for D-PIN, and 11.23%-28.11% for RD-PIN. Therefore, this proves that the network refinement method in this paper is applicable to multiple species, and can improve the performance of the node ranking method by obtaining more efficient network CM-PIN.

Table 10 The selection thresholds and module information of CM-PINs constructed on HDIP dataset

Full size table

Table 11 Comparison of various evaluation indicators of 12 node ranking methods on the S-PIN, D-PIN, RD-PIN and the CM-PIN on HDIP dataset (top 100/top 600/MCC/FM/ACC/PRAUC)

Full size table

Conclusions and perspectives

In this paper, we proposed a protein interaction network refinement method based on modular discovery and biological information. Firstly, we extract the maximum connected subgraph of a given PIN and use a module discovery algorithm Fast-unfolding to divide it into different modules. Secondly, we select critical modules by using protein orthologous information, subcellular localization information, and its topological information in the PIN. Thirdly, we construct a more refined network (CM-PIN) according to the identified critical modules.

In order to verify the effectiveness of this method, we constructed CM-PINs based on three networks (S-PIN, D-PIN and RD-PIN) of two species (Saccharomyces cerevisiae and Human sapiens) and compared the performances of 12 node ranking methods (LAC, DC, DMNC, NC, TP, LID, CC, BC, PR, LR, PeC, WDC) on the CM-PIN with those on the three networks. In terms of the identification number of essential proteins at top 100- 600, Jackknifing method, the area under the precision-recall curves (PRAUC), sensitivity (SN), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), F-measure (FM), Matthews correlation coefficient (MCC) and accuracy (ACC), the identification performances of node ranking methods on the CM-PIN are better than that of the S-PIN, D-PIN and RD-PIN. Among them, on the three datasets of Saccharomyces cerevisiae (YDIP and YBioGRID) and Human sapiens (HDIP), compared with the existing three networks, the highest improvement rate of PRAUC value of each node ranking method on the CM-PIN was 18.29%, 54.62%, 47.57% for S-PIN; 17.36%, 38.55%, 24.90% for D-PIN; and 15.70%, 11.63%, 28.11% for RD-PIN. The results demonstrated that the CM-PIN could effectively filter out false positives and false negatives and thus is a higher-quality network.

In future work, we will consider further contributing to the identification of essential proteins, the revelation of disease mechanisms and the design of targeted drug from the following three perspectives. Firstly, from the perspective of network refinement, the modular characteristics of the network can be combined with other factors to construct a more efficient network. For example, other biological information of proteins can be used to further refine some unreliable interactions within critical modules, such as structure information or annotation information of proteins. Secondly, from the perspective of module discovery, different module discovery algorithms can attempt to obtain more accurate division results in protein–protein interaction networks, such as clustering algorithms based on biological sequences [56] and attribute graphs [57]. Thirdly, the modules discovered or the critical modules detected from the protein–protein interaction network can also be used as features to assist some other biological issues. For example, the classification task of Golgi protein [58], the classification task of microorganisms’ function proteins [59], design of protein acetylation sites [60], etc.

Availability of data and materials

The Datasets used in this study, including PINs, gene expression profiles, subcellular localization information, orthologous information, and standard essential proteins, are from the public databases (DIP: http://dip.doe-mbi.ucla.edu; BioGRID: https://thebiogrid.org). The source code for the CM-PIN method and all datasets used in this paper have been uploaded to: https://github.com/paopaopig/The-construction-of-the-CM-PIN..git

Abbreviations

PIN:: Protein–protein interaction network
S-PIN:: A network constructed from raw protein–protein interaction dataset
D-PIN:: A network refined by S-PIN and gene expression profiles
RD-PIN:: A network refined by D-PIN and subcellular localization information
CM-PIN:: A refined network based on module discovery and biological information

References

Winzeler EA, Shoemaker DD, Astromoff A, et al. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science. 1999;285(5429):901–6.
Article CAS PubMed Google Scholar
Cullen LM, Arndt GM. Genome-wide screening for gene function using RNAi in mammalian cells. Immunol Cell Biol. 2005;83(3):217–23.
Article CAS PubMed Google Scholar
Giaever G, Chu AM, Ni L, et al. Functional profiling of the saccharomyces cerevisiae genome. Nature. 2002;418(6896):387–91.
Article CAS PubMed Google Scholar
Roemer T, Jiang B, et al. Large-scale essential gene identification in Candida albicans and applications to antifungal drug discovery. Mol Microbiol. 2010;50(1):167–81.
Article Google Scholar
Li X, Li W, Zeng M, et al. Network-based methods for predicting essential genes or proteins: a survey. Brief Bioinform. 2020;21(2):566–83.
Article CAS PubMed Google Scholar
Jeong HM, Mason SP, Barabasi AL, et al. Lethality and centrality in protein networks. Nature. 2001;411(6833):41–2.
Article CAS PubMed Google Scholar
Li M, Wang J, Chen X, et al. A local average connectivity-based method for identifying essential proteins from the network level. Comput Biol Chem. 2011;35(3):143–50.
Article PubMed Google Scholar
Wang J, Li M, Wang H, et al. Identification of essential proteins based on edge clustering coefficient. IEEE/ACM Trans Comput Biol Bioinf. 2012;9(4):1070–80.
Article Google Scholar
Lin C Y, Chin C H, Wu H H, et al. Hubba: hub objects analyzer—a framework of interactome hubs identification for network biology. Nucleic acids research, 2008, 36(suppl_2): W438–43.
Li M, Lu Y, Wang J, Wu FX, Pan Y. A topology potential-based method for identifying essential proteins from PPI networks. IEEE/ACM Trans Comput Biol Bioinform. 2015 Mar-Apr;12(2):372–83.
Qi Y, Luo J. Prediction of essential proteins based on local interaction density. IEEE/ACM Trans Comput Biol Bioinf. 2015;13(6):1170–82.
Article Google Scholar
Wuchty S, Stadler PF. Centers of complex networks. J Theor Biol. 2003;223:45–53.
Article PubMed Google Scholar
Joy MP, Brock A, Ingber DE, et al. High-betweenness proteins in the yeast protein interaction network. J Biomed Biotechnol. 2005;2:96–103.
Article Google Scholar
Brin S, Page L. The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst. 1998;30(1–7):107–17.
Article Google Scholar
Lü L, Zhang YC, Yeung CH, et al. Leaders in social networks, the delicious case. PLoS ONE. 2011;6(6):e21202.
Article PubMed PubMed Central Google Scholar
Li M, Zhang H, Wang J, et al. A new essential protein discovery method based on the integration of protein–protein interaction and gene expression data. BMC Syst Biol. 2012;6(1):1–9.
Article Google Scholar
Tang X, Wang J, Zhong J, et al. Predicting essential proteins based on weighted degree centrality. IEEE/ACM Trans Comput Biol Bioinf. 2013;11(2):407–18.
Article Google Scholar
Qin C, Sun Y, Dong Y. A new method for identifying essential proteins based on network topology properties and protein complexes. PLoS ONE. 2016;11(8):e0161042.
Article PubMed PubMed Central Google Scholar
Li M, Lu Y, Niu Z, Wu F. United complex centrality for identification of essential proteins from PPI networks. IEEE/ACM Trans Comput Biol Bioinf. 2017;14(2):370–80.
Article Google Scholar
Lei X, Yang X. A new method for predicting essential proteins based on participation degree in protein complex and subgraph density. PLoS ONE. 2018;13(6):e0198998.
Article PubMed PubMed Central Google Scholar
Zhong J, Tang C, Peng W, et al. A novel essential protein identification method based on PPI networks and gene expression data. BMC Bioinform. 2021;22(1):1–21.
Article Google Scholar
Von Mering C, Krause R, Snel B, et al. Comparative assessment of large-scale data sets of protein–protein interactions. Nature. 2002;417(6887):399–403.
Article Google Scholar
Xiao Q, Wang J, Peng X, et al. Identifying essential proteins from active PPI networks constructed with dynamic gene expression. BMC Genomics BioMed Central. 2015;16(3):1–7.
Google Scholar
Li M, Ni P, Chen X, et al. Construction of refined protein interaction network for predicting essential proteins. IEEE/ACM Trans Comput Biol Bioinf. 2017;16(4):1386–97.
Article Google Scholar
Meng X, Li W, Peng X, et al. Protein interaction networks: centrality, modularity, dynamics, and applications. Front Comp Sci. 2021;15(6):1–17.
Google Scholar
Mitra K, Carvunis AR, Ramesh SK, et al. Integrative approaches for finding modular structure in biological networks. Nat Rev Genet. 2013;14(10):719–32.
Article CAS PubMed PubMed Central Google Scholar
Hart GT, Lee I, Marcotte EM. A high-accuracy consensus map of yeast protein complexes reveals modular nature of gene essentiality. BMC Bioinform. 2007;8(1):1–11.
Article Google Scholar
Zotenko E, Mestre J, O’Leary DP, et al. Why do hubs in the yeast protein interaction network tend to be essential: reexamining the connection between the network topology and essentiality. PLoS Comput Biol. 2008;4(8):e1000140.
Article PubMed PubMed Central Google Scholar
Newman MEJ, Girvan M. Finding and evaluating community structure in networks. Phys Rev E. 2004;69(2):026113.
Article CAS Google Scholar
Blondel VD, Guillaume JL, Lambiotte R, et al. Fast unfolding of communities in large networks. J Stat Mech: Theory Exp. 2008;2008(10):P10008.
Article Google Scholar
Lancichinetti A, Fortunato S. Community detection algorithms: a comparative analysis. Phys Rev E. 2009;80(5):056117.
Article Google Scholar
Palla G, Derényi I, Farkas I, et al. Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005;435(7043):814–8.
Article CAS PubMed Google Scholar
Li M, Meng X, Zheng R, et al. Identification of protein complexes by using a spatial and temporal active protein interaction network. IEEE/ACM Trans Comput Biol Bioinf. 2017;17(3):817–27.
Article Google Scholar
Hu L, Pan X, Tang Z, et al. A fast fuzzy clustering algorithm for complex networks via a generalized momentum method. IEEE Trans Fuzzy Syst. 2021;30(9):3473–85.
Article Google Scholar
Hu L, Yang Y, Tang Z, et al. FCAN-MOPSO: an improved fuzzy-based graph clustering algorithm for complex networks with multi-objective particle swarm optimization. IEEE Trans Fuzzy Syst. 2023.
Yang Y, Su X, Zhao B, et al. Fuzzy-based deep attributed graph clustering. IEEE Trans. Fuzzy Syst. 2023.
Zhang Z, Ruan J, Gao J, et al. Predicting essential proteins from protein-protein interactions using order statistics. J Theor Biol. 2019;480:274–83.
Article CAS PubMed Google Scholar
Wang H, Pan L, Sun J, et al. Centrality combination method based on feature selection for protein interaction networks. IEEE Access. 2022;10:112028–42.
Article Google Scholar
Li B, Pan L, Sun J, et al. A node ranking method based on multiple layers for dynamic protein interaction networks. IEEE Access. 2022;10:93326–37.
Article Google Scholar
Barabasi AL, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004;5(2):101–13.
Article CAS PubMed Google Scholar
Nacher JC, Hayashida M, Akutsu T. Emergence of scale-free distribution in protein–protein interaction networks based on random selection of interacting domain pairs. Biosystems. 2009;95(2):155–9.
Article CAS PubMed Google Scholar
Zhao B, Wang J, Li X, et al. Essential protein discovery based on a combination of modularity and conservatism. Methods. 2016;110:54–63.
Article CAS PubMed Google Scholar
Salwinski L, Miller CS, Smith AJ, et al. The database of interacting Proteins: 2004 update. Nucleic Acids Res. 2004;32:D449-451.
Article CAS PubMed PubMed Central Google Scholar
Stark C, Breitkreutz B J, Chatr-Aryamontri A, et al. The BioGRID interaction database: 2011 update. Nucleic Acids Res. 2010; 39(suppl_1): D698–D704.
Schapke J, Tavares A, Recamonde-Mendoza M. Epgat: gene essentiality prediction with graph attention networks. IEEE/ACM Trans Comput Biol Bioinf. 2021;19(3):1615–26.
Google Scholar
Zhang R, Lin Y. DEG 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res. 2009, 37(suppl_1): D455–D458.
Mewes HW, Frishman D, Mayer K F X, et al. MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic acids Res. 2006;34(suppl_1): D169–172.
Chen W H, Lu G, Chen X, et al. OGEE v2: an update of the online gene essentiality database with special focus on differentially essential genes in human cancer cell lines. Nucleic Acids Res. 2016: gkw1013.
Tu BP, Andrzej K, Maga R, et al. Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes. Science. 2005;310(5751):1152–8.
Article CAS PubMed Google Scholar
Aran D, Camarda R, Odegaard J, et al. Comprehensive analysis of normal adjacent to tumor transcriptomes. Nat Commun. 2017;8(1):1077.
Article PubMed PubMed Central Google Scholar
Binder J X, Pletscher-Frankild S, Tsafou K, et al. COMPARTMENTS: unification and visualization of protein subcellular localization evidence. Database, 2014.
Östlund G, Schmitt T, Forslund K, et al. InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res. 2010;38(suppl_1): D196–D203.
Sonnhammer ELL, Östlund G. InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res. 2015;43(D1):D234–9.
Article CAS PubMed Google Scholar
Holman AG, Davis PJ, Foster JM, et al. Computational prediction of essential genes in an unculturable endosymbiotic bacterium, Wolbachia of Brugia malayi. BMC Microbiol. 2009;9(1):243.
Article PubMed PubMed Central Google Scholar
Meng X, Li W, Xiang J, et al. Temporal-spatial analysis of the essentiality of hub proteins in protein-protein interaction networks. IEEE Trans Netw Sci Eng. 2022;9(5):3504–14.
Article Google Scholar
Li G, Zhao B, Su X, et al. Discovering consensus regions for interpretable identification of rna n6-methyladenosine modification sites via graph contrastive clustering. IEEE J Biomed Health Inform. 2024.
Hu L, Pan X, Yan H, et al. Exploiting higher-order patterns for community detection in attributed graphs. Integr Comput-Aided Eng. 2021;28(2):207–18.
Article Google Scholar
Bao W, Gu Y, Chen B. Golgi_DF: golgi proteins classification with deep forest. Front Neurosci. 2023;17:1197824.
Article PubMed PubMed Central Google Scholar
Bao W, Liu Y, Chen B. Oral_voting_transfer: classification of oral microorganisms’ function proteins with voting transfer model. Front Microbiol. 2024;14:1277121.
Article PubMed PubMed Central Google Scholar
Bao W, Yang B. Protein acetylation sites with complex-valued polynomial model. Front Comput Sci. 2024;18(3):183904.
Article Google Scholar

Download references

Acknowledgements

The authors are very grateful for all reviewers for their suggestions on this paper.

Funding

This work was supported by the Hunan Provincial Natural Science Foundation of China under Grants 2024JJ7213, 2024JJ7208 and 2024JJ7207.

Author information

Li Pan and Haoyue Wang contributed equally to this work.

Authors and Affiliations

Hunan Institute of Science and Technology, Yueyang, 414006, China
Li Pan, Haoyue Wang, Bo Yang & Wenbin Li
Hunan Engineering Research Center of Multimodal Health Sensing and Intelligent Analysis, Yueyang, 414006, China
Li Pan & Bo Yang

Authors

Li Pan
View author publications
You can also search for this author in PubMed Google Scholar
Haoyue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Bo Yang
View author publications
You can also search for this author in PubMed Google Scholar
Wenbin Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Li Pan and Haoyue Wang contributed equally to this work. All authors read and approved the manuscript.

Corresponding authors

Correspondence to Haoyue Wang or Wenbin Li.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Pan, L., Wang, H., Yang, B. et al. A protein network refinement method based on module discovery and biological information. BMC Bioinformatics 25, 157 (2024). https://doi.org/10.1186/s12859-024-05772-z

Download citation

Received: 21 February 2024
Accepted: 10 April 2024
Published: 20 April 2024
DOI: https://doi.org/10.1186/s12859-024-05772-z

A protein network refinement method based on module discovery and biological information

Abstract

Background

Methods

Results

Background

Methods

S-PIN, D-PIN and RD-PIN

Construction of the CM-PIN

Step 1: retaining interactions in maximal connected subgraphs

Step 2: module discovery based on Fast-unfolding algorithm

Step 3: detecting critical modules

Step 4: refining the protein–protein interaction network

Experiment and discussion

Materials and datasets

Protein–protein interaction datasets and essential proteins

Other biological information

Node ranking methods

Experimental results and analysis on Saccharomyces cerevisiae

Analysis of the number of essential proteins identification

Validated by using the Jackknifing method

Analysis of precision-recall curves

Validated by accuracy

Selection and analysis of thresholds

Analysis of reasons for the improvement of identification accuracy of essential proteins

Validated on Human sapiens

Conclusions and perspectives

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us