 Research
 Open access
 Published:
A seed expansionbased method to identify essential proteins by integrating protein–protein interaction subnetworks and multiple biological characteristics
BMC Bioinformatics volume 24, Article number: 452 (2023)
Abstract
Background
The identification of essential proteins is of great significance in biology and pathology. However, protein–protein interaction (PPI) data obtained through highthroughput technology include a high number of false positives. To overcome this limitation, numerous computational algorithms based on biological characteristics and topological features have been proposed to identify essential proteins.
Results
In this paper, we propose a novel method named SESN for identifying essential proteins. It is a seed expansion method based on PPI subnetworks and multiple biological characteristics. Firstly, SESN utilizes gene expression data to construct PPI subnetworks. Secondly, seed expansion is performed simultaneously in each subnetwork, and the expansion process is based on the topological features of predicted essential proteins. Thirdly, the error correction mechanism is based on multiple biological characteristics and the entire PPI network. Finally, SESN analyzes the impact of each biological characteristic, including protein complex, gene expression data, GO annotations, and subcellular localization, and adopts the biological data with the best experimental results. The output of SESN is a set of predicted essential proteins.
Conclusions
The analysis of each component of SESN indicates the effectiveness of all components. We conduct comparison experiments using three datasets from two species, and the experimental results demonstrate that SESN achieves superior performance compared to other methods.
Background
Essential proteins are crucial and indispensable for cellular activities [1]. The identification of essential proteins promotes an understanding of the minimal requirements for cell survival and reproduction. The study of essential proteins is beneficial for discovering pathogenic genes and generating novel approaches for disease treatment. [2, 3]. The identification of essential proteins plays a crucial role in advancing research and development in the fields of biology and pathology.
Experimental methods for identifying essential proteins include the following forms: single gene knockouts [4], gene knockdown [5], and RNA interference [6]. Although these methods have high accuracy, the experiment is expensive, timeconsuming, inefficient, and there are still species limitations. With the rapid development of bioinformatics, a large amount of PPI data is measured through highthroughput technology. This provides conditions for research at the PPI network level. Research based on PPI networks has become a focal point in the field of bioinformatics [7]. However, PPI data obtained through highthroughput technology include a high rate of false positives [8, 9]. To overcome the impact of this rate, researchers have attempted various methods to construct weighted PPI networks to remove false positive interactions. These networkbased methods have been proved to be effective in the identification of essential proteins [10].
Some researchers concentrate on the identification of essential proteins based on the topology of the PPI network. Topologybased methods can generally be divided into three categories: local topologybased, global topologybased and multitopologybased. Local topologybased methods assess a protein’s essentiality through its local neighborhood, such as Degree Centrality (DC) [11], Eigenvector Centrality (EC) [12], Local Average Connectivity (LAC) [13], and Neighborhood Centrality (NC) [14]. Global topologybased methods, including Betweenness Centrality (BC) [15], Closeness Centrality (CC) [16], Information Centrality (IC) [17], and Subgraph Centrality (SC) [18], measure topological properties globally based on characteristics of paths or shortest paths between proteins. All the above mentioned approaches are included in CytoNCA [19], which is a plugin of Cytoscape. Multitopologybased methods combine various topological characteristics. For example, SIGEP [20] presents a p value calculation method, which utilizes network topology characteristics (degree and local clustering coefficient) as test statistics and can outperform the aforementioned methods. Nonetheless, all these topologybased methods ignore the topology characteristics of predicted essential proteins.
Predicting essential proteins only by using network topology ignores the biological properties of proteins. In recent years, researchers have discovered that the biological characteristics of proteins are closely related to their essentiality. PeC [21] and JDC [22] are developed to identify essential proteins by integrating PPI networks and gene expression data. LNSPF [23] is proposed to identify essential proteins based on gene expression data, subcellular localization, homologous information and topological features. RSG [24] designs essential proteins prediction method based on RNASeq, subcellular localization, and GO annotation datasets, the experimental results include two species (Saccharomyces cerevisiae and Drosophila melanogaster). RWEP [25] adopts a random walk algorithm and integrates topological and biological properties to determine protein essentiality in PPI networks, RWEP outperforms PeC and RSG in predicting essential proteins. It incorporates multiple biological properties to enhance the efficiency of essential protein prediction. However, it is unclear which biological data is the most effective. CPPK and CEPPK [26] predict essential proteins by integrating network topology, gene expression data, and certain essential proteins as prior knowledge. However, the performance of CPPK excessively depends on the number of essential proteins. NCCO [27] combines orthology datasets from species S.cerevisiae and E.coli with network topology to predict essential proteins. RWO [28] utilizes orthologous relationships between yeast and human PPI networks. All these methods that integrate PPI network topology with biological data are more effective than those based solely on network topology. However, the specific impact of each biological data and individual components of these methods on the final prediction results remains unknown. Researchers have also proposed some deep learning frameworks that integrate biological features and network topology features to identify esssential proteins. DeepEP [29] utilizes multiscale convolutional neural networks to extract biological features from gene expression profiles, while the node2vec [30] technique is applied to automatically learn topological features from PPI networks. These features are then concatenated to predict essential proteins. Zeng et al. [31] propose a deep learning framework for automatically learning biological features without prior knowledge. They employ the node2vec technique to automatically acquire a richer representation of the PPI network topology. Bidirectional long short term memory cells [32] are employed to capture nonlocal relationships in gene expression data. Additionally, they utilize a highdimensional indicator vector to characterize biological features related to subcellular localization. Yue et al. [33] propose a deep learning framework for predicting essential proteins by integrating features obtained from the PPI network, subcellular localization, and gene expression profiles.
In recent years, some studies have been dedicated to constructing PPI subnetworks through gene expression analysis in order to infer the activity of protein interactions. TSPIN [34] constructs a network by using gene expression data and subcellular localization to identify essential proteins. TPWDPIN [35] mines protein complexes from weighted dynamic PPI subnetworks constructed by gene expression data. Inspired by this, we construct PPI subnetworks by gene expression data and perform the process of seed expansion in these subnetworks.
In this research, we propose an effective method for identifying essential proteins, called SESN. SESN is a seed expansion method based on PPI subnetworks and biological characteristics. The PPI network forms an undirected graph, where proteins serve as nodes and proteinprotein interactions as edges. To filter false positive interactions in PPI network, we integrate multiple biological characteristics to weight the edges and nodes of the PPI network and construct PPI subnetworks based on gene expression data. Seed expansion is performed simultaneously in each subnetwork, and the expansion results of all the subnetworks will be summarized to the whole PPI network. To avoid relying solely on essential proteins, we will not select seeds from the essential proteins dataset. Instead, each subnetwork will randomly select a protein as a seed. The expansion process is based on the topological features of the predicted essential proteins in each subnetwork. In this process, we select the protein that is most closely related to the predicted essential proteins and add it to the set of predicted essential proteins. The error correction mechanism filters out proteins that have been expanded but exhibit low essentiality. The weight of a protein in the whole PPI network represents the essentiality of this protein. To ensure the ongoing expansion of the set of predicted essential proteins, after removing a protein, we expand the protein with the highest weight that is strongly associated with the predicted essential proteins. SESN evaluates the influence of biological data on experimental outcomes and identifies the most effective data to achieve optimal results. The output of SESN consists of a set of predicted essential proteins expanded by the seeds of all subnetworks. Proteins that expand earlier are given higher rankings. Comparative experiments are conducted across three datasets from two species. The experimental results demonstrate that, when compared with other methods(CPPK, CEPPK, RWEP, SIGEP, RWO and TSPIN), SESN achieves the best results across three datasets. Analysis of each component within SESN reveals that all components are effective, with particular emphasis on the error correction mechanism.
The contributions of SESN are outlined as follows: (1) SESN constructs weighted PPI subnetworks by integrating multiple biological data and conducts simultaneous seed expansion within each subnetwork. (2) The seed expansion process integrates the topological characteristics of predicted essential proteins in the subnetworks. (3) The error correction mechanism integrates the topological characteristics of predicted essential proteins in the whole PPI network. (4) SESN selects the biological data yielding the best experimental results and integrates multiple biological characteristics to assign weights to both PPI subnetworks and the whole PPI network.
The overall process of SESN is shown in Fig. 2, which provides an example to illustrate the process of seed expansion. The green part represents the initialization of the weighted PPI network and the weighed subnetworks. The detailed process of constructing weighted subnetworks is shown in Fig. 1. The yellow part represents the seed expansion process, and the yellow rounded rectangle on the right illustrates the expansion process of K(the definition of K is provided in section ’Initialize the seed set and the weight of node or edge’). Initially, there are 4 subnetworks, so K is initialized with 4 nodes named node 1, 2, 3 and 4. Then, the expansion process is performed simultaneously in these 4 subnetworks. The node with the highest weight is chosen and added to K. Following this, the error correction mechanism filters out the node with the lowest weight in K and introduces node m1. Among all the neighbors of K, m1 is the node with the largest weight and the closest connection to K. The expansion of K persists until its length reaches the output length n. The blue part illustrates the experimental process of SESN, to prove the superiority and effectiveness of SESN, we compare SESN with other methods and analyze each component of SESN.
Methods
Experimental datasets
To prove the superiority of SESN across different species, experiments are conducted on Saccharomyces cerevisiae and Drosophila melanogaster. We utilize PPI datasets, essential proteins, protein complexes, gene expression data, GO annotations and subcellular localization. Additionally, we perform ID mapping across different datasets using UniProt(https://www.uniprot.org/) as a reference.
PPI datasets For Saccharomyces cerevisiae, the PPI dataset can be downloaded from DIP [36](version of 20101010), and BioGRID [37]. As for Drosophila melanogaster, the PPI dataset is download from BioGRID, to distinguish it from Saccharomyces cerevisiae, it is denoted as fruitfly. The number of proteins and essential proteins and other relevant information for these datasets are presented in Table 1.
Essential proteins For Saccharomyces cerevisiae, essential proteins are selected from MIPS [38], SGD [39], DEG [40], and OGEE [41], and there are 1285 essential proteins in total. In the case of Drosophila melanogaster, essential proteins are selected from DEG and OGEE, after ID mapping and the removal of duplicate proteins, the fruitfly PPI dataset contains 493 essential proteins.
Protein complex For Saccharomyces cerevisiae, protein complexes are collected from MIPS, SGD, ALOY [42], and CYC2008 [43, 44]. Only protein complexes containing two or more proteins are retained, resulting in a total of 745 protein complexes in the final dataset. For Drosophila melanogaster, protein complexes are obtained from APMS [45]. After mapping these complexes with the fruitfly PPI datasets, the dataset encompasses 1637 protein complexes.
Gene expression data Gene expression data of Saccharomyces cerevisiae and Drosophila melanogaster can be downloaded from GEO (https://www.ncbi.nlm.nih.gov/geo/browse/) with accession GSE3431 [46] and GSE7763 [47], respectively. The probe data matrix for Saccharomyces cerevisiae consists of 9335 rows, while the probe data matrix for Drosophila melanogaster comprises 18952 rows. To map with PPI datasets, we download SOFT formatted family files from GEO. In cases where multiple probe data correspond to a single ID of PPI datasets, we take the average value of multiple probe data. After preprocessing, we obtain 4981, 5318, and 7378 gene expression data for DIP, BioGRID, and fruitfly, respectively.
GO annotations and Subcellular localization For Saccharomyces cerevisiae, GO annotation data is available from(https://downloads.yeastgenome.org/curation/literature/go_slim_mapping.tab). For Drosophila melanogaster, GO annotation data is extracted from the COMPARTMENTS database [48]. Subcellular localization is downloaded from the knowledge channel of the COMPARTMENTS database.
Gene expression databased method for constructing PPI subnetworks
Gene expression data is presented in the form of an expression matrix, where each row represents the expression level of a protein, and each column corresponds to the expression level of a sample point. The number of sample points varies across different species. In the case of Saccharomyces cerevisiae, there are 12 sample points, while for Drosophila melanogaster, there are 34 sample points. Each sample point corresponds to an average gene expression value.
For gene g of Saccharomyces cerevisiae, the average gene expression value can be expressed as Eq. 1.
where \(expr_{i}(g)\) represents the gene expression value from the expression matrix, and i denotes the sample point number. The gene expression data of Saccharomyces cerevisiae comprises three cell cycles, each containing 12 time points. Each sample point corresponds to the average gene expression value at a specific time point across the three cell cycles. For gene g of Drosophila melanogaster, the average gene expression value can be expressed as Eq. 2.
where \(expr_{i}(g)\) represents the gene expression value in the expression matrix, and i signifies the sample point number. The gene expression data for Drosophila melanogaster consists of 136 column values, with every set of 4 columns corresponding to 4 repeated experiments for one sample.
Based on gene expression data, we can construct PPI subnetworks. Saccharomyces cerevisiae contains 12 subnetworks, and Drosophila melanogaster contains 34 subnetworks. PPI networks can be abstracted into graph \(G=(V,E)\), where V is a set of nodes, E is a set of edges. Proteins are abstracted into nodes, and proteinprotein interactions are abstracted into edges. Subnetworks based on gene expression data can be represented as \(G_{i}\), G can be represented as \(G_{i}\), forming \(G=\{G_{1},G_{2},\dots ,G_{i},\dots ,G_{n}\}\), where n represents the number of sample points of gene expression data. Each \(G_{i}=(V_{i},E_{i})\) is a subnetwork of G, with \(V_{i} \subseteq V\), \(E_{i} \subseteq E\). For any edge \(e \in E_{i}\), the protein pairs in e are denoted as \(v_{a}\) and \(v_{b}\). Only if both \(v_{a}\) and \(v_{b}\) are actively expressed at sample point i, will e be added to \(E_{i}\). This approach effectively filters out noisy edges from subnetwork \(G_{i}\).
If the gene expression value of the sample point is greater than the threshold, the corresponding protein is considered to be active at this sample point. So, how to determine the threshold? The 3sigma model calculates the active expression threshold of each protein according to the characteristics of the expression value curve [49]. For gene g, the arithmetic mean and standard deviation of its gene expression data are Avg(g) and \(\sigma (g)\), respectively. Avg(g) and \(\sigma (g)\) can be expressed as follows:
where n is the number of sample points of gene expression data. The value of \(\sigma (g)\) reflects the fluctuation of gene expression data. ksigma (k=1,2,3) threshold is calculated by threesigma method [50,51,52,53], which is defined as Eq. 5.
where Avg(g) and \(\sigma (g)\) are calculated by Eqs. 3 and 4, respectively. If \(\sigma (g)\) is very small, \(Ge_{i}(g)\) is close to Avg(g), and \(Thr_{k}(g)\) is close to Avg(g). Conversely, if \(\sigma (g)\) is very large, \(Ge_{i}(g)\) is not concentrated around Avg(g), but represents a set of strongly oscillating data. In such cases, \(Thr_{k}(g)\) is close to \(Avg(g)+k \cdot \sigma (g)\), where k is a multiple of \(\sigma (g)\), \(Thr_{k}(g)\) is positively correlated with k, a larger k results in a higher \(Thr_{k}(g)\). When \(k=3\), \(Thr_{k}(g)\) achieves the highest confidence. For instance, if \(Ge_{i}(g)\ge Thr_{3}(g)\), \(Ap_{i}(g)\) get the largest value 0.99(as defined by Eq. 6).
It is assumed that a set of gene expression data follows a probability distribution similar to the normal distribution. If this assumption is correct, the mean and variance of this group of data are denoted as \(\mu\) and \(\sigma\), respectively, then, \(\ \textrm{P}\{x\mu <3 \sigma \} \approx 0.99,\ \textrm{P}\{x\mu <2 \sigma \} \approx 0.95\), and\(\ \textrm{P}\{x\mu < \sigma \} \approx 0.68\). Based on this theory, the probability of active expression of gene g at sample point i can be calculated as follows:
To further measure the reliability of protein interaction edges in each subnetwork, we construct weighted subgraphs. For an edge \(e=(v,u) \in E_{i}\) in the weighted subgraph \(G_{i}=(V_{i},E_{i},W_{i})\), where protein v corresponding to gene v and protein u corresponding to gene u, we define \(W_{i}(v,u)=Ap_{i}(v) \cdot Ap_{i}(u)\), the weight of an edge represents the possibility that both gene v and gene u are active. ID mapping of gene and protein has been done in section ’Experimental datasets’.
Biological databased method for weighting proteins and proteinprotein interactions.
The essentiality of protein is associated with some biological data, such as protein complex, gene expression data, GO annotations and subcellular localization. We utilize multiple types of biological data to characterize the essentiality of proteinprotein interactions.
GO annotations
GO terms annotate the functional properties of a protein. For two interacted proteins, the more common GO terms they have, the more similar their functions are, and the greater weight of their interaction is [35]. The weight of an edge based on GO annotations is denoted as Eq. 7.
where \(GO_{v}\) is the set of GO terms of protein v. We use GOW(v, u) to assign the weight of the edge (v, u).
Gene expression
The interaction between two proteins can be weighted based on the strength of their coexpression, as demonstrated in previous studies [21]. The weight is determined by the Pearson correlation coefficient (PCC) calculated from gene expression data [21, 54]. PCC is denoted as follows:
where X and Y correspond to the gene expression data of protein v and protein u respectively. \(X=\{X_{1},X_{2},\dots ,X_{i},\dots ,X_{n}\}\), \(Y=\{Y_{1},Y_{2},\dots ,Y_{i},\dots ,Y_{n}\}\), n is the number of sample points of gene expression data, which is defined in section ’Construct PPI subnetworks by gene expression data’. \(X_{i} = Ge_{i}(v)\), and \(Y_{i} = Ge_{i}(u)\), gene v and gene u correspond to protein v and protein u, for Saccharomyces cerevisiae, \(X_{i}\) and \(Y_{i}\) are defined by Eq. 1, and for Drosophila melanogaster, \(X_{i}\) and \(Y_{i}\) are defined by Eq. 2.
Since PCC ranges from \([1,+1]\), it needs to be standardized. GW(v, u) is the standardization of PCC, which is denoted as Eq. 9.
where PCC(X, Y) is denoted as Eq. 8. GW(v, u) ranges from \([0,+1]\). We finally use GW(v, u) to calculate the weight of the edge (v, u).
Protein complex
Proteins typically carry out biological functions through participation in protein complexes. A protein’s likelihood of being essential often increases with the number of protein complexes it is involved in, as highlighted in previous studies [55, 56]. Consequently, the count of protein complexes in which a protein is located can reflect its essentiality. \(PC_{v}\) denotes the number of protein complexes in which the protein v is located. Additionally, \(PC_{max} = max\left( PC_{v}\right) , (v \in V)\). The weight of edge (v, u) is denoted as Eq. 10.
Subcellular localization
The essentiality of proteins is related to their subcellular localizations, some subcellular localizations have a strong correlation with the essentiality of protein [57, 58]. In this section, we firstly select some subcellular localizations which are more relevant to essential proteins, and then, these selected subcellular localizations are weighted according to how important they are. Lastly, we score proteins’ essentiality by the subcellular localizations they appeared.
Subcellular localizations usually contain 11 compartments [48]. For the 11 subcellular localizations, we calculate the proportion EPI as follows:
where subi is the count of the 11 subcellular localizations, \(EP_{subi}\) is the number of essential proteins in subi, and \(P_{subi}\) is the number of proteins in subi. For the 11 subcellular localizations of different datasets, the proportion EPI is shown in Fig. 3.
As shown in Fig. 3, the proportion EPI of some subcellular localizations is significantly higher than others, such as Nucleus, Cytosol, and Cytoskeleton. On the contrary, the proportion EPI of Peroxisome and Extracellular region is significantly lower than others. This characteristic is present in different species, such as saccharomyces cerevisiae and drosophila melanogaster. Therefore, we select subcellular localizations we used based on the DIP dataset. The threshold EPthre is denoted as Eq. 12.
where, ep is the number of essential proteins in datasets DIP, and p is the number of proteins in datasets DIP. If EPI of subcellular localization is greater than threshold EPthre, this subcellular localization is selected in set SC. \(SC = \{Nucleus, Cytosol, Cytoskeleton, Endoplasmic reticulum, Golgi apparatus\}\). For every selected subcellular localization \(SC_{i}\), we score it by the number of proteins in it, and it is denoted as Eq. 13.
where, \(NSC_{i} = number \; of \; proteins \; in \; SC_{i}\), \(NSC_{max} = max(NSC_{i})\), \(SCS_{i}\) ranges from \([0,+1]\). Protein v is weighted based on the subcellular localization score, and its weight is denoted by Eq. 14.
where, \(SSC_{v}=\sum \limits _{v \in SC_{i}} SCS_{i}\), \(SSC_{max} = max\left( SSC_{v}\right) , (v \in V)\). \(SW_{v}\) ranges from \([0,+1]\)
Seed expansion method based on subnetworks and biological data
Essential proteins have a close relationship with each other [26]. CPPK predicts essential proteins by integrating network topology and some essential proteins as prior knowledge. However, the performance of CPPK depends excessively on the number of essential proteins used as prior knowledge. To tackle this problem, we randomly select a protein as a seed in each subnetwork, and the expansion process is based on the seed set of each subnetworks. This integrates the topological characteristics of predicted essential proteins in subnetworks. The error correction mechanism filters out proteins that have been expanded but are of low essentiality. This mechanism integrates the topological characteristics of predicted essential proteins in the whole PPI network, and the weight of each node is based on biological data. To provide a more realistic representation of protein interactions, we divide the proteinprotein interaction network into several subnetworks based on gene expression data. In this section, we execute seed expansion in each subnetwork simultaneously and summarize the expansion results to the whole PPI network. The detailed proecess of seed expansion is presented in Algorithm 1.
Initialization of the seed set and the weight of node and edge
For the whole PPI network, the set of predicted essential protein is initialized as K. For the PPI subnetworks, the sets of predicted essential protein are initialized as \(K_i\), where \(i\in [1,m]\), and m is the number of subnetworks. A protein is randomly chosen from each subnetwork and added to \(K_i\), ensuring that the intersection of all initial \(K_i\) is empty. The initial K is formed by taking the union of all the initial \(K_i\) sets, expressed as \(K = \bigcup _{i=1}^m K_i\). K is the set of predicted essential proteins, and it continues expanding until its length reaches the output length n. In other words, the length of output K is n. The value of n is initialized as \(\frac{V}{4}\) for Saccharomyces cerevisiae and \(\frac{V}{10}\) for Drosophila melanogaster. When \(K = n\), the proteins within K constitute all the essential proteins predicted by SESN. Furthermore, the ranking of a protein within K is higher if it was added earlier.
\(Nei\_K_i\) is the union of neighbor sets of all proteins in \(K_i\). Subsequently, any protein within \(Nei\_K_i\) that is also present in K is removed.
where, \(N_{u}\) is the set of neighbors of protein u. To clarify, \(Nei\_K_i \cap K = \emptyset\), indicating that the intersection of \(Nei\_K_i\) and K is empty. Similarly, \(Nei\_K\) is formed by combining the neighbor sets of all proteins in K. Subsequently, any protein within \(Nei\_K\) that is also present in K is removed.
\(score\_initial\) describes the essentiality of protein in the whole PPI network. We initialize a weight matrix, defined as Eq. 17.
where, GOW(v, u) is denoted as Eq. 7, GW(v, u) is denoted as Eq. 9, and PCW(v, u) is denoted as Eq. 10. \(W_{v,u}\) is the weight of edge (v, u), then, we initialize the weight of protein by Eq. 18.
where \(SW_{v}\)is denoted as Eq. 14, and \(W_{v,u}\) is denoted as Eq. 17.
\(Wmatrix_i(v,u)\) describes the weight of protein interaction (v, u) in subnetwork. Here, i is the count of the subnetworks, with \(i \in [1, m]\). For Saccharomyces cerevisiae, \(m=12\), and for Drosophila melanogaster, \(m=34\). \(Wmatrix_i(v,u)\) ranges from \([0,+1]\), which is denoted as follows:
where, GOW(v, u) is denoted as Eq. 7, \(Ap_{i}(v)\) and \(Ap_{i}(u)\) are all denoted as Eq. 6, PCW(v, u) is denoted as Eq. 10, \(SW_{v}\) and \(SW_{u}\) are all denoted as Eq. 14.
Seed expansion in subnetworks
The expansion process is performed simultaneously in all subnetworks and terminates when \(K = n\). If \(K< n\), we select a protein called \(max\_Wmatrix\_node\) from \(Nei\_K_i\), where \(i\in [1,m]\). Owing to \(Nei\_K_i \cap K = \emptyset\), the protein selected from \(Nei\_K_i\) is not in K. For each protein (denoted as Nei) in \(Nei\_K_i\), calculate \(score\_W_{Nei}\) based on Wmatrix, \(score\_W_{Nei}\) is defined as follows:
where, \(useNei = N_{Nei} \bigcap K\), \(N_{Nei}\) is the set of neighbors of protein Nei. Among the neighbors of protein Nei, only those within K are taken into consideration. When protein Nei is more closely connected with set useNei, the value of \(score\_W_{Nei}\) is higher. This approach allows us to effectively leverage the topological characteristics of predicted essential proteins within subnetworks.
The selection process of protein \(max\_Wmatrix\_node\) consists of two steps. In the first step, we find the maximum \(score\_W_{Nei}\) in each subnetwork. For all \(Nei \in Nei\_K_i\), the maximum \(score\_W_{Nei}\) is denoted as \(sub\_max\_Wmatrix_i\). In the second step, after calculating all the subnetworks, we gather all the \(sub\_max\_Wmatrix_i\) values, where \(i\in [1,m]\). The maximum value among these is denoted as \(max\_Wmatrix\), and its corresponding node is \(max\_Wmatrix\_node\).
We add the node \(max\_Wmatrix\_node\) into K if it is not already in K, and if it exists in \(V_i\), we also add it to \(K_i\). If we change the node in K(by adding or deleting it), we should also change the corresponding node in \(K_i\)(by adding or deleting it). Among \(Nei\_K_i\) of all the subnetworks, \(max\_Wmatrix\_node\) is the node with the closest connections to the predicted essential proteins set K and possesses crucial biological characteristics.
Seed expansion method with error correction mechanism
The initialization of \(K_i\) and K is random, and the expansion process is based on \(Nei\_K_i\) of subnetworks. If the essentiality of the seed in \(K_i\) is low, then there is a high probability that the essentiality of their neighboring nodes will also be low. In other words, the essentiality of the node selected from \(Nei\_K_i\) will also be low. Therefore, we add an error correction mechanism to filter out the nodes with low essentiality in K. The error correction mechanism is based on \(score\_initial\) calculates by Eq. 18, which describes the essentiality of protein in the whole PPI network.
The error correction mechanism consists of two main steps. In the first step, we find two nodes based on \(score\_initial\). The first node is \(min\_initial\_node\), which has minimum \(score\_initial\)(denoted as \(min\_initial\)) of all proteins in K that have not been removed. The second node is \(max\_initial\_node\), which has maximum \(score\_initial\)(denoted as \(max\_initial\)) of all proteins in \(Nei\_K\). Since \(Nei\_K \cap K = \emptyset\), \(max\_initial\_node\) is not in K. For all proteins in \(Nei\_K\), \(max\_initial\_node\) has the most important topological and biological characteristics. The selection of \(min\_initial\_node\) and \(max\_initial\_node\) is based on the whole PPI network. In the second step, we include one node in K while removing another node from K. If \(max\_initial > min\_initial\), we add \(max\_initial\_node\) to K if it is not already in K. Moreover, if \(min\_initial\_node\) has not been removed from K before, we remove it. A node in K can only be removed one time. Subsequently we update \(K_i\) and the state of \(min\_initial\_node\) is altered to ’removed’. In order to ensure \(K \) increases monotonically with the increase of the number of iterations, we remove \(min\_initial\_node\) from K if and only if \(max\_initial\_node\) has been added to K during this particular iteration.
The error correction mechanism is based on \(score\_initial\), which is the weight based on biological data. The selection of \(max\_initial\_node\) integrates the topological characteristics of predicted essential proteins in the whole PPI network.
Analysis of biological data
We integrate protein complex, gene expression data, GO annotations and subcellular localization into the seed expansion process. More specifically, each biological characteristic is employed to initialize \(score\_initial\) and Wmatrix. In order to analyze the effect of each biological characteristic on the final prediction results, we delete the weighting method based on biological data one by one. For example, in Eq. 18, let \(GOW(v,u) = 1\). In other words, we no longer use GO annotations to weight Wmatrix, while everything else remains the same. Specifically, Eqs. 18 and 19 are redefined as Eqs. 21 and 22 as follows:
where \(W_{v,u}\) of Eq. 17 is redefined as \(GOW(v,u)^{\alpha _1} \cdot GW(v,u)^{\alpha _2} \cdot PCW(v,u)^{\alpha _3}\) and \(SW_{v}\) of Eq. 14 is redefined as \(SW_{v}^{\alpha _4}\).
where \(Ap_{i}(v,u) = Ap_{i}(v) \cdot Ap_{i}(u)\), \(Ap_{i}(v)\) and \(Ap_{i}(u)\) are denoted as Eq. 6, \(SW(v,u) = SW_{v} \cdot SW_{u}\), \(SW_{v}\) and \(SW_{u}\) are denoted as Eq. 14. \(\alpha _1\) to \(\alpha _8\) determine which biological data will be deleted.
The value set \((\alpha _1,\alpha _2,...\alpha _8) = \{(0,1,...,1),(1,0,...,1),...,(1,1,...,0),(1,1,...,1)\}\) includes 9 groups of values, of which the ninth group consists entirely of 1 s. The values of the first eight groups are: \(\alpha _1\) to \(\alpha _8\) take 0 in sequence, while the remaining seven values are all 1 s. For example, the first group is (0, 1, 1, 1, 1, 1, 1, 1).
Table 2 compares the statistical measures of 9 methods. The 9 columns of Table 2 correspond to 9 groups of values in \((\alpha _1,\alpha _2,...\alpha _8)\). The defination of statistical measures can be found in the section ’Statistical measures’. Each column of statistical measures corresponds to the deletion of GO annotations, gene expression data, protein complex, and subcellular localization of the initialization of \(score\_initial\) and Wmatrix, respectively. Figure 4 shows jackknife curves of the three aforementioned datasets. The labels \(initial\_go\), \(initial\_gene\), \(initial\_com\), \(initial\_sub\), \(W\_go\), \(W\_gene\), \(W\_com\), \(W\_sub\) and all correspond to the 9 groups of values in \((\alpha _1,\alpha _2,...\alpha _8)\). The statistical measures and jackknife curves achieved the same experimental results. For Saccharomyces cerevisiae, when initializing Wmatrix without protein complex, the DIP and BioGRID datasets achieve the best results. The value set \((\alpha _1,\alpha _2,...\alpha _8) = (1,1,1,1,1,1,0,1)\) corresponds to these optimal results, which we will employ in subsequent experiments. For Drosophila melanogaster, when initializing \(score\_initial\) without subcellular localization, we achieve the best results, the value set \((\alpha _1,\alpha _2,...\alpha _8) = (1,1,1,0,1,1,1,1)\). We also employ these results in the followon experiments. The method with the best results is SESN.
Experimental results and discussion
Statistical measures
We compare the performance of our method with other identification methods by six statistical measures. These statistical measures can also be used to analyze the effect of each component and biological data on the final results. We define sensitivity (SN), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), Fmeasure (F), and accuracy (ACC) as follows: \(SN = \frac{TP}{TP+FN}\), \(SP = \frac{TN}{TN+FP}\), \(PPV = \frac{TP}{TP+FP}\), \(NPV = \frac{TN}{TN+FN}\), \(F = \frac{2 \cdot SN \cdot PPV}{SN+PPV}\), \(ACC = \frac{TP+TN}{TP+FP+TN+FN}\). Where TP is true positives; FP is false positives; TN is true negatives; and FN is false positives. The larger these statistical measures, the higher the accuracy of the corresponding essential protein identification method.
Jackknife curves
We plot jackknife curves to display the number of true positives for essential proteins in the predicted set of essential proteins as the ranking increases. We consider a protein’s ranking to be higher if it is added to the set K earlier.
Analysis of each component
In order to validate the effectiveness of each component of SESN, we remove one or several components. Specifically, we remove the subnetworks component, the error correction mechanism, the seed expansion component, and the subcellular localization selection. When we remove the subnetworks component, the process of seed expansion is based on the whole network, and this method is named \(rm\_sub\). We remove the error correction mechanism, the seed expansion is only based on Wmatrix with no error correction mechanism. This method is named \(rm\_correction\). We remove the seed expansion component, and the process of essential protein identification is only based on Wmatrix or \(score\_initial\). For Wmatrix, we initialize the weight of a protein based on the following equation: \(score\_Wmatrix_{v} = \sum _{u \in N_{v}}^{N_{v}} Wmatrix_{v,u}\), where Wmatrix is denoted as Eq. 19. This method is named Wmatrix. The method only based on \(score\_initial\) is named \(score\_initial\), where the weight of the protein is based on Eq. 18. In section ’Subcellular localization’, we select some subcellular localizations highly correlated to essential proteins, to validate the effectiveness of this component, we use all subcellular localizations, this method is named \(all\_subcellular\).
As shown in Table 3 and Fig. 5, compared with \(rm\_sub\), \(rm\_correction\), Wmatrix, \(score\_initial\), and \(all\_subcellular\), SESN achieves the best results in three datasets. All components are effective to SESN. For DIP and BioGRID, all components show a significant gap with SESN, especially \(rm\_correction\), which indicates that error correction mechanism is the most effective component. For Drosophila melanogaster, \(without\_correction\) corresponds to \(rm\_correction\) in Table 3. All components are effective except \(all\_subcellular\). The statistic measures of SESN are slightly higher than those of \(all\_subcellular\), and their jackknife curves basically coincide, which means that the selection of subcellular localization plays a small role in this case. It also demonstrates that the unselected subcellular localizations do not work in predicting essential proteins.
Analysis of the performance of SESN and other methods
To validate the performance of SESN, we compare it with other methods: CPPK, CEPPK, RWEP, SIGEP, RWO and TSPIN.
The algorithms CPPK and CEPPK are based on the whole proteinprotein interaction data. To filter false positive interactions, SESN integrate multiple biological characteristics to construct weighted PPI subnetworks. The algorithms CPPK and CEPPK randomly select k (\(k=100\)) known essential proteins as prior knowledge and add them to a set K. However, the performance of CPPK and CEPPK excessively depends on the number of essential proteins. The algorithm SESN does not use essential proteins as prior knowledge. We randomly select a protein in each subnetwork and add it to a set \(K_i\). \(K = \bigcup _{i=1}^m K_i\). Different from the CPPK and CEPPK algorithms, K is the set of predicted essential proteins, not known essential proteins. When the CPPK and CEPPK algorithms perform node expansion on set K within the whole PPI network, they select the neighbor node of K with the highest score and add it to set K. Due to the fact that the node expansion process of the CPPK and CEPPK algorithms is based on the neighbor nodes of K, it results in considering only the topological characteristics of set K within the whole PPI network. The algorithm SESN considers topological characteristics of predicted essential proteins in \(K_i\) and K. The seed expansion process of algorithm SESN is based on the neighbor nodes of \(K_i\), which integrates the topological characteristics of predicted essential proteins in the subnetworks. The error correction mechanism of algorithm SESN is based on the neighbor nodes of K, which integrates the topological characteristics of predicted essential proteins in the whole PPI network. CPPK and CEPPK are only applied to Saccharomyces cerevisiae, they randomly select 100 essential proteins as prior knowledge. For Drosophila melanogaster, the fruitfly dataset only has 493 essential protein, we randomly select 20 essential proteins as prior knowledge. The same as SESN, we regard the earlier a protein is selected as a predicted essential protein, the higher its score. SESN does not use essential proteins as prior knowledge, but SESN detects essential proteins more effectively than CPPK and CEPPK.
RWEP integrates the same biological properties as SESN. As shown in Table 5 and Fig. 7, to achieve optimal results, parameter \(\lambda\) of RWEP is set to 0.2, 0.1, and 0.9 for DIP, BioGRID, and fruitfly datasets, respectively. SESN analyzes the effect of each biological data on the final prediction results, and adopts the corresponding biological data with the best results. The experimental results show that SESN outperforms RWEP significantly, demonstrating the effectiveness of analyzing biological data from a different perspective.
SIGEP presents a pvalue calculation method in which both degree and local clustering coefficient are used as test statistic. Proteins are sorted according to pvalues. SIGEP does not integrate any biological data, and its performance is inferior to that of SESN, which indicates that the integration of biological data improves the performance of SESN.
RWO uses orthologous relationships to connect yeast and human PPI. Since RWO does not give the orthologous relationships applied to the fruitfly, we compare RWO and SESN on yeast datasets: DIP and BioGRID. SESN does not integrate orthologous relationships, but its performance is significantly better than RWO.
The statistical measures of these methods are shown in Table 4. In the three datasets, the values of six statistical measures of SESN are the highest among all these methods. Jackknife curves of these methods are shown in Fig. 6, revealing that SESN significantly better than other methods.
TSPIN is an algorithm aimed at optimizing PPI networks. TSPIN refines the PPI network by removing edges in it. The initial PPI network is denoted as PPI. By utilizing TSPIN, specific edges are removed from the PPI network to generate the refined network, which is designated as TSPPI. To assess the effectiveness of TSPIN for SESN, we feed the TSPPI network into SESN and execute all the SESN steps. This combined approach is denoted as TSPINSESN. The distinction between TSPINSESN and SESN lies solely in the input network. The TSPPI network employed by TSPINSESN is a subset of the PPI network used by SESN, resulting in the two algorithms utilizing distinct datasets. To ensure the validity of the comparative experimental outcomes, we implemented the subsequent two procedures on the output results of the two comparative algorithms. 1: For the output results of the SESN algorithm, only the proteins appearing in the TSPPI network were retained. In this manner, the experimental results of TSPINSESN and SESN are based on the TSPPI network. The statistical measures and Jackknife curves of SESN and TSPINSESN on TSPPI network are shown in Table 6 and Fig. 8. The experimental results show that the TSPINSESN algorithm do not improve the identification accuracy of essential proteins in the TSPPI network. Both TSPINSESN and SESN yield identical experimental results for the BioGRID and fruitfly datasets. However, for the experimental results regarding the DIP dataset, TSPINSESN exhibits even poorer performance. 2: We assign a score of 0 to proteins that have been removed from the PPI network, and subsequently add these proteins along with their scores to the output results of the TSPINSESN algorithm. In this manner, the experimental results of both TSPINSESN and SESN are based on the PPI network. The statistical measures and Jackknife curves of SESN and TSPINSESN on PPI network are shown in Table 7 and Fig. 9. The experimental results show that the SESN algorithm performs better than TSPINSESN. In summary, the TSPIN algorithm is ineffective for SESN.
Conclusions
Essential proteins are crucial for maintaining vital biological functions. Identifying essential proteins is of great significance for biology and pathology. In recent years, a large number of algorithms based on proteinprotein interaction (PPI) networks have been proposed to identify essential proteins. However, PPI data obtained through highthroughput technology often contain many false positives. This will seriously affect the accuracy of identifying essential proteins. Therefore, further research is needed to improve the accuracy of essential protein identification.
In this paper, we propose a novel method named SESN for identifying essential proteins. SESN is a seed expansion method based on proteinprotein interaction (PPI) subnetworks and biological characteristics. To filter out false positive interactions in PPI networks, SESN constructs PPI subnetworks using gene expression data. Seed expansion is performed simultaneously in each subnetwork, where each subnetwork randomly selects a protein as a seed, and the expansion results are summarized for the entire PPI network. The error correction mechanism filters out lowessentiality proteins that have been expanded. SESN adopts the biological data combination with the best experimental results. The output of SESN is a set of predicted essential proteins.
The analysis of each component of SESN shows that all components are effective, especially the error correction mechanism. The comparison experiments are conducted on three datasets of two species(DIP, BioGRID, fruitfly). Experiment results show that compared with other methods(CPPK, CEPPK, RWEP, SIGEP, RWO, and TSPIN), SESN achieves the best results in three datasets. SESN may provide a useful tool for future research on prediction of essential proteins.
Availability of data and materials
The processed dataset and source codes are available in https://github.com/zhaohe555/SESN.git
Abbreviations
 PPI:

Protein–protein interaction
 GO:

Gene ontology
 SESN:

A seed expansion method based on PPI subnetworks and multiple biological data to identify essential proteins
References
Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K, Andre B, Bangham R, Benito R, Boeke JD, Bussey H, et al. Functional characterization of the s. cerevisiae genome by gene deletion and parallel analysis. Science. 1999;285(5429):901–6.
Furney SJ, Albà M, LópezBigas N. Differences in the evolutionary history of disease genes affected by dominant or recessive mutations. BMC Genomics. 2006;7(1):1–11.
Li M, Zheng R, Li Q, Wang J, Wu FX, Zhang Z. Prioritizing disease genes by using search engine algorithm. Curr Bioinform. 2016;11(2):195–202.
Giaever G, Chu AM, Ni L, Connelly C, Riles L, Véronneau S, Dow S, LucauDanila A, Anderson K, André B, et al. Functional profiling of the saccharomyces cerevisiae genome. Nature. 2002;418(6896):387–91.
Nasevicius A, Ekker SC. Effective targeted gene ‘knockdown’ in zebrafish. Nat Genet. 2000;26(2):216–20.
Cullen LM, Arndt GM. Genomewide screening for gene function using rnai in mammalian cells. Immunol Cell Biol. 2005;83(3):217–23.
MenorFlores M, VegaRodríguez MA. Decompositionbased multiobjective optimization approach for ppi network alignment. KnowlBased Syst. 2022;243: 108527.
Von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P. Comparative assessment of largescale data sets of proteinprotein interactions. Nature. 2002;417(6887):399–403.
Brohee S, Van Helden J. Evaluation of clustering algorithms for proteinprotein interaction networks. BMC Bioinform. 2006;7(1):1–19.
Li X, Li W, Zeng M, Zheng R, Li M. Networkbased methods for predicting essential genes or proteins: a survey. Brief Bioinform. 2020;21(2):566–83.
Vallabhajosyula RR, Chakravarti D, Lutfeali S, Ray A, Raval A. Identifying hubs in protein interaction networks. PLoS ONE. 2009;4(4):5344.
Bonacich P. Power and centrality: A family of measures. Am J Sociol. 1987;92(5):1170–82.
Li M, Wang J, Chen X, Wang H, Pan Y. A local average connectivitybased method for identifying essential proteins from the network level. Comput Biol Chem. 2011;35(3):143–50.
Wang J, Li M, Wang H, Pan Y. Identification of essential proteins based on edge clustering coefficient. IEEE/ACM Trans Comput Biol Bioinform. 2011;9(4):1070–80.
Newman ME. A measure of betweenness centrality based on random walks. Soc Networks. 2005;27(1):39–54.
Wuchty S, Stadler PF. Centers of complex networks. J Theor Biol. 2003;223(1):45–53.
Stephenson K, Zelen M. Rethinking centrality: methods and examples. Soc Networks. 1989;11(1):1–37.
Estrada E, RodriguezVelazquez JA. Subgraph centrality in complex networks. Phys Rev E. 2005;71(5): 056103.
Tang Y, Li M, Wang J, Pan Y, Wu FX. Cytonca: a cytoscape plugin for centrality analysis and evaluation of protein interaction networks. Biosystems. 2015;127:67–72.
Liu Y, Liang H, Zou Q, He Z. Significancebased essential protein discovery. IEEE/ACM Trans Comput Biol Bioinform. 2020.
Li M, Zhang H, Wang Jx, Pan Y. A new essential protein discovery method based on the integration of protein–protein interaction and gene expression data. BMC Syst Biol. 2012;6(1):1–9.
Zhong J, Tang C, Peng W, Xie M, Sun Y, Tang Q, Xiao Q, Yang J. A novel essential protein identification method based on ppi networks and gene expression data. BMC Bioinform. 2021;22(1):1–21.
Zhu X, Zhu Y, Tan Y, Chen Z, Wang L. An iterative method for predicting essential proteins based on multifeature fusion and linear neighborhood similarity. Front Aging Neurosci. 2022;13:919.
Lei X, Zhao J, Fujita H, Zhang A. Predicting essential proteins based on rnaseq, subcellular localization and go annotation datasets. KnowlBased Syst. 2018;151:136–48.
Lei X, Yang X, Fujita H. Random walk based method to identify essential proteins by integrating network topology and biological characteristics. KnowlBased Syst. 2019;167:53–67.
Li M, Zheng R, Zhang H, Wang J, Pan Y. Effective identification of essential proteins based on priori knowledge, network topology and gene expressions. Methods. 2014;67(3):325–33.
Li G, Li M, Wang J, Li Y, Pan Y. United neighborhood closeness centrality and orthology for predicting essential proteins. IEEE/ACM Trans Comput Biol Bioinform. 2018;17(4):1451–8.
Jin H, Zhang C, Ma M, Gong Q, Yu L, Guo X, Gao L, Wang B. Inferring essential proteins from centrality in interconnected multilayer networks. Physica A. 2020;557: 124853.
Zeng M, Li M, Wu FX, Li Y, Pan Y. Deepep: a deep learning framework for identifying essential proteins. BMC Bioinform. 2019;20:1–10.
Grover A, Leskovec J. node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016;855–864.
Zeng M, Li M, Fei Z, Wu FX, Li Y, Pan Y, Wang J. A deep learning framework for identifying essential proteins by integrating multiple types of biological information. IEEE/ACM Trans Comput Biol Bioinform. 2019;18(1):296–305.
Graves A, Graves A. Long shortterm memory. Supervised sequence labelling with recurrent neural networks, 2012;37–45.
Yue Y, Ye C, Peng PY, Zhai HX, Ahmad I, Xia C, Wu YZ, Zhang YH. A deep learning framework for identifying essential proteins based on multiple biological information. BMC Bioinform. 2022;23(1):318.
Li M, Ni P, Chen X, Wang J, Wu FX, Pan Y. Construction of refined protein interaction network for predicting essential proteins. IEEE/ACM Trans Comput Biol Bioinform. 2017;16(4):1386–97.
Lei X, Zhang Y, Cheng S, Wu FX, Pedrycz W. Topology potential based seedgrowth method to identify protein complexes on dynamic ppi data. Inf Sci. 2018;425:140–53.
Xenarios I, Salwinski L, Duan XJ, Higney P, Kim SM, Eisenberg D. Dip, the database of interacting proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002;30(1):303–5.
ChatrAryamontri A, Breitkreutz BJ, Oughtred R, Boucher L, Heinicke S, Chen D, Stark C, Breitkreutz A, Kolas N, O’Donnell L, et al. The biogrid interaction database: 2015 update. Nucleic Acids Res. 2015;43(D1):470–8.
Mewes HW, Amid C, Arnold R, Frishman D, Güldener U, Mannhaupt G, Münsterkötter M, Pagel P, Strack N, Stümpflen V, et al. Mips: analysis and annotation of proteins from whole genomes. Nucleic Acids Res. 2004;32(suppl1):41–4.
Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, Hester ET, Jia Y, Juvik G, Roe T, Schroeder M, et al. Sgd: Saccharomyces genome database. Nucleic Acids Res. 1998;26(1):73–9.
Zhang R, Lin Y. Deg 5.0, a database of essential genes in both prokaryotes and eukaryotes. Nucleic Acids Res. 2009;37(suppl1):455–8.
Chen WH, Minguez P, Lercher MJ, Bork P. Ogee: an online gene essentiality database. Nucleic Acids Res. 2012;40(D1):901–6.
Aloy P, Bottcher B, Ceulemans H, Leutwein C, Mellwig C, Fischer S, Gavin AC, Bork P, SupertiFurga G, Serrano L, et al. Structurebased assembly of protein complexes in yeast. Science. 2004;303(5666):2026–9.
Pu S, Wong J, Turner B, Cho E, Wodak SJ. Uptodate catalogues of yeast protein complexes. Nucleic Acids Res. 2009;37(3):825–31.
Pu S, Vlasblom J, Emili A, Greenblatt J, Wodak SJ. Identifying functional modules in the physical interactome of saccharomyces cerevisiae. Proteomics. 2007;7(6):944–60.
Guruharsha K, Rual JF, Zhai B, Mintseris J, Vaidya P, Vaidya N, Beekman C, Wong C, Rhee DY, Cenaj O, et al. A protein complex network of drosophila melanogaster. Cell. 2011;147(3):690–703.
Tu BP, Kudlicki A, Rowicka M, McKnight SL. Logic of the yeast metabolic cycle: temporal compartmentalization of cellular processes. Science. 2005;310(5751):1152–8.
Chintapalli VR, Wang J, Dow JA. Using flyatlas to identify better drosophila melanogaster models of human disease. Nat Genet. 2007;39(6):715–20.
Binder JX, PletscherFrankild S, Tsafou K, Stolte C, O’Donoghue SI, Schneider R, Jensen LJ. Compartments: unification and visualization of protein subcellular localization evidence. Database 2014;2014.
Wang J, Peng X, Li M, Pan Y. Construction and application of dynamic protein interaction network based on time course gene expression data. Proteomics. 2013;13(2):301–12.
Zhang Y, Lin H, Yang Z, Wang J. Construction of dynamic probabilistic protein interaction networks for protein complex identification. BMC Bioinform. 2016;17(1):186.
Zhang Y, Lin H, Yang Z, Wang J, Liu Y, Sang S. A method for predicting protein complex in dynamic ppi networks. BMC Bioinform. 2016;17(7):533–43.
Wang R, Wang C, Liu G. A novel graph clustering method with a greedy heuristic search algorithm for mining protein complexes from dynamic and static ppi networks. Inf Sci. 2020;522:275–98.
Lei X, Fang M, Fujita H. Mothflame optimizationbased algorithm with synthetic dynamic ppi networks for discovering protein complexes. KnowlBased Syst. 2019;172:76–85.
Lei X, Ding Y, Fujita H, Zhang A. Identification of dynamic protein complexes based on fruit fly optimization algorithm. KnowlBased Syst. 2016;105:270–7.
Li M, Lu Y, Niu Z, Wu FX. United complex centrality for identification of essential proteins from ppi networks. IEEE/ACM Trans Comput Biol Bioinform. 2015;14(2):370–80.
Lu P, Yu J. Two new methods for identifying essential proteins based on the protein complexes and topological properties. IEEE Access. 2020;8:9578–86.
Fan Y, Tang X, Hu X, Wu W, Ping Q. Prediction of essential proteins based on subcellular localization and gene expression correlation. BMC Bioinform. 2017;18(13):13–21.
Zhang W, Xu J, Zou X. Predicting essential proteins by integrating network topology, subcellular localization information, gene expression profile and go annotation data. IEEE/ACM Trans Comput Biol Bioinform. 2019;17(6):2053–61.
Acknowledgements
The author thanks the anonymous reviewers for their comments and suggestions. Additionally, the author would like to thank all the teachers and students who participated in this research for their guidance and assistance.
Funding
This work was supported by the National Nature Science Foundation of China [grant number62372208, 61772226]; Science and Technology Development Program of Jilin Province [grant number 20210204133YY]; Key Laboratory for Symbol Computation and Knowledge Engineering of the National Education Ministry of China, Jilin University.
Author information
Authors and Affiliations
Contributions
HZ obtained and processed datasets. HZ and GL designed the new method, SESN. GL, and XC provided suggestions and analyzed the results. HZ wrote the manuscript. HZ, GL, and XC reviewed and edited this manuscript. All authors contributed to this work and approved the submitted version.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zhao, H., Liu, G. & Cao, X. A seed expansionbased method to identify essential proteins by integrating protein–protein interaction subnetworks and multiple biological characteristics. BMC Bioinformatics 24, 452 (2023). https://doi.org/10.1186/s12859023055838
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859023055838