- Research
- Open access
- Published:
Identification of essential proteins based on edge features and the fusion of multiple-source biological information
BMC Bioinformatics volume 24, Article number: 203 (2023)
Abstract
Background
A major current focus in the analysis of protein–protein interaction (PPI) data is how to identify essential proteins. As massive PPI data are available, this warrants the design of efficient computing methods for identifying essential proteins. Previous studies have achieved considerable performance. However, as a consequence of the features of high noise and structural complexity in PPIs, it is still a challenge to further upgrade the performance of the identification methods.
Methods
This paper proposes an identification method, named CTF, which identifies essential proteins based on edge features including h-quasi-cliques and uv-triangle graphs and the fusion of multiple-source information. We first design an edge-weight function, named EWCT, for computing the topological scores of proteins based on quasi-cliques and triangle graphs. Then, we generate an edge-weighted PPI network using EWCT and dynamic PPI data. Finally, we compute the essentiality of proteins by the fusion of topological scores and three scores of biological information.
Results
We evaluated the performance of the CTF method by comparison with 16 other methods, such as MON, PeC, TEGS, and LBCC, the experiment results on three datasets of Saccharomyces cerevisiae show that CTF outperforms the state-of-the-art methods. Moreover, our method indicates that the fusion of other biological information is beneficial to improve the accuracy of identification.
Background
Proteins are the material basis of life activities. They can be divided into essential and non-essential proteins. The cell becomes nonfunctional or dysfunctional when essential proteins are knocked out [1]. Identification of essential proteins can help us uncover the mechanisms of cell aging and aging-related diseases and is of great significance to disease diagnosis and drug design [2].
Essential proteins have been identified by biological experimental approaches and computing methods. The advantage of biological experimental methods, such as gene knockout, conditional knockout, and RNA interference [3], is high reliability, but the disadvantages are that they are time-consuming and expensive [4]. With the rapid development of high-throughput experimental methods, protein–protein interaction (PPI) data have been enriched. Consequently, it is possible to identify essential proteins using computing methods [5].
Interactions among proteins can be modeled by a simple graph where a vertex corresponds to a protein and an edge to an interaction, also called a protein–protein interaction network (PIN). In a PIN, highly connected vertices tend to be essential based on the centrality–lethality rule proposed by Jeong et al [6]. Accordingly, computing methods identify essential proteins by the topological features of PINs [7]. For these methods, centrality measures are crucial. Much research in recent years has focused on centrality measures, such as degree centrality (DC) [8], betweenness centrality (BC) [9], closeness centrality (CC) [10], subgraph centrality (SC) [11], eigenvector centrality (EC) [12], information centrality (IC) [13], local average centrality (LAC) [14], and neighbor centrality (NC) [15]. It must be also mentioned that previous research shows that we cannot identify all essential proteins based on existing centrality measures, because of noise in PINs, limitations of centrality measures, and other reasons [16]. It remains challenging to develop novel centrality measures to further improve the performance of the identification methods [17].
Besides centrality measures, previous research shows that it is helpful for identifying essential proteins to fuse multisource biological information [18], such as GO annotations, protein complexes, gene expression profiles, and subcellular localization. Fusion methods can be generally grouped into three categories: edge weight methods, PIN reconstruction methods, and fusion methods.
The basic idea of edge weight methods is to identify essential proteins via an edge-weighted PIN, whose edges are weighted based on topological features and biological information. Edge-weighted PINs can be obtained via the fusion of gene expression profiles, such as the methods proposed by Tang et al. (WDC) [19], Zhang et al. (CoEWC) [20], Li et al. (PeC) [21], and Zhong et al. (JDC) [22]. GO annotations are another kind of biological information used to assign a weight to an edge [23], for example, the method GEG presented by Zhang et al. [24]. Previous studies have demonstrated that the number of protein domain types contained in a protein is highly correlated with its essentiality, for example, the model NPRI developed by Chen et al. [25]. Based on the relation between the orthology and essential proteins, Peng et al. proposed the method ION [26]. Recently, to further enhance accuracy, some methods generate an edge-weighted PIN by simultaneously fusing several kinds of biological information, such as esPOS [27], TEO [28], and TEGS [29].
To decrease the influence of noise or incompleteness inherently existing in PINs, the key point of PIN reconstruction methods is to reconstruct a PIN using biological information. In the study of Wang et al., a dynamic PIN (DPIN), which consists of a series of time-sequenced subnetworks that are static PINs, was constructed by combining gene expression data with PINs for denoising PINs [30]. The WPDINM model proposed by Meng et al. estimates the essentiality of proteins based on subcellular localization, orthologous information, and a novel weighted protein–domain interaction network constructed by PINs and gene expression profiles [31]. On the basis of the relations between protein functions and subcellular localization, Li et al. presented the SPP method [32]. Zhao et al. presented two methods, DSN and MON, by integrating PINs, protein domains, gene expression profiles, orthologous proteins, and subcellular localization information [33, 34].
The fundamental strategy of fusion methods is to identify essential proteins through weighted scores computed using other kinds of biological information or other methods, which are complementary, that is, the essential protein sets identified by these methods are different [18, 21, 35]. By fusion of PINs, orthologous proteins, and subcellular localization, the SON method was presented by Li et al. [36]. The LIDC method proposed by Luo et al. computes weighted scores using PINs and protein complex information [37]. Based on the TEGS method, Zhang et al. proposed the CEGSO method through fusing subcellular locations and two other methods [5], namely, IDC [37] and NOS. Based on the combination of local density, BC and IDC, Qin et al. presented the LBCC method [38].
Although all the previously mentioned identification methods have demonstrated good performance, they suffer some disadvantages, and there is room for enhancement. For concerning the methods based on centrality measures, the limitation is that these measures are not sufficient to perfectly characterize the complete features of essential proteins. There remains a need for efficient centrality measures that can compute the essentiality of lowly or highly connected proteins, because lowly connected proteins may be essential and highly connected proteins maybe not. For example, there are 321 essential proteins whose interactions are less than or equal to 3 and there are 809 non-essential proteins whose interactions are greater than average in the DIP dataset (see Section “Experiments and discussions”), which contains 1167 essential proteins out of 5093 proteins. The example is inconsistent with the assumption that highly connected proteins tend to be essential. Therefore, how to design a method to identify the two types of proteins by deeply analyzing the topological features of PINs is still an important question. For the methods based on fusing multi-source biological information, it is still a challenge to identify more inherent potential relations between essential proteins and biological properties in different kinds of biological information.
To tackle the limitations mentioned above, we present a novel method for identifying essential proteins, named CTF (the identification method of essential proteins based on edge features including h-quasi-cliques and uv-triangle graphs, and the fusion of multiple-source biological information). To our knowledge, it is the first time that the concepts of h-quasi-cliques and uv-triangle graphs are considered in the identification of essential proteins. The contributions of this paper are summarized as follows.
-
1
For constructing an edge-weighted PIN, we propose an function, named EWCT (the edge weight function based on edge features h-quasi-cliques and uv-triangle graphs by combining with GO annotations), to weight edges.
-
2
To denoise PINs and further enhance their performance, we construct an edge-weighted PIN using EWCT and a DPIN.
-
3
To further enhance accuracy, the CTF method computes three essential scores of proteins using three kinds of biological information, namely, protein complexes, subcellular localization, and orthologous information, and the CTF method is upgraded by optimizing the weights of the different essential scores.
To verify the effectiveness and superiority of CTF, we design experiments on three different yeast PINs and compare CTF with 16 methods, including MON, PeC, TEGS, and LBCC. The results show that CTF has higher performance than the other methods.
Definitions and notations
Let us introduce some notations and terminologies before describing the CTF method in detail. A PIN is typically modeled by a simple graph \(G = (V, E)\) with a set of vertices V and a set of edges E, where vertices and edges represent proteins and interactions, respectively. For an edge \(e \in E\) incident on u and v, denote the edge e by \(e = (u, v)\) or (u, v), and we say that u and v are “adjacent” or u is a “neighbor” of v. The kth-order neighbors of vertex u are a set of vertices whose shortest path distances to u are equal to k, and the kth-order nearest neighbors of protein u are a set of vertices whose shortest path distances to u are less than or equal to k. In this paper, for convenience, we interchangeably use the terms “vertex” and “protein” without any confusion because of the one-to-one mapping between the vertex set and the protein set and similarly for “edge” and “interaction”.
In a simple graph \(G = (V, E)\), the “degree” of a vertex u is the number of edges incident on it. Let d(v) denote the degree of v, and N(v) denote the set of neighbors of v. The union of N(u) and N(v), denoted by \(N(u) \cup N(v)\), is the set of vertices that are in N(u) or N(v) or both N(u) and N(v), and the intersection of N(u) and N(v), denoted by \(N(u) \cap N(v)\), is the set of vertices that are in both N(u) and N(v). The set \(N(u) \cap N(v)\) is called the common neighbor set of u and v.
An edge-weighted graph is a graph that has a number, called a weight, associated with each edge. We denote the weight of the edge e incident on vertices u and v by w(e(u, v)).
Given a simple graph \(G = (V, E)\), G is a clique if u is adjacent to v for arbitrary two distinct vertices u and v of V. Therefore, given a clique with n vertices, it has \((n * (n - 1)) / 2\) edges. The maximal clique problem is to find a clique that is not contained in any other clique in a graph. In real-world contexts, we need to relax a clique problem to an almost-clique problem, that is, dense incomplete graphs, also called quasi-cliques, which generalize the notion of cliques. In our method, we define a variant of cliques: h-quasi-cliques.
Definition 1
(h-quasi-clique) For a simple graph G with n vertices, G is an h-quasi-clique such that the number of edges in G is greater than or equal to \((n * (n - 1)) / 4\), that is, half the number of edges of a clique with n vertices.
Given a simple graph \(G = (V, E)\), for each \(v \in V\), if G contains at least one subgraph that is a triangle and contains vertex v, we say that G is a triangle graph. A variant of a triangle graph is a uv-triangle graph.
Definition 2
(uv-triangle graph) Given a simple graph \(G = (V, E)\), we say that G is a uv-triangle graph if it satisfies the uv-triangle condition: there exists an edge \(e = (u, v)\) for each vertex \(w \in V\) such that \({w, w_1, w_2}\) induces a triangle in G, where \(w_1\) and \(w_2 \in {u, v} \cup (N(u) \cap N(v))\). The triangle is called a triangle graphlet of G.
For example, Fig. 1 illustrates a subgraph that is an h-quasi-clique and is also a uv-triangle graph, where the blue vertices belong to \(N(u) \cap N(v)\), and the gray vertices belong to \((N(u) \cup N(v)) - (N(u) \cap N(v))\).
For a graph G, if G is an h-quasi-clique and is also a uv-triangle graph, the density of the edges in G is much higher and can be used to measure the edge density of the subgraph.
Methods
Previous studies have shown that there are several strategies to upgrade the performance of the essential protein identification methods. The first one is to design novel centrality measures, which can provide crucial insights on the topological features of PINs. The second strategy is to denoise PINs to increase the precision of the interactions [29]. Another one is to identify essential proteins based on the fusion of other kinds of biological information or other kinds of identification methods.
In this study, we present a new identification method based on a new centrality measure, DPINs, and the fusion of three kinds of biological information, namely, protein complex, subcellular location, and orthologous information, as shown in Fig. 2.
Edge-weight function
There are four scores in the CTF method. The first one is a topological score computed based on an edge-weighted PIN. To construct an edge-weighted PIN, we first propose the EWCT function for the assignment of weights to edges.
The central idea of EWCT is to assign weights to the edges of PINs based on the edge features of the PINs and GO annotations. The topological features used in EWCT are h-quasi-cliques and uv-triangle graphs.
Theorem 1
Given a PIN \(G_p = (V_p, E_p)\), for \((u, v) \in E_p\), let \(C_1 = N(u) \cap N(v)\), and \(C_2 = N(w_1) \cup N(w_2)\), where \(w_1 \in {u, v}\) and \(w_2 \in C_1\). Let \(G_{uv} = (V_{uv}, E_{uv})\) be the induced subgraph on the vertex set \({u, v} \cup C_1 \cup C_2\). If \(|V_{uv}| < 8\), then \(G_{uv}\) is an h-quasi-clique, and it is also a uv-triangle graph.
Proof
We first show that \(G_{uv}\) is an h-quasi-clique.
The number of edges in \(G_{uv}\) is computed below. Let \(n = |V_{uv}|\), \(n_1 = |C_1|, n_2 = |C_2|, w \in {u, v}, v_1 \in C_1\), and \(v_2 \in N(w) \cap N(v_1) \subseteq C_2\). Consequently, we have that \(n = n_1 + n_2 + 2\). Observe that vertices u, v, and \(v_1\) are vertices of a triangle in \(G_p\), and the number of these triangles is \(n_1\); vertices w, \(v_1\), and \(v_2\) are vertices of a triangle in \(G_p\), and the number of these triangles is \(n_2\). Therefore, the number of edges in Guv is at least \(2n_1 + 2n_2 + 1 = 2n - 3\). The triangles formed by vertices u, v, and \(v_1\) or w, \(v_1\), and \(v_2\) are triangle graphlets of \(G_{uv}\).
In addition, for the clique \(C_{uv} = (V_c, E_c)\) on the vertex set \({u, v} \cup C_1 \cup C_2\), we have \(|E_c| = n(n - 1) / 2 = (n_1 + n_2 + 2)(n_1 + n_2 + 1) / 2\).
Since n is an integer and \(0< n < 8\), Eq. (1) holds.
Thus, \(G_{uv}\) is an h-quasi-clique by Definition 1.
By the construction of \(G_{uv}\) and Definition 2, we get that \(G_{uv}\) is a uv-triangle graph. The theorem follows. \(\square\)
To the best of our knowledge, the average degree in a PIN is about 8, and the degrees of about 60–85% of proteins in a PIN are less than or equal to 7 such as shown in Table 1, in which there are 5 PINs, including Gavin, Krogan, DIP, MIPS, and MBD, for describing degree properties of vertices in PINs. We may conclude that the vertex number of a maximal clique in a PIN is lower than 8, and the vertex number of \(G_{uv}\) is lower than 7 in most cases. The property of a PIN satisfies the conditions of Theorem 1 in most cases, that is, \(G_{uv}\) is an h-quasi-clique and is also a uv-triangle graph.
The important observation is that \(G_{uv}\) is characterized by the richness of triangle graphlets. The edge feature of \(G_{uv}\) can be used to compute the weight of (u, v).
To define the function EWCT, the two definitions below are used.
Definition 3
(Half of the Common Neighbors) For two vertices u and v in a PIN, the half of the common neighbors (HCN) of u and v is defined as Eq. (2).
Definition 4
(Summation of All Neighbor Supports) For two vertices u and v in a PIN, the summation of all neighbor supports (SANS) is the summation of the product of HCN(u, w) and HCN(w, v), where w is a common neighbor of u and v.
Note that, as illustrated above, the vertex set \(\{u, v\} \cup (N(u) \cap N(v)) \cup ((N(u) \cap N(w)) \cup (N(v) \cap N(w)))\) is an h-quasi-clique in most cases and is also a uv-triangle graph.
On the basis of HCN and SANS, we define the function EWCT by Eq. (4) used to compute the importance of edge \(e = (u, v)\). In addition, GO annotations can be used to adjust the weights of the edges as stated above. We use the function Go(v, u) proposed by Wang [39] to adjust the edge weights, where the value of Go(v, u) is between 0 and 1.
For two vertices u and v in a PIN, the function EWCT is defined as Eq. (4), where the divisor in Eq. (4) is used to balance the difference of the neighbor numbers for different vertices.
The meaning of function EWCT(u, v) is that its value is highly correlated with two edge features h-quasi-cliques and uv-triangle graph.
For example, previous studies have shown that the neighborhood topology of a PIN is highly correlated with the essentiality of proteins. Based on the neighborhood topology of a PIN, four kinds of subgraphs occur frequently in PINs as shown in Fig. 3, called \(T_1\)-Graph, \(T_2\)-Graph, \(T_3\)-Graph, and \(T_4\)-Graph, where the edge \(e = (u, v)\) will be assigned a weight. As detailed in Fig. 3, the solid edges are the characterizing edges used to compute the weight of e. The features of these four graphlets are described in Table 2. If we only consider topological features by omitting GO annotations in Eq. (4), that is, set Go(u, v) to 1, the EWCT values of the edge e in \(T_1\)-Graph, \(T_2\)-Graph, \(T_3\)-Graph, and \(T_4\)-Graph are 0, 0, 0.2, and 1.2, respectively. That is, higher EWCT values lead to more important edges.
Furthermore, we also analyze the computational complexity of the EWCT method. The basic operation of EWCT is to compute the common neighbor set of u and v, that is, \(N(u) \cap N(v)\). Therefore, the computational complexity of EWCT is \(O(d(u) \times \log (d(v))\). To compute the weights for all \(e \in E\), the computational complexity is \(O(|E| \times d(u) \times \log (d(v))\). As the average degree in a PIN is about 8, the EWCT function can be efficiently computed.
Construction of an edge-weighted PIN
It is well known that PINs obtained through high-throughput methods have a high level of noise. This leads to difficulties in identifying essential proteins. A PIN is also called a static PIN to distinguish from a DPIN. In addition, interactions among proteins are dynamic in a cell, that is, a static PIN cannot reflect the dynamic feature of interactions.
To tackle these two problems, especially the noise in the form of false positives, we construct a DPIN by combining static PINs with gene expression profiles. This paper applies the 3-sigma method proposed by Wang et al. to construct DPINs [30].
A DPIN is defined as a 4-tuple \(DG = (V, E, T, \text {ATE})\), where V and E correspond to proteins and interactions of PINs, respectively, \(T = \{T_i | 1 \le i \le n\}\) is a set of active time points for proteins, and ATE is a function whose value is the active time attribute set of proteins. A snapshot of a DPIN is defined as a 3-tuple \(DG_{i} = (V_i, E_i, \text {ATE}(u, v, T_i))\), where \(V_i \in V\) and \(E_i \in E\) are active at time point \(T_i \in T\), \(\text {ATE}(u, v, T_i)\) is used to compute the active probability of vertices u and v in \(V_i\) at time point \(T_i\), and \(i \in [1,|T|]\).
Given a DPIN subnetwork \(DG_i = (V_i, E_i, {\text {ATE}}(u, v, T_i))\), the weight of edge (u, v) is computed using the function \(\text {EWD}(u, v, T_i)\) as Eq. (5). Recall that gene expression profiles are used to construct DPINs, and the gene expression profiles used in our experiments are 12 time intervals per cycle. Therefore, the number of active time points is 12 for a gene in a cycle, that is, \(|T| = 12\).
As detailed in Algorithm 1, the method CEP (construction of an edge-weighted PIN) is used to construct an edge-weighted PIN. CEP contains 12 iterations, and each iteration processes a DPIN subnetwork and consists of two major steps. To begin with, compute the EWD value by Eq. (5)), and after that, we delete the trivial edges.
The interactions with high weights tend to connect essential proteins. After obtaining an edge-weighted PIN, it will be used to compute the topological score of a protein.
Essentiality scores based on edge features
For protein u in an edge-weighted PIN, the topological score function defined by Eq. (6), named TS(u), is used to compute the topological score of u based on the weights of edges adjacent to u.
Normally, the range of TS(u) is from 0 to 100. Accordingly, if the value of TS(u) is too high, it is treated as an abnormal value. In fact, most of the proteins with too high topology scores are not essential, and their topology scores are assigned 0 by a threshold. In practice, we take 1000 as the threshold of TS(u). For example, as shown in Fig. 4a and b, respectively, there are 32 and 25 high-score proteins arranged in circles, whose scores are greater than 1000 in the Gavin dataset. The subgraph induced by these proteins is a quasi-clique. The quasi-clique with 1 essential protein has 32 vertices and 458 edges in Fig. 4a, and the quasi-clique with 3 essential proteins has 25 vertices and 289 edges in Fig. 4b. For the proteins arranged in a circle, their topology scores are set to zero.
Essentiality scores based on biological information
As pointed out above, previous studies indicate that the use of biological information can improve the accuracy of essential protein identification. This paper applies three kinds of biological information, namely, protein complexes, subcellular localizations, and orthologous information.
A protein complex is a group of proteins that mutually interact, that is, protein complexes are substructures of a PIN. For a protein in a complex, the essentiality highly positively correlates with the participation degree [40].
Subcellular localization information is vital to understand the functions of proteins and is easily obtained. From a biological view, for two proteins, there is an interaction between them if and only if they are in the same subcellular compartment [41]. Subcellular localization information can be used to reduce the noise in PINs and is helpful for further improvement of identification accuracy.
Because orthologous proteins have evolved from a common ancestor, they often perform the same function. The SON method proposed by Li et al. applied orthologous information, subcellular localization, and PINs to identify essential proteins [36]. Some previous studies also showed that the identification accuracy of essential proteins could be improved using orthologous information.
Based on these reports, this paper identifies essential proteins by the fusion of three kinds of biological information mentioned above.
CTF method
Comparisons of the essential protein sets identified by the methods TS, IDC, SCIS, and NOS, show that these methods are complementary. In this paper, we first compute essentiality scores of proteins by four scores, namely, the topology score TS and three kinds of biological information scores as shown in Eq. (7), where IDC(u), SCIS(u), and NOS(u) are obtained from protein complexes, subcellular localizations, and orthologous information, respectively. These four scores are combined via a linear combination. Then, we rank proteins by essential scores in descending order, and the higher-ranked proteins are more likely to be essential proteins, that is, we can choose the top k proteins as essential candidates.
Note that the value of NOS ranges from 0 to 1 in practice. By contrast, those of TS, IDC, and SCIS range from 0 to 100, that is, the value of NOS is much less than TS, IDC, and SCIS. Subsequently, the value of NOS is amplified 100-fold in Eq. (7) to scale the four scores.
The parameter \(\alpha \in\) [0, 1] is used to tune the rate of the four components TS, IDC, NOS, and SCIS. If \(\alpha\) is set to 1, the essential score is determined by TS and IDC, and if \(\alpha\) is set to 0, the essential score is determined by NOS and SCIS. If \(\alpha\) is between 0 and 1, essential scores are computed according to the percentages of TS, IDC, NOS, and SCIS. In CTF, \(\alpha\) is set to 0.4, and the reason is described in Subsection “Parameter settings”.
The details of the CTF method are described in Algorithm 2.
Experiments and discussions
Experimental data
In this study, multiple biological datasets from the baker’s yeast Saccharomyces cerevisiae are used, namely, PINs, GO annotations, gene expression profiles, subcellular localizations, protein complexes, orthologous information, and standard essential proteins. Saccharomyces cerevisiae has been widely used for essential protein studies because it is one of the most intensively studied organisms in molecular and cell biology, and it contains the most complete PPIs and rich biological information. Therefore, we evaluate the performance of CTF based on Saccharomyces cerevisiae datasets as shown in Table 3.
Comparisons with other methods
To show the advantage of our method CTF, three comparison methods are used, namely, statistical measures, top k proteins method, and receiver operating characteristic (ROC) and precision-recall (PR) curves.
Comparisons of statistical measures
For comparisons of CTF with some other existing algorithms, six statistical measures are employed, namely, sensitivity (SN), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), F-measure (F), and accuracy (ACC). These measures are commonly used to measure the performance of essential protein identification. Let TP and TN denote the number of samples of the essential and non-essential proteins, which are identified correctly, respectively, and FN and FP denote the number of samples of the essential proteins and non-essential proteins, which are identified wrongly, respectively. These measures mentioned above are described as shown in Eqs. (8–13).
According to previously published studies, about 20–30% of all proteins are essential in a PIN. Therefore, we choose the top 25% as essential proteins and the others as non-essential proteins. For CTF, the lowest scores of essential proteins are 21.36 in DIP (1167th), 21.2 in Krogan (929th), and 23.355 in Gavin (714th). The average of the lowest scores is 21.97 in three datasets. Therefore, we take 22 as the threshold for CTF.
If we only use the threshold to choose essential proteins, for some datasets, the size of the candidate set may be inappropriate. Therefore, the evaluation model of this paper is described as follows. Let s be the number of the essential candidates chosen by a threshold and r be 25% of the size of the dataset, then we choose the top \((s + r)/2\) as the essential candidates. Actually, experiment results show that the evaluation model is better than the simple threshold model or the top k model.
We compare CTF with 14 existing methods, including MON, JDC, and LBCC on the DIP, Krogan, and Gavin datasets. The results are shown in Tables 4, 5 and 6.
The comparison results show that CTF outperforms the other methods on DIP (Table 4) and Krogan (Table 5), and CTF outperforms other methods in terms of three measures, namely, SN, NPV, and F-measure on Gavin (Table 6). Therefore, the CTF method has better performance than the other existing methods.
Comparisons of top k proteins
Similar to most comparisons, we also carry out comparisons of the top k proteins between CTF and other methods. We first rank proteins by essential scores in descending order, then choose the top k proteins as essential candidates and determine how many of these are essential.
To evaluate the performance of CTF, we compare it with 16 methods, namely, NC, PeC, WDC, ION, CoEWC, LAC, GEG, SON, LBCC, TEO, esPOS, TEGS, JDC, DSN, MON, and GEGSO on the DIP, Krogan, and Gavin datasets. The results are listed in Table 7, Table 8, and Table 9, in which the number of essential proteins in the top k-ranked proteins is shown, where k is set to 100, 200, 300, 400, 500, and 600. The results show that CTF outperforms the other compared methods in more than half of all cases.
Comparison of ROC and PR curves
ROC and PR curves are commonly used to visually evaluate the performance of identification methods. A ROC curve is a graphical plot created by plotting the true positive rate (TPR, also called the sensitivity (SN), represented as Eq. (8)) against the false positive rate (FPR, represented as Eq. (14)), and a PR curve is a graphical plot created by plotting the TPR against the PPV.
As stated above, the proteins obtained by the methods are ranked by their scores in descending order. We choose the score of the kth protein as the threshold for CTF. The top k proteins are put into the positive set, which is the candidate set of essential proteins, and the others are put into the negative set, which is the candidate set of non-essential proteins, where \(1 \le k \le 5093\) on the DIP data, \(1 \le k \le 3672\) on the Krogan data, and \(1 \le k \le 1855\) on the Gavin data. Then, the values of TPR, FPR, and PPV are calculated and plotted in the ROC and PR curves.
The area under the ROC or PR curve (AUC) is a measure used to evaluate the performance of identification methods. In general, a larger AUC value means better identification performance. The AUC values of ROC and PR for CTF and other existing methods are illustrated in Fig. 5.
Figure 5 indicates that CTF is very effective. In ROC analysis, CTF (blue) outperforms the other existing methods on three datasets as shown in Fig. 5a–c, and for PR analysis, CTF (blue) also outperforms the other existing methods on DIP and Krogan as shown in Fig. 5d and e. CTF has good performance on Gavin as shown in Fig. 5f. From the annotation numbers in Fig. 5, the values of AUC for CTF are significantly higher than the other existing methods.
Ablation study
To elucidate the contributions of the CTF method, we perform an ablation study to investigate whether the EWCT-based measure TS and the usage of DPINs provide improvements in the identification performance. For investigating the effect of TS, we only use TS scores to identify essential proteins, and for investigating the effect of DPINs, we use static PINs instead of DPINs to compute the TS scores of the proteins.
Effect of the EWCT-based measure TS
To investigate the effects of the TS measure, we conduct an ablation study by removing three scores, namely, IDC, SCIS, and NOS, from CTF, that is, only use the topological scores computed by TS to identify the essential proteins and compare the results with other centrality measures, such as BC, SC, and LAC. The results in Tables 10 and 11 show that TS can identify more essential proteins than the other five centrality measures in most cases (83%) on static PINs and in all cases on DPINs. That is, TS outperforms other centrality measures, such as BC, SC, and LAC.
Further analysis indicates that there are some proteins identified as essential proteins by the TS measure but non-essential proteins by other centrality measures, such as BC, SC, and LAC. The common feature of these proteins is that they have low connectivity (degrees), but rich triangle graphlets formed by their second-order nearest neighbors. For example, as shown in Fig. 6, the proteins YPL217C in DIP, YAL034W-A in Gavin, and YHR065C in Krogan are identified as essential by TS but non-essential by BC, SC, and LAC.
Effect of DPINs
To demonstrate the effect of DPINs on the performance of CTF, we constructed ablation experiments, which use DPINs and static PINs to identify essential proteins. As shown in Table 12, when using DPINs, CTF can identify more essential proteins than using static PINs, that is, the results show that DPINs play an important role in the performance of CTF.
Parameter settings
To balance the weight of the different components in CTF for improving accuracy, a proportional parameter \(\alpha \in (0.1, 0.9)\) is adopted. As shown in Table 13, the number of essential proteins in top k proteins is shown, where k is set to 100, 200, 300, 400, 500, and 600 on the three datasets, and \(\alpha\) is set to 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9. The highest number of essential proteins is shown in bold in Table 13 in each case. From the numbers in Table 13, we find that the best performance of CTF is achieved when \(\alpha\) is set to 0.4.
Conclusion
Essential proteins are very important for living organism survival, disease diagnosis and treatment, and drug design. The massively increasing number of PINs has enabled us to identify essential proteins using computing methods. To further improve the accuracy of identification, better centrality measures and the fusion of biological information are two crucial techniques.
In this paper, we presented the CTF method, based on h-quasi-cliques, uv-triangle graphs, and the fusion of three kinds of biological information. CTF first constructs an edge-weighted PIN to compute the topological scores of proteins and then computes the other three essential scores on the basis of three kinds of biological information. The analysis and experiments indicate that CTF has the following advantages. First, our method proposes the EWCT function for constructing an edge-weighted PIN used to compute the topological scores of proteins based on h-quasi-cliques, uv-triangle graph, and GO annotations. EWCT provides a deep insight into the inherent topological features of essential proteins. Second, to reduce the noise in PINs, CTF constructs an edge-weighted PIN using DPINs. In addition, CTF further upgrades the accuracy of identification through the fusion of three kinds of biological information. The experiment results on three PIN datasets show that CTF has substantially higher performance in terms of six statistical measures, including sensitivity, specificity, and F-measure, than other existing methods.
A well-defined centrality measure based on the topological features of PINs is still a very important issue, and to denoise PINs is another important issue. In future work, we plan to design better centrality measures and denoise PINs for identifying essential proteins.
Availability of data and materials
The Datasets used in this study, including PINs, GO annotations, gene expression profiles, subcellular localizations, protein complexes, orthologous information, and standard essential proteins, are from the public databases. The source code of the CTF method can be made available upon request from the corresponding author.
Abbreviations
- CEP:
-
Construction of an edge-weighted PIN
- CTF:
-
Identification method of essential proteins based on edge features including h-quasi-Cliques and uv-triangle graphs, and the Fusion of multiple-source biological information
- DPIN:
-
Dynamic PIN
- EWCT:
-
Edge weight function based on edge features h-quasi-Cliques and uv-triangle graphs by combining with GO annotations.
- HCN:
-
Half of the common neighbors
- SANS:
-
Summation of all neighbor supports
- TS(\(\cdot\)):
-
Topological score function
References
Giaever G, Chu AM, Ni L, Connelly C, Riles L, Véronneau S, Dow S, Lucau-Danila A, Anderson K, André B. Functional profiling of the Saccharomyces cerevisiae genome. Nature. 2002;418(6896):387–91.
Lu X, Wang X, Ding L, Li J, Gao Y, He K. frdriver: a functional region driver identification for protein sequence. IEEE/ACM Trans Comput Biol Bioinform. 2020;18(5):1773–83.
Cullen LM, Arndt GM. Genome-wide screening for gene function using rnai in mammalian cells. Immunol Cell Biol. 2005;83(3):217–23.
Lu X, Qian X, Li X, Miao Q, Peng S. Dmcm: a data-adaptive mutation clustering method to identify cancer-related mutation clusters. Bioinformatics. 2019;35(3):389–97.
Zhang W, Xue X, Xie C, Li Y, Liu J, Chen H, Li G. Cegso: boosting essential proteins prediction by integrating protein complex, gene expression, gene ontology, subcellular localization and orthology information. Interdiscip Sci: Comput Life Sci. 2021;13(3):349–61.
Jeong H, Mason SP, Barabási A-L, Oltvai ZN. Lethality and centrality in protein networks. Nature. 2001;411(6833):41–2.
Zotenko E, Mestre J, O’Leary DP, Przytycka TM. Why do hubs in the yeast protein interaction network tend to be essential: reexamining the connection between the network topology and essentiality. PLoS Comput Biol. 2008;4(8):1000140.
Hahn MW, Kern AD. Comparative genomics of centrality and essentiality in three eukaryotic protein-interaction networks. Mol Biol Evol. 2005;22(4):803–6.
Joy MP, Brock A, Ingber DE, Huang S. High-betweenness proteins in the yeast protein interaction network. J Biomed Biotechnol. 2005;2005(2):96.
Wuchty S, Stadler PF. Centers of complex networks. J Theor Biol. 2003;223(1):45–53.
Estrada E, Rodriguez-Velazquez JA. Subgraph centrality in complex networks. Phys Rev E. 2005;71(5): 056103.
Bonacich P. Power and centrality: a family of measures. Am J Sociol. 1987;92(5):1170–82.
Stephenson K, Zelen M. Rethinking centrality: methods and examples. Soc Netw. 1989;11(1):1–37.
Li M, Lu Y, Wang J, Wu F-X, Pan Y. A topology potential-based method for identifying essential proteins from ppi networks. IEEE/ACM Trans Comput Biol Bioinform. 2014;12(2):372–83.
Wang J, Li M, Wang H, Pan Y. Identification of essential proteins based on edge clustering coefficient. IEEE/ACM Trans Comput Biol Bioinform. 2011;9(4):1070–80.
Li S, Chen Z, He X, Zhang Z, Pei T, Tan Y, Wang L. An iteration method for identifying yeast essential proteins from weighted ppi network based on topological and functional features of proteins. IEEE Access. 2020;8:90792–804.
He X, Kuang L, Chen Z, Tan Y, Wang L. Method for identifying essential proteins by key features of proteins in a novel protein-domain network. Front Genet. 2021;12:1081.
Zeng M, Li M, Fei Z, Wu F-X, Li Y, Pan Y, Wang J. A deep learning framework for identifying essential proteins by integrating multiple types of biological information. IEEE/ACM Trans Comput Biol Bioinform. 2019;18(1):296–305.
Tang, X., Wang, J., Pan, Y.: Identifying essential proteins via integration of protein interaction and gene expression data. In: 2012 IEEE International Conference on Bioinformatics and Biomedicine, pp. 1–4. IEEE
Zhang X, Xu J, Xiao W-X. A new method for the discovery of essential proteins. PloS ONE. 2013;8(3):58763.
Li M, Zhang H, Wang J-X, Pan Y. A new essential protein discovery method based on the integration of protein-protein interaction and gene expression data. BMC Syst Biol. 2012;6(1):1–9.
Zhong J, Tang C, Peng W, Xie M, Sun Y, Tang Q, Xiao Q, Yang J. A novel essential protein identification method based on ppi networks and gene expression data. BMC Bioinform. 2021;22(1):1–21.
Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F. A new method to measure the semantic similarity of go terms. Bioinformatics. 2007;23(10):1274–81.
Zhang W, Xu J, Li X, Zou X. A new method for identifying essential proteins by measuring co-expression and functional similarity. IEEE Trans Nanobiosci. 2016;15(8):939–45.
Chen Z, Meng Z, Liu C, Wang X, Kuang L, Pei T, Wang L. A novel model for predicting essential proteins based on heterogeneous protein-domain network. IEEE Access. 2020;8:8946–58.
Peng W, Wang J, Wang W, Liu Q, Wu F-X, Pan Y. Iteration method for predicting essential proteins based on orthology and protein-protein interaction networks. BMC Syst Biol. 2012;6(1):1–17.
Zhang Z, Ruan J, Gao J, Wu F-X. Predicting essential proteins from protein–protein interactions using order statistics. J Theor Biol. 2019;480:274–83.
Zhang W, Xu J, Li Y, Zou X. Detecting essential proteins based on network topology, gene expression data, and gene ontology information. IEEE/ACM Trans Comput Biol Bioinform. 2016;15(1):109–16.
Zhang W, Xu J, Zou X. Predicting essential proteins by integrating network topology, subcellular localization information, gene expression profile and go annotation data. IEEE/ACM Trans Comput Biol Bioinform. 2019;17(6):2053–61.
Wang J, Peng X, Li M, Pan Y. Construction and application of dynamic protein interaction network based on time course gene expression data. Proteomics. 2013;13(2):301–12.
Meng Z, Kuang L, Chen Z, Zhang Z, Tan Y, Li X, Wang L. Method for essential protein prediction based on a novel weighted protein-domain interaction network. Front Genet. 2021;12: 645932.
Li M, Li W, Wu F-X, Pan Y, Wang J. Identifying essential proteins based on sub-network partition and prioritization by integrating subcellular localization information. J Theor Biol. 2018;447:65–73.
Zhao B, Hu S, Liu X, Xiong H, Han X, Zhang Z, Li X, Wang L. A novel computational approach for identifying essential proteins from multiplex biological networks. Front Genet. 2020;11:343.
Zhao B, Han X, Liu X, Luo Y, Hu S, Zhang Z, Wang L. A novel method to predict essential proteins based on diffusion distance networks. IEEE Access. 2020;8:29385–94.
Yue Y, Ye C, Peng P-Y, Zhai H-X, Ahmad I, Xia C, Wu Y-Z, Zhang Y-H. A deep learning framework for identifying essential proteins based on multiple biological information. BMC Bioinform. 2022;23(1):1–27.
Li G, Li M, Wang J, Wu J, Wu F-X, Pan Y. Predicting essential proteins based on subcellular localization, orthology and ppi networks. BMC Bioinform. 2016;17(8):571–81.
Luo J, Qi Y. Identification of essential proteins based on a new combination of local interaction density and protein complexes. PloS ONE. 2015;10(6):0131418.
Qin C, Sun Y, Dong Y. A new method for identifying essential proteins based on network topology properties and protein complexes. PloS ONE. 2016;11(8):0161042.
Wang R, Wang C, Liu G. A novel graph clustering method with a greedy heuristic search algorithm for mining protein complexes from dynamic and static ppi networks. Inform Sci. 2020;522:275–98.
Yang Z, Liu P-Q, Fei Z-J, Liu C. Essential protein identification method based on structural holes and fusion of multiple data sources. Comput Sci. 2020;47(11A):40–5.
Fei Z, Liu P, Guo J, Yang Z, Liu C. Essential protein identification algorithm based on weighted subnetwork participation degree and multi-source information fusion. Appl Res Comput. 2022;39(1):163–9.
Acknowledgements
The authors are very grateful for the fruitful discussions with the members of the Intelligent Algorithm and Software Laboratory of Shandong University. The authors would like to thank Prof. Fei Guo of Central South University for her help of revising the manuscript. Thanks also go to Dr. Zhenzhen Yan of Shandong Technology and Business University for polishing the language of the manuscript.
Funding
This work is supported by the National Natural Science Foundation of China (No. 62176140), the Natural Science Foundation of Shandong Province (No. ZR2022MA076),and the Education Quality Improvement Plan for Graduate Students of Shandong Province (Nos. SDYKC19199, SDYJG21211).
Author information
Authors and Affiliations
Contributions
PL conceived and supervised the study. PL and CL conceptualized and designed the method. CL was responsible for the implementation. PL and CL drafted the manuscript together. YM, JG, FL, WC, and FZ participated in discussion and conceptualization as well as revising the draft. All authors read and approved the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Liu, P., Liu, C., Mao, Y. et al. Identification of essential proteins based on edge features and the fusion of multiple-source biological information. BMC Bioinformatics 24, 203 (2023). https://doi.org/10.1186/s12859-023-05315-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859-023-05315-y