An iteration model for identifying essential proteins by combining comprehensive PPI network with biological information

Background Essential proteins have great impacts on cell survival and development, and played important roles in disease analysis and new drug design. However, since it is inefficient and costly to identify essential proteins by using biological experiments, then there is an urgent need for automated and accurate detection methods. In recent years, the recognition of essential proteins in protein interaction networks (PPI) has become a research hotspot, and many computational models for predicting essential proteins have been proposed successively. Results In order to achieve higher prediction performance, in this paper, a new prediction model called TGSO is proposed. In TGSO, a protein aggregation degree network is constructed first by adopting the node density measurement method for complex networks. And simultaneously, a protein co-expression interactive network is constructed by combining the gene expression information with the network connectivity, and a protein co-localization interaction network is constructed based on the subcellular localization data. And then, through integrating these three kinds of newly constructed networks, a comprehensive protein–protein interaction network will be obtained. Finally, based on the homology information, scores can be calculated out iteratively for different proteins, which can be utilized to estimate the importance of proteins effectively. Moreover, in order to evaluate the identification performance of TGSO, we have compared TGSO with 13 different latest competitive methods based on three kinds of yeast databases. And experimental results show that TGSO can achieve identification accuracies of 94%, 82% and 72% out of the top 1%, 5% and 10% candidate proteins respectively, which are to some degree superior to these state-of-the-art competitive models. Conclusions We constructed a comprehensive interactive network based on multi-source data to reduce the noise and errors in the initial PPI, and combined with iterative methods to improve the accuracy of necessary protein prediction, and means that TGSO may be conducive to the future development of essential protein recognition as well.


Background
Numerous studies have shown that essential proteins play important roles in human biological processes. The lack of essential proteins will affect cell growth and development seriously, and the functions of the protein complexes will be lost as well. Essential protein prediction is not only of great significance to the researches on life science, but also able to provide valuable information to the treatment of diseases and the design of new drugs [1][2][3][4]. Traditionally, essential proteins are identified by medical experiments, such as RNA interference (RNAi) [5,6] and gene knockout [7]. Chen et al. described a method for identifying essential genes of Streptococcus sanguis SK36 strain using whole-genome deletion mutations [8]. Ji et al. used antisense technology to construct a controllable gene expression system, and conducted a comprehensive genome analysis of Staphylococcus aureus, an important human pathogen [9]. In [10,11], the necessity of each gene in the genome is analyzed by the method of sequencing the targeted insertion site of the transposon. However, these biological experiments are not only time-consuming, but also costly and inefficient. Hence, automated and accurate detection methods become necessary. Up to now, many computational models for identifying essential proteins have been developed successively. For instance, Yu et al. found the correlations between bottlenecks and essential proteins, where bottlenecks were defined as proteins with high degrees of centrality [12]. Based on the modular nature of a protein essentiality, Li Min et al. proposed a calculation method to identify essential proteins based on local average connection [13],and they also proposed a new model by adopting a new protein network recognition method based on topological potential [14], the basic idea is to treat each protein in the network as a material particle, generate a potential field around it, and calculate the topological potential of each protein to determine the importance of the protein. Jeong et al. introduced the central lethal rule to estimate the connection between network topology and essential proteins [15]. From then on, based on the concept of centrality, a lot of different methods, including the Degree Centrality (DC) [16], Information Centrality (IC) [17], Eigenvector Centrality (EC) [18], Subgraph Centrality (SC) [19], Betweenness Centrality (BC) [20], Closeness Centrality (CC) [21] and Neighbor Centrality (NC) [22], have been designed successively. However, although these centrality-based methods can improve the efficiency of traditional biological experiments effectively, their recognition abilities are still not very satisfactory, since there are lots of noises such as the false negatives and the false positives existing in the PPI networks [23,24]. Therefore, in order to further improve the performance of identification models, biological information data including GO (Gene Ontology) statement annotations, gene expression profiles, subcellular data and protein domain data have been integrated with the PPI networks to identify essential proteins. For example, by calculate the co-expression and edge clustering coefficient between nodes, integrating PPI networks with gene expression data, Li et al. established a prediction method called Pec [25] to infer potential essential proteins. Zhang et al. proposed a computational model named CoEWC, integrates the clustering coefficient and gene co-expression properties of nodes, capture the common features of essential proteins in both date hubs and party hubs, and achieved good prediction performance [26]. Zhao et al. designed a model called POEM to predict essential proteins, POEM combines network topology with gene expression profiles to reduce the negative impact of PPI noise. Unlike other methods, POEM pays more attention to predicting the essential biological modules and uses calculation methods to determine the date hubs and party hubs [27]. Zhao et al. proposed that only constructing a single network will easily ignore the differences in biological characteristics and functional relevance, and conceal the inherent properties of heterogeneous data. Hence, Zhao et al. combined PPI with multiple biological data to construct a heterogeneous network to predict essential proteins [28].
The GO database is the largest source of information about gene function in the world [29], which has often been adopted to mine functional similarities between proteins. For instance, Kim et al found that it can improve the prediction performance of models by adopting the informational GO terms to prune the PPI networks [30]. Zhang et al. combined PPI with GO annotations and protein domain information to construct a threedimensional tensor, and infer essential proteins through an extended HITS model [31], and got better performance. The meta-heuristic algorithm has the characteristics of high robustness, low complexity, and good optimization. Inspired by this, Lei et al. applied the intelligent evolutionary optimization algorithm to design the model and proposed a new method for predicting essential proteins in PPI networks based on artificial fish swarm optimization [32]. Zhang et al. defined a new measurement method for characterizing subcellular location information, and based on data fusion, proposed a new predictive model TEGS [33,34]. Lei et al. designed a model called RSG through combining the RNA-seq data instead of the gene expression data with the GO annotation and subcellular localization to identify essential proteins [35], which is not only based on connectivity, but also considers co-expression level and functional similarity to measure protein importance.
Machine learning has also been applied in the field of essential protein identification. By using features from DNA and protein sequence data, Zhang et al. proposed a deep learning-based network embedding method to automatically learn features and use the features to train deep neural networks to predict human essential genes [36]. Zeng et al. proposed the Ess-NEXG model, which used RNA-seq, subcellular localization, orthology and other information to construct a reliable weighting network, and captured topological features through node2vec, and finally used a classifier to make predictions [37].
Considering that essential protein is more conservative than non-essential proteins in evolution, Peng et al. proposed an iterative method named ION to predict essential proteins by integrating orthology with PPI network [38]. Zhang et al. introduced a prediction method called OGN, in addition to the common topological attributes and co-expression probability of protein nodes in the date hubs and the party hubs, OGN adds orthologous scores to integrate the calculation of protein importance scores [39]. Lei et al. designed a method called PCSD for identifying essential proteins based on the degree of protein participation in protein complexes and the density of sub graphs [40]. Li et al. developed a prediction model called NCCO to identify potential essential proteins by extending the Pareto optimal consensus model (EPOC) [41]. Zhang et al. designed a dynamic PPI network (FDP). First, based on each time point, construct a series of active PPI networks, and then merge them one by one according to the similarity between the networks. Finally assign ranking scores to protein in consideration of homology and topological properties [42]. In our previous work, an iterative method called CVIM was proposed, which first integrates the topological characteristics of the PPI network based on the entropy weight method, and finally uses an iterative model to calculate and predict essential proteins based on orthologous information [43].
In this paper, different from above models, a novel centrality-based method called TGSO is proposed by combining biological essence data including the gene expression data, the orthologous information and the subcellular localization data with the topological information in a newly constructed comprehensive PPI network. In TGSO, a new centrality-based method named DBN (Density between nodes) is designed first to calculate the node density in complex networks, which can characterize the physical structure association between nodes in a complex network, and then, based on DBN, a protein aggregation degree interaction network (ADN) can be constructed. Next, by adopting the Pearson correlation coefficient to measure protein co-expressions based on the gene expression data, a protein co-expression interaction network (CEN) can be constructed. Moreover, based on the subcellular localization data, a protein co-localization interaction network (CLN) can be obtained as well. Hence, through integrating these three kinds of interaction networks, a comprehensive PPI network (PCIN) can be constructed. Finally, based on the newly obtained comprehensive PPI network, an iterative method called TGSO is designed to predict potential essential proteins by using the orthology information as the initial scores of proteins. In order to estimate the identification performance of TGSO, intensive experiments have been implemented, and experimental results show that TGSO can achieve more satisfactory prediction performance than stateof-the-art competitive prediction models such as DC [16], IC [17], EC [18], SC [19], BC [20], CC [21], NC [22], PEC [25], CoEWC [26], POEM [27], ION [38], TEGS [34] and CVIM [43] based on two kinds of different databases separately.

Method
As illustrated in Fig. 1, the procedure of TGSO mainly includes the following five steps: Step 1: Construction of the ADN (the protein Aggregation Degree interaction Network).
Step 2: Construction of the CEN (the protein Co-Expression interaction Network).
Step 3: Construction of the CLN (the protein Co-Location interaction Network).
Step 4: Construction of the PCIN (the Protein Comprehensive Interaction Network).
Step 5: Construction of the TGSO.
The G = (V , E) represents the PPI network downloaded from database D. Where V = {p 1 , p 2 , ..., p N } is the set of protein nodes, and E is the set of edges in the network. As shown in Fig. 1, matrix A = (a ij ) N * N represents the adjacency matrix of the protein, where there is a ij = 1 , if and only if there exists an edge e(p i , p j ) between p i and p j in E, otherwise there is a ij = 0 , the N represents the total protein amount.

Construction of the ADN
Recent researches show that the degrees of connections between essential proteins are often higher than that between non-essential proteins [44], and essential proteins can form tightly connected molecular modules [33]. Hence, based on the modular nature of key proteins, for each edge e(u, v), we can design a local metric called DBN (Density between nodes) to measure the interaction between them in the original PPI network G as follows: Here, NG(u) = {v|∃e(u, v) ∈ E, v ∈ V } , represents the set of neighboring nodes of the protein node u in G, and |NG(u)| is the total number of neighboring nodes of the protein node u in G. According to above formula (1),we can obtain a new matrix DBN, on this basis can construct a new weighted PPI network, which is define as protein Aggregation Degree interactive Network (ADN).

Construction of the CEN
Gene expression refers to the process of synthesizing genetic information from genes into functional gene products. Gene expression products are usually proteins, but the expression products of non-protein coding genes such as transfer RNA (tRNA) or small nuclear RNA (snRNA) genes are functional RNA. Over a period of time, there may be similar expressions between essential proteins. According to the studies of Horyu et al. [45], it was found that the Pearson correlation coefficient (PCC) is suitable for measuring the similarities between The initial PPI, combined with subcellular localization and gene expression data as well as network topology information, was integrated to obtain the comprehensive protein interaction network, and the network and protein conservative score were put into the iterative model to obtain the final required protein score gene expression profiles. Hence, based on the concept of PCC, for any a pair of proteins u and v, we can calculate the similarity between them as follows: Here, Exp(u, i) is the expression level of the protein u on the i-th time node, and for any given protein u, its expression information on a series of n different time nodes constitutes a vector Exp(u) = {Exp(u, 1), Exp(u, 2), ..., Exp(u, n)} . In addition, Exp(u) is the average expression value of the protein u, σ (u) is the standard variance for gene expression of the protein u.
Existing studies illustrate that the essentiality of proteins is related to the proteins or genes themselves and the molecular modules they belong to [46,47], and the essential complex biological module consists of a large number of essential proteins that are highly connected and shared between biological functions [48]. Based on these findings, for any a pair of proteins u and v, we can measure the interaction between them in the original PPI network G as follows: Based on above formula (3), we can construct another weighted PPI network, namely, protein co-expression interaction network (CEN).

Construction of the CLN
Researches show that protein interactions in human bodies tend to coexist in the same cell compartment or adjacent cell compartments [49]. And it has been demonstrated that the introduction of subcellular localization information is of great help in screening essential proteins [28,34,35].
As shown in Fig. 2, the cell nucleus has the largest number of essential proteins, while Extracellular and Peroxisome have only a small number of essential proteins. Moreover, individual subcellular sites had similar amounts of essential proteins in three different datasets. For example, Nucleus accounts for about 40% of essential proteins in DIP, Essential, and Krogan. Recent research discover that 76% of protein-protein interactions in yeast cells occur between identical subcells [50]. And in many cases, the product of complex functions is more important than the function of individual proteins, and essential proteins tend to form protein complexes to perform important functions together [46,47]. Hence, in order to distinguish the importance of different subcellular localizations, for any given subcellular location i, we define the total number of subcellular species related to i as follows: Here, sub(i) represents the number of protein nodes associated with the subcellular location i in the database. Hence, for any give protein u, we can define its self-localization score as follows: Here, L(u) is a collection of all subcellular localizations possessed by u. Based on above formula (5), for any a pair of proteins u and v, we can further obtain the co-localization score between them as: According to above formula (6),we can further construct a new weighted PPI network as the protein Co-Localization interaction Network (CLN).

Construction of the PCIN
Based on above three kinds of newly constructed weighted PPI networks such as the AND, CEN and CLN, for any given protein u, we can obtain a unique score for u as follows: In the three data sets of Essential, Dip and Krogan, the percentage of Essential protein in 11 subcellular locations is represented by the thickness of the line, and the length of the outer ring represents the total proportion of Essential protein in this subcellular locations According to above formula (7), for any two given proteins i and j, we can define a comprehensive interaction between them as follows: where N is the total number of protein nodes.

Construction of the TGSO
Peng et al. [38] found that the essentiality of protein is closely related to the degree of protein conservatism. Figure 3 shows the brief results of using conservative scores alone to screen for essential proteins. The accuracy of this score reached 76% in the top 1% of the three databases. So the conservative score plays an important role in the recognition of essential protein, and we use this score as the initial score vector of protein.
For any given protein u i , let I(i) denote its homology score, Eq. (9) can be obtained by referring to [38] S is the set of reference organisms which is used to get orthologous information of node V. s denotes its element. |S| denotes the number of its elements. X s is a subset of node V. Its element has orthologs in organism s.
Then we can obtain the conservatism score O_score(i) corresponding to u i based on the original PPI network G as follows: Based on above formula (10), for all N different proteins p 1 , p 2 , ..., p N in G, then we can obtain their initial scores as follows: Finally, based on above newly obtained initial scores and the newly constructed weighted comprehensive PPI network PCIN, we use iteratively based on the weighted PageRank [51] to obtain the critical scores of all proteins in G: Here, the parameter α(0 α 1) is used to adjust the proportion of initial scores P 0 and last iteration scores P t . Based on the above descriptions, the general flowchart of our prediction algorithm TGSO can be mainly described as follows: Step5: Obtaining the initial score vector P 0 according to the formula (11); Step6: Let t = 0 ; Obtaining P1according to formula (12); Step7: Let t = t + 1 ; Obtaining P t+1 according to formula (12); Step8: Repeating Step7 until (||P t+1 − P t ||)/|E| < γ; Step9: Sort proteins by the value of P in the descending order; Step10: Output top K percent of sorted proteins.

Experimental data
In order to estimate the identification performance of TGSO, in this section, we will compare it with 13 different state-of-the-art competitive prediction models illustrated in the following Table 1.
Since saccharomyces cerevisiae includes the most complete PPI data and rich biological information data, and is widely used to evaluate essential protein prediction models, we will first evaluate the performance of TGSO based on three saccharomyces cerevisiae related databases such as the DIP database [52], the Krogan database [53], and the Gavin database [54]. After filtering out repetitive interactions and self-interactions, as shown in the Table 2, we finally obtained a total of 5093 proteins and 24,743 interactions from the DIP database, 14,317 pairs of interactions between 3672 proteins from the Krogan database, and 1855 proteins and 7669 interactions from the Gavin database respectively.
Moreover, as a benchmark dataset for testing the accuracy of different identification models, a set of 1293 essential genes is derived from the MIPS [55], the Saccharomyces Genome Database(SGD) [56], the Saccharomyces Genome Deletion Project Database (SGDP) [57], and the Database of Essential Genes (DEG) [58] simultaneously. In addition, the gene expression data of Saccharomyces cerevisiae is obtained from the work proposed by Tu et al. [59], which contains 6777 gene products and 36 samples. The orthologous information is downloaded from the InParanoid database (Version 7) [60]. Besides, as illustrated in above Fig. 2, we derived eleven subcellular locations related to eukaryotic cells from the COMPARTMENTS database [61,62] as well.
Finally, in order to evaluate the uniqueness and efficiency of TGSO, in this section, we will first adopt different measurements such as accuracy, jackknife, Precision Recall regression curve (PR-curves) and Receiver Operating Characteristic curve (ROC) to compare TGSO with 13 competitive prediction models shown in Table 1 comprehensively. And then, we will further estimate the effect of the parameter α on the performance of TGSO.

Comparisons between TGSO and 13 representative methods
In this section, two kinds of datasets downloaded from the DIP database and the Krogan database separately are adopted to compare TGSO with 13 competitive prediction models illustrated in Table 1. And as a result, Fig. 4 and Table 3 show the comparison results based on the DIP database and the Krogan database respectively.  From observing Fig. 4, it is not difficult to see that in the top 1% (51) potential key proteins, TGSO has screened out 48 true essential proteins, with an accuracy rate of 94%. Among 5% (255) and 10% (510) candidate critical proteins, there are 208 and 368 true essential proteins having been identified by TGSO separately, with an accuracy rate of 82% and 72% as well.
Comparing with traditional centrality-based methods such as DC, IC, EC, SC, BC, CC and NC, the number of true essential proteins detected by TGSO has obvious advantages. Especially except NC, TGSO predicts twice as many truly essential proteins as other centrality methods in the top 1% and 5% of candidate essential proteins. And simultaneously, in the top 10% predicted essential proteins, while comparing with DC, IC, EC, SC, BC, CC and NC, the prediction accuracy of TGSO has increased by 77.78%, 75.24%, 88.72%, 88.72%, 102.2%, 90.67% and 30.5% respectively. Moreover, while comparing with methods that combined PPI networks with multiple biological data, such as Pec, CoEWC, ION, POEM and CVIM, TGSO can still achieve the highest prediction accuracy in any range from the top 1% to 25% of potential key proteins. Therefore, the results show that TGSO is the best predictor based on the DIP database.
From observing Table 3, it can be found that TGSO can achieve similar prediction performance based on the Krogan database. For instance, among the top 1% (37) candidate critical proteins, 35 true essential proteins have been detected by TGSO, with the accuracy rate of 95%, while in the top 15% (551) potential essential proteins, TGSO can still achieve the accuracy rate of 66.06%, which is 76.70% higher than that of the worst-performing CC, and 11.31% and 13.40% higher than that of the best-performing CVIM and TEGS respectively in these 13 tradition competitive models. Furthermore, with the increasing of candidate key proteins, the accuracy rate of all kinds of prediction models will decrease inevitably, but in the top 25%, the number of true essential proteins detected by TGSO has reached 515, which is still much higher than 479 detected by CVIM and 480 discovered by ION. Hence, we can draw the conclusion that TGSO can achieve the best identification performance based on both the Krogan database and the DIP database while comparing with these 13 competitive state-of-the-art prediction models.

Validation with jackknife methodology
In order to evaluate the TGSO model more comprehensively and specifically, we extracted the top 1000 proteins sorted by importance score calculated by TGSO. TGSO's ability to place experimentally validated essential proteins at the top of the ranked proteins was evaluated with Jackknife [63]. The X-axis represents the ordered proteome of an organism, arranged from left to right with the strongest prediction to the least prediction of importance. The Y axis is the cumulative count of essential proteins encountered as they traverse the ordered proteome from left to right. And as a result, Figs. 5 and 6 illustrate the comparison results. From observing Fig. 5a, TGSO can achieve better performance than these centrality-based methods including DC, IC, EC, SC, BC, CC and NC. Moreover, from observing Fig. 5b, the prediction performance of TGSO is significantly better than those multiple biological data based methods such as Pec, CoEWC, POEM and ION as well. Although there are some partial overlaps among TGSO and CVIM and TEGS, as the number of candidate key protein increases to about 600, the prediction performance of TGSO will become significantly higher than both CVIM and TEGS, which indicates that TGSO is superior to both CVIM and TEGS. In addition, from Fig. 6a, b, it is to see that TGSO can achieve better performance than all these 13 competitive methods. Especially, comparing with those methods that combined PPI networks with multiple biological data, while the number of candidate essential proteins reaches 300, TGSO can achieve much better performance than all these competitive methods simultaneously.

Validation by precision-recall curves and ROC curves
In this section, we will further use the receiver operating characteristic curve (ROC curve) to evaluate the performance of TGSO. Studies show that the larger the area under the ROC curve (AUC), the better the performance of the model, and if AUC=0.5, it means a random performance [64][65][66]. In the three kinds of yeast cell databases including the DIP, Krogan and GAVIN databases, the proportion of key proteins is very small, and the proportion of non-essential proteins and essential proteins is about 3 to 1. Studies show that while dealing with highly skewed datasets, the precision recall (PR) curve can provide more information about the performance of an algorithm [67]. Therefore, in this section, we will further adopt the PR curves to compare TGSO with 13 competitive methods. As shown in Figs. 7 and 8, the AUCs achieved by TGSO is much higher than that of competitive methods based on both the DIP database and the Krogan database. However, from observing Figs. 7b and 8b, we can find that the curves of TGSO and CVIM have a little overlap. Hence, in order to further evaluate TGSO and CVIM, we adopt the F1-score as well, and the comparison results are shown in Table 4. From observing Table 4, not only the AUC achieved by TGSO is higher than those 13 competitive methods based on both the DIP database and the Krogan database, but also the F1-score achieved by TGSO is superior to those 13 competitive methods simultaneously. Therefore, it is reasonable to believe that TGSO has better performance than all these traditional state-of-the-art methods.

Difference analysis of TGSO and 13 competitive methods
In order to better reflect the uniqueness and differences between TGSO and these existing competitive methods, we will further compare TGSO with 13 competing prediction models based on the top 200 ranked proteins and the DIP database in this section. And the comparison results are illustrated in Tables 5 and 6. In Tables 5 and 6, M i represents one of these 13 competitive models, |TGSO ∩ M i | denotes the number of key proteins screened by both TGSO and M i , while |TGSO − M i | indicates the number of critical proteins found by TGSO instead of M i . From Tables 5 and 6, it can be discovered that TGSO can screen out new key proteins that cannot discovered by any of these 13 competing methods. And in addition, from observing the fourth and fifth columns in both Tables 5 and 6, it can be observed that the proportion of true essential proteins screened by TGSO alone is much higher than the proportion of true essential proteins screened alone by any of these 13 competing methods, which is further demonstrated by the results illustrated in Fig. 9 as well.

General applicability of TGSO
In order to prove the applicability of TGSO, we will further execute some simple tests and comparisons based on the Gavin database in this section, and the experimental results are shown in the following Table 7.
As can be seen from Table 7, while comparing with these 13 competing methods, TGSO can achieve the best predictive performance in any range from the top 1% to 25% of potential key proteins, which demonstrates that TGSO is the best prediction model among these competitive models and has wide applicability.    Table 7 Number of essential proteins predicted by TGSO and 13 methods based on the Gavin database

Effects of parameter on performance of TGSO
In this section, we will analyze the influence of the parameter α on the performance of TGSO. In TGSO, the parameter α with value between 0 and 1 is adopted to adjust the weight of the comprehensive interaction network PCIN and the protein conservatism. During simulation, we will adjust the value of α to study its influence on the performance of TGSO. As shown in Table 8, based on the DIP database, while α is equal to 0.2, the algorithm is in the top 1% and the top 25% respectively takes the maximum value of 48 and 671. When α is 0.4, there are two maximum values of 48 and 487. When α is 0.3, the algorithm reaches the maximum value in the first 1%, the first 10%, and the first 20%. Therefore, on the DIP, 0.3 is the best parameter. In addition, from observing the Table 9, it can be seen that based on the Krogan database, while α varying from 0.1 to 0.4, in the top 1% candidate key proteins, there are α maximum of 35 true essential proteins detected by TGSO, with the accuracy rate of 95%. While α is set to 0.2, TGSO can achieve the best accuracy rate in the top 1% and 25% candidate key proteins. When α is set to 0.3 or 0.4, TGSO achieves the best performance in the two intervals respectively. Therefore, based on the Krogan database, if α is set to 0.2 ,0.3, 0.4, TGSO can achieve the best performance. From Table 10, we can find that when α is between 0.1 and 0.4, only 0.3 occupies two maximum values. To sum up, based on these three kinds of databases, we will set α to 0.3 as the best value in  experiments for comparing TGSO with these state-of-the-art competitive models in this article.

Albation study
The previous comparative experiments confirmed that TGSO can effectively improve the performance of identifying essential proteins and is superior to existing methods in all aspects. In the design process of TGSO, three kinds of protein interaction networks such as ADN, CEN and CLN were involved from different perspectives. In order to analyze the positive contributions of these networks to the predictive performance of TGSO, we designed the ablation experiment as follows: The initial PPI network is used as the control group, and the experimental groups are ADN, CEN and CLN. All groups are set with the same parameters for iterative calculation, and the optimal result of each group is taken as the representative value of the group. The three evaluation indicators of accuracy, AUC, and F1-score are compared, and the accuracy experimental results obtained are shown in Table 11. It can be seen from above Table 11 that in DIP, the initial PPI network contains a lot of noisy data, which leads to poor recognition results. The new network topology of ADN has improved the initial PPI to a certain extent. Among these three kinds of networks, CEN, the protein co-expression network, has a greater improvement in the accuracy of the interval.
In addition, we considered the performance of several networks on the ROC and PR graphs. In the PR chart, the area under the curve of the CEN network was larger than that of other single networks. In the ROC curve chart, CLN performed even better.  Through ROC and PR graphs, we calculated the AUC and F1-score values of different network models, detailed results were shown in Table 12 and Fig. 10. From observing above Table 12, the obvious based on the DIP database, the AUC value of CLN is 0.763, which is higher than Init (0.692), ADN (0.718), and CEN (0.738). And simultaneously, based on the Krogan and Gavin databases, CLN can achieve the maximum values of AUC and F1-score as well. Therefore, based on above experimental results, we can think that the CLN network, that is, the subcellular colocalization data, may have played the most critical role in the network construction of our prediction model. After analysis, the importance of CLN network is that it can successfully capture characteristics that essential proteins often perform important functions collaboration in the same subcellular location. Therefore, it can provide a positive contribution to the performance of TGSO. In addition, it can be seen as well from the above experimental results that the integrated interaction network PCIN has higher recognition accuracy than any single network, since it can balance the advantages and disadvantages of multiple networks, and eliminate noisy data. Moreover, TGSO can achieve satisfactory performance under multiple evaluation frameworks such as PR graph, ROC graph, AUC and F1-score, which has also fully demonstrated the rationality and excellence of network integration.

Discussion
Essential proteins are indispensable materials to sustain life activities.In recent years, the development of computational methods for essential protein recognition has become a research hotspot, and many researchers have successively developed various algorithms based on PPI networks. With the gradual improvement of high-throughput biodata, more efficient prediction models have been proposed by combining PPI networks with biodata including the subcellular information and lineal homology information to screen essential proteins. Inspired by this, we first designed a subcellular co-localization score index and a co-expression index based on gene expression data and subcellular data of proteins separately. And then, a novel detection method called TGSO was designed to identify essential proteins based on multiple data fusion. Through comparative experiments, it was confirmed that TGSO is superior to existing methods. Moreover, as for methods including CVIM and TEGS that adopt similar combination of PPI network topology and additional biological information with TGSO, although the numbers of essential proteins in top 200 ranked proteins are similar, but the detailed essential proteins detected by TGSO is very different from that detected by TEGS and CVIM. During experiment, we tried to combine features selected by these models with features in TGSO, but experimental results showed that the recognition effect of fusing these features is not ideal. Through analysis, this might be caused by that the criticality of key proteins is very diverse. For example, in TEGS, the importance of protein was predicted by combining GO annotation with homologous prediction and subcellular localization data. But many GO annotations were provided on the basis of orthology predictions, i.e. an annotation was provided in one species based on published experimental evidence. Hence, the same annotation was transferred to the orthologous proteins. If the term did not exclude homologous transferred by predicted orthology, it would make TEGS become highly redundant. In CVIM, gene expression and network topology information were adopted, but the subcellular location information was not considered. And moreover, the entropy weighted method was only used to integrate topological features, however, topological features often have lots of noisy data, so the effect of CVIM would be limited. In general, TGSO can achieve better predictive performance. In the future, we will carry out a more in-depth analysis of it, and look for better characteristic information to collect key proteins found by different methods and improve the recognition rate of TGSO.