A novel essential protein identification method based on PPI networks and gene expression data

Background Some proposed methods for identifying essential proteins have better results by using biological information. Gene expression data is generally used to identify essential proteins. However, gene expression data is prone to fluctuations, which may affect the accuracy of essential protein identification. Therefore, we propose an essential protein identification method based on gene expression and the PPI network data to calculate the similarity of "active" and "inactive" state of gene expression in a cluster of the PPI network. Our experiments show that the method can improve the accuracy in predicting essential proteins. Results In this paper, we propose a new measure named JDC, which is based on the PPI network data and gene expression data. The JDC method offers a dynamic threshold method to binarize gene expression data. After that, it combines the degree centrality and Jaccard similarity index to calculate the JDC score for each protein in the PPI network. We benchmark the JDC method on four organisms respectively, and evaluate our method by using ROC analysis, modular analysis, jackknife analysis, overlapping analysis, top analysis, and accuracy analysis. The results show that the performance of JDC is better than DC, IC, EC, SC, BC, CC, NC, PeC, and WDC. We compare JDC with both NF-PIN and TS-PIN methods, which predict essential proteins through active PPI networks constructed from dynamic gene expression. Conclusions We demonstrate that the new centrality measure, JDC, is more efficient than state-of-the-art prediction methods with same input. The main ideas behind JDC are as follows: (1) Essential proteins are generally densely connected clusters in the PPI network. (2) Binarizing gene expression data can screen out fluctuations in gene expression profiles. (3) The essentiality of the protein depends on the similarity of "active" and "inactive" state of gene expression in a cluster of the PPI network.

expression [21]. Recently, Li et al. construct TS-PIN dynamic network by combining gene expression profile and subcellular localization information to predict essential proteins [22]. Li et al. introduce a sub-network partition method to predict essential proteins by using the subcellular localization information [23]. Fan et al. adopt an improved PageRank algorithm to identify essential proteins based on gene expression and subcellular localization information [24]. Lei et al. incorporate the multiple biological characteristics, including PPI network, GO annotation data, subcellular localization information, and protein complexes information, to identify essential proteins by using random walk algorithms [25]. Zhang et al. propose a method to predict essential proteins by fusing dynamic PPI networks [26].Li et al. identify essential proteins by computing each protein's topology potential [27]. Peng et al. propose the UDoNC method to predict the essential proteins [28].
On the other hand, some prediction methods adopt supervised learning methods and use machine learning algorithms to identify essential proteins, such as SVM, Random Tree, RBF network, and Naïve Bayes. Gustafson et al. propose using Naïve Bayes to identify essential proteins based on gene expression data and topological features in the PPI network [29]. Compared with unsupervised methods, the performance of supervised methods for detecting essential proteins are often better than that of unsupervised methods. Hwang et al. construct an SVM classifier by using some biological features (such as ORF, ST, PHY) and some topological features (such as DC, BD, CC) of the PPI network [30]. Zhong et al. adopt the GEP method and an XGBFEMF framework to predict the essential proteins [31,32]. Deng et al. predict essential proteins by combining Naïve Bayes classifier, C4.5 decision tree, CN2 rule, and logistical regression model [33]. Kim et al. adopt machine learning methods to predict essential proteins by using topological properties in the GO-pruned PPI network [34]. Recently, Zeng et al. design a deep learning framework for the prediction of essential proteins [35].
The methods based on PPI network and gene expression data may, to some extent, eliminate false positive and false negative of protein interaction data. However, the gene expression profile is a set of values with large fluctuations that may affect prediction performance. When studying complex biological systems, Niehrs et al. point out that the "on" and "off" of genes at different times played an important role in biological development [36]. To introduce the "on" and "off" of states of genes, we propose an essential protein prediction method, named JDC, based on the PPI data and gene expression data by using the essential Degree Centrality with Jaccard similarity index. JDC can eliminate the fluctuations of gene expression data by calculating the similarity of "active" and "inactive" state of gene expression in a cluster of the PPI network. Compared with the state-of-the-art methods on four organisms, our method is more accurate and has higher specificity and sensitivity. Figure 1 illustrates an example of JDC to predict essential proteins. The JDC algorithm incorporates gene expression information with PPI network data. The whole process of JDC includes the following steps.(1) ECC is used to characterize the probability of two proteins being in a cluster from a topology perspective (2) A dynamic threshold is set to binarize gene expression data for filtering out the fluctuations in gene expression profiles. (3) The Jaccard similarity index measures the similarity of two proteins that has the "active" and "inactive" state of gene expression profiles; (4) JDC scores are calculated by integrating the ECC values and Jaccard similarity index. According to those steps, we use top rank analysis in the JDC value to verify the performance of our method.

Experimental datasets
We have collected the four organisms: Saccharomyces cerevisiae (Bakers' Yeast), Escherichia coli (E.coli), Drosophila melanogaster (Fly), and Homo sapiens (Human) to evaluate the JDC method.
The PPI data of Yeast and E.coli were obtained from the DIP database. The PPI network of E. coli has 2727 proteins and 11,803 edges after filtering the self-interactions and the repeated interactions. There were 5093 proteins and 24,743 edges in the PPI network of Yeast. The PPI data of Fly and Human can be downloaded from the BioGRID database. There were 76,480 edges and 9217 nodes in Fly datasets, and the 504,848 edges and 18,009 nodes in Human datasets. By converting the id and filtering the self-interactions and the repeated interactions, there were 37,992 edges and 6481 nodes in Fly network, and 348,871 edges and 15,721 nodes in Human network. Essential proteins were integrated by the four databases of MIPS [37], SGD [38], DEG [39], and SGDP [40]. There are 1167 essential proteins present in Yeast PPI network. Out of all 2727 proteins in the E.coli network, 254 were essential. The essential proteins of Fly and Human can be obtained from the OGEE database. There are 408 essential proteins and 13,373 non-essential proteins in Fly datasets. The number of essential genes human was 7123.
The Gene Expression data were downloaded from the NCBI Gene Expression Omnibus website. After pretreatment and normalization, 6777 Yeast gene products and 36 samples were obtained. Similarly, the gene expression data of E.coli was also downloaded from this website. After removing the redundant data, the E.coli gene expression data had 7312 genes and 8 samples. GSE67547 is the gene-expression profiles of Fly with 11,952 genes and 66 samples, whereas GSE86354 is the human tissue-specific RNA-seq expression profiling by high throughput sequencing.

Edge clustering coefficient (ECC)
Radicchi et al. first propose the edge clustering coefficient that is an important topological feature in computational networks [41]. Wang et al. adopt the edge clustering coefficient to predict essential proteins in the yeast PPI network, which also has achieved a good detection effect [42]. The advantage of the edge clustering coefficient is to describe the clustering characteristics of PPI networks from the perspective of topology. We adopt the ECC shown in formula (1) for our method to calculate the topological attribute of the two nodes, i and j: i,j denotes the number of actual triangles formed by the edge i, j in PPI networks, then, the number of possible triangles determined by the minimum degree of node i and j is defined as min(k i − 1, k j − 1) . ECC is used to describe how tightly two proteins are connected. The larger the ECC value is, the more likely two connected proteins are in the same cluster. Thus, the PPI network was divided into multiple clusters by calculating the ECC value of each pair of interacting proteins.

Binarization of gene expression data
Gene expression data are continuous and produced from microarray experiments. However, the gene expression from high-throughput experiments are prone to large fluctuations. Sahoo et al. performed a Boolean analysis of mouse B cell gene expression data to understand gene regulation and gene function [43]. In order to eliminate fluctuation of gene expression, in this paper, we use a threshold strategy to covert the continuous values to the discrete state values, and then characterize gene expression data with "active" and "inactive" state.
In this paper, we select one sigma value close to the mean value as the threshold for screening the "active" and "inactive" state of gene expressions. Formula (2) is the mean of gene expression data. Formula (3) is the standard deviation of gene expression, and Formula (4) is the volatility of gene expression. The threshold parameter is defined in Formula (5). (1) t is the expression value of protein i at time point t, U (i) is the mean of expression value of protein i, σ (i) is the standard deviation of expression data of protein i, V (i) is the volatility of expression value of protein i, G(i) is the threshold parameter of expression value of protein i.
G denotes a matrix constructed from gene expression data, N is the number of genes, and M is the time of proteins: where s i,t is the expression level of protein i at time t. If the expression value of s i,t is higher than the specified threshold, the "active" gene expression is defined as "1". If the value of s i,t is not higher than the specified threshold G(i) , it is "inactive" gene expression and defined as "0". The calculation formula is as follows: where s ′ i,t is the activity of protein i at time t. S is updated to the matrix with Boolean values. In this paper, the gene expression data are transformed into Boolean values that can reflect the "active" and "inactive" state of gene expression.

Jaccard similarity index
The Jaccard coefficient is generally used to measure the similarity of two discrete objects. Numanagic et al. proposed the SEDEF framework based on the Jaccard coefficient, which can accurately predict segmental duplications (SDs) [44]. Wallace et al. introduced the Jaccard coefficient into the prediction of disease-disease relationship and deduced the information of the interaction network [45]. In this paper, we compare the co-expression of two different related proteins with the Jaccard coefficient. Therefore, the Jaccard coefficient of edge i, j can be defined as: where S i and S j represent the Boolean values of the gene expression data of gene i and gene j. The Jaccard correlation coefficient should be between 0 and 1. Here, we define the value as the similarity of active expression between gene i and gene j in a cluster of PPI networks.

JDC measure index
It has been proved that genes with similar functions often exhibit similar expression patterns, known as the "guilt-by-association" principle [46]. Based on the edge clustering coefficient (ECC) and Jaccard coefficient (Jaccard), we propose a new measurement method with Jaccard similarity index (JDC), which is named as the essential Degree Centrality. We describe the clustering degree of two proteins from topological and biological perspectives. Therefore, we define the clustering degree of an edge (i, j) in the PPI network as follows: For protein i, we define its JDC value as the sum of the probability that the protein and its neighbors belong to the same cluster: where D i denotes all the neighborhoods of node i. Then, the node i and the neighbors are divided into a cluster. The values measured by JDC depend on the similarity of "active" and "inactive" state of gene expression in a cluster of PPI networks.
In this paper, we propose an essential protein identification method based on PPI data and gene expression. The advantage of this method is that the calculation is simple, and the performance of JDC is better than some state-of-the-art prediction methods.

ROC curves and its AUC analysis
In this section, we adopt receiver operating characteristic (ROC) curves to evaluate the global performance of each method. The comparison results are shown in Fig. 2.
As shown in Fig. 2, the ROC curve of JDC is almost above that of other prediction methods. The area under the ROC curve (AUC) on both two datasets are 0.6996, and 0.6999 respectively, which are the highest values among all methods. The ROC results obtained by ten methods demonstrate that JDC is more suitable for predicting essential proteins.
To show that our method has better performance, we focus on comparing JDC with WDC and PeC, because these methods use the same input data. Li and Tang have introduced the Pearson correlation coefficient to weight PPI network based on ECC, which effectively reduced false positives and false negatives in PPI network on Yeast data [12,13]. Compared with those methods, JDC not only takes the false positive and false negative data into consideration on PPI data, but also introduces the "active" and "inactive" states of gene expression. The AUC of JDC method on the yeast dataset improves more 0.0112 and 0.0665 than that of WDC and Pec, respectively. The similar results are obtained in the experimental results of E.coli dataset. The advantage of introducing different states is to eliminate fluctuations in gene expression data, especially between two genes, the expression value of one gene is particularly high, and thus affects the similarity value. JDC can fully consider the coexpression state of the connected genes at multiple different moments, while WDC and Pec compare the similarity of the specific expression values of the two genes at different times.
To further compare the performance of JDC, WDC and Pec, we analyze the ROC curve based on the top 20% of proteins ranked by each method. The ROC curves are (9) J c i, j = Jaccard i, j * ECC i, j shown in Fig. 3. As can be seen from Fig. 3, the AUC of JDC is higher than that of WDC and PeC both on yeast and E.coli datasets.

Accuracy analysis
Where denotes the number of true-positive proteins, denotes the number of falsepositive proteins, denotes the number of true negative proteins, and denotes the number of false-negative proteins. In this paper, true-positive is that real essential proteins are correctly predicted as essential proteins, false positive is that non-essential proteins are predicted as essential proteins, true negative is that non-essential proteins are correctly predicted as non-essential proteins, and false negative is that the essential proteins are predicted as non-essential proteins. The results on Yeast and E.coli data are in Table 1.
The Formula (11)-Formula (17) are as follows: where TP denotes the number of true-positive proteins, FP denotes the number of falsepositive proteins, TN denotes the number of true negative proteins, and FN denotes the number of false-negative proteins. In this paper, true-positive is that real essential proteins are correctly predicted as essential proteins, false positive is that non-essential proteins are predicted as essential proteins, true negative is that non-essential proteins are correctly predicted as non-essential proteins, and false negative is that the essential proteins are predicted as non-essential proteins. The results on Yeast and E.coli data are in Table 1. It can be seen from Table 1 that the values Table 1. The lower the FPR , the better the method. The FPR value of JDC is also the lowest of all methods in the two data sets.

Top analysis and overlapping analysis
To further validate the performance of JDC, we adopt a top analysis metrics that select the scores of each top percentage (top1%, top5%, top10%, top15%, top20%, top25%) of the methods and determine how many of these are essential proteins. The experimental results are shown in Figs. 4 and 5. Fig. 4 Compares the top 1%, 5%, 10%, 15%, 20% and 25% of essential proteins obtained by JDC with other methods in yeast data. a TOP1%. b Top5%. c Top10%. d Top15%. e Top20%. f Top25% As shown in Fig. 4a of Yeast data, when we select the top 1% ranked proteins, JDC and other methods (DC, IC, EC, SC, BC, CC, NC, PeC, and WDC) identify 45, 22, 24, 24, 24, 24, 32,40 and 36 essential proteins, respectively. In the Yeast data, the JDC method can identify 45 essential proteins when we select the top 1% ranked proteins. Compared with the centrality method, the number of essential proteins that JDC can identify has increased by at least 43%. When compared with PeC and WDC, JDC can also improve by 12.5% and 25%, respectively. In Fig. 5, JDC can identify 10, 47, 75, 94, 123 and 141 essential proteins in each top percent (1%, 5%, 10%, 15%, 20% and 25%) of proteins on E.coli data. This shows that the JDC method is better than other methods at 5%, 10%, 20% and 25%. Compares the top 1%, 5%, 10%, 15%, 20% and 25% of essential proteins obtained by JDC with other methods in E.coli data. a TOP1%. b Top5%. c Top10%. d Top15%. e Top20%. f Top25% To find the difference and overlap of essential proteins identified by each method, we select the top 100 proteins sorted by each method in yeast data, and investigate the overlapping relationships. Table 2 shows the intersection, difference of results between JDC and other various methods, and lists corresponding number and proportion of non-essential and essential proteins.
Where JDC ∩C i denotes the number of overlapping proteins identified by various prediction methods, and | C i -JDC| denotes the number of non-overlapping proteins identified by JDC and various centrality measures. As can be seen from Table 2, the number of non-essential proteins in JDC is smaller than that of other methods, and the proportion of essential proteins is much higher than that of other methods. Take BC as an example. The number of BC in | C i -JDC| is 85. The percentage of essential proteins of BC in | C i -JDC| was 42.35%, while JDC identified 78.82% essential proteins. This means that JDC can identify more essential proteins that BC is not.

Jackknife analysis
Holman et al. devised a jackknife strategy that tests the performance of ranking methods [47]. We also use this method to evaluate the JDC method and other nine essential protein prediction methods. For each prediction method, we assess the performance by calculating the sum of the true essential proteins and the number of essential proteins. Figure 3 is the jackknife curve of various methods. The jackknife curve of ten essential protein prediction methods is plotted in Fig. 6. Where the vertical axis represents the cumulative count of essential proteins, and the horizontal axis represents the predicted number of essential proteins. The jackknife curve of the JDC method is higher than that of other nine methods (DC, IC, EC, SC, BC, CC, NC, WDC, and PeC). The results from the jackknife analysis show that the performance of JDC is superior to other prediction methods in identifying essential proteins. The advantage of JDC is that it can overcome the volatility of the gene expression data.

Modularity analysis
Hart et al. indicate that the importance of proteins is not related to themselves, but specific protein complexes [48]. Zotenko et al. further demonstrate that functional protein modules contain a large number of essential proteins [49]. To verify the conclusion, we select the top 100 proteins ranked by JDC, and constructed a small PPI network module with those proteins and their neighbor proteins. The result is shown in Fig. 7. The top 100 proteins of JDC include 80 essential proteins (yellow nodes in Fig. 7a) and 17 functional modules by Markov Cluster procedure (MCL) [50]. For WDC, we follow a similar analysis as above, 68 essential proteins (yellow nodes in Fig. 7b)and 14 functional modules are found. The modularity of JDC presents more obvious than that of WDC. Besides, most of the essential proteins are hubs in the network, as shown in Fig. 7a, which is consistent with views of He et al. [51]. To compare the functional modules, we adopt the GO enrichment analysis by using website(http:// geneo ntolo gy. org/). By using  JDC method, 11 out of 17 functional modules have p-value less than 0.05, whereas, 6 out of 14 functional modules with WDC have p-value less than 0.05.

Results using fly and human dataset
To further prove the advantage of our method, we compare JDC with PeC and WDC methods on other two organisms: Fly and Human. The gene profiles for human are RNA-seq expression with tissue-specific labels, we select the two kinds of tissues dataset for further analysis. The results using Fly and Human datasets are listed in Table 3, which show the number of essential proteins in top 100, 200, 300, 400, 500, 600 essential candidates ranked by JDC, Pec and WDC. It can be found that the JDC almost presented the high-performance in the results, which indicate that the JDC had improvement over the other methods based on different organisms.

Comparison with dynamic network framework
In the previous description, we compared JDC with various essential protein prediction methods that are proposed base on the static PPI network. The experimental results show that our method can improve the accuracy of essential protein prediction. To further prove the advantage of our method, we compare it with some methods that are   Tables 4  and 5.
The methods with dynamic PPI network can effectively improve the accuracy of the identification of essential proteins in DC, EC, SC, BC, CC, IC, LAC, and NC. As shown in Table 4, when the top100, top200, top300, top400, top500, and top600 proteins are selected, JDC can identify 80, 153, 224, 267, 315, and 355 essential proteins, respectively. As can be seen from Table 4, our method is better than that of other prediction methods at the top 200, top 300, top 400, top 500, and top 600. compared with the TS-PIN, which incorporated subcellular localization information, our method also has similar results. As shown in both Tables 4 and 5, the exceed times of our method are 5 and 5 respectively, which indicate the JDC method is an effective prediction method for essential proteins.

Discussion
The difference between JDC and PeC or WDC is how to weight the PPI network. PeC and WDC both adopt the Pearson product-moment correlation coefficient to measure the similarity between two sets of gene expression values. However, the gene expression data can be represented with continuous values, which are prone to fluctuations that may affect prediction performance. JDC incorporate the Boolean values to represents the "on/off" state of genes at different times in biological development, and adopt Jaccard similarity index to measure the similarity between genes. JDC can fully consider the co-expression state of the connected genes at multiple different moments, while WDC and Pec compare the similarity of the specific expression values of the two genes at different times. Based on the results form Figs. 2 and 3, the ROC curve for JDC can almost achieve the best on the yeast dataset, and when values of FPR are less than 0.4 on the E.coli dataset, the ROC curve of JDC also has the similar results. The results suggest that the JDC has better sensitivity than that of WDC and PeC.
Recently, some computational methods for essential proteins prediction have been proposed, which employ a variety of biological data including sequence, orthology, evolution, expression, and subcellular localization information. We have further compared the JDC with recent developed methods for predicting essential proteins by using multiple biological information. SPP adopts a strategy of sub-network partition and prioritization to predict essential proteins by fusing PPI network and subcellular localization data, which can identify 84, 153, 210, 261, 314, 362 essential proteins with different top set, respectively. Compare with SPP, the results of JDC are improved by 6.25%, 2.32%, and 0.32% in top 300, 400, 500 essential candidates, respectively. In top 100 and 600, SPP generates better results than that of JDC. The results indicate that both subcellular localization data and gene expression data can often improve the accuracy of essential protein prediction. NCCO fuses the PPI network and orthology information to predict the essential proteins, which integrate NCC (Neighborhood Closeness Centrality) and OS (Orthologous Scores). Compare with NCC, the result of JDC is better than that of NCC. Orthology information is adopted to assessed the conservative property of proteins. Many essential proteins of Yeast are conserved comparing with non-essential proteins, so OS is useful feature for NCCO to predict the essential proteins. NCCO exhibits the higher accuracy than the JDC. RWEP uses the random work algorithm to identify essential proteins by fusing PPI network and biological properties including subcellular localization information, gene expression, complex information, and GO annotation information. Comparing with RWEP, JDC achieved the better result at top 1%, the optimal results of RWEP are better than that of JDC at top 5%-20%. In order to get the optimal results, RWEP adopts a parameter to adjust the contribution of proteins' own scores and their neighbors' scores, which is a need to tune the parameters, however it is difficult to choose the best parameters for different datasets. Different parameters have a great influence on the experimental results. In summary, fusing more biological data can improve the effectiveness of methods to identify essential proteins.

Conclusions
In this study, we propose a new essential protein recognition algorithm named JDC based on the PPI networks and gene expression data. JDC eliminates the influences of fluctuations in gene expression data by calculating the similarity of "active" and "inactive" state of gene expression in a cluster of the PPI network. Compared with the nine prediction methods using static PPI network and two dynamic prediction methods, JDC is an effective essential protein prediction method. As future work, it would be more accurate to predict essential proteins by further utilizing the time-series gene expression dataset. For the time series data, the dynamic methods can be used to refine the PPI network to construct a reliable PPI network, and a method can be revised to segment the time series data, and within each segment to construct a static network with binarizing gene expression data. The new method would be considered both advantages of dynamic network methods and the JDC method.