A virus–target host proteins recognition method based on integrated complexes data and seed extension

Background Target drugs play an important role in the clinical treatment of virus diseases. Virus-encoded proteins are widely used as targets for target drugs. However, they cannot cope with the drug resistance caused by a mutated virus and ignore the importance of host proteins for virus replication. Some methods use interactions between viruses and their host proteins to predict potential virus–target host proteins, which are less susceptible to mutated viruses. However, these methods only consider the network topology between the virus and the host proteins, ignoring the influences of protein complexes. Therefore, we introduce protein complexes that are less susceptible to drug resistance of mutated viruses, which helps recognize the unknown virus–target host proteins and reduce the cost of disease treatment. Results Since protein complexes contain virus–target host proteins, it is reasonable to predict virus–target human proteins from the perspective of the protein complexes. We propose a coverage clustering-core-subsidiary protein complex recognition method named CCA-SE that integrates the known virus–target host proteins, the human protein–protein interaction network, and the known human protein complexes. The proposed method aims to obtain the potential unknown virus–target human host proteins. We list part of the targets after proving our results effectively in enrichment experiments. Conclusions Our proposed CCA-SE method consists of two parts: one is CCA, which is to recognize protein complexes, and the other is SE, which is to select seed nodes as the core of protein complexes by using seed expansion. The experimental results validate that CCA-SE achieves efficient recognition of the virus–target host proteins. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04792-x.

However, most existing methods often ignore the inherent biological structure of the complexes, and many only consider the structure as dense subgraphs. Few of them consider the modularity of PPI, let alone search for potential virus-target proteins from protein complexes, such as Babak Khorsand [18]. Here we adopt CCA to recognize protein complexes, showing the feasibility of seed selection, and then we used CCA-SE to predict potential virus-target human proteins. At first, a network representation learning technique called node2vec is utilized to learn a dense vector for each vertex to represent the topological information. Secondly, based on edge clustering coefficient (ECC) and degree, we introduce a new seed selection strategy, and the core structure of protein complexes is detected based on SE. Then according to network topology, we design a fitness function to identify protein complexes with various densities and modularity. At last, we apply CCA-SE to predict the 2019-nCov-target proteins, which play a fundamental role in detecting virus drug targets.

Principle of the CCA-SE method
We developed the CCA-SE method based on the principle of coverage clustering-coresubsidiary structure, which contained two parts, CCA was used to recognizing protein complexes, and SE was applied to select seed nodes. Especially, We demonstrated a new selection strategy for seed nodes. We clustered seed nodes to obtain the nuclear structure of protein complexes. Then, based on the density and modularity, we expanded the core structure and formed protein complexes. Next, integrate downloaded complexes data with our results. At last, we calculated the similarity between unknown human proteins and known virus-target proteins in the same protein complexes. We defined a scoring function, which was helpful in getting potential virus-target human proteins. Figure 1 shows the algorithm.

Construction of virus-host PPI network
We constructed a virus-target human PPI network based on the human protein interaction database HIPPIE and the human protein dataset of 2019-nCov infection. Researchers often regard direct neighbor proteins as potential host proteins in some typical virus-host proteins research [19]. Therefore, G = (V, E) represents human PPI. V includes not only host proteins attached by the 2019-nCov directly but also direct neighbors of those host proteins. E shows human PPI. After removing noisy data, we got 2308 human proteins. We reported details in "Methods" section.
Node2Vec [20] not only makes similar nodes closer in the vector space but also retains network structure, captures the diversity connection between nodes. We embedded Node2Vec into the PPI network to extract hidden information and used a low-dimensional vector to represent each node.
We extracted the hidden layer weight of PPI and demonstrated each protein node with a 128-dimensional vector [21]. In Fig. 2, nodes included not only the proteins attached by virus but also those known interacted with virus-host proteins directly.

Selection of seed nodes
In a real protein network, many protein complexes are overlapped and share multiple nodes, so that the core nodes and the overlapped nodes cannot be distinguished simply by degree [22]. Therefore we use two topological properties, ECC and nodes degree, which can evaluate the importance of nodes. The score of each protein is obtained by the sum of degree and ECC. According to an existing work [9], if the score of a protein node is greater than the average score, we consider it as a seed protein that would achieve the best results. The scoring function for each protein u is defined in Eq. (1). We proved Eq. (1) feasibility in Table 1.
where sum EEC (u) indicates summation of ECC and degree(u) indicates degree of node u. Equation (1) considers network topological structure and reduces the overlapped nodes that include the sum of ECC by dividing each node degree. Meanwhile, it avoids overlapped nodes mistaken as seed nodes and improves seed selection accuracy. We formed Table 1 based on our results. "Score" means the standard score when we choose seed nodes. "Seed" means the number of seeds. We use 0.1698 as the seed selection standard also is the average score. As we can see from Table 1, whether the score is higher than average or lower, their precision, recall, and F-score are not good as the average score's effect.
We can use Gene Ontology (GO) database to predict and analyze gene function. Usually, a gene or gene product is annotated by one or more GO terms. We can  calculate the similarity between genes to analyze and predict gene function. The traditional coverage clustering algorithms (CCA) [23] are less time-consuming, handle large amounts of data, and have no overlapping region among obtained clusters. However, these methods would generate much false positive data without a reasonable clustering radius. Therefore, we use GO function similarity to obtain a reasonable radius and improve clustering accuracy. The clustering radius is computed as follows.
We denote rx, the distance of all unlearned seed nodes to the clustering center based on the Euclidean distance. We redefine the distance as shown in Eq. (2). The clustering radius of each final cluster is obtained by Eq. (3). D indicates the seed nodes that have not learned yet. It shows that in PPI networks, high-density subgraphs incline to form protein complexes [24]. Also, in a subgraph, if the internal weight is much greater than the external weight, it is more likely to form protein complexes. Therefore, density [25] and modularity [26] are two important factors, determining whether the subgraph could form protein complexes or not. We demonstrate a new method to evaluate protein complexes. degree in (c u ) is the sum of internal edges of the complexes, degree out (c u ) is the sum number of other edges connected with the complexes, E Cu is the number of all edges in the complexes, and V Cu is the number of all nodes in the complexes. Moreover, we use a parameter t to balance the weight of subgraph modularity and subgraph density, as shown in Eq. (4).

Obtaining potential virus-host proteins
We obtain a set of protein complexes cores. To form complete protein complexes, we add subsidiary structures to each core with the following steps and name them SE: (i) The first-order neighbor nodes of each protein complexes core are obtained from the network diagram as candidate proteins of the complexes. (ii) Calculating the sum of functional similarity between the core and their neighbor nodes based on edge weight. (iii) ranking nodes according to the functional similarity score, completing protein complexes by adding affiliated nodes. The local score of the protein complexes calculates when a new node adds. (iv) Repeating (iii) until no more nodes can improve the local score, then terminating the expansion of the protein complexes core. Protein complexes are molecular polymers that participate in the same functional region at the same time and space and have direct or indirect effects [27]. Studies have shown that the network is modular, and proteins in the same complexes are more likely to undertake the same life activity or in the same pathway [28]. At the same time, the more host proteins contained by the complexes indicate that the complexes are more likely to participate in life activities such as virus replication [29].
We obtained 255 protein complexes on the human PPI network dataset. Considering incomplete data, we downloaded all human complexes data from the CORUM database, including 1948 protein complexes. Then we integrated them with our results, choosing protein complexes data with nodes than three and known host proteins while deleting the redundant data. Therefore we got 455 human complexes, with complexes that contain more than two known proteins having sixty. The more known host proteins contained in the complexes, the more likely it participate in viral survival and reproduction. Since the proteins phenotypes in the same protein complexes are similar, the remaining proteins may also become the proteins required for virus replication. Table 2 lists some protein complexes, in which Complex_size represents the size of the module, Host_protein is the host protein of the virus, and Complexes is the collection of all proteins.
Based on Table 2, we can speculate that the unknown human proteins in the same complexes are closely related to the known virus-host proteins. The table also shows that not only other proteins in the module are related to the virus, but also the module itself is related to the virus, which jointly completes some biological functions and promotes the reproduction of the virus. Therefore, these remaining proteins are potential virus-target proteins we need. The above descriptions show that each of the complexes contains virus-target host proteins and is more or less associated with the virus. Then we use the protein complexes and Node2Vec to calculate the similarity between the rest of the unknown proteins and known host proteins, then score them as potential target proteins. The score computes as follows.
w c indicates the proportion of known host proteins in the complexes to which potential target protein C belongs. n c indicates the number of known proteins in the complexes. Let N indicates the total number of proteins in the complexes. w c in Eq. (8) can be obtained as follows.
The second term in Eq. (7) indicates the sum of the similarities between target protein C and the known proteins in its complexes. The formula is as follows.

Statistical model
We define the functional similarity between the two interacting proteins a and b as GO similarity , and as follows.
Based on the definition of protein functional similarity, the adjacency matrix A ij of graph G can express as follows. e ij equals GO similarity .

Dataset
Based on the latest study of the interaction map of 2019-nCov virus-host protein [30], we obtained 332 high reliability 2019-nCov human PPI data. We mapped the proteins with the Uniprot ID through the Uniprot database and got 256 key host factors in our experiments. To reduce the impact of data redundancy on results, we used the human PPI data in HIPPIE, which integrates interactive data of multiple databases, including MINT, HPRD, and BioGrid. The HIPPIE is the most commonly used PPI database. Therefore, we download 65,536 PPI data from the HIPPIE database, involving 11,564 human proteins.

Weighting protein networks by GO annotations
GO is an internationally standardized gene function classification system. It consists of a predefined set of GO terms, which can limit and describe the function of gene products. GO terms provide the logical structure and correlation of biological processes and classify biological process (BP), molecular function (MF), and cellular component (CC). GO annotations [31] are responsible for describing GO terms function. We use G = (V, E) to represent the proteins network, V is the set of proteins, and E represents the set of protein-protein interactions. The specific steps are as follows: (i) We assume that protein a contains N GO annotation sets on BP.

M GO annotation sets on MF.
K GO annotation sets on CC.
(ii) According to the tree structure of GO, we can calculate all parental annotation sets of protein a under different categories, then add to the original annotation set of GO(a), and remove the redundant data. The function annotation of protein a obtains as follows.

Cross-validation method
We adopt Cross-Validation to adjust the parameters reasonably under the condition of moderate source datasets and apply them to practical problems. Furthermore, we use the K-fold Cross-Validation method to evaluate the experimental results.
Firstly, we divide the identified host protein complexes into a training set and validation set according to the ratio of 8:2. To know how many validation data are in top k, we conduct ten groups of control experiments to verify the results and use the ratio as the final target proteins classification standard. We consider deleting the candidate protein with zero scores in the final ranking process to reduce the data redundancy. Moreover, k ranges from 0 to 100 with a step of 10, The Cross-Validation shows that the prediction results can be divided into four categories, as shown in Table 3.
In this section, the true-positive rate (TPR) and false-positive rate (FPR) values are obtained according to the four results in Table 3, as shown in Eqs. (16) and (17), TPR indicates the prediction coverage of our method.

Selection of parameter k
We analyze the influence of parameter k, then select the best k value. Let the value of k vary from 0 to 100. The ROC curve shows in Fig. 3. The prediction results divide into two categories, the first k% data considered as the predicted potential target proteins, and the second (100-k)% data not considered as the predicted potential target proteins. It can be seen from Fig. 3 that when k is 40, CCA-SE successfully predicts 22 known host proteins. According to the above host PPI network, 80% of the proteins are nonstructural protein interactions encoded by the 2019-nCov. In addition, this experimental result also shows that the predicted protein scores are high when the value of k increases, indicating that the use of Eq. (7) helps to improve the sorting performance of candidate proteins. We plot the AUC curve to show the advantages and disadvantages of each group of algorithms. As shown in Fig. 4, the AUC curve is relatively stable in the ten groups of control experiments, which is basically around 0.81, indicating that the algorithm still shows good performance even when the amount of data is less than the actual biological network data. In summary, in the subsequent prediction of potential virus-target proteins, we set the value of k as 40 to achieve the best experimental results.

Comparative experiment of data integration
To make a horizontal comparison and evaluate the applicability of the integrated data [32], we compare the results of adding complexes data based on CCA-SE with only containing public human protein complexes data set. We also use TPR and FPR as evaluation indicators. The following table shows that the TPR containing public human protein complexes is only 68.67%, which indicates that the integrated protein complexes data can improve accuracy and make the biological data more comprehensive. It also shows that the complexes recognized by CCA-SE are more biological (Table 4).

Performing GO enrichment analysis on prediction results
After we obtain biological data, to read the genetic information, differential genetic analysis is a necessary experiment between different samples. Therefore we need to annotate these genes because the number of these genes are maybe large and difficult to compare. A common method divides these genes or proteins into several categories, one category is equivalent to a GO term. This process is called enrichment analysis. Commonly used enrichment analysis methods include GO analysis and KEGG analysis.
We integrate known virus-target host proteins with their directly interacted human proteins and apply CCA-SE to the whole network. Our predicted results are listed in "Additional file 1" and named "new targets". In our results, eight proteins have been proved among the top ten, which are Q13617, Q15370, P62877, Q15369, P09884, Q14181, P35658, and P78406. It reported that 87% of them combined with a virus of non-structural proteins. At the same time, these non-structural proteins are cleaved by 3CLPro, and this protease is one of the organic substances necessary for the reproduction of 2019-nCov. On the other hand, researchers have not recognized similar restriction sites in the human body. It is a medical value that we target 3CLPro as a drug target. Based on the above analysis, CCA-SE can recognize virus-target host proteins. Table 5 is part of the potential target proteins.
To verify the accuracy of the CCA-SE in predicting candidate virus-target proteins, we only conduct gene enrichment analysis for the top 50 potential target proteins. Table 6 lists the enrichment analysis results of proteins that are high scores. These GO terms meet the p value of less than 0.05. The analysis results in Table 6 showed that most predicted GO annotations of target proteins are related to biological processes such as protein binding, enzyme binding, transcription factor binding, transcription factor activity, protein kinase binding, apoptosis, and proliferation. For example, GO:0005515, which belongs to MF. Previous studies have found that the spike protein RBD encoded by 2019-nCov contains six amino acids, including L455, F486, Q493, S494, N501, and Y505. Meantime, RBD can integrate with the ACE2 protein of human lung epithelial cells, so we can infer that the host protein corresponding to this functional annotation has a huge correlation with these residues. Another example, GO:0016032, which plays an important role in virus affection, relates to the viral genome replication and the assembly of progeny virus particles. Moreover, Babak Khorsand [18] listed the most central nodes in human interactions of 2019-nCov in his paper, which are Q86VP6, Q92905, Q13573, and P01106. Our results included Q92905, Q13573, and P01106, and we both performed experiments in the same datasets. The prediction of 2019-nCov-target potential host proteins shows a significant enrichment effect. Demonstrating the accuracy of our prediction based on a molecular network. Once we get the differential gene information, in order to learn their functions more clearly, gene enrichment analysis may be used to discover biological pathways that play a key role in biological processes, so that we can better understand the molecular mechanisms of biological processes. KEGG pathway analysis [33] selects pathway databases and human-related pathways to analyze the predicted proteins. In Fig. 5, C1-C6 represents the result processed by KEGG pathway enrichment analysis. C1 means Annotation Cluster 1, C2 means Annotation Cluster 2, and so on. We sort the results in descending order of Enrichment Score, and "other" includes those genes that do not belong to any of the clusters. These genes have not shown their functional characteristics in our pathway analysis, we consider them less important factors in our potential virus-target experiments. According to Fig. 5, a total of 42 pathways (p-value 0.05) are obtained by screening proteins with high scores, mainly including the TNF signaling pathway, T cell receptor signaling pathway (TCR) pathway, and MAPK pathway related to cell cycle and inflammatory immune regulation, PI3K-Akt signaling pathway and HIF-1 signaling pathway related to pulmonary fibrosis regulation, renal cell carcinoma pathway related to viral diseases and human immunodeficiency virus 1 infection pathway. The above pathways demonstrate that although some predicted host proteins do not directly interact with virus-encoded proteins, they are closely related to the pathogenesis of the virus.
The existing studies have shown that the MAPK pathway is related to cell growth and mutations. TNF is mainly produced by T cells and NK cells, and both are closely related to inflammation. They can release signals to interact with specific receptors on the cell surface, making them conservative, so that MAPK-JNK, 5-lipoxygenase, and other signaling pathways are activated, making cytokines related to inflammation disorders, such as abnormal expression of gp130 and IL-1, and promoting human inflammatory response [3,4].
The PI3K-Akt signaling pathway is closely related to the renal cell carcinoma pathway, as shown in Fig. 6. Tyrosine kinase receptors can activate phosphatidylinositol 3-kinases (PI3Ks) signaling pathways, which are related to cell proliferation and apoptosis. When PI3Ks is activated, it produces a messenger that binds to the signal protein PDK1 containing the PH domain. Through phosphorylation, the Akt signaling protein is activated to form PI3K-Akt. This protein can also phosphorylate and regulate the downstream factor mammalian target of rapamycin (mTOR) [34], thereby activating the mTORC1 pathway, which participates in the expression of T cytokines in the immune system and promotes the enhancement of the immune system.

Conclusion
In this paper, we proposed a protein complexes recognition method CCA-SE. The protein complexes obtained by CCA-SE were integrated with the human protein complexes to obtain a more reliable protein complexes dataset, then we defined a score function to get potential target proteins. The scoring function takes into account not only the relationship between the protein complexes and the virus-encoded proteins but also the protein itself to predict the virus-target human proteins. Moreover, we verified the effectiveness of CCA-SE on the biological network under different parameter settings. At the same time, the selected target proteins were imported into the DAVID v6.7 database (https:// david. ncifc rf. gov/). We conducted GO function enrichment analysis and KEGG signal pathway enrichment analysis. The analysis explained the correlation between the predicted results obtained by the CCA-SE and the life process of virus infection and replication and proved the accuracy from the biological perspective. The experimental results showed that CCA-SE can effectively recognize human proteins targeted by the 2019-nCov and play a fundamental role in detecting virus drug targets.