IILLS: predicting virus-receptor interactions based on similarity and semi-supervised learning

Background Viral infectious diseases are the serious threat for human health. The receptor-binding is the first step for the viral infection of hosts. To more effectively treat human viral infectious diseases, the hidden virus-receptor interactions must be discovered. However, current computational methods for predicting virus-receptor interactions are limited. Result In this study, we propose a new computational method (IILLS) to predict virus-receptor interactions based on Initial Interaction scores method via the neighbors and the Laplacian regularized Least Square algorithm. IILLS integrates the known virus-receptor interactions and amino acid sequences of receptors. The similarity of viruses is calculated by the Gaussian Interaction Profile (GIP) kernel. On the other hand, we also compute the receptor GIP similarity and the receptor sequence similarity. Then the sequence similarity is used as the final similarity of receptors according to the prediction results. The 10-fold cross validation (10CV) and leave one out cross validation (LOOCV) are used to assess the prediction performance of our method. We also compare our method with other three competing methods (BRWH, LapRLS, CMF). Conlusion The experiment results show that IILLS achieves the AUC values of 0.8675 and 0.9061 with the 10-fold cross validation and leave-one-out cross validation (LOOCV), respectively, which illustrates that IILLS is superior to the competing methods. In addition, the case studies also further indicate that the IILLS method is effective for the virus-receptor interaction prediction.


Background
Viruses are the most abundant biological entities on the planet and widely distributed in organs of living organisms and environments [1,2]. In particular, they are an important part of the human microbiome which is closely related with human health and diseases [3]. Actually, hundreds of human diseases were resulted from *Correspondence: duangh@csu.edu.cn 1 School of Computer Science and Engineering, Central South University, 932 South Lushan Rd, 410083 ChangSha, China Full list of author information is available at the end of the article viruses [4], such as Ebola virus (EBOV) [5], Zika virus [6], American Machupo virus (MACV), Guanarito virus (GTOV), Sabia virus (SABV), Junin virus (JUNV), and so on [7]. In marine environments, viruses can kill up to 40% of the standing stock of prokaryotes daily [8]. In addition, the cellular and physiological changes in the host cells can be caused by virus infections, such as altering genomic sequences and dysfunctioning their hosts [9,10].
When viruses contact the surface of host cells, the virus process starts [11]. In general, the receptor-binding is considered as the first step for the viral infection of host cells [12]. The specificity and affinity are the main factors that viruses can use diverse types of molecules to attach to and enter into cells [13]. With the development of highthroughput technologies, many studies indicate that some molecules including proteins are the receptor of viruses [14], such as carbohydrates and lipids [15]. Furthermore, the virus-receptor interaction is also an dynamic process, as it can evolve over the course of an infection while virus variants with distinct receptor-binding specificity and tropism can appear [13]. In order to help understand the interaction mechanism between viruses and receptors, a database (called viralReceptor) with mammalian virus-receptor interactions has been constructed by Zhang et.al [16]. ViralReceptor consists of 128 viral species or sub-species, 119 receptors of mammalian and 268 interaction pairs between them. In addition, the structural and functional analysis of receptors also further provide the theoretic basis to discover new virus-receptor interactions, which include protein domains, higher level of N-glycosylation, higher ratio of self-interaction, and so on [16].
In this study, we propose a computational method (IILLS) based on Initial Interaction scores method via the neighbors and Laplacian regularized Least Square algorithm (a semi-supervised learning method), to predict virus-receptor interactions. IILLS integrates the known virus-receptor interactions and amino acid sequences of receptors to compute similarities of viruses and receptors. Then IILLS uses the Laplacian regularized Least Square algorithm and initial interaction scores based on the neighbors to construct the computational model. We conduct the 10-fold cross validation (10CV) and leave one out cross validation (LOOCV) to assess the prediction performance of IILLS and compare it with other three methods. The prediction performance of IILLS is best in terms of AUC (the area under of ROC curve) as its AUC values are 0.8675 and 0.9061 with 10CV and LOOCV, respectively. The evaluation results of case study also show that IILLS is an effective virus-receptor prediction method.
We also provide IILLS, via a web server, to predict virusreceptor interactions. The input of this web server is a receptor amino acid sequence or a txt file with multiple sequences in the FASTA format. The prediction result will be displayed after submission when uploading a sequence. However, the prediction results of the txt file of sequences is sent by the email with link page. Therefore, when uploading a sequence file, an email address should be provided. In addition, a job ID is assigned after one submission. According to job ID, the user can also obtain the prediction result from web server.

Materials
We download the known mammalian virus-receptor interactions from viralReceptor database. Then we further extract human virus-receptor interactions as the benchmark dataset. It includes 104 virus species or sub-species, 74 receptors and 211 interaction pairs between viruses and receptors. The detail node degree distributions of viruses and receptors in this standard virus-receptor interaction network are also described in Figs. 1 and 2. The degree of a node is the number of edges which have this node as an endvertex in the virus-receptor interaction network. Each color represents the proportion of viruses (receptors) which have the same node degree. In Fig. 1, the node degrees of 104 virus range from 1 to 8, respectively. Their distribution proportion are 56.7%, 19.2%, 8.7%, 6.7%, 1.9%, 3.8%, 1.0% and 1.9%, respectively. In Fig. 2, each color represents the proportion of receptors with the same node degree. For example, the red color represents that 8.1% of all receptors have the node degree of 4.

Similarity of viruses
Based on the assumption that similar viruses exhibit similar interaction profiles with receptors [17][18][19][20], we used the Gaussian Interaction Profile (GIP) similarity to measure the virus similarity. Let V = {v 1 , v 2 , ..., v N v } be the set of N v viruses, P = {p 1 , p 2 , ..., p N p } be the set of N p receptors, and Y ∈ R N v ×N p be the adjacency matrix of the bipartite graph to describe known virus and receptor associations. When the virus v i and receptor p j have a known interaction, the value of y ij is 1 and otherwise 0. The GIP similarity of viruses v 1 and v 2 can be computed as follows: in which yv 1 = {y 11 , y 12 , ..., y 1N p } and yv 2 = {y 21 , y 22 , ..., y 2N p } are the interaction profiles of virus v 1 and virus v 2 , respectively. The parameter γ v is used to regulate the kernel bandwidth. We can set the value of bandwidth parameter γ , v by the cross validation. In this study, the parameter γ , v is set to be 1 according to previous successful studies [17,21,22] and the influence analysis of prediction performance of parameter γ , v by the 10-fold cross validation.

Similarity of receptors
In this study, we take two methods to measure the receptor similarity, which include the GIP similarity and the amino acid sequence similarity. The GIP similarity of receptors is also computed by the known interactions of receptors. Specifically, for receptors p 1 and p 2 , their GIP similarity can be calculated as follows: in which yp 1 = {y 11 , y 21 , ..., y N v 1 } T is the interaction profile of receptor p 1 while yp 2 = {y 12 , y 22 , ..., y N v 2 } T is the interaction profile of receptor p 2 . Furthermore, the parameter γ p is also used to control the kernel bandwidth and the parameter γ , p is also set to be 1.
In addition, we compute the sequence similarity between receptors. First, we download the amino acid sequences of receptors from the KEGG GENE database [23]. The receptor sequence similarity is computed by their normalized Smith-Waterman score [24,25]. For receptors p 1 and p 2 , the sequence similarity can be calculated as follows: (5) in which SW (p 1 , p 2 ) is the original Smith-Waterman score between receptor p 1 and receptor p 2 .
Based on the GIP similarity and the sequence similarity of receptors, we construct the final similarity of receptors S p as follows: where α is the weight parameter.

Initialized interaction profiles for new viruses and receptors
The quality of known virus-receptors has important impact on the performance of prediction method. In this study, we want to set the initialized interaction scores for viruses (receptors) which have no known interaction with receptors (viruses). Inspired by the KNN method, we take the interaction profiles of all neighbors into consideration, which have known interactions. For example, the initial interaction profile between a new virus v i and receptor p j can be calculated as follows: in which S (il) v is the GIP similarity between viruses v i and v l .
Similarly, we also apply the same model to calculate the interaction profiles of new receptor. Specifically, the initial interaction profile between virus v i and a new receptor p j can be calculated as follows: p is the final similarity between receptors p j and p l .

Laplacian regularized least square for virus-receptor interaction prediction
Inspired by successful applications of Laplacian regularized Least Square (LapRLS) model in predicting drugtarget interactions [26][27][28], we adopt the LapRLS model to predict virus-receptor interactions. After obtaining the similarity matrices, we construct the normalized Laplacian matrices for viruses and receptors as follows:  For viruses and receptors, prediction matrixes F v and F p are respectively calculated from the LapRLS model by minimizing the cost functions as follows: in which tr(.) is the trace of a matrix, Y is the adjacency matrix of the known virus-receptor interactions, L v and L p are the normalized Laplacian matrices of virus similarity and receptor similarity, and ||.|| F is the Frobenius norm. β v and β p are the trade-off parameters and are set to be 1. According to previous studies [29], the computation model can be solved by: Finally, we obtain the virus-receptor interaction prediction matrix F * by the mean of results of viruses and receptors:

Performance evaluation
In order to assess the prediction performance of IILLS, we conduct the 10CV and LOOCV. The AUC is the metric to evaluate the prediction performance. We compare our method with other three methods: BRWH [30] , LapRLS [26] and CMF [31]. Figure 3 shows the prediction performance of four methods in 10CV. Compared with other methods (BRWH: 0.7959, LapRLS: 0.7577, CMF: 0.7128), IILLS achieves the best prediction performance with the AUC value of 0.8675. Figure 4 also shows that IILLS is superior to other methods in terms of AUC values (IILLS: 0.9061, BRWH:

Analyzing receptor similarity
In this study, we also analyze the receptor similarity based on the GIP similarity and sequence similarity in terms of the influences of prediction performance of parameter α in our method. We conduct 10CV and LOOCV to compute the prediction performance. Table 1 shows the 10CV prediction performances of various parameter values of α ranging from 0 to 1.0 with the increment of 0.1. We can see from Table 1 that our method obtains the best prediction performance in 10CV when only using sequence similarity (α = 0). The AUC value of our method has a slightly descending trend when α ranges from 0 to 1.0. Table 2 shows the LOOCV prediction performances of various parameter values of α ranging from 0 to 1.0 with the increment of 0.1. We can see from Table 2 that our method also obtains the best prediction performance in LOOCV when only using sequence similarity (α = 0). The AUC value of our method has also a slightly descending trend when α ranges from 0 to 1.0. Therefore, we set the α to be 0 in this study.
In addition, we also provide the ROC of our method on different values of parameter α in three cases. The first only uses the sequence similarity of receptors (α = 0). The second only uses the GIP similarity of receptors (α = 1.0). The third is with the mean of GIP similarity and sequence similarity of receptors (α = 0.5). Figures 5 and 6 show the prediction performances of IILLS under three different receptor similarities in 10CV and LOOCV, respectively. We can also see from Figs. 5 and 6 that IILLS achieves the best prediction performance when only using the sequence similarity.

Parameter analysis for γ , v
In this section, we analyze parameters γ , v . In addition, by considering the effect of parameter γ , v is similar to the effect of parameter γ , p , we set γ , p = γ , v . When only using the sequence similarity, Table 3 shows the 10CV prediction performances of value set (0.25, 0.5, 1, 2, 4) of parameter γ , v . We can see from Table 3 that our method Table 2 The LOOCV prediction performances of various parameter values of α ranging from 0 to 1.0 with the increment of 0.1, the best result is in the bold face obtains best prediction performance in 10CV when γ , v is set to be 2. The AUC value under setting γ , v = 2 is slightly better than the AUC value when γ , v = 1. Therefore, we also simply set the γ , v = 1 as the default value based on the previous successful studies and experiment results of 10CV.

Case studies
In order to further evaluate the prediction performance of IILLS in applications, we analyze the prediction ability of our method in discovering new virus-receptor interactions. The extracted human virus-receptor interactions are used as the benchmark datasets.   Table 4 shows the validation results of top 10 virusreceptor interactions which are predicted by IILLS. We can see from Table 4 that 5 of 10 predicted associations are validated by previous studies. C-type lectin domain family 4 member M (CLEC4M, also called L-SIGN or CD209L) is equipped with a carbohydrate recognition domain (CRD) that mediates the recognition of fucose and high-mannose glycans in a Ca2+-dependent manner, these carbohydrate structures can be found in multiple pathogens, such as Lassa virus, Ebola virus, among others [32,33]. The CD209 is also the receptor of known SARS-CoV, human coronaviruses and 229E, although the disease caused by SARS-CoV differs from the diseases caused by the known human coronaviruses and 229E [34]. L-SIGN (also called DC-SIGN) is related to CLEC4M and is a C-type lectin involved in both innate and adaptive immunity, they are known to bind to multiple pathogens and function as cellular receptors for various viruses, such as Dengue virus [35]. Rift Valley fever virus (RVFV) goes through L-SIGN to infect cells expressing the lectin ectopically [32,36]. The phleboviruses, such as Uukuniemi virus (UUKV), can exploit L-SIGN for infection [32,36].

Discussion
With the development of high-through sequencing technology and microbiology, many studies have evidenced that microbes have key impacts on health body and human diseases. Furthermore, the viruses are an important part of the human microbiomes, and are also the direct origin of infectious diseases, such as Sabia virus and so on. The receptor-binding is the first step for viral infection of host cells. Therefore, in order to systematically understand the mechanisms between virus and receptor and improve the diagnosis and treatment of infectious diseases, it need develop effective methods to identify new virus-receptor interactions.

Conclusion
In this study, we develop a computational method (IILLS) to predict virus-receptor interactions of human with known virus-receptor interactions and the amino acid sequence of receptors. Firstly, IILLS computes the virus similarity by GIP kernel. Then we also calculate the receptor GIP kernel similarity and the receptor sequence similarity. The final receptor similarity is constructed by the sequence similarity based on the experiment results. IILLS uses the Laplacian regularized Least Square (LapRLS) model to predict the potential virus-disease interactions. It further improves the prediction performance by adding an initial interaction scores process for new viruses and receptors. In terms of AUC with 10CV and LOOCV, IILLS can achieves better prediction performance than other three competing methods. The case studies also show that IILLS can effectively predict virus-receptor interactions, and also help control the virus infectious diseases in the future.
However, there still exist some limitations in IILLS. On the one hand, the virus similarity is calculated by the GIP kernel with known virus-receptor interactions. We should consider more relevant biological network information, such as sequence information. In addition, other integration methods of receptor similarity also should be considered in the future. Finally, other latest matrix factorization methods also should be considered, such as DNRLMF-MDA [37], DRRS [38], SIMCLDA [39] and BNNR [40]. Therefore, we would like to develop a more effective method for predicting virus-receptor interactions by addressing the above limitations in the future.