Predicting virus-host association by Kernelized logistic matrix factorization and similarity network fusion

Background Viruses are closely related to bacteria and human diseases. It is of great significance to predict associations between viruses and hosts for understanding the dynamics and complex functional networks in microbial community. With the rapid development of the metagenomics sequencing, some methods based on sequence similarity and genomic homology have been used to predict associations between viruses and hosts. However, the known virus-host association network was ignored in these methods. Results We proposed a kernelized logistic matrix factorization with integrating different information to predict potential virus-host associations on the heterogeneous network (ILMF-VH) which is constructed by connecting a virus network with a host network based on known virus-host associations. The virus network is constructed based on oligonucleotide frequency measurement, and the host network is constructed by integrating oligonucleotide frequency similarity and Gaussian interaction profile kernel similarity through similarity network fusion. The host prediction accuracy of our method is better than other methods. In addition, case studies show that the host of crAssphage predicted by ILMF-VH is consistent with presumed host in previous studies, and another potential host Escherichia coli is also predicted. Conclusions The proposed model is an effective computational tool for predicting interactions between viruses and hosts effectively, and it has great potential for discovering novel hosts of viruses.

well as robust growth of target host strains [8], which is usually difficult to achieve in experiments. The isolation method based on culturing bacteria is inefficient to identify viruses, and it only identifies relatively fewer viruses. Nowadays, discoveries of unknown viruses have been greatly accelerated by metagenomic shotgun sequencing, but unlike viral isolation, viral sequences assembled from metagenomics usually fails to directly obtain hosts infected by them. For example, crAssphage is a highly abundant human enterovirus, which may play an important role in the human intestinal tract, but the cultivation of crAssphage in the laboratory is still not achievable, so its hosts and biological function has not yet been identified [9]. As more and more metagenomic sequencing datasets are available, it is urgent to propose effective culture-free methods to identify new viruses and their hosts.
Recognizing hosts infected by viruses is important for understanding dynamics of viruses and their effects on microbial communities. Recently, some computational methods have been used to infer associations between viruses and hosts. Edwards [10] et al. introduced three types of virus-host association prediction methods, including sequence homology [6,11,12], abundance profile co-occurrence [13] and sequence composition [14][15][16]. As for virus-host association prediction methods based on sequence homology, homologies between new viruses and potential hosts are limited, because they depend on whether hosts of the query virus exist in the host genome database. Abundance profile method is based on co-variation, but significant co-variation does not necessarily represent real interaction. Because there is usually a time delay in dynamic interactions between viruses and hosts, many interactions depending on timescale sampling may not be detected. Sequence composition is based on codon usage or short pairs of nucleotides (k-mers) shared by viruses and hosts to predict which hosts the virus infects. Ahlgren et al. proposed 11 measurements of oligonucleotide frequency (ONF) such as d Ã 2 to calculate k-mers distances between viruses and hosts [17]. This method achieves good results in host prediction accuracy at the genus level, but less than 40% at the species level. In addition, previous human microbial community studies relied on independent bacterial and viral communities, i.e. they were divided into two separate network communities [2,5], which could not capture complex dynamics of virus-host interactions.
In this paper, we propose a logistic matrix factorization algorithm based on integrating multi-information on the heterogeneous network to predict potential virus-host associations (ILMF-VH). The main differences from previous studies are that our proposed method combines information of three networks to form a virus-host heterogeneous network and applies similar network fusion (SNF) to integrate multiple host information for constructing the host-host similarity network. We used the benchmark data of viral and bacterial genomes in NCBI, and verified that ILMF-VH obtained best performance compared with recent five network-based methods under five-fold cross validation. Moreover, the host prediction accuracy is 63.66% which is 24.66 and 13.29% higher than two recently proposed virus-host association prediction methods respectively, and it is 0.49% higher than our previous approach [18]. In addition, the host of crAssphage inferred by our algorithm includes putative host Bacteroides obtained from previous studies [9,19], and another potential host Escherichia coli is also suggested. Because previous studies have shown that Escherichia coli is associated with human intestinal diseases, such as diarrhea [20], our research indicates that crAssphage may be closely related to these diseases, and this proves that our approach is effective in predicting novel virus-host associations.

Data sets
We used the data adopted by Ahlgren et al. which collected accession numbers and taxonomies of 1427 viruses and 31,986 hosts. For the initial analysis, we selected a subset including 352 viruses whose hosts were at strain level [17]. In addition, we downloaded the benchmark datasets provided by Edwards et al. including accession numbers and taxonomies of 820 viruses and 2699 hosts [21]. Based on accession numbers of viruses and hosts, we have written scripts to obtain their whole genome sequences from NCBI. In terms of each virus, their known virus-host associations are obtained through the 'isolate host = 'or 'host = 'fields in the viral annotation file. The genome of crAssphage in the human intestinal metagenomic is downloaded from NCBI and the accession number is JQ995537.1 [19].

Methods
As for our model, the virus set and host set are represented by V ¼ fv 1 ; v 2 ; …; v N v g and H ¼ fh 1 ; h 2 ; …; h N h g, where N v and N h represent the number of viruses and hosts, respectively. The associations between viruses and hosts are defined as an adjacency matrix Y ∈R N v ÂN h , if a virus v i is known to be associated with a host h j , then y ij is set to 1; otherwise, y ij is set to 0. In terms of elements in the adjacency matrix Y, the negative and positive interactions between viruses and hosts are represented by 0 and 1, respectively. In this work, firstly, we define a set of viruses which are positively related to hosts as V þ ¼ fv i j P N v i¼1 y ij > 0; ∀1≤ i ≤ N v g , then a set of viruses which are negatively related to hosts is defined as V − = V\V + . Next, a set of hosts which are positively related to viruses is defined as H þ ¼ fh j j P N h j¼1 y ij > 0; ∀1 ≤ j ≤N h g, and a set of hosts which are negatively related to viruses is defined as H − = H\H + . Finally, similarities between viruses are calculated by oligonucleotide frequency (ONF) measures and expressed by S v ∈R N v ÂN v ; similarities between hosts are calculated by integrating ONF measures and Gaussian interaction profile (GIP) kernel similarity based on SNF model, and expressed by S h ∈R N h ÂN h .

Oligonucleotide frequency measures for viruses and hosts
Recently, dissimilarity measurements based on k-mer frequencies have been applied to infer relationships between genomic sequences [17]. Here, based on the hypothesis that similar viruses or hosts share similar k-mer patterns, we calculated k-mer similarities between viral genomic sequences to measure correlations between viruses. Similarly, k-mer similarities between hosts' genomic sequences are calculated to measure correlations between hosts. According to previous research [17], d Ã 2 [22] has a good performance in calculating k-mer similarity and k is set to 6, so we calculate the distance between k-mer frequency vectors of each pair of viruses or hosts. Finally, the virus-virus similarity matrix S v and the host-host similarity matrix S(onf) h can be obtained.

Gaussian interaction profile kernel similarity for hosts
Zou et al. [23] calculated the GIP kernel similarity between microbes based on the known disease-microbe association matrix and achieved good results. Apart from sequence similarities of hosts, based on the assumption that similar hosts exhibit similar patterns with viruses, we apply GIP kernel similarity to measure associations between hosts. There are two steps to calculate GIP kernel similarity. First, the interaction profile IP(h i ) of host h i is the i-th column of the adjacency matrix Y, which is a binary relationship vector representing associations between a host h i and each virus. The GIP kernel similarity between host h i and h j is calculated from their interaction profiles and defined as [24]: This is a kernel that represents the similarities between hosts. These kernels are called Gaussian kernels. The parameter γ h is used to control the kernel bandwidth and defined as: Here, N h is the number of hosts. According to the previous study [25], we simply set r 0 h to 1.

Integrated similarity for hosts
The associations between hosts are measured by calculating ONF measures and GIP kernel similarity between hosts, respectively. Here, we introduce similar network fusion (SNF) [26] to integrate two host similarity networks. The SNF includes following three main steps. First, the edge weights of each host similarity network are represented by a N h × N h matrix S h , respectively. Then, as for each similarity network, a normalized weight matrix p can be obtained by the following formula [26]: Here S(i, j) is the matrix element of S h . Then, k nearest neighbor (KNN) is used to measure the local relationship as follows: N i represents the number of neighbors in the host. This method filters out low-similar edges.
Let P (v) and KNN (v) represent similar matrices of the above two hosts, respectively. The process of SNF is an iterative update of similarity matrices, which corresponds to each data type as follows [26]: This step updates the matrix P (v) when m parallel exchange diffusion processes are generated on m host networks. In this paper, we have two types of host similar matrices, so m is set to 2. The final similarity matrix that integrates all data types is defined as follows: Construction of heterogeneous networks

Kernelized logistic matrix factorization
We developed a kernelized logistic matrix factorization algorithm based on network similarity fusion for predicting virus-host associations, and the flowchart of ILMF-VH model is shown in the Fig. 1. First, the binary matrix Y is decomposed into W ∈R N v Âk and ∈R N h Âk , so viruses and hosts are mapped to the shared potential lowdimensional space. Seq(v i , h j ) represents the ONF similarity between each pair of virus and host, and we integrate this sequence similarity information into the associated probability p ij , which represents association probability of virus-host pair (v i , h j ) and is defined as the logistic function: It is hypothesized that known relationships between viruses and hosts provide useful information for virus-host association prediction. Current importance weighting methods have been proven to be effective for personalized recommendations and drug-target interaction predictions [27,28]. We apply the weight constant c to control the level of importance between each known and unknown associations. According to previous studies, c is set to 5. The conditional probability of Y is defined as: In this work, we also use the neighborhood regularization method to regularize the logistic matrix factorization algorithm [28]. The nearest neighbors of virus v i and host h i are defined as N(v i ) ∈ V\v i and N(h j ) ∈ H\h i , N(v i ) and N(h i ) represent the K 1 neighbors of the virus v i or the host h i , respectively. K 1 is set to 5 according to the experiment. The neighborhood information of viruses and hosts is represented by the adjacency matrices A and B, respectively. In terms of matrix A, if virus v m ∈ N(v i ), a im ¼ s v im , otherwise a im = 0; in terms of Fig. 1 The flow chart of ILMF-VH model matrix B, if host h n ∈ N(h j ), b jn ¼ s h jn , otherwise b jn = 0 . The main purpose of virus-host association prediction is to minimize distances between v i /h i and nearest neighbors N(v i )/N(h i ). We should try to minimize the following objective formula: Where tr(•) is the trace of the matrix, Our goal is to find the minimum of the following objective functions: Where , σ v and σ h are expressed as the variance of Gaussian distribution of viruses and hosts, respectively. ‖•‖ F represents the Frobenius norm of the matrix, and W and H are randomly initialized using a Gaussian distribution with a mean of 0 and standard deviation of 1 ffi ffi r p . We use the AdaGrad algorithm [29] to solve the optimization problem of Eq. (11).
When learning vectors W and H, vectors of the negative virus group or host group are learned only based on negative associations in the training process. However, some unknown virus-host associations may exist potential correlations. Based on previous studies, we replaced the vector of a negative virus/host with a linear combination of its neighbors in the positive set [28]. Here, we build K 2 nearest neighbor sets for each virus and host separately and K 2 is set to 5, according to the experimental study. We use N + (v i )/ N + (h j ) to express K 2 nearest neighbors of v i ∈ V − /h j ∈ H − in V + /H + . Therefore, w i and h j in Eq. (7) are corrected to: Evaluation metrics Based on the heterogeneous network constructed by the above method, we compare the AUC [30] and AUPR [31] of ILMF-VH and recent five network-based algorithms by five times five-fold cross-validation to evaluate their performances. Then, based on previous studies [10,17], we evaluated our virus-host association prediction methods by host prediction accuracy on a benchmark dataset including 820 viruses genomes. The host prediction accuracy refers to the percentage of the virus which is predicted to have the same host taxonomy level as known hosts of the query virus.

Performance evaluation of different based-network methods
In order to assess the performance of our model, we trained datasets including 352 viruses and 71 hosts to obtain model parameters and tested our model on benchmark datasets including 820 viruses and 2699 hosts. In addition, we compare ILMF-VH model with five recently proposed network-based methods (LMFH-VH [18], NetLapRLS [29], KBMF2K [32], BLM -NII [33], CMF [34]) through five-fold cross-validation in the dataset containing 352 viruses. In each round of five-fold cross-validation, one-fifth of the virus-host associations are set to test data, and corresponding elements in the adjacency matrix Y are set to 0, the other four subsets are used as training data. It should be noted that in each round of five-fold cross-validation experiment, when virus-host relationships are set to 0, the Y matrix has been changed, so each time we need to recalculate GIP kernel similarities between hosts, and then kernel similarities can be fused with ONF similarities of hosts by applying SNF model to obtain updated host-host similarities. In addition, according to previous studies [28,[32][33][34][35], the range of parameter settings for each method is shown in the Table 1. Here, we use a random search strategy [36] for each model to select optimal parameters. Table 2 shows the AUC values and AUPR values obtained by six methods in the data sets including 352 viruses. The results showed that ILMF-VH achieved the best performance and AUC value and AUPR value are 0.9202 and 0.6243, respectively. This result demonstrates the effectiveness of our model in virus-host association prediction.

Sensitivity analysis of parameter values
As seen in Additional file 1: Figure S1- Figure S4, these figures show AUPR values obtained by ILMF-VH model corresponding to different parameter settings. We also tested effects of different K value (the number of neighbors of KNN) of SNF model on AUPR values (Additional file 1: Figure S5). So, we mainly analyze five parameters of ILMF-VH and the number of neighbors K of SNF model.
More specificity, we analyse the change trend of AUPR values with different factorization factor k used for matrix factorization. As shown in Additional file 1: Figure S1, the optimal value of k is 100 and average AUPR value of ILMF-VH is 0.6305 under five-fold cross validation. In addition, we also study impacts of regularization parameters α and β used for neighborhood smoothing in the prediction procedure. Additional file 1: Figure S2 shows the change trend of AUPR values under different α and β. The optimal values of α and β are 0.0625 and 0.25, respectively. When α > 0.0625 and β > 0.25, corresponding AUPR values begin to decrease. These results emphasize that neighbor regularization has a certain impact on the virus-host prediction model. Moreover, we also analyse effects of λ on the prediction procedure. Here, λ ¼ λ v ¼ 1 , σ v and σ h represent the variance of Gaussian distribution of viruses and hosts, respectively. As shown in Additional file 1: Figure S3, the AUPR value becomes larger gradually with the increase of λ, and when λ equals 2, AUPR reaches optimal value. Additional file 1: Figure S4 shows the variation trend of AUPR when learning rate parameters γ is set to different values. When γ equals 0.25, AUPR takes the optimal value; when γ increases, the AUPR value begins to decrease, so γ is set to 0.25. Furthermore, we also analyzed influences of different neighbor parameter K of SNF model on AUPR values. As shown in Additional file 1: Figure S5, the AUPR value reaches the optimal value when K is set to 5; when K increases again, the AUPR value begins to decrease, so the optimal value of K is 5.

Comparison of ILMF-VH and previous virus-host prediction studies
In this work, we apply the ILMF-VH method to the benchmark dataset including 820 viruses and 2699 complete bacterial genomes. First, we calculate scores between each virus and candidate hosts. The higher the predicted score, the more likely the virus is infected by the host. Here, the highest ranked host is identified as the predicted result of the given virus, and if the predicted host is the same as known host of the given virus at the species level, the predicted host is considered as a correct one. Figure 2 shows the host prediction accuracy of four types of methods include abundance profile cooccurrence, sequence homology, sequence composition, and network-based. The result shown that ILMF-VH achieved the highest host prediction accuracy (58.90%) compared with other three types of methods.
In order to further improve the host prediction accuracy, we apply a consensus method [17] to our method. We believe that the most frequent host species in the top n predicted hosts of a virus can be classified as the host taxon of the given virus. The prediction accuracy is highest at n = 5, therefore, we selected the most frequent classification among the top 5 hosts as the host taxon of the query virus. As shown in Fig. 2, when a consensus strategy is applied to our model, the host prediction accuracy can be increased to 63.66%, which is 24.66%, 13.29 and 0.49% higher than three proposed virus-host prediction methods [10,17,18], respectively.
As for the general situation, when a new virus lacks host information, we can use the ILMF-VH method to predict its potential hosts. First, we constructed a virushost network based on known virus-host associations; then GIP kernel similarities between hosts based on known virus-host associations can be calculated, and these GIP kernel similarities and ONF similarities of hosts are integrated through SNF model, so the host similarity network can be constructed. At the same time, we can calculate ONF similarities of whole genome sequences between the new virus and other viruses in the   [17] predicted potential hosts of crAssphage based on sequence similarities between crAssphage and candidates hosts; WANG et al. [9] used the Markov random field integration network to predict potential hosts of crAssphage. They all suggested that bacteria belonging to Bacteroidetes are the host of crAssphage. According to previous study [19], crAssphage is a virus that is widely found in the human gut genome, but we know very little about its biological significance and hosts of crAssphage, due to the difficulty of culturing crAssphage. Different methods have been proposed to predict Fig. 2 The host prediction accuracy of four types of methods for benchmark datasets including 820 viruses and 2699 hosts hosts of the given virus, our information integration algorithm validates the host of crAssphage which was found in previous studies and also predicts another potential host Escherichia coli. As for each virus, the candidate host is ordered according to predicted association scores obtained by ILMF-VH algorithm. In this paper, we assume that if the known candidate host of a virus v j is h i , another new host h k at the same taxon level as the host h i may be a potential host of the virus v j . At the same time, the higher the predicted score of the candidate host h k , the more likely it is to have a potential correlation with the query virus. In the case study, we added the whole genome sequence of crAssphage to the similarity network containing 820 viruses, that is, similarities between the crAssphage and 820 virus sequences can be calculated based on ONF measurement, thus a new virus-virus similarity network can be constructed. Apart from that, we also add links between crAssphage and 2699 hosts to the virus-host network to build a new virus-host association network. Based on ONF measurement and known associations between viruses and hosts, we used our algorithm to obtain predicted scores between crAssphage and 2699 candidate hosts.
Our approach supports the previous conclusion that candidate hosts belonging to Bacteroides are potential hosts of crAssphage. As for the top 50 predicted hosts of crAssphage, there were three hosts belonging to phylum Bacteroidetes and were ranked 4th, 44th and 50th: Cardinium endosymbiont of Encarsia pergandiella, Weeksella virosa, and Tannerella forsythia. Our prediction model also inferred that Escherichia coli belonging to phylum Proteobacteria is the potential host of crAssphage, and Escherichia coli ranks highest among 2699 hosts. A possible explanation for its highest predicted score is that the alignment-free similarity score between crAssphage and Escherichia coli is 0.6568, which is higher than the average score (0.6096) between the virus and all candidate hosts. Therefore, sequence alignment is an important part of extracting virus-host association signal, and it provides an efficient contribution indicator for this prediction result.
Our algorithm predicted host of crAssphage based on the metagenomics sequencing data, which is identical to the putative host at phylum level in previous studies. In addition, another potential host Escherichia coli is also inferred. Recent studies have shown that [20] most Escherichia coli strains grow harmlessly in the gut and rarely cause diseases in healthy individuals. However, many pathogenic strains can cause diarrhea or extraintestinal disease in both healthy and immunocompromised individuals. Our experimental results suggest that crAssphage may play an important role in these diseases. In general, our algorithmic model is effective in predicting potential hosts of new viruses.

Conclusion and outlook
Viral infection usually results in changes in the ecosystem function of host cells. Virus-host association studies can reveal complex virus-host network interactions and are important for understanding of microorganism diversity. Despite this, although some methods for virus-host association prediction have been proposed, the host prediction accuracy at the species level cannot be achieved very well and these methods need to be improved.
We present an effective method ILMF-VH for predicting virus-host associations. We performed the best performance compared to recent five network-based methods by five-fold cross-validation. Secondly, we compared the host prediction accuracy with several recently proposed virushost association prediction methods [10,17]. Our method obtained the highest host prediction accuracy (63.66%). Finally, we analyzed our method's abilities to predict potential hosts for the given virus. As for the crAssphage, our predicted hosts are corresponding to previous studies, and predicted another host Escherichia coli is associated with intestinal diseases. In general, it is important to study virushost associations. Our research not only has potential to predict hosts of viruses, but also can be applied to predict virus-host associations.
Although some results have been achieved so far, there are still some problems that can be further studied in the future. First, the biology characteristics of viruses and hosts are abundant and varied. Apart from whole genome sequences, protein, amino acid, abundance profile and other related information might also have contribution to the prediction model. It needs further research to study what information provides a reliable basis for virus-host association prediction, and extracting appropriate characteristics of viruses and hosts are important for predicting results. Here, we integrate genome sequence information and known virus-host associations. In the future researches, we will consider adding different information sources of viruses and hosts to analyze impacts of different characteristics on prediction results.