Alignment of biological networks by integer linear programming: virus-host protein-protein interaction networks

Background The alignment of protein-protein interaction networks was recently formulated as an integer quadratic programming problem, along with a linearization that can be solved by integer linear programming software tools. However, the resulting integer linear program has a huge number of variables and constraints, rendering it of no practical use. Results We present a compact integer linear programming reformulation of the protein-protein interaction network alignment problem, which can be solved using state-of-the-art mathematical modeling and integer linear programming software tools, along with empirical results showing that small biological networks, such as virus-host protein-protein interaction networks, can be aligned in a reasonable amount of time on a personal computer and the resulting alignments are structurally coherent and biologically meaningful. Conclusions The implementation of the integer linear programming reformulation using current mathematical modeling and integer linear programming software tools provided biologically meaningful alignments of virus-host protein-protein interaction networks.


Background
Many meaningful questions in molecular biology have been successfully answered through their translation into alignment problems for different mathematical structures. From simple structures, such as genomic or proteomic sequences, to richer structures, such as complex networks or whole biological systems, pairwise and multiple alignment have been used to compare these structures, inferring features and new biological relations from their alignment.
Several methods and software tools have been already introduced for the alignment of biological networks, including protein-protein interaction networks, metabolic pathways, and gene regulatory networks. They are addressed to solve interesting biological questions, such as the inference of protein-protein interactions and protein functions, the regulation of biological processes, and the metabolic capabilities of microorganisms. The alignment and analysis of protein-protein interaction networks has become a key ingredient to obtain functional orthologs and discover protein-protein interactions and their associated functions, as well as evolutionary conserved assembly pathways of protein complexes.
In the general network setting, and hence also in the particular case of protein-protein interaction networks, an alignment between two networks is an injective, but possibly partial, mapping from the set of nodes in one network (the source network) to the set of nodes in the other network (the target network). When the mapping defining the alignment has as domain the whole set of nodes of the source network, the alignment becomes an embedding of the source into the target network. Since biological networks are large networks, with hundreds to thousands of nodes and edges, most of the techniques developed for their alignment [1][2][3][4][5] are heuristic, and the alignments obtained by applying these techniques to the same biological networks often differ considerably and do not provide a true, consensus alignment. On the other hand, an exact solution to the network alignment problem can be obtained by an integer quadratic programming formulation [6], but its linearization [7] has a huge number of binary variables and constraints.
In this paper, we present a compact integer linear programming reformulation of the protein-protein interaction network alignment problem, which can be solved using state-of-the-art mathematical modeling and integer linear programming software tools. We also present empirical results showing that small biological networks, such as the virus-host protein-protein interaction networks in the STRING Viruses database [8], can be aligned in a reasonable amount of time on a personal computer and the resulting alignments are structurally coherent and biologically meaningful.

Results
The STRING Viruses database [8] contains sequences for 9,660,620 viral and host proteins and protein-protein interaction data for 230 viruses and 3 hosts: Homo sapiens (11,437,065 interactions), Saccharomyces cerevisiae (2,007,278 interactions), and Escherichia coli (1,166,900 interactions). We downloaded from STRING Viruses the virus-host protein-protein interaction data for Homo sapiens and all the protein sequence data (see Availability of data and materials).
Each of the protein-protein interactions is annotated with a combined score, an indicator of confidence ranging from 0 to 1, where a combined score of 0.5 indicates that roughly every second interaction might be a false positive. Therefore, we discarded any protein-protein interaction with a combined score under 0.510, keeping only interactions in the last 10% of the distribution of combined scores. Also, host-host protein-protein interactions were discarded since the alignment purpose in this experiment is the relation between the proteins of a virus and its host. Nevertheless, in a more general setting host-host protein-protein interactions can be also considered. Further, we discarded the smallest networks, those with 64 or less interactions, and focused our alignment experiments on the remaining 25 largest networks, which have between 56 and 735 viral and host proteins and between 65 and 957 virus-host protein-protein interactions. These networks are listed in Table S1, ranked by the number of interactions, and the viral proteins involved in them are listed in Tables S2-S26 [see Additional file 1].
Then, we aligned all possible pairs of these 25 networks. Due to the symmetry of the network alignment problem, we actually aligned 25 · 24/2 = 300 pairs of networks. We performed each of these 300 alignments using the compact integer linear programming formulation presented in this paper with AMPL version 2018.10.22 [9] and Gurobi Optimizer version 8.1.0, and also with some of the most popular protein-protein interaction network alignment tools: PINALOG [1], SPINAL [2], HubAlign [3], L-GRAAL [4], and AligNet [5], using default parameters for all of them. All of the alignments where computed using a personal computer with an Intel Core i7-8550U quad-core processor at 1.80 GHz and 32 GB of memory running Ubuntu 18.04 LTS. We took either the optimal solution or the best feasible solution that could be computed within a solver time limit of 60 minutes.
While our method is aimed at finding exact solutions to the problem of aligning protein-protein interaction networks, all of the aforementioned protein-protein interaction network alignment tools use an heuristic algorithm to obtain the final alignment. The general idea behind all of these alignment tools, is to define a node similarity measure that combine the similarity of the protein sequences with some network structure similarity. Then, the actual alignment is obtained based on node similarity. More precisely, in PINALOG, network structures are "communities, " which are scored and aligned based on a node similarity score that combines protein sequence similarity and GO terms, and aligned communities are extended to obtain the network alignment. In SPINAL, node similarity score is defined based on sequence similarity of nodes and of their neighbours, this score is iterated until some stability is reached, and the network alignment is obtained by a greedy, seed-and-extend approach. In HubAlign, network structures are "hubs" and "bottlenecks, " a score or weight is assigned to each node and edge of a network using an iterative minimum-degree heuristic algorithm to measure the topological and functional importance of a node (that is, the likelihood of being a hub or bottleneck), and the network alignment is obtained by choosing protein pairs with high alignment score by, again, a greedy, seed-and-extend approach. In L-GRAAL, node similarity is measured by considering 2-node to 4-node graphlet (connected subgraph) degree similarity and, based on node similarity, seeds are obtained using Integer Linear Programming (ILP) and Lagrangian relaxation and then extended to a network alignment using a greedy heuristic algorithm. Last, but not least, in AligNet, an overlapping clustering for every node in every network is computed. Then, all clusters pairs are aligned and scored based on sequence similarity of proteins and their neighbours. Finally, the clusters in one network are aligned with the clusters in the other network using the Hungarian algorithm, and local network alignments are first obtained as solutions to weighted bipartite hypergraph problem instances and then extended to a global network alignment.
In order to evaluate the alignments we considered the work reported in [10,11] where several topological coherence and biological coherence measures were proposed for the comparison of protein-protein interaction network alignment methods and tools. It is shown in [11] that there is a strong correlation among the various topological coherence measures and also among the various biological coherence measures, while there is a weak correlation between the topological coherence measures and the biological coherence measures. Therefore, we have chosen one topological coherence measure and one biological coherence measure for assessing the quality of virus-host protein-protein interaction network alignments: EC, the edge correctness score, defined as the ratio of the interactions that are preserved by the alignment over the total number of interactions [10], and the sequence similarity score, a measure of functional coherence (FC), defined as the normalized sum of the sequence similarities (correlation of amino acid composition [12]) of the aligned proteins.
In Fig. 1 we show the boxplot of the edge correctness scores obtained for every alignment with the six alignment methods and tools considered in this study. We can observe there that L-GRAAL and our ILP method obtained the best results, with mean EC scores of 0.83 and 0.78, respectively. As far as biological coherence goes, in Fig. 2 we observe that PINALOG, being the alignment tool with the lowest EC scores, is the tool that reached the highest FC scores, with a mean FC score of 0.92, followed by our ILP method, with a mean FC score of 0.90. We can also observe that, as stated in [10,11], some alignment methods and tools obtain either high EC scores but low FC scores, or low EC scores but high FC scores. As a measure of a balance between topological and biological coherence, we took the mean of the EC and FC scores, whose boxplot we show in Fig. 3. We can  observe that, again, L-GRAAL and our ILP method obtained the best scores, followed by HubAlign and AligNet. Table 1 shows the mean edge correctness and sequence similarity scores for all the six alignment approaches considered in this text, for the 300 pairs of virus-host proteinprotein interaction networks from the STRING Viruses database described above. Moreover, Table 2 illustrates the trade-off between the conservation of interactions and the alignment of similar proteins, for a subset of 45 pairs of virus-host protein-protein interaction networks, as a function of a parameter λ ∈[ 0, 1] that controls the balance between  protein similarity scores and protein-protein interaction weights in our model (see the "Methods" section for more details). With λ = 0, we obtain an alignment with the highest topological coherence but with the lowest biological coherence, while λ = 1 produces an alignment with the lowest topological coherence but with the highest biological coherence.
In order to measure the amount of variation or dispersion of the EC and FC scores used to evaluate the topological and biological coherence of the alignments, we introduced some noise to the virus-host protein-protein interaction networks by randomly adding and deleting 5% of the interactions. We computed 10,000 alignments between 100 random perturbations of the Marburg marburgvirus (taxid 11269) and 100 random perturbations of the Zaire ebolavirus (taxid 186538) virus-host protein-protein interaction networks. The mean and standard deviation of the EC and FC scores are 0.955413 and 0.012193 for the EC score and 0.991356 and 0.003269 for the FC score. That is, small perturbations of the virus-host protein-protein interaction networks produced small variations of the EC and FC scores.
While these results are based on a particular view of sequence similarity as correlation of amino acid composition, as mentioned above, it is possible to use the protein-protein interaction network alignment method with any measure of sequence similarity, including alignment-free measures such as the Euclidean distance between k-mer frequencies [12] and also alignment-based measures such as a normalized global alignment score. Table 3 shows the mean edge correctness and sequence similarity scores for different measures of sequence similarity (Euclidean distance between k-mer frequencies, for k between 1 and 4, and normalized global alignment score), for a subset of 45 pairs of virus-host proteinprotein interaction networks with λ = 0.5. The higher the value of k, the lower the mean sequence similarity score, with normalized global alignment giving the lowest score, but the mean edge correctness score is unaffected by the choice of sequence similarity measure.

Discussion
To reinforce the statement that the integer linear programming formulation of the network alignment problem provides biologically meaningful alignments of virus-host protein-protein interaction networks, we analyzed the alignments in term of agreement on virus taxonomy. Namely, we considered the taxonomy classification of the virus in every virus-host protein-protein interaction network and assumed that the highest alignment scores must be obtained when considering closely related viruses. Indeed, Table 4 shows that the best alignment (measured by the mean value of edge correctness and sequence similarity) for each of the 25 virus-host protein-protein interaction networks in Table 5, correspond to a network in the same Baltimore class [13] for 21 of the 25 best alignments. Table 5 also shows the taxonomy classification of the 25 viruses considered in our study. As a matter of fact, in class I (double-stranded DNA viruses), the Alphapapillomavirus 9 network is best aligned with the Human alphaherpesvirus 2 network; the Human betaherpesvirus 5 network is best aligned with the Human betaherpesvirus 6B network; the Human alphaherpesvirus 3 network is best aligned with the Human alphaherpesvirus 1 network; the Human alphaherpesvirus 1 and Human alphaherpesvirus 2 networks are best aligned with each other; and the Human betaherpesvirus 6A and Human betaherpesvirus 6B networks are also best aligned with each other.
In class IV (positive-sense single-stranded RNA viruses), the Human coronavirus 229E network is best aligned with the SARS-related coronavirus network. In class V (negativesense single-stranded RNA viruses), the Influenza A virus network is best aligned with the Marburg marburgvirus network; the Human orthopneumovirus, Mumps rubulavirus, and Hendra henipavirus networks are best aligned with the Human metapneumovirus network; the Marburg marburgvirus and Zaire ebolavirus networks are best aligned with each other; and the Human metapneumovirus and Measles morbillivirus networks are also best aligned with each other.

Conclusions
The compact integer linear programming reformulation of the protein-protein interaction network alignment problem can also be applied to similar alignment problems on graph-based representations of molecular structures, such as metabolic pathways and gene regulatory networks. The application to virus-host protein-protein interaction networks provided high scored alignments in both network topology and biological coherence, which constitutes evidence that the alignments obtained with this approach are biologically meaningful.
The alignment of virus-host protein-protein interaction networks may contribute to discover the effect of viral infection to their host. New databases with virus information have been created in the last years from the analysis of new metagenomics data [14][15][16]. Table 5 The virus-host protein-protein interaction networks for human viruses in the STRING Viruses database considered in our study However, one of the problems to deal with nowadays is to understand the mechanism by which viruses infect a host and to determine the viral proteins interacting with host proteins that are responsible for such an infection. New sets of Gene Ontology classes have been developed that are applicable to microbes and their hosts, improving both coverage and quality in this area of the Gene Ontology [17]. Therefore, the alignment of virus-host protein-protein interactions can reveal a useful tool to predict new functions of viral proteins related to host infection, as it has been proven to be useful for inferring new protein functions.

Methods
The following notation will be used in this section. A protein-protein interaction network is represented by means of an undirected graph G = (V , E), where each node v ∈ V corresponds to a protein and each edge {u, v} ∈ E corresponds to an interaction between the proteins represented by the nodes u ∈ V and v ∈ V . Let G = (V , E) and G = V , E be the two protein-protein interaction networks to be aligned, let V = {v 1 , . . . , v m } and V = v 1 , . . . , v n be their respective sets of nodes and A = a ij and B = (b k ) be their respective adjacency matrices. Let S = (s ik ) be a similarity matrix between the nodes of the two networks, with each s ik the similarity score of v i ∈ V and v k ∈ V .
An alignment of G and G can be represented by a binary matrix X = (x ik ), where x ik = 1 if the i-th node, v i , of the first network is aligned with the k-th node, v k , of the second network, and x ik = 0 otherwise. Then, the protein-protein interaction network alignment problem has the following simple integer quadratic programming (IQP) formulation in terms of the binary variables x ik [6].
In this problem's objective function, λ is a parameter, with 0 λ 1, that controls the balance between protein similarity scores and protein-protein interaction weights: only node scores are considered when λ = 1, and only edge scores are taken into account when λ = 0. Constraints (Q2) and (Q3) enforce that, for every i = 1, . . . , m, at most one x ik is equal to 1 (that is, that the matrix X = (x ik ) defines a, possibly partial, mapping) and that, for every k = 1, . . . , n, at most one x ik is equal to 1 (that is, that the mapping defined by X is injective) and hence that the matrix X defines an alignment between the networks G and G , given by v i , v k ∈ V × V : x ik = 1 .
The objective function above comes from the PathBLAST [18] idea that protein-protein network alignment be based on a log-probability-like criterion, with matching terms corresponding to both proteins and interactions [6]. The first sum in the objective function, represents the number of edges that are preserved by the alignment; that is, of pairs of This quadratic formulation has a linearization with O m 2 n 2 binary variables and constraints [7], of no practical use with current integer linear programming software tools such as IBM ILOG CPLEX Optimization Studio or Gurobi Optimizer. We present next a much more compact linearization, with only O(mn) binary variables, integer variables, and constraints, along the lines of a well-known linearization of the quadratic assignment problem [19][20][21].
In addition to the binary variables x ik above, we introduce an integer variable y ik for each v i ∈ V and each v k ∈ V . Each such new variable y ik is intended to represent Indeed, if x i 0 k 0 = 1 and y i 0 k 0 < m j=1 n =1 a i 0 j b k 0 x j 0 , then the pair of matrices (x ik ) , ŷ ik withŷ ik = y ik except forŷ i 0 k 0 = m j=1 n =1 a i 0 j b k 0 x j , still satisfies constraints (L1) to (L5) and it has a larger value of the objective function in Problem ILP, which would contradict the assumption that ((x ik ), (y ik )) is a solution to problem ILP.
This implies that, when λ < 1, if ((x ik ), (y ik )) is a solution to problem ILP, then a ij b k x j for every i = 1, . . . , m and k = 1, . . . , n. Since the constraints on (x ik ) are the same in both problems, we conclude that (x ik ) is a solution to problem IQP.
Therefore, a solution ((x ik ), (y ik )) of the linear reformulation ILP of the alignment problem defines an alignment between the mapped proteins in the two networks via v i , v k ∈ V × V : x ik = 1 .