Exploring biological interaction networks with tailored weighted quasi-bicliques

Background Biological networks provide fundamental insights into the functional characterization of genes and their products, the characterization of DNA-protein interactions, the identification of regulatory mechanisms, and other biological tasks. Due to the experimental and biological complexity, their computational exploitation faces many algorithmic challenges. Results We introduce novel weighted quasi-biclique problems to identify functional modules in biological networks when represented by bipartite graphs. In difference to previous quasi-biclique problems, we include biological interaction levels by using edge-weighted quasi-bicliques. While we prove that our problems are NP-hard, we also describe IP formulations to compute exact solutions for moderately sized networks. Conclusions We verify the effectiveness of our IP solutions using both simulation and empirical data. The simulation shows high quasi-biclique recall rates, and the empirical data corroborate the abilities of our weighted quasi-bicliques in extracting features and recovering missing interactions from biological networks.


Introduction
Cellular processes such as transcription, replication, metabolic catalyses, or the transport of substances are carried out by molecules that are associated in functional modules, and are often realized as physical interaction within protein complexes. These physical interactions form molecular networks. Analyzing these networks is a thriving field (e.g. [1]) and has extensive implications for a host of issues in biology, pharmacology [1], and medicine [2]. Capturing the modularities of molecular networks accurately will gain insights into cellular processes and gene function. Yet, before such modularities can be reliably inferred, challenging computational problems have to be overcome.
These computational problems typically result from incomplete and error-prone networks that largely obfuscate the reliable identification of modules [3,4]. Often, molecular interactions can not be measured to the accuracy of the genome sequences, leaving some guesswork in identifying modularities correctly. Some molecular interactions are highly transient and can only be measured indirectly, while others withstand denaturing agents. Functional interaction does not even have to be realized via physical interactions. Thus, computational methods for capturing modularity can not directly rely on presence or absence of interactions in molecular networks and need to be able to cope with substantial error rates.
Unweighted quasi-biclique approaches have been used in the past to identify modularity in protein interaction networks when presented as bipartite graphs that are spanned between different features of proteins, e.g. binding sites and domain content function [3,5]. An example is depicted in Figure 1. While these approaches aim to solve NP-hard problems using heuristics, they were able to identify some highly interactive protein complexes [6,7].
Unweighted quasi-biclique approaches are sensitive to the quantitative uncertainties intrinsic to molecular networks. Interactions are only represented by an unweighted edge in the bipartite graph if they are above some userspecified threshold. Therefore, unweighted quasi-biclique approaches are prone to disregard many of the invaluable interactions that are below the threshold, and treat all interactions above the threshold the same. Further, some interactions may or may not be represented due to some seemingly insignificant error in the measurement. Consequently, many crucial modules may be concealed and remain undetected by using unweighted quasi-biclique approaches.
Here we introduce novel weighted quasi-biclique problems by using bipartite graphs where edges are weighted by the level of the corresponding interactions, e.g., Figure 1. We show that these problems are, similar to their unweighted versions, NP-hard. However, in practice, exact Integer Programming (IP) formulations can efficiently tackle many NP-hard real-world problems [8]. Therefore, we describe exact Integer Programming (IP) solutions for our weighted quasi-biclique problems. Furthermore our IP solutions exploit the sparseness of molecular networks when represented as bipartite graphs. This allows to verify the ability of our IP solutions using a moderately sized genetic network [9], and simulation studies. In addition our IP solutions can provide exact results for instances of the unweighted quasibiclique approaches that were previously not available.

Related work
Maximal bicliques in biological networks are selfcontained elements characterizing functional modules.
In protein interaction networks they manifest as interactive protein complexes (e.g., [3,7]). Bipartite graphs are graphs whose vertices can be bi-partitioned into sets X and Y such that each edge is incident to vertices in X and Y. A biclique is a subgraph of a bipartite graph where every vertex in one partition is connected to every vertex in the other partition by an edge. A biclique is maximal if it is not properly contained in any other biclique, and it is maximum if no other maximal bicliques have larger total edge weights. The problem of finding maximum bicliques is well studied in the literature of graph theory and is known to be NP-complete [10] and effective heuristics for this problem have been described and used in various applications [11]. However, bicliques are too stringent for identifying modules in real world networks [12]. For example a module is not identified through a biclique that is incomplete by one single edge. Quasi-bicliques are partially incomplete bicliques that overcome this limitation. They allow a specified maximum number of edges to be missed in order to form a biclique [13]. While quasi-bicliques are less stringent for the identification of modules, they might contain genes that are interacting with only a few or none other of the genes. Such situations occur when the missing edges are not homogeneously distributed throughout. The δ-quasi-bicliques (δ-QB) [14] allow to control the distribution of missing edges by setting lower bounds, parametrized by δ, on the minimum number of incident edges to vertices in each of the vertex sets in a δ-QB.

Our contributions
Here we define a "weighted" version of δ-QB, called a,bweighted quasi-bicliques (a,b-WQB), to improve on the identification of modules in molecular networks by using the interaction levels between genes. Thus, a,b-WQB's may be better applicable to handle noisy data sets as they distribute the overall missing information across the vertices of the quasi-biclique as shown in our simulations. We define two versions of a,b-WQB's, each in terms of the amount of weight the quasi-biclique is allowed to miss. The two different versions of weighted quasi-bicliques provide flexibility in choosing the missing weight. For the first version, called the percentage version, we define the missing weight in terms of the percentage of number of vertices in the quasi biclique. While for the second version, called the constant version, the missing weight is defined as a constant. The need for constant version weighted quasi-bicliques arises from the fact that, for certain applications, the weight allowed to be missed does not depend on any of the graph parameters and is a constant. Finding a maximum a,b-WQB in a given edge weighted bipartite graph is NP-hard, since it is a generalization of the NP-hard problem to find δ-QBs in unweighted bipartite graphs [3]. We also introduce a "query" version of the maximum a, b-WQB problem that allows biologists to focus their analyzes on genes of their particular interest. Given a network and specific genes from this network, called query, the query problem is to find a maximum weighted a,b-WQB that includes the query. We prove this problem to be NP-hard. While the maximum a,b-WQB problem and its query version are NP-hard, we provide exact IP solutions to solve both problems. By reducing the number of required variables and exploiting the sparseness of bipartite graphs representing molecular networks, our solutions solve moderate-sized instances. This allows us to verify the applicability of a, b-WQB by analyzing the most complete data set of genetic interactions available for the Eukaryotic model organism Saccharomyces cerevisiae. Our results not only extract meaningful yet unexpected quasi-bicliques under functional classes, but also suggest higher possibilities of recovering missing interactions not presented in the input. A preliminary version of this work appeared in ISBRA 2011 [15]. In this paper, we extend the usage of the parameters a and b such that the edge weight threshold can be either a ratio of the a,b-WQB size or a constant. The time complexity and the application of this extended a,b-WQB are both discussed.

Results and discussion
Before analyzing our findings in biological networks, we first introduce formal definitions of weighted-quasi bicliques (WQB) and then discuss the results of applying the WQB as a data mining tool.

Preliminaries
A bipartite graph, denoted by (U + V, E), is a graph whose vertex set can be partitioned into the sets U and V such that its edge set E consists only of edges {u, v} where u U and v V (U and V are independent sets). Let G := (U + V, E) be a bipartite graph. The graph G is called complete if for any two vertices u U and v V there is an edge {u, v} E. A biclique in G is a pair (U', V') that induces a complete bipartite subgraph in G, where U' ⊆ U and V' ⊆ V. Since any subgraph induced by a biclique is a complete bipartite graph, we use the two terms interchangeably. A pair (U, V) includes another pair (U', V') if U' ⊆ U and V' ⊆ V. In such case, we also say that the pair (U', V') is included in (U, V). A pair (U, V ) is non-empty if both U and V are non-empty. A weighted bipartite graph, denoted by (U + V, E, ω), is a complete bipartite graph (U + V, E) with a weight function ω : Maximum weighted quasi-biclique (a,b-WQB) problem and satisfies the two properties: Definition 2 (a,b-WQB C ). Let G := (U + V, E, ω) and a,b [0, ∞). A constant version a,b-weighted quasi-biclique, denoted as a,b-WQB C , in G is a nonempty pair (U', V') that is included in (U, V ) and satisfies the two properties: In either version, the weight of an a,b-WQB is defined as the sum of all its edge weights.
, if its weight is at least as much as the weight of any other a,b-WQB P(C) in G for given values of a and b Problem 1 (a,b-WQB P(C) ). Instance: A weighted bipartite graph G := (U + V, E, ω), and values a, b [0, 1]([0, ∞)).
Find: A maximum weighted a,b-WQB P(C) in G. Note that, we use the same notation (a,b-WQB P(C) ) for a, b-weighted quasi-biclique and maximum weighted a, b-weighted quasi-biclique problem of either version. The context in which we use the notation will make the difference clear. Also, when we just say a,b-WQB, the context will make clear if we are referring to percentage version or constant version or both.

Query problem
A common requirement in the analysis of networks is to provide the environment of a certain group of genes, which translates into finding the maximum weighted a, b-WQB P(C) which includes a specific set of vertices. We call this the query problem and is defined as follows.
, and a pair (P, Q) included in (U, V).
Find: The a,b-WQB P(C) which includes (P, Q) and has a weight greater than or equal to the weight of any a,b-WQB P(C) which includes (P, Q).

Experiment results
Finding appropriate values for a and b is a critical part in an application. The a and b values allow the user to custom-tailor the search based on the weight distribution and the expected findings of the particular application. Typically, quasi-bicliques of different a and b values have to be analyzed by the domain experts in order to optimize the findings. We use simulated data sets to explore the problem of finding right a and b values. We then use our IP model to explore a,b-WQB's in a real world application, a recent data set of functional groups formed in genetic interactions. The filtered data set, compared to the raw data, served to investigate the role of non-existing edges in the input bipartite graph. While mathematically equivalent in the modeling step, a non-edge in a experimentally generated network represents either a true non-interaction or a false-negative. Assuming the input consists of meaningful features, our preliminary results show that a,b-WQB's may recover missing edges with potentially higher weights better than δ-quasi-bicliques.

Simulations
As part of simulation studies, we try to retrieve a known maximum weighted quasi biclique from a weighted bipartite graph using both versions of a,b-WQB's. In each simulation experiment we do the following. The pair (U, V) represents the vertices of a weighted bipartite graph G. We randomly choose U' ⊆ U and V' ⊆ V as vertices of the known quasi biclique in G. The sizes of both U' and V' are set the same and is picked randomly, but is limited to a specific percentage of the total vertices on each side. Random edges between the vertices of U' and V' in G are introduced according to a pre-determined edge density d.
The edges between vertices of U\U' and V\V' of G are also generated randomly according to a pre-determined density d'. The edge weights of the known quasi-biclique (U', V') are determined by a Gaussian distribution with a mean mn and standard deviation dev. Weights of the edges of G not present in the quasi-biclique are also determined by a Gaussian distribution with a lower mean mn' and standard deviation dev'. We now retrieve maximum weighted a,b-WQB P and a,b-WQB C from G by using specific values a and b calculated as described below.
For retrieving a,b-WQB P , the values a and b are chosen in two different ways. As part of the simulation we evaluate the performance of both methods. The first method sets both a and b to the mean of the weights of the edges of the quasi biclique. In the second method, a and b are calculated as given below: Similarly, for retrieving a,b-WQB C , the values a and b are calculated as given below.
The ILP models of the corresponding a,b-WQB problems are generated in Python, and solved in Gurobi 4 [16] on a PC with an Intel Core2 Quad 2.4 GHz CPU with 8 GB memory.
For the evaluation, let (U", V") represent the maximum weighted a,b-WQB returned by the ILP model. The percentage of the vertices of U' in U" is called the recall of U'. Similarly, the percentage of vertices of V' in V" is called the recall of V'. The recalls of U' and V' are our evaluation criteria. For a specific graph sizes experiments were run by varying the values mn and mn'. The values dev and dev' were set 0.1. The densities d and d' are set to 0.8 and 0.2. The experiments were run for graphs of size 16 × 16, 32 × 32 and 40 × 40. Each experiment is repeated thrice and the average number of recalled vertices is calculated. The recall of the experiments can be seen in Table 1. For each graph size the first two columns represent the recall values for percentage version a,b-WQB's and the third column represents the recall value for constant version a,b-WQB. As the difference between the means increases, so does the average recall. The second method of choosing a and b for percentage version a,b-WQB yields a consistently higher recall.

Genetic interaction networks
A comprehensive set of genetic interaction and functional annotation published recently by Costanzo et al. [9] is amongst the best single data sources for weighted biological networks. The aim of our application is to identify the maximum weighted quasi-bicliques consisting of genes in different functional classes in the Costanzo dataset. The absolute values of the interaction score ε, are used as the edge weights. Values greater than 1 are rounded off to 1. Any gene present in both the functional classes A and B is represented as different vertices in the partitions U A and V B and the edge between those vertices is given a weight of 1. We build LP models of both a,b-WQB versions for the bipartite graphs to identify the maximum weighted quasi-bicliques.

Biological interpretation and examples
Genes with high degree and strong links dominate the results. In several instances, the quasi-bicliques are trivial in the sense that only one gene is present in U', and it is linked to more than 20 genes in V'. Such quasibicliques are maximal by definition but provide limited insight. A minimum of m = 2 genes per subset was included as an additional constraint to the LP model. It might be sensible to implement such restrictions in the application in general.
We observed the following with the maximum weighted a,b-WQB P 's in the data sets. Given the low overall weight, the a,b-WQB P 's generated with the parameters a = b = 0.1 were the most revealing. Though we found many interesting quasi-bicliques in the 153 bipartite graphs, we only present a couple of them here. A notable latent set that was obtained identified genes involved in amino acid biosynthesis (SER2, THR4, HOM6, URE2) and was found to form a 4 × 10 maximum weighted quasi-biclique with genes coding for proteins of the translation machinery, elongation factors in particular (ELP2, ELP3, ELP4, ELP6, STP1, YPL102C, DEG1, RPL35A, IKI3, RPP1A). These connections, to our knowledge, are not described and one might speculate that this is a way how translation is coupled to the amino-acid biosynthesis. In some cases the maximum weighted quasi-biclique is centered around the genes that are annotated in more than one functional class as they provide strong weights. These genes are involved in mitochondrial to nucleus-signaling and are examples where our approach recovers known facts. Using the query approach, it is possible to obtain quasi-bicliques around a gene set of interest quickly and extend the approach proteins of interest.
Maximum weighted a,b-WQB C 's generated from the data sets with parameters a = b = 5 reveal the following. Genetic interaction networks allow to study proteincoding genes as well as genes that might only code for RNAs. A noteworthy example was discovered in the comparison of genes involved in nuclear transport and those with an unknown bioprocess revealed proteins that are part of the nuclear pore transport (POM34, NUP60, NUP157, THP2 and POM152). They interact with a number of genes that are lined up on chromosome 15 (YML033W, YML034C (SRC1) and YML035C-A) as well as and YDR431W. Most of these genes they interact with are annotated as "dubious" in the current version of the Yeast Genome Database SGD [17]. SRC1 overlaps with another uncharacterized gene YML034C-A. It would be possible that locus codes of a long RNA are involved in nuclear transport.

Recovering missing edges
The published data sets have edges under different thresholds removed. To sample such missing edges, we calculate the average weight of all the edges removed in the 153 bipartite graphs (generated above), and the calculated average weight is 0.0522.
For each of the 153 maximum weighted quasi-bicliques of either version, the missing edges induced by the quasibicliques are then identified, and the average missing edge weight e of each is calculated. The average missing edge weight e is always greater than 0.0522. In other words, we observe that a missing edge in a maximum weighted quasi-biclique has a higher expected weight  Recall of vertices in the simulation. For every experiment, the value in the A U (A V ) column represents the average recall percentage of U'(V'). The results are different from the preliminary version due to randomness in the simulation experiments.
than the weight of a randomly selected missing edge. This happens when the a and b values are chosen to derive a,b-WQB's which are more dense in terms of weight.
(2) D05/M2: δ-QB with δ = 0.5; m = 2. Comparing the averages of e from A005/M2 to A05/ M2 (please see Table 2), we see a steady increase. Since a and b can be seen as expected edge weights of the resulting QB, the changing in e shows that QB's identify sub-graphs of expected edge weights. However, we do not see a similar pattern in the constant versions. Overall, in this particular experiment data set, the removed edge weights are at most 0.16, hence e can never approach closely to the parameter a no matter how lenient the parameters are.

Time complexity
Here we prove the NP-hardness of the a,b-WQB P(C) problem by a reduction from the maximum edge biclique problem. Note that the query P(c) problem is a generalization of a,b-WQB P(C) problem and hence, is also NP-hard. Lemma 1. The a,b-WQB P(C) problem is NP-hard. Proof. Given a bipartite graph G := (U + V, E) and an integer k, the maximum edge biclique problem asks if G contains a biclique with atleast k edges. The maximum edge biclique problem is NP-complete [10]. Let G' E or is set to 0 otherwise. Note that, there is a biclique with k edges in G if and only if the maximum weighted a,b-WQB P in G' has a weight of atleast k when a and b are set to 1. Similarly, there is a biclique with k edges in G if and only if the maximum weighted a,b-WQB C in G' has a weight of atleast k when a and b are set to 0. Therefore, the a,b-WQB P(C) problem is NP-hard.
We now prove that checking for the existence of a percentage a,b-WQB in a bipartite graph is NP-complete. Note that, checking the existence of a constant version a,b-WQB in a bipartite graph can be done in polynomial time. For rest of the section we only refer to percentage version a,b-WQB's.
Problem 4 (One sided existence). The series of reductions to prove the hardness of the existence problem are as follows. We first reduce the partition problem, which is NP-complete [18], to the modified existence problem. The modified existence problem is then reduced to the one sided existence problem. The one sided existence problem reduces to the existence problem.
Lemma 2. The modified existence problem is NPcomplete.
Proof. The proof of MO-WQB NP can be briefly described in the following.
Given a weighted bipartite graph G(U + V, E, Ω) and a pair (U', V') included in (U, V), it can be verified in polynomial time if the pair (U', V') satisfies the all the MO-WQB constraints for G. So, the modified existence problem belongs to class NP. The reduction from partition problem is as follows.
We are left to show that partition ≤ p MO-WQB. Given a finite set A, and a size s(a) Z + associated with every element a of A, the partition problem asks if A can be partitioned into two sets (A 1 , A 2 ) such that a. Construction: Let SUM be the sum of sizes of all elements in A. Build a modified weighted bipartite graph G := (U + V, E, Ω) as follows. For every element a in A there is a corresponding vertex u a in U. The set V contains two vertices v + and v -. For every vertex u a U, Ω(u a , v + ) = s(a)/(2 × SUM) and Ω(u a , v -) = -s(a)/(2 × SUM). Add an additional vertex u sum to set U. Set Ω(u sum , v + ) to -1/4 and Ω(u sum , v -) to 1/4. Note that, the weights assigned to edges of G satisfy the constraint on Ω for a modified weighted bipartite graph. b. ⇒: Let (A 1 , A 2 ) be a partition of A such that the sum of the sizes of elements in A 1 is equal to the sum of the sizes of elements in A 2 . Let U 1 = {u a : a A 1 }. The sum of weights of all edges from v + to the vertices in U 1 is equal to 1/4. Let U' = U 1 ∪ u sum . The sum of weights of all edges from v + to vertices of U' is 0. Similarly, the sum of weights of all edges from vto vertices of U' is 0. Thus, (U', V) is a MO-WQB of G.
⇐: Let (P, V) be a MO-WQB of G. The edge from vto u sum is the only positive weighted edge from vertex v -. So, P will contain vertex u ∑ .
Since Ω(v + , u sum ) is negative, set P will also contain vertices from Uu sum . The sum of the weights of edges from vto vertices in Pu sum cannot be smaller than -1/4. Similarly, the sum of the weights of edges from v + to vertices in Pu sum cannot be smaller than 1/4. So, the sum of all elements in A corresponding to the vertices in Pu sum should be equal to SUM/2. This proves that if G contains a MO-WQB, set A can be partitioned.
Hence, the modified existence problem is NPcomplete.
Lemma 3. The one sided existence problem is NPcomplete.
Proof. The proof of one sided existence NP is omitted for brevity. Next we show MO-WQB ≤ p one sided existence. We prove this problem to be NP-complete by a reduction from the modified existence problem. The reduction is as follows. a. Construction: Let G := (U + V, E, Ω) be the modified weighted bipartite graph in an instance of the modified existence problem. We build a graph G' := (U + V, E, ω) for an instance of one sided existence problem from G. Notice that the partition and vertices remain the same. If the weight of every edge in the G is non negative, set a = b = 0 and ω(u, v) = Ω(u, v) for every edge (u, v) E. Otherwise, set a and b to |x| and ω(u, v) = Ω(u, v)x for every edge (u, v) E, where x is the minimum edge weight in G. b. ⇒ and ⇐: Let (U', V) be a MO-WQB of graph G. If weights of all edges in G are non negative, the constraints for both the problems are the same. If G has negative weighted edges, the constraints of both the problems will be the same when a,b and ω for the one sided existence problem instance are set as mentioned in the construction. It can be seen that there is a MO-WQB in G if and only if there is a a,b-WQB P in the graph G' which includes the pair (∅, B).
This proves that the one sided existence problem is NP-complete.  1] be the parameters of the one sided existence problem. We build the weighted bipartite graph G = (U p + V, E, ω) for the instance of existence problem as follows. First, set G = G'. For every vertex u U, let S u denote the sum of the weights of all edges incident on u. Delete every vertex u U whose S u is less than a|V|. Let (U p + V) denote the remaining vertices, and E' represent the remaining edges in G. For the instance of the existence problem, set a = 0 is not the same set as V, the pair (U', V) is still a a,b-WQB P in G' and it includes the pair (∅, V).

IP formulations for the a,b-WQB problem
Although greedy approaches are often used in problems of a similar structure, e.g., multi-dimensional knapsack [19], δ-QB [3], in our experiments, both greedy and randomized approach did not identify solutions close enough to the exact solutions. In our experiments, simple greedy and randomized solutions yielded accuracies ranging from 60% to 95% depending on various parameters without performance guarantee. Hence we consider that it is rather important here to find exact solutions in order to demonstrate the usefulness of a,b-WQB's. Here we present integer programming (IP) formulations solving the a,b-WQB problem in exact solutions.
Due to the similarity in formulating constraints between a,b-WQB C and a,b-WQB P , we start by formulating a solution to a,b-WQB P . Our initial IP requires quadratic constraints, which are then replaced by linear constraints such that it can be solved by various optimization software packages. Our final formulation is further improved by adopting the implication rule to simplify variables involved. This improved formulation requires variables and constraints linear to the number of input edges, and thus, suits better for sparse graphs. Throughout the section, unless stated otherwise, G := (U + V, E, ω) represents a weighted bipartite graph, and G' = (U', V') represents the maximum weighted a,b-WQB of G and E' represents the edges induced by G' in G.

Quadratic programming
The integer program to find the solution G' can be formulated as follows.
Binary variables: x u , s.t. x u = 1 iff u ∈ U for each u ∈ U (1) Subject to: Maximize The quadratic terms in the constraints are necessary because, a and b thresholds apply only to vertices in U' and V'. This formulation uses variables and constraints linear to the size of input vertices, i.e., O(|U| + |V|). Since solving a quadratic program usually requires a proprietary solver, we reformulate the program so that all expressions are linear.

Converted linear programming
A standard approach to convert a quadratic program to a linear one is introducing auxiliary variables to replace the quadratic terms. Here we introduce a binary variable y uv for every edge (u, v) in G, such that, y uv = 1 if and only if x u = x v = 1, i.e., the edge (u, v) is in G'. The linear program to find the solution G' is formulated as follows.
Binary variables: Same as in (1) and (2) y uv , s.t., y uv = 1 iff x u = x v = 1 for all (u, v) ∈ E (6) Subject to: v∈V ω(u, v) · y uv ≥ α v∈V y uv for all u ∈ U (9) u∈U ω(u, v) · y uv ≥ β u∈U y uv for all v ∈ V (10) Expressions (7) and (8) state the condition that y uv = 1 if and only of x u = x v = 1. Expression (8) ensures that, for any edge whose end points (u, v) are chosen to be in G', y uv is set to 1. Due to the use of y uv variables, this formulation requires O(|U||V|) variables and constraints.

Improved linear programming
Observe that constraint (7) becomes trivial if y uv = 0. In other words, this constraint formulates implications, e.g., for binary variables p and q, the expression p ≤ q is equivalent to p q. Expanding on this idea, we eliminate the requirement of variables y uv in constraints (9) and (10) in the next formulation while sharing the rest of the aforementioned linear program.
There is a variable x v for every vertex v in G. There is a variable y uv for every edge (u, v) in G whose weight is not 0. The variable y uv is set to 1 if and only if both x u and x v are set to 1. For any vertex u U (v V), the variable x u (x v ) is set to 1 if and only if vertex u (v) is in G'. Constraint (12) can also be explained as follows. If x u = 1, the constraint transforms to the second constraint in the a,b-WQB Definition. If x u = 0, constraint (12) becomes trivial. Constraint (13) can be explained in a similar manner.
Generalized formulation for a,b-WQB P and a,b-WQB C Recall that the difference between the two problems a, b-WQB P and a,b-WQB C is in the edge weight summation which we can combine as the following properties: (1) ∀u U' : ∑ v V' ω(u, v) ≥ a P |V'| -a C , and (2) ∀v V' : ∑ u U' ω(u, v) ≥ b P |U'| -b C , where (a P , b P ) and (a C , b C ) are the parameters given in a,b-WQB P and a, b-WQB C respectively. Following the same reasoning in the previous paragraphs, linear constraints (12) and (13) are now updated as the following.
Subject to: v∈V (ω(u, v) − α P )x v + α C ≥ |V|(x u − 1) for all u ∈ U (14) u∈U (ω(u, v) − β P )x u + β C ≥ |U|(x v − 1) for all v ∈ V (15) As a results, the problem instance is a a,b-WQB C problem if (a P , b P ) = (1, 1), and it is a a,b-WQB P problem if (a C , a C ) = (0, 0). Note that the formulation does not require either condition to present; it essentially defines a generalization of a,b-WQB problems when all 4 parameters are valid and non-zero.
If there are n vertices in U and m vertices in V, there will be a total of m + n + 2k constraints and m + n + k variables where k is the number of edges whose weight is not equal to 0. The above formulations can be extended to solve the query problem by adding an additional constraint x v = 1 to the formulation, for every vertex v P ∪ Q. Similar constraints also help us explore sub optimal solutions, e.g., excluding known vertices in subsequent solutions, or provide a lowerbound of required query items in the optimal solution.

Conclusions
We address noise and incompleteness in biological networks by introducing graph-theoretical optimization problems that identify variations of novel weighted quasi-bicliques. These quasi-biclique problems incorporate biological interaction levels in different analytical settings and exhibit improvements over un-weighted quasi-bicliques. To meet demands of biologists we also provide a query version of (weighted) quasi-biclique problems. We prove that our problems are NP-hard, and describe IP formulations that can tackle moderate sized problem instances. Simulations and empirical data solved by our IP formulation suggest that our weighted quasi-biclique problems are applicable to various other biological networks.
Future work will concentrate on the design of algorithms for solving large-scale instances of weighted quasi-biclique problems within guaranteed bounds. Greedy approaches may result in effective heuristics that can analyze ever-growing biological networks. A practical extension to the query problem is the development of an efficient enumeration of all maximal a,b-WQB's.