Efficient α, β-motif finder for identification of phenotype-related functional modules
© Schmidt et al; licensee BioMed Central Ltd. 2011
Received: 7 July 2011
Accepted: 11 November 2011
Published: 11 November 2011
Microbial communities in their natural environments exhibit phenotypes that can directly cause particular diseases, convert biomass or wastewater to energy, or degrade various environmental contaminants. Understanding how these communities realize specific phenotypic traits (e.g., carbon fixation, hydrogen production) is critical for addressing health, bioremediation, or bioenergy problems.
In this paper, we describe a graph-theoretical method for in silico prediction of the cellular subsystems that are related to the expression of a target phenotype. The proposed (α, β)-motif finder approach allows for identification of these phenotype-related subsystems that, in addition to metabolic subsystems, could include their regulators, sensors, transporters, and even uncharacterized proteins. By comparing dozens of genome-scale networks of functionally associated proteins, our method efficiently identifies those statistically significant functional modules that are in at least α networks of phenotype-expressing organisms but appear in no more than β networks of organisms that do not exhibit the target phenotype. It has been shown via various experiments that the enumerated modules are indeed related to phenotype-expression when tested with different target phenotypes like hydrogen production, motility, aerobic respiration, and acid-tolerance.
Thus, we have proposed a methodology that can identify potential statistically significant phenotype-related functional modules. The functional module is modeled as an (α, β)-clique, where α and β are two criteria introduced in this work. We also propose a novel network model, called the two-typed, divided network. The new network model and the criteria make the problem tractable even while very large networks are being compared. The code can be downloaded from http://www.freescience.org/cs/ABClique/
Identifying and understanding cellular subsystems (or functional modules) responsible for the expression of a phenotype can assist genetic engineers with determining which genes to introduce or modify  in order to aid (or inhibit) the phenotype expression in an organism. Identification of such subsystems (or functional modules) is often performed as a computational search for specific network structures, or network motifs, in network models of biological data [2, 3].
A type of network model, called protein functional association network, lends itself to the identification of functional modules. In protein functional association networks, a pair of vertices representing proteins is connected by an edge if the genes that encode these proteins are functionally associated, i.e, needed for the same function . Genes required for the same function may co-occur in the same operon, co-express under similar conditions, or be involved in gene fusion events . Evidence of these phenomena can be empirically observed and used to predict the functional association of two genes.
Functional modules, in which all pairs of proteins are functionally associated, can be modeled as maximal cliques in the context of protein functional association networks [5, 6]. Maximal cliques have been recognized for: (a) finding biologically more relevant protein complexes, with more than 10% improvement in their functional homogeneity when compared to clusters ; and (b) reducing the noise in the data . Among all maximal cliques, we are only interested in those that potentially help an organism express a particular phenotype, and this requires additional signals.
One signal that can be observed in biological networks is the evolutionary conservation of a functional module. A module that is phenotype-related is more likely to be conserved in phenotype-expressing organisms than phenotype-non-expressing organisms [8, 9].
With the nave approach (or the brute-force method), identifying maximal cliques that are statistically biased towards being present in the networks of phenotype-expressing organisms requires at a minimum comparision of all maximal cliques across all the organismal networks considered. For any given network, the number of maximal cliques can be exponential in terms of the network size. Thus, a multi-way comparison for such exponential spaces is impractical.
Results and Discussion
Phenotypes used to create the various networks used for experimentation
Number of Vertices
Number of Intra-Edges
Number of Inter-Edges
TCA Cycle Expression
In this section, the biological relevance of (α, β)-cliques is demonstrated by comparing the sets of genes predicted to be phenotype-related functional modules to known phenotype-related functional modules.
Network Edge Threshold Selection
Our network edge threshold selection strategy aims to optimize the method performance on a small validation set of COGs that are known to be associated with the target phenotype. This prior knowledge is derived from published literature. Starting with the most conservative value (e.g., 999) of the edge threshold for the COG-COG organismal network in STRING database, we generate a set of networks till we reach a threshold of 500 with a 50 step difference. For each network, we calculate the accuracy of identifying known phenotype-related COGs from the validation set. We select the first threshold that ensures at least 75% accuracy. Additionally, we check the distribution of the number of unique COGs obtained per a given edge threshold and find the threshold that shows the maximum change in the number of COGs from the previous threshold and then analyze a few thresholds above and below that value using the prior knowledge information. See Additional file 2 for the details of the edge threshold experiment for the motility phenotype. The prior knowledge, or the validation set, for the motility phenotype was obtained from Liu et al with p-value ≤ 0.05.
To identify phenotype-related functional modules for each of the 5 phenotypes (biohydrogen production, acid-tolerance, TCA expression, aerobic respiration, and motility), phylogenetically diverse (as much as possible) microorganisms representative of each phenotype were identified. Each phenotype was treated separately, and so the sets of organisms vary in each dataset (see Additional file 1). In some cases, for example, dark fermentative biohydrogen production and acid-tolerance, individual organisms were present across multiple datasets. Selection of organisms was based solely on the ability of each organism to express at least one phenotype discussed in the paper and not on the ability of each organism to express all phenotypes discussed in the paper. The organisms were chosen by reviewing the existing literature. The aerobic respiration and motility phenotypes were selected due to the availability of a number of completely sequenced and annotated genomes for the organisms exhibiting the phenotypes. In addition, to predict and distinguish between phenotypes that are highly similar, we selected the TCA expression phenotype.
In recent years, focus has shifted towards development of metabolically engineered organisms capable of expressing desired phenotypic traits. For the case of biohydrogen production using wastewater, this concept could be applied to create a mixed microbial community that is "ideal'' for enhancing hydrogen production. This is particularly important, since multiple phenotypes are necessary to optimize overall hydrogen yields, and not a single hydrogen-producing microorganism has been identified that is capable of expressing all these phenotypes. Thus, phenotypes important for hydrogen production using wastewater and waste materials were analyzed. Therefore, phenotypes selected for the study related to biohydrogen production using wastewater include:
Hydrogen Production: In order to understand metabolic and cellular processes involved in expression and regulation of the hydrogen production phenotype, microorganisms, representative of all three types (dark fermentation, light fermentation, and bio-photolysis) of hydrogen production are selected to identify phenotype-related functional modules.
Acid-tolerance (pH = 4.5-6.5): Acid-tolerant organisms are those capable of growing in slightly acidic and acidic conditions (pH 4-6). Similar to acidophiles, acid-tolerant organisms have developed metabolic and cellular acid tolerance response (ATR) systems to protect themselves when exposed to acid environments . For hydrogen producers, the presence of ATR systems is extremely important, particularly, with respect to acidogenesis. During acidogenesis, organic acids (e.g., butyrate and acetate) are produced, thus lowering the pH level in the medium. In solventogenic organisms, such as Clastridium acetobutylicum, the change in pH results in a metabolic shift from acidogenesis to solventogenesis. As a result, the organism will stop producing acetate and butyrate and will generate solvents (e.g., acetone and butanol). To prevent metabolic shifts and maintain conversion of glucose (or other sugar compound) to hydrogen at maximum yields, organisms need to be able to tolerate acidic pH conditions.
For acid-tolerence phenotype, we selected a subset of species, since a diverse, large set of completely sequenced acid-tolerant microorganisms was not available at the time of the study. Prior to the experiment a number of acid-tolerant organisms were identified through literature review. However, many of the organisms' genome sequences were not completely sequenced. To ensure the best predictions, a criterion for organism selection was the presence of their sequenced and annotated genomes. Unfortunately, the organisms we identified are only representative of a small group of acid-tolerant bacteria consisting of nine Firmicutes and one Proteobacteria. As such, results obtained are somewhat biased towards acid-tolerant Firmicutes.
The set of (α, β)-cliques was enumerated for three different statistically significant (p-value less than 0.005) α, β-values: (7,0) (see Additional file 3), (8,1) (see Additional file 4), and (9,2) (see Additional file 5). The enumerated (α, β)-cliques were able to identify sets of COGs that were known to be associated with hydrogen production.
A functional module identified by the (α, β)-motif finder for hydrogen production phenotype
Hydrogenase maturation factor
Hydrogenase maturation factor
Hydrogenase maturation factor
Hydrogenase maturation factor
Based on published studies of crystal structures for hydrogenase maturation proteins, the presence and coordinated interaction between the proteins are essential for synthesis of [NiFe]-hydrogenase. In this study, we found similar evidence of functional associations between HypCDEF proteins. This is shown in one of the modules identified as associated with hydrogenase. While associations between maturation proteins have been well characterized in model organisms [15, 18], detailed molecular analysis of [NiFe]-hydrogenase structures and their associated proteins has not been conducted across all phenotype-expressing organisms. Based on results obtained, it can be hypothesized that HypCDEF proteins are related to hydrogen producing organisms and will not be present in hydrogen non-producing organisms.
A functional module associated with nitrogenase formation identified by the (α, β)-motif finder
Uncharacterized NAD(FAD)-dependent dehydrogenase
Threonine dehydrogenase and related Zn-dependent dehdyrogenases
Nitrogenase molybdenum-iron protein
Although the presence of these two proteins is essential for nitrogen-fixation and biological hydrogen production, association of other genes may play an important role regulating Nif genes. Examples include proteins such as cysteine sulfinate desulfinase (COG1104; NifS) and nitrogen regulatory protein PII (COG0347; GlnK), which are involved in synthesis of the Fe-S cluster and regulation of proteins responsible for nitrogen metabolism , respectively. For both GlnK and NifS, the (α, β)-motif finder predicted associations between each COG group and Nif proteins. Specifically, we noted the association of NifH with the regulatory protein PII (GlnK). In nitrogen fixing organisms, GlnK is described as a key signal transducer in NifA in some organisms and regulatory protein in the transcription of the nitrogenase protein NifH in other organisms . In this study, the association between COG groups related to NifH and GlnK supports experimental evidence that PII proteins are involved in inactivation of nitrogenase across a number of nitrogen-fixing species. In addition, identification of this COG-COG associations suggests that PII proteins may play a vital role in hydrogen production via nitrogenase.
When a set of acid-tolerant microorganisms (Phylum Firmicutes and Proteobacteria) were used by the (α, β)-motif finder (see Additional file 6), the two main mechanisms, lysine/arginine decarboxylase and arginine deaminase, associated with acid-tolerance were not identified. However, eleven types of COGs associated with amino acid transporters were identified, suggesting that amino acid transport is highly related to the set of phenotype-expressing organisms in this study. Within microorganisms, amino acid transporters can participate in a number of metabolic and cellular processes, such as energy metabolism and protein synthesis. In organisms exposed to acid stress, decarboxylation of the two amino acids, lysine and arginine, is reported as two mechanisms for neutralization of internal pH [13, 24, 25]. During the neutralization process via arginine decarboxylation, antiporters responsible for replacing the argmatine generated from arginine with another arginine brought in from the surrounding environment . In another system, arginine deaminase, ammonia is generated to help protect against acid stress . From our knowledge of these systems, production or uptake of amino acids by microorganisms may play an important role in regulating intracellular pH levels.
A Functional module identified by the α, β-motif finder for acid tolerance phenotype
Amino acid transporter
COGs associated with TCA cycle enzymes and whether any of them are part of an (α, β)-clique enumerated in either two-typed, divided network
Associated COG IDs
COG0473, COG0538, COG2838, COG4579
COG0022, COG0508, COG0567, COG1071, COG1249
COG0479, COG1053, COG2009, COG2142
To determine if this functional module could be modeled as an (α, β)-clique, the set of (α, β)-cliques with significant α, β-values were enumerated in both networks. In the TvrT_800 network, this was defined as any α, β-value that had a p-value less than 0.005. In the AvAn_999, this was determined as any α, β-value that had a p-value less than 1 * 10-5.
The set of COGs modeled by an (α, β)-clique enumerated in the TvrT_800 network.
Succinyl-CoA synthetase, beta subunit
Succinyl-CoA synthetase, alpha subunit
Succinate dehydrogenase/fumarate reductase, Fe-S protein subunit
2-oxoglutarate dehydrogenase complex, dehydrogenase (E1) component
Succinate dehydrogenase/fumarate reductase, flavoprotein subunit
Pyruvate/2-oxoglutarate dehydrogenase complex, dihydrolipoamide dehydrogenase (E3) component
The motility phenotype was examined to observe how both the (α, β)-clique model and the enumeration algorithm performed when a large number of organisms were used. The two-typed, divided network used in this experiment (MvnM_999) was constructed using 85 functional association networks from motile organisms and 56 functional association networks from non-motile organisms. The (α, β)-cliques were enumerated in the phylogenetic functional association network for all α, β-values that had a p-value of less than 1 * 10-5. This resulted in 95 unique (α, β)-cliques (see Additional file 9).
The enumerated set of (α, β)-cliques contained 38 (α, β)-cliques that consisted entirely of flagella-related proteins. Additionally, five (α, β)-cliques were enumerated that contained chemotaxis-related proteins. Flagella are cellular structures that enable the movement of microorganisms, while chemotaxis is a chemical process that determines how the microorganism moves in response to its environment . These immediate results suggest that at least some of the remaining 52 (α, β)-cliques include COGs that are motility-related.
Effects of Phylogenetic Diversity
In addition to introducing the phylogenetic diversity scoring function S p () (see Methods section), we performed an experimental study to test the robustness of our scoring method. For the hydrogen production phenotype, we grouped our input set of 17 organisms into 13 groups based on their genus. For each experiment, we randomly selected one organism from each genus, then chose a random subset (of size 8-13) of these organisms, and finally ran the (α, β)-motif finder algorithm on each subset. We then compared the results of these experiments and found that based on both the module's significance score and the module's phylogenetic score, the top 10 most significant modules identified remained unchanged. The organisms used in the experiments and the results for various experiments are avaliable in Additional files 10 and 11, respectively.
Effect of (α, β)-criterion on the Number of Maximal Cliques
The results in this section describe the effect that different values of α and β have on the number of maximal cliques output when the algorithm is run on a given two-typed, divided network. The (α, β)-cliques corresponding to all possible values of α and β from two-typed, divided network of two phenotypes, hydrogen production and TCA cycle expression were enumerated (see Additional files 12 and 13) to analyze the effect.
Effect of Search Bounds on Algorithm Runtime
In this paper, we proposed a methodology that can identify evolutionarily conserved functional modules that are likely associated with the expression of a target phenotype. The structure of the functional modules considered in the paper is a maximal clique. The task of enumerating the maximal cliques in a network and comparing maximal cliques across multiple networks to identify those biased towards phenotype-expressing organisms can be computationally intractable for large networks. In this paper, we have introduced a novel type of network model called the two-typed, divided network and the (α, β)-criterion that together help move the problem to a tractable space with the modules known to be related to the target phenotypes considered. Additional files 14, 15, 16, 17, 18, 19 and 20 provide additional information about the (α, β)-cliques identified for the five target phenotypes.
Build the two-typed, divided network model which involves combining the individual organismal networks into the single network.
Identify statistically significant α and β values using hypergeometric test.
Run the (α, β)-motif finder algorithm to identify the phenotype-related functional modules.
Assign a phylogenetic diversity score to each functional module identified in the previous step.
Run Hendrix et al  algorithm using the functional modules identified and each of the organismal networks used in the (α - β)-motif finder as input, to account for the missing edges in the organismal networks.
Two-typed, Divided Network Model
In a divided network N = (D, V, E), the set of vertices V is completely divided amongst non-intersecting subsets of vertices, called divisions, which are represented by the set of subsets D. Edges in a divided network can be either inter-organismal edges or intra-organismal edges. If an edge connects two vertices that belong to different divisions, then the edge is considered an inter-organismal edge. If an edge connects two vertices that belong to the same organism, then the edge is considered an intra-organismal edge.
a 1 and a 2 are functionally associated in A (Figure 2.1);
b 1 and b 2 are functionally associated in B (Figure 2.1);
a 1 and b 1 are orthologous proteins (Figure 2.2);
a 2 and b 2 are orthologous proteins (Figure 2.2);
The concept of a organism-type is introduced to differentiate the organisms according to the expression of the phenotype. Each organism will be associated with a type t. A phenotype-expressing organism will belong to the positive type (t = p) and a phenotype-non-expressing organism will belong to the negative type (t = n). Due to the binary nature of phenotype expression and non-expression, the network model is called a two-typed, divided network, N t . The set of organisms that have the same type t are referred to as a type-set of t, T(t). For a given clique S from N t and type t, a type-count function c(S, t) returns the number of organisms in the type-set T(t) that contain at least one vertex in S. This condition is sufficient to check if the conserved clique is present in an organism.
A clique in the two-typed, divided network (Figure 2.4) is made up of a set of cliques (e.g, the clique in Figure 2.4 is made up of two cliques of three vertices each: one clique from Organism A and the other clique from Organism B), where each individual organism contributes exactly one clique or nothing. Thus, the type-count function only needs to check if an organism is participating in the current two-typed, divided network clique or not. In other words, if any one vertex from the organism is present in a two-typed, divided network clique, then the organism contributes a clique. Thus, checking if at least one vertex of the clique is in the organism is sufficient.
The (α, β)-criterion is introduced to reduce the number of maximal cliques that need to be compared across organismal networks and to identify those modules that are highly biased towards phenotype-expressing organisms. A subnetwork S satisfies the α-criterion, if the type count value c(S, p) ≥ α. A subnetwork S satisfies the β-criteria, if the type count value c(S, n) ≤ β. A subnetwork S satisfies the (α, β)-criterion if and only if it satisfies both the α-criterion and the β-criterion.
Parameter Threshold Selection
Thresholds for statistically significant values of α and β parameters are identified using the hypergeometric test. Given a population P (i.e., the set of all input organisms), let S be the set of successes (i.e., all the phenotype expressing organisms) in the population, let X be a sample from P and Y be the set of successes in the sample X. The null hypothesis states that a random draw of X organisms from P with |S| successes will yield |Y| successes. The alternate hypothesis states that a random draw of X organisms from P with |S| successes will not yield |Y| successes.
If P is the set of all input organisms, S is the set of all phenotype-expressing organisms, and S - P is the set of all phenotype non-expressing organisms, then hypergeometric test provides the p-value for all α values from 1 to |S| in combination with all β values from 0 to |S - P|. For each test, the α + β becomes the size of the sample X and the α value becomes Y, while S and P are the same as described earlier. We choose all (α, β) pairs with α ≥ β that have p-value less than or equal to a specified threshold (e.g., p-value is less than 0.005).
Additionally, for each β value, we choose only the smallest α value such that the associated pair (α, β) satisfies the given threshold for its p-value. This way only non-redundant pairs are considered and (α, β)-cliques are identified if the maximal clique exists in at least α of positive networks. Thus, if a maximal clique exists for α0, then it is also part of the resulting set for α0 + 1 for the same value of β. Thus, the smallest α value will contain all of the possible cliques for the given β value.
In the hydrogen production phenotype experiment, |P| = 17, |S| = 9 and |S - p| = 8 and a p-value threshold of 0.005, we identify all those non-redundant (α, β) pairs with p-value ≤ 0.005. Additional file 21 contains these results for the hydrogen production phenotype.
Note that this significance test does not take into account the phylogenetic similarity or phylogenetic diversity of the organisms in the input set and in the selected sample. It only provides the candidate (α, β) pairs used for running the algorithm. The additional scoring of the identified (α, β)-clique based on phylogenetic diversity of the organisms is further applied, as described below.
Phenotype-related Functional Module Identification
If it is assumed that a functional module in a single organism will form a maximal clique in the organism's functional association network, then a functional module that is evolutionarily conserved will form a maximal clique in the two-typed, divided network (Figure 2). A functional module is present in two different organisms if there is a one-to-one mapping between the proteins of the module in one organism and the proteins of the module in the other organism. The one-to-one mapping is obtained by using orthology (Figure 2.2). By virtue of the existance of intra-organism edges and the addition of the inter-organismal edges (based on the two criteria discussed earlier), a conserved functional module, will form a maximal clique in the two-typed, divided network (Figure 2.4).
We hypothesize that conserved functional modules that are related to phenotype-expression will likely form maximal cliques and satisfy the (α, β)-criterion in the two-typed, divided network. Given a large enough α-value, the α-criterion will ensure that the maximal clique is present in enough phenotype-expressing organisms to likely represent a phenotype-related functional module. The β-criterion will ensure that the functional module is less likely related to the phenotype-non-expression. The maximal cliques that satisfy the (α, β)-criteria are referred to as (α, β)-cliques.
The main distinction to make here is that not every maximal clique in the two-typed, divided network is an (α, β)-clique. A maximal clique is also an (α, β)-clique if at least α of the phenotype-expressing organisms and no more than β phenotype non-expressing organisms participate in the maximal clique identified from the two-typed, divided network. Hence, as we enumerate the maximal cliques, we keep track of the the number of phenotype expressing and phenotype non-expressing organisms that participate in this clique and only output those that satisfy both the α and β criteria. A second distinction to make is that a maximal clique of the two-typed, divided network is not the relevant motif; instead, it is a set of maximal cliques (motifs), at most one from each organism. These maximal cliques are equivalent to each other and are representatives of a conserved phenotype-related motif.
α, β-motif Finder
Input: CLIQUE - The set of vertices in the current clique
Input: CAND - The set of vertices that can be added to the set to form a new clique
Input: NOT - The set of vertices that, if added to the set, would form redundant cliques
1 if CAND is empty then
2 if NOT is empty AND c(CLIQUE, p) ≥ α AND c(CLIQUE, n) ≤ β then
3 Output CLIQUE;
5 current = First vertex in CAND;
6 while current ≠ null do
7 if c(CLIQUE, n) ≤ β then
8 NEWCLIQUE = CLIQUE + current;
9 forall vertices v in CAND do
10 if isConnected(v, current) then
11 NEWCAND + = v;
12 forall vertices u in NOT do
13 if isConnected(u, current) then
14 NEWNOT + = u;
15 if c(NEWCLIQUE ∪ NEWCAND, p) ≥ α then
16 Call α, β-motif Finder(NEWCLIQUE, NEWCAND, NEWNOT);
17 CAND = CAND - current;
18 NOT = NOT + current;
19 if CAND has more vertices then
20 current = Next vertex in CAND;
22 current = null;
Algorithm 1: (α, β)-motif Finder Algorithm
The pseudocode of a recursive enumeration algorithm for (α, β)-motif finder is given in Algorithm 1. The algorithm is a modification of the maximal clique enumeration algorithm of Bron and Kerbosch , which will be referred to as the BK algorithm (see Additional file 22). There are two key modifications introduced to generate these phenotype-related modules. The first is the introduction of the type-count function c(S, t) dicussed in methods section. The function is used in Line 2 to restrict the output of the algorithm to (α, β)-cliques.
The second modification is the introduction of two bounds to reduce the search space and to make the algorithm more efficient. As the BK algorithm traverses the search tree it keeps track of three arrays. First is the CLIQUE array, which contains all of the vertices already in the clique. Second is the CAND array that keeps track of all of the vertices that could be added to CLIQUE to form a new larger clique. The third is the NOT array, which contains vertices that, if added to CLIQUE would only identify maximal cliques that have previously been enumerated. The first bound on Line 7 reduces the search space using the value of c(CLIQUE, n). If c(CLIQUE, n) is greater than β, then there is no reason to continue expanding the subtree of the current search node as the β criterion has already been violated. The second bound on Line 15 reduces the search space using the value of c(CLIQUE, p). For any given search node, the maximum c(CLIQUE, n) value that could exist for any child of the search node is c(NEWCLIQUE ∪ NEWCAND, p). If this value is less than α, then there is no reason for continuing the subtree expansion for the current search node.
Statistical Significance of the (α, β)-cliques
Statistical significance of the (α, β)-clique being related to the phenotype-expression is quantified by calculating its bias towards the phenotype expressing organismal networks. This significance is calculated using hypergeometric probability, where the population is the total set of organisms in the experiment, the number of successes in the population is the number of phenotype expressing organisms in the experiment, the sample size is the total number of organisms that the (α, β)-clique is found in, and the successes in the sample is the number of phenotype expressing organisms the (α, β)-clique is found in.
Phylogenetic Diversity Score of the (α, β)-cliques
where δ(i, j) is the phylogenetic distance between organism i and organism j. The function S p () rewards the module if the phylogenetic distance between the phenotype-expressing organisms is large, i.e, the module is conserved among a set of diverse phenotype-expressing organisms. The phylogenetic distance information was downloaded from the IMG (Integrated Microbial Genomes) database . Section ''Effects of Phylogenetic Diversity'' and Additional file 23 provide the results of the phylogenetic score calculation and the robustness of the scoring function.
Handling Missing Data in Input Graphs
With real data, it is quite possible to have missing information, for instance, unobserved protein interactions or missed orthologies, that would lead to missing edges and/or vertices in the network model of this data. Handling possibly missing edges becomes imporant, because the phenotype-related functional modules that we mine are modeled as maximal cliques. Hence, missed information in any one organism could lead to identification of cliques that do not model the complete phenotype-related sub-system.
To address this drawback, we introduce a post-processing step utilizing our quasi-clique mining algorithm with knowledge priors (DENSE) . The algorithm takes as input a set of query vertices as knowledge priors and identifies those subgraphs that are "enriched'' by the vertices from the query set and meet a certain density (in terms of the number of edges) threshold. We supply all the identified phenotype-related modules as input to this algorithm and identify those extended functional modules that are more loosely connected, or form a quasi-clique. The density of the modules identified and the number of query vertices that will enrich the modules are controlled by the two parameters, γ and μ, respectively. The results of running the algorithm using the functional modules identified for the hydrogen production phenotype (α = 9, β = 2)and the functional association network of Clostridium acetobutylicum ATCC 824 with μ = 0.5 and γ = 0.5 can be found in Additional file 24.
Comparison of the unique COGs identified by the two algorithms (see Additional file 25) indicates that the DENSE algorithm was able to identify key hydrogen producing-related COGS missed by the (α, β)-motif finder algorithm. One example of a COG missed includes the iron only hydrogenase (COG4624). Enzyme complex associated with this COG plays a key role in the generation of biological hydrogen in microorganisms. In our initial results (see Section Hydrogen Production), the (α, β)-motif finder identified COGs involved in maturation and activation of [NiFe]-hydrogenases. While [NiFe]-hydrogenases are important for regulating removal of biological H2(g) in organisms, it is Fe-only hydrogenases that are responsible for the production of H2(g). As such, identification of motifs containing Fe-only hydrogenases is important for understanding how these genes interact with other genes and sub-networks. Other potentially important COGs included those indirectly associated with hydrogen production. These included COGs involved in the formation of acetate and Acetyl-CoA(COG0282, COG1013, COG1014). While these COGs do not lead to the direct formation of H2(g), they are involved in metabolic pathways that result in the production of H2(g). For example, COG0282 corresponds to acetate kinase, the last enzyme in acidogenesis. During the overall pathway for acidogenesis, hydrogen gas is produced. Although acetate kinase only catalyzes the conversion of acetyl-phosphate to acetate and not hydrogen, it is an indicator of acidogenesis. Hence, acetate production can be considered a phenotype-related pathway for hydrogen production.
Time and Space Complexity of the (α, β)-motif Finder
Runtime and Memory Usage in the Worst Case
Total time (sec)
Memory Usage (MB)
The memory usage by the algorithm, in the worst case, for the graphs used in the various experiments is documented in Table 7. As we discussed in , the Bron and Kerbosh algorithm, while exploring the graph in the depth-first manner, keeps track of the paths it has already visited and does not revisit them. Additionally, the algorithm uses a stack data structure, and hence the memory requirement is polynomial in the size of the input graph. However, the two-typed, divided network has more edges than the union of the edges across the organism-specific networks. The constructed network will have the same number of vertices, but it will have a superset of the edges in the union network. It will contain all of the edges in the union network, plus the inter-organismal edges, which do not exist in the union network. In the worst-case, the number of inter-organismal edges will be in O(N2 * k), where k is the number of organisms and N is the number of nodes per organism-specific network. The alternative to adding these inter-organismal edges would be comparing the cliques enumerated in each organism-specific network with those enumerated in another organism-specific network. This would require O(C k ) time, where C is the number of maximal cliques in an organism-specific network, which is likely significantly larger than O(N2 * k).
This work was supported in part by the U.S. Department of Energy, Office of Science, the Office of Advanced Scientific Computing Research (ASCR) and the Office of Biological and Environmental Research (BER) and the U.S. National Science Foundation (Expeditions in Computing). The work by AR was supported by the Delores Auzenne Fellowship and the Alfred P. Sloan Minority PhD Scholarship Program. Oak Ridge National Laboratory is managed by UT-Battelle for the LLC U.S. D.O.E. under contract no. DEAC05-00OR22725.
- Benfey PN, Mitchell-Olds T: From genotype to phenotype: systems biology meets natural variation. Science 2008, 320(5875):495–497. 10.1126/science.1153716PubMed CentralView ArticlePubMedGoogle Scholar
- Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Mol Syst Biol 2007, 3: 88.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu X, Gerstein M, Snyder M: Getting connected: analysis and principles of biological networks. Genes Dev 2007, 21(9):1010–1024. 10.1101/gad.1528707View ArticlePubMedGoogle Scholar
- Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, Bork P, von Mering C: STRING 8-a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 2009, (37 Database):D412-D416.Google Scholar
- Li X, Tan S, Foo C, Ng S: Interaction graph mining for protein complexes using local clique merging. Genome Inform 2005, 16(2):260–269.PubMedGoogle Scholar
- Zhang B, Park B, Karpinets T, Samatova NF: From pull-down data to protein interaction networks and complexes with biological relevance. Bioinformatics 2008, 24(7):979–86. 10.1093/bioinformatics/btn036View ArticlePubMedGoogle Scholar
- Tabb D, Thompson M, Khalsa-Moyers G, VerBerkmoes N, McDonald W: MS2Grouper: Group assessment and synthetic replacement of duplicate proteomic tandem mass spectra. J Am Soc Mass Spectrom 2005, 16: 1250–1261. 10.1016/j.jasms.2005.04.010View ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning Protein Functions by Comparative Genome Analysis: Protein Phylogenetic Profiles. Proc Natl Acad Sci 1999, 96(8):4285–4288. 10.1073/pnas.96.8.4285PubMed CentralView ArticlePubMedGoogle Scholar
- Levesque M, Shasha D, Kim W, Surette MG, Benfey PN: Trait-to-Gene A Computational Method for Predicting the Function of Uncharacterized Genes. Curr Biol 2003, 13(2):129–133. 10.1016/S0960-9822(03)00009-5View ArticlePubMedGoogle Scholar
- Slonim N, Elemento O, Tavazoie S: Ab initio genotype-phenotype association reveals intrinsic modularity in genetic networks. Mol Syst Biol 2006., 2: (2006.2005) (2006.2005)Google Scholar
- Tatusov R, Fedorova N, Jackson J, Jacobs A, Kiryutin B, Koonin E, Krylov D, Mazumder R, Mekhedov S, Nikolskaya A, Rao BS, Smirnov S, Sverdlov A, Vasudevan S, Wolf Y, Yin J, Natale D: The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4: 41. 10.1186/1471-2105-4-41PubMed CentralView ArticlePubMedGoogle Scholar
- Liu Y, Li J, Sam L, Goh CS, Gerstein M, Lussier YA: An Integrative Genomic Approach to Uncover Molecular Mechanisms of Prokaryotic Traits. PLoS Comput Biol 2006, 2(11):e159. 10.1371/journal.pcbi.0020159PubMed CentralView ArticlePubMedGoogle Scholar
- Foster JW: Microbial Response to Acid Stress. In Bacterial Stress Responses. Edited by: Storz G, Hengge-Aronis R. Washington, D.C.: ASM Press; 2000:99–116.Google Scholar
- Vignais PM, Billoud B, Meyer J: Classification and phylogeny of hydrogenases. FEMS Microbiol Rev 2001, 25(4):455–501.View ArticlePubMedGoogle Scholar
- Butland G, Zhang Jw, Yang W, Sheung A, Wong P, Greenbalt JF, Emili A, Zamble DB: Interactions of the Escherichia coli hydrogenase biosynthetic proteins: HybG complex formation. FEBS Letters 2006, 580: 677–681. 10.1016/j.febslet.2005.12.063View ArticlePubMedGoogle Scholar
- Shomura Y, Komori H, Miyabe N, Tomiyama M, Shibata N, Higuchi Y: Crystal Structures of Hydrogenase Maturation Protein HypE in the Apo and ATP-bound Forms. J Mol Biol 2007, 372(4):1045–1054. 10.1016/j.jmb.2007.07.023View ArticlePubMedGoogle Scholar
- Blokesch M, Albracht SPJ, Matzanke BF, Drapal NM, Jacobi A, Bock A: The Complex Between Hydrogenase-maturation Proteins HypC and HypD is an Intermediate in the Supply of Cyanide to the Active Site Iron of [NiFe]-Hydrogenases. J Mol Biol 2004, 344: 155–167. 10.1016/j.jmb.2004.09.040View ArticlePubMedGoogle Scholar
- Butland G, Peregrin-Alvarez JM, Li J, Yang W, Yang X, Canadien V, Starostine A, Richards D, Beattie B, Krogan N, Davey M, Parkinson J, Greenblatt J, Emili A: Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature 2005, 433(7025):531–537. 10.1038/nature03239View ArticlePubMedGoogle Scholar
- Rey FE, Heiniger EK, Harwood CS: Redirection of metabolism for biological hydrogen production. Appl Environ Microbiol 2007, 73(5):1665–1671. 10.1128/AEM.02565-06PubMed CentralView ArticlePubMedGoogle Scholar
- Fani R, Gallo R, Lio P: Molecular evolution of nitrogen fixation: the evolutionary history of the nifD, nifK, nifE, and nifN genes. J Mol Evol 2000, 51: 1–11.PubMedGoogle Scholar
- Rey FE, Oda Y, Harwood CS: Regulation of uptake hydrogenase and effects of hydrogen utilization on gene expression in Rhodopseudomonas palustris. J Bacteriol 2006, 188(17):6143–6152. 10.1128/JB.00381-06PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang C, Liu S, Zhou Y: Fast and accurate method for identifying high-quality protein-interaction modules by clique merging and its application to yeast. J Proteome Res 2006, 5(4):801–807. 10.1021/pr050366gView ArticlePubMedGoogle Scholar
- Atkinson MR, Blauwkamp TA, Ninfa AJ: Context-Dependent Functions of the PII and GlnK Signal Transduction Proteins in Escherichia coli. J Bacteriol 2002, 184(19):5364–5375. 10.1128/JB.184.19.5364-5375.2002PubMed CentralView ArticlePubMedGoogle Scholar
- Borden JR, Jones SW, Indurthi D, Chen Y, Terry Papoutsakis E: A genomic-library based discovery of a novel, possibly synthetic, acid-tolerance mechanism in Clostridium acetobutylicum involving non-coding RNAs and ribosomal RNA processing. Metab Eng 2010, 12(3):268–281. 10.1016/j.ymben.2009.12.004PubMed CentralView ArticlePubMedGoogle Scholar
- Foster JW: Escherichia coli acid resistance: tales of an amateur acidophile. Nat Rev Microbiol 2004, 2: 898–907. 10.1038/nrmicro1021View ArticlePubMedGoogle Scholar
- Steffes C, Ellis J, Wu J, Rosen BP: The lysP gene encodes the lysine-specific permease. J Bacteriol 1992, 174(10):3242–3249.PubMed CentralPubMedGoogle Scholar
- Chou HT, Hegazy M, Lu CD: L-Lysine Catabolism Is Controlled by L-Arginine and ArgR in Pseudomonas aeruginosa PAO1. J Bacteriol 2010, 192(22):5874–5880. 10.1128/JB.00673-10PubMed CentralView ArticlePubMedGoogle Scholar
- White D: The Physiology and Biochemistry of Prokaryotes. 3rd edition. Oxford University Press, USA; 2006.Google Scholar
- Hendrix W, Rocha AM, Elmore MT, Trien J, Samatova NF: Discovery of Enriched Biological Motifs Using Knowledge Priors with Application to Biohydrogen Production. In BIOCOMP. Edited by: Arabnia HR, Tran QN, Chang R, He M, Marsh A, Solo AMG, Yang JY. CSREA Press; 2010:17–23.Google Scholar
- Bron C, Kerbosch J: Algorithm 457: Finding All Cliques of an Undirected Graph. Commun ACM 1973, 16(9):575–577. 10.1145/362342.362367View ArticleGoogle Scholar
- Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Werner G, Padki A, Zhao X, Dubchak I, Hugenholtz P, Anderson I, Lykidis A, Mavromatis K, Ivanova N, Kyrpides NC: The integrated microbial genomes (IMG) system. Nucleic Acids Res 2005, 34(suppl 1):D344-D348.PubMed CentralGoogle Scholar
- Moon JW, Moser L: On cliques in graphs. Israel J Math 1965, 3: 23–28. 10.1007/BF02760024View ArticleGoogle Scholar
- Schmidt MC, Samatova NF, Thomas K, Park B: A scalable, parallel algorithm for maximal clique enumeration. J Parallel Distr Com 2009, 69(4):417–428. 10.1016/j.jpdc.2009.01.003View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.