Genome-scale analysis to the impact of gene deletion on the metabolism of E. coli: constraint-based simulation approach

Background Genome-scale models of metabolism have only been analyzed with the constraint-based modelling philosophy. Some gene deletion studies on in silico organism models at genome-scale have been made, but most of them were from the aspects of distinguishing lethal and non-lethal genes or growth rate. The impact of gene deletion on flux redistribution, the functions and characters of key genes, and the performance of different reactions in entire gene deletion still lack research. Results Three main researches have been done into the metabolism of E. coli in gene deletion. The first work was about finding key genes and subsystems: First, by calculating the deletion impact p of whole 1261 genes, one by one, on the metabolic flux redistribution of E. coli_iAF1260, we can find that p is more detailed in describing the change of organism's metabolism. Next, we sought out 195 important (high-p) genes, and they are more than essential genes (growth rate f becomes zero if deleting). So we speculated that under some circumstances and when an important gene is deleted, a big change in the metabolic system of E. coli has taken place and E. coli may use other reaction ways to strive to live. Further, by determining the functional subsystems to which 195 key genes belong, we found that their distribution to subsystems was not even and most of them were related to just three subsystems and that all of the 8 important but not essential genes appear just in "Oxidative Phosphorylation". Our second work was about p's three characters: We analyzed the correlation between p and d (connection degree of one gene) and the correlation between p and vgene (flux sum controlled by one gene), and found that both of them are not of linear correlation, but the correlation between p and f is of highly linear correlation. The third work was about highly-affected reactions: We found 16 reactions with more than 2000 Rg value (measuring the impact that a reaction is gotten in the whole 1261 gene deletion). We speculated that highly-affected reactions involve in the metabolism of basic biomasses. Conclusion To sum up, these results we obtained have biological significances and our researches will shed new light on the future researches.


Background
Since various 'omics' datasets are becoming available, biology has transited from a data-poor to a data-rich environment. This has underscored the need for systems analysis in biology and systems biology has become a rapidly growing field as well [1].
A change in mathematical modelling philosophy is also necessitated, and that is based on building and validating in silico models. Modern biological models need to meet new sets of criteria: organism-specific, data-driven, easily scalable, and so on. Many modelling approaches, such as kinetic, stochastic and cybernetic approaches, are currently being used to model cellular processes. Owing to the computational complexity and the large number of parameters needed, it is currently difficult to use these methods to model genome-scale networks. To date, genome-scale models of metabolism have only been analyzed with the constraint-based modelling philosophy [2,3]. Genome-scale network models of diverse cellular processes such as signal transduction, transcriptional regulation and metabolism have been generated. Gene-protein-reaction (GPR) associated models can keep track of associations between genes, proteins, and reactions [4], and there have been several genome-scale GPR models, such as E. coli [4,5], S. aureus [6], H. pylori [7], M. barkeri [8], S. cerevisiae [9] and B. subtilis [10]. A reconstruction is herein defined as the list of biochemical reactions occurring in a particular cellular system and the associations between these reactions and relevant proteins, transcripts and genes [2]. A reconstruction can include the assumptions necessary for computational simulation, such as maximum reaction rates and nutrient uptake rates [11].
Computer simulations of complex biological systems become essential as soon as the computational capability become available. As reconstructed networks have been made publicly available, researchers around the world have undertaken new computational studies using these networks [12]. Many researches apply a core set of basic in silico methods and often also describe novel methods to investigate different models. An extensive set of methods for analyzing these genome-scale models have been developed and have been applied to study a growing number of biological problems [12]. But as we have mentioned above, as yet, genome-scale models of metabolism have only been analyzed with the constraint-based philosophy [2,3].
The in silico models can be applied to generate novel, testable and often quantitative predictions of cellular behaviors [13]. The impact of a gene deletion experiment on cellular behavior can be simulated in a manner similar to linear optimization of growth [14]. The results can be used to guide the design of informative confirmation experiments and will be helpful for metabolic engineering. Some gene deletion studies on the genome-scale in silico models of organisms have been made [4][5][6][7][8][9][10][15][16][17][18][19], but most of them are from the standpoints of distinguishing lethal and non-lethal genes or growth rate [4][5][6][7][8][9][10][15][16][17][18][19][20][21][22]. The impact of gene deletion on flux redistribution, the characters and functions of key genes, and the performance of different reactions in entire gene deletion still lack research.
In this paper, in the part of results, we have done three research works. The first one: First, we calculated flux distribution of E. coli_iAF1260. Then we calculated the deletion impact of whole 1261 genes (using p to describe the deletion impact of one gene), one by one, on the metabolic flux redistribution of E. coli_iAF1260. Next, we sought out the important genes that most greatly affect the metabolic flux distribution, and furthermore determined their functional subsystems. The second one: We analyzed the correlation between p (describing deletion impact of one gene) and f (describing growth rate in the deletion of 1261 genes), the correlation between p and d (connection degree of one gene) and the correlation between p and v gene (flux sum controlled by one gene). The third one: We made research into what are the reactions affected most greatly in the whole 1261 gene deletion (using Rg to measure the impact). In the part of methods and materials, we introduced the GPR model, some properties of the in silico model of E. coli_iAF1260 (SBML (Systems Biology Markup Language) format) and the method of constraintbased analysis.

Results and discussion
Metabolic flux distribution of E. coli_ iAF 1260 As a base for the later comparing research, we here calculate the flux distribution of E. coli_iAF1260. What we use is E. coli_iAF1260_ flux1.xml, one of the two SBML files that are presented with the reconstruction of E. coli [5]. The computational method we use is flux balance analysis (FBA) [11], one of the fundamental genome-scale phenotypic calculations, which can simulate cellular growth. FBA is based on linear optimization of an objective function, which typically is biomass formation. Given an uptake rate for key nutrients and the biomass composition of the cell (usually in mmol component gDW -1 and defined in the biomass objective function), the maximum possible growth rate of the cells can be predicted in silico. We use the COBRA toolbox [11] to carry out this computation of FBA. The flux distribution of E. coli_iAF1260 is illustrated in Figure 1.

Impact of gene deletion on the metabolic flux redistribution and key genes
As our first work, we now do research into the impact of gene deletion on the metabolic system of E. coli. First we calculate the deletion impact of 1261 genes, further seek out important genes and functional subsystems to which these key genes respectively belong.

1) Impact of gene deletion on the metabolic flux redistribution and key genes that affect metabolism most greatly
There are 1261 genes in the model of E. coli_iAF1260. If a single gene is associated with multiple reactions, the deletion of that gene will result in the removal of all associated reactions. On the other hand, a reaction that can be catalyzed by multiple non-interacting gene products will not be removed in a single gene deletion. By the aid of the COBRA toolbox [11], we can calculate the impact of their deletion. We define the impact of one gene deletion on the whole metabolic flux redistribution as p Where v i and are respectively the flux value of i-th reaction of the model of E. coli_iAF1260 before and after a single gene deleting and R is the whole reaction set. In most of the researches on gene deletion [4][5][6][7][8][9][10][15][16][17][18][19][20][21][22], the change of growth rate f is often used to describe the impact of gene deletion. The reason why we define p as the impact of gene deletion is that we believe it is more detailed in describing the change of organism's metabolism. p has considered the flux change taking place at every reaction, and it uses the square sum of the difference between v i and . Otherwise, f is just a whole measure and it does not distinguish the flux change taking place at every reaction. Figure 2 shows the deletion impact of these 1261 genes. Table 1 gives p scopes, gene numbers falling within these scopes and their corresponding percentages that these genes take. Figure 3 shows the deletion impact of these 1261 genes to the growth rate f of E. coli. Every deletion of these 1261 genes will entail a new f.
We define those genes with p>9800 as key genes or highp genes, and there are 195 genes in total. There are 187 cases in which f = 0, their corresponding genes are usually called essential genes or zero-f genes, and all of their p are >9800. These 187 so-called essential genes are consistent with previous literatures [5], except "s0001" which is not included in the report of Ref. [5]. The left 8 genes with p > 9800 &f ≠ 0 are shown in Table 2 with bold text, and we call them INE (Important but Not Essential) genes. Additional file 1 provides the details. Comparing with experiment observation [22], six (b3731, b3733, b3734, b3735, b3736, b3738, b3731) of the 8 INE genes are essential genes; Comparing with experiment observation [23], two (b3731, b3736) of the 8 INE genes are essential genes. At the same time, two genes (b0529 and b3956) are reported as essential genes in Ref. [5], but they are not key genes as Figure 1 Flux distribution of E. coli_iAF1260. X-axis indicating every reaction in rxns (the order is as the same as in rxns, total 2382) and y-axis indicating the value of its corresponding flux (unit is mmol gDW -1 h -1 ). Rxns is the reaction set in the model.
The deletion impact p of 1261 genes of the E. coli_iAF1260 model Figure 2 The deletion impact p of 1261 genes of the E. coli_iAF1260 model. X-axis indicating every gene in 1261 genes (the order is as the same as in genes, total 1261) and y-axis indicating its impact p. Genes is the set of genes in model. for our computation, while b3956 is reported as nonessential gene both in Ref. [22,23] and b0529 is reported as nonessential gene both in Ref. [22]. From these comparisons, we can find that p has an advantage over f in describing the change of organism's metabolism.
We also note that there are 8 genes with p>9800 &f ≠ 0. Based on the fact, we can speculate that, under some circumstances and when an important gene is deleted, a big change in the metabolic system of E. coli has taken place and E. coli may use other reaction ways to strive to live. This may reflect the robustness of the metabolic networks of microbes. It is also an important and interesting conclusion.

2) Functional subsystems to which these key genes belong
If a gene catalyzes a reaction which belongs to a certain subsystem, we say that the gene belongs to the subsystem. Functional subsystems about important genes in the metabolic system of micro-organism are seldom reported. We have hereinabove defined those genes with p>9800 as key genes. We now list the functional subsystems to which every key gene belongs, 23 subsystems in total, and several genes appear in more than one subsystem, shown in Table  2 We can find that the distribution to subsystems of these 195 key genes is not even and most of them are related to "Cofactor and Prosthetic Group Biosynthesis", "Cell Envelope Biosynthesis" and "Purine and Pyrimidine Biosynthesis" subsystems, especially CPGB. We can also find that all of the important but not essential (INE) genes, 8 in total, appear in "Oxidative Phosphorylation".
The reason for many high-p genes just belonging to several metabolic subsystems maybe is in that these subsystems involve many reactions and provide supports for other subsystems; The reason for INE genes just belonging to "Oxidative Phosphorylation (OP)" subsystem probably is in that the permissibility which E. coli use other reaction ways to carry out this kind of metabolism, under the given media condition, takes place on OP subsystem.

Analysis to the three characters of p
As our second work, we now begin research into some properties of the metabolic network of E. coli, i.e., three characters of p. Some properties about the metabolic network of micro-organisms have been reported in literatures [15][16][17][18][19][20][21][22]. Because the measure we defined is different, our research will provide further evidences to the properties about the metabolic network. Figure 4 is the scatter diagram (p, f), total 1261 data pairs. Many data pairs are superposition and locate at the same place, so there aren't lots of points in the figure. From the diagram, we can easily find that the relationship between p and f is of highly linear correlation. High p corresponds to low f.

3) Correlation between p and v gene (flux sum controlled by every gene)
We define the flux sum controlled by every gene as Where v j is the flux value of j-th reaction of the model of E. coli_iAF1260 before a single gene deleting and R gene is the reaction set controlled by the given gene. We can easily compute out the flux sum v gene of every gene in those 1261 genes of the E. coli_iAF1260 model, as illustrated in Figure  7. From the figure, we can find that some but not many genes have high v gene value, but will they affect metabolic flux distribution greatly? Figure 8 is the scatter diagram (v gene , p), 1261 data pairs in total, and many data pairs are superposition. From the diagram, we can also find that the relationship between v gene and p is not of linear correlation as well. So genes with high v gene and genes with low v gene are equally important to the metabolism of E. coli_iAF1260.

Impact of gene deletion on every metabolic reaction
As our third work, we now make research into what are the reactions affected most greatly in the whole 1261 gene deletion. Highly-affected reactions (HAR) are often neglected in many researches in literatures about gene deletion study.

1) Impact of gene deletion on every metabolic reaction
There are 2382 reactions in the in silico model of E. coli_iAF1260. We define Rg to measure the impact that a reaction is gotten in the whole 1261 gene deletion. v The controlled reaction number of every gene in 1261 genes of the E. coli_iAF1260 model Figure 7 The controlled reaction number of every gene in 1261 genes of the E. coli_iAF1260 model. X-axis indicating every gene in 1261 genes (the order is as the same as in genes, total 1261) and y-axis indicating the number of its controlled reactions.
The scatter diagram (d, p) Figure 6 The scatter diagram (d, p). X-axis indicating d (connection degree of every gene) and y-axis indicating the corresponding gene impact p.
The related reaction number of every gene in 1261 genes of the E. coli_iAF1260 model Figure 5 The related reaction number of every gene in 1261 genes of the E. coli_iAF1260 model. X-axis indicating every gene in 1261 genes (the order is as the same as in genes, total 1261) and y-axis indicating the number of its related reactions.
The scatter diagram (p, f) Figure 4 The scatter diagram (p, f). X-axis indicating p and y-axis indicating f, total 1261 data pairs. Many data pairs locate at the same points.
Where v 0 and v k are respectively the flux value of a certain reaction of the model of E. coli_iAF1260 before and after k-th gene deleting, and G is the set of whole 1261 genes.  Table 3 shows Rg scopes, corresponding reaction number within these scopes and the percentages that these reactions take. In the following section, we will determine what the highly-affected reactions are.
Why are these 16 reactions more sensitive to gene deletion? Maybe, it is due to the fact that they involve in the metabolism of basic biomasses such as H 2 O, ATP, O 2 , NADH.

Conclusion
In this paper, we have done three main researches into the metabolism of E. coli in gene deletion. The first was to find its important genes and the corresponding belonging subsystems, the second was to analyze the characters of p, and the third was to find its highly-affected reactions in gene deletion.
To the first work: We used p to describe the impact which gene deletion entailed. Our first finding was that maybe p is more detailed than f in describing the change of organism's metabolism in gene deletion. After calculating the deletion impact of 1261 genes, we sought out 195 important genes (high p genes, p >9800), and they are more than essential genes (f = 0 genes). So our second finding was that under some circumstances and when an important gene is deleted, the metabolic system of E. coli has greatly changed and E. coli may use other reaction ways to strive to live. The third finding was that the distribution to subsystems of these 195 key genes is not even and most of them are related to about three subsystems ("Cofactor and Prosthetic Group Biosynthesis", "Cell Envelope Biosynthesis" and "Purine and Pyrimidine Biosynthesis") and that all of the 8 important but not essential (INE) genesappear just in "Oxidative Phosphorylation" subsystem. We have also tried to give some explanations.
To the second work: We have done research into p's three characters, i.e. its relationship with f, d, v gene . We found that p-f correlation was of highly linear correlation, while both of the p-d correlation and the p-v gene correlation were not of linear correlation. Our research can provide further evidences to the properties about the metabolic network, because the measure we defined is different.
To the third work: We defined Rg to measure the impact that a reaction is gotten in the whole 1261 gene deletion. We calculated the Rg value of each 2382 reactions and gave a statistics to the Rg scopes and the corresponding reaction number. Finally, we sought out 16 reactions with more than 2000 Rg value. We have also tried to give an explanation, i.e., these highly-affected reactions involve in the metabolism of basic biomasses.
In summary, because the in silico model of E. coli_iAF1260 is credible, we can conclude that the results we obtained have biological significances and that the researches we have done will shed new light on the future research. As a next step, we will try more media conditions to the The Rg of each 2382 reactions of E. coli_iAF1260 Figure 9 The Rg of each 2382 reactions of E. coli_iAF1260. Xaxis indicating every reaction in 2382 reactions (the order is as the same as in rxns, total 2382) and y-axis indicating its corresponding Rg value.
The scatter diagram (v gene , p) Figure 8 The scatter diagram (v gene , p). X-axis indicating v gene (the flux sum controlled by every gene) and y-axis indicating the impact, p.
research on E. coli, and will also do similar work on other organisms and compare them with the case of E. coli.

Gene-protein-reaction (GPR) associated model
The association between genes and reactions is not a oneto-one relationship. Many genes may encode subunits of a protein which catalyze one reaction, while there are genes that encode so-called promiscuous enzymes that can catalyze several different reactions. So it is necessary to keep track of associations between genes, proteins, and reactions and to distinguish "&" and "OR" associations in GPR models. Examples of different types of GPR associations are illustrated in Ref. [4,14].

GPR model structure of E. coli_iAF1260
The in silico model that we use is E. coli_iAF1260 [5], a metabolic reconstruction consisting of the chemical reactions that transport and interconvert metabolites within E. coli K-12 MG1655. This network reconstruction was based on a previous reconstruction, termed E. coli_iJR904 [4]. The general features of E. coli_iAF1260 are shown in Ref. [5].
SBML format file to the model E. coli_iAF1260 can be downloaded from the supplementary information of Ref. [5]. There are two SBML files that are presented with the reconstruction, each containing a different flux distribution XML files. SBML file properties are given in the supplementary of Ref. [5]. The dimensions of rxns, mets, and genes are respectively 2382, 1668, 1261.
The minimal media of in silico model is an important aspect. The computational minimal media of E. coli_iAF1260 is also included in the supplementary information of Ref. [5]. In the method of constraint-based analysis, the biomass objective function (BOF) should be defined. The BOF was generated by defining all of the major and essential constituents that make up the cellular biomass content of E. coli [5].
Gene-protein-reaction associations embodied in rxn-GeneMat matrix, which is a matrix with as many rows as there are reactions in the model and as many columns as there are genes in the model. The ith row and jth columncontains a one if the jth gene in genes is associated with the ith reaction in rxns and zero otherwise.

Methodology of constraint-based analysis 1) Constraint-based analysis
In silico modelling and simulation of genome-scale biological systems are different from that practiced in the physicochemical sciences. A network can fundamentally have many different states or many different solutions. Which states (or solutions) are picked is up to the cell and based on the selection pressure experienced, and such choices can change over time. Therefore, constraint-based approaches [2,3] to the analysis of complex biological systems have proven to be very useful. The differences between the physicochemical sciences and the physical sciences or engineering are illustrated in Ref. [14]. All theory-based considerations (i.e., engineering and physics) lead one to attempt to seek an "exact" solution, and typically computed based on the laws of physics and chemistry. However, constraint-based considerations (as in biology) are useful. Not only can a network have many different behaviors that are picked based on the evolutionary history of the organism, but also these networks can carry out the same function in many different and equivalent ways [14].

2) Representation of reconstructed metabolic network
Before calculation and simulation, the reconstructed metabolic network must be represented mathematically. The stoichiometric matrix, S, is the centerpiece of a mathematical representation of genome-scale metabolic networks. It represents each reaction as a column and each metabolite as a row, where each numerical element is the corresponding stoichiometric coefficient.
An upper and lower bound for the allowable flux through each reaction also requires defining. This represents the lowest and highest reaction rate possible for each reaction. The set of upper and lower bounds is represented as two separate vectors, each containing as many components as there are columns in S, and in the same order. In many cases, reversible reactions are defined to have an arbitrary large upper bound and an arbitrarily large negative lower bound. Irreversible reactions have a lower bound that is nonnegative, usually zero.
In order to predict meaningful fluxes, setting upper and lower bounds is especially important for exchange reactions which serve to uptake compounds to the cell or secrete compounds from the cell. The lower bound of exchange reaction column must be a finite negative number using this orientation (e.g., glucose). The upper bound of exchange reaction column must be greater than zero. At least one of the reactions in the model must have a constrained lower/upper bound, and typically, the substrate (e.g., glucose or oxygen) uptake rates are set to experimentally measured values. The upper and lower bounds for exchange reactions are quantitative in silico representations of the growth media environment.

3) Biomass objective function (BOF) and minimal media
The constraint-based approach is based on the assumption that cells strive to maximize their growth rate. This assumption which provides an acceptable starting point for many types of computations is satisfied by simulating maximal production of the molecules required to make new cells (biomass precursor molecules). In spite of their limitations, the predictive power of genome-scale models of metabolic networks has been demonstrated in diverse situations through careful experimentation [11].
The biomass objective function (the function v growth , see below) is a special reaction taking as substrates of all biomass metabolites, ATP and water and producing ADP, protons, and phosphate (as a result of the non-growth associated ATP maintenance requirement) [6].
The minimal media is determined computationally with the systematic testing of distinct inputs. Different combinations of molecules are allowed to enter the reaction network until the minimal group that allowed biomass production, or non-zero Z (see below), was found [6]. It is only concerned that some amount of biomass production is calculated but do not discriminate between extremely slow, inefficient growth and rapid growth.

4) Computation of phenotypic states
In genome-scale metabolic networks, the fluxes within a cell usually cannot be uniquely calculated because a range of feasible values exist when fluxes are subjected to known constraints. Flux balance analysis (FBA) is used to find optimal growth phenotypes. Briefly, a large-scale linear programming is used to find a complete set of metabolic fluxes (v) that are consistent with steady-state condition (eq. 4) and reaction rate bounds (eq. 5), and at the same time maximize the biomass objective function in the defined ratio. This corresponds to the following linear programming problem [6]: Where S is the stoichiometric matrix, and α i and β i define the bounds through each reaction v i . The flux range was set arbitrarily high for all internal reactions so that no internal reaction restricted the network, with the exception of irreversible reactions, which have a minimum flux of zero. The inputs to the system were restricted to a minimal media.
The value of Z computed with the above procedure can either be zero (predicting no growth) or greater than zero (corresponding to cellular growth) depending on the inputs and outputs that are allowed, according to the nutrients provided in the media.

5) Gene deletion study
The effect of a gene deletion experiment on cellular growth can be simulated in a manner similar to linear optimization of growth [5,11]. Gene-reaction associations model the logical relationship between genes and their corresponding reactions. If a single gene is associated with multiple reactions, the deletion of that gene will result in the removal of all associated reactions, i.e. to simultaneously restrict the fluxes (upper and lower flux bounds) of these reactions to zero prior to computing maximal biomass objective function. On the other hand, a reaction that can be catalyzed by multiple non-interacting gene products will not be removed in a single gene deletion. The possible results from a simulation of a single gene deletion are unchanged maximal growth (nonlethal), reduced maximal growth or no growth (lethal). Those genes were considered essential if no biomass could be produced without their usage.