A study on multi-omic oscillations in Escherichia coli metabolic networks

Background Two important challenges in the analysis of molecular biology information are data (multi-omic information) integration and the detection of patterns across large scale molecular networks and sequences. They are are actually coupled beause the integration of omic information may provide better means to detect multi-omic patterns that could reveal multi-scale or emerging properties at the phenotype levels. Results Here we address the problem of integrating various types of molecular information (a large collection of gene expression and sequence data, codon usage and protein abundances) to analyse the E.coli metabolic response to treatments at the whole network level. Our algorithm, MORA (Multi-omic relations adjacency) is able to detect patterns which may represent metabolic network motifs at pathway and supra pathway levels which could hint at some functional role. We provide a description and insights on the algorithm by testing it on a large database of responses to antibiotics. Along with the algorithm MORA, a novel model for the analysis of oscillating multi-omics has been proposed. Interestingly, the resulting analysis suggests that some motifs reveal recurring oscillating or position variation patterns on multi-omics metabolic networks. Our framework, implemented in R, provides effective and friendly means to design intervention scenarios on real data. By analysing how multi-omics data build up multi-scale phenotypes, the software allows to compare and test metabolic models, design new pathways or redesign existing metabolic pathways and validate in silico metabolic models using nearby species. Conclusions The integration of multi-omic data reveals that E.coli multi-omic metabolic networks contain position dependent and recurring patterns which could provide clues of long range correlations in the bacterial genome. Electronic supplementary material The online version of this article (10.1186/s12859-018-2175-5) contains supplementary material, which is available to authorized users.

1 S1 -Criteria for deciding the number of mRN As/cell The number of mRN As/cell is constantly object of study because of several experimental settings. Bartolomaus et all. provide two mRN As/cell quantities in different experimental settings: upon osmotic stress in the minimum medium, the number of mRNAs decreases from ≈ 2400 mRNA copies/cell to ≈ 1600 mRNA copies/cell (this is our lower bound). Instead, upon an heat stress in nutritionally rich medium, it is recorded a reduction of copies/cell from ≈ 7800 mRNAs to ≈ 7200 mRNAs [1] (this is our upper bound). Steady state conditions minus treatment conditions on average lead to a value between ≈ 2000 mRN As/cell and ≈ 7500 mRN As/cell. Furthermore, E.coli under exponential growth at medium growth rate is known to contain about 3000 mRNA copies/cell [2]. Thus, considering the different growth conditions, our scaling coefficient for the normalisation is considered as a fraction 1 of ≈ 3800 mRNA copies/number on the summatory of totalRN A for each microarray replicate. Note that this scaling factor does not change the mR-NAs fold change and it is simply a change of measure units.
2 S2 -Criteria for the trasformation of protein abundance ppm in proteins/cell Protein abundance furnished in ppm is transformed in molecules cell . The first step is to compute the scaling factor to normalize the data. Then it is computed the molecular weight gram/mole for each protein: it is accomplished extracting the molecular weight of each amino acid, and then applying equation 1 according to the involved chemical quantities : ppm i * gram cell · molecules mole n j ppm j · gram mole j = molecules cell (1) In this case, at the numerator the total mass of proteins estimated per cell is of 2.35e6 grammes per cell [2] and the constant implied in the conversion of molecules per mole is the number of Avogadro. The denominator of the equation represents the summation of the expressed protein abundances grammes/moleperppm. Then the scaled protein abundance is obtained in molecules/cell via the multiplication of the scaling factor times the i-th value of protein abundance expressed in ppm (ppm i ).

S-Static and dynamic multi-omics
We can separate omic sources in two category: the first one represent the data that could be intended as variable because are changing due to the effect of treatments, the second one represent the data that are static because does not change during perturbations. In all the cases omic data subject to perturbations (for example protein abundance or mRNA expression levels) and static omic data (like codon usage) maintain the same schema and the same identifier over all the sources. The names of the proteins and the structure of the pathways are extracted from the KEGG REST API [3] and EcoCyc [4]. The protein pathways of KEGG [5] can be used for the extraction and association of other types of information extracted from NCBI [6] and EcoCyc and vice-versa. From Ecocyc are extracted the information about the operons and protein complexes, co-regulations, enzymatic functions and their direction of reaction: the latter will be of central importance when reconstructing a protein centric metabolic network. The NCBI information outlines genes and their positions on the DNA double strand; moreover, they give us information about the direction, 5'-3' or 3'-5', of these genes in the double strand that are presented as + and − in my multi-omic space. From this source it is extracted also the position of the genes on the double strand which is an intrinsic information fundamental for the definition of the multi-layer structure. For each gene name are downloaded the DNA sequences and it is computed the codon usage. We have described till now static omic sources. Instead, variable omic sources are extracted from NCBI Gene Expression Omnibus (GEO) [7] and protein abundance database PaxDB [8]. In particular, we have used a web-service called GEO2R, that in turn is based on GEOquery [9] and limma [10]. GEO2R is a very useful web-service through which it is possible to generate R [11] code for obtaining information about microarray experiments and gene expression profiles with a user-friendly web interface. From this service it is possible to obtain mRNA amounts in steady state conditions (controls) and after perturbations (treatments). Instead from PaxDb it is extracted the protein abundance in standard conditions. Leveraging the services of NCBI are unified the names that on the selected microarray are obsolete, thus avoiding the lack of collinear schema information. Coupling informations about mRNA controls and protein abundance it is inferred a protein variation (pv ) as it is described in the paper. 4 S4 -Correctness of the relative and absolute score for oscillating multi-omics on sequences 4.1 Theorem 1 For each mov the relative multi-omic sequence score (equation 2) is just the number of adjacent couple of values divided by N : Th 1: The maximal score from the equation 2 is obtained if and only if on a multi-omic pattern there are fully oscillating multi-omics.
Proof: The maximal score corresponds exactly to the expected value of a random variable X ( see equation 3) defined as follows: each of the events | e j − e j+1 | can fall in 0 or 1 showing or not oscillating multi-omcis; therefore, this event is represented by a binary random variable X i for all the 0 ≤ i ≤ div which represent the chances of seeing an oscillation of multi-omic values.
Then, for each X i , the expected number E[X i ] of occurrences of an oscillation is equal to E[ and it is similar to take into consideration a bound to the load balancing problem [12]. The expected number ∀ 0 ≤ i ≤ div of seeing a fixed antidyadic effect on N classes for the div independent chances is a balls and bins problem where it is considered the max loading of a fixed machine [13] and in equation 4 it follows by leveraging the property of linearity for the i.i.d. X i : The correctness of this result is not difficult to prove. It is given for induction on the expected value E[X]. The base of the induction is for i = div then for E[X div ] = 1 N we arrive to E[0] = 0. For hypothesis a single oscillation is founded with a chance of 1 N that corresponds to the chance of a ball to fall in a fixed bin. For our random variable of Equation 3 there are two outcomes: either a ball falls into the fixed bin and X i adds 1 for the property of linearity of the expected value, or it adds 0. Therefore, we have: resembling equation 5 it is proved that E[X div ] = 1 N +E[X div−1 ] and resolving this one for each i corresponds to obtain equation 4. In conclusion, the score of equation 2 in the case that all the observed multi-omic values in a sequence are oscillating (maximal score) it is equal to the expected value of X defined by equation 4.

Theorem 2
The absolute score 6 is obtained dividing the relative score (equation 2) by the number of divisors thus showing an absolute score oscillation measurment, as written in equation 6.
Th 2: The maximal absolute score on equation 6 is just the expected value of X i and it is referred to the expectation of seeing an ideal oscillation where each 1 follows a 0 or vice-versa (i.e. mov id : Prof : In the following equation 7, where l − 1 = div it is proved the relation with the maximal absolute score and the expected value of X i : 5 S5 -Multi-omic layers integration 5.0.1 Genomic layer: the role of codon usage One of the most relevant omics considered is the codon usage. The latter is widely proved being a fundamental component showing relevant patterns on the genome [14,15]. The codon adaptation index (CAI) [16] is an index of non-uniform codon use [17]. The CAI of each gene sequence of l codons is a measure of the bias in codon usage and it is defined as follows: where the product of w k = w(i k ) is considered over the DNA code under examination. The quantity w(i k ) for a codon i at position k is given by the relative frequency of the codon i with respect to the most used codon for the same amino acid in a set of highly expressed genes (i.e. ribosomal proteins). In organisms the synonymous codons are not chosen randomly but they follow a rule, thuus it is more common that the beginning of the gene is more composed of rare codons. This is due to the influence of the machinery of translation that is conditioned by the presence of rare tRNAs [18,19]. The genes that present high values of CAI, for instance, ribosomal and major outer membrane genes whose products are required in large quantity with respect to other proteins for high duplication rates, are the effective fitness bottleneck of the bacterial duplication machinery. Until a few years ago, most of the researchers believed that the codon usage could be used to study the elongation rate in the translation process, being the pairing between codon and anticodon of the diffusing tRNA, a rate-limiting step compared with the peptide bond formation and translocation steps [20]. Recent discoveries based on ribosome profiling and next generation sequences seem to contrast the previous demonstration proving that rare codons, the so-called slow codons, are generally translated at similar average speed than abundant codons [21,22,23] . It remains de-facto that all the translational machines, and not only if we think to the duplication machinery, are dominated by regulation mechanisms. Furthermore, if the codon usage, in particular, the CAI, is not a rate-limiting index it does not affect the fact that a priori shows some new regularities in our multi-omic spaces and it is an oscillating omic on multi-omic patterns.

Transcriptomic layer: mRNA amounts
The omics extracted in the transcriptomic layer are the mRNA amounts and their fold changes (f c) caused to the effect of antibiotics. Two compendia of micro arrays are considered: The first compendium extracted from GEO with accession number GSE6836 is the one of Faith et al. [24] and contains 121 experimental conditions. The second compendia is of Suzuki et al. [25] with GEO accession number GSE59408 is based on the study E.coli resistance to 10 different antibiotics. From these compendia are selected 69 experiments between those one with a reasonable statistical reliability [26,27] and a 'control vs treatment' expression matrix design. Reliabilities, described in equation 9, are fundamental in the definition of the Spearman's correct correlation (S. correction). The latter is leveraged for the compendia analysis within-studies and between-studies because is a stable form of correlation that take into account the noise propagation and stabilize the analyses considering microarray replicates. Then S. correction within-studies is considered on replicates of the same experiment in one of the possible compendia. S. correction between-studies is utilized when we compare mRNAs studies that come from different compendia between two different experiments ( Figure 1 (red and blue arrows))) and in between-studies of the same exper-iment but between replicates of controls and treatments ( Figure 1 (green  arrow)). Within-studies and between-studies median values of Spearman's correct correlations and Pearson's correlation (P.correction) are computed for each compendia. The results are reported in the Table 1. Once it is con- Table 1: Median and standard deviation (±) within-studies (inter relation of all the i-th (i) microarray replicated measures within the control (C) or/and treatment (T)). Median and standard deviation (±) between-studies (cross relation between the replicated measures of the i-th (i ) microarray and the j-th (j ) microarray considering control (C) or/and treatment (T)). Median and standard deviation are above all Spearman's correction (S.correction) for attenuation and sample Pearson's correlation (P.correlation) in separate columns. (see also Figure 1) Within Studies Between Studies sidered the Spearman's correct correlation through these data it is possible to prove that the selected experiments could be integrable and robust with respect the noise. After all the median values in Table 1 reports a maintained coherence within experiments reporting small variations between replicates of control and treatments. That's mean in specific contexts that antibiotics in the most of the cases considered presents a specific cellular target, thus altering the mRNA expression of a little group of genes (Table 1 The total mRNA extracted from the platforms is normalised to the averaged value of mRN As/cell considering ≈ 3800 mRNA copies/number as scaling factor over all the compendia ( see also supplementary material S1 ).

Proteomic layer: protein abundance and protein variation
E.coli steady-state protein abundance is downloaded from the PaxDb webservice [8]. We have considered all the studies with a coverage ≥ 98 % on the whole genome. In this way it is obtained the protein abundance in steady state. Then we need to extract the protein variation. The mRNAs amount and their relative fold changes play a central role in the evaluation of proteinlevel variation and, due to the heterogeneity of data, it is not a process so obvious. Furthermore, in the literature, there are a lot of methodologies for determining the variation of protein abundance caused by a treatment i.e. mRNAs levels, codon usage and amino-acid usage [28,29]. In more published works mRNA and protein abundance are showed to be correlated weakly, with a coefficient of determination R 2 ≈ 0.17 − 0.47 [30]. However, the lack of correlation it is imputed several control factors and due to the presence of noisy data [31]. It is possible to obtain a relevant correlation between protein abundance and mRNA if it is accounted the noise and separated from the rest of the information [27,32]. Frost and Thompson [33] discussed several methods for the regression dilution bias. Then, protein variation will be inferred through a noise-robust linear model stabilized with reliabilities. In the figure 2 the pipe-line of the method adopted is shown: steady-state protein abundance it is indicated as B and controls mRNAs amount as A. In this setting, the observations A and B are considered with additive noise. Then A = α + n α with respectively A ≈ N (µ, σ 2 A ) and noise n α ≈ N (0, σ 2 α ) with α the true value. Steady-steate protein abundance B are defined as the latter with B = β +n β . Consider A as a predictor of the mRNA controls true value α and B as the dependent variable. A and B are normally distributed. We are interested into the reliability of A that is defined as the variance ratio of the true value and the observed value: When the true value it is unknown the definition of reliability above remains theoretical. Even if it is possible to reach an estimation of ψ α through the Pearson coefficient of correlation ρ. Then, each couple of observed values 8 (i,j) of A are considered as follows: ψ α = ρ A i A j . Reliability functions are furnished by Csardi et al. [27] in R. The reliability ψ α will be multiplied to an estimated slope according to some models in literature [34,33]. Once the reliability it is obtained we can compute the protein variation leveraging a random effect model [35]. The random effect model considered is based on a linear regression with a random estimation (log-likelihood) for the slopes and intercepts. It is given as independent variable a selected group of mRNAs replicates in steady-state conditions (controls). The latter are chosen above all the 69 experiment considering those one that once perturbed are differentially expressed with a specific threshold γ and that maintain an acceptable reliability (ψ α > 0.5). The threshold γ permits to filter the differentially expressed genes by a variable fold change f c (±f c, 0.1 ≤ f c ≤ 1.2) and a significant p-value ( ≤ 0.05 with Bonferroni correction). From these selected groups are removed the outliers. The predictors left are those biologically valid, because are those one that are really expected to change between experiments. Then, the latter could predict in a correct way the protein variation. If γ is more restrictive the reliability of A decrement making not affordable the predictors. In fact increasing the number of measurement, then the reliability coefficient increases: where γ is the factor by which the test length is increased: it is proportionally related to the reliability. In a classical linear model for a j-th observation the mRNA predictors and the steady state protein abundance it is represented in the following form: B j = mA j + j , j ≈ N (0, σ 2 ) and j = 1, ..., n. Note that in this way it is not possible to consider the mRNA replicates decreasing the stability with respect the noise. Instead, the relationship in a random effect model it is repeated for N=8 controls times the replicates. Then the random effect model it is defined with i = 1, ..., N and a shared variable θ i as follows: Due to the Wilks's theorem [36] we can compare, with a log likelihood ratio test, the null model B ij = mA ij + ij (who does not take into account the random effect) with the random effect model B ij = θ i +mA ij + ij [37,32] ( an approximated chi-square distribution with an associated p-value which 9 returns the goodness of fit). This procedure is repeated each time varying the fold change, thus obtaining a relevant f c threshold that could make the model significant. As showed in table 2 the optimal log fold change threshold is of 0.7 and it could be individuated at the seventh row of the table. At this threshold the random effect model says that mRNA control replicates (predictors) affect protein abundances (dependent variable) with a (χ 2 (3) = 19.53, p = 0.00002), increasing it by about 2.4e + 24 molecules/cell ±9.17e + 23 (standard errors). Each θ i will present its own slope and intercept then, it is possible to obtain a list of slopes and intercepts. Then, from this model, it is estimated a median intercept ofĉ = −1.15e + 22 and a median slope of m t = 2.98e + 24. The latter, as said before, is corrected with its associated reliability ( Ψ α = 0.529) and become ofm = 1.25e + 24. Thus taking the parametersm andĉ we can have an estimation of the protein variation index pv j that dimensionally is coherent with molecules/cell and is a function of the mRNA amounts x t molecules/cell: PaxDB's steady state protein abundance represented part per million (ppm) is transformed molecules/cell as described in the supplementary material S2.

Metabolomic layer: a novel integrated network
In this work the whole metabolic network is represented as a protein-centric network topology [38]. We have integrated in R the most updated E.coli metabolic network (furnished in supplementary material) from multiple sources: EcoCyc and KEGG. The network extracted from the Ecocyc smart tables is of V = 1321 proteins and E = 366778 reactions. Moreover, are integrated the KEGG pathways obtaining a network of V = 1005 proteins and E = 3394 edges. Ecocyc and KEGG networks are merged forming a network of V = 1644 proteins and E = 369863 edges. Each protein complex in the table of reactions is considered as a monomer by its individual reaction. Since we are interested on the relations of multi-omics projected on this structure we prove that this novel integrated network responds to the most well known topology properties present in literature.
It is studied the scale-free property of this network and we could asses, according to other studies [39,40], that the integrated protein-centric metabolic network presents features of a power law degree (d) distribution Figure 3 with its typical parameter α = 2.7. After that, the network topology is investigated across the features that characterise this metabolic network as a small world network: the clustering coefficient (CC) and the average path length (APL). In particular EcoCyc integrated network presents a CC = 0.95 and AP L = 2.26, for what concern the novel metabolic network integrated ( KEGG + EcoCyc) the CC = 0.84 and have an average path length of 2.71 proteins for shortest path. In both cases are presented some typical features of small-world networks: an high CC and a low APL [41]. Erdós-Rényi [42] models could be considered as a yardstick for deciding if we are dealing with a small-world network. In fact, it has been proven that real networks are small world networks, once compared to random networks, if presents the same number of edges and nodes, a similar ALP and a higher CC [43]. 100 Erdós-Rényi random networks of the same dimension of the integrated network are generated thus computing the CC median values and error (0.5103229 ± 0.0002) and the ALPs (1.86 ± 0.000005) with the conclusion that enjoys the small-world properties. If in one hand random networks are useful to describe the property of small-world between multi-omics pro-jected on metabolic structures, on the other hand they not describe perfectly the real network under examination. In a certain sense, Poisson and exponential degree distribution could explain the evolution theory of proteins and compounds [38] because they randomly add on each epoch t new nodes to networks. Thus, some studies on E.coli based on these distributions account an average of 9.76 links per node [44]. In this setting, proteins are considered to accomplish the same charge of work not clearly explaining the complexity of the metabolic structures, because, for example, it is not well modeled the presence of hubs [45,46]. Alternatively in scale-free networks is accounted a preferential attachment [43] that, in turn, show the high specificity of the enzyme and reactions and where it is possible to recognize hubs. It is important to underline that topology networks are helpful in describing the interaction between nodes but they do not describe the internal node characteristics.

S-MORA: toy example
Adjacency influences are considered on a toy model ( figure 4 (a)). A network of nodes, named with a number from 1 to 13, is showed. The nodes are ordered following their relative positions forming this sequence: 3-1-2-4-5-6-7-8-10-11-13-9-12 (in pathways the positions are taken considering the distance by the origin of replication). According to the algorithm MORA, in the figure the nodes are showed with a more/less influence on the sequence (reciprocal influence) and their relative adjacency weights (more/less a node is influent, more/less its diameter increases on the network). For example, nodes 3-1-2 are adjacent and linked in the network showing relevant adjacent weights. Node 7 is the most influencing node on the sequence, due to the effect of its direct adjacency with nodes 6 and 8 on the sequence and its direct links to 6 and 8 on the network (ψ = 2). Furthermore, node 7 is involved in 2 specific shortest paths: in fact, the algorithm increases its weight, counting the influence effects of adjacent nodes 5-6 and 8-10 in the sequence, that on the network represent the shortest paths of three nodes (5-7-6) and (8-7-10) (ψ = 3) (node 7 is present in both). Influences are considered on undirected networks. Nodes with more/less influence are linked with their associated structural/reciprocal influence colour. Structural/reciprocal influence (value of inf l ) are divided and coloured forming the group of the most adjacent nodes (those with a weight ≥ median( inf l )) and that of the less adjacent nodes. (figure 4 (b) (c)). In these plots 2 extreme structural conditions are tested with a sequence equal to 1-2-3-4-5, where the first plot is a clique and the second a broken clique (snaked).

7
S7 -A concrete example for multi-omic metabolic network motifs: E.coli Glycolysis In Figure 5 part (a) is shown an example of multi-omic oscillations on the pathway for the E.coli Glycolysis. Blue and red nodes show oscillating multiomic values. Orange edges link nodes with the anti-dyadic effect (i.e. those with oscillating multi-omics), instead red and blue edges show the dyadic effect. In this case the anti-dyadic effect magnitude is of m 01 = 1.99 and the dyadic effect magnitude of m 1100 = 2.14. The different node sizes depict a less or more adjacent influences (computed with MORA). The reciprocal influence RI of Glycolysis with respect its sequence pattern is equal to 1. Then more adjacent nodes are those with adjacency weights greater than 1 and less adjacent nodes are those ≤ 1. Figure 5 part (b) shows multi-omic oscillations on patterns. The values are shown on a normalised scale and not yet binary discretized. The dots of several colours show a variation in oscillation due to the effect of the 69 considered treatments. The groups of operons and protein complexes that take part to the pathway are shown in the blue and red bands. The Glycolysis multi-omic pattern similarity to an ideal oscillating sequence is of σ obs = 0.62. In Figure 6 part (a) are shown the multiomic oscillations on the E.coli Glycolysis with path extensions. In this case, new reactions are added, and consequently also new nodes to the pathway and new multi-omics on the pattern. In this case, the anti-dyadic effect magnitude is m 01 = 1.35 and the dyadic effect magnitude m 00−11 = 1.46. With respect to figure 5 both the effects are lowered, but the network still maintains a significant amount of oscillating multi-omics ( m 01 > 1 ). In this case, the RI is = 2 showing that there are more adjacent influences with respect to the Glycolysis pathway without path extensions. In Figure 6 part (b) are shown multi-omic oscillations on patterns with the insertion of path extensions. With respect the standard conditions without modifications, where σ obs was = 0.6, while now σ obs is increased to 0.7413793. For analyzing the effect of treatments please refer to the tables of Additional File 1.  Table 1. Correlations are studied to show some characteristics of the microarray experiment design and to test the robustness of the dataset before executing other types of analysis.