Robust detection and verification of linear relationships to generate metabolic networks using estimates of technical errors
© Kose et al. 2007
Received: 25 July 2006
Accepted: 21 May 2007
Published: 21 May 2007
Skip to main content
© Kose et al. 2007
Received: 25 July 2006
Accepted: 21 May 2007
Published: 21 May 2007
The size and magnitude of the metabolome, the ratio between individual metabolites and the response of metabolic networks is controlled by multiple cellular factors. A tight control over metabolite ratios will be reflected by a linear relationship of pairs of metabolite due to the flexibility of metabolic pathways. Hence, unbiased detection and validation of linear metabolic variance can be interpreted in terms of biological control. For robust analyses, criteria for rejecting or accepting linearities need to be developed despite technical measurement errors. The entirety of all pair wise linear metabolic relationships then yields insights into the network of cellular regulation.
The Bayesian law was applied for detecting linearities that are validated by explaining the residues by the degree of technical measurement errors. Test statistics were developed and the algorithm was tested on simulated data using 3–150 samples and 0–100% technical error. Under the null hypothesis of the existence of a linear relationship, type I errors remained below 5% for data sets consisting of more than four samples, whereas the type II error rate quickly raised with increasing technical errors. Conversely, a filter was developed to balance the error rates in the opposite direction. A minimum of 20 biological replicates is recommended if technical errors remain below 20% relative standard deviation and if thresholds for false error rates are acceptable at less than 5%. The algorithm was proven to be robust against outliers, unlike Pearson's correlations.
The algorithm facilitates finding linear relationships in complex datasets, which is radically different from estimating linearity parameters from given linear relationships. Without filter, it provides high sensitivity and fair specificity. If the filter is activated, high specificity but only fair sensitivity is yielded. Total error rates are more favorable with deactivated filters, and hence, metabolomic networks should be generated without the filter. In addition, Bayesian likelihoods facilitate the detection of multiple linear dependencies between two variables. This property of the algorithm enables its use as a discovery tool and to generate novel hypotheses of the existence of otherwise hidden biological factors.
In recent years, time course analyses of metabolic perturbations have become more important to understand metabolic networks based on experimental data [1, 2]. One way to analyze metabolic networks is by systematically investigating linear relationships between all analyzed metabolites (variables) followed by constructing networks from positively identified components, and eventually comparing network topologies  between different physiological or genetic conditions [4, 5]. Simulations of metabolic reactions have shown that even stochastic influences on metabolism may result in linear metabolic co-regulation because initial metabolic perturbations can be propagated and enhanced through the cellular biochemical network . Such linear co-regulation of pairs of metabolites may point to changes in biochemical control (chemical equilibrium, mass conservation, asymmetric control distribution) as well to transcriptional regulation . Variance in metabolite levels can be caused by three different factors:
(I) concentrations alter and hence increase variance due to intentionally changing the experimental conditions, for example by altering environmental parameters like external nutrients or by using different genotypes ,
(II) metabolite data will found to vary in a stochastic manner caused by the imprecision of the analytical method  used for acquiring metabolite data and
(III) interestingly, even under very controlled environmental conditions, a high degree of biological variation is found for metabolite levels due to stochastic biological events that trickle through the biochemical network and thus reflect the underlying control structure at this particular biological condition .
Therefore, if enough biological replicates are analyzed for a given organism at a given physiological situation, the metabolic phenotype can be investigated not only by its corresponding average metabolic values, but also by a snapshot of its corresponding metabolic network. However, biologists often do not know the inherent biological variability in advance and hence tend to use just a few independent biological replicates based on preliminary power analysis. Resulting data may be sufficient to estimate arithmetic means of metabolic levels but do not enable analyzing the linear control structure between different biological conditions. One of the challenges for calculating linearity networks is to compute the likelihood or significance of the presence of a truly linear relationship, with the aim of excluding both false negative and false positive detections of linearities.
Estimating optimal linearity parameters has been solved decades ago for cases, for which linear dependence of variables could be reasoned based on background knowledge. However, in metabolic data sets, the control structure of metabolites is unknown a priori. Therefore, two fundamental questions need to be answered:
(a) For which pairs of variables can a linear relationship be hypothesized?
(b) Are there sub sets of data that reflect differences in linear behavior of variables? For example, linearity may be given for only a group of data but absent in another group, or the linearity parameters between these groups may be different.
Linear relationships must be detected in an unbiased and observer-independent manner.
Sub sets of data need to be grouped according to presence of (multiple) linear relationships.
Criteria have to be applied that verify linear hypotheses based on test statistics.
Technical errors: varying degree of analytical-chemical measurement errors and missing data have to be accounted for.
Especially, the potential presence of multiple linear relationships and independence of both variables poses problems for simple regression analyses. As a substitute for regression, the degree of correlation has been used for detecting linear relationships despite the fact that correlation only relates the covariance to the total variance, but does not verify genuine linearities. Moreover, Pearsons' correlation coefficients lack robustness against outliers, especially for multivariate datasets, and a number of different approaches have been suggested to link estimates to better test statistics . In practice, however, empirical or heuristic thresholds are taken to distinguish strong or weak correlations, but no mathematical basis exists on which such thresholds can safely be founded. In some cases, Student's statistics p-values have been taken in an effort to validate Pearson's correlations . Unfortunately, such p-values only describe the significance of the non-randomness of data pairs but do not test hypotheses if data pairs can be described by a (single or multiple) linear functions. Consequently, correlation networks based on Pearsons correlations may be strongly distorted .
A further approach has been taken using partial correlations that deconvolute contributions by additional parameters in order to reduce the list of correlations to basic dependencies  which may present a link from correlation to causality [22, 23]. This method is valuable to investigate the control structure within a given correlation network but it does not remove the principle robustness problem of correlation estimates. Simple correlations coefficients always decline with increasing variance that is introduced by method errors during data acquisition. In contrary, partial correlation coefficients may be increasing, decreasing or even change the algebraic signs with increasing method errors . In order to remedy this situation, scientists tend to select high Pearson's correlation thresholds  which imply that the variance caused by method errors is small in relation to the biological variance. The latter assumption is often true when comparing widely different metabolic phenotypes such as certain mutant genotypes, or severe stress conditions such as acute (metabolic) diseases in comparison to healthy states. However, metabolic theory predicts that even incremental changes in enzymatic properties can have large effects on metabolic control, especially when multiple enzymes are affected . Such changes might be too subtle to cause large differences in average concentrations but would still effect the pathway control structure and hence, linearities in pairs of metabolite data. Consequently, the metabolic control structure can only be assessed with a robust tool for linearity detection.
We here present a different approach. Using the Bayesian law , a likelihood formula is derived that is based on information about the measurement error using a specific technical method. This formula is then transformed in way that allows searching for local maxima of linear parameters within the total hypothesis space. Such likelihood maxima are subsequently assessed for residuals of the corresponding linearity parameters using simulated test statistics. We demonstrate the power of this approach using a synthetic data set with a given set of true linear relationships which are subsequently subjected to both increasing technical errors and increasing number of samples.
In what follows we collect the coefficients of the above equation in a vector α = (α 1,..., α n ) t and express the linear relationship as α t x •j = β. By x •j we denote the metabolic profile of the jth sample. The entirety of the parameters α and β results in A. The probability p (A) is the a priori probability of the parameters α and β before the measurement has been performed. Because no preference can be given, the a priori probability is constant for all A. The same is true for p (B) with B representing the measured metabolite concentrations. Therefore, p (A | B) is the likelihood for parameter A if the pair of variables B is given. We have used an unbiased approach here assuming random and unrelated technical errors, and we cannot know beforehand if a certain metabolite will be detectable or not, and how large the concentration of such metabolite could be. These assumptions result in a constant probability for p(A) and p(B), because otherwise certain values for A and B would be more likely than others. From the Bayesian law it can be concluded
p (A | B) = c · p (B | A). (6)
The constant value c can be neglected because the objective is to compare different linear hypotheses. Consequently, a hypothetical metabolic profile has the same probability at a given linear hypothesis as a hypothetical linear hypothesis at a given metabolic profile. The expression p (B|A) is therefore the probability that the measured metabolic profile B is determined at a given set of parameters A, which can be calculated using the function which describes the probability distribution of the technical error. We are now in position to state the following general theorem:
Let x = (x1, ..., x n ) t include the measurements of n metabolites in a biological sample with technical errors that follow a Gaussian distribution with covariance matrix Σ. Let a (n-N)-dimensional surface in the n-dimensional metabolite space be defined by the equations
αk1x1 + ... + α kn x n = β k for k = 1, ..., N. (7)
The coefficients of these equations comprise a matrix α = (α ki ) and a vector β = (β 1 , ..., β N ) t . The matrix elements can also be arranged in vectors α k :(αk1, ..., α kn ) t . It is assumed the hyperplanes defined by (8) are orthogonal in pairs with respect to the covariance matrix, i.e.
α k t Σ α l = 0 for k, l = 1, ..., N and k ≠ l. (8)
Maximizing p (A|B) or L (A|B) gives the maximum likelihood. The resulting estimator for the considered linear relationship is called the simple ML-estimator. The likelihood takes values between 0 and 1. Likelihoods are different to probabilities  with respect to p (B|A) which is maximized in order to find the most likely parameters for a given hypothesis, here: a linear hypothesis.
Additional File 2 gives the impact of the magnitude of the constant c on the total likelihood. It is demonstrated in an empirical way that the total likelihood is not decreased by any c ≥ 1. In what follows we fix the constant at the value c = 1 und consider an adapted ML-estimator that is constructed by maximization of L 1(A | B).
Additional considerations are outlined for the case of missing data (NANs, not-a-number) which are often found in metabolomic data sets. In such cases, the probability function p (B|A) needs to be adapted. This function then represents the information about the missing value: for example, data could get lost due to measurement instrument malfunctions or the variable (i.e. a metabolite level) might be below the limit of detection in a given biological situation. In both cases, the probability density function is uniform, i.e. the probability is constant in a certain range. For the case of 'below detection limit', the probability density is limited, for the case of 'instrument malfunction' the probability density is zero at all levels. However, for both cases, the undefined integral is one (we must assume a false negative metabolite detection). If we knew the true cause of the missing value (i.e. either false negative or true negative), the correct probability density function could be modeled. For now, however, we need to assume an infinite technical error which demands to add the maximal likelihood of ln2 to these data points. Accordingly, missing data do not have a diminishing impact on L. An extreme case of data set that exclusively comprises missing data would result in a maximal likelihood for arbitrary linear hypotheses. This interpretation is correct because all hypotheses would be equally probable and could not be denied, which, however, results to an interpretive power of zero. Consequently, for real cases a maximal number of missing values needs to be defined in order to deny any linear hypothesis that might be due to missing explanatory power. The upper limit of the number of such missing data has to be set by the user who may call in further biological or analytical background information for individual metabolite pairs.
The adapted ML-estimator considers technical errors.
The adapted ML-estimator detects linear patterns and groups sub sets of data accordingly.
The adapted ML-estimator is robust against outliers.
The adapted ML-estimator relies on background information on missing values and therefore does not distort interpretations.
Therefore, the adapted ML-estimator realizes a solution to several of the challenges of unbiased and robust detection of multiple linear hypotheses in complex data sets.
We here assume that linear relationships between two metabolites are only confused by technical errors, but not by other biological factors, so the degree of noise here is only induced by technical errors. In order to test the algorithm described above, a data set was simulated that closely describes the problem. This model data set comprised 200 variables which were grouped into 20 clusters of equal size. All variables within a cluster were described by a linear relationship y = ax + b, but between clusters, no linearity was modeled apart from random relationships. For each test, a different number of samples was taken to assume experimental data from metabolomic snapshots, with further and various levels of technical errors that were added to the modeled measurements. Technical errors were assumed to follow a Gaussian distribution. In total, the total data set was investigated for 19,900 pair-wise relationships of which 900 were modeled to be described by linear relationships in order to assess false positive and false negative error rates. The parameters were varied in an exhaustive permutation of the sample size from 3–150 samples and relative technical errors from 0.1–100%. Error rates were determined four times for each combination of ,number of samples' and ,technical errors', and the average of these four determinations was taken. Technical errors may be divided into the absolute error and a relative error. The absolute error, for example, is constituted by the resolution of an analytical instrument or constant background through chemical impurities of reagents and solvents. Such errors can be carefully controlled in validated chemical procedures and are usually less important than relative errors. In most cases in metabolomics, the total technical error is dominated by relative errors that relate to the true value, originating for example by sample storage, extraction and sample preparation procedures, and by cross-contamination and carry over between samples. Technical errors can be estimated by reproducing all sample preparation steps multiple times from small aliquots of a larger homogenized pool, and subsequent data acquisition. The magnitude of relative errors varies by the vulnerability of the compounds to be altered during the sample preparation and measurement process. However, for the sake of clarity, identical technical errors were used in the simulations for each pair of metabolites.
Use of the technical error concomitant with a maximum likelihood assessment of linearity parameters and verification by simulated test statistics enables a robust detection and verification of liner relationships in complex data sets. An implementation of this algorithm will enable biologists to calculate and compare linearity networks in metabolomic or other multivariate data sets, from which biological hypotheses may be derived. The algorithm can be modified with respect to the ratio of type I and type II errors depending on the biological focus of a study. It is highly advised to use more than 20 biological replicates for each condition that is to be tested in a biological experimental design of genotypes x environments (G x E), unless advances in analytical chemistry and instrumentation decrease the overall technical error to very low levels, i.e. below 5%. Even the existence of more than one linear relationship per pair of variables can be detected using the maximum likelihood algorithm, which has so far been hard to compute with classical approaches.
The work was funded by the NIEHS through the R01 project ES13932 granted to OF and by a fellowship granted to FK by the Max-Planck Society, Germany. Helpful comments by Joachim Selbig are appreciated.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.