# Robust detection and verification of linear relationships to generate metabolic networks using estimates of technical errors

- Frank Kose
^{1}, - Jan Budczies
^{2}, - Matthias Holschneider
^{3}and - Oliver Fiehn
^{4}Email author

**8**:162

https://doi.org/10.1186/1471-2105-8-162

© Kose et al; licensee BioMed Central Ltd. 2007

**Received: **25 July 2006

**Accepted: **21 May 2007

**Published: **21 May 2007

## Abstract

### Background

The size and magnitude of the metabolome, the ratio between individual metabolites and the response of metabolic networks is controlled by multiple cellular factors. A tight control over metabolite ratios will be reflected by a linear relationship of pairs of metabolite due to the flexibility of metabolic pathways. Hence, unbiased detection and validation of linear metabolic variance can be interpreted in terms of biological control. For robust analyses, criteria for rejecting or accepting linearities need to be developed despite technical measurement errors. The entirety of all pair wise linear metabolic relationships then yields insights into the network of cellular regulation.

### Results

The Bayesian law was applied for detecting linearities that are validated by explaining the residues by the degree of technical measurement errors. Test statistics were developed and the algorithm was tested on simulated data using 3–150 samples and 0–100% technical error. Under the null hypothesis of the existence of a linear relationship, type I errors remained below 5% for data sets consisting of more than four samples, whereas the type II error rate quickly raised with increasing technical errors. Conversely, a filter was developed to balance the error rates in the opposite direction. A minimum of 20 biological replicates is recommended if technical errors remain below 20% relative standard deviation and if thresholds for false error rates are acceptable at less than 5%. The algorithm was proven to be robust against outliers, unlike Pearson's correlations.

### Conclusion

The algorithm facilitates finding linear relationships in complex datasets, which is radically different from estimating linearity parameters from given linear relationships. Without filter, it provides high sensitivity and fair specificity. If the filter is activated, high specificity but only fair sensitivity is yielded. Total error rates are more favorable with deactivated filters, and hence, metabolomic networks should be generated without the filter. In addition, Bayesian likelihoods facilitate the detection of multiple linear dependencies between two variables. This property of the algorithm enables its use as a discovery tool and to generate novel hypotheses of the existence of otherwise hidden biological factors.

## Background

- (I)
concentrations alter and hence increase variance due to intentionally changing the experimental conditions, for example by altering environmental parameters like external nutrients or by using different genotypes [8],

- (II)
metabolite data will found to vary in a stochastic manner caused by the imprecision of the analytical method [9] used for acquiring metabolite data and

- (III)
interestingly, even under very controlled environmental conditions, a high degree of biological variation is found for metabolite levels due to stochastic biological events that trickle through the biochemical network and thus reflect the underlying control structure at this particular biological condition [6].

Therefore, if enough biological replicates are analyzed for a given organism at a given physiological situation, the metabolic phenotype can be investigated not only by its corresponding average metabolic values, but also by a snapshot of its corresponding metabolic network. However, biologists often do not know the inherent biological variability in advance and hence tend to use just a few independent biological replicates based on preliminary power analysis. Resulting data may be sufficient to estimate arithmetic means of metabolic levels but do not enable analyzing the linear control structure between different biological conditions. One of the challenges for calculating linearity networks is to compute the likelihood or significance of the presence of a truly linear relationship, with the aim of excluding both false negative and false positive detections of linearities.

*a priori*. Therefore, two fundamental questions need to be answered:

- (a)
For which pairs of variables can a linear relationship be hypothesized?

- (b)
Are there sub sets of data that reflect differences in linear behavior of variables? For example, linearity may be given for only a group of data but absent in another group, or the linearity parameters between these groups may be different.

- (1)
Linear relationships must be detected in an unbiased and observer-independent manner.

- (2)
Sub sets of data need to be grouped according to presence of (multiple) linear relationships.

- (3)
Criteria have to be applied that verify linear hypotheses based on test statistics.

- (4)
Technical errors: varying degree of analytical-chemical measurement errors and missing data have to be accounted for.

Especially, the potential presence of multiple linear relationships and independence of both variables poses problems for simple regression analyses. As a substitute for regression, the degree of correlation has been used for detecting linear relationships despite the fact that correlation only relates the covariance to the total variance, but does not verify genuine linearities. Moreover, Pearsons' correlation coefficients lack robustness against outliers, especially for multivariate datasets, and a number of different approaches have been suggested to link estimates to better test statistics [18]. In practice, however, empirical or heuristic thresholds are taken to distinguish strong or weak correlations, but no mathematical basis exists on which such thresholds can safely be founded. In some cases, Student's statistics *p*-values have been taken in an effort to validate Pearson's correlations [19]. Unfortunately, such *p*-values only describe the significance of the non-randomness of data pairs but do not test hypotheses if data pairs can be described by a (single or multiple) linear functions. Consequently, correlation networks based on Pearsons correlations may be strongly distorted [20].

A further approach has been taken using partial correlations that deconvolute contributions by additional parameters in order to reduce the list of correlations to basic dependencies [21] which may present a link from correlation to causality [22, 23]. This method is valuable to investigate the control structure within a given correlation network but it does not remove the principle robustness problem of correlation estimates. Simple correlations coefficients always decline with increasing variance that is introduced by method errors during data acquisition. In contrary, partial correlation coefficients may be increasing, decreasing or even change the algebraic signs with increasing method errors [20]. In order to remedy this situation, scientists tend to select high Pearson's correlation thresholds [24] which imply that the variance caused by method errors is small in relation to the biological variance. The latter assumption is often true when comparing widely different metabolic phenotypes such as certain mutant genotypes, or severe stress conditions such as acute (metabolic) diseases in comparison to healthy states. However, metabolic theory predicts that even incremental changes in enzymatic properties can have large effects on metabolic control, especially when multiple enzymes are affected [25]. Such changes might be too subtle to cause large differences in average concentrations but would still effect the pathway control structure and hence, linearities in pairs of metabolite data. Consequently, the metabolic control structure can only be assessed with a robust tool for linearity detection.

We here present a different approach. Using the Bayesian law [26], a likelihood formula is derived that is based on information about the measurement error using a specific technical method. This formula is then transformed in way that allows searching for local maxima of linear parameters within the total hypothesis space. Such likelihood maxima are subsequently assessed for residuals of the corresponding linearity parameters using simulated test statistics. We demonstrate the power of this approach using a synthetic data set with a given set of true linear relationships which are subsequently subjected to both increasing technical errors and increasing number of samples.

## Results and Discussion

### (1) A model for the technical error in metabolomic data

*x*

_{ ij }} denote the entirety of

*n*metabolite measurements in a collection of

*m*samples. The measurements can be arranged in a matrix

*x*

_{ ij }with rows i = 1, ...,

*n*that refer to the metabolites and columns j = 1, ...,

*m*that refer to the samples. Each measurement results from the sum of the true metabolite content ${{x}^{\prime}}_{ij}$ and a technical error,

*e*

_{ ij }include the chemical-analytical error, but can also include a contribution from different storage manners or times of the biomaterial after its extraction. The technical variance of the j

*th*measurement can be derived by a probability density function

*ρ*

_{ j }that reflects knowledge about sample storage and data acquisition. For missing data, it is only known that these can be expected in a defined range but with uniform probability distribution. For non-missing data, the technical error is modeled by a multivariate normal distribution that is centered around zero. More precisely, the probability density for the technical error

*e*

_{ j }= (

*e*

_{1 j},...,

*e*

_{ nj })

^{ t }of the

*j*th metabolic profile is given by

_{ j }can be estimated from the covariance matrix of replicated measurements of the same biomaterial. In practice, correlations between the technical errors of different metabolites are often disregarded, leading to a model with diagonal variance matrices ${\Sigma}_{j}=\text{diag}\left({\sigma}_{1j}^{2},\mathrm{...},{\sigma}_{nj}^{2}\right)$. Further one often works with fixed absolute or fixed relative errors,

respectively.

### (2) Maximum Likelihood (ML) function for a general linear problem

In what follows we collect the coefficients of the above equation in a vector *α* = (*α*_{1},..., *α*_{
n
})^{
t
}and express the linear relationship as *α*^{
t
}*x*_{•j}= *β*. By *x*_{•j}we denote the metabolic profile of the *j* th sample. The entirety of the parameters *α* and *β* results in *A*. The probability *p* (*A*) is the *a priori* probability of the parameters *α* and *β* before the measurement has been performed. Because no preference can be given, the *a priori* probability is constant for all *A*. The same is true for *p* (*B*) with *B* representing the measured metabolite concentrations. Therefore, *p* (*A* | *B*) is the likelihood for parameter *A* if the pair of variables *B* is given. We have used an unbiased approach here assuming random and unrelated technical errors, and we cannot know beforehand if a certain metabolite will be detectable or not, and how large the concentration of such metabolite could be. These assumptions result in a constant probability for p(A) and p(B), because otherwise certain values for A and B would be more likely than others. From the Bayesian law it can be concluded

*p* (*A* | *B*) = *c* · *p* (*B* | *A*).

The constant value *c* can be neglected because the objective is to compare different linear hypotheses. Consequently, a hypothetical metabolic profile has the same probability at a given linear hypothesis as a hypothetical linear hypothesis at a given metabolic profile. The expression *p* (*B*|*A*) is therefore the probability that the measured metabolic profile *B* is determined at a given set of parameters *A*, which can be calculated using the function which describes the probability distribution of the technical error. We are now in position to state the following general theorem:

*Let x* = (*x*_{1}, ..., *x*_{
n
})^{
t
}*include the measurements of n metabolites in a biological sample with technical errors that follow a Gaussian distribution with covariance matrix Σ. Let a (n-N)-dimensional surface in the n-dimensional metabolite space be defined by the equations*

*α*_{k 1}*x*_{1} + ... + *α*_{
kn
}*x*_{
n
}= *β*_{
k
} *for k* = 1, ..., *N*.

*The coefficients of these equations comprise a matrix α = (α*_{
ki
}*) and a vector β = (β*_{
1
}*, ..., β*_{
N
}*)*^{
t
}*. The matrix elements can also be arranged in vectors α*_{
k
}:(*α*_{k 1}, ..., *α*_{
kn
})^{
t
}. *It is assumed the hyperplanes defined by (8) are orthogonal in pairs with respect to the covariance matrix, i.e*.

*α*_{
k
}^{
t
}Σ *α*_{
l
}= 0 *for k*, *l* = 1, ..., *N and k* ≠ *l*.

*Then, the likelihood for the metabolite concentrations to lie on the (n-k)-dimensional surface is given by*

*σ*

_{1},

*σ*

_{2}and technical covariance

*σ*

_{12}. Then, the likelihood for the metabolite concentration to lie on the straight line

*α*

_{1}

*x*

_{1}+

*α*

_{2}

*x*

_{2}=

*β*is given by

*A*= {

*α*

_{ ki },

*β*

_{ k }} after measurement of the metabolomic data

*B*= {

*x*

_{ ij }} the result

Maximizing *p* (*A*|*B*) or *L* (*A*|*B*) gives the maximum likelihood. The resulting estimator for the considered linear relationship is called the simple ML-estimator. The likelihood takes values between 0 and 1. Likelihoods are different to probabilities [29] with respect to *p* (*B*|*A*) which is maximized in order to find the most likely parameters for a given hypothesis, here: a linear hypothesis.

### (3) An adapted Maximum Likelihood estimator for robust verification of linear hypotheses

*p*(

*B*|

*A*) can never become larger than one of its factors, and it comprises exactly one global maximum. Consequently, just a single outlier may decrease

*p*(

*B*|

*A*) significantly. Unfortunately, outlier data are frequently found in biological data sets due to both the multitude of factors in biological cells and the complexity of data acquisition methods that may result in false positive data points. Furthermore, it is still unclear, how many linear relationships exist for a given pair of variables. Both questions are reflected by introducing a decoupling term, the constant c, to the likelihood function,

Additional File 2 gives the impact of the magnitude of the constant *c* on the total likelihood. It is demonstrated in an empirical way that the total likelihood is not decreased by any *c* ≥ 1. In what follows we fix the constant at the value *c* = 1 und consider an adapted ML-estimator that is constructed by maximization of *L*_{1}(*A* | *B*).

*L*of ln 2 in the worst case. Figure 1 shows two plots, each representing 30 data pairs. In the upper panels, the 30 data pairs shall follow a hypothetical linear function with an additional modeled analytical measurement error. The lower panels represent a data set in which 15 of the 30 data points are transposed by a constant value, so that two linear functionalities exist. For each of the two examples, likelihood distributions are given across a part of the hypothesis space for the simple ML estimator (mid panels B) and the adapted ML estimator (right hand panels C). For the case of a dataset that comprised two likely linear functions, the simple ML estimator only recognizes a shift in the local maximum but fails to detect two local maxima according to the two linearities. In contrary, the adapted ML estimator correctly identifies both local maxima and is thus able to detect the most likely parameters for both linear functions. In fact, the local likelihood maximum of the original single linear function does not shift for the adapted ML estimator when 15 data points are shifted by a constant factor but it just leads to a decrease of the maximal possible likelihood. If all measured data are assigned to different linear hypotheses according to their corresponding likelihoods, the criterion (2) as given above in the section 'background' is fulfilled: Sub sets of data are now grouped according to presence of (multiple) linear relationships.

Additional considerations are outlined for the case of missing data (NANs, not-a-number) which are often found in metabolomic data sets. In such cases, the probability function *p* (*B*|*A*) needs to be adapted. This function then represents the information about the missing value: for example, data could get lost due to measurement instrument malfunctions or the variable (i.e. a metabolite level) might be below the limit of detection in a given biological situation. In both cases, the probability density function is uniform, i.e. the probability is constant in a certain range. For the case of 'below detection limit', the probability density is limited, for the case of 'instrument malfunction' the probability density is zero at all levels. However, for both cases, the undefined integral is one (we must assume a false negative metabolite detection). If we knew the true cause of the missing value (i.e. either false negative or true negative), the correct probability density function could be modeled. For now, however, we need to assume an infinite technical error which demands to add the maximal likelihood of ln2 to these data points. Accordingly, missing data do not have a diminishing impact on *L*. An extreme case of data set that exclusively comprises missing data would result in a maximal likelihood for arbitrary linear hypotheses. This interpretation is correct because all hypotheses would be equally probable and could not be denied, which, however, results to an interpretive power of zero. Consequently, for real cases a maximal number of missing values needs to be defined in order to deny any linear hypothesis that might be due to missing explanatory power. The upper limit of the number of such missing data has to be set by the user who may call in further biological or analytical background information for individual metabolite pairs.

- (i)
The adapted ML-estimator considers technical errors.

- (ii)
The adapted ML-estimator detects linear patterns and groups sub sets of data accordingly.

- (iii)
The adapted ML-estimator is robust against outliers.

- (iv)
The adapted ML-estimator relies on background information on missing values and therefore does not distort interpretations.

Therefore, the adapted ML-estimator realizes a solution to several of the challenges of unbiased and robust detection of multiple linear hypotheses in complex data sets.

### (4) Algorithm for the detection of linear relationships

*L*, resulting in false discovery rates for which limits can be set. The core of the algorithm determines the local maximum of a likelihood distribution which is subsequently compared to limits of a test statistics. It can be assumed that this maximum will be the global maximum since all measured data will be explained by the tested linear relationship. However, outliers will reduce the likelihood drastically. Consequently, data are only considered if residues are small to the hypothetical linearity. The 2

*σ*interval was chosen to exclude outliers at 95% confidence. The data inside the 2

*σ*confidence interval is denoted by

*B*

_{return}. The likelihood is subsequently normalized to the number

*m*

_{return}of this data. The parameter m

_{return}comprises the number of samples that were returned to belong to a linear function despite deviation that is due to the contribution of unrelated variance. Each data point contributes a value of ln 2 to the likelihood function, resulting in the normalized likelihood

*m*

_{return}and

*L*

_{return}, for which test statistics can be determined based on randomly selected true linear relationships.

*A*

_{max}denotes the parameters for which the maximum likelihood is assumed. The distributions of

*m*

_{return}and

*L*

_{return}were assessed by Monte Carlo simulations: For each sample size ranging from 3 to 150 data points we have generated 25,000 random data sets, and test statistics were derived for each sample size

*m*. The data set were generated by selecting an arbitrary linear function and a random selection of data points corresponding to this linear function. Technical errors were sampled from Gaussian distributions and added to the data points. After localizing the maximum value of

*L*

_{1}one determines all samples which belong to the corresponding linear function. Based on

*L*

_{1}and

*m*

_{return}, the value of

*L*

_{return}is determined as given above. The frequency distributions of

*L*

_{return}for different values of

*m*

_{return}are shown for the example of

*m*= 20 samples (figure 2). Figure 2 demonstrates that the

*L*

_{return}distributions varied for different

*m*

_{return}values, and consequently, corresponding test statistics were established that set the limits for rejecting the null-hypotheses at a false negative error rate of ≤ 5% for each of the

*m*

_{return}values.

### (5) Determination of false positive and false negative error rates

*x*is given as

We here assume that linear relationships between two metabolites are only confused by technical errors, but not by other biological factors, so the degree of noise here is only induced by technical errors. In order to test the algorithm described above, a data set was simulated that closely describes the problem. This model data set comprised 200 variables which were grouped into 20 clusters of equal size. All variables within a cluster were described by a linear relationship *y* = *ax* + *b*, but between clusters, no linearity was modeled apart from random relationships. For each test, a different number of samples was taken to assume experimental data from metabolomic snapshots, with further and various levels of technical errors that were added to the modeled measurements. Technical errors were assumed to follow a Gaussian distribution. In total, the total data set was investigated for 19,900 pair-wise relationships of which 900 were modeled to be described by linear relationships in order to assess false positive and false negative error rates. The parameters were varied in an exhaustive permutation of the sample size from 3–150 samples and relative technical errors from 0.1–100%. Error rates were determined four times for each combination of ,number of samples' and ,technical errors', and the average of these four determinations was taken. Technical errors may be divided into the absolute error and a relative error. The absolute error, for example, is constituted by the resolution of an analytical instrument or constant background through chemical impurities of reagents and solvents. Such errors can be carefully controlled in validated chemical procedures and are usually less important than relative errors. In most cases in metabolomics, the total technical error is dominated by relative errors that relate to the true value, originating for example by sample storage, extraction and sample preparation procedures, and by cross-contamination and carry over between samples. Technical errors can be estimated by reproducing all sample preparation steps multiple times from small aliquots of a larger homogenized pool, and subsequent data acquisition. The magnitude of relative errors varies by the vulnerability of the compounds to be altered during the sample preparation and measurement process. However, for the sake of clarity, identical technical errors were used in the simulations for each pair of metabolites.

*L*

_{return}and

*m*

_{return}, the 5% threshold for the false positive rates would be reached at higher technical error rates. However, simultaneously, the minimal error rate for of false negatives would increase. Consequently, type I and type II error rates could in principle be balanced by adapting the thresholds for

*L*

_{return}and

*m*

_{return}in a qualitative manner. Nevertheless, the total error rate can only be influenced by decreasing the technical error or increasing the number of samples taken into account.

*L*

_{return}and

*m*

_{return}were adjusted to tolerate an outlier rate of 5% (one of 20 samples). Outliers were modelled with a distance from 2

*σ*to 1000

*σ*away from the true linearity. It was found that outliers were generally easier recognized when these were very distant from the linearity. Despite the additional outliers, false positive and false negative error rates were found to almost identical as in figure 3. The number of false negative linearity detections remained below 5% without filter in all cases, and conversely, the false positive rate remained unchanged with activated filter. Hence, the use of Bayesian likelihood estimations enables robust detection and verification of linear relationships in an unbiased way and in complex datasets. If more outliers are present in a data set, these may actually constitute several local likelihood maxima as shown in figure 1C (lower panel) or in the figures in Additional File 2. Such additional linear relationships may be revealed if several outliers follow a different linear function and hence yield unexpected hypotheses of cellular regulation. Finding and validating such multiple linearities has so far been hard to accomplish with classical tools but is now amenable with the algorithm presented here. The algorithm has been implemented in a stand alone software solution. For data sets of a size of 200 variables × 150 samples, robust linearity networks are generated in around 10 min computing time using a 512 MB RAM and 3.5 GHz personal computer. The actual computing time will vary from 3.5–15 min, depending on the actual linearity structure of the data set. Improved implementations of the algorithm, specifically for the search of global likelihood maxima, may certainly be worked out more effectively with respect to computational run time. However, acquiring metabolomic data of the size of 150 samples (from growth of biological organisms, harvesting, sample processing, data acquisition to data processing) will take time on the order of weeks which surely justifies computational efforts on standard personal computers on the order of minutes.

## Conclusion

Use of the technical error concomitant with a maximum likelihood assessment of linearity parameters and verification by simulated test statistics enables a robust detection and verification of liner relationships in complex data sets. An implementation of this algorithm will enable biologists to calculate and compare linearity networks in metabolomic or other multivariate data sets, from which biological hypotheses may be derived. The algorithm can be modified with respect to the ratio of type I and type II errors depending on the biological focus of a study. It is highly advised to use more than 20 biological replicates for each condition that is to be tested in a biological experimental design of *genotypes x environments* (*G x E*), unless advances in analytical chemistry and instrumentation decrease the overall technical error to very low levels, i.e. below 5%. Even the existence of more than one linear relationship per pair of variables can be detected using the maximum likelihood algorithm, which has so far been hard to compute with classical approaches.

## Declarations

### Acknowledgements

The work was funded by the NIEHS through the R01 project ES13932 granted to OF and by a fellowship granted to FK by the Max-Planck Society, Germany. Helpful comments by Joachim Selbig are appreciated.

## Authors’ Affiliations

## References

- Morgenthal K, Weckwerth W, Steuer R: Metabolomic networks in plants: Transitions from pattern recognition to biological interpretation. Biosystems. 2006, 83 (2–3): 108-117. 10.1016/j.biosystems.2005.05.017.View ArticlePubMedGoogle Scholar
- Ratcliffe RG, Shachar-Hill Y: Measuring multiple fluxes through plant metabolic networks. Plant Journal. 2006, 45 (4): 490-511. 10.1111/j.1365-313X.2005.02649.x.View ArticlePubMedGoogle Scholar
- Kose F, Weckwerth W, Linke T, Fiehn O: Visualizing plant metabolomic correlation networks using clique-metabolite matrices. Bioinformatics. 2001, 17 (12): 1198-1208. 10.1093/bioinformatics/17.12.1198.View ArticlePubMedGoogle Scholar
- Fiehn O: Metabolic networks of Cucurbita maxima phloem. Phytochemistry. 2003, 62 (6): 875-886. 10.1016/S0031-9422(02)00715-X.View ArticlePubMedGoogle Scholar
- Weckwerth W, Loureiro ME, Wenzel K, Fiehn O: Differential metabolic networks unravel the effects of silent plant phenotypes. Proceedings of the National Academy of Sciences of the United States of America. 2004, 101 (20): 7809-7814. 10.1073/pnas.0303415101.PubMed CentralView ArticlePubMedGoogle Scholar
- Steuer R, Kurths J, Fiehn O, Weckwerth W: Observing and interpreting correlations in metabolomic networks. Bioinformatics. 2003, 19 (8): 1019-1026. 10.1093/bioinformatics/btg120.View ArticlePubMedGoogle Scholar
- Camacho D, de la Fuente A, Mendes P: The origin of correlations in metabolomics data. Metabolomics. 2005, 1: 53-63. 10.1007/s11306-005-1107-3.View ArticleGoogle Scholar
- Lin H, Bennett GN, San KY: Chemostat culture characterization of Escherichia coli mutant strains metabolically engineered for aerobic succinate production: A study of the modified metabolic network based on metabolite profile, enzyme activity, and gene expression profile. Metabolic Engineering. 2005, 7 (5–6): 337-352. 10.1016/j.ymben.2005.06.002.View ArticlePubMedGoogle Scholar
- Grubbs FE: Errors of Measurement, Precision, Accuracy and Statistical Comparison of Measuring-Instruments. Technometrics. 1973, 15 (1): 53-66. 10.2307/1266824.View ArticleGoogle Scholar
- Tocher JF: Pigmentation survey of school children in Scotland. Biometrika. 1908, 6: A1-A67. 10.2307/2331470.View ArticleGoogle Scholar
- Horton NJ, Laird NM: Maximum likelihood analysis of generalized linear models with missing covariates. Statistical Methods in Medical Research. 1999, 8 (1): 37-50. 10.1191/096228099673120862.View ArticlePubMedGoogle Scholar
- Lindsey KJ: Applying generalized linear models. 1997, New York: Springer, 1Google Scholar
- Davis PL: Aspects of robust linear regression. Annals of Statistics. 1993, 21 (4): 1843-1899.View ArticleGoogle Scholar
- Andrews DF: Robust method for multiple linear-regression. Technometrics. 1974, 16 (4): 523-531. 10.2307/1267603.View ArticleGoogle Scholar
- Wald A: The fitting of straight lines if both variables are subject to error. Annals of Mathematical Statistics. 1940, 11: 284-300.View ArticleGoogle Scholar
- Berkson J: Are There 2 Regressions. Journal of the American Statistical Association. 1950, 45 (250): 164-180. 10.2307/2280676.View ArticleGoogle Scholar
- Scheffe H: Fitting Straight-Lines When One Variable Is Controlled. Journal of the American Statistical Association. 1958, 53 (281): 106-117. 10.2307/2282571.Google Scholar
- Cressie N, Read TRC: Pearsons-X2 and the Loglikelihood Ratio Statistic-G2 – a Comparative Review. International Statistical Review. 1989, 57 (1): 19-43.View ArticleGoogle Scholar
- Urbanczyk-Wochniak E, Luedemann A, Kopka J, Selbig J, Roessner-Tunali U, Willmitzer L, Fernie AR: Parallel analysis of transcript and metabolic profiles: a new approach in systems biology. Embo Reports. 2003, 4 (10): 989-993. 10.1038/sj.embor.embor944.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen YP, Popovich PM: Correlation: Parametric and nonparametric measures. 2002, Sage Publications, 1Google Scholar
- de la Fuente A, Bing N, Hoeschele I, Mendes P: Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics. 2004, 20 (18): 3565-3574. 10.1093/bioinformatics/bth445.View ArticlePubMedGoogle Scholar
- Perl J: Causality: Models, reasoning and inference. 2000, New York: Cambridge University Press, 1Google Scholar
- Wright S: Correlation and causation Part I. Method of path coefficients. Journal of Agricultural Research. 1920, 20: 0557-0585.Google Scholar
- Morgenthal K, Wienkoop S, Scholz M, Selbig J, Weckwerth W: Correlative GC-TOF-MS-based metabolite profiling and LC-MS-based protein profiling reveal time-related systemic regulation of metabolite-protein networks and improve pattern recognition for multiple biomarker selection. Metabolomics. 2005, 1: 109-121. 10.1007/s11306-005-4430-9.View ArticleGoogle Scholar
- Thomas S, Fell DA: The role of multiple enzyme activation in metabolic flux control. Advances in Enzyme Regulation. 1998, 38: 65-85. 10.1016/S0065-2571(97)00012-5.View ArticlePubMedGoogle Scholar
- Bayes T: An Essay Towards Solving a Problem in the Doctrine of Chances. Biometrika. 1958, 45 (3-4): 296-315. 10.1093/biomet/45.3-4.296.View ArticleGoogle Scholar
- Lee PM: Bayesian statistics: An introduction. 1989, New York: Oxford University PressGoogle Scholar
- Box GEP, Tiao GC: Bayesian inference in statistical analysis. 1973, Reading, MA: Addison-Wesley Publishing CompanyGoogle Scholar
- Fisher RA: On the 'probable error' of a coefficient of correlation deduced from a small sample. Metron. 1921, 1: 1-32.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.