- Methodology article
- Open access
- Published:

# Least-squares methods for identifying biochemical regulatory networks from noisy measurements

*BMC Bioinformatics*
**volume 8**, Article number: 8 (2007)

## Abstract

### Background

We consider the problem of identifying the dynamic interactions in biochemical networks from noisy experimental data. Typically, approaches for solving this problem make use of an estimation algorithm such as the well-known linear Least-Squares (LS) estimation technique. We demonstrate that when time-series measurements are corrupted by white noise and/or drift noise, more accurate and reliable identification of network interactions can be achieved by employing an estimation algorithm known as Constrained Total Least Squares (CTLS). The Total Least Squares (TLS) technique is a generalised least squares method to solve an overdetermined set of equations whose coefficients are noisy. The CTLS is a natural extension of TLS to the case where the noise components of the coefficients are correlated, as is usually the case with time-series measurements of concentrations and expression profiles in gene networks.

### Results

The superior performance of the CTLS method in identifying network interactions is demonstrated on three examples: a genetic network containing four genes, a network describing p53 activity and *mdm2* messenger RNA interactions, and a recently proposed kinetic model for interleukin (IL)-6 and (IL)-12b messenger RNA expression as a function of ATF3 and NF-*κ* B promoter binding. For the first example, the CTLS significantly reduces the errors in the estimation of the Jacobian for the gene network. For the second, the CTLS reduces the errors from the measurements that are corrupted by white noise and the effect of neglected kinetics. For the third, it allows the correct identification, from noisy data, of the negative regulation of (IL)-6 and (IL)-12b by ATF3.

### Conclusion

The significant improvements in performance demonstrated by the CTLS method under the wide range of conditions tested here, including different levels and types of measurement noise and different numbers of data points, suggests that its application will enable more accurate and reliable identification and modelling of biochemical networks.

## Background

A key objective of Systems Biology research is to move from a qualitative to a quantitative understanding of cellular signalling and gene networks. Motivated by recent advances in high-throughput genomics and proteomics analysis, and the resulting explosive growth in the amount of data available for analysis, much effort is currently focused on developing reliable methods for inferring the structural and functional organisation of biochemical networks from data obtained by time-series measurements – see for example [1–6] and references therein.

Interactions between components of biological networks can conveniently be represented by weighted, directed graphs, where the nodes correspond to the biochemical components, and the edges, represented as arrows with weights attached, indicate the direct quantitative effect that a change in one component has on another component [1]. The weights are, in general, nonlinear functions that represent often largely unknown reaction kinetics, and it is therefore not usually practical to directly determine these weights from experimental data. This is particularly the case for gene networks whose structures are poorly understood in general, even qualitatively. In such cases, a useful approach is to consider the biochemical network behaviour about some steady-state, and assume that it behaves linearly for small deviations from this steady-state [2, 3]. With this assumption, the network weights become constants, quantifying the reactions between the components in the neighbourhood of the steady-state. An interaction matrix, known as the Jacobian, is then obtained by grouping the constant weights into a matrix.

Several different approaches for determining the Jacobian of a network from time-series data have recently appeared in the literature [1–6]. A common feature of all these approaches is that the network is perturbed in some way, and then data are collected from time-series measurements of one or more components of the network. In [1], an approach was proposed which can handle very general types of system perturbations, such as gene knockouts and inhibitor additions. For these types of perturbations, the exact size, as well as the direct effect of the perturbations will be largely unknown, and therefore the method also allows the determination of the perturbation itself from the data. Another advantage of the approach of [1] is that the effect of unsteady-state initial conditions can be treated as an unknown perturbation and hence also estimated from the data. This removes the requirement for the system to be in a steady-state with known activities and concentrations when the perturbation is applied.

Another common feature of almost all the approaches for reverse engineering biomolecular networks so far proposed in the literature is that they employ some estimation algorithm to infer network structure from the measurement data. In [1], for example, the basis of the method for simultaneous estimation of the system states and parameter perturbations is a linear least-squares algorithm. A significant limitation of most such algorithms is that they do not take account of the noise that is inevitably present in the measurement data. Indeed, in the results presented in [1], it was observed that significant levels of noise in the measurement data could lead to quite large errors in the estimated Jacobian matrix.

In data from most biological experiments, the error associated with each measurement is substantial. The amount of measurement noise is often poorly defined but arises from 1) errors inherent in the measurement technique; 2) errors in the time a measurement is made (with absolute and drift components); and 3) biological variation in the behaviour of cells or organisms in the assay. Inaccuracy in measurements, leading to noise in the data available for analysis, can, in theory, be addressed by improvement of techniques and by replication. In practice, however, improving measurement quality or increasing replication is often not possible because it can involve slower sampling or result in the inclusion of more biological variation (e.g. through adding parallel cultures, or repeating experiments on different days). Therefore, it is critical to develop analytical approaches which allow robust identification of interactions in biochemical networks from data with a substantial, but poorly-defined, noise component. Such approaches are also valuable in reverse, i.e. in suggesting how experimental sampling strategies can be improved to provide optimal data in terms of both number and accuracy of data points.

Given the ubiquity of measurement noise in biological data, there is clearly a need for advanced estimation algorithms which can explicitly, and in some sense optimally, take such noise into account when producing estimates of the network interactions. In this paper, we consider two such extensions of the classical Least Squares (LS) algorithm, namely the Total Least Squares (TLS) [7, 8], and the Constrained Total Least Squares (CTLS) [9, 10] algorithm. The CTLS algorithm, in particular, is shown to be ideally suited to the problem of accurately and reliably identifying functional interactions between network components from noisy data. While both of these algorithms are now routinely used in advanced signal and image processing applications, we believe that this is the first time that their usefulness in Systems Biology has been highlighted.

## Results and Discussion

In this section, the performance of the three algorithms described above is tested on an *in silico* four-gene network example, on a high fidelity *in silico* p53 and mdm2 interaction model and on an example of interleukin (IL)-6 and IL- 12b interactions with activating transcription factor 3 (ATF3) and Rel (a component of NF-*κ* B) based on *in vivo* data.

All computations were performed on a 3.06 GHz Pentium IV machine with 1.00 GB of RAM using Windows XP Professional, MATLAB 7.2, and the MATLAB Optimisation Toolbox Version 3.0.4.

### A Four-Gene Network Model

A four-gene network example is presented in the supplementary material of [2]. This network was used as a test-bed to evaluate the performance of network identification approaches in both [1] and [2]. The differential equations for the gene network are given by

where *x*_{
i
}(*t*) is the concentration of mRNA_{
i
}, for *i* = 1, 2, 3, 4, the first term and the second term on the right hand side of the equations represent the rate of transcription and the rate of degradation of each mRNA, respectively, and each maximal enzyme rate is given by {V}_{1}^{s} = 5, {V}_{2}^{s} = 3.5, {V}_{3}^{s} = 3, {V}_{4}^{s} = 4, {V}_{1}^{d} = 200, {V}_{2}^{d} = 500, {V}_{3}^{d} = 150, {V}_{4}^{d} = 500, with units of nM · h^{-1}. The Michaelis constants are given by *K*_{14a}= 1.6, *K*_{24a}= 1.6, *K*_{32a}= 1.5, *K*_{43a}= 0.15, *K*_{12i}= 0.5, *K*_{31i}= 0.7, *K*_{1d}= 30, *K*_{2d}= 60, *K*_{3d}= 10, *K*_{4d}= 50, in units of nM, and *A*_{14} = 4, *A*_{24} = 4, *A*_{32} = 5, *A*_{43} = 2, *n*_{12} = 1, *n*_{14} = 2, *n*_{24} = 2, *n*_{31} = 1, *n*_{32} = 2, *n*_{43} = 2. In the model, gene interactions result in nonlinear dependencies of transcription rates on other mRNA concentrations, which act as communicating intermediaries. The corresponding gene network for this example is shown in Figure 1.

For this example, the level of perturbation for {V}_{i}^{s} for *i* = 1, 2, 3, 4 from the nominal values is 100% and the measurement noise is assumed to be zero-mean white gaussian with variance equal to the square of the equilibrium times 0.02, where the equilibrium states are given by {x}_{1}^{\text{eq}} = 0.4920, {x}_{2}^{\text{eq}} = 0.6052, {x}_{3}^{\text{eq}} = 0.1866, and {x}_{4}^{\text{eq}} = 0.6514. The number of experiments is four. In each experiment {V}_{i}^{s} is perturbed in the negative direction, i.e. inhibited, and the data sampling time is 0.01 h (36s). The true (simulated) values of *x*_{
i
}(*t*) together with the noisy measurements for this example are shown in Figure 2.

The three different least squares algorithms are tested for different numbers of data points per experiment, i.e., 3, 6, 9, 12, 21, 30, 60, and the quality of the Jacobian estimations was evaluated according to a number of different definitions of estimation error, which are discussed in the Methods section. The results generated from 1000 Monte-Carlo simulations are given in Table 1.

Note that the estimation errors of the TLS for the cases of very few data points, i.e. 6, 9, and 12 are larger than the errors from the standard least squares algorithm. This is because, as discussed later in the Methods section, the TLS algorithm requires a minimum number of data points to work properly. For the case of only 3 data points, all three algorithms provide the same result, since in this case the set of equations to be solved is not over-determined, i.e. there is a single unique solution. Excluding this case, the CTLS reduces the mean of the relative magnitude error for each element of the Jacobian, i.e. *ε*_{
M
}, by an average of 27% compared with the standard least squares technique, over all the different cases considered. This improvement rises to 37% when the four cases with the fewest data points are removed. The variance of the error is reduced by an average of 25.6%, excluding the first three cases. For the sign estimation error, *ε*_{
S
}, all three methods give a similar level of performance – the reason for this is easy to see, however, by considering the true Jacobian of the network:

Clearly, the Jacobian contains no terms which are very close to zero and therefore the signs of the estimates for each term will be very similar for all three methods. The CTLS almost always gives the best performance in the root mean square sense. A common feature of the results presented in Table 1 is that the accuracy of the estimate improves with increasing numbers of data points. However, beyond a certain critical number of data points, there is no further improvement in the quality of the estimate using any algorithm. The fact that for certain error measures the estimate dis-improves slightly for a large number of data points is due to the biased nature of the least squares solution. It is well known that when *A* and *b* in *Ax* = *b* are statistically independent, the least square solution has no bias error. However, when they are not independent, which is the case here, the solution has a bias in general. Moreover, the level of bias is generally a nonlinear function of the number of measurements and hence the bias error may not decrease monotonically [11]. From Table 1, it is clear that using too few data points can generate huge errors in the inferred network. However, since the error does not decrease monotonically with larger number of data points, it may be better to increase the accuracy of the data while obtaining fewer data points rather than sacrificing accuracy to obtain many more data points. In this specific case, the optimal number of data points seems to be between 21 and 60. Finally, to evaluate the effect of drift noise, which is a common form of noise in biochemical measurements, each algorithm was again evaluated using Monte-Carlo simulations. In this case, the measurements are given by [12]

Δ\tilde{x} = Δ*x*_{
k
}+ *v*_{
k
}+ *b*_{
k
} (3)

for *k* = 1, 2, ..., *L* where *v*_{
k
}is white noise and *b*_{
k
}is the drift noise. The drift noise, which is also called Brownian motion or random walk, is modelled as follows:

*b*_{k+1}= *b*_{
k
}+ *w*_{
k
}Δ*T* (4)

for *k* = 1, 2, ..., *L* - 1 where *b*_{1} equals to 0 and *w*_{
k
}is white noise. The dimensions of *b*_{
k
}and *w*_{
k
}are *n* × 1 and the variance of the *i*-th element of *w*_{
k
}is (*x*_{
i
}^{eq}*γ*)^{2} for *i* = 1, 2, ..., *n*. For this example, *n* is equal to 4. Monte-Carlo simulation results for various levels of drift noise are shown in Tables 2 and 3, for 12 and 21 data points per experiment, respectively. Again, for each case, and for all different levels of drift noise considered, the CTLS algorithm generally yields significant reductions in the Jacobian estimation error. By comparing Tables 2 and 3, it can be seen that the errors with 21 data points are smaller than the ones with 12 data points. However, if the number of data points is increased by too much, adverse effects will occur, as the bias error is stronger as time increases and hence the later data are effected more strongly by bias noise. Hence, in the presence of bias noise it is also important to try to choose the optimal number of data points so that the noise does not increase the estimation error too much.

### p53 and *mdm2* messenger RNA expression

The negative feedback interactions between the tumor suppressor p53 and the oncogene Mdm2 have been the subject of much attention in the recent Systems Biology literature. p53 protein activates *mdm2* and the Mdm2 protein in turn negatively regulates p53, forming a negative feedback loop. The nature of the oscillations in p53 have been observed to vary significantly from cell to cell. The period and the amplitude variability seem to stem from low frequency noise in the protein production rate, [13]. A detailed mathematical model for the single-cell response of p53 to ionising radiation, which includes the levels of p53 and transcription of the *mdm2* gene, the corresponding protein levels of mdm2 and the activation kinetics of the protein p53 and ataxia telangiectasia mutated (ATM), is presented in [14]. The model accurately replicates the experimentally observed phenomenon that the number, but not the frequency or amplitude, of p53 oscillations depends on the radiation dose. The full model includes not only deterministic but also some stochastic dynamics, which represent the stochastic behaviour of the number of double-strand break complexes. To formulate a realistic problem to demonstrate the performance of the constrained total least-squares algorithm, we considered the following scenario: First, the p53 activity and *mdm2* gene expression levels, as shown in Figure 7A of [14], are the only measurements available. Second, we do not know the activity of ATM and other proteins including p53. Finally, the relation between these two genes is to be estimated by the Jacobian estimation algorithm. The kinetics for the two gene expression levels are given by

where the transcription rate of p53 is invariant and leads to a constant mRNA steady state and p53*(*t* - *τ*_{1}) is the activated p53 protein level, which is phosphorylated by the phosphorylated ATM. The effect of the activated p53 on *mdm2* is delayed by *τ*_{1}. Since this protein activation part is unknown, the effect of p53 on *mdm2* is hidden in the measurements and the estimated Jacobian will be effected accordingly. The unknown kinetics includes negative feedback interactions between p53 and Mdm2 proteins and the stochastic dynamics of double-strand break complexes – see [14] for more details. The true Jacobian is simply given by

where the numbers are given in [14].

The perturbation level on p53 is negative 10%, the measurement sampling time is 2 hours and white noise is added to the measurement data. Note that the true states converge to a steady state after around 16 samples, i.e. 30 hours. The true and the measured values of the perturbed gene expression levels are shown in Figure 3. The estimation results are shown in Table 4. In most cases, the average errors for all measures produced by the CTLS are the smallest. Similarly to the previous example, the error reduction stops after the number of data points reaches around 8. Hence, further reductions in the estimation error for this example would require the application of more accurate detection methodologies. Since there are some ignored kinetics, even with virtually no measurement error data the Jacobian still gives some connections between p53 and *mdm2*. Then, we may conclude that there are some additional kinetics, which have not been discovered yet, and this may motivate further experiments to elucidate these hidden regulatory mechanisms.

*IL6* and *IL12* messenger RNA expression

Proper regulation of the innate immune system is crucial for host survival, and is mediated, in part, by cytokines that are secreted by macrophages. In particular, breakdown of immune system regulatory mechanisms can lead to inflammatory disease. Immune system control is extraordinarily complex, making it an obvious candidate for investigation using Systems Biology approaches. The biochemical network through which interleukin (IL)-6 and IL- 12b interact with activating transcription factor 3 (ATF3) and Rel (a component of NF-*κ* B) forms an important part of the innate immune system response [15]. A kinetic model for the expression of *IL6* mRNA by *ATF3* and *Rel* was recently proposed in [15] as follows:

where *τ* = 600/ln(2), *β*_{Rel} = 7.8, *β*_{ATF3} = -4.9, [*Il6*] is the *Il6* mRNA expression level, and [*Rel*] and [*ATF3*] are the level of *Rel* and *ATF3*, respectively. The second part in the right hand side of the kinetic equation is a sigmoidal function that incorporates lower and upper bounds on *Il6* expression, *τ* is given by *T*^{1/2}/ln(2) and *T*^{1/2} is a typical mRNA half-life in mammalian cells, and *β*_{Rel} and *β*_{ATF3} represent the relative contributions of *Rel* and *ATF3* in the levels of *Il6* transcription, respectively. In [15], this kinetic model was developed to match the experimental data shown in Figure 4 using a least squares regression. Similarly, a kinetic model for *IL12* is given by

where *τ* = 600/ln(2), *β*_{Rel} = 18.5, *β*_{ATF3} = -9.6, and [*IL12*] is the level of *Il12* mRNA expression. Similar interpretations to those for *Il6* can be applied to this equation. Unlike with the previous example, since this is a model based on real data from a partially understood biochemical network, the true Jacobian is unknown, and therefore the estimation error cannot be evaluated explicitly. However, using the proposed kinetic models we can obtain the following ratio:

and therefore we can partially validate the Jacobian estimation from the data against the proposed model by checking the value of this ratio. The equivalent ratio for the case of *IL12* is -1.93. Note that in the context of this example, the negative sign of this value is crucial since it corresponds to a negative feedback role for *ATF3*, which was the main finding presented in [15].

In [15], to obtain the data shown in Figure 4 wild type mice were stimulated (or perturbed) by 10 ng ml^{-1} lipopolysaccharide (LPS). The data was sampled at intervals of 10 minutes but the original data at 180 and 300 minutes were not given, hence, they are interpolated for our study to make all data equally spaced in time. Of course, the measurement data will definitely include some noise and the direct calculation of the Jacobian using the conventional least squares may therefore produce biased/inaccurate results. Note that since the number of states is 3, the number of perturbations is 1, and the number of data points for each state is 7, there is relatively little data with which to accurately estimate the Jacobian for this particular example. However, using the various least squares algorithms, we tried to extract the maximum amount of information from the given set of experimental data. In addition, note that since the equilibrium point is not given, the measurements we have are not relative measurements Δ\tilde{x}_{
k
}but absolute measurements \tilde{x}_{
k
}. This presents no difficulty, however, since in the framework of [1], the problem formulation to estimate the Jacobian using \tilde{x}_{
k
}is exactly the same as the one for Δ\tilde{x}_{
k
}– see [1] for more details. For *Il6* the key result for this example is that the standard least squares algorithm gives the wrong (positive) sign for the ratio defined above, whereas the more advanced algorithms give the correct sign. The correct ratio of *Rel* and *ATF3* to *Il6* is -1.59 and the estimated values computed with the LS, TLS, and CTLS algorithms are 1.43, -3.73, and -6.35, respectively. Thus, only by using the TLS or CTLS algorithms can the negative regulation effect of ATF3 proposed in [15] be confirmed from the noisy data presented in the paper. For *Il12*, the ratio calculated from each method, i.e., LS, TLS, and CTLS, is -4.53, -2.46, and -1.98, respectively. Therefore, in this case all three algorithms predict the negative regulation role of ATF3 correctly. However, the ratio computed from the CTLS, -1.98, is by far the closest to the true value (-1.93) predicted by the model. The improved parameterization achieved using the CTLS algorithm means that additional factors, in particular API and perhaps chromatin remodelling factors, can be added to the model with only limited and targeted new biological data.

## Conclusion

We have considered the problem of identifying the dynamic interactions in biochemical networks from noisy data. Since time-series measurements of, for example, concentrations and expression profiles in gene networks, are almost guaranteed to be corrupted by significant levels of noise, algorithms are required which explicitly take this noise into account when computing estimates of quantitative interactions in biochemical networks. The TLS and CTLS algorithms are extensions of the widely used least squares approach which optimally deal with the presence of uncorrelated and correlated noise in the measurements, respectively. Since noise in time-series measurements from biological experiments are generally correlated, the CTLS approach is ideally suited to estimation problems of this type.

The superior performance of the CTLS method in identifying network interactions was demonstrated on three examples: a genetic network containing four genes, a high fidelity p53 and *mdm2* interaction network, and a recently proposed kinetic model for interleukin (IL)-6 and (IL)-12b messenger RNA expression as a function of ATF3 and NF-*κ* B promoter binding. For the first example, the CTLS has significantly reduced the errors in the estimation of the Jacobian for the gene network. For the second, the CTLS shows similarly superior performance over the other least-squares methods to estimate the Jacobian from the measurements of a high fidelity gene network with neglected kinetics. For the third, it has allowed the correct identification, from noisy data, of the negative regulation of (IL)-6 and (IL)-12b by ATF3. The ability to take into account errors from various noise sources when identifying biochemical networks is valuable in informing decisions about the optimal numbers of data measurements that are required. While the use of very few data points generates huge errors, the errors do not decrease monotonically with much larger numbers of points. Thus, the calculations presented in this paper provide a rational basis for the design of experiments, in particular regarding the required frequency and accuracy of sampling. This is a very important issue in practice, since for some experiments there will be a practical upper limit on the number of data points per experiment (e.g. some measurements may take a certain amount of time to make; or if a measurement uses 5% of a starting culture which changes properties if scaled by more than 2-fold, one can only ever obtain 10 measurements). Secondly, the number of data points is often a trade-off with accuracy of the measurements. Typically, one would schedule a single day to do an experiment and then the choice is often between making more, noisier measurements or fewer, more accurate ones (since it is usually impossible to make conditions on two different days exactly the same.)

The excellent performance of the CTLS method compared to LS and TLS approaches (or the implicit assumption of perfect data) under the wide range of conditions tested here – including different levels of noise, different numbers of data points, and with drift – suggests that its application will enable better identification and modelling of biochemical networks. In the future, explicit analysis with the CTLS method is likely to increase the number of parameters that can be included in a model, even where there is limited knowledge of the noise levels and their source.

## Methods

### Problem Formulation

In general, the dynamics of many biochemical networks can be modelled as a nonlinear differential equation [1]

(*t*) = *f* (*x*(*t*)) (10)

where \dot{x}(*t*) is the time derivative of *x*(*t*), i.e. *dx* (*t*)/*dt*, and *x*(*t*) is an element of ℝ^{n}where ℝ^{n}is the real *n*-dimensional space. Note that the symbol *x* is used for two purposes in this paper, one for the unknown in the linear equation *Ax* = *b* and the other for the state of an ordinary differential equation. To distinguish between them, the state vector of the ordinary differential equation will always be written as *x*(*t*), i.e. as a function of time. In the above differential equation, *f* (·) is a nonlinear function, which satisfies the conditions for the existence and uniqueness of the solution of the ordinary differential equation. In biochemical networks *f* (·) is often also a function of some experimentally adjustable parameters such as kinetic rate constants or gene transcription rates [1].

If the above system is perturbed at an equilibrium point, *x*_{0}, that satisfies *f* (*x*_{0}) = 0, then the state around the equilibrium point is also perturbed by Δ*x*, and satisfies the following linear ordinary differential equation:

Δ\dot{x} (*t*) = *F* Δ *x*(*t*) + *u*(*t*). (11)

In the following, *u*(*t*) is assumed to be a constant, however, all results also hold in the case of a time-varying *u* (*t*) [1]. From the above equation, it is clear that the matrix *F* reveals the relations between each state in *x* (*t*) around the equilibrium point. The main purpose of this paper is to provide a more efficient and reliable method for estimating the matrix *F*, known as the Jacobian of *f* (*x*(*t*)) at *x* (*t*) = *x*_{0}, when the measurement data are effected by noise. Since the measurement data are recorded at discrete time intervals, we reformulate the continuous time system as a discrete time system given by

Δ*x*_{k + 1}= Φ Δ*x*_{
k
}+ *u* (12)

where Δ*x*_{k + 1}and Δ*x*_{
k
}are the sampled versions of the corresponding states in the continuous system and Φ is an *n* × *n* matrix in ℝ^{n × n}. Whenever Φ is determined, the original *F* can be recovered using the following relation [1]:

where Δ*T* is the sampling time and log(Φ) is the solution of *e*^{X}= Φ. Note, however, that when Φ is very poorly estimated, it could happen that log(Φ) becomes a complex matrix. To avoid this biologically meaningless result, the following approximation can be used [5]:

where *I*_{
n
}is the *n* × *n* identity matrix. Even with this approximation, however, there could exist numerically ill-conditioned problems if Φ + *I*_{
n
}is close to singular and thereby not invertible. In such situations the following Euler approximation can be used instead [1]:

although in this case the Jacobian *F* will be very sensitive to the magnitude of Δ*T*. Clearly, however, if the transformations using (13) and (14) fail then Φ is not a very good estimate and, hence, the resulting *F* from (15) cannot be expected to be close to the true Jacobian.

Let us consider *L* noisy measurements of Δ*x*_{
k
}. Then we have

Δ\tilde{x}_{
k
}= Δ*x*_{
k
}+ *v*_{
k
}for *k* = 1, 2, ..., *L* (16)

where *v*_{
k
}is a zero-mean white noise vector in ℝ^{n}. That is, *E* (*v*_{
k
}) = 0 and *E* (*v*_{
i
}{v}_{j}^{T}) = 0 for *i* ≠ *j* where *E* (·) is the Expectation. The Expectation is defined through a corresponding probability density function, i.e. *E* (*v*_{
k
}) = ∫_{Ω} *v*_{
k
}*p*(*ω*) *d* Ω where *p* (*ω*) is the probability density function and Ω is the sample space. In simple terms, *E* (*v*_{
k
}) can be regarded as the average of *v*_{
k
}and *E* ({v}_{k}^{2}) as the variance of *v*_{
k
}.

Assuming that the number of measurements, *L*, is greater than *n* + 2, from the relations given in (12) and (16) we have

Δ*x*_{
k
}= Φ Δ*x*_{k - 1}+ *u* = Φ Δ \tilde{x}_{k - 1}+ *u* - Φ*v*_{k - 1} (17)

for *k* = 1, 2, ..., *L*. However, since the true values of Δ*x*_{
k
}are not known, the left hand side of (17) must be replaced by the corresponding measured values:

Δ\tilde{x}_{
k
}- *v*_{
k
}= Φ Δ\tilde{x}_{k - 1}+ *u* - Φ *v*_{k - 1} (18)

for *k* = 1, 2, ..., *L*. The above can be written in a matrix form as follows:

where **1**_{1 × (L - 1)}and **0**_{1 × (L - 1)}are 1 × (*L* - 1) vectors with elements of 1 or 0, and

for *i* <*j* (*i* and *j* are positive integer numbers).

To formulate the problem in a standard least squares form, i.e. *Ax* = *b*, where *A* and *b* are measurements and *x* is to be estimated, we make the following definitions:

for *i* = 1, 2, ..., *n*. Now, the *i*-th row of the matrix [Φ *u*] in (19) can be written in the standard form as follows:

(*A* + Δ*A*) *x* = *b* + Δ*b* (22)

where

for *i* = 1, 2, ..., *n*. Δ*A* and Δ*b* are unknown correction terms caused by the noise in the data. The above problem is solved *n*-times to obtain the estimate of all the rows in the matrix Φ. For the case of multiple experiments, the details of the formulation of the problem are given in the additional file (See Additional File 1).

### Estimating the Jacobian in the Presence of Noise

Consider first the simple scalar version of the least squares problem in the absence of measurement noise, given by *a** *x* = *b**. In this case it is easy to see that the exact solution is given by *x** = *b**/*a** for *a** ≠ 0. Now, let the measurements of *a** and *b** be corrupted by noise as follows: *a* = *a** + *v*_{1} and *b* = *b** + *v*_{2} where *a** and *b** are the true values, *v*_{1} and *v*_{2} are the unknown measurement noise, and *a* and *b* are the (known) measurements of *a** and *b**. In this case, the standard least squares solution for just one set of measurements is

Using the binomial theorem, the denominator of the above expression can be expanded as follows:

Then, the least squares solution becomes

Now, if both *v*_{1} and *v*_{2} are independent zero-mean white gaussian noises, this implies that E(*v*_{1}) = 0, E(*v*_{2}) = 0, E(*v*_{1} *v*_{2}) = 0, *etc*. Thus, taking the Expectation on both sides gives

where {\sigma}_{1}^{2} is the variance of *v*_{1}. It is clear from the above relation that if the noise *v*_{1} is not present, the least squares solution gives the true solution in the average sense. However, when *a** is corrupted by noise, the solution is not optimal but biased proportional to the variance of the noise. To correct this situation, the problem is now set up with so-called correction terms as follows:

(*a* + Δ*a*) *x* = *b* + Δ*b*. (28)

The Total Least Squares (TLS) technique was developed to solve exactly this problem by finding the correction terms Δ*a* and Δ*b*. The correction terms are obtained by minimising ||Δ*a* Δ*b*||_{F}, while simultaneously satisfying the above relation (28) where || · ||_{
F
}is the Frobenius norm defined by ||*A*||_{
F
}= \sqrt{\text{tr}(A{A}^{T})} for a matrix *A* in which tr(*AA*^{T}) is the trace of the matrix, i.e. the sum of the diagonal terms. For the case of one measurement, as given above, the cost minimised by the TLS is given by

||Δ*a* Δ*b*||_{
F
}= Δ*a*^{2} + Δ*b*^{2}. (29)

For a higher number of measurements, i.e. *Ax* = *b*, the solution from the TLS is given by

*x*_{TLS} = (*A*^{T}*A* - *λ*^{2} *I*)^{-1} *A*^{T} *b* (30)

where *λ* is the smallest singular value of [*A b*] and the derivation can be found in the additional file (See Additional File 1) or [7]. On the other hand, the conventional least squares solution is given as follows:

*x*_{LS} = (*A*^{T} *A*)^{-1} *A*^{T}*b* (31)

and thus it essentially has the same error as shown in (27). The TLS technique tries to find the correction terms for *A* such that the bias error, which stems from the inaccuracy in *A*, is reduced. Hence, the quality of *x*_{TLS} depends on how close the estimated *λ* is to the true correction term. The TLS solution is always guaranteed to be as good or better than the least squares solution in the root mean square sense, if the number of measurements is sufficient to allow the algorithm to compute a reasonable approximation of the true correction term. If the number of measurements is too small, however, the TLS solution may not be better than the conventional least squares solution.

One of the main assumptions behind the TLS technique is that the two noise terms *v*_{1} and *v*_{2} are independent. However, if they are not independent but related to each other in some way, this knowledge about the structure of the problem should be used in estimating the solution. For example, if *v*_{1} = *v*_{2}, the least squares solution is approximated by

The optimal solution for this case is the Constrained Total Least Squares (CTLS) technique. If it is known that *v*_{1} = *v*_{2} then Δ*a* must be equal to Δ*b*. Hence, instead of minimising the Frobenius norm of ||Δ*a* Δ*b*||_{F}, a more appropriate cost for this problem would be Δ*a*^{2} instead of Δ*a*^{2} + Δ*b*^{2}. The CTLS algorithm exploits the knowledge that the true correction terms must be of the form Δ*a* = -*v*_{1} and Δ*b* = -*v*_{1}. As a result, the CTLS technique searches for the correction term, which is a minimum in the 2-norm sense, i.e. {\Vert {v}_{1}\Vert}_{2}^{2}, and simultaneously satisfies the constraint (28) where {\Vert {v}_{1}\Vert}_{2}^{2}={v}_{1}^{T}{v}_{1}. Other than different cost functions for each method, the main difference between the TLS and the CTLS is the dimension of the correction term search space. For one set of measurements, for example, the TLS searches for the correction terms Δ*a* and Δ*b* in a two dimensional space, while the CTLS, on the other hand, searches for the single correction term Δ*a* or Δ*b* in the minimal (one dimensional) space. The CTLS formulation can finally be reduced to the following minimisation problem:

where *C* is constructed from the measurements and *H*_{
x
}is given in a special form which is a function of the structure of the correction terms and also of *x* – the details can be found in the additional file (See Additional File 1) or in [10]. The minimisation problem solved by the CTLS algorithm is nonlinear since *H*_{
x
}is a function of *x* and in general will not be convex – therefore the quality of the resulting solution will depend on the initial guess for *x*. Of course the simplest way to obtain a good initial guess is to use the solution provided by the LS algorithm. If this solution is not too far away from the true solution, then the minimisation problem can be efficiently solved by some local minimisation algorithm such as Newton's method, Sequential Quadratic Programming, *etc* [16]. However, in some cases it may happen that the LS solution does not provide a good initial guess and, as a result, the minimisation algorithm may produce very large values for Δ*a* and Δ*b*. In general, large magnitudes of the correction terms correspond to incorrect solutions because if Δ*a* and Δ*b* are large compared to the magnitude of *a* and *b*, then Δ*a* and Δ*b* become dominant and the solution could be any number. For example, if *a** and *b** are equal to 1 and the measurements, i.e. *a* and *b*, are 1.1 and 0.9, respectively, the correct correction terms are -0.1 and 0.1. However, if the correction terms are given as 1 × 10^{10} and -1 × 10^{100}, then the solution is approximately -1 × 10^{90}, which is too far from the true solution, 1. Hence, whenever the final solution produced by the CTLS algorithm is drastically different from the initial guess in terms of the magnitude, it is advisable to impose some bounds on the magnitude of the correction terms. To do this, the above unconstrained minimisation problem may be solved with the following constraint on *x*:

*x*_{0} - *h* |*x*_{0}| ≤ *x* ≤ *x*_{0} + *h* |*x*_{0}| (34)

where *h* is a small positive constant. Numerically, constrained optimisation problems are much harder to solve than unconstrained optimisation problems, and hence, the original unconstrained problem should generally be solved first, with the above constraints only being imposed if they are necessary to compute a reasonable solution.

Finally, we note that there are many cases of biochemical experimental data where the matrix inversions required in (31) and (33) are not possible, i.e., they are singular or very close to singular. This is usually a result of a too large sampling time which results in the *A* matrix not having full-rank. One way to fix this situation is by removing the dependent parts of *A* by using a singular value decomposition. After eliminating the dependent parts, the matrix inversions in (31) and (33) are feasible and the problems are again well-defined. More details can be found in [1]. Full details of the mathematics involved in the solutions of the TLS and CTLS problems are given in the additional files (See Additional File 1), together with complete MATLAB programmes for the solution of each algorithm (See Additional File 2).

### Evaluating the Jacobian Estimation Error

The main reason for estimating the Jacobian is to gain a quantitative understanding of the local structure of the biochemical network. Hence, correct estimation of each element of the matrix is important, since each element provides a measure of the (local) functional interaction between two nodes in the network. From this point of view, the estimation error, *ε*_{
M
}, can be naturally defined as follows:

where *N*_{1} is the number of non-zero elements in the true *F*, *N*_{2} is the number of zero elements in the true *F*, *F* is given by

*F* := (*f*_{
ij
})_{n × n}, (36)

where \widehat{f}_{
ij
}and *f*_{
ij
}are the *i*-th row, *j*-th column element of \widehat{F} and *F*, respectively, and \widehat{F} is the estimated matrix whose form is the same as *F*. The above definition is a slight modification of the one proposed in [1] where *β*_{
ij
}= 0 for all *i* = 1, 2, ..., *n* - 1, *n* was made and *j* = 1, 2, ..., *n* - 1, *n* (i.e. it effectively ignores errors in the estimation of the zero elements of the Jacobian).

In [5], two alternative measures of the error in the Jacobian estimation, *r*_{
z
}and *r*_{
nz
}, are defined. These measures quantify the error in the zero elements and non-zero elements of the estimated matrix, respectively. To do this, the elements of the estimated Jacobian are sorted according to their absolute values. For a given positive integer *n*_{
h
}, the smallest *n*_{
h
}elements of the estimated Jacobian are then set to zero. *r*_{
z
}is then defined as the ratio of the number of zero elements in the estimated Jacobian to the number of zero elements in the true Jacobian. *r*_{
nz
}is defined as the ratio of the number of non-zero elements in the estimated Jacobian whose signs are the same as the ones in the true Jacobian, to the number of non-zero elements in the true Jacobian. In this paper, we consider a slightly more compact version of this measure and define an error measure *ε*_{
S
}as follows:

where sign(*a*) is a function that has the sign of *a* as its value, i.e., -1, 0, 1, for *a* < 0, *a* = 0, and *a* > 0, respectively. Finally, the third possible error definition is based on the Frobenius norm of a matrix:

*ε*_{
F
}:= ||\widehat{F} - *F*||_{
F
}. (39)

This error measure arises naturally in the context of the TLS problem since this approach exactly minimises the Frobenius norm of the correction terms arising from the noise in the data, Δ*A* and Δ*b*.

## References

Schmidt H, Cho KH, Jacobsen EW: Identification of small scale biochemical networks based on general type system perturbations. FEBS Journal 2005, 272(9):2141–2151. 10.1111/j.1742-4658.2005.04605.x

Kholodenko BN, Kiyatkin A, Bruggeman FJ, Sontag E, Westerhoff HV: Untangling the wires: A strategy to trace functional interactions in signaling and gene networks. Proceedings of the National Academy of Sciences 2002, 99(20):12841–12846. 10.1073/pnas.192442699

Sontag E, Kiyatkin A, Kholodenko BN: Inferring dynamic architecture of cellular networks using time series of gene expression, protein and metabolite data. Bioinformatics 2004, 20(12):1877–1886. 10.1093/bioinformatics/bth173

Tegner J, Yeung MKS, Hasty J, Collins JJ: Reverse engineering gene networks: Integrating genetic perturbations with dynamical modelling. Proceedings of the National Academy of Sciences 2003, 100(10):5944–5949. 10.1073/pnas.0933416100

Bansal M, Gatta GD, di Bernardo D: Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 2006, 22(7):815–822. 10.1093/bioinformatics/btl003

Cho KH, Choo SM, Wellstead P, Wolkenhauer O: A unified framework for unraveling the functional interaction structure of a biomolecular network based on stimulus-response experimental data. FEBS Letters 2005, 579: 4520–4528. 10.1016/j.febslet.2005.07.025

Golub GH, Loan CFV: An analysis of the total least squares problem. SIAM Journal on Numerical Analysis 1980, 17(6):883–893. 10.1137/0717073

Huffel SV, Vandewalle J: The Total Least Squares Problem: Computational Aspects and Analysis. Frontiers in Applied Mathematics series SIAM 1991., 9:

Abatzoglou T, Mendel J: Constrained total least squares. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing 1987, 12: 1485–1488.

Abatzoglou TJ, Mendel JM, Harada GA: The constrained total least squares technique and its application to harmonic superresolution. IEEE Transactions on Signal Processing 1991, 39(5):1070–1087. 10.1109/78.80955

Mendel JM: Lessons in estimation theory for signal processing, communications, and control. Englewood Cliffs, New Jersey 07632, USA: Prentice Hall, Inc; 1995.

Maybeck PS: Stochastic Models, Estimation, and Control. Volume 1. Arlington, VA: Navtech Book & Software Store; 1994.

Geva-Zatorsky N, Rosenfeld N, Itzkovitz S, Milo R, Sigal A, Dekel E, Yarnitzky T, Liron Y, Polak P, Lahav G, Alon U: Oscillations and variability in the p53 system. Molecular Systems Biology 2006., 2(33):

Ma L, Wagner J, Rice JJ, Hu W, Levine AJ, Stolovitzky GA: A plausible model for the digital response of p53 to DNA damage. Proceedings of the National Academy of Sciences 2005, 102(40):14266–14271. 10.1073/pnas.0501352102

Gilchrist M, Thorsson V, Li B, Rust AG, Korb M, Kennedy K, Hai T, Bolouri H, Aderem A: Systems biology approaches identify ATF3 as a negative regulator of Toll-like recepter 4. Nature 2006, 441(11):173–178. 10.1038/nature04768

MathWorks: Optimization Toolbox (Version 3) For Use With MATLAB. 3 Apple Hill Drive, Natick, MA, 01760–2098, USA: The MathWorks, Inc; 2006.

## Acknowledgements

This work was carried out under the UK EPSRC Research Grant GR/S61874/01 and BBSRC Research Grant BB/D015340/1. It was also supported by the Korean Ministry of Science and Technology through Korean Systems Biology Research Grant (M10503010001-05N030100111) and the 21C Frontier Microbial Genomics and Application Center Program (Grant MG05-0204-3-0), in part by 2005-B0000002 from the Korea Bio-Hub Program of Korea Ministry of Commerce, Industry & Energy, and by a grant from NITR/KOREA FDA for the National Toxicology Program in Korea (KNTP). K.-H. Cho was supported by the second stage Brain Korea 21 Project in 2006.

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Authors' contributions

JK derived the mathematical details, implemented and tested the algorithms under the supervision of DGB and IP. DGB, IP, and KHC checked the mathematical derivations. JK and DGB wrote the first draft of the manuscript. PHH and KHC provided biological interpretations of the results and all authors contributed to the final manuscript.

## Electronic supplementary material

### 12859_2006_1380_MOESM1_ESM.pdf

Additional file 1: Detailed mathematical descriptions of the least squares, total least squares, and constrained total least squares algorithms, for the multiple experiments case, are provided in this file. (PDF 64 KB)

### 12859_2006_1380_MOESM2_ESM.zip

Additional file 2: This is a standard zip compressed file. It can be uncompressed using freely available software, such as winzip. In unix, it can be uncompressed using the *unzip* command. This file includes the MATLAB source files to run all the calculation for the examples in this paper. To run the files, the MATLAB and Optimization toolboxes for MATLAB are required. More details about each file can be found in "readme.txt". (ZIP 11 KB)

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

**Open Access**
This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License (
https://creativecommons.org/licenses/by/2.0
), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Kim, J., Bates, D.G., Postlethwaite, I. *et al.* Least-squares methods for identifying biochemical regulatory networks from noisy measurements.
*BMC Bioinformatics* **8**, 8 (2007). https://doi.org/10.1186/1471-2105-8-8

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1471-2105-8-8