When analysing learning procedure's efficiency, the procedure should be applied to the data generated by a known network, which then might be compared with the inferred one. To this aim, most studies apply procedures to gene expression data and compare inferred interactions with those found in biological literature. The disadvantage of this approach is that our knowledge of the structures of real biological networks is far from being complete even in the most deeply investigated organisms. Although many interactions between genes are known, there are very few results stating the absence of some interactions. Thus no conclusion can be drawn from the fact that the procedure inferred unknown interaction. The above disadvantage is no longer present when data are generated from a mathematical model simulating real networks. However, a danger of this approach is that the employed model simplifies the real process, losing important biological features. In that case, analysis of a learning procedure is aimed at its ability to infer an artificial model rather than real biological networks.
Husmeier in [17] suggests that a satisfactory compromise between the above two extremes is to apply the learning procedure to data generated by a system of ordinary differential equations.
Basic model
In the present study we generate data using the model proposed in [21]. The model consists of 54 species of molecules, representing 10 genes with their transcription factors, promoters, mRNAs, proteins and protein dimers, connected through 97 elementary reactions, including transcription factor binding, transcription, translation, dimerization, mRNA degradation and protein degradation. The dynamics of the model is described by the system of ordinary differential equations of the following form:
where t represents time, k
x
are kinetic constants of related reactions, [X] means concentration, pX, mX, X and X2 are promoter, mRNA, protein and dimer X, respectively, finally X·Y stands for a transcription factor bound to a promoter.
The system is composed of structures reported in the biological literature [22–24], i.e. a hysteretic oscillator, a genetic switch, cascades and a ligand binding mechanism that influences transcription (during the simulation, the ligand is injected for a short time). The whole network is shown in Fig. 1(a).
The total time of each simulation is set to 5000 minutes. At time 1000 minutes the ligand is injected for 10 minutes, changing the expression levels of all genes. The influence of the injection to expression decays throughout the interval 1500–3200 minutes (depending on the gene), but at time 2400 minutes system dynamics becomes similar to the initial one.
All the equations and parameters of the model, as well as the MATLAB code to integrate it, are available in the supplementary materials to [21].
This model is chosen for two reasons. First, differential equations formalism and biological origin of the structure guarantee realistic simulations. Second, small size of the system (note that, according to microarray experiments data, the learning procedure will be provided with mRNA concentrations only) allows to avoid a noise arising from heuristic search methods, which are necessary when dealing with large networks. Such noise might influence comparison of methods.
Modified models
Since genes G and K from the model are regulated by the same gene C and have the same kinetic constants, trajectories of their concentrations are identical. Consequently, their contributions to the regulatory interactions are indistinguishable, given the expression data. For this reason, Husmeier in [17] tests efficiency of DBN based learning techniques using the model slightly modified by identifying both genes.
In the present study we introduce perturbations to the model. It is done by replacing the differential equation regarding the mRNA of a perturbed gene by the following equation:
which makes the concentration exponentially converging to c. Taking c = 0 yields system with gene knocked out, while setting c to maximal (with respect to the basic system) expression level of a gene makes it overexpressed. 21 simulations altogether are executed: one simulation with the basic system and 20 simulations with one gene knocked out or overexpressed.
Octave scripts for generating expression time series are available in the supplementary materials [25].
Sampling and discretization
Husmeier in [17] chooses for his test 12 time points in equal length intervals between 1100 and 1600 minutes. He argues that more information is contained in the data derived from the system which is driven out of equilibrium by the ligand injection. In our tests Husmeier's choice is repeated and other intervals are tried, as reported below.
Before applying our learning procedure (as well as Husmeier's), the expression levels need to be discretize. One of the simplest methods of discretizing is partition of the interval of real numbers covering mRNAs concentrations of each gene into subintervals of equal length, relevant to particular discretized values. Another strategy is to base the discretization procedure on the meanings of introduced discrete expression levels (e.g. 'on'-'off' or 'under-expressed'-'baseline'-'over-expressed').
Husmeier in [17] applies the former approach with 3 discretized expression levels, and we follow him in the case of unperturbed data.
The specificity of perturbed data suggests the latter approach. The simulation of an unperturbed system specifies the reference point for expression levels of perturbed data. Thus discretization consists in comparing each concentration value from a perturbed system simulation with the concentration value of the same gene at the same time point of the unperturbed system simulation. When the values are close to each other, i.e.
the expression level is set to 'baseline', otherwise it is set to 'over-' or 'under-expressed'. The log-ratios of concentration values in knockout simulations are shown in Fig. 2. The threshold 0.5 seems to minimize the loss of information, inevitable in the discretization process. However, this choice, as well as the choice of the number of thresholds, is arbitrary.
Discretized expression data files are available in the supplementary materials [25].