Reverse engineering gene regulatory network based on complex-valued ordinary differential equation model

Background The growing researches of molecular biology reveal that complex life phenomena have the ability to demonstrating various types of interactions in the level of genomics. To establish the interactions between genes or proteins and understand the intrinsic mechanisms of biological systems have become an urgent need and study hotspot. Results In order to forecast gene expression data and identify more accurate gene regulatory network, complex-valued version of ordinary differential equation (CVODE) is proposed in this paper. In order to optimize CVODE model, a complex-valued hybrid evolutionary method based on Grammar-guided genetic programming and complex-valued firefly algorithm is presented. Conclusions When tested on three real gene expression datasets from E. coli and Human Cell, the experiment results suggest that CVODE model could improve 20–50% prediction accuracy of gene expression data, which could also infer more true-positive regulatory relationships and less false-positive regulations than ordinary differential equation.

could reflect the regulatory orders of genes and the important information of regulatory objects [7]. With the development of DNA microarray or chip technology, a large amount of gene expression profiles have been generated, which provide a condition for identifying gene regulatory networks using computational models [8][9][10].
Some computational approaches have been developed to describe biology networks [11][12][13][14][15][16][17], especially the regulatory relationships between genes with gene expression time series, such as directed graph [18], Boolean network [3,19], Bayesian network [20,21], differential equation [22], neural network [23][24][25], stochastic equation [26,27], etc. Differential equation model is more conducive to describing the concentration evolution of biological macromolecules such as RNA and protein over time. Therefore, this model has been widely utilized in pharmacokinetics, enzymology and gene regulatory network construction [28][29][30][31][32][33]. For each target gene, its corresponding differential equation is identified, in which the dependent variables (genes) indicate the corresponding regulatory factors of target gene. Hohm and Zitzler resolved the issue of how to optimize the parameters of ordinary differential equation (ODE) for GRN inference [34]. Tian et al. proposed stochastic delay differential equations to describe the time-delayed regulatory relationships in GRN [35]. Gebert and Jong proposed piecewise linear differential equation to identify the most relevant regulating interactions in GRN [36,37]. Zhang et al. proposed single-index ODE model and clipped absolute deviation penalty penalized function to infer network structure [38]. Matsumoto et al. proposed ODE models to identify GRN with single-cell RNA-Seq data [39]. Some special nonlinear ODE models also have been proposed to infer GRN, such as S-system model [40][41][42].
Recently the complex versions of many models have been proposed due to their potential to optimize more easily, better generalization characteristics, faster learning and higher noise-robust memory mechanisms [43][44][45]. You and Hong presented multilayer feedforward neural network (MFNN) with complex-valued activation function to model QAM signals of different constellation sizes [46]. Deng et al. also proposed complex-valued radial basis function neural network (CVRBFNN) to deal with QAM signals and the results showed that CVRBFNN performed better than functional link artificial neural network [47]. Hu et al. discussed the global stability of the complex-valued and delayed version of recurrent neural network [48]. Trabelsi et al. proposed complex-valued convolutional feed-forward networks [49]. Yang et al. proposed a novel complex-valued method based on mathematical expression to resolve real-valued prediction and classification problems [50].
In order to forecast gene expression data and identify GRN accurately, complex-valued ordinary differential equation (CVODE) is presented in this paper. Grammar-guided genetic programming (GGGP) is utilized to evolve the structure of CVODE and complex-valued firefly algorithm (CFA) is proposed to search the optimal complex-valued parameters of model. Three real gene expression datasets are applied to test the inference performances of our proposed methods.

Results and discussion
Three real gene expression data sets are utilized to test the performance of our proposed method. All the real gene expression data need to be normalized, which is defined as follows.
where g min and g max are the minimum and maximum of dataset, respectively.
In our experiment, ODE and CVODE models are both applied to predict gene expression data and infer gene regulatory networks with three real gene expression datasets. Our proposed hybrid evolutionary algorithm is utilized to infer ODE and CVODE models and the parameters are the same. In GGGP, population size is set to 50, crossover rate is set to 0.9 and mutation rate is set to 0.1. In CFA, population size is set as 100, and step size is set as 0.02. The parameters are selected according to previous research results [51][52][53]. Sensitive ( S n ) and Specificity ( S p ) are utilized to evaluate the inferred gene regulatory networks. S n represents the proportion of the positive samples to be identified correctly in all positive ones. S p denotes the proportion of the negative samples to be identified correctly in all negative ones.

SOS DNA repair network
SOS DNA damage repair network is selected as the first real gene regulatory network, which contains six main genes (uvrD, lexA, umuD, recA, uvrA and polB). The real structure of SOS DNA damage repair network is depicted in Fig. 1. The gene expression data used in this experiment are from Uri Alon [54,55]. The DNA repair process of E. coli is observed by ultraviolet irradiation. The expression levels of six genes at different time points are measured.
The datasets from the first three experiments are utilized to optimize the CVODE models. The last dataset is utilized to test. With hybrid evolutionary algorithm the corresponding CVODE models of six genes are described as follows. Variables x 1 , x 2 , . . . , x 6 represent six genes: uvrD, lexA, umuD, recA, uvrA and polB, respectively. ODE models are also utilized to infer SOS network. The predicted results of ODE and CVODE for six genes are depicted in Fig. 2. From Fig. 2, it could be seen that the results of CVODE predicted are closer to real ones than ODE. In order to compare clearly, the prediction RMSE values of ODE and CVODE for six genes are listed in Table 1, which reveals that CVODE model has smaller prediction RMSE values than ODE for six genes. On average CVODE could decrease 34.4% RMSE value.    In order to test the modeling performance of CVODE further, the fitness values of ODE and CVODE against generations are plotted in Fig. 3. From Fig. 3, we can see   Fig. 4. Compared Fig. 1 with Fig. 4, it could be seen that CVODE model could identify 7 true-positive (real) regulations and 17 false-positive regulatory relationships, while ODE model could identify 5 true-positive relationships and 18 false-positive relationships. The performances of S n and S p of SOS networks inferred by ODE and CVODE are listed in Table 2. In terms of S n , CVODE is 39.99% higher than ODE. In terms of S p , our method is 9.1% higher than ODE. The results reveal that our method could identify all true-positive regulatory relationships and less false-positive regulations than ODE model.

Human cell time-series data
Cell cycle refers to the whole the process from the beginning of one cell division to the end of the next division. The second real gene expression time series data are from the genes periodically expressed, which were supplied by Whitfield [56]. The experiment includes five periods and 1134 genes. The sub dataset extracted in this part includes five genes and 46 time points, in which the first 40 time points are utilized to search the optimal CVODE models and the rest time points are utilized to test the performance. By applying our method, we have obtained the following CVODE models. Variables x 1 , x 2 , . . . , x 5 represent five genes, respectively.
The prediction errors of ODE and CVODE are depicted in Fig. 5, which reveals that the prediction errors of CVODE are closer to zero than ones of ODE. The predicted RMSE values of ODE and CVODE are listed in Table 3. For the predicted results of five genes, CVODE has smaller prediction RMSE values than ODE, which show that CVODE could more accurately predict gene expression data. On average CVODE could decrease 49.07% RMSE value. (3)

E. coli database
The third real gene expression dataset is extracted from polymicrobial probe database (version 4 build 6), which contains 4297 genes and 907 biology experiments [57]. The used dataset contains the first 35 experiments, in which the first 30 time points are utilized to search the optimal CVODE models and the rest time points are utilized as testing data. Eight genes (Crp, araC, nagC, chbC, araE, araA, chbA and chbF) are chosen and the real network structure is from RegulonDB [58] containing 15 regulations, which is described in Fig. 6.  Through our proposed hybrid evolutionary method, we could obtain the CVODE models as Eqs. (4). Variables x 1 , x 2 , . . . , x 8 represent genes Crp, araC, nagC, chbC, araE, araA, chbA and chbF, respectively. The predicted errors of ODE and CVODE are depicted in Fig. 7, which reveal that the predicted errors of CVODE are closer to zero than ones of ODE. The predicted RMSE values of ODE and CVODE are listed in Table 4. For Gene1, Gene2, Gene4, Gene7 and Gene8, CVODE has smaller predicted RMSE values than ODE. For Gene3, Gene5 and Gene6, ODE performs better than CVODE. On average CVODE could improve 20.69% prediction accuracy and more accurately forecast gene expression data than ODE.

Fig. 7 Prediction errors of ODE (blue lines) and CVODE (red lines) with E. coli database
According to the optimal CVODE models, the inferred E. coli sub-network is described in Fig. 8b. ODE models are also utilized to infer E. coli and the result is also depicted in Fig. 8a. Compared with real network structure, we can see that CVODE model could identify 14 true-positive regulations and ODE model could identify 12 true-positive relationships. The performances of S n and S p of E. coli networks inferred by ODE and CVODE are listed in Table 5. From the inferred results, it could be seen that CVODE could identify more accurately gene regulatory network than ODE model.

Conclusions
In this paper, we have presented complex-valued ODE model to identify the regulations among genes for gene regulatory network inference. Grammar-guided genetic programming and complex-valued firefly algorithm are proposed to search the optimal structure and parameters of CVODE. Three real gene expression datasets are utilized and the results reveal that CVODE model could not only improve 20%-50% prediction accuracy of gene expression data, but also identify more true-positive regulatory relationships than ODE.
Our proposed method has some advantages including (1) compared with ODE, CVODE has the complex-valued structures, constants and coefficients, which could improve the modeling ability; (2) GGGP overcomes the shortcomings of GP and CFA has more population diversity and faster convergence than the traditional firefly algorithm, so our proposed hybrid optimization method based on GGGP and CFA could search the optimal model faster; (3) because of the equipment and sample tissues, the expression data may contain noise during the data collection process. Complexvalued methods have higher noise-robust memory mechanisms, so CVODE model has better performance than ODE.
In the further, CVODE will be applied for real large-scale GRN inference. The optimization of complex-valued model needs many computing resources, so the parallel technologies will improve the learning speed of model in the following work.

Complex-valued ordinary differential equation model
Complex-valued ordinary differential equation (CVODE) model is a variant of ODE model, whose coefficients and functions are complex-valued. Its expression is given as followed [59]. Fig. 9 The derived tree of CVODE model sin z + cos z − z where Z is complex-valued independent variable, β is complex-valued coefficient vector, t is the time point and F (·) is complex-valued unknown function.

Structure optimization
Grammar-guided genetic programming (GGGP) is an improved version of genetic programming (GP), which overcomes the shortcomings of GP, such as the generation and preservation of valid programs, small search space, and complex genetic operators [51][52][53][54][55][56][57][58][59][60]. GGGP is presented to evolve the structure of CVODE model. In GGGP, context-free grammar (CFG) model is utilized to guide the evolutionary process of GP in order to obtain the optimal solution faster.
CFG model contains a quadruple, which is represented as G = {N , T , P, } , where N is non-terminal symbol set, T is terminal symbol set, P is production rule set and is beginning symbol set. Four sets satisfy the conditions: N ∩ T = φ and ∈ N . An element in production rule set is represented as x → y , where x ∈ N , and y ∈ N ∪ T . In order to represent an example of CVODE model sin z + cos z − z , four sets of CFG model are defined in advance. N = {s, exp, op, pre, var} , T = {sin, cos, +, −, z} , = {s} , and P is given as The derived tree of CVODE model sin z + cos z − z is obtained by generating sentence through context-free grammar, which is depicted in Fig. 9.
In GGGP, the individuals come from the derived trees of the grammar models. As the same as GP, GGGP also has genetic operators to evolve the derived trees, which contain selection, replication, crossover and mutation. Selection and replication mechanisms are consistent with GP. For crossover mechanism, two derived trees are randomly selected, and two sub-trees with the same source characteristics are selected to be crossed, whose root nodes have the same non-terminal symbols. For mutation mechanism, an internal node is selected randomly. If the node is a non-terminal node, the node is retained and the sub-tree with this node as the root node is deleted and replaced by a sub-tree created by syntax rules. If this node is the terminal node, it is replaced by a new terminal node created randomly.

Parameter optimization
Complex-valued firefly algorithm (CFA) is an efficient complex-valued evolutionary algorithm, which is based on the mutual attraction and movement processes of complex-valued fireflies [52]. Compared with many traditional evolutionary algorithms, CFA has some advantages, such as simple design, few parameters, strong robustness, high population diversity and fast convergence.
In CFA, the positions of firefly populations are complex, so the real parts and imaginary parts of the positions of firefly populations need to be optimized in parallel. In CVODE model, complex-valued constants and coefficients need to be optimized. In this paper, CFA is presented to evolve the parameters of CVODE. The detail optimization process of parameters of CVODE with CFA is given in Algorithm 1.

The flowchart of GRN inference with CVODE
In this paper, CVODE is utilized to infer gene regulatory network, in which GGGP and CFA are proposed to optimize CVODE model. Suppose that gene expression data [D 1 , D 2 , . . . , D m ] contains m genes and each gene contains n time points ( D i = [D 1 i , D 2 i , . . . , D n i ] ). The inference flowchart of GRN with m genes by our proposed algorithm is described in Fig. 10 The detailed process is given as follows.
(1) Due to that the input data of CVODE model are complex-valued, gene expression data need to convert into complex data using Algorithm 2 before GRN inference. (2) The regulatory relationships of each target gene are inferred by CVODE model independently. For gene i, with gene expression data, the process of searching the optimal CVODE model is introduced as followed.
1. Initialize N CVODE individuls. 2. Calculate the fitness values of N CVODE individuals. The complex outputs need to be converted into real values by Algorithm 3. 3. GGGP is applied to search the optimal structure of CVODE, which is introduced in Sect. 2.2. At some iterations, CFA is utilized to search the optimal complex-valued parameters of CVODE, whose structure is fixed. 4. If the satisfied condition is achieved, the optimal process stops; otherwise, go to step 2).
(3) If the optimal CVODE contains independent variables, the corresponding genes of independent variables could regulate the target gene. For example gene j is included in the optimal CVODE dZ i dt = f (Z j ) , which means that gene i is regulated by gene j. (4) i + +. If the regulatory relationships of all genes have been identified, create overall gene regulatory network; otherwise go to step (2).