Our work focuses on methods for reverse engineering which operate on time-resolved gene expression experiments (in terms of mRNA concentrations). We define a time series profile for a gene measured over n time points as a sequence of expression values x =<x1 , ..., x
n
>, where each x
i
, 1 ≤ i ≤ n, corresponds to a distinct time point t
i
. In addition, let each of the m genes be represented by r time-resolved replicates over n time points. Here, we use the mean of r = 6 replicates, resulting in an m × n data matrix M. Let M
i
, 1 ≤ i ≤ m, denote the ith row of a matrix M, which corresponds to the time-resolved expression profile of the ith gene.
The general reverse engineering method based on a particular similarity measure μ and a scoring scheme F operating on the data matrix M is given in Algorithm 1.
Input:
M, matrix with m rows (genes) and n columns (time points),
μ, similarity measure,
F, scoring scheme
Output:
m × m adjacency matrix, A, of the reconstructed network G
1 foreach gene i, i ∈ {1,..., m} do
2 foreach gene j, j ∈{1,..., m}, j ≠ i do
3 w
ij
←|μ(M
i
, M
j
) |
4 end
5 end
6 C : c
ij
← w
ij
· f
ij
;
7 chose a threshold τ ;
8 a
ij
← 1 if c
ij
>τ;
9 a
ij
← 0 if c
ij
≤ τ;
Algorithm 1: General reverse engineering method based on a similarity measure μ and a scoring scheme F. f
ij
∈ F, c
ij
∈ C, a
ij
∈ A, and w
ij
∈ W, where W is the matrix obtained by applying μ on all pairs of rows of the given data matrix M.
The evaluation of the scoring schemes and measures is generally performed in R[40] using available packages, as noted in the manuscript. Additionally, several C routines were developed in order to improve computational speed.
In what follows, we describe the procedure for generating the synthetic time series data, and present the definitions of the similarity measures and scoring schemes used in the comparative analysis. Furthermore, the details of the ROC analysis are briefly reviewed.
Synthetic data sets
The evaluation of the existing methods for reverse engineering gene regulatory networks often employs real time-resolved expression data. However, these data include the convoluted effects of regulons (genes under regulation by the same regulatory protein) and stimulons (genes under regulation by the same external influence), which renders it challenging to realistically assess the performance of investigated methods. Moreover, not every regulatory subnetwork leads to expression of the participating genes over the measured time period and particular condition of interest. These facts lead to a lack of control when using transcriptomics time series data sets for network inference.
Following the example of the DREAM challenge [2, 14], in this comparison study we use synthetically generated data sets to overcome the described disadvantages. The usage of these synthetic data, in contrast to real measurements, enables us to directly compare the performance of different reconstruction tools, since the topology and dynamic of the underlying network is known a priori. In particular, we use the synthetic network generator, SynTReN [41, 42], which creates synthetic transcriptional regulatory networks by providing the edges of the network, as well as information about the interaction (activating, repressing, or dual). Additionally, the generator produces simulated gene expression data of the associated mRNA concentrations for each gene, based on Michaelis-Menten and Hill kinetics, which approximates experimental expression measurements. These expression data sets are uniformly sampled in time.
In SynTReN, the levels for three types of noise are user definable: (1) biological noise, corresponding to biological variability given by the stochastic variations in gene expression, (2) experimental noise, corresponding to the technical variability, and (3) noise on correlated inputs, which accounts for the influence of several activated genes on a regulated gene. Note that different noise levels are included in this study, but we make no distinction between the strength of the three noise types.
In particular, we generate regulatory networks from the GRN of E. coli and S. cerevisiae, using the cluster addition strategy to select a connected subgraph: In each iteration, a node is randomly chosen and added to the graph together with all its neighbors. This strategy is chosen since it is an efficient method to extract a subnetwork that approximates well the topology of the source network. The results presented in this work are obtained for subnetworks of distinct sizes which differ in degree and clustering coefficient. In particular, we investigate an E. coli subnetwork of 100 genes including 121 links where 10 of the genes code for transcription factors. It is characterized by an average degree of 2.42 and a clustering coefficient of 0.016. Additionally, we examine two more networks: (1) An E. coli subnetwork of 200 genes (34 coding for transcription factors) that includes 303 links and is characterized by an average degree of 3.03 and a clustering coefficient of 0.019, and (2) a S. cerevisiae subnetwork of 100 genes (14 coding for transcription factors) that includes 123 links and is characterized by an average degree of 2.46 and a clustering coefficient of 0.026. The degree distributions are shown in Additional file 1, Figure S14. Synthetic gene expression data with 10 biological (n = 10 time points) and 6 technical (r = 6) replicates have been generated from the particular networks. We use the means of the technical replicates for the interaction analysis. An example of a generated network (100 in E. coli used for our investigations), and the simulated gene expression data sets are visualized in Figure 7.
Interpolation
Since transcriptomics time series data usually consist of expression at a few, possibly non-uniformly sampled time points, data interpolation is often the first pre-processing step. Different techniques, including linear [43] and nonlinear interpolation methods, such as cubic- [33] and b-splines [25] have been applied [11]. Although these methods extend the available data, they also introduce artefacts due to over-fitting. For this reasons, we preclude from interpolating the synthetic time series data (whenever possible) for ranking of the used similarity measures. Nevertheless, for reason of comprehensiveness, we consider the effect of interpolation using cubic splines on the investigated similarity measures and scoring schemes of highest rank in absence of noise.
Similarity measures
Reverse engineering of regulatory networks relies on the inference of interrelationships among genes, based on similarity measures. In general, given two time series x and y over n time points, a similarity measure is given by the mapping μ : Rn × Rn → I, where I, I ⊆ R. The so-defined pairwise similarity measure detects (non)linear relationships between two variables, in the considered case, between two gene expression time series. The definition allows for the measure to be symmetric, which is not commonly the case for gene regulatory interactions. Moreover, if two genes are linked indirectly via a third gene, the pairwise measure can not distinguish direct from the indirect relationships and hence additional false positive links will be introduced in the network reconstruction.
Nevertheless, the definition of the similarity measure can be extended to conditional and partial measures, incorporating the possibility to exclude the influence of a third gene. Conditional similarity measures are more general, since they do not rely on specific assumptions on the probability distribution (deduced from the time series associated with a discrete random variable), but estimate the distribution which in turn impedes the computation of the measure from short time series. On the other hand, partial measures can indicated conditional independence reliable only for multivariate Gaussian variables.
To be able to discern the direction of a putative interaction and hopefully eliminate any spurious effects, we consider the conditional and partial variants for several of the measures detailed below. We term the basic pairwise measures as simple, in comparison to their conditional and partial variants. An overview on the measures included in this study is given in Table 1.
Measures operating on vectors
Some of the standard measures used for determining gene regulatory interactions are based on the calculation of the distance between expression time series regarded as vectors. In the following, x and y will denote the vectors <x1, ..., x
n
> and <y1, ..., y
n
>, respectively. Our study includes:
L
s
norm:
This distance measure for vectors x and y is defined as follows:
In our study, s = 10, which corresponds to the number of available time points.
Euclidean distance
Furthermore, we consider the well-known Euclidian distance, which is a special case of the Ls norm, with s = 2. Therefore, it is defined as
Manhattan distance
We also study the performance of the Manhattan distance which represents the shortest path between two points, placed on a rectangular grid, and is analogous to the L1 norm:
Dynamic time warping (DTW)
In addition, we investigate the performance of the DTW, which to our knowledge, has not been applied to the problem of gene regulatory network inference, but rather on clustering genes expression data [43, 44]. The DTW-based measure relies on finding the optimal (least cumulative) distance mapping a given time series into a reference time series, where both sequences may vary in time and/or speed. It was originally developed for speech recognition [45, 46], but has been recently used for different data mining tasks in medicine and bioinformatics [43, 47]. The concept of DTW is sketched in Figure 8 for two short time series with 4 time points each. In the first step of the DTW algorithm, local distances (e.g., Euclidean or Manhattan distance) for all pairs of time points are calculated. Then, the time series are mapped into each other by linking various time points, such that each point is included at least once and the sum over the lengths of all those links is minimal (optimal alignment path). Here, we use the DTW as implemented in the R-package "d tw" [48–50], with the Euclidean as point-wise local distance, and different step patterns which indicate the local constraints of the alignment paths. We include three different step patterns, namely
-
symmetric1
-
symmetric2
-
and asymmetric
to find an optimal alignment. Here μ
EC
denotes the local (Euclidean) distance, and the measure μ
W
the cumulative distance (representing the minimum sum of local distances along the alignment paths).
The resulting matrix of cross-distances D contains the pairwise calculated distance measures (μ
EC
, μ
L
, μ
MA
, or μ
W
, as defined above) and is, in all cases, normalized by the largest value occurring in the matrix as follows:
The similarity measure is then defined by:
Measures operating on random variables
Despite the representation of the expression time series as vectors, time series x = <x1, ..., x
n
> can be associated with a discrete random variable X with probability distribution p(x), x ∈ X that is approximated by the frequency via standard binning arguments. This representation allows to calculate several widely used similarity measures, such as correlation and information-theoretic measures. Note that the temporal information is lost by this representation of time series data. We first review the most commonly used measures in detail:
Pearson correlation
This similarity measure quantifies the linear relationship between the random variables X and Y, corresponding to two time series x and y:
where E denotes expectation
If the variables are independent, the correlation coefficient is μ
p
= 0, but the opposite is not true, as this coefficient is sensitive mainly to linear dependencies. Note that μ
P
receives values in the interval [-1, 1] and is symmetric.
Conditional Pearson correlation (CPC)
By using the conditional expectation value
where p(x|y), x ∈ X, y ∈ Y is the conditional probability distribution, one can provide the following definition for CPC:
Thus, the conditional correlation among the time series x and y of the corresponding genes, eliminating the influence of all other genes is defined as
Partial Pearson correlation
Analogously, one could also consider
where
The residuals are calculated following Eq. (16) making a linear regression of x (respectively y) depending on z:
Rank correlations can be used as a more general measure of interdependence, not restricted to a linear relationship. Even though they measure a different type of relationship than the product moment correlation coefficient, like the previous correlation measures, these are also defined in the interval [-1, 1].
Spearman's rank correlation
This type of correlation is based on the rank distribution of the expression values:
where R(x) is the rank of x.
Kendall's rank correlation
Another rank correlation is
with n
c
being the number of concordant pairs, and n
d
the number of discordant pairs of the rank sets.
It is common to regard the rank correlation coefficients (especially Spearman's rank correlation) as alternatives to Pearson's coefficient, since they could either reduce the amount of calculation or make the coefficient less sensitive to non-normality of distributions. Nevertheless, they quantify different types of association.
Unlike most of the measures discussed here, the correlation measures do not only provide an information about whether two genes are interacting, but also whether it is an activating or repressing relationship. As the latter information is outside of the interest of the current study, only the absolute value (respectively the square) of the correlation coefficient is taken into account.
Additionally to the correlation, information-theoretic measures are also defined using random variables as relevant representation for expression time series.
Mutual information
One of the most commonly used measures for inferring interdependencies between two subunits of a system is the mutual information (MI) [16]. Intuitively, MI measures the information content that two random variables X and Y share. The simple mutual information can then be expressed in terms of the marginal entropies H(X) and H(Y ), and the joint entropy H(X, Y ) using the definition of the Shannon entropy
which quantifies the uncertainty associated with a random variable. Hence the simple MI is defined as
It includes also non-linear interrelations, but same as the other simple measures, the simple MI cannot be used to distinguish between direct and indirect relations.
Conditional mutual information
However, if we replace the marginal and joint entropies by the conditional analogs, H(X|Z), H(Y|Z) and H(X, Y|Z), we can eliminate the influence of a third variable Z. Hence, we calculate the minimal information shared by the time series x and y of two genes conditioned on each z
k
, k = 1, ..., m. Thus the conditional--sometimes also referred to as partial [51]--mutual information (CMI) can be written as:
Where
The degree of interaction is indicated by the values of μ
I
or
normalized by the largest value occurring among all pairs of genes.
Mutual/conditional coarse-grained information rate
Another approach based on information-theoretic aspects are the coarse-grained measures. Here, instead of approximating the exact entropies of time series, relative measures of "information creation" are used to study the interrelationship of two (sub)systems. Thus, for this purpose, the calculation of coarse-grained entropy rates [52] is used to replace the approximation of the Kolmogorov-Sinai entropy (metric entropy of a dynamical system): First, a time lag l
max
is determined such that
among all analyzed data sets. Then, the coarse-grained information rate (CIR) is given by the norm of the mutual information
Usually the parameter l
min
and Δl (difference between consecutive time lags) can be set to one, and thus the CIR becomes
Hence, the mutual coarse-grained information rate (MCIR) is defined as
whereas the conditional coarse-grained information rate (CCIR) as
with
Finally, a normalization by the largest value occurring among all pairs of genes is performed. These (normalized) coarse-grained information rates are then used to indicate the degree of interaction.
Model-based measures
Granger causality
A rather new approach for inferring gene regulatory networks is the Granger causality (GC). Given the time series x and y, two linear autoregressive (AR) models are estimated, both including the past of x, and additionally, one of them including the past of y. In order to determine the optimal order q of the AR model, which denotes the number of past time points which have to be included, we use the function "VARselect" from the R-package "vars" [53, 54] based on the Akaike information criterion (AIC) [55]. The AIC is a measure of the goodness of a fit of an estimated statistical model, deduced as a tool for model selection. In the general case, the AIC is defined as:
where u is the number of parameters in the statistical model, and L is the maximized value of the likelihood function for the estimated model.
With properly selected AR models, the part of the variance in the data which is explained by one model in comparison to the other one, provides an information on the causal relationship. This comparison can be formulated in terms of an index.
Thus the Granger Causality index, denoted by μ
G
for the simple linear measure, as defined in [34, 35] via the covariance σ, is:
and can be inferred from the AR models:
where a11i, a21iand a22iare the parameters of the models and u1t, respectively u2trepresents white noise.
Conditional/partial Granger causality
As for the previous measures, we use the conditional and partial (linear) Granger causality measures (CGC and PGC) as defined in [34, 35], in order to identify existing indirect relationships. Hence, the AR models are formulated as
for the conditional, and, in addition,
for the partial Granger causality, with a11i, a12i, a21i, a22i, a23i, a31i, a32i, a41i, a42iand a43ibeing the parameter of the models and u1t, u2t, u3tand u4trepresenting noise terms.
Using the Eqs. (33) and (34), the conditional Granger causality index is then defined as:
where
and, using the Eqs. (33) to (36), the partial Granger causality index is
where
The degree of interaction is indicated by the Granger causality index normalized by the largest value occurring among all pairs of genes.
Measures operating on symbolic dynamics
Despite the promising applications of interaction measures based on symbolic dynamics in various fields, they have not yet been employed for reverse engineering gene regulatory networks. For instance, in standard nonlinear time series analysis, the usage of symbolic dynamics to uncover patterns of interactions, especially from short data sets [56], has proven as a valuable tool. Therefore, we explore the potential of symbolic dynamics for the problem at hand by using the principle of order patterns. By this principle, as described in [38], the time series are transformed into symbol sequences. An order pattern π of dimension δ is defined by the discrete order sequence of the time series x and has the length δ. Hence, the time series can be symbolized using order patterns following:
where l is the time lag. In terms of gene regulatory network reconstruction, we need to choose a certain number of time points and rank them according to their expression value in order to obtain the order pattern. Then, each possible ordering corresponds to a predefined symbol.
This concept is illustrated in Figure 9 for time series composed of n = 4 time points. As we are dealing with very short time series here, we consider all possible combinations of the chosen number of time points. For instance, for the time series of length n = 4 and an order pattern of dimension δ = 3, we define symbols (order patterns π
k
) for the following groups of time points: (1, 2, 3), (1, 2, 4), (1, 3, 4) and (2, 3, 4), shown in the left panels of Figure 9. Next, we define a symbol sequence
where
denotes the order pattern obtained for gene i from the k-th group of time points and
is the length of the symbol sequence.
In this work, we usually choose the dimension δ such that the length T becomes maximal, given a time series of length n (i.e., δ = 5 for n = 10). However, if the symbol sequences are calculated for longer time series, this approach is not applicable anymore. This is due to the fact that calculations in R are not possible because of the fast growing length of the symbol vectors. Hence, we use for the time series of 20 time points order pattern of dimension 6 instead of the dimension 10, which would lead to the symbol vector of maximal length.
Symbol sequence similarity
Using the above described approach, we infer the interdependency of two genes as follows: Given a certain number δ of time points we define a vector P containing all possible permutations of the ranking, and assign a symbol (order pattern π
k
) to each of them. Next, we define a vector
, using the same symbols as for P, but assigned to the reversed ranking. Now, we count the pattern overlap of two symbol sequences S(i)and S(j)to evaluate the symbol sequence similarity, p1, assuming both time series are interrelated (Eq. (44)), respectively p2 if we assume anti-interrelation (Eq. (45)).
We choose the maximal value of the two frequencies p1 and p2
to define the symbol sequence similarity (SySim).
Mutual information of symbol vectors
We calculate the mutual information of the symbol vectors (SymMI) of maximal length by:
In addition, we consider the mean of the symbol sequence similarity and the mutual information (of the symbol vectors) (SymSimMI)
This is further extended to include symbolic dynamics based on a slope comparison (order patterns for pairs of time points), where we consider
-
the symbol sequence similarity for pairs (
pairs) as a similarity measure
-
and the conditional entropies for pairs (
pairs) as a distance measure, with
.
A novel measure -- the residual mutual information
Estimating entropies from short time series is imprecise, hence the estimation of the mutual information and, in particular, its conditional counterpart, suffers the same disadvantage. On the other hand, the simple mutual information is not able to distinguish between direct and indirect links. Therefore, in order to overcome the encountered problem, we propose a novel partial measure -- the residual mutual information (RMI) defined as:
where
analogously to the idea of partial correlation (the residuals are calculated in the same way as for the partial correlation in Eq. (16)). The degree of interaction in the complex network is then indicated by the values of
, normalized by the largest value occurring among all pairs of genes.
Applied to short data sets, we expect that the residual mutual information performs much better in discriminating indirect links than the conditional MI, as we can abandon the estimation of additional conditional probabilities. Hence, the measure is more robust to effects of small sample size. Furthermore, we expect that the RMI's performance ranges between those of the simple and the conditional mutual information for long time series, since in contrast to the CMI, we eliminate here only the linear influence of the variable Z on X and Y. We postpone the confirmation of this claim for further theoretic analysis.
Scoring schemes
Once a chosen similarity measure has been applied on a given data matrix, there are several possibilities to score the resulting "weights" of putative interactions. In this sense, a scoring scheme F is a matrix of dimensions m × m. Let W denote the matrix obtained by applying μ on all pairs of rows of a given data matrix M, for w
ij
≥ 0, ∀i, j. The scores from a given scoring scheme and similarity measure can then be represented by a matrix C calculated from the Hadamard element-wise product of W and F, such that c
ij
= w
ij
· f
ij
.
To unravel the linkage of genes we apply the relevance network algorithm (Algorithm 1) using different scoring schemes and measures. In principle, all of the measures can be combined with any scoring schemes, but we restrict our investigations to the most commonly used.
IDentity (ID)
The identity scoring scheme corresponds to the basic relevance network [26] approach: Given a specific measure, a particular threshold τ is set in order to account for "true" links between elements in the network. The matrix F is the unit matrix (f
ij
= 1) in this case. Therefore, for a symmetric similarity measure, the identity scoring scheme cannot infer directionality of interactions. We test the performance of all measures mentioned above in combination with this scoring scheme.
Context Likelihood of Relatedness (CLR)
As a second type of scoring scheme, often used for the reconstruction of GRN, we consider the CLR, which is an extension to the basic relevance network approach. Once weights w
ij
have been assigned for each pair of genes according to the strength of interaction inferred from a particular measure, a score is derived, related to the empirical distribution of the values in W. Thus, the matrix F obtains the form
where
and σ
i
(σ
j
) are the mean and standard deviation of the empirical distribution of w
ik
(w
jk
), k = 1, ..., m. The links having c
ij
<τ (with c
ij
= w
ij
· f
ij
and τ a predefined threshold) are removed for the network reconstruction.
Here, we use the CLR as implemented in the R-package "minet"[57, 58], which uses either the simple mutual information or a squared correlation matrix (Pearson's, Spearman's or Kendall's) to measure the strength of interaction among genes. We note that the CLR algorithm cannot infer directionality from symmetric measures.
Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE)
Furthermore, we investigate the Algorithm for the Reconstruction of Accurate Cellular NEtworks, referred to as ARACNE, and include its performance in the current comparison study. The ARACNE is based on the data processing inequality, which states that post-processing cannot increase the amount of information. Hence it follows that:
when gene i and k are not directly linked, but this goes through j, where x
i
, x
j
and x
k
are the expression time series of these genes. In this manner, the algorithm discriminates indirect links. ARACNE is a relevance network algorithm as illustrated in Algorithm 1 as well. First, weights w
ij
(normalized to the interval [0, 1]) are assigned to each pair of nodes. Then the scoring scheme operates as follows: For each triplet of nodes the edge having the lowest weight will be removed (its weight is set to zero), if the difference between the two lowest weights is above a threshold τ
d
, as that interaction is interpreted as indirect. In this manner, the matrix F obtains the form:
Moreover, the ARACNE removes all edges satisfying c
ij
<τ, where τ is a predefined threshold.
As for the CLR algorithm, we rely on the "minet"-package, using the simple mutual information or a squared correlation matrix to determine the weights. By default, the two thresholds are set to zero. The ARACNE does not distinguish the direction of a link from a symmetric measure as well.
Maximum Relevance/minimum redundancy NETwork (MRNET)
As another example of a relevance network algorithm, we consider the Maximum Relevance/minimum redundancy NETwork (MRNET) [59]. This scoring scheme performs series of supervised maximum relevance/minimum redundancy (MRMR) gene selection procedures, where the expression of each gene in turn plays the role of the target output y = x
i
, with V = x\x
i
being the set of the expression data of the input variables, and x the set of the expression levels of all genes. Given the set M of selected variables and pairwise weights w
ij|M
, the criterion updates M by choosing the variable
that maximizes the score
where
is a redundancy term, and u
j
= w
ji
is a relevance term. This scheme therefore assigns higher rank to direct interactions, whereas indirect interactions (redundant information with the direct ones) should receive lower rank. Thus, the matrix F is defined as:
Finally, all edges whose score c
ij
lies below a predefined threshold τ will be removed.
The implementation of the MRNET in the "minet"-package, which we used in our study, assigns the weights based on the pairwise simple mutual information or a squared correlation among the time series of two genes (normalized to the largest value occurring among the pairs). Also this algorithm is not able to infer directionality from symmetric measures.
Time Shift (TS)
In nonlinear time series analysis, the shifting of time series is a common way to infer the directionality of causal relationships. As the driving system by definition has to act first, shifting its time series forward in time (relative to the time series of the response system) should increase the similarity of both time series. Comparing the values of a particular measure for different time shifts gives then a hint on the direction of the interaction. Thus, the time shift scoring scheme starts with a cubic spline interpolation for each pair of genes expression time series. Then the series of the second gene is shifted against that of the first gene. If x and y are two expression time series stored in the ith and jth row of the data matrix M, and
and
are the related interpolated time series, we can then define the shifted time series as
and
where N is the length of the interpolated time series and N
shift
is the assumed shift of
versus
, with N
shift
∈ Z and N
shift
∈ [-0.1 · N, 0.1 · N]. Next,
is evaluated for all possible values of N
shift
, resulting in a vector
(for not significant values of μ the corresponding entry will be set equal 0). The scoring is now given by
In case the largest significant value of the measure is obtained for a negative shift, the regulatory direction from the first to the second gene is kept, while the opposite direction is preserved if the largest significant value is obtained for a positive shift. Furthermore, both regulatory directions are kept, if the maximum arises for a shift of zero or multiple opposed shift values or in the case when no significant value exists. The scoring scheme aims at providing a hint on the directionality, because the absolute values of the calculated correlations on the delayed time series are rather biased as the data sets are quite short.
In the next step of Algorithm 1, the information regarding the directionality are combined with the weight of interaction inferred from a particular measure (c
ij
= w
ij
· f
ij
). Hence, the weights for the unlikely direction are set to zero in order to break symmetries, and thus reduce the number of false positives. Finally, all edges with c
ij
<τ are removed, where τ is a particular threshold.
We test that scoring scheme using the absolute value of of the correlation coefficients μ
P
(Pearson) and μ
S
(Spearman) for pairs of the shifted expression series, where the significance level was set to α = 0.01 and only absolute values of correlation larger 0.9 have been taken into account. The choice of the measure to infer the weights in the first step of Algorithm 1 is independent of that and includes here the mean of sequence similarity and mutual information of symbols, as well as Spearman's and Pearson's correlation. Furthermore, this scoring scheme is applied in addition to (or after) another scoring scheme (e.g., ID, CLR, or AWE). It is important to note that in contrast to the previously described modifications of the algorithm, the scoring scheme we propose here allows to investigate the directionality, also when symmetric measures are considered.
A novel scoring scheme - Asymmetric WEighting (AWE)
Most of the measures used to infer the degree of interaction between pairs of genes, such as correlations or the mutual information, are symmetric. Hence, when applied in symmetric algorithms they are not able to unravel the regulatory dependences, since these measures do not distinguish the direction of the interaction. Thus, we introduce an asymmetric weighting based on topological aspects, for the complete set of pairwise weights obtained from a particular measure, and implement it according to Algorithm 1. In particular, we compute a matrix of weights, where the columns represent the genes which are regulated, and the rows stand for the genes which regulate other genes. The scoring value is then calculated by dividing each row entry by the sum of the corresponding column values. The scoring scheme (and the corresponding matrix F) is defined by:
Hence, the probability that the jth gene is regulated sums up to one:
. Here, the score indicates that how likely a gene is regulating another one depends not only on the strength of interactions, but also on its amount.
Eventually, if c
ij
≥ τ the edge is introduced, otherwise it is omitted.
We test the asymmetric weighting on the matrix W inferred from the symbolic dynamics measures.
ROC analysis
In order to rank the performance of the different similarity measures and scoring schemes, we evaluate to which extent each of them accurately reconstructs the underlying network of regulatory interactions. To this end, we use the receiver operating characteristic (ROC) analysis [60], as it provides indices to value the reconstruction efficiency among all the measures and scoring schemes under study. The ROC analysis is a tool for visualizing, organizing, and selecting classifiers based on their performance in terms of a cost/benefit analysis. For this purpose the ROC space is defined by the false positive rate, fpr, and the true positive rate, tpr, which depict the relative trade-offs between true positives tp (benefits) and false positives fp (costs). An overview on important quantities in ROC analysis is given in Table 3.
While discrete classifiers lead to just a single point in the ROC space, classifiers such as the similarity measures studied in this work produce probability values of how likely an instance belongs to a certain class. Here, the classification depends on a predefined threshold. The ROC curve is then produced by continuously tuning this threshold, which on the other hand can be suggestive on the performance of the measures. However, a well-defined rating is not always possible "by eye". Therefore, different summary statistics are common, for example the area under the ROC curve (AUC(ROC)) or the YOUDEN index (YOUDEN = max(tpr - fpr)) [61]. Another standard evaluation plot in the field of the ROC analysis is the precision/recall graph (PvsR), which is based on the comparison between the true edges and the inferred ones. Hence, it highlights the precision of the reconstruction, and does not suffer from the typically large number of false positives in a gene regulatory network reconstruction. We thus give the summary statistic using the area under the precision-recall curve (AUC(PvsR)) as well as the ROC curve. An efficient implementation of the ROC analysis is provided by the R-package "ROCR" [62].
All ROC curves are evaluated with respect to the underlying GRN, which is a directed graph. As several of the scoring schemes/measures do not distinguish whether the regulation is directed from gene i to gene j or vise versa, some of the false positives will follow from the missing information on the directionality. However, since the network under study is a sparse one, this additional false positives barely carry a weight.