Mathematical definitions
The gene expression matrix EGCTis composed of genes G = {g1, ⋯, g|G|}, conditions C = {c1, ⋯, c|C|} and time-points , were |G| denotes the number of genes, |C| the number of conditions and |T
c
| the number of time-points measured under the condition c. If |T
c
| varies throughout the dataset, only independent response modules can be mined, whereas, coherent and independent response modules can be mined in datasets where all conditions contain d samples. Egctrefers to the expression value of gene g under condition c at time-point t. The vector egcTspecifies the profile of gene g under condition c over all time-points |T
c
| (equation 1). Using this notation, a row (egCT), containing all conditions and time-points of a single gene, is defined in equation 2 and a column (eGcT), containing all genes and time-points under a single condition, in equation 3.
Further, we define the average trajectory over all conditions C
m
and all genes G
m
, as well as the average trajectory over all genes G
m
within one condition . Each trajectory contributing to the average is assigned a weight. This weight is specified for each gene and condition by the vectors WGand WC. The denominator normalizes the average profile in accordance to these weight vectors:
To quantify the similarity of two genes we apply the Pearson distance ρ to their profile. This distance can be related to the Pearson correlation coefficient r by the simple formula ρ = 1 - r. Now, we can provide a mathematical definition for the coherent and independent response modules, where M refers to a module containing the genes G
M
and the conditions C
M
.
Coherent modules
For the coherent modules the Pearson distance of each condition c, containing the genes G
m
, to the average trajectory (equation 4) must not exceed the threshold τ
C
. Accordingly, the Pearson distance of every gene g, under the conditions C
m
, to the average trajectory (equation 4) must not exceed the threshold τ
G
. Coherent modules are defined as follows:
Independent response modules
In case we wish to mine independent response modules instead of coherent modules, comparisons are restricted to be within conditions. This restriction is imposed because comparisons between conditions are only desirable if one wants to align the profile of different conditions, as it is done for the coherent modules. Thus, for each condition c ∈ C
m
we average over the Pearson distance of each trajectory contained in G
m
to the average trajectory of all genes contained in G
m
(equation 5). This average must not exceed the threshold τ
C
. Accordingly, for each gene g ∈ G
m
we average over the distance of each trajectory contained in C
m
to the average trajectory of all genes contained in G
m
. This average must not exceed the threshold τ
G
. Independent response modules are defined as follows:
Single response modules
As a special case of coherent or independent response modules, the single response module type can be defined by setting the number of conditions to one.
Note, that all definitions are symmetric with respect to the genes and conditions. For the independent response modules no assumption is made regarding the comparability of the expression values among the conditions. This enables the comparison of experiments with different dimensions, time intervals, and normalization protocols.
Preprocessing
Prior to the preprocessing procedure, genes are filtered from the dataset if they do not have a two-fold difference under at least one condition. This step aims at removing noise signals and unaffected genes. For the Arabidopsis thaliana dataset, a control measurement was available as reference point. For the Homo sapiens multiple sclerosis dataset no control measurements were available. Here we used the first measurement as a reference for all time-points.
EDISA is designed to refine initial modules sampled from prefiltered datasets. Such modules could be randomly drawn from the dataset. However, this leads to a predilection for strong signals, which is a recognized problem of the ISA approach. Therefore, before applying the module refinement, samples are drawn from the dataset with the aim of creating initial modules which follow a certain trajectory representing the signal of a module, while the set of all samples covers a broad range of signals. The approach proceeds by randomly sampling seed genes and, according to the Pearson distance, selecting its s - 1 nearest neighbors within one of the conditions, where s is the desired sample size. To cover a broad range of signals we are not interested in drawing the same genes too frequently. Therefore, we draw genes without replacement to obtain module samples of size s = 30.
Module scores
The mathematical definitions of the modules specify the set of all modules of a desired type. However, to mine for modules contained in the set, we need to define a scoring function, in accordance with the module definition. This scoring function is employed throughout the iterative procedure of the EDISA algorithm. Analogous to the coherent module definition, the scoring function for coherent modules is defined:
Analogous to the independent response module definition, the scoring function for independent response modules is defined as follows:
Iteration
At each iteration step i the iteration scheme applies a filter to remove those genes and conditions from Giand Ci, which do not meet the module criteria. This results in new gene and condition sets Gi+1and Ci+1, for the next iteration step i + 1. In order to explicitly mine for either coherent or independent response modules, the score of each module is computed with the respective scoring function. This procedure implies that genes and conditions are treated equally.
Assume, we are searching for coherent modules. Then, given the current Giand Ci, the score for each gene and condition in the module is computed using the scoring function S
co
:
Ci+1= {∀c ∈ Ci|S
co
(GicT) <τ
C
}
Given Giand Ci+1the next iteration step is:
Gi+1= {∀g ∈ Gi|S
co
(gCi+1T) <τ
G
}
To search for independent response modules, the scoring function S
ir
is applied.
Ci+1= {∀c ∈ Ci|S
ir
(GicT <τ
C
}
Given Giand Ci+1the iteration step is:
Gi+1= {∀g ∈ Gi|S
ir
(gCi+1T) <τ
G
}
These iteration formulas are repeated until Gi= Gi+1and Ci= Ci+1holds.
Average trajectory calculation
The initial sample drawn by the preprocessing step has a fixed size s. Often, the size of this sample is larger than the number of genes contained in a module. Thus a small signal is embedded into a relatively large amount of background noise, which is likely to dominate the average. To account for this effect, a weighted average trajectory is used (equation 4), which takes advantage of the preprocessing procedure. Thereby, s genes are selected iteratively, where at every step the gene with the smallest Pearson distance to the seed gene g is added. To assign a weight to each gene, we generate a weighting vector WGby drawing s samples from the exponential distribution with λ = 1. These weights are sorted and the highest weight is associated with the seed gene g. Then they descend according to their Pearson distance to g from 1 to s.
Automatically refining the parameters τ
C
and τ
G
The mathematical definition of the modules defines a set of modules by applying global thresholds τ
G
and τ
C
. However, initial modules are often fuzzy and contain random genes, disrupting the average of the final module. Thus, initially a less restrictive threshold is needed, which, as the iteration proceeds, can be decreased to narrow the module down to its dense core. This refinement is only employed during the iteration procedure and does not affect τ
G
and τ
C
in the postprocessing.
The adaptation of τ
C
and τ
G
is accomplished by applying a k-means clustering with k = 2 at each iteration step. Thus, k-means separates the genes or conditions into two sets, one which should remain in the module and another which should be discarded. For both sets the module definitions are applied to calculate the minimal acceptable values of τ
G
and τ
C
. The new thresholds are then set to the minimal τ
G
and τ
C
, respectively. Given these thresholds, the iteration formula refines the modules in accordance with the clusters determined by k-means. The thresholds are left unaffected if k-means is unable to partition the modules.
This adaptation procedure increased the performance of EDISA significantly on the synthetic dataset.
Postprocessing
The EDISA approach draws a large number of random samples. It is inevitable that this approach can yield the same module a number of times. Furthermore, a maximal module may be found along with numerous copies of its submodules. Consequently, for a proper evaluation of the results, the sampled modules are merged.
First the merging procedure filters out all modules with a value above a specified threshold τ (equation 8 or 9). Then, a k-means clustering, with 10 restarts, is performed on the remaining modules. The clustering operates on the pairwise Pearson distances of the centroids, so that similar centroids are clustered. The parameter k is set to the number of principal components which explain for 95% of the variation in the centroid distance matrix. All modules that cluster together are tested for inclusion and all modules are discarded which are subsets of other modules. This inclusion test could also be performed without clustering the modules, however, the clustering procedure provides a significant runtime improvement. The merged modules are refined by two extension steps. The first extension step considers all genes in the dataset and adds them to the module if their correlation to the average module trajectory is below the threshold τ
G
. Accordingly, the second extension step considers for every condition whether it should be added to the module, by matching its average correlation against τ
C
. Both extension steps are carried out in accordance with equations 6 and 7. After the extension step a final filter is applied, removing all modules which have an overlap (according to equation 14) of more than 75% with another module.
Organization of the modules
A requirement for module definitions is that the modules are allowed to overlap. To visualize the number of genes shared by different modules, we represent their relationship by a graph, in which the edges indicate the degree of overlap between two modules. The edge weight is calculated by equation 14. A weight of 0 indicates no overlap and 1 indicates module identity. Edges with a value below 0.15 are not drawn.
Evaluation on the synthetic dataset
To evaluate the EDISA approach, we applied it to a synthetic dataset with implanted modules. To obtain a score for recovered modules, each module M is compared against the most similar implanted module (equation 15). A perfect correspondence of the recovered and implanted modules results in a score of 1, whereas completely disjoint modules score with 0. The equation employed for this score is similar to the previously introduced postprocessing equation (equation 14). Two sets of modules R1 = (M1, ... M|R 1|) and R2 = (M1, ... M|R 2|) are compared based on the genes G
n
and conditions C
n
which are part of the respective modules.
Analysis of cis-regulatory elements
The analysis on cis-regulatory elements has been carried out with the RSA-tools package described by van Helden et al. [45, 46], which is based on the frequency of a motif over its respective background frequency. Motifs found significantly enriched in the 1100 bp upstream region of the translation start site (ATG) were subsequently compared to the PLACE database [47] to identify motifs of known regulatory function.
Datasets
Synthetic
The synthetic dataset contains 1,000 genes measured over 10 conditions with 6 time-points each. The measurements were generated by drawing the first time-point from a normal distribution with a mean of 5 and a variance of 0.3. The remaining time-points were sampled from a normal distribution with a variance of 0.3 and a mean centered at the first time-point. Into this background model four modules were implanted. Each module contained 50 genes and three of the modules were overlapping. In the case of perfectly correlated modules, the noise level within the modules is σ = 0. To introduce artificial variance to the modules, noise was added to the modules by a normal distribution with different standard deviations σ = (0.1, 0.3, 0.5, 0.7, 0.9) [see additional file 1].
Homo sapiens multiple sclerosis dataset
The dataset was generated in the course of a pharmacological study analyzing the response of multiple sclerosis patients to IFN-β treatment [25]. Peripheral blood of 14 multiple sclerosis patients was obtained and the measurements were conducted before as well as 1, 2, 4, 8, 24, 48, 120, and 168 h after the treatment. This dataset was obtained from the authors of the Guttman et al. publication [25].
Arabidopsis thaliana abiotic stress dataset
The Arabidopsis thaliana dataset is provided by the AFGN (Arabidopsis Functional Genomics Network) and available at TAIR [22]. Within this project, 9 time-series experiments were conducted [26]. Among these, we extracted a group of abiotic stress stimuli for the tissues shoot and root, as well as the respective control measurements. The stress conditions and their reference numbers at TAIR are (cold: ME00325, osmotic: ME00327, salt: ME00328, drought: ME00338, genotoxic: ME00326, uv-b: ME00329, wound: ME00330, heat: ME00339). Each of these time-series contains 6 to 9 measurements with two biological replicates.
The signals were normalized with GCRMA [48], the biological replicates were averaged and finally the log2 was taken.