Definitions
Time series gene expression dataset (D): Such a dataset can be modeled as a G × C × T matrix, of which each element d
ijk
corresponds to the expression value of the ith gene over the jth sample and across the kth time point where i ∈ (g
1,g
2,....,g
G
), j ∈ (c
1,c
2,....,c
C
), k ∈ (t
1,t
2,....,t
T
).
Tricluster (M): A tricluster can be defined as a sub-matrix M(I,J,K) = [ m
ijk
], where i ∈ I, j ∈ J, k ∈ K. Sub-matrix M represents a subset of genes (I) that have similar expression profiles over a subset of samples (J) during a subset of time points (K).
Perfect shifting tricluster: A tricluster M(I,J,K) is called perfect shifting tricluster if each element of the tricluster is represented as: m
ijk
=Γ+α
i
+β
j
+η
k
, where Γ is a constant value of the tricluster and α
i
, β
j
and η
k
are the shifting factors of ith gene, jth sample, kth time point respectively.
Mean squared residue: Mean squared residue score (MSR) of shifting tricluster M(I,J,K) can be modeled as [8, 9]
$$ \begin{aligned} MSR =&\, \frac{1}{|I||J||K|} \sum_{i \in I, j \in J, k \in K} r^{2}_{ijk} = \frac{1}{|I||J||K|} \sum_{i \in I, j \in J, k \in K}\\ &(m_{ijk} - m_{iJK} - m_{IjK} - m_{IJk} + 2m_{IJK})^{2}, \end{aligned} $$
((1))
where the mean of the ith gene is \( m_{\textit {iJK}} = \frac {1}{|J||K|}\sum _{j \in J, k \in K} m_{\textit {ijk}}\), the mean of the jth sample is \(m_{\textit {IjK}} = \frac {1}{|I||K|} \sum _{i \in I, k \in K} m_{\textit {ijk}}\), the mean of the kth time point is \(m_{\textit {IJk}} = \frac {1}{|I||J|} \sum _{i \in I, j \in J} m_{\textit {ijk}}\), and the mean of tricluster is \(m_{\textit {IJK}} = \frac {1}{|I||J||K|} \sum _{i \in I, j \in J, k \in K} m_{\textit {ijk}}\).
The MSR score of a tricluster represents the level of coherence among the elements of the tricluster. Hence a lower MSR score means better quality of a tricluster. For a perfect shifting tricluster the MSR score is zero. If we use some global normalization, like min–max normalization globally on the whole dataset, it does not affect the algorithm. Moreover, it can be shown that gene–wise Z-normalization only on a tricluster does not affect the MSR score. However, when we apply similar normalization on the whole dataset, it affects the triclusters, and in turn affects our algorithm. Still we prefer to normalize the dataset in order to eliminate the variability in gene expression profiles due to experimental errors and noises and as normalization reduces the effects of scaling patterns, scaling patterns could also be identified partially.
Steps of EMOA- δ-TRIMAX
The steps the of δ-TRIMAX algorithm [8, 9] have been described in the Additional file 1. Figure 2 shows the steps of our proposed EMOA- δ-TRIMAX algorithm.
Multi-objective optimization problem
The multi-objective optimization problem is equivalent to finding the vector \(\bar {x}^{*} = [x_{1}^{*},x_{2}^{*},\ldots,x_{n}^{*}]^{T}\) of decision variables that satisfies a number of equality and inequality constraints by optimizing the vector function \(\bar {f}(\bar {x}) = [f_{1}(\bar {x}),f_{2}(\bar {x}),\ldots,f_{r}(\bar {x})]^{T}\) subject to some constraints. Here the constraints correspond to the feasible region F that holds all the acceptable solutions; \(\bar {x}^{*}\) stands for an optimal solution. For a minimization problem, Pareto optimality can be formally delineated as: A decision vector \(\bar {x}^{*}\) is referred to as Pareto optimal if and only if there is no \(\bar {x}\) such that ∀i∈{1,2,..,r}, \(f_{i}(\bar {x}) \leq f_{i}(\bar {x}^{*})\) and ∃i∈{1,2,…,r}, \(f_{i}(\bar {x}) \textless f_{i}(\bar {x}^{*})\). In words, \(\bar {x}^{*}\) is called Pareto optimal if there exists no possible vector \(\bar {x}\) that induces a diminution of some criterion without a contemporaneous increase of at least one other criterion [11, 14].
Genetic algorithm
A genetic algorithm is a search heuristic that imitates the process of Darwinian evolution [11, 14]. Here the population is generated randomly and consists of a set of chromosomes that encode the parameters of the search space. A fitness function corresponds to the objective function to be optimized and is used to estimate the goodness of each chromosome in the population. Genetic operators such as selection, crossover and mutation are used to evolve subsequent generations. If some particular criterion is met or the maximum generation limit is reached, then the algorithm finishes its execution.
Encoding chromosome
Each chromosome is represented by a binary string that has three parts. A chromosome encodes a possible tricluster. For a time series gene expression dataset having G number of genes, C number of samples and T number of time points, the first G bits correspond to genes, the next C bits represent the samples and the last T positions stand for the time points. Hence each string is represented by (G+C+T) bits, having a value either 1 or 0. A value 1 means the corresponding gene or sample or time point is a member of the tricluster. Suppose for a 3D gene expression dataset having 10 genes, 5 samples and 8 time points, a string {10010011100011101010101} represents that genes {g
1,g
4,g
7,g
8,g
9}, samples {s
3,s
4,s
5} and time points {t
2,t
4,t
6,t
8} are the members of the tricluster. The initial population consists of a set of randomly generated chromosomes. Retrieval of overlapping genes belonging to several triclusters are guaranteed by the step of chromosome encoding. As each bit of a chromosome in the population represents the presence or absence of genes, replicates and time points in one resultant tricluster, often we could find an overlap between the positions of any two chromosomes containing a value 1. Thus different chromosomes can encode overlapping triclusters. Some genes and/ or samples and/ or time points could be added to the initial population inspite of lying far away from the feature space. To remove such nodes from the population, δ-TRIMAX has been used as a local search heuristic.
Objective functions
After applying δ-TRIMAX as a local search heuristic, each string in the population represents one δ-tricluster having an MSR below a threshold δ. Now we compute values of the following three objective functions for each individual in the population. The first objective function is
$$ f_{1} = \frac{MSR}{\delta}, $$
((2))
where MSR is the mean squared residue score of one tricluster. Hence, f
1 needs to be minimized. The second objective function is
$$ f_{2} = \frac{\left|I\right| \ast \left|J\right| \ast \left|K\right|}{\left|G\right| \ast \left|C\right| \ast \left|T\right|}, $$
((3))
where (|I|∗|J|∗|K|) is the volume of the tricluster and (|G|∗|C|∗|T|) is the volume of the dataset. Our goal is to maximize the value of f
2. Finally the third objective function is
$$ f_{3} = \left| 1-\frac{6\sum {d_{i}^{2}}}{n(n^{2}-1)} \right|, $$
((4))
where d
i
is the difference between the ranks of average expression values (sorted either in ascending or descending order) over a subset of samples at ith time point of each pair of genes in one tricluster and n is the number of time points in that tricluster. Here the goal is to maximize the non-parametric Spearman correlation coefficient (f
3) [15] of the resultant triclusters.
Motivations of objective functions
As the aim of our proposed algorithm is to find triclusters having a lower MSR score and a higher volume, the first two objective functions (f
1 and f
2) ensure to accomplish those goals. Moreover the objective function f
3 is used to maximize the correlation coefficients among genes belonging to the resultant triclusters. We have taken the absolute values of the correlation coefficients just considering the fact that coregulated genes can be both up- and down-regulated by the transcription factors across a subset of time points.
Genetic operators
Here, non-dominated sorting and crowding distance are used for fitness assignment and comparison [11]. A crossover is a generalization of several mutations performed at once, which we have not applied in this work [16]. Instead, we have used bit string mutations with a high mutation probability to generate offspring population from a parent population. In this case, the mutation occurs at random positions through bit flips. For instance, for a binary string {1011010010} we generate a random number ranges from 0 to 1 for each bit of the string. If this random number for a particular bit is less than or equal to the mutation probability, mutation occurs and the value 1 or 0 is changed to a value 0 or 1, respectively. The mutation probability remains same for each of the bits of chromosome. After applying the mutation operator on each individual of the population, some genes/samples/time points can be added to the population that are lying far away from the feature space. To cope with this problem we have applied δ-TRIMAX as a local search heuristic.
Elitism
We have included elitism to keep track of non-dominated Pareto optimal solutions after each generation [11]. Stopping criteria is measured by the convergence metric delineated in equation (8).
Tricluster eigengene
We applied the singular value decomposition method (SVD) on the expression data of each resultant tricluster to detect its eigengene [17]. For instance, \(X^{i}_{g \times (c \ast t)}\) stands for the expression matrix of ith tricluster, where g, c and t represent the number of genes, samples and time points of ith tricluster. Now we apply SVD on the data matrix (normalized to mean=0 and variance=1). Now, the SVD of ith tricluster can be represented as,
$$ X^{i} = UDV^{T}, $$
((5))
where U and V are the orthogonal matrices. U
i is a g ∗ (c ∗ t) matrix with orthonormal columns, V
i is a (c ∗ t) × (c ∗ t) orthogonal matrix and D
i is (c ∗ t) × (c ∗ t) diagonal matrix of singular values. Assuming that singular values in matrix D
i are arranged in non-decreasing order, we can represent the eigengene of the ith tricluster by the first column of matrix V
i, i.e.
$$ E^{i} ={V^{i}_{1}}. $$
((6))
KEGG pathway enrichment
To establish the biological significance of the genes belonging to each resultant tricluster for both datasets we have performed a KEGG pathway enrichment analysis using the GOStats package in R with a p-value cutoff (BH-corrected p-value) of 0.05 [18, 19].
TFBS enrichment analysis
Genes that exhibit similar expression profiles are supposed to be regulated by the same mechanism. To analyze the potential co-regulation of co-expressed genes, we have done a transcription factor binding site (TFBS) enrichment analysis using the TRANSFAC library (version 2012.2) [20]. Here we used 52 million TFBS predictions that are conserved between human, mouse, dog and cow [21]. Out of these 52 million conserved TFBSs we have selected the highest-scoring 1 % for each TRANSFAC matrix to identify the most specific regulator (transcription factor) - target interactions. We have applied a hypergeometric test and Benjamini Yekutieli-FDR for p-value correction to find over-represented binding sites (p-value ≤ 0.05) in the upstream regions of genes belonging to each tricluster [22, 23].
Datasets
Description of the artificial datasets
Artificial dataset 1 (AD1):
First, we have applied the proposed algorithm to an artificial dataset containing 1000 genes, 5 samples and 4 time points. We have then embedded 3 perfect shifting triclusters (standard deviation (σ)=0) of size 100 × 4 × 4, 80 × 4 × 4 and 60 × 4 × 4 into the dataset. In the next step, we have implanted 3 noisy triclusters with different levels of noise (σ=0.1,0.3,0.5,0.7,0.9) into the synthetic dataset.
Artificial dataset 2 (AD2):
Moreover, we have generated another artificial dataset which contains 200 genes, 10 replicates and 10 time points. Afterwards, we have implanted 3 perfect shifting triclusters (standard deviation (σ)=0) of size 50 × 3 × 3, 50 × 3 × 3 and 50 × 3 × 3 into the dataset. In the next step, we have added different levels of noise (σ=0.1,0.3,0.5,0.7,0.9) into the synthetic dataset.
Artificial dataset 3 (AD3):
To evaluate the performance of the proposed algorithm in case of the datasets containing different number of time points, we have generated three additional artificial datasets of size 200 (genes) × 10 (replicates) × 20 (time points), 200 (genes) × 10 (replicates) × 25 (time points) and 200 (genes) × 10 (replicates) × 30(time points) in which we have embedded 3 perfect shifting triclusters of size 30 × 3 × 8, 30 × 3 × 6 and 30 × 3 × 4.
Artificial dataset 4 (AD4):
In order to show the performance of the algorithm for the dataset containing missing values, we have randomly deleted the values of 0.5 %, 1 %, 1.5 % and 2 % of all elements of one artificial dataset of size 200 × 10 × 20 containing three triclusters of size 30 × 3 × 8, 30 × 3 × 6 and 30 × 3 × 4.
Description of real-life datasets
Dataset 1:
In this work, this previously published dataset has only been used for comparing the performance of the proposed algorithm with that of the other existing triclustering algorithms since one of the algorithms we wanted to compare our approach with, OPTricluster, can only be efficiently applied to a short time series gene expression dataset and thus, was not suitable to be used for dataset 2 (see below) [7]. Dataset 1 holds 54675 Affymetrix human genome U133 plus 2.0 probe ids, 3 samples and 4 time points (0, 3, 6 and 12 hours) (GSE11324) [24]. The goal of this experiment was to determine cis-regulatory sites in previously uncharted genome regions, responsible for conveying estrogen responses, and to identify the cooperating transcription factors that also contribute to estrogen signaling in MCF7 breast cancer cells.
Dataset 2:
This dataset contains 48803 Illumina HumanWG-6 v3.0 probe ids, 3 replicates and 12 time points (days 0, 3, 7, 10, 14, 20, 28, 35, 45, 60, 90 and 120) (GSE35671) [12]. All these replicates are independent of each other. The aim of this study was to provide insights into the molecular regulation of hiPSC differentiation to cardiomyocytes.
Dataset 3:
This experiment was carried out to study the dynamics of expression profiles of 54675 Affymetrix human genome U133 plus 2.0 probe ids in response to IFN-beta-1b treatment across four time points over 6 patients (GSE46280) [25].