 Methodology Article
 Open access
 Published:
Minimum redundancy maximum relevance feature selection approach for temporal gene expression data
BMC Bioinformatics volume 18, Article number: 9 (2017)
Abstract
Background
Feature selection, aiming to identify a subset of features among a possibly large set of features that are relevant for predicting a response, is an important preprocessing step in machine learning. In gene expression studies this is not a trivial task for several reasons, including potential temporal character of data. However, most feature selection approaches developed for microarray data cannot handle multivariate temporal data without previous data flattening, which results in loss of temporal information.
We propose a temporal minimum redundancy  maximum relevance (TMRMR) feature selection approach, which is able to handle multivariate temporal data without previous data flattening. In the proposed approach we compute relevance of a gene by averaging Fstatistic values calculated across individual time steps, and we compute redundancy between genes by using a dynamical time warping approach.
Results
The proposed method is evaluated on three temporal gene expression datasets from human viral challenge studies. Obtained results show that the proposed method outperforms alternatives widely used in gene expression studies. In particular, the proposed method achieved improvement in accuracy in 34 out of 54 experiments, while the other methods outperformed it in no more than 4 experiments.
Conclusion
We developed a filterbased feature selection method for temporal gene expression data based on maximum relevance and minimum redundancy criteria. The proposed method incorporates temporal information by combining relevance, which is calculated as an average Fstatistic value across different time steps, with redundancy, which is calculated by employing dynamical time warping approach. As evident in our experiments, incorporating the temporal information into the feature selection process leads to selection of more discriminative features.
Background
Feature selection approaches can be roughly categorized into filterbased methods [1], wrapperbased methods [2] and embedded methods [3]. Filterbased methods perform feature selection independently from the learning process. On the other hand, wrapperbased and embedded methods combine feature selection and the learning process in order to select an optimal subset of features. This combined process usually requires the use of nested cross validation procedure which may lead to increased computational cost and possible overfit, especially when a small number of observations is available, which is often the case in gene expression datasets. Therefore, we focus on filterbased feature selection approaches in this paper.
A challenge in gene expression studies is the identification of discriminative genes, which may be later used as predictors (inputs) to classification models. Removing irrelevant features may lead to improved accuracy and increased interpretability of the classification model. However, this task is challenging, especially when data have temporal characteristics. Various feature selection approaches have been developed for microarray data [4–6]. However, most of these methods cannot handle multivariate temporal data without data flattening, which is the process that transforms a temporal data into a single matrix and results in loss of temporal information.
Several feature selection approaches for temporal data have been proposed recently. For instance, [7] proposed a marginbased feature selection approach for temporal data, where the original feature space was transformed into weighted feature space to perform optimization in order to maximize temporal margin in this weighted feature space. However, redundancy among features was not considered. Following the same intuition, in [8, 9] authors proposed an approach, where they project the data to another space to learn new features (factors or principal component). However, the methods are for dimension reduction, rather than feature selection which is our focus in this paper. The Multitask Lasso method [10, 11] employs group lasso regularization based on the L _{2,1}norm penalty for feature selection, thus ensuring all regression models at different time points to share a common set of features. This method removes redundant features by reducing their weights (coefficients) to zero but the approach belongs to the embedded feature selection methods (the search for an optimal subset of features is built into the classifier construction) rather than filtertype methods.
A special group of filterbased feature selection approaches tends to simultaneously select highly predictive but uncorrelated features. An example is the Maximum Relevance Minimum Redundancy (mRMR) algorithm developed for feature selection of microarray data [12]. It tends to select a subset of features having the most correlation with a class (relevance) and the least correlation between themselves (redundancy). In this algorithm, the features are ranked according to the minimalredundancymaximalrelevance criteria. Relevance can be calculated by using the Fstatistic (for continuous features) or mutual information (for discrete features) and redundancy can be calculated by using Pearson correlation coefficient (for continuous features) or mutual information (for discrete features). In [13] authors proposed the MIFSND algorithm, which selects features according to the minimalredundancymaximalrelevance criteria by using an optimization algorithm known as Nondominated Sorting Genetic AlgorithmII [14]. When selecting features, instead of using the calculated values for relevance and redundancy (e.g., Fstatistic and Pearson correlation coefficient), authors used domination count and dominated count, which account for the rank in the sorted list of calculated relevance and the rank in the sorted list of calculated redundancy, respectively. In [15], authors proposed an approach, where they select one representative gene from each group/cluster with the objective that the selected genes are jointly discriminative. This approach requires features to be previously clustered based on correlation or domain knowledge (e.g., molecular functions, gene ontology, etc.). By clustering genes this algorithm prevents selection of redundant features. All these algorithms tend to select highly predictive uncorrelated features and require a preprocessing approach to perform temporal data flattening.
In this paper, we propose a temporal minimum redundancy  maximum relevance (TMRMR) feature selection approach, which is able to handle multivariate temporal data without data flattening. We preserve the idea of maximum relevance and minimum redundancy criteria [12] but we change evaluation procedure for relevance and redundancy. In the proposed approach, we compute the relevance of a gene by averaging the Fstatistic values calculated across individual time steps, and redundancy between genes by using the dynamical time warping (DTW) approach. The proposed methodology, tested on three temporal gene expression datasets from viral studies, outperforms the alternatives used in this study.
Methods
mRMR algorithm and data flattening
The mRMR is a feature selection approach that tends to select features with a high correlation with the class (output) and a low correlation between themselves. For continuous features, the Fstatistic can be used to calculate correlation with the class (relevance) and the Pearson correlation coefficient can be used to calculate correlation between features (redundancy). Thereafter, features are selected one by one by applying a greedy search to maximize the objective function, which is a function of relevance and redundancy. Two commonly used types of the objective function are MID (Mutual Information Difference criterion) and MIQ (Mutual Information Quotient criterion) representing the difference or the quotient of relevance and redundancy, respectively. For temporal data, mRMR feature selection approach requires some preprocessing techniques that flatten temporal data into a single matrix in advance. This may result in a loss of possibly important information among temporal data (such as temporal order information). A common way for data flattening used as a preprocessing step to mRMR is depicted in Fig. 1.
TMRMR algorithm
In this study, we preserve the idea of the mRMR algorithm by maximizing the objective function, which includes relevance and redundancy, but we adapt it to handle multivariate temporal data without flattening (Fig. 2).
Let us denote by \(D={{\left \{ {{X}_{i}},{{c}_{i}} \right \}}_{i=1,...,N}}\) the dataset with N individuals. \({{X}_{i}}\in {{\mathbf {R} }^{G\times T}}\) represents G observed genes measured at T time steps for individual i. \({{c}_{i}}\in \left \{ 1,...,K \right \}\) represents the class label for individual i. Let us also denote by \({{g}_{j}}\in {{\mathbf {R} }^{N\times T}},j=1,...,G\) the N×T matrix of gene expression data for jth gene. We represent relevance of a gene g _{ j } by calculating the Fstatistic at each time step and then combining these values by using an appropriate aggregation operator. A number of aggregation operators may be applicable here, such as median, arithmetic mean, geometric mean, maximum or even an approach that combines aggregation operators [16]. However, we aim to choose an operator that is most appropriate for the observed problem, i.e., that is able to capture different gene expression patterns between groups even if differences are present in just a small fraction of the observed time period. Median is more robust to outliers than arithmetic mean which is a nice property, however in some cases it may fail to detect different levels of gene expressions. For instance, some genes may have different expression between groups in just a short time period following infection (e.g., initial two time points with large Fstatistic values) and thereafter the differences between groups become insignificant (e.g., the next five time points with small or even neglectable Fstatistic values). In such a case, operators like median and geometric mean would fail to detect different gene expression behavior between groups. The maximum Fstatistic value may be more appropriate in this case. However, this operator is based on a single Fstatistic value (maximal) and it neglects all other values corresponding to other time steps. On the other hand, arithmetic mean, although it will be affected with several small Fstatistic values corresponding to the time points where differences in gene expression values between groups do not exist, will have a significant value. In addition, we implemented the Multilayer Aggregation (MLA) method from [16] to combine arithmetic mean, geometric mean and median for aggregation of Fstatistic values corresponding to different time steps, however, it did not improve results significantly and it reduced robustness of the proposed feature selection methods. For these reasons, we choose the arithmetic mean operator to aggregate Fstatistic values calculated across all time steps into a single value representing the total gene relevance:
where \(g_{j}^{\left (t \right)}\) is an Ndimensional vector containing gene expression data of a gene g _{ j } at the tth time step, c is a classification variable with K possible classes, n _{ k } is the number of observations belonging to the kth class, \(\bar {g}_{j}^{\left (t \right)}\) is the average value of \(g_{j}^{\left (t \right)}\) in all tissue samples, \(\bar {g}_{j,(k)}^{\left (t \right)}\) is the average value of \(g_{j}^{\left (t \right)}\) in all tissue samples belonging to the kth class, and \(g_{j,l,(k)}^{\left (t \right)}\) is the gene expression value of lth sample belonging to the kth class.
By using Eq. 1, we quantify correlation of a gene g _{ j } with a class at each time step t. Thereafter, we calculate the overall relevance of the gene g _{ j } (Eq. 2) by averaging relevance (Fstatistic) values calculated for all time steps. Here, it should be noted that relevance calculated in this way differs from relevance calculated on flattened data. For instance, it may happen that for some phenotype 1 expression values for a certain gene have increasing trend (let say from 0 to 1) and for phenotype 2 symmetric decreasing trend (from 1 to 0). In this case, data flattening may lead to low interclass variance and therefore to low relevance. On the other hand, relevance calculated by using Eqs. (1)(2) should be able capture the different trends of gene expression data for the two phenotypes.
In the proposed approach for temporal feature selection we calculate redundancy by using DTW, which is an efficient algorithm for measuring similarity between two temporal sequences that may vary in time or speed. DTW uses “elastic” alignment and is able to capture similarity between curves even if they are out of phase in time (in such cases Euclidean and Manhattan distance measures, which align corresponding time points, would fail to detect similarity).
An issue with the mRMR algorithm is the possible selection of irrelevant features, which is possible especially in the first few iterations of the algorithm. For instance, based on the MIQ criterion the second feature may be selected simply because it is totally different from the first one (feature with the highest relevance) although it may be irrelevant. Thereafter, this problem is further propagated since a selected irrelevant feature affects selection of the next ones. In order to solve this issue, we introduced hyperparameter α, which controls the number of the top relevant features (according to the average Fstatistic value calculated by using Eqs. (1)(2)) included in the feature selection process. This means that we choose the next nonredundant feature from only the top α G relevant genes (where G is the total number of genes). For each two genes g _{ i } and g _{ j }, belonging to the group of the α G most relevant features, we calculate N×N distance matrix D (Fig. 2) whose elements represent DTW distances between rows in matrices g _{ i } and g _{ j } (e.g., D _{ pq } represents DTW distance between pth row in matrix g _{ i } and qth row in matrix g _{ j }). After computing the distance matrix D we calculate redundancy by using one of the following two approaches:
In Eq. 3 R _{ c } represents redundancy calculated by using DTW distances between every pair of rows in matrices g _{ i } and g _{ j }, while in Eq. 4 R _{ m } represents redundancy calculated by using only DTW distances between corresponding rows in matrices g _{ i } and g _{ j }.
Although DTW is able to capture similarity between curves that are out of phase in time it may fail to capture similarity between curves fluctuating in a similar manner but with different offsets and amplitudes. For instance, one signal may fluctuate with amplitude between 5 and 10, while another signal may fluctuate in a similar manner but with larger amplitude between 30 and 40. In order to deal with this issue, prior to evaluation of distance matrix D for each pair of genes g _{ i } and g _{ j }, all gene expression temporal sequences were normalized by the zscore normalization (Fig. 2) which is often used as a preprocessing step to DTW [17–19]:
where g _{ i,p } is a time series corresponding to ith gene and pth observation (patient), and \({{\bar {g}}_{i,p}}\) and σ _{ i,p } are the average value and standard deviation of this time series. Zscore normalization ’translates’ gene expression time series to fluctuate around the same (zero) offset and removes differences in amplitudes. Thereafter, the gene expression time series differ only in shape which is exactly what we are interested in when calculating redundancy.
After the normalization of gene expression temporal sequences, for each pair of genes g _{ i } and g _{ j } distance matrix D is calculated. Each entry of D is calculated by using DTW approach:
where d t w() is the function which calculates the DTW distance between temporal sequences g _{ i,p } and g _{ j,q }.
As in mRMR [12], the proposed algorithm starts by selecting one feature (gene) having the largest relevance calculated by using Eq. 2. Thereafter, algorithm performs greedy search and adds one feature in each iteration according to the MIQ criterion:
where S is a subset of already selected genes extended with gene g _{ k } and \(\left  S \right \) is the number of features in S, F is the average Fstatistic value across different time steps (Eq. 2), and R is either R _{ c } (Eq. 3) or R _{ m } (Eq. 4). Depending on the choice of the redundancy measure (R _{ c } or R _{ m }), in this paper we propose two versions of the TMRMR algorithm: (1) TMRMRC, using R _{ c } as a measure of redundancy and (2) TMRMRM, using R _{ m } as a measure of redundancy. Figure 3 shows the pseudocode of the proposed TMRMRC and TMRMRM algorithms.
Solution to the optimization problem given in Eq. 7 requires O(α G m) computational complexity, where m is the number of genes selected. Taking into account that the computational complexity of the DTW algorithm is O(T ^{2}) then the total time complexities of the TMRMRC and TMRMRM algorithms are O(α G m T ^{2} N ^{2}) and O(α G m T ^{2} N), respectively. Both proposed algorithms require more computational complexity than the original mRMR algorithm whose computational complexity is O(G m T N) for the temporal gene expression dataset. However, in cases where it is necessary to reduce execution time of the proposed algorithms (e.g. datasets with large number of time points T), their computational complexity may be reduced through parameter α. In addition, we can further speed up the proposed algorithms by utilizing an approximate DTW that has a linear time complexity [20], however, it is out of the current manuscript’s scope.
Implementation
Both, the TMRMRC and TMRMRM algorithms are implemented by using MATLAB software. DTW is implemented by using dynamic time warping package [21]. Our software takes as input a set of temporally aligned gene expression data and provides the ranked list of the top genes as the output. The number of genes to be selected is specified by a user. Source code is freely available at: https://github.com/radovicmiloskg/TMRMR.git.
Results and discussion
Dataset description
In this study, we evaluated the proposed feature selected approach by comparing it with alternatives on three independent gene expression datasets from human viral challenge studies [22]. These datasets contain gene expression data for 17, 20 and 19 human volunteers, who were infected with H3N2 influenza, rhinovirus (HRV) and respiratory syncytial virus (RSV), respectively. A summary of the datasets is given in Table 1.
In each dataset, subjects were classified based on severity of reaction to infection into “symptomatic” and “asymptomatic” groups. In particular, symptoms were recorded twice daily and classified based on modified Jackson Score [23]. Patients with a modified Jackson score larger than or equal to 6 over the quarantine period were denoted as “symptomatic”. Gene expression measurements were collected temporally, starting at baseline (24 hours prior to inoculation with virus) and thereafter at a certain time points following experimental procedure which is described in detail in [22], making a total of 16, 14 and 21 timepoint measurements for H3N2, HRV and RSV datasets, respectively.
Comparison methods
We compared the proposed TMRMRC and TMRMRM methods with four popular stateoftheart feature selection approaches, widely used for extraction of the most informative features from gene expression data:

1.
mRMR: This algorithm tends to select a subset of features having the most correlation with the class (output) and the least correlation between themselves [12]. It ranks features according to the minimalredundancymaximalrelevance criterion which is based on mutual information.

2.
Fstatistic: ANOVA is one of the most widely applied techniques in microarray data analysis [24]. This approach selects features simply according to the Fstatistic value (which is the statistic for ANOVA). It prefers to select features having small intraclass variances and large interclass variance.

3.
ReliefF: One of the most successful and most widely used feature selection approaches which is based on the idea that a good feature should have similar values in observations belonging to the same class and different values in observations belonging to different classes [25]. It choses instances randomly, finds their nearest neighbors from the same and the opposite class(es), and weights features according to their distances (more weight is given to features that discriminate the instances from neighbors of different class(es) and do not discriminate the instances from neighbors of the same class).

4.
Multitask Lasso (MTLASSO): This method represents one of the stateoftheart methods for temporal feature selection [10, 11]. It employs the group lasso regularization based on the L _{2,1}norm penalty for feature selection, thus ensuring that all regression models at different time points (tasks) share a common set of features. The method is implemented by using the MALSAR software package [26].
Performance evaluation procedure
We evaluated the feature selection approaches by calculating the classification accuracy of the three classifiers:

1.
Knearest neighbors (KNN): Instancebased lazy learning algorithm which predicts the class of a testing observation that is dominant among the K most similar examples (nearest neighbors) in the problem space.

2.
Naive Bayes classifier (NB): A probabilistic classifier based on applying Bayes’ theorem, which is often used for classification of gene expression data [12, 27].

3.
Support vector machine (SVM): A discriminative classifier, which uses a kernel trick to transform the input data space in order to create a separating hyperplane. In this study, we used linear SVM because previous studies have proved its effectiveness in gene expression classification problems [28].
For evaluation of the three classifiers, the 5fold cross validation procedure was used, where, in each iteration, observations belonging to the leftout fold were used for testing purposes (test set), while the remaining observations were used for feature selection followed by classifier training (training set). In each iteration of the cross validation procedure we optimized parameters of the classifiers by applying nested 5fold cross validation procedure on the training set. In this way optimal values of parameters \(C\in \left \{ {{10}^{3}},{{10}^{2}},...,{{10}^{3}} \right \}\) for SVM and \(K\in \left \{ 1,3,5,7 \right \}\) for KNN were selected. Here it should be noted that the test data were never used for feature selection and classifiers training (including optimization of classifiers parameters  nested 5fold cross validation procedure). In addition, the optimal value of the hyperparameter α can be estimated in a nested crossvalidation procedure. However, due to the fact that datasets used in this study contain a huge number of features (12023) measured in a large number of time points (1421 depending on a dataset), it was too time consuming to use the nested crossvalidation to select the value of α. Thus, we simply fixed the α parameter to 0.3 in all experiments ensuring that all selected genes come from the pool of the top 30% relevant genes (3610 genes).
All three gene expression datasets used in this study are balanced, and therefore classification accuracy may serve as a good metric for comparison of TMRMRC and TMRMRM with other baseline feature selection methods. Prior to feature selection and evaluation, missing values in all three datasets were imputed by linear interpolation. In addition, gene expression values for each gene were normalized to the range [ 0,1] by using minmax normalization. All methods were implemented by using MATLAB software.
Classification accuracy on gene expression data
The proposed TMRMRC and TMRMRM feature selection approaches were compared with four baseline feature selection algorithms according to the evaluation procedure described in the previous section. By using the 5fold cross validation procedure, the accuracy of KNN, NB and SVM classifiers was calculated for the top m={1,10,20,30,40,50} genes.
Table 2 shows the results for the three datasets H3N2, HRV and RSV, respectively. It clearly shows that both TMRMRC and TMRMRM methods outperformed alternatives in most cases. More precisely, both TMRMRC and TMRMRM algorithms achieved improvement in 34 out of 54 cases when compared to alternatives (with 12 tie results). When comparing the two proposed approaches, TMRMRC outperformed TMRMRM in most cases (165 in favor of TMRMRC and 33 tie results). These results reveal that redundancy calculated by using DTW distances between every pair of time series from gene expression matrices (R _{ c }) significantly contributes to prediction accuracy. In addition, we calculated the average accuracy of the three classifiers over all datasets (last row in Table 2). These values show that, on average, both proposed methods outperformed alternatives in all cases (for all classifiers and all m values). This indicates that the proposed methods have selected the most discriminative features.
For each m value, we tested whether the proposed TMRMRC approach (which outperformed the TMRMRM) statistically significantly outperforms other methods. For this purpose, we applied Welch’s ttest on the results given in Table 2 and found that the accuracy of the proposed TMRMRC method is statistically more significant than other four baseline methods in 17 out of 24 cases (α=0.05).
Results given in Table 2 are depicted in Fig. 4. In this figure the accuracy is plotted as a function of m for all classifiers and for all datasets. This figure clearly shows that in most cases both, TMRMRC and TMRMRM approaches, outperform baseline methods for most values of m. This figure also shows that, among the four tested baseline feature selection approaches, Fstatistic outperformed the others in most cases including mRMR. Since mRMR uses Fstatistic as a measure of relevance we can conclude that minimum redundancy condition, calculated as a Pearson correlation coefficient, hurts its performance. On the other hand, the proposed TMRMRC and TMRMRM methods achieved highest accuracy by combining relevance, calculated as an average Fstatistic value across different time steps, with redundancy, calculated by employing DTW and thus succeeded to capture some additional information hidden in temporal characteristics of the data.
The accuracy of the DTW algorithm may degrade considerably when operating on expression profiles with not enough data points which is often the case in gene expression datasets. This may limit the applicability of the proposed TMRMRC and TMRMRM algorithms in such cases and, for this purpose, we performed analysis on how reducing the number of time points affects performance of the proposed methods comparing to baseline approaches. We repeated the same evaluation procedure but with reduced number of time points T=3, T=5 and T=7 for all three datasets. We select the following time points for evaluation purposes: initial time point, end time point and equally distant time points between them (e.g. t _{1}, t _{ T/2}, t _{ T }). Due to the space limitation, in Table 3 and Fig. 5 we show only results averaged over all datasets. The obtained results show that the reduction of time points did not affect the performance of the TMRMRC algorithm, which outperformed all alternatives in all cases (for all classifiers and for all T and m values). On the other hand, the TMRMRM algorithm showed improvement in all but 3 cases from which 2 occurred when the number of time points was set to 3 (T=3) and the remaining one occurred when the number of time points was set to 5 (T=5). This confirms the fact that a limited number of time points negatively affects the DTW approach and consequently the TMRMRM algorithm, nevertheless, the proposed method showed improvement in most cases when comparing to baseline approaches. This leads to the conclusion that in cases with a limited number of time points the TMRMRC approach, which is computationally more expensive, might be more appropriate than the TMRMRM approach.
Gene ontology overrepresentation analysis
We have performed gene ontology overrepresentation analysis to find gene ontology (GO) terms that are overrepresented within the subset of selected genes. For this purpose we used annotations for the top 50 genes selected by the TMRMRC algorithm from each of the three datasets used in this study (the full list of selected genes, together with error bars for the two groups, symptomatic and asymptomatic, is given in Additional file 1). Selected genes from each dataset were independently submitted to the PANTHER (protein annotation through evolutionary relationship) classification system (http://www.pantherdb.org/) which extracted significantly overrepresented biological processes. For each of the three datasets, the top 5 GO terms are reported in Table 4. The last column in the table is pvalue corrected based on the Bonferroni procedure.
We can see from Table 4 that most of GO terms that are overrepresented in all datasets are related to immune response to viral infection. This is consistent with the fact that the three gene expression datasets originate from human viral challenge studies where human volunteers were infected with H3N2 influenza, rhinovirus (HRV) and respiratory syncytial virus (RSV), respectively.
Robustness
In order to compare robustness of the proposed TMRMRC and TMRMRM feature selection methods with other baseline approaches used in this study, we calculated the Spearman’s rank correlation coefficient (ρ), Tanimoto distance (T _{ dist }) and number of features shared across all folds of the 5fold cross validation procedure (N _{ shared }) for the top 50 selected features (m=50). For each method, Fig. 6 shows the average value of each stability measure across all datasets (H3N2, HRV and RSV) and across all tested numbers of time points (T=3, T=5, T=7 and T=T _{ all }, where T _{ all }∈{16,14,21}). This figure clearly shows that, on average, the TMRMRC feature selection method is the most stable one according to each of the three measures (N _{ shared }=15, ρ=0.40 and T _{ dist }=0.33). The second most stable method is ReliefF (N _{ shared }=10.08, ρ=0.36 and T _{ dist }=0.32) which appears to be more stable than the TMRMRM algorithm (N _{ shared }=9.66, ρ=0.31 and T _{ dist }=0.25), while the least stable method is mRMR (N _{ shared }=2.16, ρ=0.03 and T _{ dist }=0.12). Since both the mRMR and the TMRMRC algorithms are based on maximum relevance and minimum redundancy criteria, we can conclude that combining relevance, calculated as an average Fstatistic value across different time steps, with redundancy, calculated by employing DTW significantly improves robustness for temporal data.
Conclusion
We presented filterbased feature selection methods for temporal gene expression data. The proposed methods utilize the maximum relevance and minimum redundancy criteria which were originally introduced by the mRMR algorithm. In order to handle multivariate temporal data without previous data flattening we modified the evaluation procedure for relevance and redundancy. Concretely, in the proposed methods we calculate the relevance of a gene by averaging Fstatistic values calculated across individual time steps and redundancy between genes by using dynamical time warping. The proposed methods have been tested on three temporal gene expression datasets from viral studies. We showed that TMRMRC and TMRMRM proposed methods outperformed alternatives in most cases. In addition, we evaluated the proposed approaches on a reduced number of time points and showed that they achieved improvement in most cases when compared to alternatives. In the future, we will focus on optimization of minimumredundancymaximumrelevance criteria and investigate applicability of various optimization algorithms, other than greedy search used in this study.
Abbreviations
 DTW:

Dynamical time warping
 GO:

Gene ontology
 H3N2:

H3N2 influenza virus
 HRV:

Rhinovirus
 KNN:

Knearest neighbors
 MID:

Mutual Information difference criterion
 MIQ:

Mutual Information quotient criterion
 mRMR:

Maximum relevance minimum redundancy algorithm
 MTLASSO:

multitask Lasso
 NB:

Naive bayes classifier
 PANTHER:

Protein annotation through evolutionary relationship
 RSV:

Respiratory syncytial virus
 SVM:

Support vector machine
 TMRMR:

Temporal minimum redundancy maximum relevance feature selection method
References
Yu L, Liu H. Feature selection for highdimensional data: A fast correlationbased filter solution In: Fawcett T, Mishra N, editors. Proceedings of the 20th International Conference on Machine Learning (ICML03). Menlo Park: The AAAI Press: 2003. p. 856–63.
Kohavi R, John GH. Relevance wrappers for feature subset selection. Artif. Intell. 1997; 97(1):273–324.
Lal TN, Chapelle O, Weston J, Elisseeff A. Embedded Methods In: Guyon I, Nikravesh M, Gunn S, Zadeh LA, editors. Feature Extraction: Foundations and Applications. Berlin: Springer: 2006. p. 137–65.
Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007; 23(19):2507–517.
BolónCanedo V, SánchezMaroño N, AlonsoBetanzos A, Benítez JM, Herrera F. A review of microarray datasets and applied feature selection methods. Information Sciences. 2014; 282:111–35.
Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Adv. Bioinformatics. 2015; 2015:1–13.
Lou Q, Obradovic Z. Analysis of temporal highdimensional gene expression data for identifying informative biomarker candidates. In: 2012 IEEE 12th International Conference on Data Mining. Washington: IEEE Computer Society: 2012. p. 996–1001.
Chen B, Chen M, Paisley J, Zaas A, Woods C, Ginsburg GS, Hero A, Lucas J, Dunson D, Carin L. Bayesian inference of the number of factors in geneexpression analysis: application to human virus challenge studies. BMC Bioinformatics. 2010; 11(1):1–16.
Chen M, Zaas A, Woods C, Ginsburg GS, Lucas J, Dunson D, Carin L. Predicting viral infection from highdimensional biomarker trajectories. J Am Stat Assoc. 2011; 106(496):1259–1279.
Argyriou A, Evgeniou T, Pontil M. Multitask feature learning In: Scholkopf B, Platt JC, Hoffman T, editors. Advances in Neural Information Processing Systems 19. Cambridge: MIT Press: 2007. p. 41–8.
Nie F, Huang H, Cai X, Ding CH. Efficient and robust feature selection via joint L2,1norms minimization In: Lafferty JD, Williams CKI, ShaweTaylor J, Zemel RS, Culotta A, editors. Advances in Neural Information Processing Systems 23. Red Hook, NY: Curran Associates, Inc.: 2010. p. 1813–1821.
Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. J Bioinforma Comput Biol. 2005; 03(02):185–205.
Hoque N, Bhattacharyya DK, Kalita JK. Mifsnd: A mutual informationbased feature selection method. Expert Syst Appl. 2014; 41(14):6371–385.
Deb K, Agrawal S, Pratap A, Meyarivan T. In: Schoenauer M, Deb K, Rudolph G, Yao X, Lutton E, Merelo JJ, Schwefel HP, (eds).A Fast Elitist Nondominated Sorting Genetic Algorithm for Multiobjective Optimization: NSGAII. Berlin, Heidelberg: Springer; 2000, pp. 849–58.
Ghalwash MF, Cao XH, Stojkovic I, Obradovic Z. Structured feature selection using coordinate descent optimization. BMC Bioinformatics. 2016; 17(1):1–14.
Elena T, Veselka B. Nonparametric recursive aggregation process. Kybernetika. 2004; 40(1):51–70.
Petitjean F, Ketterlin A, Gançarski P. A global averaging method for dynamic time warping, with applications to clustering. Pattern Recognition. 2011; 44(3):678–93.
Ratanamahatana CA, Tohlong P In: Sugimoto S, Hunter J, Rauber A, Morishima A, editors. Speech Audio Retrieval Using Voice Query. Berlin: Springer: 2006. p. 494–7.
Rakthanmanon T, Campana B, Mueen A, Batista G, Westover B, Zhu Q, Zakaria J, Keogh E. Searching and mining trillions of time series subsequences under dynamic time warping. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’12. New York: ACM: 2012. p. 262–70.
Salvador S, Chan P. Toward accurate dynamic time warping in linear time and space. Intell Data Anal. 2007; 11(5):561–80.
Wang Q. Dynamic Time Warping (DTW). 2013. http://www.mathworks.com/matlabcentral/fileexchange/43156dynamictimewarpingdtw. Accessed 25 Feb 2016.
Zaas AK, Chen M, Varkey J, Veldman T, III AOH, Lucas J, Huang Y, Turner R, Gilbert A, LambkinWilliams R, Øien NC, Nicholson B, Kingsmore S, Carin L, Woods CW, Ginsburg GS. Gene expression signatures diagnose influenza and other symptomatic respiratory viral infections in humans. Cell Host & Microbe. 2009; 6(3):207–17.
G J, HF D, IG S, AV B. Transmission of the common cold to volunteers under controlled conditions: I. the common cold as a clinical entity. AMA Archives of Internal Medicine. 1958; 101(2):267–78.
Peyman J, Francisco A. An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors. BMC Med Inform Decis Mak. 2006; 6:27.
Kira K, Rendell LA. A practical approach to feature selection. In: Proceedings of the Ninth International Workshop on Machine Learning. ML92. San Francisco: Morgan Kaufmann Publishers Inc.: 1992. p. 249–56.
Zhou J, Chen J, Ye J. MALSAR: MultitAsk Learning via StructurAl Regularization. 2012. http://www.public.asu.edu/%7Ejye02/Software/MALSAR. Accessed 25 Feb 2016.
Fan L, Poh KL, Zhou P. A sequential feature extraction approach for naïve bayes classification of microarray data. Expert Syst. Appl. 2009; 36(6):9919–923.
Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Machine Learning. 2002; 46(1):389–422.
Acknowledgements
Not Applicable.
Funding
This material is based upon work partially supported by the Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO) under Contract No. W911NF16C0050, and partially supported by DARPA grant No. 660011114183 negotiated by SSC Pacific grant, and Serbian Ministry of Education, Science and Technological Development grants III41007 and ON174028.
Availability of data and materials
The datasets used in this study and MATLAB 8.5 source code for TMRMRC and TMRMRM algorithms are publically available at: https://github.com/radovicmiloskg/TMRMR.git.
Authors’ contributions
MR developed and implemented the computational methods, and conducted the experiments, supervised by ZO and NF. MR wrote the manuscript and discussed and analyzed the results with MG. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
All the datasets used in this study were previously published by other authors and are publicly available (the raw data are available in GEO under accession no. GSE17156). Thus, this study does not need to be reviewed by any ethics committee.
Author information
Authors and Affiliations
Corresponding author
Additional file
Additional file 1
Supplementary materials. The supplementary PDF file contains relevant information omitted from the main manuscript such as: (1) the ranked list of the top 50 genes selected by the TMRMRC approach for H3N2, HRV and RSV datasets, respectively and (2) error bars for the two groups, symptomatic and asymptomatic, for the top genes selected from the three datasets. (DOCX 240 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Radovic, M., Ghalwash, M., Filipovic, N. et al. Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinformatics 18, 9 (2017). https://doi.org/10.1186/s1285901614239
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901614239