Improving clustering with metabolic pathway data
- Diego H Milone†1Email author,
- Georgina Stegmayer†1,
- Mariana López2,
- Laura Kamenetzky2 and
- Fernando Carrari2
© Milone et al.; licensee BioMed Central Ltd. 2014
Received: 30 April 2013
Accepted: 25 March 2014
Published: 10 April 2014
It is a common practice in bioinformatics to validate each group returned by a clustering algorithm through manual analysis, according to a-priori biological knowledge. This procedure helps finding functionally related patterns to propose hypotheses for their behavior and the biological processes involved. Therefore, this knowledge is used only as a second step, after data are just clustered according to their expression patterns. Thus, it could be very useful to be able to improve the clustering of biological data by incorporating prior knowledge into the cluster formation itself, in order to enhance the biological value of the clusters.
A novel training algorithm for clustering is presented, which evaluates the biological internal connections of the data points while the clusters are being formed. Within this training algorithm, the calculation of distances among data points and neurons centroids includes a new term based on information from well-known metabolic pathways. The standard self-organizing map (SOM) training versus the biologically-inspired SOM (bSOM) training were tested with two real data sets of transcripts and metabolites from Solanum lycopersicum and Arabidopsis thaliana species. Classical data mining validation measures were used to evaluate the clustering solutions obtained by both algorithms. Moreover, a new measure that takes into account the biological connectivity of the clusters was applied. The results of bSOM show important improvements in the convergence and performance for the proposed clustering method in comparison to standard SOM training, in particular, from the application point of view.
Analyses of the clusters obtained with bSOM indicate that including biological information during training can certainly increase the biological value of the clusters found with the proposed method. It is worth to highlight that this fact has effectively improved the results, which can simplify their further analysis.
The algorithm is available as a web-demo at http://fich.unl.edu.ar/sinc/web-demo/bsom-lite/. The source code and the data sets supporting the results of this article are available at http://sourceforge.net/projects/sourcesinc/files/bsom.
KeywordsClustering SOM training Pathway data
In the biology field, clustering is implemented under the guilt-by-association principle , that is to say, the assumption that compounds involved in a biological process behave similarly under the control of the same regulatory networks . It is assumed that if a metabolic compound with unknown function varies in a similar fashion with a known metabolite from a defined metabolic pathway, it can be inferred that the unknown element is also likely to be involved in the same pathway . Therefore, one cluster that groups some metabolites indicates that they can be connected within common metabolic pathways. This pathway-based approach to identify metabolic traits results in more biological information (hypothesis) that has to be tested through the design of biological experiments (wet experiments) . From this perspective, it could be useful to perform a detailed inspection of the patterns inside a cluster to determine memberships to known metabolic pathways.
Due to the limitations of traditional algorithms, computational intelligence has been recently applied to bioinformatics with promising results [5, 6]. For example, self-organizing maps (SOMs)  are a special class of neural networks that use competitive learning. SOMs can represent complex high-dimensional input patterns into a simpler low-dimensional discrete map, with prototype vectors that can be visualized in a two-dimensional lattice structure, while preserving the proximity relationships of the original data as much as possible. SOMs have been used for unsupervised clustering of transcriptome profiles [8, 9] as well as metabolites . For example in  SOM clustering was used for the analysis of Arabidopsis thalianadatasets, helping in the hypothesis validation of a metabolic mechanism responding to sulfur deficiency. SOMs have been recently proposed also for the integration and knowledge discovery of coordinated variations in transcriptomics and metabolomics data , and a software tool for SOM application has been designed to give support to the data mining task of datasets derived from different databases, providing user-friendly interface and several visualization tools easy to understand by non-expert users .
When evaluating a clustering solution, it is a common (and necessary) practice to validate each group returned by a clustering algorithm through manual analysis and visual inspection, according to a-prioribiological knowledge. Traditionally, the known annotations are used only as a second step, after data have been clustered according to their variation patterns. Only those clusters in which many genes (and proteins/metabolites) are annotated within the same category (for example, the same MapMan BIN  or Gene Ontology (GO) terms ), are then selected for further analysis [16–19]. For each pattern, its annotations and memberships to well-known metabolic pathways are generally assessed. The results obtained after inspection of each cluster, by hand, may indicate functionally related patterns. Automatic pos-clustering validation proposals like “gene set enrichment analysis”  focus on groups of genes that share common biological function, chromosomal location or regulation. Similarly, ProteinProtein-Interactions (PPI) derived metrics can be used in combination with genomic data to validate clusters with respect to their biological relevance . These metrics, however, can only be applied to clusters of genes. Recently, a biologically inspired validity measure that can be applied not only to groups of genes but also to genes and metabolites together has been proposed .
Actually, there is a growing interest in improving the cluster analysis of biological data by incorporating such prior basic knowledge into the clustering itself, in order to increase the biological meaning of the clusters that are subjected to later scrutiny. In the past few years, several methods have been introduced with that aim, since integrating a biological similarity measure or biological information into a clustering method can lead to the potential enhancement in the performance of the clustering, as a result of the good correlation between biological similarity and gene co-expression levels [23, 24]. For example, the distance function built by  combines information from expression data and the proximity of the proteins in a metabolic pathway network. In  a similar approach is presented, where a graph is used based on the GO structure. The work of  proposed shrinking the distances between pairs of genes sharing a common annotation. In fact, the distance measure between two genes can be modified to be a linear combination of the similarity of their expression profiles and their functional similarity [28–30]. Moreover, a classical clustering method can be modified to work with such a newly defined metric, for example, by slicing a hierarchical clustering tree obtained from a gene dataset to get clusters that are as consistent as possible with well-known gene annotations . Another example of using heterogeneous genomic data into a clustering algorithm is proposed by , with the aim of identifying highly correlated genes more effectively than using only expression data or a single data source. Most of these clustering methods utilize only the annotations provided by the GO ontology or its hierarchical structure through the use of similarity measures between terms. Although GO is heavily used in systems biology, redundancy and problems with stability over time have been recently indicated . Besides, this information, cannot be associated to other molecular entities such as metabolites. It can be used for genes and their products only. Additionally, there are many genes that are currently unannotated and this situation is generally handled by excluding them from the analysis or by considering them as exceptional cases.
In summary, it can be anticipated that the integration of -omics measurements with additional relevant biological information is expected to improve the quality and the biological significance of unsupervised clustering. This paper proposes and illustrates this integrative principle, not only for genomic data but also for metabolic and integrated datasets. We present a novel training algorithm that combines biological similarities derived from metabolic pathways information and demonstrate that its application improves the quality of the clustering. This new approach weights the biological connectivity of the patterns (genes and/or metabolites) during training of the clustering method. This can be achieved through the use of a new term for the biological assessment of the clusters while they are being formed. The algorithm takes into account not only the classical Euclidean distance between patterns, but also a biological term assessed by means of the number of common pathways. The proposed approach was tested on a set of transcriptome and metabolome data from Solanum lycopersicumand Arabidopsis thaliana, showing improved clusters formation when using the proposed biologically inspired SOM (bSOM), in comparison to the standard SOM training (sSOM). This improvement is demonstrated by the increase of biological connections in the clusters found by bSOM and the biological analysis of the clusters found.
In the following section we explain in detail the new biologically-inspired algorithm for SOM training. After that, the validation measures used for performance comparison among training algorithms are presented. Finally, the datasets used for SOM training are described.
Improved SOM training using metabolic pathways
SOM clustering is based on nodes (neurons) that compete in response to a given input. Inputs are fully connected to the output nodes. Each output node corresponds to a cluster and is associated with a prototype or synaptic weight vector . Given an input pattern, competition among neurons takes place, when their similarity (or distance) to the input is computed. Thus, the neurons in the output layer compete with each other, and only the closest to the input becomes activated or fired. The weight vector of this winning neuron is further moved towards (closer to) the input pattern. This competitive learning paradigm allows learning for the neuron that best matches the given input pattern and it is also known as winner-takes-all learning .
When competition among the neurons is complete, SOM updates not only the weight vector of the winning neuron but also a set of weights within its neighborhood, according to a neighborhood function Λ. This function defines the neurons that will be affected by the changes in the winning neuron. We have used the standard squared neighborhood. Thus for example, if the radius of the neighborhood is 1, all the 8 neurons in touch with the winning one will be updated as well. At the beginning of training, Λ has a radius equal to a quarter of the size of the map. During training, this radius is reduced linearly with training epochs, until reaching 0 (that is to say, at this point only the winning neuron is updated). The rate of the modifications at different neurons is a monotonically decreasing scalar function of the training epochs. Its form is not so important as long as its value is large at the beginning of the process, gradually reducing it to a fraction of it in successive steps .
The goal of SOMs is to represent complex highdimensional input patterns into a simpler low-dimensional discrete map, with prototype vectors that can be located in a two-dimensional lattice structure, while preserving the proximity relationships of the original data as much as possible . SOM structures the output nodes (neurons) in such a way that nodes in closer proximity are more similar to each other than to other nodes that are farther apart. Having finished the training, input patterns are projected into the lattice, corresponding to adjacent neurons connected to each other through the neighborhood function, giving a clear topology of how the network fits into the input space . In this projection, an input pattern is associated to a neuron (cluster) simply according to minimum distance to all neuron prototypes.
where πℓ∉m is the average number of biological connections among all the patterns clustered in the neuron m not including the pattern ℓ; and πℓ∈m is the average number of biological connections among all the patterns clustered in the neuron m including the pattern ℓ. The average biological connections are calculated using a metabolic pathways connection matrix ρ, where each element ρ i j has the number of metabolic pathways that involve both pattern in row i and pattern in column j. This is calculated by simply counting the number of pathways in common, following the same procedure for metabolites as well as for transcripts.
The biological term b ℓ m measures how close (or distant) is a pattern ℓ to a neuron m, in terms of improvement of the average number of common pathways in that cluster. When a pattern has b ℓ m >0 with respect to neuron m, it means that if the pattern ℓ were assigned to the neuron m, the average number of common pathways among all the data patterns clustered in that neuron would be decreased. Instead, if b ℓ m <0, the assignment of the pattern ℓ to the neuron m would certainly increment the number of average common pathways, clearly increasing the biological value of that cluster. The parameter α is used to balance between the two goals: when α=0, d ℓ m becomes the classical Euclidean distance and the algorithm becomes the standard SOM clustering (sSOM); and when α=1 the algorithm completely disregards the expression measures and groups data only according to biological connections. In principle, it cannot be stated that there is any optimum α, it depends on the weight that is given to the related biological information in the final analysis.
After the application of an unsupervised mining technique, it is quite difficult to validate the obtained results. A set of objective measures can be used to quantify the quality of the clusters obtained by different available methods . A new kind of biological measure is presented as well, that evaluates the metabolic connections existing in the clustering partition found. The work of  presents a summary of different types of validation measures that can be used to qualify a clustering solution. In this study we have used:
It measures intracluster compactness or homogeneity as , For a global measure of compactness, the average over all k clusters is calculated as . Values of close to 0 indicate more compact clusters.
It quantifies the degree of separation between individual clusters, measuring the mean Euclidean distance among cluster centroids as , where close to 1 indicates more separated clusters.
This is a combination of the previous two measures and a popular metric for evaluating clustering algorithms . DB index is a function of the ratio of the sum of within-cluster scatter to between-cluster separation. This is an indication of clusters overlap, therefore DB close to 0 indicates that the clusters are compact and far from each other.
It combines dissimilarity between clusters and their diameters, based on the idea of identifying cluster sets that are compact and well separated. D index measures inter-cluster distances (separation) over intra-cluster distances (compactness). If a clustering partition contains well-separated clusters, the distances among them are usually large and their diameter is expected to be small. Therefore, a larger D value means better cluster configuration.
Biological internal connectivity
is the number of all the possible shared pathways among patterns grouped in cluster m and any other pattern in the dataset. A value close to 0 indicates more biologically significant clusters. For this measure, non empty and annotated clusters are taken into account.
Global Measure for Linked Clustering (GMLC)
For evaluating both coherence and biological significance of clusters found over biological datasets, we have used the G measure which is a biologically-inspired validity measure for comparison of clustering methods over metabolic datasets . It is defined as the sum of , which is a measure of the flatness of the distribution of patterns along clusters, that indicates if the data samples have been coherently grouped when having a sign-inverted value, and which evaluated biological internal connectivity, as previously explained.
In this subsection, the datasets used for SOM training are described. The Kyoto Encyclopedia of Genes and Genomes (KEGG) [39, 40] pathway database was used for calculation of the biological connectivity. All pathways in which the measured elements participated have been considered.
The first biological dataset used in this paper involves metabolic and transcriptional profiles from Introgression Lines (ILs) of Solanum lycopersicum. The ILs harbor, at certain chromosomes segments, introgressed portions of the wild species (Solanum pennellii). After log-transforming the expression values over the entire dataset, genes with no significant change were discarded from further analysis. As a result of the pre-processing and selection steps, 1159 genes were selected. The metabolic data were obtained analyzing polar extracts of tomato fruits, through Gas Chromatography coupled to Mass Spectrometry (GC-MS). The metabolite profiling technique used allows the identification of approximately 80 primary metabolic compounds. For each metabolite in each IL, the log ratio of the mean of the replicates was calculated. In the selection step only 70 metabolites (having log ratio greater than 0.1) were kept for data integration and cluster analysis. Further details on data selection can be found on . This data set has a size of 1229 data points.
The second biological dataset comprises primary metabolites and transcripts measured in Arabidopsis thaliana leaves. The integrated analysis of this data is aimed at studying the effects of the cold on circadian regulated genes in this plant . In this study we included metabolites and transcripts under light-dark cycles at two control temperatures (20°C and 4°C). Genes involved in diurnal cycle and cold-stress responses were selected for further study. More details on how the data were processed, filtered and normalized can be found in . A total of 1549 genes and 51 metabolites were used in the integrated analysis, resulting in a total of 1600 data patterns.
Results and discussion
Validation measures for SOM training: standard (sSOM) versus biological (bSOM) for metabolic datasets
With respect to measures that take into account the biological information associated to the clusters obtained, considering the measure, it is clearly and consistently improved as α increases for the proposed algorithm when compared to sSOM, for all configurations and both datasets. As can be expected, at low α the improvement is not so important but when α increases, clusters are more biologically connected which is directly reflected by this measure, reaching the best possible result for this index at the maximum α here considered. The significance of these results has been statistically tested by performing 100 re-samplings of 90% of the metabolites in both datasets, for all the methods (sSOM vs. bSOM with different α). An ANOVA was performed to test the null hypothesis in which the difference among the clustering results for the biological connectivity measure () with different training methods is not significant. The analysis revealed that the results in the table show significant differences (p<0.001). Finally, the G measure, which evaluates in a single index not only clusters quality but also their biological content, remains almost unchanged or even improves. For the first data set, G has almost the same value in all configurations. As α is increased on bSOM, G values improve for the second data set, even at maximum α. In general, it can be stated that while a balance between homogeneity and coherence is maintained, an improvement in the biological connectivity of the clusters can be achieved.
Validation measures for SOM training: standard (sSOM) and biological (bSOM) for the full datasets
Taking into consideration now only the measures that evaluate the biological quality of the solutions ( and G), both present better results and it can be stated that, in general, the biological connectivity of the clusters is really improved when using bSOM compared to sSOM, in both datasets. The biological connectivity of the clusters is effectively improved when using bSOM in comparison to sSOM, which is even achieved when both distances (Euclidean and biological) are equally considered (α=0.5). The G measure also consistently obtains better scores when α increases, in all configurations tested for each map. This means that enhanced clustering results can be achieved when using bSOM rather than sSOM, not only with respect to clusters quality but also from a biological point of view.
For the full Arabidopsis dataset, we have also calculated the biological homogeneity index (BHI)  for sSOM and bSOM, which measures how homogeneous are biologically the clusters obtained. BHI evaluates if genes in the same cluster are also part of the same functional classes according to GO annotations. The BHI score obtained for sSOM was 6.49%. For bSOM with the same α values reported in Table 2, the BHI scores were 6.57, 6.68 and 7.53%. As can be seen, this independent measure also indicates that better biological clusters can be obtained with the proposed algorithm.
Detail of patterns and common pathways for sSOM vs. bSOM
From a quantitative point of view, it can be seen that in general bSOM can increase the number of common pathways in the clusters for the same number of elements. In particular, in Cluster A the number of common pathways among cluster elements is maintained, although bSOM can achieve that result with less cluster elements. In Cluster B, for the same number of elements a higher number of common metabolic pathways was obtained. In Cluster C, a better grouping allows finding common biological information, which could not be achieved by using the standard training algorithm. Finally, cluster D exemplifies how, for the same number of elements with related biological information in a cluster, more common pathways can be found by bSOM (note that although the cluster found by bSOM has 5 elements, only 3 of them participate in known pathways).
The previous examples suggest that bSOM is able to better group the amino acids glycine, serine, threonine, valine, leucine, isoleucine, lysine and arginine within clusters considering the number of biochemical pathways they are involved in. For instance, bSOM grouped serine, threonine, valine and isoleucine within cluster A and glycine, arginine and lysine in a separate cluster (B). In this case, bSOM takes account of the possibility that co-variation of valine and isoleucine can also be affected by their degradative pathway (ko00280). Another example of the usefulness of bSOM is given by clusters C and D. In the first case, bSOM grouped two transcripts which both encode for beta-galactosidase precursor. It is somehow here expectable either because they are derived from the same gene or from different loci. In cluster D, glutamate, proline and sucrose grouped together with two transcripts. One of these transcripts (LE23B16) encodes a putative calcium-dependent protein kinase (CDPK). Although the exact mechanism by which this protein could be related to the variation of the above-mentioned metabolites is not known, the role of different CDPKs in the control of primary plant metabolism is well documented .
In this paper we presented a new training algorithm for self-organizing maps (bSOM) over biological datasets. A new biologically-inspired term, considering common pathways, is added in the calculation of the distances among data points and neurons centroids. This term evaluates the internal connections of the data samples in terms of their belonging to known pathways. The proposed training algorithm was tested in two datasets involving Solanum lycopersicum and Arabidopsis thaliana transcripts and metabolites. Classical data mining validation measures were used to qualify the clustering solutions obtained when using both algorithms, as well as a new measure that takes into account biological significance of the clusters found. The new algorithm showed important improvements in performance in comparison to standard SOM training. It is worth to highlight the fact that the inclusion of biological information implicitly during training has effectively improved the results. This would certainly increase the biological value of the clusters found and would simplify their further analysis. Future work will involve the expansion of the range of additional biological sources that could be used in combination with clustering algorithms.
Project name: bSOM.
Source code and data sets: http://sourceforge.net/projects/sourcesinc/files/bsom
License: opensource, free for academic use.
DM and GS proposed and implemented the clustering algorithm, and wrote the manuscript. ML, LK and FC have contributed with motivations and useful discussions, provided the case study dataset and revise the manuscript. All authors read and approved the final manuscript.
This work was supported by National Scientific and Technical Research Council [PIP 2013 #117], INTA [PNBIO #1131022], National University of Litoral [CAI+D 2011 #548] and National Agency for the Promotion of Science and Technology [PICT 2011 #2440, PAE #37122, PICT 2011-1171].
- Wolfe CJ, Kohane IS, Butte AJ: Systematic survey reveals general applicability of “guilt-by-association” within gene coexpression networks. BMC Bioinformatics. 2005, 6: 227-237. 10.1186/1471-2105-6-227.View ArticlePubMed CentralPubMedGoogle Scholar
- Lacroix V, Cottret L, Thebault P, Sagot MF: An Introduction to Metabolic Networks and Their Structural Analysis. IEEE/ACM Trans Comput Biol Bioinform. 2008, 5 (4): 594-617.View ArticlePubMedGoogle Scholar
- Usadel B, Obayashi T, Mutwil M, Giorgi F, Bassel G, Tanimoto M, Chow A, Steinhauser D, Persson S, Provart N: Co-expression tools for plant biology: opportunities for hypothesis generation and caveats. Plant, Cell & Environ. 2009, 32 (12): 1633-1651. 10.1111/j.1365-3040.2009.02040.x.View ArticleGoogle Scholar
- Tohge T, Fernie A: Combining genetic diversity, informatics and metabolomics to facilitate annotation of plant gene function. Nat Protoc. 2010, 5 (6): 1210-1227. 10.1038/nprot.2010.82.View ArticlePubMedGoogle Scholar
- Tasoulis D, Plagianakos V, Vrahatis M: Computational Intelligence in Bioinformatics, Volume 94 of Studies in Computational Intelligence. 2008, Berlin: SpringerGoogle Scholar
- Fogel G, Corne D, Pan Y: Computational Intelligence in Bioinformatics. 2007, Piscataway: Wiley-IEEE PressView ArticleGoogle Scholar
- Kohonen T: Essentials of the self-organizing map. Neural Netw. 2013, 37 (37): 52-65.View ArticlePubMedGoogle Scholar
- Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T: Interpreting patterns of gene expression with self-organizing maps: Methods and applications to hematopoietic differentiation. Proc Natl Acad Sci USA. 1999, 96: 2907-2912. 10.1073/pnas.96.6.2907.View ArticlePubMed CentralPubMedGoogle Scholar
- Wang J, Delabie J, Aasheim H, Smeland E, Myklebost O: Clustering of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study. BMC Bioinformatics. 2002, 3: 36-46. 10.1186/1471-2105-3-36.View ArticlePubMed CentralPubMedGoogle Scholar
- Allen E, Moing A, Ebbels TM, Maucourt M, Tomos AD, Rolin D, Hooks MA: Correlation Network Analysis reveals a sequential reorganization of metabolic and transcriptional states during germination and gene-metabolite relationships in developing seedlings of Arabidopsis. BMC Syst Biol. 2010, 4: 62-72. 10.1186/1752-0509-4-62.View ArticlePubMed CentralPubMedGoogle Scholar
- Hirai M, Klein M, Fujikawa Y, Yano M, Goodenowe D, Yamazaki Y, Kanaya S, Nakamura Y, Kitayama M, Suzuki H, Sakurai N, Shibata D, Tokuhisa J, Reichelt M, Gershenzon J, Saito K: Elucidation of gene-to-gene and metabolite-to-gene networks in arabidopsis by integration of metabolomics and transcriptomics. J Biol Chem. 2005, 280 (27): 25590-25595. 10.1074/jbc.M502332200.View ArticlePubMedGoogle Scholar
- Stegmayer G, Milone D, Kamenetzky L, Lopez M, Carrari F: Neural Network Model for Integration and Visualization of Introgressed Genome and Metabolite Data. IEEE International Joint Conference on Neural Networks. 2009, Piscataway: IEEE Computational Intelligence Society, 3177-3183.Google Scholar
- Milone D, Stegmayer G, Kamenetzky L, Lopez M, Giovannoni J, Lee JM, Carrari F: *omeSOM: a software for integration, clustering and visualization of transcriptional and metabolite data mined from interspecific crosses of crop plants. BMC Bioinformatics. 2010, 11: 438-448. 10.1186/1471-2105-11-438.View ArticlePubMed CentralPubMedGoogle Scholar
- Usadel B, Poree F, Nagel A, Lohse M, Czedik-Eysenberg A, Stitt M: A guide to using MapMan to visualize and compare Omics data in plants: a case study in the crop species, Maize. Plant Cell Environ. 2009, 32: 1211-1229. 10.1111/j.1365-3040.2009.01978.x.View ArticlePubMedGoogle Scholar
- Ashburner M: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-9. 10.1038/75556.View ArticlePubMed CentralPubMedGoogle Scholar
- Buehler E, Sachs J, Shao K, Bagchi A, Ungar L: The CRASSS plug-in for integrating annotation data with hierarchical clustering results. Bioinformatics. 2004, 20 (17): 3266-3269. 10.1093/bioinformatics/bth362.View ArticlePubMedGoogle Scholar
- Curtis RK, Oresic M, Vidal-Puig A: Pathways to the analysis of microarray data. Trends Biotechnol. 2005, 23 (8): 429-435. 10.1016/j.tibtech.2005.05.011.View ArticlePubMedGoogle Scholar
- Doherty J, Carmichael L, Mills J: GOurmet: a tool for quantitative comparison and visualization of gene expression profiles based on gene ontology (GO) distributions. BMC Bioinformatics. 2006, 7: 1-9. 10.1186/1471-2105-7-1.View ArticleGoogle Scholar
- Toronen P: Selection of informative clusters from hierarchical cluster tree with gene classes. BMC Bioinformatics. 2004, 5: 32-10.1186/1471-2105-5-32.View ArticlePubMed CentralPubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub T, Lander E, Mesirov J: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Nat Acad Sci USA. 2005, 102 (43): 15545-15550. 10.1073/pnas.0506580102.View ArticlePubMed CentralPubMedGoogle Scholar
- Lanckriet GRG, Bie TD, Cristianini N, Jordan MI, Noble WS: A statistical framework for genomic data fusion. Bioinformatics. 2004, 20 (16): 2626-2635. 10.1093/bioinformatics/bth294.View ArticlePubMedGoogle Scholar
- Stegmayer G, Milone DH, Kamenetzky L, Lopez MG, Carrari F: A biologically inspired validity measure for comparison of clustering methods over metabolic data sets. IEEE/ACM Trans Comput Biology Bioinform. 2012, 9 (3): 706-716.View ArticleGoogle Scholar
- Dotan-Cohen D, Kasif S, Melkman AA: Seeing the forest for the trees: using the Gene Ontology to restructure hierarchical clustering. Bioinformatics. 2009, 35 (14): 1789-1795.View ArticleGoogle Scholar
- Wang H, Azuaje F, Bodenreider O, Dopazo J: Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships. CIBCB ’04. Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology. 2004, Piscataway: IEEE Press, 25-31.View ArticleGoogle Scholar
- Hanisch D, Zien A, Zimmer R, Lengauer T: Co-clustering of biological networks and gene expression data. ISMB (Supplement of Bioinformatics). 2002, Oxford: Oxford University Press, 145-154.Google Scholar
- Cheng J, Cline M, Martin J, Finkelstein D, Awad T, Kulp D, Siani-Rose MA: A knowledge-based clustering algorithm driven by gene ontology. J Biopharm Stat. 2004, 14 (3): 687-700. 10.1081/BIP-200025659.View ArticlePubMedGoogle Scholar
- Huang D, Pan W: Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics. 2006, 22 (10): 1259-1268. 10.1093/bioinformatics/btl065.View ArticlePubMedGoogle Scholar
- Speer N, Spieth C, Zell A: A memetic co-clustering algorithm for gene expression profiles and biological annotation. Proc. of Congress on Evolutionary Computation (CEC), Volume 2. 2004, Piscataway: IEEE Press, 1631-8.Google Scholar
- Kustra R, Zagdanski A: Data-fusion in clustering microarray data: Balancing discovery and interpretability. IEEE/ACM Trans Comput Biol Bioinform. 2010, 7: 50-63.View ArticlePubMedGoogle Scholar
- Diaz N, Ruiz J: GO-based functional dissimilarity of gene sets. BMC Bioinformatics. 2011, 12: 360+-10.1186/1471-2105-12-360.View ArticleGoogle Scholar
- Dotan-Cohen D, Melkman AA, Kasif S: Hierarchical tree snipping: clustering guided by prior knowledge. Bioinformatics. 2007, 23 (24): 3335-3342. 10.1093/bioinformatics/btm526.View ArticlePubMedGoogle Scholar
- Kasturi J, Acharya R: Clustering of diverse genomic data using information fusion. Bioinformatics. 2005, 21 (4): 423-429. 10.1093/bioinformatics/bti186.View ArticlePubMedGoogle Scholar
- Gillis J, Pavlidis P: Assessing identity, redundancy and confounds in Gene Ontology annotations over time. Bioinformatics. 2013, 2013. doi:10.1093/bioinformatics/bts727.,Google Scholar
- Xu R, Wunsch DC: Clustering. 2009, Piscataway: Wiley and IEEE PressGoogle Scholar
- Haykin S: Neural Networks: A Comprehensive Foundation (3rd Edition). 2007, Upper Saddle River, NJ, USA: Prentice-Hall, IncGoogle Scholar
- Kohonen T, Schroeder MR, Huang TS: Self-Organizing Maps. 2005, New York, Inc.: Springer-VerlagGoogle Scholar
- Handl J, Knowles J, Kell DB: Computational cluster validation in post-genomic data analysis. Bioinformatics. 2005, 21 (15): 3201-3212. 10.1093/bioinformatics/bti517.View ArticlePubMedGoogle Scholar
- Davies D, Bouldin D: A cluster separation measure. IEEE Trans Pattern Anal Mach Intell. 1979, 1 (4): 224-227.View ArticlePubMedGoogle Scholar
- KEGG PATHWAY Database. http://www.genome.jp/kegg/pathway.html.,
- Kanehisa M, Goto S: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28: 27-30. 10.1093/nar/28.1.27.View ArticlePubMed CentralPubMedGoogle Scholar
- Espinoza C, Degenkolbe T, Caldana C, Zuther E, Leisse A, Willmitzer L, Hincha D, Hannah M: Interaction with Diurnal and Circadian regulation results in dynamic metabolic and transcriptional changes during cold acclimation in Arabidopsis. PloS one. 2010, 5 (11): 1-19.View ArticleGoogle Scholar
- Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a dataset via the Gap statistic. J R Stat Soc B. 2001, 63: 411-423. 10.1111/1467-9868.00293.View ArticleGoogle Scholar
- Rubel O, Weber G, Huang MY, Bethel EW, Biggin M, Fowlkes C, Hendriks CL, Keranen S, Eisen M, Knowles D, Malik J, Hagen H, Hamann B: Integrating data clustering and visualization for the analysis of 3D gene expression data. IEEE/ACM Trans Comput Biol Bioinform. 2010, 7: 64-79.View ArticlePubMedGoogle Scholar
- Datta S: Datta S: Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. 2006, 7 (7): 397-Google Scholar
- Plaxton WC, McManus MT, Moorhead GBG, Templeton GW, Tran HT: Role of protein kinases, phosphatases and 14-3-3 proteins in the control of primary plant metabolism. Ann Plant Rev. 2007, 22: 121-149.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.