Integrating gene expression and GO classification for PCA by preclustering
© De Haan et al; licensee BioMed Central Ltd. 2010
Received: 20 May 2009
Accepted: 26 March 2010
Published: 26 March 2010
Gene expression data can be analyzed by summarizing groups of individual gene expression profiles based on GO annotation information. The mean expression profile per group can then be used to identify interesting GO categories in relation to the experimental settings. However, the expression profiles present in GO classes are often heterogeneous, i.e., there are several different expression profiles within one class. As a result, important experimental findings can be obscured because the summarizing profile does not seem to be of interest. We propose to tackle this problem by finding homogeneous subclasses within GO categories: preclustering.
Two microarray datasets are analyzed. First, a selection of genes from a well-known Saccharomyces cerevisiae dataset is used. The GO class "cell wall organization and biogenesis" is shown as a specific example. After preclustering, this term can be associated with different phases in the cell cycle, where it could not be associated with a specific phase previously. Second, a dataset of differentiation of human Mesenchymal Stem Cells (MSC) into osteoblasts is used. For this dataset results are shown in which the GO term "skeletal development" is a specific example of a heterogeneous GO class for which better associations can be made after preclustering. The Intra Cluster Correlation (ICC), a measure of cluster tightness, is applied to identify relevant clusters.
We show that this method leads to an improved interpretability of results in Principal Component Analysis.
With the advent of large gene expression experiments, new methods of analysis have become necessary to extract relevant information from the data. Exploratory data analysis methods like cluster analysis are regularly used to examine the expression profiles [1–3]. Other methods use annotation information and look for overrepresentation in sets of significantly regulated genes [4–6]. A next step would be to associate relevant profiles with annotation information and experimental variables simultaneously. In this paper we will show advances in finding associations between annotation categories and experimental variables in microarray experiments.
One of the most extensive and systematic methods of categorizing information about genes is the Gene Ontology (GO) database . A problem when relating GO classes with expression profiles is the fact that the genes in these functional classes can have diverse expression profiles. This could mean that a class is not responding to the experimental factors and is not related to the specific biological settings. However, a second possibility is that interesting subgroups are silenced by other heterogeneous or anti-correlated expression profiles present within the class. This may obscure interesting relations. To address this problem, we propose to cluster the expression profiles of genes in every category, and select relevant clusters before applying Principal Component Analysis (PCA; ).
PCA has been applied frequently to explore the microarray data in a low-dimensional space [9, 10]. Either genes or arrays are described with so called Principal Components, in order to assess relations between arrays or to identify genes with similar expression profiles. The technique is very versatile and can easily cope with large datasets. Work done by Alter et al.  is an example of the application of PCA to reduce the dimensionality of microarray data. PCA was applied to the Yeast Cell Cycle dataset of Spellman et al. , with each gene as an individual object. We will use the same dataset, but will focus on improvements in the application of PCA to find relations between specified classes of genes and phases in the cell cycle. The work by Goeman et al.  is an example of the direct association between annotation information and data analysis. A global test is introduced, determining the relation between a global expression pattern of a group of genes and a clinical outcome of interest. The global expression pattern summarising a group of genes is a method to perform research, based on previous research stored in databases like for instance GO. A second example of summarization of annotation categories is from Chen and Wang . In this paper, gene expression data with prior biological knowledge are integrated by constructing "supergenes" for each gene category by summarizing information from genes related to outcome using a modified principal component analysis (PCA) method. Instead of using genes, these supergenes representing information from each gene category were used in further analysis. Both methods [13, 14] indicate that analysing the data on the level improves the results of predictions.
We propose to perform clustering of expression profiles present in individual GO classes before applying PCA. This is a reversal of the order with respect to more common analyses, where clusters of genes with similar expression profiles are mapped to GO terms. An indication of the improvements of GO class descriptions after cluster analysis is given in Figure 1B, C and 1D. After clustering with model-based clustering  these three new subgroups with more distinct profiles can be formed for this class. The new subclasses will not be discarded because the mean correlation of each newly formed class is higher and the role of this process will not be obscured any longer.
The advantages of the cluster analysis will be shown by comparing results from PCA performed with and without preclustering. The Saccharomyces cerevisiae dataset and a Mesenchymal Stem Cells (MSC) dataset will be used. The intra-class correlation of clusters will be analyzed to identify better defined subgroups, and two specific examples of GO categories benefiting from the clustering will be given.
All calculations are performed in R . GO information is obtained from the R data packages "Yeast" and "hgu133a".
Two datasets are used to show the advantages of the method proposed here. The first dataset is the well known Saccharomyces cerevisiae Cell Cycle dataset of Spellman et al. , from here on referred to as the Yeast Cell Cycle (YCC) dataset. The focus is on a subset of 800 genes involved in the cell cycle . From this set, genes are selected for which GO annotation information is available. Only GO classes which contain at least 4 genes are considered, which are members of the Biological Process definition in GO. The 24 time points of the cdc15 synchronization method in the experiment are used. As a result of these choices, a data matrix of 24 time points by 348 genes is obtained.
The second gene expression experiment analyzed in this paper, was performed on human mesenchymal stem cells, triggered to undergo osteogenic differentiation (E. Piek et al., manuscript in preparation). This is a time series dataset with multiple osteogenic treatments: Dexamethasone (DEX), Bone Morphogenetic Protein (BMP), Vitamin D3 (VIT) and an Untreated control (UNT). The hybridizations were performed with Affymetrix Human Genome U133 A GeneChips . The dataset was used previously to demonstrate the advantages of applying Principal Component Analysis (PCA) on interaction terms from an Analysis of Variance for a large dataset . The MSC dataset has four dimensions (genes, treatments, time points and replicates). Rather than focussing on the expression values, as in the YCC dataset, we analyze the "Gene-Treatment" two-way interaction matrix from the ANOVA model, described in de Haan et al. . It represents the relations of genes with specific treatments - in this case, we want to extend this to describe the relation of specific GO (sub)classes and treatments. Starting from the interaction matrix, genes are selected having GO annotations, where the corresponding GO classes contain at least 4 genes and have the definition Biological Process. This gives us a data matrix of 4 treatments times 11974 genes. Furthermore, a subset of this set is considered containing 50 relevant GO categories. These categories have been defined beforehand by domain experts.
Principal Component Analysis and biplots
where U (an n by n score matrix) and V (an m by m loading matrix) are orthogonal and Λ is an n by m matrix containing the so called singular values. The superscript T means the transpose of the matrix V. For the biplot , the n × 2 matrix U2 of the first 2 PCs represents the objects in the data. The m × 2 matrix containing the first 2 loading vectors represents the variables in the data. A biplot is then constructed by plotting U2 and V2 in the same graph.
Combining experimental data with GO information
In order to relate GO information with experimental data, either expression profiles or interaction effects from ANOVA, we should combine both entities. Here, we use matrices to represent expression information (E) and GO category information (G). In both matrices, rows correspond to genes. Columns in E indicate time points (as in the YCC dataset) or interactions (the MSC data). In matrix G, columns represent GO terms in a binary coding: if a gene is annotated for a specific GO term the corresponding value in the matrix is 1, otherwise it is 0.
The resulting data table X is then scaled column-wise so that the sum of each column is one. The X matrix can now be considered to contain the mean expression profiles - or interaction profiles - for all GO categories of interest.
Finally, PCA is performed on the X table containing combined information sources. The results can be visualized with a biplot . In the biplots the GO classes are shown as points; cell cycle phases and treatments are indicated as arrows. The biplot allows the GO terms to be correlated with the treatments or the cell cycle phases.
In order to prevent interesting profiles within one GO category to cancel out, we propose to perform a cluster analysis on all GO categories of interest individually, provided they contain enough genes. The actual clustering is performed on the submatrix of the gene expression matrix E, that corresponds to all columns and genes within a go cluster. Based upon the results of the gene expression clustering, new GO groups (one per cluster) are added to the GO matrix. This step is indicated with the term "preclustering". In the remainder of the paper, we will only consider GO (sub)categories containing at least four genes. In principle, any clustering method can be used. We have chosen to cluster the genes, present in each GO class, with model-based clustering  because this is one of the few methods giving an indication of the optimal number of clusters automatically. When this optimal number equals one, the category is not split; if it is larger than one, the GO class is split into several subclasses. Model-based clustering is available in several R packages, such as flexmix and mclust. Which method is used is not very important - the main motivation is to achieve an automatic assessment of the optimal number of clusters. In our scripts, we have used the 2002 version of mclust (available as mclust02). Model-based clustering has been applied to gene expression data before (e.g., ). The data are described as a mixture of (normal) distributions. The method applies a number of different models, identified by more or less stringent constraints. Parameters for these models include different variations of shape, volume, and orientation of the clusters. Selection of the best model is performed using the Bayesian Information Criterion (BIC, ). This index corrects for several parameters used in the clustering - for instance the number of clusters - to enable a fair comparison between the results of different models. The clustering of the model with the best BIC was chosen with a minimum of 1, and maximum of 8 clusters; it is not to be expected that a higher number of clusters is needed in practice, but one can easily check whether the results improve upon increasing this upper limit. The steps in the preclustering algorithm are displayed below:
1. Construct the E matrix of expression values and the G matrix containing GO category information.
2. Set maximum number of clusters and minimum number of genes in cluster (here 8 and 4 respectively)
3. For each GO class:
3a. select the submatrix from matrix E that contains only the genes from that particular GO class
3b. perform gene-wise clustering of the submatrix and select the best clustering
3c. accept only clusters with at least the minimum number of genes
4. Define a new G matrix containing either the original GO classes or, when clusters have been detected, the GO subclasses - one column is used for each (sub)class
5. Multiply E and the new G matrix to obtain the X matrix
6. Perform column-wise scaling of X by dividing by the number of genes in each subclass
7. Perform PCA on the scaled X matrix and show the biplot
Here, C i is an element in the lower triangle of the correlation matrix of the genes in a cluster. The measure is also used by Busold et al.  but was not specifically given a name. By using the ICC it is possible to assess the quality of the newly formed subgroups from a GO class: high values indicate tight clusters. An additional advantage of the ICC over a measure like cluster volume is that uninteresting clusters around zero, which may cover a small volume and therefore would seem interesting according to the volume criterion, usually show only low ICC values.
The result of the preclustering stage is a new set of GO categories, nested in the original categories. These new categories now are used as matrix G in Equation 2 to obtain matrix X, the focus of attention. As a consequence, one original GO category now may have several representatives in this matrix, and in that case will show up multiple times in PCA plots. The matrix E containing the expression values is not changed by the preclustering algorithm.
Results and Discussion
Number of GO categories, before and after preclustering.
# GO cl.
# > 4
MSC (50 cl.)
Clearly, for the MSC data set and the relevant subset of 50 GO categories, many GO categories are heterogeneous with respect to the expression data. For the subset, for example, all GO categories are split, leading to 100 new categories. Of these 100, 79 contain more than four genes. For the much smaller and simpler YCC data set, there is still a group of 12 GO categories that is split, leading in total to 70 categories each containing more than four genes.
As an example of a heterogeneous GO class benefiting from preclustering the term "cell wall organization and biogenesis" is taken. The three subclasses are shown with asterisks in Figure 2. The expression profiles and clusters for this term can be seen in Figure 1 in the introduction. The term is chosen because it is expected to be involved in the cell cycle, where the cell wall has to be organized and assembled during cell division. The new subclasses generated with clustering have an increased ICC compared to the ICC of the whole class. The profiles are more specific now and will not cancel out in the analysis.
The consequences of preclustering are shown by a number of differences between Figures 3A and 3B. The term "cell wall organization and biogenesis", represented with a star, is used as an example showing interesting changes. When the whole GO class is used (Figure 3A), the term seems uninteresting and is close to the center of the PCA biplot. It is hard to associate it with a specific phase of the cell cycle. When the different profiles are separated by the clustering however, the new subgroups can be associated with phases G1, M, and between M and G2.
Because of the large number of GO terms for this dataset we will focus on the terms involved in cell differentiation and osteogenesis. Out of a list of 50 GO terms which could be expected to be responsive to the conditions and the setup of the experiment, 24 terms were present in our dataset 2, taking into account our criteria for the minimum number of genes in a class. For these 24 classes, clustering of the expression profiles in the original GO classes gives rise to 79 classes. These subclasses are marked with black dots in Figure 4. A large number of new subclasses show an increased ICC; they all lie in the vertical band described above. There is also a number of new subclasses which has a decreased ICC.
The number of subclasses associated with cell differentiation and osteogenesis has increased from 24 original GO classes to 79 subclasses, as mentioned before. The GO term "skeletal development" is shown as an example of a category which benefits from preclustering. The group is marked with a star symbol in Figure 5A. After preclustering, six subclasses are identified. As can be seen in Figure 5A, the original term could not be associated with a specific treatment and is present in the center of the plot. This has changed dramatically after clustering, as shown in Figure 5B. One subgroup is still in the center of the plot, but the other subclasses can be associated with individual treatments. Two subgroups can specifically be correlated with osteogenic treatments, one for VIT and one for BMP. These groups are even lying on the corresponding arrows for the loadings. A number of genes within the subgroups corresponds with information known from biological literature. The new subgroup lying on the arrow of the BMP contains the gene MSX1 for instance, which has been proven to be induced by BMP in mice . An example of a gene present in the newly formed subgroup lying on the loading of the VIT treatment is MSX2, which is known to be regulated by vitamin D . Finally, there are two subgroups which lie in the direction of the untreated arrow and one final class which could possibly be correlated with DEX.
Table of 24 GO terms which were used to generate the ROC plot.
cell cycle arrest
regulation of cell growth
negative regulation of progression through cell cycle
Wnt receptor signaling pathway
positive regulation of cell proliferation
transmembrane receptor protein serine/threonine kinase signaling pathway
inactivation of MAPK activity
regulation of Wnt receptor signaling pathway
regulation of cell differentiation
activation of MAPK activity
positive regulation of MAPK activity
extracellular matrix organization and biogenesis
regulation of development
transforming growth factor beta receptor signaling pathway
regulation of cell proliferation
We have shown a simple and general method to relate expression levels, either directly, or after an ANOVA, to GO categories at all levels in the hierarchy simultaneously. The crucial step is the realization that genes, although they are in the same GO category, may show different profiles. As a result, the application of preclustering leads to more differentiated information. The method is generic; we have opted to use model-based clustering, but in principle any other clustering method can be used, provided that it is possible to automatically generate a reasonable estimate of the number of clusters. The method may be adjusted in a number of ways. The lower limit on the number of genes for a GO category to be considered in the analysis is rather arbitrary. For other problems and datasets this can be adjusted depending on the questions for the specific dataset. The same is true for the cutoff for selecting GO categories. It is possible to inadvertently remove categories which are of interest to your dataset, but are very small.
Other techniques have been proposed for relating measurements to annotation information. An example of relating class information and experimental factors is shown by Jeffery et al. , who use transcription factor binding sites information. These are sites in the promoters of genes to which a transcription factor can bind. The transcription factors play an important role in the regulatory networks of gene expression. A method called Correspondence Analysis (CA; ) was applied to relate classes and experimental information. CA is similar to PCA; the algorithm can reduce the multidimensional data matrix to a lower dimension containing the important information from the data. In the standard way of applying CA, the dependency of values in the rows on the columns of a contingency table for two sets of conditions is investigated. After scaling of both the rows and columns, PCA is applied to the scaled data.
For the analysis of gene expression data, the first articles which applied CA treated the matrix with expression values as a contingency table. The aim was then to associate genes with variables. Kishino et al.  applied CA to show the relationship between genes and tissues of a dataset which was designed to investigate colon cancer . Fellenberg et al.  also showed the possibility to investigate the associations between genes and variables. The application of CA to incorporate GO information as class information in the visualization of results was shown by Busold et al. . Datasets investigating glucose metabolism and from human pancreatic adenocarcinomas were used.
For our data, CA and PCA would lead to very similar results, because of the fact that the column means and row means, used for scaling in CA, are very similar, something that is also true for the data used in, e.g., . In Figure 6, the CA curves would completely overlap with the PCA curves and therefore have been left out. Since PCA is more easy to interpret, we prefer it for this case. Preclustering of GO information as presented here could also reveal interesting findings otherwise discarded in methods using class information like CA.
GO information (and other annotation information) is to some extent limited, because not all genes are annotated. This will have a limiting effect on the size of clusters for less well known GO categories; if a certain GO category in the database is not described correctly, or extensively enough, it will obviously be hard to link this GO category with the experimental factors of interest. However, the number of annotations will increase rapidly so that the method will only gain in importance.
With the division of GO classes in clusters with similar profiles, it is possible that one of the new classes gets a high ICC, but remains in the center of the biplot. Such a subgroup will contain relatively flat profiles; nevertheless, it is beneficial that it is clustered separately, to prevent other interesting subgroups to be obscured. One final result of the preclustering is that a category can be found more than once in the same PCA biplot. This can make it more difficult to draw conclusions for a single GO category as a whole. On the other hand, the results will be more indicative of what is really happening within a GO class.
The advantages of clustering heterogeneous GO classes have been shown here for two real gene expression datasets, the well-known YCC dataset and the MSC dataset. The results show that preclustering yields an increased number of interesting groups deviating from the center of the PCA biplot. In this center the less interesting groups with flat profiles are present. In future, more formal selection criteria could be used to identify interesting GO classes and newly formed subclasses, based on statistical significance.
These properties of preclustering allow for a better association of GO categories with phases or treatments, because interesting subgroups which are obscured by different profiles are separated from each other. New meaningful relations are discovered which would not have been found otherwise with PCA. For the GO processes "cell wall organization and biogenesis" (GO:0007047) and "skeletal development" (GO:0001501) this is explicitly shown.
The first author acknowledges MSD for financial support.
- Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863View ArticleGoogle Scholar
- Tavazoie S, Hughes J, Campbell M, Cho R, Church G: Systematic determination of genetic network architecture. Nature genetics 1999, 22: 281–285. 10.1038/10343View ArticlePubMedGoogle Scholar
- Yeung K, Fraley C, Murua A, Raftery A, Ruzzo W: Model-based clustering and data transformations for gene expresison data. Bioinformatics 2001, 17: 977–987. 10.1093/bioinformatics/17.10.977View ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette M, Paulovich A, Pomeroy S, Golub T, Lander E, Mesirov J: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 2005, 102: 15545–15550. 10.1073/pnas.0506580102View ArticleGoogle Scholar
- Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations and open problems. Bioinformatics 2005, 21: 3587–3595. 10.1093/bioinformatics/bti565View ArticlePubMedPubMed CentralGoogle Scholar
- Alexa A, Rahnenführer J, Lengauer T: Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 2006, 22: 1600–1607. 10.1093/bioinformatics/btl140View ArticlePubMedGoogle Scholar
- The Gene Ontology Consortium: Gene Ontology: Tool for the Unifaction of Biology. Nature Genetics 2000, 25: 25–29. 10.1038/75556View ArticlePubMed CentralGoogle Scholar
- Jackson J: A users guide to principal components. Wiley & Sons, New York; 1991. full_textView ArticleGoogle Scholar
- Holter N, Mitra M, Maritan A, Cieplak M, Banavar J, Fedoroff N: Fundamental patterns underlying gene expression profiles: Simplicity from complexity. Proceedings of the National Academy of Sciences 2000, 97: 8409–8414. 10.1073/pnas.150242097View ArticleGoogle Scholar
- Raychaudhuri S, Stuart J, Altman R: Principal components analysis to summarize microarray experiments: application to sporulation time series. Pacific Symposium on Biocomputing 2000, 455–466.Google Scholar
- Alter O, Brown P, Botstein D: Singular value decomposition for genome-wide expression data processing and modeling. Proceedings of the National Academy of Sciences 2000, 97: 10101–10106. 10.1073/pnas.97.18.10101View ArticleGoogle Scholar
- Spellman P, Sherlock G, Zhang M, Iyer V, Anders K, Eisen M, Brown P, Botstein D, Futcher B: Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 1998, 9: 3273–3297.View ArticlePubMedPubMed CentralGoogle Scholar
- Goeman J, Geer S, de Kort F, van Houwelingen H: A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 2004, 20: 93–99. 10.1093/bioinformatics/btg382View ArticlePubMedGoogle Scholar
- Chen X, Wang L: Integrating biological knowledge with gene expression profiles for survival prediction of cancer. Journal of Computational Biology 2009, 16: 265–278. 10.1089/cmb.2008.12TTView ArticlePubMedPubMed CentralGoogle Scholar
- Busold C, Winter S, Hauser N, Bauer A, Dippon J, Hoheisel J, Fellenberg K: Integration of GO annotations in Correspondence Analysis: facilitating the interpretation of microarray data. Bioinformatics 2005, 21: 2424–2429. 10.1093/bioinformatics/bti367View ArticlePubMedGoogle Scholar
- Fraley C, Raftery A: Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 2002, 97: 611–631. 10.1198/016214502760047131View ArticleGoogle Scholar
- R Development Core Team:R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2008. [http://www.R-project.org]Google Scholar
- Lipshutz RJ, Fodor S, Gingeras T, Lockhart D: High Density Synthetic Oligonucleotide Arrays. Nature Genetics 1999, 21: 20–24. 10.1038/4447View ArticlePubMedGoogle Scholar
- de Haan J, Wehrens R, Bauerschmidt S, Piek E, van Schaik R, Buydens L: Interpretation of ANOVA models for microarray data using PCA. Bioinformatics 2007, 23: 184–190. 10.1093/bioinformatics/btl572View ArticlePubMedGoogle Scholar
- Gabriel K: The biplot graphic display of matrices with application to principal component analysis. Biometrika 1971, 58(3):453–467. 10.1093/biomet/58.3.453View ArticleGoogle Scholar
- Schwarz G: Estimating the dimension of a model. Ann Statist 1978, 6: 461–464. 10.1214/aos/1176344136View ArticleGoogle Scholar
- Binato R, Martinez CA, Robert B, Abdelhay E: SMAD 8 binding to mice Msx1 basal promoter is required for transcriptional activation. Proceedings of the National Academy of Sciences 2006, 393: 141–150.Google Scholar
- Lian J, Sein J, Stein G, Montecino M, van Wijnen A, Javed A, Gutierrez S: Contributions of nuclear architecture and chromatin to vitamin D-dependent transcriptional control of the rat osteocalcin gene. Steroids 2001, 66: 159–170. 10.1016/S0039-128X(00)00160-4View ArticlePubMedGoogle Scholar
- Jeffery I, Madden S, McGettigan P, Perriere G, Culhane A, Higgins D: Integrating transcription factor binding site information with gene expression datasets. Bioinformatics 2007, 23: 298–305. 10.1093/bioinformatics/btl597View ArticlePubMedGoogle Scholar
- Greenacre M: Theory and applications of correspondence analysis. London, Academic Press; 1984.Google Scholar
- Kishino H, Waddel P: Correspondence Analysis of Genes and Tissue Types and Finding Genetic Links from Microarray Data. Genome Informatics 2000, 11: 83–95.PubMedGoogle Scholar
- Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences 1999, 96: 6745–6750. 10.1073/pnas.96.12.6745View ArticleGoogle Scholar
- Fellenberg K, Hauser N, Brors B, Neutzner A, Hoheisel J, Vingron M: Correspondence analysis applied to microarray data. Proceedings of the National Academy of Sciences 2001, 98: 10781–10786. 10.1073/pnas.181597298View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.