Computational analysis of gene expression space associated with metastatic cancer
© Ptitsyn; licensee BioMed Central Ltd. 2009
Published: 8 October 2009
Prostate carcinoma is among the most common types of cancer affecting hundreds of thousands people every year. Once the metastatic form of prostate carcinoma is documented, the majority of patients die from their tumors as opposed to other causes. The key to successful treatment is in the earliest possible diagnosis, as well as understanding the molecular mechanisms of metastatic progression. A number of recent studies have identified multiple biomarkers for metastatic progression. However, most of the studies consider only direct comparison between metastatic and non-metastatic classes of samples.
We propose an alternative concept of analysis that considers the entire multidimensional space of gene expression and identifies the partition of this space in which metastatic development is possible. To apply this concept in cancer gene expression studies we utilize a modification of high-dimension natural taxonomy algorithm FOREL. Our analysis of microarray data containing primary and metastatic cancer samples has revealed not only differentially expressed genes, but also relations between different groups of primary and metastatic cancer. Metastatic samples tend to occupy a distinct partition of gene expression space. Further pathway analysis suggests that this partition is delineated by a specific pattern of gene expression in cytoskeleton remodeling, cell adhesion and apoptosis/cell survival pathways. We compare our findings with both report of original analysis and recent studies in molecular mechanism of metastasis.
Our analysis indicates a single molecular mechanism of metastasis. The new approach does not contradict previously reported findings, but reveals important details unattainable with traditional methodology.
The utility of microarrays in cancer research has been recognized as this technology became available. Early studies  exploring the benefits of simultaneous estimation of activity for thousands of genes have been followed by hundreds of publications reporting important discoveries in cancer biology. In the past decade, design of both experiment and analysis pipelines have settled into a "classic" template with certain variations reflecting specific goals of the study. Samples are collected from contrast groups, such as normal and cancerous tissue or primary and metastatic tumors and comparison is conducted between such groups. The entire data set is reduced to a smaller, more manageable number of genes (features), informative in comparing pre-defined classes. Statistically significant differential genes, after adjustment for possible false discovery rate (FDR), are selected for further analysis. Additionally, the whole set or a subset of selected genes are subjected to hierarchical cluster analysis to highlight the difference in expression pattern between classes of samples. The goal of cluster analysis in this case is to identify the subset of genes with the highest potential to serve as predictive biomarkers. Some papers follow an alternative approach and start the analysis with unsupervised clustering procedures [2–5]. The results show "molecular classification" of cancer into subtypes not necessarily following traditional histopathology classification. Studies show a difference in survival rate for the patients that belong to different subtypes. However, the nature of these subtypes has proved to be hard to interpret and even harder to bring into clinical practice. At least partially, the utility of molecular sub-typing of cancer samples is hampered by the intrinsic limitations of the analysis.
Classification is typically based on the set of pre-selected "informative" features. In turn, the "informative" property of each feature/gene is estimated by variability in the general population or between pre-defined classes of samples. This pre-selection introduces bias into the overall picture of gene expression. Rarely expressed and marginally differential genes tend to be filtered out before their relation to the biology of cancer could be established on the grounds of signal intensity and variation alone, without any consideration of correlated gene expression and gene interaction. As a result, the final interpretation of results is based on a biased subset of relatively highly expressed genes depleted of early switches and enriched with non-specific downstream effectors.
In earlier publications we proposed an unsupervised classification algorithm that does not require dimensionality reduction. The algorithm was tested on cancer gene expression data  and was essential for understanding of patterns of gene expression associated with progressive insulin resistance in skeletal muscle . On the other hand, we have also proposed a new approach to the selection of biomarkers based on systems biology, allowing inclusion of marker genes less differential on their own, but closely interlinked in the context of biological pathways. A similar approach has also revealed important biomarkers for diagnosis of persistent bovine diarrhea . However, these studies relied on pre-defined sample classification and didn't attempt any cluster analysis. This paper proposes to join high-dimension unsupervised clustering with a systems biology approach in attempt to elucidate the molecular mechanisms of metastasis.
Metastasis is the deadliest development in cancer, responsible for most cancer-related deaths. Metastatic cells acquire mobility and spread through the blood and lymphatic system to form colonies in other organs. There are multiple theories about the origin of metastasis [9–13], but our own previous studies suggest that primary and metastatic tumors share the same basic metabolism. Differences between primary and metastatic tumor samples compared across multiple tissues of origin are associated with cytoskeleton remodeling, antigen representation, extracellular matrix and some other pathways. In this study, the data is limited to just one set of experiments comparing primary and metastatic samples of prostate cancer. The original publication of this data  presents the state of the art in both functional genomics and bioinformatics of cancer. However, it still leaves open the question we seek to solve in this paper.
Results and discussion
The FOREL algorithm tends to break large uneven continuums into clusters the same way the human eye tends to separate constellations on the night sky (even though they in fact belong to the same Milky Way galaxy). This property is created by consecutive extraction of clusters from the overall population of samples rather than simultaneous partitioning (for example in k-means family of algorithms). On each step, FOREL identifies the best cluster, but clusters extracted on different steps are not necessarily separable from each other and can even overlap in space. This property of the algorithm creates the opportunity to probe the entire space of features (genes) for areas of higher or lower density, concentration or depletion of samples of certain class (metastatic, for example) and then reconstruct relative positions for such areas. Interpretation of cluster juxtaposition allows us to make a few important observations. The congregation of clusters made of non-metastatic samples occupies a large space and the clusters are not clearly separable from each other or forming any geometrical pattern. The observed "cloud" is consistent with a single class with relatively high variability. Three clusters containing metastatic samples are also likely to represent the same continuous class, but opposite to non-metastatic they occupy a compact area, show less variation and demonstrate an elongated pattern. In other words, metastatic tumors are more constrained in the ratios at which different genes can be co-expressed, possibly more tightly regulated and originating from a limited to a single origin and a single path of development. It remains for future studies to determine whether this elongated trend reflects the progression of metastatic transformation and can be used to estimate the age of the metastasis from the event that trigger the transformation. For resolution, these questions need more information than is contained in this data, but the missing data could be acquired in future experiments.
The original data set includes 3 non-cancerous, 23 primary tumor and 9 metastatic tumor samples collected at MSKCC between 1993 and 1999. Samples were snap-frozen in liquid nitrogen and stored at -80°C. Each sample was histologically examined. Cells of interest were manually selected from the frozen block, trimming away non-neoplastic tissues. RNA was extracted by homogenization using RNAeasy kif (Qiagen, Valencia, CA) and evaluated for integrity by denaturing agarose gel. Complementary DNA was synthesized from total RNA using t7-promoter-tagged oligo(dT) primer. Gene expression was estimated by hybridization to Affymetrix U95 microarrays. MAS 4.0 (average difference) algorithm was used to quantify expression values.
where O(i) is a cluster of n i elements. Clusters are extracted from the general population in order of their statistical fitness (see Cluster validation). This perception is fundamentally different from the popular k-means algorithm, which shares certain similarity with FOREL, but in k-means concept the whole data set is a sum of distinct classes rather then a union. FOREL clusters can partially or completely overlap in space or even share the same centroid, but can be separated as long as they differ in other statistical characteristics, for example density. In a nutshell, a white and a yolk of an egg would be separate classes by FOREL, while inseparable by k-means. Other hypersphere-based algorithms such as k-means imply Gaussian distribution of objects (phenotypes) in clusters . FOREL is more flexible and does not require such assumptions. FOREL effectively combines the best features of k-means and hierarchical clustering approaches for the price of increased computation complexity. However, the performance of our C++ implementation is acceptable; up to a few hundreds of microarrays can be clustered on a PC within one hour. The algorithm starts with positioning of a hypersphere with a radius R0 and a center C0 in a certain coordinate, which can be one of the objects or a centroid of pre-defined cluster or any other point of interest in the expression space. Position of the "formal element" element is calculated as a center of mass of all objects, for which the distance ρ i (C i ) ≤ R i . After the mass center of all captured objects is calculated, its center is moved to the new position. If new objects are found within the radius from the new position, they are added to the provisional cluster and the mass center is recalculated. This process is repeated until the no more objects can be added on the current step of the algorithm and the hypersphere stops.
Our version of FOREL consists of alternating steps of cluster isolation and cluster validation. Each completed walks of a hypersphere with R n and a center C m produces a provisional cluster O(R n C m ), which is temporarily stored. We perform an exhaustive coverage of the data trying each element of the original data set as potential starting point. For each starting point we perform series of clustering steps with different hypersphere radius, ranging between minimum R = min(D(C i , C j ))+μ and maximum R = min(D(C i , C j ))- Here D(C i , C j ) is a distance (for example, Euclidean) between any two objects in the data set and μ is a margin, introduced to reduce computation time. The step of R increment is also a parameter. The resulting provisional clusters are fuzzy subsets, each captured by a hypersphere with specific radius as it moves gradually from the starting to the resting point. The validity of the provisional cluster can be verified by a statistical utility measure based on density, variance, sum of inter-cluster distances, etc (see  for review). If the cluster meets the selection criteria, it is removed from the original data set and the process reiterated until no more statistically valid (according to the chosen metric) unclassified objects are left or the best provisional cluster does not satisfy the minimal fitness requirement. Current implementation accommodates a few different metrics for computational cluster validation, but only two metrics have been applied in this study, density-based:
This metric is a reasonable compromise between precision and performance, which has proven to be effective in analysis of microarray data .
Application of FOREL family algorithms has a number of potential pitfalls which have to be considered. Clusters produced by FOREL may overlap in space partially or completely, thus should not be assumed separable by ANOVA. In fact further analysis of relations between FOREL clusters may reveal continuous trend, like in was in case of gene expression patterns in skeletal muscle of diabetic and non-diabetic patients .
The implementation developed by A. Ptitsyn  employs a complete test of every object as a possible cluster seed or hypersphere starting point. By default the current version of the program implements an iterative solution: all possible radii are tested with a certain step within a limited range. The step (or precision) is derived from the analysis of variation of distances within the whole data set. The range is defined by the minimal and maximal distance found within the whole dataset. These are extreme values, with a radius less than a minimal distance, the algorithms can produce only singletons, and on the other hand with radius equal to the maximal distance, all objects are guaranteed to fall into one large cluster. The best radius is one that produces a provisional cluster with the best quality. "Best" is an ambiguous term which can be defined differently in different versions of FOREL. Current implementation has a parameter which allows to choose between quality estimated through density or connectivity of cluster elements (see "Cluster validation"). Cutting percentile margins from the possible radius range can reduce the computation demands of the program. By default 20% of the range is cut from both minimal and maximal radius values. The "brute force" approach to computational cluster validation implemented in current version (see "Cluster validation") provides more reliable results compared to re-sampling, but results in considerably longer execution times. Typical running time for FOREL clustering does not exceed a few minutes on an average Pentium 4 PC. Depending on the validation metric applied and the parameter settings complete clustering of a large data set (up to a few hundred microarrays) data can take up to a few hours. The demand for computational power is significantly mitigated by effective C++ implementation and generally affordable, considering the time required to collect such data.
Current implementation of FOREL clustering algorithm runs on the standard PC under MS Windows (Win32 console application) or Linux. FOREL execution time is practically unaffected by the dimensionality of the data, but can be sensitive to the number of objects (samples) in the set.
For cluster visualization dimensionality is reduced from over 12,000 down to 3 in two consecutive steps. First, the results of FOREL cluster analysis are used to identify centroids of non-singleton clusters. These points are used as anchors to reduce the dimensionality down to the number of identified clusters. Second, principal component analysis is applied to reduced the "centroid space" down to the space of three first principal components. Spotfire Decisionsite (TIBCO Software, Palo Alto, CA) is used for graphical depiction of clusters. The resulting picture reflects juxtaposition of clusters based on geometrical distances between cluster centroids and cluster radius. Individual position of samples inside each cluster is lost in dimensionality reduction, but guaranteed to be inside the space delimited by cluster radius.
Selection of differentially expressed genes
A set of differentially expressed genes was selected using University of Pittsburgh Gene Expression Data Analysis suite (GEDA). The software is available for downloading from http://bioinformatics.upmc.edu/GE2/GEDA.html. For selection, we applied the standard J5 metric with threshold 4 and optional 4 iteration of Jackknife procedure to reduce the number of false-positive differential genes. Both J5 metric and threshold parameter are standard pre-set values recommended by the developers. We did not attempt to estimate the confidence level of individual genes and used J5 not as a statistical test, but as a selection procedure providing a shortlist of genes deviating from the expected average value and enriched with differential genes. This metric has obvious limitation and could not be recommended for a direct substitute of t-test. However, it is also free from assumptions of independence and normal distribution of intensities (or gene abundance estimates) required for t-test. Used for preliminary selection followed by computational validation and pathway analysis J5 produces unbiased and sensible results. The details of tandem application of preliminary J5 inference with selective pathway analysis are discussed in Ptitsyn et al .
Functional annotation and pathway analysis
Analysis of biological pathways was performed using MetaCore software (GeneGo Inc.), Ingenuity Pathways Analysis (IPA, Ingenuity Systems Inc.). Significance of a particular pathway represented in a given list of genes is estimated by Fisher's exact test with adjustments to current database contents. The GeneGo and IPA databases are accessed online and the contents (including Canonic Maps, Molecular Functions, Gene Interactions etc.) are frequently updated by the corps of curators reading research publications and extracting information related to all forms of interaction among genes and chemical compounds. Consequently, the results of pathway analyses performed at different times may differ in details. The maps of canonic signaling and metabolic pathways are fuzzy, but tested for significance independently. As a result, the list of significant pathways often enriched with redundant pathways overlapping by majority of components. In comparison between such lists of pathways Metacore distinguishes "similar" and "identical" pathways as shown in Figure 3. The interpretation of the meaning of statistically significant pathways relies on the knowledge of the biological function underlying the pathway maps and cross-comparison between two independent pathway databases (IPA and Metacore).
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 11, 2009: Proceedings of the Sixth Annual MCBIOS Conference. Transformational Bioinformatics: Delivering Value from Genomes. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S11.
- Golub TR, et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286(5439):531–7. 10.1126/science.286.5439.531View ArticlePubMedGoogle Scholar
- Garber ME, et al.: Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci USA 2001, 98(24):13784–9. 10.1073/pnas.241500798PubMed CentralView ArticlePubMedGoogle Scholar
- Su AI, et al.: Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res 2001, 61(20):7388–93.PubMedGoogle Scholar
- Virtanen C, et al.: Integrated classification of lung tumors and cell lines by expression profiling. Proc Natl Acad Sci USA 2002, 99(19):12357–62. 10.1073/pnas.192240599PubMed CentralView ArticlePubMedGoogle Scholar
- Bhattacharjee A, et al.: Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 2001, 98(24):13790–5. 10.1073/pnas.191502998PubMed CentralView ArticlePubMedGoogle Scholar
- Ptitsyn A: Class discovery analysis of the lung cancer gene expression data. DNA Cell Biol 2004, 23(10):715–21. 10.1089/dna.2004.23.715View ArticlePubMedGoogle Scholar
- Ptitsyn A, et al.: Unsupervised clustering of gene expression data points at hypoxia as possible trigger for metabolic syndrome. BMC Genomics 2006, 7: 318. 10.1186/1471-2164-7-318PubMed CentralView ArticlePubMedGoogle Scholar
- Smirnova NP, et al.: Persistent fetal infection with bovine viral diarrhea virus differentially affects maternal blood cell signal transduction pathways. Physiol Genomics 2009, 36(3):129–39.PubMed CentralView ArticlePubMedGoogle Scholar
- Onn A, Fidler IJ: Metastatic potential of human neoplasms. In Vivo 2002, 16(6):423–9.PubMedGoogle Scholar
- Fidler IJ, Kripke ML: Metastasis results from preexisting variant cells within a malignant tumor. Science 1977, 197(4306):893–5. 10.1126/science.887927View ArticlePubMedGoogle Scholar
- Fidler IJ: The pathogenesis of cancer metastasis: the 'seed and soil' hypothesis revisited. Nat Rev Cancer 2003, 3(6):453–8. 10.1038/nrc1098View ArticlePubMedGoogle Scholar
- Liu R, et al.: The prognostic role of a gene signature from tumorigenic breast-cancer cells. N Engl J Med 2007, 356(3):217–26. 10.1056/NEJMoa063994View ArticlePubMedGoogle Scholar
- Parker B, Sukumar S: Distant metastasis in breast cancer: molecular mechanisms and therapeutic targets. Cancer Biol Ther 2003, 2(1):14–21.View ArticlePubMedGoogle Scholar
- Ptitsyn AA, Weil MM, Thamm DH: Systems biology approach to identification of biomarkers for metastatic progression in cancer. BMC Bioinformatics 2008, 9(Suppl 9):S8. 10.1186/1471-2105-9-S9-S8PubMed CentralView ArticlePubMedGoogle Scholar
- LaTulippe E, et al.: Comprehensive gene expression analysis of prostate cancer reveals distinct transcriptional programs associated with metastatic disease. Cancer Res 2002, 62(15):4499–506.PubMedGoogle Scholar
- Wartburg O: Uber den Stoffwechsel der Carcinomzelle. Klin Wochenschr 1925, 4: 534–536. 10.1007/BF01726151View ArticleGoogle Scholar
- Sreekumar A, et al.: Metabolomic profiles delineate potential role for sarcosine in prostate cancer progression. Nature 2009, 457(7231):910–4. 10.1038/nature07762PubMed CentralView ArticlePubMedGoogle Scholar
- Zagoruiko NG: Methods of pattern recognition and their application. Moskow: Soviet Radio; 1972.Google Scholar
- Zagoruiko N, Elkina V, Lbov G: Algorithms for recognition of empiric patterns. Nauka, Novosibirsk; 1985.Google Scholar
- Aivazyan S, et al.: Applied Statistics: Classification and Reduction of Dimensionality. Moscow: Finansy and Statistika; 1989.Google Scholar
- Grabusts P, Borisov A: Using Grid-clustering Methods in Data Classification. Proceedings of International Conference on Parallel Computing in Electrical Engineering. Warsaw, POLAND 2002.Google Scholar
- Kovalerchuk B, Vityaev E: Discovering Law-like Regularities in Financial Time Series. Journal of Computational Intellligence in Finance 1998, 6(3):12–26.Google Scholar
- Kovalerchuk B, Vityaev E: Data Mining in Finance: Advances in Relational and Hybrid methods. Boston, London, Dordrecht: Kluwer Academic Publishers; 2000.Google Scholar
- Gurevich I, et al.: Information Technology for the Morphological Analysis of the Lymphoid Cell Nuclei. In Lecture Notes in Computer Science. Springer-Verlag GmbH; 2003:541–548.Google Scholar
- Rosa GJM, C.R.P.D.G.: Robust Linear Mixed Models with Normal/Independent Distributions and Bayesian MCMC Implementation. Biometrical Journal 2003, 45(5):573–590. 10.1002/bimj.200390034View ArticleGoogle Scholar
- Handl J, Knowles J, Kell DB: Computational cluster validation in post-genomic data analysis. Bioinformatics 2005, 21(15):3201–12. 10.1093/bioinformatics/bti517View ArticlePubMedGoogle Scholar
- Patel S, Lyons-Weiler J: caGEDA: a web application for the integrated analysis of global gene expression patterns in cancer. Appl Bioinformatics 2004, 3(1):49–62. 10.2165/00822942-200403010-00007View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.