Mining metastasis related genes by primary-secondary tumor comparisons from large-scale databases
© Kim and Lee; licensee BioMed Central Ltd. 2009
Published: 19 March 2009
Metastasis is the most dangerous step in cancer progression and causes more than 90% of cancer death. Although many researchers have been working on biological features and characteristics of metastasis, most of its genetic level processes remain uncertain. Some studies succeeded in elucidating metastasis related genes and pathways, followed by predicting prognosis of cancer patients, but there still is a question whether the result genes or pathways contain enough information and noise features have been controlled appropriately.
We set four tumor type classes composed of various tumor characteristics such as tissue origin, cellular environment, and metastatic ability. We conducted a set of comparisons among the four tumor classes followed by searching for genes that are consistently up or down regulated through the whole comparisons.
We identified four sets of genes that are consistently differently expressed in the comparisons, each of which denotes one of four cellular characteristics respectively – liver tissue, colon tissue, liver viability and metastasis characteristics. We found that our candidate genes for tissue specificity are consistent with the TiGER database. And we also found that the metastasis candidate genes from our method were more consistent with the known biological background and independent from other noise features.
We suggested a new method for identifying metastasis related genes from a large-scale database. The proposed method attempts to minimize the influences from other factors except metastatic ability including tissue originality and tissue viability by confining the result of metastasis unrelated test combinations.
Cancer metastasis is spread of a tumor from its primary organ to other part or non-adjacent organs. During the last several decades, development of cancer treatment and surgical technology has greatly increased survival rate of cancer patients , but treatment of metastasis still remains above of medical capability. Once cancer cells have been disseminated to distant organs through lymph/blood vessels, they always have a potential for re-colonization to form secondary tumors. Furthermore, the newly generated tumors already have genuine ability to form second metastasis . From these reasons, metastasis is the cause of about 90% of deaths from solid tumors. The biology of metastasis has been studied for more than 100 years since Stephen Paget first proposed the 'seed and soil' hypothesis . During or after the complex genetic changes in normal cells' tumorigenesis, a small portion of tumor cells acquire additional abilities. It is generally accepted that a tumor cell has to go through a lot of obstructions and overcome harsh conditions [4, 5]. For example, the new environment hardly supplies the metastasized tumor cells with hormones or ligands which are indispensable for cellular growth and proliferation. It means that metastasize tumor cells need to rearrange their genetic contents to live without those signalling proteins. Tumor cells also face with physical barriers including basement membranes (BM), extracellular matrices (EM) and vessel walls. In this case, some cells who secured higher motility, ability for detachment survival and ability to change their physical/biological characteristics through epithelial-mesenchymal transition (EMT) get favorable opportunities to win a competition among the other tumor cells and move on to the next metastasis barriers. There are many other chemical/physical barriers in whole metastasis procedures including intravasation (getting into a vessel), high fluid pressure in vessels, scattered immune cells, and extravasation (getting out from a vessel). Micrometastasis is a microscopic secondary tumor resulted from a set of primary tumor cell's success in hurdling all of the barriers above. Forming an outgrowing tumor in the secondary site is extremely hard because the entire hurdling events are a series of long odds. Even though a micrometastasis settles down in the new site, it usually dies from the inharmonious environment surrounding the cell or lies dormant due to the lack of suitable growth factors. So, the metastasized tumor cells in the secondary site have been chosen by selective pressures to have all the abilities for metastasis [6, 7]. Sometimes, these winner cells are called 'decathlon champions'.
Many researchers have tried to explain metastatic procedures in the genetic level either in small scale experiments or from large scale expression data. Wang et al identified 76 gene signatures using 286 lymph node negative breast cancer expression data . They used unsupervised clustering to classify good and bad prognosis. Tomlins et al tried to identify gene sets which are related to prostate cancer's progression using the 'molecular concept map' . Their result showed state related 'molecular concepts' from normal prostate tissues to PIN (Prostate Intraepithelial Neoplasia), PCA (Prostate cancer), and metastasis. Edelman et al  used GSEA  analysis with 71 prostate samples consist of 22 benign, 32 PCA, and 17 metastatic tissues. They proposed several gene sets which are significantly changed in the step of n → p (normal to prostate cancer), and p → m (prostate cancer to metastasis). In these genetic level studies, researchers succeeded in clearly representing metastasis related gene sets or pathways, and in validating their results with classification tests.
As returning to the nature of metastasis biology, however, two substantial questions are emerging especially on the sample comparison step. First, do the metastasis samples really have metastasis characteristics? In Wang's work, the samples in the metastasis class are not actually metastasis samples; they are primary tumor samples which later turn out to show bad prognoses. Usually in other work, the samples used for representing metastasis are tumor samples from the very organ where the primary tumor occurred. The only difference is that the patients where the metastasis samples are from had metastatic tumors in their other organs. It is seriously doubtful whether the sample of a part of primary tumor has metastatic abilities; maybe cells with metastatic abilities already moved out to other organs, and only the other cells without the abilities have remained. Second, have other metastasis independent features been eliminated in the comparison between two samples from two distinct organs? In the case of comparing a sample from a primary tumor in an organ with another sample from metastasis tumor in another organ, there should be several elements that affect the result of the comparisons, such as tissue specificity, tissues' environmental viability, and a subtype of cancer. It is hardly expected that a result gene set represents metastatic characteristics only; large parts of the gene set might have been selected for another reasons.
In this paper, we present how to alleviate the noise effects and the lack of information in metastasis gene finding procedures using multiple and controlled analyses. We used a large scale expression profile database with rich clinical information – expO  (expression project for Oncology). With the clinical information, samples are categorized into several distinct sets. We investigated each set and tagged it with its intrinsic characteristics – metastatic ability, tissue specificity, and organ dependent viability. Any two combinatorial sets can be chosen for further comparisons, and the result would represent various information depends on the differences of the selected sets' characteristics.
We describe the data sets and scoring functions in this section. First, the expression database and preprocess procedures are explained. Second, methods for converting the probe level expression values to gene and context scores are described.
Expression profile database
The expO (expression project for Oncology) database is an archive of tumor samples with detailed clinical information. The IGC (International Genomics Consortium) has established a uniform system for obtaining and processing tissue samples for molecular characterization studies. Currently the expO database obtained 1911 tumor samples (2008-07-14) and provides the expression data through the NCBI's public gene expression database  (GEO – GSE2109 series). There are lots of clinical attributes associated with gene expression data such as patient's age, gender, ethnic background, tobacco use, alcohol consumption and familial history of cancer. Furthermore, cancer specific information including pathological and clinical TNM stages, cancer grades, primary tumor sites and relapse information is also available. This standardization of both microarray platform and clinical description greatly contributed to users conveniences.
13 samples have been assigned to class A using the below query from expO relational database.
SELECT * FROM expo WHERE source = 'liver' AND primarysite = 'liver' AND histology NOT LIKE '%metastatic%'
From the query result, the GSM203676 sample was removed as it turned out to be a relapsed cancer. Finally, class A has 12 primary liver tumors without any relapse and metastasis and collected from liver.
20 samples have been assigned to class B using the below query.
SELECT * FROM expo WHERE source = 'liver' AND primarysite = 'colon' AND histology LIKE '%metastatic%'
The histology phrase ensures that the primary colon tumor has metastasized to liver. Without the phrase, the secondary liver cancer might be thought to be another independent primary liver tumor after the primary colon tumor occurred.
188 samples have been assigned to class C using the below query.
SELECT * FROM expo WHERE source = 'colon' AND primarysite = 'colon' AND histology NOT LIKE '%metastatic%' AND pathologicalM = '0'
The pathologicalM field denotes the doctor's decision about the tumor's metastatic aspect.
14 highly heterogeneous samples have been assigned to the class D. We extracted 16 non-liver metastatic tumors that are originated from colon.
SELECT * from expo WHERE primarysite = 'colon' AND source ! = 'liver' AND source ! = 'colon' AND histology LIKE '%metastatic'
From the 16 query results, two samples (GSM102484, GSM137952) were removed due to the adjacency of the metastasized organs (small intestine and peritoneum). The remaining samples' secondary tumor sites included ovary, lung and bladder.
Class definitionCharacteristics of each tumor class were assessed in four criteria – metastatic ability, colon tissue specificity, liver tissue specificity and viability in liver environment. For example, class A has no metastatic ability and no colon tissue specificity but it has liver tissue specificity and liver's environmental viability, whereas class B has metastatic ability, colon tissue specificity (tumor has originated from colon cells) and viability in liver's environment but no liver tissue specificity.
Comparison combinations of classes and expected characteristic differences.
A – B
O (liver VS. colon)
O (only in B)
A – C
O (liver VS. colon)
O (liver only VS. colon only)
A – D
O (liver VS. colon)
O (only in D)
O (liver only VS. anywhere but liver)
B – C
O (only in B)
O (colon and liver VS. colon only)
B – D
O (colon and liver VS. anywhere but liver)
C – D
O (only in D)
O (colon only VS. anywhere but liver)
Scoring and analysis
where μ is the mean, n is the number of samples, and σ is the standard deviation. If the siAB score is bigger than 0, the gene i is up-regulated in class A.
Gathering class combinations with a consistent context gives an overview of specific characteristics. We named the series of class combinations a context vector. For example, a metastasis context vector of gene i is defined as below.
vm i = (si BA , si BC , si DA , si DC )
where siBA = -siAB, siDA = -siAD, siDC = -siCD.
Because we cannot jump to a conclusion that each element belongs to a specific characteristic, we need to justify the consistency of the element's directionality. For example, siAB is used in three context values (as an siBA form in vmi and vli).
Gene i is down-regulated in metastatic tumors
Gene i is down-regulated in colon tissues
Gene i is up-regulated in liver tissues
where vτi is a context vector of gene i on the characteristic τ, cvτ i is a consistency factor of the vector vτi.
Results and discussion
For each gene, the scores for dependency on a specific characteristic, f(i, τ) were calculated for four τ s – characteristics. For example, the scores of gene HPN (hepsin, transmembrane protease) with respect to the four characteristics (metastasis, colon tissue, liver tissue, and liver viability) were 0, -100.0, 100.0, and 2179 respectively, which can be interpreted like the HPN gene is not related to metastasis, but it is closely related to the cellular viability in liver as Klezovitch et al has reported . Top genes for each characteristic were identified. We will discuss the top genes and their reliability case by case.
Colon and liver tissue
Top 10 genes of four characteristics.
Receptor-interacting serine-threonine kinase 3
Fucosyltransferase 2 (secretor status included)
Transmembrane protease, serine 4
Gap junction protein, beta 3, 31 kDa (connexin 31)
Chromosome 1 open reading frame 106
Usher syndrome 1C (autosomal recessive, severe)
Intraflagellar transport 172 homolog (Chlamydomonas)
Discs, large homolog 3 (neuroendocrine-dlg, Drosophila)
Tubulin, beta 2A
Quaking homolog, KH domain RNA binding
Family with sequence similarity 107, member B
RAB7, member RAS oncogene family-like 1
CAP, adenylate cyclase-associated protein 2
Interleukin 6 signal transducer
Protein tyrosin phosphatase-like B
Cytoplasmic polyadenylation elemenet binding protein
Complement component 4A (Rodgers blood group)
Fibrinogen gamma chain
Serpin peptidase inhibitor, clade A
Fibrinogen alpha chain
Lysosomal-associated membrane protein2
Fibrinogen beta chain
Phospholipase A2 receptor1, 180 kDa
Triggering receptor expressed on myeloid cells 2
Heparin sulfate 3-O-sulfotransferase 2
ST6 beta-galactosamide alpha-2,6-sialytransferase 2
ATPase, class V, type 10A
RAB11 family interacting protein 3
Chromosome 20 open reading frame 12
We concluded that extracted colon/liver tissue specific genes do not contain any universal tissue specific genes. Instead, the result includes genes relatively up or down-regulated than in the other tissues. But this result is enough to be used to offset any bias caused by the tissue differences.
The same analysis was applied to the liver viability characteristics. Mean score was 6.9 with standard deviation 513.4. Surprisingly, 9 of 10 top genes were all registered to the TiGER's liver specific gene database (see Table 2). Including ALB (albumin, 6216 liver viability score, 2nd ranked), FGG (fibrinogen gamma chain, 6097 liver viability score, 3rd ranked) and SERPINA3 (serpin peptidase inhibitor, 5958 liver viability score, 4th ranked), all the top genes were well known as liver specific genes. Despite FGG and HP/HPR (Homo sapiens haptoglobin/haptoglobin-related protein) are significantly up-regulated in the tissues from liver environment, their liver score were all zero. In the case of FGG, the DEG score from A↔B was -0.0058 making the final score 0. Similarly, HP/HPR's zero score was due to the minus DEG score in A↔B.
The differences of two liver-related scores need to be examined. One seeks to find any signs coming from liver tissue's characteristics, while the other from liver's environments. Because the liver context vector and liver viability context vector shares two elements, SiAC and SiAD, we could pay attention to only SiAB, SiBC, and SiBD elements. In the result, we found SiAB hardly catches liver specific genes. Even though the B sample came from colon tissues, we could see its expression pattern simulated that of liver tissues. Cancer cells go through increased genetic and epigenetic mutations, and sometimes their genetic instability helps by providing variety and perpetuity to themselves. During the metastasis procedures, colon cancer cells that acquired invasion and metastasis abilities possibly acquire activation of liver specific genes before or after they form micrometastasis. The tissue specific gene list of TiGER database has been established using EST (Expressed Sequence Tags) tags from sample tissues. So, the genes in the list can be identified properly not by the tissue originality but by the activity of core genes, which enables the cell to live in a liver. It is well shown in the score SiBC; even though B and C are both from colon cells, their liver specific genes are mostly up-regulated in B. In the mean time, we cannot stop concerning the possibility that all the samples collected from liver tissues contain surrounding normal liver tissues, making the entire results more ambiguous.
B↔C comparison and modified result.
B vs. C
Liver and metastasis dependency
HP /// HPR
Modified with liver viability
Cancer and metastasis
To prove the enhancement of the result we compared two gene lists from B↔C and post-processed B↔C with our method. B↔C (liver metastasis of colon cancer versus primary colon cancer) is a commonly used comparison for extracting metastatic signatures. As shown in the Table 3, a result from B↔C also contains liver viability characteristics as well as metastasis characteristics. So we normalized both of the B↔C and liver viability scores and found gene sets whose B↔C scores are high but liver viability scores are low. Because the liver viability context vector already contains SiBC, we modified the context vector removing the element. As we expected, the Pearson correlation of two scores was 0.469 indicating that large parts of B↔C comparison result is due to the liver viability characteristics. As shown in Table 3, almost half of the top scored genes merely from the B↔C comparison were liver specific genes. But there was no liver specific gene in the top 20 gene list in modified results. All the top 10 genes including COLEC11 (collectin sub-family member 11), FGG (fibrinogen gamma chain), ART4 (ADP-ribosyltransferase 4), ALB (albumin), and HP (haptoglobin) were removed in the modified result. Instead, metastasis candidate genes including AOAH  (acyloxyacyl hydrolase), EPO  (erythropoietin) and MAR1 (macrophage scavenger receptor1, involved in inflammation pathways) were newly included. We are expecting the other genes would be validated further.
We suggested a new method for identifying metastasis related genes from a large scale database. The proposed method attempts to minimize the influences from other factors but metastasis including tissue originality and tissue viability by confining the result of metastasis unrelated test combinations. We presented tissue specific and tissue viability related genes, and validated them using tissue specificity database, TiGER. Finally, we presented metastasis candidate genes by calculating differences of metastasis and liver viability normalized scores. We would like to expand the experiments to other tissues using remaining records of the databases and further validate the result by constructing classifiers.
This work was supported by the Korean Systems Biology Program (No. M10309020000-03B5002-00000) and the National Research Lab. Program (No. 2006-01508) from the Ministry of Education, Science and Technology through the Korea Science and Engineering Foundation. The efforts of the International Genomics Consortium (IGC) and expO (expression project for Oncology) are greatly acknowledged. We would like to thank CHUNG Moon Soul Center for BioInformation and BioElectronics for providing research facilities.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 3, 2009: Second International Workshop on Data and Text Mining in Bioinformatics (DTMBio) 2008. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S3.
- Welch HG, Schwartz LM, Woloshin S: Are Increasing 5-Year Survival Rates Evidence of Success Against Cancer?. JAMA. 2000, 283: 2975-2978. 10.1001/jama.283.22.2975.View ArticlePubMedGoogle Scholar
- Catherine R, Tait DDKH: Do metastases metastasize?. Journal of Pathology. 2004, 203: 515-518. 10.1002/path.1544.View ArticleGoogle Scholar
- Paget S: The Distribution of Secondary Growths in Cancer of the Breast. The Lancet. 1889, 133: 571-573. 10.1016/S0140-6736(00)49915-0.View ArticleGoogle Scholar
- Mehlen P, Puisieux A: Metastasis: a question of life or death. Nat Rev Cancer. 2006, 6: 449-458. 10.1038/nrc1886.View ArticlePubMedGoogle Scholar
- Gupta GP, Massague J: Cancer Metastasis: Building a Framework. Cell. 2006, 127: 679-695. 10.1016/j.cell.2006.11.001.View ArticlePubMedGoogle Scholar
- Fidler IJ: The Pathogenesis of Cancer Metastasis: the 'Seed and Soil' Hypothesis Revisited. Nat Rev Cancer. 2003, 3: 453-458. 10.1038/nrc1098.View ArticlePubMedGoogle Scholar
- Merlo LMF, Pepper JW, Reid BJ, Maley CC: Cancer as an evolutionary and ecological process. Nat Rev Cancer. 2006, 6: 924-935. 10.1038/nrc2013.View ArticlePubMedGoogle Scholar
- Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005, 365: 671-679.View ArticlePubMedGoogle Scholar
- Tomlins SA, Mehra R, Rhodes DR, Cao X, Wang L, Dhanasekaran SM, Kalyana-Sundaram S, Wei JT, Rubin MA, Pienta KJ: Integrative molecular concept modeling of prostate cancer progression. Nat Genet. 2007, 39: 41-51. 10.1038/ng1935.View ArticlePubMedGoogle Scholar
- Edelman EJ, Guinney J, Chi J-T, Febbo PG, Mukherjee S: Modeling Cancer Progression via Pathway Dependencies. PLoS Computational Biology. 2008, 4: e28-10.1371/journal.pcbi.0040028.PubMed CentralView ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005, 102: 15545-15550. 10.1073/pnas.0506580102.PubMed CentralView ArticlePubMedGoogle Scholar
- Expression Project for Oncology (expO). The International Genomics Consortium, [http://www.intgen.org/expo.cfm]
- Edgar R, Barrett T: NCBI GEO standards and services for microarray data. Nat Biotechnol. 2006, 24: 1471-1472. 10.1038/nbt1206-1471.PubMed CentralView ArticlePubMedGoogle Scholar
- Klezovitch O, Chevillet J, Mirosevich J, Roberts RL, Matusik RJ, Vasioukhin V: Hepsin promotes prostate cancer progression and metastasis. Cancer Cell. 2004, 6: 185-195. 10.1016/j.ccr.2004.07.008.View ArticlePubMedGoogle Scholar
- Liu X, Yu X, Zack D, Zhu H, Qian J: TiGER: A database for tissue-specific gene expression and regulation. BMC Bioinformatics. 2008, 9: 271-10.1186/1471-2105-9-271.PubMed CentralView ArticlePubMedGoogle Scholar
- Granata F, Petraroli A, Boilard E, Bezzine S, Bollinger J, Del Vecchio L, Gelb MH, Lambeau G, Marone G, Triggiani M: Activation of Cytokine Production by Secreted Phospholipase A2 in Human Lung Macrophages Expressing the M-Type Receptor. J Immunol. 2005, 174: 464-474.View ArticlePubMedGoogle Scholar
- Lu H, Ouyang W, Huang C: Inflammation, a Key Event in Cancer Development. Mol Cancer Res. 2006, 4: 221-233. 10.1158/1541-7786.MCR-05-0261.View ArticlePubMedGoogle Scholar
- DeNardo D, Johansson M, Coussens L: Immune cells as mediators of solid tumor metastasis. Cancer Metastasis Rev. 2008, 27: 11-18. 10.1007/s10555-007-9100-0.View ArticlePubMedGoogle Scholar
- Bouchon A, Dietrich J, Colonna M: Cutting Edge: Inflammatory Responses Can Be Triggered by TREM-1, a Novel Receptor Expressed on Neutrophils and Monocytes. J Immunol. 2000, 164: 4991-4995.View ArticlePubMedGoogle Scholar
- Wang B, Pelletier J, Massaad MJ, Herscovics A, Shore GC: The Yeast Split-Ubiquitin Membrane Protein Two-Hybrid Screen Identifies BAP31 as a Regulator of the Turnover of Endoplasmic Reticulum-Associated Protein Tyrosine Phosphatase-Like B. Mol Cell Biol. 2004, 24: 2767-2778. 10.1128/MCB.24.7.2767-2778.2004.PubMed CentralView ArticlePubMedGoogle Scholar
- Nguyen M, Breckenridge DG, Ducret A, Shore GC: Caspase-Resistant BAP31 Inhibits Fas-Mediated Apoptotic Membrane Fragmentation and Release of Cytochrome c from Mitochondria. Mol Cell Biol. 2000, 20: 6731-6740. 10.1128/MCB.20.18.6731-6740.2000.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang W, Eddy R, Condeelis J: The cofilin pathway in breast cancer invasion and metastasis. Nat Rev Cancer. 2007, 7: 429-440. 10.1038/nrc2148.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang S: Gene amplifications at chromosome 7 of the human gastic cancer genome. International journal of molecular medicine. 2007, 20: 225-231.PubMedGoogle Scholar
- Lai SY, Childs EE, Xi S, Coppelli FM, Gooding WE, Wells A, Ferris RL, Grandis JR: Erythropoietin-mediated activation of JAK-STAT signaling contributes to cellular invasion in head and neck squamous cell carcinoma. Oncogene. 2005, 24: 4442-4449. 10.1038/sj.onc.1208635.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.