Skip to main content

Mining metastasis related genes by primary-secondary tumor comparisons from large-scale databases



Metastasis is the most dangerous step in cancer progression and causes more than 90% of cancer death. Although many researchers have been working on biological features and characteristics of metastasis, most of its genetic level processes remain uncertain. Some studies succeeded in elucidating metastasis related genes and pathways, followed by predicting prognosis of cancer patients, but there still is a question whether the result genes or pathways contain enough information and noise features have been controlled appropriately.


We set four tumor type classes composed of various tumor characteristics such as tissue origin, cellular environment, and metastatic ability. We conducted a set of comparisons among the four tumor classes followed by searching for genes that are consistently up or down regulated through the whole comparisons.


We identified four sets of genes that are consistently differently expressed in the comparisons, each of which denotes one of four cellular characteristics respectively – liver tissue, colon tissue, liver viability and metastasis characteristics. We found that our candidate genes for tissue specificity are consistent with the TiGER database. And we also found that the metastasis candidate genes from our method were more consistent with the known biological background and independent from other noise features.


We suggested a new method for identifying metastasis related genes from a large-scale database. The proposed method attempts to minimize the influences from other factors except metastatic ability including tissue originality and tissue viability by confining the result of metastasis unrelated test combinations.


Cancer metastasis is spread of a tumor from its primary organ to other part or non-adjacent organs. During the last several decades, development of cancer treatment and surgical technology has greatly increased survival rate of cancer patients [1], but treatment of metastasis still remains above of medical capability. Once cancer cells have been disseminated to distant organs through lymph/blood vessels, they always have a potential for re-colonization to form secondary tumors. Furthermore, the newly generated tumors already have genuine ability to form second metastasis [2]. From these reasons, metastasis is the cause of about 90% of deaths from solid tumors. The biology of metastasis has been studied for more than 100 years since Stephen Paget first proposed the 'seed and soil' hypothesis [3]. During or after the complex genetic changes in normal cells' tumorigenesis, a small portion of tumor cells acquire additional abilities. It is generally accepted that a tumor cell has to go through a lot of obstructions and overcome harsh conditions [4, 5]. For example, the new environment hardly supplies the metastasized tumor cells with hormones or ligands which are indispensable for cellular growth and proliferation. It means that metastasize tumor cells need to rearrange their genetic contents to live without those signalling proteins. Tumor cells also face with physical barriers including basement membranes (BM), extracellular matrices (EM) and vessel walls. In this case, some cells who secured higher motility, ability for detachment survival and ability to change their physical/biological characteristics through epithelial-mesenchymal transition (EMT) get favorable opportunities to win a competition among the other tumor cells and move on to the next metastasis barriers. There are many other chemical/physical barriers in whole metastasis procedures including intravasation (getting into a vessel), high fluid pressure in vessels, scattered immune cells, and extravasation (getting out from a vessel). Micrometastasis is a microscopic secondary tumor resulted from a set of primary tumor cell's success in hurdling all of the barriers above. Forming an outgrowing tumor in the secondary site is extremely hard because the entire hurdling events are a series of long odds. Even though a micrometastasis settles down in the new site, it usually dies from the inharmonious environment surrounding the cell or lies dormant due to the lack of suitable growth factors. So, the metastasized tumor cells in the secondary site have been chosen by selective pressures to have all the abilities for metastasis [6, 7]. Sometimes, these winner cells are called 'decathlon champions'.

Many researchers have tried to explain metastatic procedures in the genetic level either in small scale experiments or from large scale expression data. Wang et al identified 76 gene signatures using 286 lymph node negative breast cancer expression data [8]. They used unsupervised clustering to classify good and bad prognosis. Tomlins et al tried to identify gene sets which are related to prostate cancer's progression using the 'molecular concept map' [9]. Their result showed state related 'molecular concepts' from normal prostate tissues to PIN (Prostate Intraepithelial Neoplasia), PCA (Prostate cancer), and metastasis. Edelman et al [10] used GSEA [11] analysis with 71 prostate samples consist of 22 benign, 32 PCA, and 17 metastatic tissues. They proposed several gene sets which are significantly changed in the step of n → p (normal to prostate cancer), and p → m (prostate cancer to metastasis). In these genetic level studies, researchers succeeded in clearly representing metastasis related gene sets or pathways, and in validating their results with classification tests.

As returning to the nature of metastasis biology, however, two substantial questions are emerging especially on the sample comparison step. First, do the metastasis samples really have metastasis characteristics? In Wang's work, the samples in the metastasis class are not actually metastasis samples; they are primary tumor samples which later turn out to show bad prognoses. Usually in other work, the samples used for representing metastasis are tumor samples from the very organ where the primary tumor occurred. The only difference is that the patients where the metastasis samples are from had metastatic tumors in their other organs. It is seriously doubtful whether the sample of a part of primary tumor has metastatic abilities; maybe cells with metastatic abilities already moved out to other organs, and only the other cells without the abilities have remained. Second, have other metastasis independent features been eliminated in the comparison between two samples from two distinct organs? In the case of comparing a sample from a primary tumor in an organ with another sample from metastasis tumor in another organ, there should be several elements that affect the result of the comparisons, such as tissue specificity, tissues' environmental viability, and a subtype of cancer. It is hardly expected that a result gene set represents metastatic characteristics only; large parts of the gene set might have been selected for another reasons.

In this paper, we present how to alleviate the noise effects and the lack of information in metastasis gene finding procedures using multiple and controlled analyses. We used a large scale expression profile database with rich clinical information – expO [12] (expression project for Oncology). With the clinical information, samples are categorized into several distinct sets. We investigated each set and tagged it with its intrinsic characteristics – metastatic ability, tissue specificity, and organ dependent viability. Any two combinatorial sets can be chosen for further comparisons, and the result would represent various information depends on the differences of the selected sets' characteristics.


We describe the data sets and scoring functions in this section. First, the expression database and preprocess procedures are explained. Second, methods for converting the probe level expression values to gene and context scores are described.

Data preparation

Expression profile database

The expO (expression project for Oncology) database is an archive of tumor samples with detailed clinical information. The IGC (International Genomics Consortium) has established a uniform system for obtaining and processing tissue samples for molecular characterization studies. Currently the expO database obtained 1911 tumor samples (2008-07-14) and provides the expression data through the NCBI's public gene expression database [13] (GEO – GSE2109 series). There are lots of clinical attributes associated with gene expression data such as patient's age, gender, ethnic background, tobacco use, alcohol consumption and familial history of cancer. Furthermore, cancer specific information including pathological and clinical TNM stages, cancer grades, primary tumor sites and relapse information is also available. This standardization of both microarray platform and clinical description greatly contributed to users conveniences.

Database construction

Although the expO database provides a lot of useful information with great level of standardization, the large size and the text based format (SOFT format) make it less convenient to analyze the data freely. To settle these problems, we constructed a relational database using the expO contents (Figure 1, up). Firstly, we extracted clinical information from SOFT formatted flat files using parsers. This information was uploaded into a data table (MySQL 5.0, Red-Hat Linux platform). The data table part was kept in a separated storage divided into a single GSM entry. Secondly, we constructed a web-based database (developed with JSP and JSTL) in which a user can fetch required data using SQL query statements (Figure 1, down). Finally, we built a program for automatic generation of input files used in GSEA analyses (GCT and CLS files).

Figure 1
figure 1

Relational and Online Database for expO. The flat file expO database was parsed and updated into MySQL based relational database. Database schema is shown in the upper figure. Online access to the database (lower figure) is available on

Experimental design

Class definition

Using the relational database and the GSEA analysis preparation program, we set up four data classes of different organ and metastatic abilities. The main concept of the analysis describes colon cancer's metastasis to liver. The organs, colon and liver, were selected due to the relative sufficiency of sample numbers than those of the other organs. The four classes are named A (a primary tumor in liver), B (metastasis tumor in liver from a primary colon tumor), C (a primary tumor in colon) and D (a metastasis tumor in other organs but liver from a primary colon tumor) respectively. Corresponding locations of four tumor classes were depicted in Figure 2. For entry details with clinical information, see Additional file 1.

Figure 2
figure 2

Tumor class diagram. Each of the four classes is described in this figure. A is a primary tumor arisen in liver. C is a primary tumor arisen in colon. B and D are both metastatic tumors disseminated from primary colon tumors. Primary tumors are denoted by blue circles, metastatic tumors are denoted by red circles.

Class A

13 samples have been assigned to class A using the below query from expO relational database.

SELECT * FROM expo WHERE source = 'liver' AND primarysite = 'liver' AND histology NOT LIKE '%metastatic%'

From the query result, the GSM203676 sample was removed as it turned out to be a relapsed cancer. Finally, class A has 12 primary liver tumors without any relapse and metastasis and collected from liver.

Class B

20 samples have been assigned to class B using the below query.

SELECT * FROM expo WHERE source = 'liver' AND primarysite = 'colon' AND histology LIKE '%metastatic%'

The histology phrase ensures that the primary colon tumor has metastasized to liver. Without the phrase, the secondary liver cancer might be thought to be another independent primary liver tumor after the primary colon tumor occurred.

Class C

188 samples have been assigned to class C using the below query.

SELECT * FROM expo WHERE source = 'colon' AND primarysite = 'colon' AND histology NOT LIKE '%metastatic%' AND pathologicalM = '0'

The pathologicalM field denotes the doctor's decision about the tumor's metastatic aspect.

Class D

14 highly heterogeneous samples have been assigned to the class D. We extracted 16 non-liver metastatic tumors that are originated from colon.

SELECT * from expo WHERE primarysite = 'colon' AND source ! = 'liver' AND source ! = 'colon' AND histology LIKE '%metastatic'

From the 16 query results, two samples (GSM102484, GSM137952) were removed due to the adjacency of the metastasized organs (small intestine and peritoneum). The remaining samples' secondary tumor sites included ovary, lung and bladder.

Class definitionCharacteristics of each tumor class were assessed in four criteria – metastatic ability, colon tissue specificity, liver tissue specificity and viability in liver environment. For example, class A has no metastatic ability and no colon tissue specificity but it has liver tissue specificity and liver's environmental viability, whereas class B has metastatic ability, colon tissue specificity (tumor has originated from colon cells) and viability in liver's environment but no liver tissue specificity.

Differently expressed genes from two distinct classes represent characteristic gaps between those classes (see Table 1). When we compare the class A with B, the result genes are expected to contain three kinds of characteristic differences – metastatic ability (from B), colon tissue specificity (from B) and liver tissue specificity (from A). So, if a gene α was up-regulated in class B, we expect that the gene α plays a role in metastasis or colon tissue related activities.

Table 1 Comparison combinations of classes and expected characteristic differences.

Scoring and analysis

Differently expressed scores have been calculated based on a t-test. A score siAB, gene i's enrichment in class A compared with class B can be obtained from the below equation.

S A B i = μ A i μ B i σ A i 2 n A + σ B i 2 n B MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4uam1aa0baaSqaaiabdgeabjabdkeacbqaaiabdMgaPbaakiabg2da9KqbaoaalaaabaGaeqiVd02aa0baaeaacqWGbbqqaeaacqWGPbqAaaGaeyOeI0IaeqiVd02aa0baaeaacqWGcbGqaeaacqWGPbqAaaaabaWaaOaaaeaadaWcaaqaaiabeo8aZnaaDaaabaGaemyqaeeabaGaemyAaK2aaWbaaeqabaGaeGOmaidaaaaaaeaacqWGUbGBdaWgaaqaaiabdgeabbqabaaaaiabgUcaRmaalaaabaGaeq4Wdm3aa0baaeaacqWGcbGqaeaacqWGPbqAdaahaaqabeaacqaIYaGmaaaaaaqaaiabd6gaUnaaBaaabaGaemOqaieabeaaaaaabeaaaaaaaa@4D01@

where μ is the mean, n is the number of samples, and σ is the standard deviation. If the siAB score is bigger than 0, the gene i is up-regulated in class A.

Gathering class combinations with a consistent context gives an overview of specific characteristics. We named the series of class combinations a context vector. For example, a metastasis context vector of gene i is defined as below.

vmi= (si BA , si BC , si DA , si DC )

where siBA = -siAB, siDA = -siAD, siDC = -siCD.

Each element of the metastasis context vector denotes how far a gene i was up-regulated in the metastatic tumor sample in contrast to another primary tumor sample. The bigger the each element's value, the higher dependency on metastasis the context vector indicates. Likewise, we can define other three context vectors – colon tissue context vector, liver tissue context vector and liver viability context vector.

v c i = ( s B A i , s C A i , s D A i ) v l i = ( s A B i , s A C i , s A D i ) v v i = ( s A C i , s A D i , s B C i , s B D i ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeaabmqaaaqaaiabhAha2jabhogaJnaaCaaaleqabaGaemyAaKgaaOGaeyypa0JaeiikaGIaem4Cam3aa0baaSqaaiabdkeacjabdgeabbqaaiabdMgaPbaakiabcYcaSiabdohaZnaaDaaaleaacqWGdbWqcqWGbbqqaeaacqWGPbqAaaGccqGGSaalcqWGZbWCdaqhaaWcbaGaemiraqKaemyqaeeabaGaemyAaKgaaOGaeiykaKcabaGaeCODayNaeCiBaW2aaWbaaSqabeaacqWGPbqAaaGccqGH9aqpcqGGOaakcqWGZbWCdaqhaaWcbaGaemyqaeKaemOqaieabaGaemyAaKgaaOGaeiilaWIaem4Cam3aa0baaSqaaiabdgeabjabdoeadbqaaiabdMgaPbaakiabcYcaSiabdohaZnaaDaaaleaacqWGbbqqcqWGebaraeaacqWGPbqAaaGccqGGPaqkaeaacqWH2bGDcqWH2bGDdaahaaWcbeqaaiabdMgaPbaakiabg2da9iabcIcaOiabdohaZnaaDaaaleaacqWGbbqqcqWGdbWqaeaacqWGPbqAaaGccqGGSaalcqWGZbWCdaqhaaWcbaGaemyqaeKaemiraqeabaGaemyAaKgaaOGaeiilaWIaem4Cam3aa0baaSqaaiabdkeacjabdoeadbqaaiabdMgaPbaakiabcYcaSiabdohaZnaaDaaaleaacqWGcbGqcqWGebaraeaacqWGPbqAaaGccqGGPaqkaaaaaa@7AE1@

Because we cannot jump to a conclusion that each element belongs to a specific characteristic, we need to justify the consistency of the element's directionality. For example, siAB is used in three context values (as an siBA form in vmi and vli).

A high score of siAB can be explained in one of the three hypotheses below;

  1. i.

    Gene i is down-regulated in metastatic tumors

  2. ii.

    Gene i is down-regulated in colon tissues

  3. iii.

    Gene i is up-regulated in liver tissues

Now, we check whether the gene i has been up or down-regulated in other class combinations. If the siAC, and siAD score were also high, all of the elements in the liver viability context vector vli have plus values indicating that the hypothesis iii – gene i would be up-regulated in liver tissues – would be correct. We define the consistency factor c.

c v i = { + 1 ( every element of v i > 0 ) 0 ( v i  has both signs ) 1 ( every element of v i < 0 ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaee4yam2aaSbaaSqaaiabhAha2jabhMgaPbqabaGccqGH9aqpdaGabaqaauaabaqadiaaaeaacqGHRaWkcqaIXaqmaeaacqGGOaakcqqGLbqzcqqG2bGDcqqGLbqzcqqGYbGCcqqG5bqEcqqGGaaicqqGLbqzcqqGSbaBcqqGLbqzcqqGTbqBcqqGLbqzcqqGUbGBcqqG0baDcqqGGaaicqqGVbWBcqqGMbGzcqqGGaaicqqG2bGDdaahaaWcbeqaaiabbMgaPbaakiabg6da+iabicdaWiabcMcaPaqaaiabicdaWaqaaiabcIcaOiabbAha2naaCaaaleqabaGaeeyAaKgaaOGaeeiiaaIaeeiAaGMaeeyyaeMaee4CamNaeeiiaaIaeeOyaiMaee4Ba8MaeeiDaqNaeeiAaGMaeeiiaaIaee4CamNaeeyAaKMaee4zaCMaeeOBa4Maee4CamNaeiykaKcabaGaeyOeI0IaeGymaedabaGaeiikaGIaeeyzauMaeeODayNaeeyzauMaeeOCaiNaeeyEaKNaeeiiaaIaeeyzauMaeeiBaWMaeeyzauMaeeyBa0MaeeyzauMaeeOBa4MaeeiDaqNaeeiiaaIaee4Ba8MaeeOzayMaeeiiaaIaeeODay3aaWbaaSqabeaacqqGPbqAaaGccqGH8aapcqaIWaamcqGGPaqkaaaacaGL7baaaaa@872D@

The final score of gene i's dependency on a specific characteristic τ is,

f ( i , τ ) c v τ i x | v τ x i | MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOzayMaeiikaGIaemyAaKMaeiilaWcccmGae8hXdqNaeiykaKIaeyOeI0Iaem4yam2aaSbaaSqaaiabhAha2jab=r8a0naaCaaameqabaGaemyAaKgaaaWcbeaakmaarafabaWaaqWaaeaacqWH2bGDcqWFepaDdaqhaaWcbaGaemiEaGhabaGaemyAaKgaaaGccaGLhWUaayjcSdaaleaacqWG4baEaeqaniabg+Givdaaaa@4762@

where i is a context vector of gene i on the characteristic τ, c i is a consistency factor of the vector i.

Results and discussion

We scored 20606 genes using the function described in the last section. Total 54675 probe sets from an Affymetrix U133 Plus 2.0 chip were matched to their corresponding genes using GSEA v2 program's Collapse Dataset tool. In the case of many to gene matching, we used the maximum value of the probes. Enrichment scores have been calculated for six comparisons (A↔B, A↔C, A↔D, B↔C, B↔D, C↔D), and their differently expressed genes were denoted using six Heat Maps. The heat map of A↔B is shown in Figure 3. Remaining heat maps are shown in Additional file 2.

Figure 3
figure 3

Heat map of A↔B comparison. Differently expressed genes were denoted in a heat map. Sample A (left cluster) is from primary liver tissues, sample B (right cluster) is from liver metastasis of colon cancer. As shown in Table 1, this result contains information of metastatic ability and tissues specificity (liver versus colon tissue). All heat maps from other comparison combinations are included in Additional file 2.

For each gene, the scores for dependency on a specific characteristic, f(i, τ) were calculated for four τ s – characteristics. For example, the scores of gene HPN (hepsin, transmembrane protease) with respect to the four characteristics (metastasis, colon tissue, liver tissue, and liver viability) were 0, -100.0, 100.0, and 2179 respectively, which can be interpreted like the HPN gene is not related to metastasis, but it is closely related to the cellular viability in liver as Klezovitch et al has reported [14]. Top genes for each characteristic were identified. We will discuss the top genes and their reliability case by case.

Colon and liver tissue

As described in the section on Experimental design, the context vector for colon is exactly the minus signed liver context vector, because any tissues used for the sample data were colon tissues or liver tissues, exclusively. 6691 genes from total 20606 genes scored zero (their context vectors contain both signs of elements) in these characteristics. RIPK3 (receptor-interacting serine-threonine kinase 3) scored highest by 1674.1 and KLHL (kelch-like 5) gene scored lowest by -1610.9. Scores were re-scaled for more convenient readability (multiplied power of 10s depend on the size of context vectors). Mean score was 11.27 with standard deviation 85.4. Top 10 genes for all characteristics are shown in Table 2. We used the TiGER database (Tissue-specific Gene Expression and Regulation) [15] to validate the result genes' tissue specificity. In colon's case, top 10 genes were all highly regulated in colon compared to liver. TMPRSS4 (1442.7 colon score, 4th ranked) and HLXB9 (Homo sapiens homeobox HB9, colon score 1125.7, 8th ranked) were registered to the TiGER colon specific gene list. But in liver's case, no gene in the top 10 liver score list was registered to the database.

Table 2 Top 10 genes of four characteristics.

We concluded that extracted colon/liver tissue specific genes do not contain any universal tissue specific genes. Instead, the result includes genes relatively up or down-regulated than in the other tissues. But this result is enough to be used to offset any bias caused by the tissue differences.

Liver viability

The same analysis was applied to the liver viability characteristics. Mean score was 6.9 with standard deviation 513.4. Surprisingly, 9 of 10 top genes were all registered to the TiGER's liver specific gene database (see Table 2). Including ALB (albumin, 6216 liver viability score, 2nd ranked), FGG (fibrinogen gamma chain, 6097 liver viability score, 3rd ranked) and SERPINA3 (serpin peptidase inhibitor, 5958 liver viability score, 4th ranked), all the top genes were well known as liver specific genes. Despite FGG and HP/HPR (Homo sapiens haptoglobin/haptoglobin-related protein) are significantly up-regulated in the tissues from liver environment, their liver score were all zero. In the case of FGG, the DEG score from A↔B was -0.0058 making the final score 0. Similarly, HP/HPR's zero score was due to the minus DEG score in A↔B.

The differences of two liver-related scores need to be examined. One seeks to find any signs coming from liver tissue's characteristics, while the other from liver's environments. Because the liver context vector and liver viability context vector shares two elements, SiAC and SiAD, we could pay attention to only SiAB, SiBC, and SiBD elements. In the result, we found SiAB hardly catches liver specific genes. Even though the B sample came from colon tissues, we could see its expression pattern simulated that of liver tissues. Cancer cells go through increased genetic and epigenetic mutations, and sometimes their genetic instability helps by providing variety and perpetuity to themselves. During the metastasis procedures, colon cancer cells that acquired invasion and metastasis abilities possibly acquire activation of liver specific genes before or after they form micrometastasis. The tissue specific gene list of TiGER database has been established using EST (Expressed Sequence Tags) tags from sample tissues. So, the genes in the list can be identified properly not by the tissue originality but by the activity of core genes, which enables the cell to live in a liver. It is well shown in the score SiBC; even though B and C are both from colon cells, their liver specific genes are mostly up-regulated in B. In the mean time, we cannot stop concerning the possibility that all the samples collected from liver tissues contain surrounding normal liver tissues, making the entire results more ambiguous.


The result shows top 10 up and down-regulated genes in metastasis samples (Table 3, down-regulated genes are not shown). Mean score was 5.45 with standard deviation 46.87. Unfortunately, any biological processes of top scored gene MGC16121 (hypothetical gene, 2032 metastasis score, 1st ranked) were not discovered. PLA2R1 (phospholipase A2 receptor 1, 1405 metastasis score, 2nd ranked) is known to acts as a receptor for phospholipase sPLA2-IB and also bind to snake PA2-like toxins. Binding of sPLA2-IB induces various effects [16] including activation of MAPK cascade to induce cell proliferation and inflammatory reactions which are well known metastasis procedures by increasing cellular motility and angiogenesis [17, 18]. TREM2 (triggering receptor expressed on myeloid cells 2, 1405 metastasis score, 3rd ranked) is also known to have a role in chronic inflammations and stimulate production of constitutive chemokines and cytokines [19]. On the other hands, PTPLB (protein tyrosine phosphatase-like member b, -1189 metastasis score, top down-regulated) is significantly down-regulated in all metastasis samples. The main function of PTPLB is not well discovered so that the direct relation to metastasis is hard to find. But PTPLB is known to interact with BAP31 [20] which is involved in CASP8-mediated apoptosis [21] which is an important pathway in tumorigenesis and metastasis resistance [22].

Table 3 B↔C comparison and modified result.

To prove the enhancement of the result we compared two gene lists from B↔C and post-processed B↔C with our method. B↔C (liver metastasis of colon cancer versus primary colon cancer) is a commonly used comparison for extracting metastatic signatures. As shown in the Table 3, a result from B↔C also contains liver viability characteristics as well as metastasis characteristics. So we normalized both of the B↔C and liver viability scores and found gene sets whose B↔C scores are high but liver viability scores are low. Because the liver viability context vector already contains SiBC, we modified the context vector removing the element. As we expected, the Pearson correlation of two scores was 0.469 indicating that large parts of B↔C comparison result is due to the liver viability characteristics. As shown in Table 3, almost half of the top scored genes merely from the B↔C comparison were liver specific genes. But there was no liver specific gene in the top 20 gene list in modified results. All the top 10 genes including COLEC11 (collectin sub-family member 11), FGG (fibrinogen gamma chain), ART4 (ADP-ribosyltransferase 4), ALB (albumin), and HP (haptoglobin) were removed in the modified result. Instead, metastasis candidate genes including AOAH [23] (acyloxyacyl hydrolase), EPO [24] (erythropoietin) and MAR1 (macrophage scavenger receptor1, involved in inflammation pathways) were newly included. We are expecting the other genes would be validated further.


We suggested a new method for identifying metastasis related genes from a large scale database. The proposed method attempts to minimize the influences from other factors but metastasis including tissue originality and tissue viability by confining the result of metastasis unrelated test combinations. We presented tissue specific and tissue viability related genes, and validated them using tissue specificity database, TiGER. Finally, we presented metastasis candidate genes by calculating differences of metastasis and liver viability normalized scores. We would like to expand the experiments to other tissues using remaining records of the databases and further validate the result by constructing classifiers.


  1. Welch HG, Schwartz LM, Woloshin S: Are Increasing 5-Year Survival Rates Evidence of Success Against Cancer?. JAMA. 2000, 283: 2975-2978. 10.1001/jama.283.22.2975.

    Article  CAS  PubMed  Google Scholar 

  2. Catherine R, Tait DDKH: Do metastases metastasize?. Journal of Pathology. 2004, 203: 515-518. 10.1002/path.1544.

    Article  Google Scholar 

  3. Paget S: The Distribution of Secondary Growths in Cancer of the Breast. The Lancet. 1889, 133: 571-573. 10.1016/S0140-6736(00)49915-0.

    Article  Google Scholar 

  4. Mehlen P, Puisieux A: Metastasis: a question of life or death. Nat Rev Cancer. 2006, 6: 449-458. 10.1038/nrc1886.

    Article  CAS  PubMed  Google Scholar 

  5. Gupta GP, Massague J: Cancer Metastasis: Building a Framework. Cell. 2006, 127: 679-695. 10.1016/j.cell.2006.11.001.

    Article  CAS  PubMed  Google Scholar 

  6. Fidler IJ: The Pathogenesis of Cancer Metastasis: the 'Seed and Soil' Hypothesis Revisited. Nat Rev Cancer. 2003, 3: 453-458. 10.1038/nrc1098.

    Article  CAS  PubMed  Google Scholar 

  7. Merlo LMF, Pepper JW, Reid BJ, Maley CC: Cancer as an evolutionary and ecological process. Nat Rev Cancer. 2006, 6: 924-935. 10.1038/nrc2013.

    Article  CAS  PubMed  Google Scholar 

  8. Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005, 365: 671-679.

    Article  CAS  PubMed  Google Scholar 

  9. Tomlins SA, Mehra R, Rhodes DR, Cao X, Wang L, Dhanasekaran SM, Kalyana-Sundaram S, Wei JT, Rubin MA, Pienta KJ: Integrative molecular concept modeling of prostate cancer progression. Nat Genet. 2007, 39: 41-51. 10.1038/ng1935.

    Article  CAS  PubMed  Google Scholar 

  10. Edelman EJ, Guinney J, Chi J-T, Febbo PG, Mukherjee S: Modeling Cancer Progression via Pathway Dependencies. PLoS Computational Biology. 2008, 4: e28-10.1371/journal.pcbi.0040028.

    Article  PubMed Central  PubMed  Google Scholar 

  11. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005, 102: 15545-15550. 10.1073/pnas.0506580102.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  12. Expression Project for Oncology (expO). The International Genomics Consortium, []

  13. Edgar R, Barrett T: NCBI GEO standards and services for microarray data. Nat Biotechnol. 2006, 24: 1471-1472. 10.1038/nbt1206-1471.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  14. Klezovitch O, Chevillet J, Mirosevich J, Roberts RL, Matusik RJ, Vasioukhin V: Hepsin promotes prostate cancer progression and metastasis. Cancer Cell. 2004, 6: 185-195. 10.1016/j.ccr.2004.07.008.

    Article  CAS  PubMed  Google Scholar 

  15. Liu X, Yu X, Zack D, Zhu H, Qian J: TiGER: A database for tissue-specific gene expression and regulation. BMC Bioinformatics. 2008, 9: 271-10.1186/1471-2105-9-271.

    Article  PubMed Central  PubMed  Google Scholar 

  16. Granata F, Petraroli A, Boilard E, Bezzine S, Bollinger J, Del Vecchio L, Gelb MH, Lambeau G, Marone G, Triggiani M: Activation of Cytokine Production by Secreted Phospholipase A2 in Human Lung Macrophages Expressing the M-Type Receptor. J Immunol. 2005, 174: 464-474.

    Article  CAS  PubMed  Google Scholar 

  17. Lu H, Ouyang W, Huang C: Inflammation, a Key Event in Cancer Development. Mol Cancer Res. 2006, 4: 221-233. 10.1158/1541-7786.MCR-05-0261.

    Article  PubMed  Google Scholar 

  18. DeNardo D, Johansson M, Coussens L: Immune cells as mediators of solid tumor metastasis. Cancer Metastasis Rev. 2008, 27: 11-18. 10.1007/s10555-007-9100-0.

    Article  CAS  PubMed  Google Scholar 

  19. Bouchon A, Dietrich J, Colonna M: Cutting Edge: Inflammatory Responses Can Be Triggered by TREM-1, a Novel Receptor Expressed on Neutrophils and Monocytes. J Immunol. 2000, 164: 4991-4995.

    Article  CAS  PubMed  Google Scholar 

  20. Wang B, Pelletier J, Massaad MJ, Herscovics A, Shore GC: The Yeast Split-Ubiquitin Membrane Protein Two-Hybrid Screen Identifies BAP31 as a Regulator of the Turnover of Endoplasmic Reticulum-Associated Protein Tyrosine Phosphatase-Like B. Mol Cell Biol. 2004, 24: 2767-2778. 10.1128/MCB.24.7.2767-2778.2004.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  21. Nguyen M, Breckenridge DG, Ducret A, Shore GC: Caspase-Resistant BAP31 Inhibits Fas-Mediated Apoptotic Membrane Fragmentation and Release of Cytochrome c from Mitochondria. Mol Cell Biol. 2000, 20: 6731-6740. 10.1128/MCB.20.18.6731-6740.2000.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  22. Wang W, Eddy R, Condeelis J: The cofilin pathway in breast cancer invasion and metastasis. Nat Rev Cancer. 2007, 7: 429-440. 10.1038/nrc2148.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  23. Yang S: Gene amplifications at chromosome 7 of the human gastic cancer genome. International journal of molecular medicine. 2007, 20: 225-231.

    CAS  PubMed  Google Scholar 

  24. Lai SY, Childs EE, Xi S, Coppelli FM, Gooding WE, Wells A, Ferris RL, Grandis JR: Erythropoietin-mediated activation of JAK-STAT signaling contributes to cellular invasion in head and neck squamous cell carcinoma. Oncogene. 2005, 24: 4442-4449. 10.1038/sj.onc.1208635.

    Article  CAS  PubMed  Google Scholar 

Download references


This work was supported by the Korean Systems Biology Program (No. M10309020000-03B5002-00000) and the National Research Lab. Program (No. 2006-01508) from the Ministry of Education, Science and Technology through the Korea Science and Engineering Foundation. The efforts of the International Genomics Consortium (IGC) and expO (expression project for Oncology) are greatly acknowledged. We would like to thank CHUNG Moon Soul Center for BioInformation and BioElectronics for providing research facilities.

This article has been published as part of BMC Bioinformatics Volume 10 Supplement 3, 2009: Second International Workshop on Data and Text Mining in Bioinformatics (DTMBio) 2008. The full contents of the supplement are available online at

Author information

Authors and Affiliations


Corresponding author

Correspondence to Doheon Lee.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

SK developed the fundamental idea of the work, performed experiments, validated the results, and wrote the manuscript. DL evaluated and revised the idea, and supervised manuscript processes.

Electronic supplementary material

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Kim, S., Lee, D. Mining metastasis related genes by primary-secondary tumor comparisons from large-scale databases. BMC Bioinformatics 10 (Suppl 3), S2 (2009).

Download citation

  • Published:

  • DOI: