A bi-ordering approach to linking gene expression with clinical annotations in gastric cancer
- Fan Shi^{1, 2}Email author,
- Christopher Leckie^{1, 2},
- Geoff MacIntyre^{1, 2},
- Izhak Haviv^{3},
- Alex Boussioutas^{4} and
- Adam Kowalczyk^{1, 2}
https://doi.org/10.1186/1471-2105-11-477
© Shi et al; licensee BioMed Central Ltd. 2010
Received: 19 April 2010
Accepted: 23 September 2010
Published: 23 September 2010
Abstract
Background
In the study of cancer genomics, gene expression microarrays, which measure thousands of genes in a single assay, provide abundant information for the investigation of interesting genes or biological pathways. However, in order to analyze the large number of noisy measurements in microarrays, effective and efficient bioinformatics techniques are needed to identify the associations between genes and relevant phenotypes. Moreover, systematic tests are needed to validate the statistical and biological significance of those discoveries.
Results
In this paper, we develop a robust and efficient method for exploratory analysis of microarray data, which produces a number of different orderings (rankings) of both genes and samples (reflecting correlation among those genes and samples). The core algorithm is closely related to biclustering, and so we first compare its performance with several existing biclustering algorithms on two real datasets - gastric cancer and lymphoma datasets. We then show on the gastric cancer data that the sample orderings generated by our method are highly statistically significant with respect to the histological classification of samples by using the Jonckheere trend test, while the gene modules are biologically significant with respect to biological processes (from the Gene Ontology). In particular, some of the gene modules associated with biclusters are closely linked to gastric cancer tumorigenesis reported in previous literature, while others are potentially novel discoveries.
Conclusion
In conclusion, we have developed an effective and efficient method, Bi-Ordering Analysis, to detect informative patterns in gene expression microarrays by ranking genes and samples. In addition, a number of evaluation metrics were applied to assess both the statistical and biological significance of the resulting bi-orderings. The methodology was validated on gastric cancer and lymphoma datasets.
Keywords
1 Background
A typical aim of exploratory analysis of genomics data is to identify potentially interesting genes and pathways that warrant further investigation. There is a critical need to streamline the analysis in order to support continuing advances in high throughput genomics methods such as gene expression microarrays, which measure thousands of genes in a single assay and are the focus of this paper. However, such assays provide noisy and incomplete measurements, which require sophisticated bioinformatics techniques to identify statistically and biologically significant associations between genes and relevant phenotypes of interest.
Unsupervised analysis techniques cluster data without using prior information on the labels of samples. This enables the discovery of novel histological subtypes. However, a major limitation of traditional clustering algorithms for this task is that they cluster either genes or samples into non-overlapping groups, based on the similarity of gene expression across all samples for gene clustering, or all genes per sample in sample clustering. This limits the ability to find groups of genes that are "co-correlated" across only a subset of samples, or participate in multiple cellular pathways. A related open issue is how to evaluate the statistical significance of the clusters.
In spite of such limitations, there are examples of remarkable biologically significant discoveries. One such case revisited in this paper is the analysis of gastric cancer data [1]. The original paper used hierarchical clustering of both 7383 genes and 124 gastric cancer samples (malignant and pre-malignant). By inspecting the "heat map" they observed a number of prominent biclusters, which were linked to various aspects of cancer etiology. However, the approach [1] was heavily dependent on manual inspection to identify the groupings. In particular, several sets of co-expressed genes were not grouped together by hierarchical clustering, and needed to be grouped manually by expert analysis. Moreover, it is difficult to assess whether such clusters are robust to any changes, and whether different clustering attempts converge to a stable result. Consequently, there is a need for techniques that can guide such a process of discovering significant and worthwhile hypotheses for follow-up analysis.
Biclustering, also called co-clustering, is a promising technique proposed for the automated discovery of highly correlated subsets of genes across a subset of samples. The concept of "biclustering" was first introduced by [2] and has been the subject of several surveys [3, 4]. Many techniques have been used for finding biclusters with different objective functions, such as "SAMBA" using graphic models [5], biclustering by Gibbs sampling [6], the Order-Preserving Submatrix algorithm (OPSM [7]), biclustering using maximum-similarity between genes [8], the Iterative Signature Algorithm (ISA [9]), and biclustering using linear geometry [10]. Recently, several studies have applied biclustering to more specific bioinformatics areas, such as local multiple sequence alignment of RNA [11] and e-CCC-Biclustering for gene expression time-series data [12]. Several of these representative biclustering methods will be used as a basis for comparison in this paper.
This paper proposes a technique for exploratory biclustering analysis, which combines biclustering with an evaluation of the statistical significance and biological relevance of such biclusters. There are four main contributions that we make in this paper. First, we introduce a novel algorithm, called bi-ordering, which is in some respects a member of the family of biclustering techniques. This algorithm is benchmarked against several relevant biclustering algorithms in the literature [2, 5, 6, 9]. Second, we extend an existing statistic based on the hyper-geometric distribution to a generalized statistic for evaluating the saturation of phenotypes in biclusters, called the Multiple-Class-Saturation (MCS) metric. In addition, we apply the Jonckheere trend test [13] to evaluate the significance of the correlation between ordered samples and clinical annotations. Third, we assess the stability of the observed results by assessing the size of their "basin of attraction" as follows. In our experiments, random initializations of the algorithm yield many unique biclusters, which are then grouped into a manageable number of families of very similar outcomes (called a "super-bicluster") by a secondary clustering of the biclusters. The size of these super-biclusters provides a measure of bicluster "stability". We find that our technique is able to find a small set of highly stable super-biclusters, which correspond to distinct histopathological types in an existing gastric cancer dataset [1]. We have also applied our approach to analyze a lymphoma dataset [14]. Fourth, we demonstrate that the discovered super-biclusters have associated Gene Ontology (GO) terms with very significant p-values, which can provide a basis for the biological interpretation of the gene modules.
In Section 2, we introduce our core algorithm BOA, as well as the statistics for its evaluation. In Section 3, we validate the BOA algorithm on the gastric cancer and lymphoma datasets and compare the results to other algorithms. The biological interpretations of the results from the gastric cancer datasets are discussed in Section 4, 5 and the Additional file 1. Finally, we conclude the paper in Section 6.
2 Methods
In this section, we introduce a protocol for identifying and characterizing modules of genes and orderings of samples that exhibit high statistical, biological and clinical significance. Our protocol, named Bi-ordering Exploratory Analysis (BEA), comprises six main stages as described below.
Bi-ordering Exploratory Analysis
- 1.
Generate biclusters with ordered genes and samples using BOA algorithm outlined below.
- 2.
Merge similar biclusters into "super-biclusters" to identify robust modules of co-expressed genes.
- 3.
Annotate biclusters with histological and biological attributes to support their interpretation.
- 4.
Generate figures of merit (i.e., p-values) for each bicluster:
- (a)
Over-representation of histological categories;
- (b)
Concordance of sample orderings with various phenotype gradients;
- (c)
Gene Ontology annotations.
- 5.
Evaluate biological significance of results.
Now we briefly elaborate on selected key stages of the above protocol.
2.1 Bi-ordering Analysis
In this section, we describe Bi-ordering Analysis and its details.
2.1.1 Algorithm
Algorithm 1 Bi-Ordering Analysis (BOA)
- 1.
Normalize data: first, for each gene across all samples, normalize to median = 0 and std = 1, then repeat this for each sample across all genes.
- 2.
Initialization: Select a subset of samples ∅ ≠ S ⊂ {1, ⋯, n_{ S } }
- 3.
Repeat the Steps a-d below until convergence (i.e., G and S stabilise):
- (a)
Update gene scores f(g) ← ⟨x_{ gs } ⟩_{s∈S}for g = 1, ⋯, n_{ G } ,
- (b)
Select genes: $\text{G}\leftarrow \{g;f(g)-{\u3008f(g)\u3009}_{g=1,\cdots ,{n}_{G}}>{\theta}_{G}/\sqrt{\left|S\right|}\}$,
- (c)
Update sample scores h(s) ← ⟨x_{ gs } ⟩_{g∈G}for S = 1, ⋯, n_{ S } ,
- (d)
Select samples: $S\leftarrow \{s;h(s)-{\u3008h(s)\u3009}_{s=1,\cdots ,{n}_{S}}>{\theta}_{S}/\sqrt{\left|G\right|}\}$,
Output: The gene scores f(g), sample scores h(s), and a bicluster B = (G, S), which is determined by applying the cut-off thresholds θ_{ G } and θ_{ S } to the scores f(g) and h(s), respectively.
According to the iterative process of step 3 in the BOA algorithm, we aim to select a submatrix with coherently high gene expression levels across selected samples. In particular, we aim to detect differentially expressed genes between phenotypes. Consequently, it is important to normalize the data in step 1, otherwise, the genes with high expression levels across most samples will dominate the selection procedure. The above algorithm is similar to ISA [9], which is discussed in more detail in Section 3.2.
2.1.2 Variants of BOA
- 1.
Under-expressed biclusters - This variant finds significantly down-regulated genes by selecting genes and samples with f(g) and h(s) less than a threshold in steps 3.b and 3.d, respectively.
- 2.
Fixed-size biclusters - We could also select G and S of fixed size rather than using the cut-off thresholds in each iteration. The attraction is that the algorithm is guaranteed to converge, which in practice may not always hold for the previous two options. The formal proof of the convergence for this variant is included in the Additional file 1.
2.1.3 Ordering scores
An important ability of the BOA algorithm is to assign an ordering score h(s) to samples and f(g) to genes in a bicluster. The sample score h(s) orders all samples according to the average expression level across a subset of genes in the current bicluster, while the gene score f(g) orders all genes according to a subset of samples. The sample score h(s) relates a bicluster to the clinical annotations of samples, especially for multi-class attributes, such as the cancer progression stage, or continuous attributes, such as the survival time. These are discussed in detail in Section 2.3.
2.2 Super-biclustering (SBC)
The similarity measure could also be defined on samples, though here we focus on genes, which are the dominant and far more complex dimension to handle in gene expression microarray datasets. Based on the similarity measure in Equation 1, a hierarchy of unique biclusters is constructed. We then use a similarity threshold of 0.5 to extract a few top groups of biclusters, which are super-biclusters, from this hierarchy. For convenience of evaluation, a typical bicluster is selected as a prototype in each resulting SBC. Without loss of generality, we choose the bicluster that covers the most genes in its SBC as the prototype, and so any further numerical evaluation is based on these prototypes when we refer to a SBC.
2.3 Figures of merit
In order to evaluate the statistical and biological significance of the resulting biclusters, we employ three statistical methods and their corresponding p-values.
2.3.1 Saturation metric of samples
The homogeneity of phenotypes of samples in a bicluster can be evaluated if we are given a prior classification of each sample (e.g., its cancer subtype) as the label. Ideally, each bicluster should be enriched with samples in one or a few more similar classes, e.g., normal or tumor samples. For the purpose of quantification, we use the p-value of the hypergeometric distribution to evaluate the purity of biclusters according to the phenotypes of samples. Previously [5], a measure of homogeneity using the hypergeometric distribution was applied to the single most abundant class within a bicluster. However, if some genes are co-expressed across multiple classes, calculating p-values on a single class is not an adequate representation of accuracy. To address this limitation, we extend this measure to a more generalized form where the significance is calculated for a group of classes to determine the dominant class(es). We refer to the original statistic used in [5] and our generalized statistic as Single-Class Saturation (SCS) and Multiple-Class Saturation (MCS) metrics, respectively. The calculation of MCS p-values based on the hypergeometric distribution is given in Equation 2 below.
are the numbers of samples in the dataset and in the bicluster B annotated with any class in $\left\{{C}_{{i}_{1}},\cdots ,{C}_{{i}_{r}}\right\}$, respectively. This quantity calculates the probability of observing k or more samples classified in $\left\{{C}_{{i}_{1}},\cdots ,{C}_{{i}_{r}}\right\}$ in a bicluster B.
In our evaluation, we generate the full set of combinations of all sample classes from {C_{1}, ⋯, C_{ q } } and compute p_{ MCS } for each bicluster and each combination, so that we could discover any potential associations between gene sets and a group of phenotypes. Finally, we select the subset of classes that corresponds to the most significant p-value for each bicluster in the evaluation in Section 3. Note that the SCS is a special case of the MCS. We compute a p-value with respect to each individual class, and then select the single class that corresponds to the best p-value for each bicluster.
2.3.2 Jonckheere's trend test
for this purpose.
from which the p-values can be estimated.
2.3.3 Gene Ontology Annotations
Given that each gene's expression in a bicluster is highly similar with respect to other genes in the bicluster, it is expected that the collection of genes as a whole are likely to be involved in some related biological processes. In order to determine this, the structured vocabulary of the Gene Ontology (GO) [15] is used to help uncover the biological processes represented by each of the biclusters. As each gene can be annotated with one or more terms within the GO, we can determine which GO terms are statistically over-represented within a group of genes. We use an existing tool GOstat [16] to determine the statistically over-represented terms within each bicluster for the biological process branch of the GO.
2.4 Efficiency
One of the advantages of the BOA algorithm is its efficiency. The time complexity in each iteration is $\mathcal{O}$(n_{ G } + n_{ S } ), since only averaging operations for computing the gene score f(g) and sample score h(s) are required. Practically, the number of iterations for generating a single bicluster is usually no more than 10, and the number of initializations is 1000 in our experiments.
3 Results
In this section, we analyze the performance of our algorithm on a real gene expression dataset, namely the gastric cancer dataset in [1]. The main reason for this choice is the availability of local expertise in the biology of this disease. We compare the performance of our algorithm in terms of SCS and MCS in Section 2.3 to the results obtained from the algorithms in [2, 5, 6, 9] by using the parameter settings recommended in those papers, including the normalization method specified in each algorithm, or by observing the best results obtained under different parameter settings. The evaluation using Jonckheere's test, the Gene Ontology and the biological relevance of the results for gastric cancer are discussed in detail in Section 4. In addition, we also apply BOA to another lymphoma dataset for validation [14].
3.1 Results of BOA on Gastric Cancer dataset
After applying gene filtering as described in [1], we have n_{ G } = 7383 gene expressions evaluated for n_{ S } = 124 human tissue samples. Excluding two singletons, there are six different phenotypes in the data, of which three are subtypes of gastric cancer: 35 diffuse (DGC), 22 intestinal (IGC), 7 mixed (MGC); and the other three phenotypes are pre-malignant conditions: 26 chronic gastritis (CG), 22 intestinal metaplasia (IM) and 10 normal, e.g., non-inflamed mucosa tissue removed during surgery for the gastric cancer. Now we briefly discuss the algorithmic aspects and setup of the experiment.
First, we generated a set of 1000 initializations, which were 1000 subsets of samples selected by the method described in Section 2.2. The actual number of initializing samples for gastric cancer data ranged from 14 to 41 across 1000 subsets. As described in Section 2.2, each sample is randomly selected with a probability of 0.2 for inclusion in the initial subset of samples. Note that other selection probabilities of 0.4, 0.6 and 0.8 have been tested, but the results were largely insensitive to changes in this parameter.
Note that in the BOA algorithm, there are other alternative normalization methods that can be used, i.e., using mean = 0 instead of median = 0 for centering the genes and samples. Here, we followed the normalization method used in [1] for the sake of a fair comparison with their manual analysis. In addition, we have found that there is very little numerical difference between normalizing by median = 0 and normalizing by mean = 0 on the dataset we have studied.
Second, we applied BOA to the gastric cancer data using 11 different pairs of thresholds: (θ_{G}, θ_{ S } ) = (5.0, 3.0), (4.0, 4.0), (5.0, 4.0), (5.5, 4.0), (6.0, 4.0), (5.0, 4.5), (5.5, 4.5), (5.0, 5.0), (5.5, 5.0), (6.0, 5.0), (5.5, 5.5), which used the same set of initializations.
These threshold settings were limited to this range since they produced biclusters of moderate size. For all biclusters across the 11 pairs, the minimum and maximum number of genes were 21 and 816, respectively. We have also tried several other groups of thresholds on the dataset, but the generated biclusters are not very informative if the thresholds are too large or too small.
Super-biclusters in gastric cancer dataset
#Converging | P-values | Most significant biological process | ||||
---|---|---|---|---|---|---|
SBC | SBC | Prototype | MCS | Malignancy Score | GO | |
SBC1 | 11 | 6 | 9.4 × 10^{-4} | 1.8 × 10^{-13} | 5.1 × 10^{-9} | epidermis development |
SBC2 | 188 | 7 | 1.0 × 10^{-8} | 7.1 × 10^{-7} | lipid metabolic process | |
SBC3 | 2 | 1 | 1.5 × 10^{-6} | 5.5 × 10^{-8} | 3.2 × 10^{-32} | immune system process |
SBC4 | 96 | 2 | 1.8 × 10^{-1} | 2.0 × 10^{-53} | immune system process | |
SBC5 | 15 | 15 | 1.1 × 10^{-18} | 7.7 × 10^{-21} | 1.8 × 10^{-14} | cell cycle process |
SBC6 | 328 | 11 | 3.0 × 10^{-7} | 4.9 × 10^{-8} | 1.8 × 10^{-20} | multicellular organismal process |
SBC7 | 359 | 229 | 4.0 × 10^{-14} | -5.4 × 10^{-22} | 3.2 × 10^{-22} | gen. of precursor metabolic & energy |
SBC8 | 1 | 1 | 3.0 × 10^{-10} | -5.2 × 10^{-8} | 2.2 × 10^{-2} | lipid metabolic process |
However, there are two limitations of calculating SCS or MCS. First, these measures cannot deal with the case of continuous annotations of samples. Second, the significance of SCS and MCS are affected by the choice of cut-off threshold on samples, especially when the sample orderings h(s) change smoothly. Thus, we also used Jonckheere's test to overcome these limitations. We first allocated a "Malignancy Score" y(s) to each sample s following the expert advice: y(s) = 1 for normal, 2 for CG, 3 for IM and finally 4 for any gastric cancer (DGC, IGC or MGC sample). We then tested the significance of the agreement of the samples ordered according to the h(s) score generated by the BOA algorithm with this progression y(s). For the prototype of SBC7, the malignancy scores show an increasing trend from normal (y(s) = 1) to malignant samples (y(s) = 4) along the ascending ordered gene expression levels, which results in a directional p-value of 5.35 × 10^{-22}.
Over-represented GO terms in gastric cancer dataset
ID | P-value | Biological process |
---|---|---|
GO:0006091 | 3.24 × 10^{-22} | generation of precursor metabolites and energy |
GO:0006119 | 3.68 × 10^{-18} | oxidative phosphorylation |
GO:0006118 | 3.48 × 10^{-12} | electron transport |
GO:0042773 | 8.12 × 10^{-12} | oxidative phosphorylation#ATP synthesis coupled electron transport |
GO:0042775 | 8.12 × 10^{-12} | organelle ATP synthesis coupled electron transport |
More biological details of the gene modules and evaluation statistics for different SBCs are discussed in the next section.
3.2 Comparison with other algorithms
Cheng & Church's method was chosen as a comparison because it is similar to our BOA method in two aspects. First, they both aim to identify "constant biclusters" [3]. Second, they all use an iterative process to refine a bicluster. However, the BOA algorithm has the following advantages that differ from Cheng & Church's method. First, in every step of the iterative process, BOA ranks all the genes and samples based on the current subset of samples and genes, so that the converged biclusters are local optima solutions that imply the relevance between genes and samples. The relation between genes and samples is a desirable property in biology, especially when studying the continuous annotations of samples. Second, the super-biclustering is also an important part of our framework, which identifies similar biclusters and then creates representative prototypes of biclusters, so that more meaningful biclusters could be found. In addition, to make a fair comparison, we also applied Cheng & Church's method to the gastric cancer dataset that has been normalized by the procedure described in the BOA method. The resulting biclusters of the two methods were evaluated by the saturation metrics and reported in the Additional file 1.
The BOA algorithm is very similar to ISA. However, the main objective of ISA is discerning "co-regulated" gene modules, while the association with phenotype classes (conditions) is not important, whereas it is of prime interest for our medical application. The main formal differences resulting in the different performance are: (i) ISA starts with an initialisation of a subset of genes; (ii) the two sided test is used for the selection of samples; (iii) samples are weighted, with possibly negative weights, so different conditions, say with up-regulated and down-regulated genes, can be joined in the same bicluster. Consequently, ISA aims at generating "constant column" biclusters while BOA's objective is a "constant" bicluster [3].
Figure 2 shows that BOA generates more significant biclusters in terms of SCS and MCS.
Our evaluation of GO annotations for both ISA [9] and biclustering by Gibbs sampling [6] show that they are capable of generating biclusters of significance comparable to BOA (details of values are not shown). These algorithms generated 6 and 5 SBCs, respectively, with similar gene sets to the SBCs of BOA. For example, the GO annotations "generation of precursor metabolites and energy" and "oxidative phosphorylation" significantly associated with SBC7 of BOA whose p-values are 3 × 10^{-22} and 4 × 10^{-18} (in Table 1) are also found by the ISA algorithm with p-values of 3 × 10^{-8} and 4 × 10^{-6} and Gibbs algorithm with p-values 1 × 10^{-30} and 5 × 10^{-13}. Similarly, the "multicellular organismal process" and "multicellular organismal development" annotations (significant for diffuse-type gastric cancer) in SBC6 of BOA, were also found by the ISA and Gibbs algorithms. However, we have observed that the BOA algorithm usually has better performance than either ISA or Gibbs in terms of Jonckheere's test, in particular, in the case of the evaluation of the "Malignancy Score".
3.3 Validation on lymphoma dataset
4 Biological Analysis of Gastric Cancer
In this section, we focus on validating the biological significance of our findings for the gastric cancer dataset.
4.1 Gene modules compared with previous study
Comparison with previous literature
Region in[1] | SBC1 | SBC2 | SBC3 | SBC4 | SBC5 | SBC6 | SBC7 | SBC8 | ||
---|---|---|---|---|---|---|---|---|---|---|
Symbol | Annotation | No. Genes | 41 | 217 | 194 | 158 | 227 | 409 | 515 | 146 |
B | Mitochondrial | 665 | 0 | 0 | 0 | 0 | 0 | 1 | 416 | 9 |
D1-D3 | Proliferation | 201 | 0 | 0 | 0 | 0 | 76 | 0 | 0 | 0 |
E | Intestinal | 294 | 1 | 81 | 0 | 0 | 0 | 0 | 1 | 44 |
F | Intestinal | 157 | 0 | 112 | 0 | 0 | 7 | 1 | 0 | 27 |
G | Squamous | 37 | 25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
H | Inflamation | 330 | 7 | 0 | 117 | 135 | 9 | 7 | 0 | 30 |
K | Extra cellular matrix | 877 | 3 | 0 | 67 | 0 | 74 | 392 | 1 | 0 |
4.2 Biological relevance for gastric cancer
In Table 3 we then considered the significance of these super-biclusters in terms of the three types of figures of merit discussed in Section 2.3, namely, the SCS and MSC p-values, the p-value of the over-represented GO annotations, and the p-value of the Jonckheere test on the order of the progression of the cancer in the samples. We have discussed the assignment of malignancy scores y(s) and tested the significance of the agreement between y(s) and sample orderings h(s) in Section 3.1. Table 1 shows the numerical results of these statistics. The heat map of SBC7 (Figure 1) shows that the ordering induced by the bicluster has a clear negative correlation with the malignancy score of the samples. The h(s) for SBC5 and SBC7 and to a lesser extent SBC3 are very significantly correlated with y(s). More biological relevance is discussed in the Discussion section.
5 Discussion
Based on the results of our experiments, we now consider the biological significance of our findings. The generated results including the GO and clinical correlation were analysed by expert biologists and clinicians. We quote them to some extent as a proof that the formal data processing protocols as discussed here can lead to the generation of significant biological hypotheses warranting follow-up wet lab experiments. The BOA algorithm has shed new light on preexisting themes in gastric cancer etiology. The resulting bi-orderings represent successive steps in cancer progression and distinct histopathological types of the disease. Specifically, SBC1 represents epithelial morphology, typical to squamous samples; SBC2 and SBC8 are typical intestinal lipid metabolism signatures, observed in intestinal metaplasia pre-malignant samples; SBC3 and SBC4 represent a novel split of the inflammatory signature that in [1] were merged as one signature; SBC5 represents the proliferation signature described in [1] for intestinal type gastric cancer; SBC6 reflects the extracellular matrix deposition typical to diffuse type cancer, and elevated in all cancer samples compared to pre-malignant samples; SBC7 represents the metabolic stress observed in chronic gastritis samples, possibly due to elevated H. Pylori infection. There are also other observations which are potentially novel discoveries. They are available in the Additional file 1.
6 Conclusion
In this paper we have presented a novel method of bi-ordering genes and samples from microarray data, together with two statistical techniques for evaluating the significance of the generated groupings and orderings of multiple histological samples. In comparison to several existing algorithms in the literature, our method is able to generate highly robust and statistically significant gene modules with respect to sample histological annotations on a gastric cancer dataset. The results of our analysis closely match reported theories of gastric cancer tumorgenesis, and have helped to identify promising hypotheses for further investigation in cancer research. We also show that other biclustering algorithms can also be utilized as a basis of exploratory bi-ordering analysis of genomic data.
Declarations
Acknowledgements
National ICT Australia (NICTA) is funded by the Australian Government's Department of Communications, Information Technology and the Arts and the Australian Council through Backing Australia's Ability and the ICT Center of Excellence program. This paper is an extended version of a previous paper in the 2nd International Workshop on Machine Learning in Systems Biology.
Authors’ Affiliations
References
- Boussioutas A: Distinctive Patterns of Gene Expression in Premalignant Gastric Mucosa and Gastric Cancer. Cancer Research 2003, (63):2569–2577.Google Scholar
- Cheng Y, Church GM: Biclustering of expression data. Proceedings of International Conference on Intelligent Systems for Molecular Biology 2000, 8: 93–103.Google Scholar
- Madeira SC, Oliveira AL: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM transactions on computational biology and bioinformatics/IEEE, ACM 2004, 1: 24–45. 10.1109/TCBB.2004.2View ArticlePubMedGoogle Scholar
- Prelić A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, Hennig L, Thiele L, Zitzler E: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 2006, 22(9):1122–1129. 10.1093/bioinformatics/btl060View ArticlePubMedGoogle Scholar
- Tanay A, Sharan R, Shamir R: Discovering statistically significant biclusters in gene expression data. Bioinformatics 2002., 18(Suppl 1):Google Scholar
- Sheng Q, Moreau Y, De Moor B: Biclustering microarray data by Gibbs sampling. Bioinformatics 2003., 19(Suppl 2):Google Scholar
- Ben-Dor A, Chor B, Karp R, Yakhini Z: Discovering local structure in gene expression data: the order-preserving submatrix problem. Journal of computational biology a journal of computational molecular cell biology 2003, 10(3–4):373–384.View ArticlePubMedGoogle Scholar
- Liu , Xiaowen , Wang , Lusheng : Computing the maximum similarity bi-clusters of gene expression data. Bioinformatics 2007, 23: 50–56. 10.1093/bioinformatics/btl560View ArticlePubMedGoogle Scholar
- Ihmels J, Bergmann S, Barkai N: Defining transcription modules using large-scale gene expression data. Bioinformatics 2004, 20(13):1993–2003. 10.1093/bioinformatics/bth166View ArticlePubMedGoogle Scholar
- Gan X, Liew AW, Yan H: Discovering biclusters in gene expression data based on high-dimensional linear geometries. BMC Bioinformatics 2008, 9: 209+. 10.1186/1471-2105-9-209View ArticlePubMedPubMed CentralGoogle Scholar
- Wang S, Gutell RR, Miranker DP: Biclustering as a method for RNA local multiple sequence alignment. Bioinformatics 2007, 23(24):3289–3296. 10.1093/bioinformatics/btm485View ArticlePubMedPubMed CentralGoogle Scholar
- Madeira S, Oliveira A: A polynomial time biclustering algorithm for finding approximate expression patterns in gene expression time series. Algorithms for Molecular Biology 2009, 4: 8+. 10.1186/1748-7188-4-8View ArticlePubMedPubMed CentralGoogle Scholar
- Jonckheere A: A test of significance for the relation between m rankings and k ranked categories. The British Journal of Statistical Psychology 1954, 7: 93–100.View ArticleGoogle Scholar
- Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403(6769):503–511. 10.1038/35000501View ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. Nat Genet 2000, 25: 25–29. 10.1038/75556View ArticlePubMedPubMed CentralGoogle Scholar
- Beissbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 2004, 20(9):1464–1465. 10.1093/bioinformatics/bth088View ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 1995, 57: 289–300.Google Scholar
- Sharan R, Maron-Katz A, Shamir R: CLICK and EXPANDER: a system for clustering and visualizing gene expression data. Bioinformatics 2003, 19(14):1787–1799. 10.1093/bioinformatics/btg232View ArticlePubMedGoogle Scholar
- Barkow S, Bleuler S, Prelic A, Zimmermann P, Zitzler E: BicAT: a biclustering analysis toolbox. Bioinformatics 2006, 22(10):1282–1283. 10.1093/bioinformatics/btl099View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.