Individualized markers optimize class prediction of microarray data
© Pavlidis and Poirazi; licensee BioMed Central Ltd. 2006
Received: 11 April 2006
Accepted: 14 July 2006
Published: 14 July 2006
Identification of molecular markers for the classification of microarray data is a challenging task. Despite the evident dissimilarity in various characteristics of biological samples belonging to the same category, most of the marker – selection and classification methods do not consider this variability. In general, feature selection methods aim at identifying a common set of genes whose combined expression profiles can accurately predict the category of all samples. Here, we argue that this simplified approach is often unable to capture the complexity of a disease phenotype and we propose an alternative method that takes into account the individuality of each patient-sample.
Instead of using the same features for the classification of all samples, the proposed technique starts by creating a pool of informative gene-features. For each sample, the method selects a subset of these features whose expression profiles are most likely to accurately predict the sample's category. Different subsets are utilized for different samples and the outcomes are combined in a hierarchical framework for the classification of all samples. Moreover, this approach can innately identify subgroups of samples within a given class which share common feature sets thus highlighting the effect of individuality on gene expression.
In addition to high classification accuracy, the proposed method offers a more individualized approach for the identification of biological markers, which may help in better understanding the molecular background of a disease and emphasize the need for more flexible medical interventions.
discriminate between different disease types
predict the outcome of a disease
detect sub-categories or states of a disease
pin down independent and possibly unknown processes which are involved in the generation or the progression of a disease.
Several marker (or feature) selection methods have been used in gene expression studies utilizing microarray technology. Among these, filter methods in which the selection is independent from the optimization criteria of the classifier are most frequently used. Such methods have the advantage of being cost-effective and easy to implement which make them very attractive for microarray data experiments where the set of features is in the order of thousands. Frequently used filter methods include the two-sample t-test [14–22] Signal-to-Noise , TNoM , ICED  and the z-test  just to name a few. Wrapper methods on the other hand use similar criteria as the classifier in order to select optimal features thus maximizing classification capacity. Recursive Feature Elimination is an example of a wrapper method used on microarray data . A recent study  comparing the performance of filter vs. wrapper methods on microarray data showed that the latter achieve higher performance than the former but the improvement in performance is accompanied by a considerable cost in computational complexity. While a number of other feature selection methods have been used in microarray data, only the aforementioned filter methods are discussed as they are more relevant to the present work. All of these methods have a number of shortcomings that are particularly important when applied to microarray data. For example, a basic assumption of the t-test and Signal-to-Noise methods is that data follow a normal distribution, a postulation which is not always valid for microarray experiments. In fact, a recent publication  showed that a yeast gene expression dataset is better modeled by an alpha distribution (a = 1.3). The main difference between Signal-to-Noise and t-test is that the former gives a larger penalty to genes with high expression variance in both -as opposed to just one- classes. However, this kind of expression variability might be important for biological samples, where only a given condition may influence the expression of certain marker genes. The main drawback of these methods is that they both assume a global behavior of a marker gene across all samples of the same class, which is an oversimplified assumption for biological samples.
protein degradation and stability
immortalization and senescence
chemical and radiation- induced mutagenesis
the stress response (NIH-CE)
Samples belonging to the same class may have considerably different gene expression profiles. This variability is two-fold. First, each sample maybe best characterized by a (partially or entirely) different set of marker genes. Second, each sample maybe best characterized by a different expression range of a common set of gene features. To address this kind of variability, we construct a pool of genes, hereby termed "informative genes", each of which carries important information with respect to the categorization of some -but not necessarily all- samples. Each informative gene comprises of one or more "Consistent Expression Regions" (CERs) that accurately predict the category of certain samples (see Methods).
Variability among samples of the same class may indicate the existence of unknown subgroups that should be treated separately. The proposed method is particularly suitable for this kind of variability, as it has an innate property to identify subgroups based on their characterization by a common subset of genes or gene expression regions.
Results and discussion
Comparison of Proposed Method Performance against Existing Methods on Publicly Available Datasets. As shown in the table, the proposed method achieves a high classification performance on all datasets tested. In particular, the performance is superior to that of the referenced method in 3/6 datasets and matches that of the referenced method in the remaining 3 datasets. The last column shows the ratio of genes selected by both ours and the cross reference method over the total number of gene-features in the cross reference method. Abbreviations: S2N: Signal to Noise, CC: Correlation Coefficient, NA: Neighborhood Analysis, FA: Factor Analysis, 2-tail T: 2-Tail Student test, ER: Expression Ratio, K-NN: K-Nearest Neighbors. *Outlier samples for this dataset were omitted from the classification in both the reference and our method.
Cross Reference Method
4/4, 3/3, 8/8
AML vs. ALL leukemia results
Identification of AML sub-groups: "Failure" vs. "Success" discrimination
Identification of ALL sub-groups: B-cell vs. T-cell sample discrimination
ALL/MLL/AML leukemia results
Breast cancer results
Central Nervous System (CNS) results
The Central Nervous System data set  contains 60 biopsy samples taken from patients with various tumor types including medulloblastomas, primitive neuroectodermal tumours (PNETs), atypical teratoid/rhabdoid tumours (AT/RTs) and malignant gliomas. Samples were obtained before the patients received treatment, accompanied with clinical follow-up. Survivors are patients who are alive after treatment, while failures are those who succumbed to their disease. For a significance percentage p s = 2%, a total number of 443 genes (170, 82, 51 and 140, 1 st , 2 nd , 3 rd and 4 th order, respectively) were selected to discriminate between poor vs. good treatment outcome. When only first order genes were used in the classifier, the method achieved a performance of 44/60 correct classification. Notice that first order genes roughly resemble genes that are selected by Signal-to-Noise or t-test as done in the original publication, thus explaining the similar classification accuracy (47/60). However, using only higher order genes (n > 1) resulted in a significant improvement in discrimination accuracy with 55/60 correct class assignments. This improvement maybe be due to the complexity of the particular tumors that belong to at least four different categories, which can only be captured by more complex (higher order) gene features. Figure 6 in [Additional file 1] provides some supporting evidence for this hypothesis. As evident in the figure, the fraction of higher order genes among all selected features is consistently larger in the CNS as opposed to two other datasets (ALL/AML and Breast Cancer) in which the content of 1 st order genes is much bigger. The statistically significant presence of higher order genes in the CNS data along with the improved classification capacity achieved with these features suggests an important discriminatory and perhaps biological role. A list of six representative higher order genes selected in the CNS dataset is included in Table 4 of [Additional file 1].
Given a set of tissue-specific microarray experiments performed under different conditions, this work presents a new method for identifying genes that can explain or get affected by these conditions. Such informative genes are shown not only to accurately discriminate between different disease types or stages but also reveal the existence of known or new sub-groupings within a main category and pinpoint molecular mechanisms that are likely to support these groupings.
The utilization of higher order (multiple-region) genes often results in a significant improvement in discrimination performance. A nice example is the classification of Central Nervous System samples into two classes representing poor vs. good treatment outcome, where the utilization of higher order genes alone results in significantly higher accuracy compared to the use of first order genes. Note that first order (single-region) genes are similar to those detected by Signal-to-Noise, t-test or ICED methods as they often have a single-threshold for classification. It is likely however, that treatment outcome in these patients depends heavily on genes with a more complex expression pattern that differentially characterizes the heterogeneous group of CNS tumors used in this study. A comparison between AML/ALL Leukemia, Breast Cancer, and Central Nervous System datasets -all of which are performed using the same microarray chips – reveals several interesting differences in the number and order of selected genes. First order genes comprise nearly 70% of the total number of selected genes in both Breast Cancer and AML/ALL Leukemia datasets, but less than ~40% in the Central Nervous System dataset. On the contrary, more higher-order (4 th and 3 rd ) genes are selected in the Central Nervous System dataset as compared to the other two, supporting the hypothesis that treatment outcome for CNS tumor patients is characterized by complex gene expression patterns (see Figure 6 in [Additional file 1]). Moreover, a number of higher order genes selected by our method have been associated with CNS tumors and treatment outcome. Interesting examples include the gene encoding for CD70/CD27 ligand, the antiapoptotic gene seladin-1, the gene coding for the interleukin-1 receptor (IL1R1) and the gene coding for the Ser/Thr protein kinase CDK5 (see Table 4 in [Additional file 1]). CD70 is a member of the Tumor Necrosis Factor family which is highly expressed in human brain tumors  and was recently shown to play an immune stimulatory role -preventing tumor growth in vivo- that encourages its application in tumor immunotherapy . The interleukin-1 receptor (IL1R1) is a membrane protein which is variably expressed in different brain tumors  and has also been suggested to play a role in brain immunotherapy of astrocytomas . The antiapoptotic gene seladin-1, which is implicated in Alzheimer's disease and cholesterol metabolism, was also found to integrate cellular response to oncogenic and oxidative stress . This gene was recently found to be downregulated in adrenocortical adenomas and carcinomas  while its differential expression in pituitary adenomas has been suggested to associate with a different apoptotic response to somatostatin analogs . Cyclin dependent kinase 5 (CdK5) is a proline-direct protein kinase that is most active in the CNS and has been implicated in certain neurodegenerative diseases. It was recently shown to facilitate the progression of apoptosis by regulating the activity of the tumor suppressor protein p53 , the expression of which has been associated with poor prognosis in primary CNS diffuse large B-cell lymphoma . In addition, overexpression of both p53 and bcl-2 proteins has been associated with ominous prognosis in pediatric glioblastoma multiforme tumours . Taken together, these findings suggest that genes with heterogeneous expression detected by our method are not simply the result of technical or biological irrelevant variation but they can have an important biological role.
In conclusion, this work describes a new method for the identification of informative genes that takes into account inherent genetic variation in disease samples which may be characteristic of certain sub-groups within a disease category. This relatively simple approach, in conjunction with a committee voting classifier allows for improved class prediction as well as identification of interesting disease sub-groups. More importantly, our method allows the detection of marker genes that support these sub-groupings, thus possibly shedding some light on the underlying molecular mechanisms involved in disease related processes and providing a new tool that may facilitate efforts towards individualized medicine.
Identification of informative genes and construction of gene pool
The proposed algorithm uses a training set comprised of labeled samples belonging to two categories (0 or 1) to construct a pool of informative genes that exhibit Consistent Expression Regions (CERs). CERs are defined as the intervals enclosing the expression (sorted in ascending order) of a given gene in a significant number of training samples which belong to the same category. Examples of informative genes and associated CERs are shown in Figure 2. The consistency of a CER is given by the fraction of these majority samples in CER, normalized by the size of their respective category. Only genes with at least one CER whose consistency value is greater than a statistically defined threshold, p s , are used to construct the pool. The order of each informative gene with respect to a category (0 or 1) reflects the number of class-specific CERs it consists of (for more details about the estimation of consistency thresholds and gene orders see [Additional file 1]).
The outcome of this step is the identification of category-specific classifiers formed by expression regions in the profiles of selected genes as opposed to a single expression threshold defined by most existing feature selection methods. As a result, a gene exhibiting a class-specific CER can be used to reliably assign a label of the same class to any new sample in which its expression lies within the boundaries of this CER. Note that any informative gene can produce several different regional classifiers, according to its CER assortment.
The contribution of these thresholded expression regions could be twofold. First, their mapping to a limited sample number of the same class may provide insights about the complexity of a given disease category. For example, CERs of the same category may reflect differences in the order and/or extent that various cancer-associated molecular processes are utilized to induce qualitatively the same phenotype but with a different gene expression pattern. It is thus conceivable that CERs can detect subgroups within a single class. Second, the classification accuracy of these regions, which is overlooked by existing methods focusing on the expression profile of a gene as a whole, can be used to construct a potentially more powerful classifier that takes into account the individuality of different samples.
Class prediction using CERs and hierarchical clustering
To classify unseen samples into their respective categories, the method combines subsets of informative genes to form an aggregate classifier. For a two-class problem, the aggregate classifier consists of two -possibly overlapping- lists of informative genes. Each list consists of the set of informative genes that contain CERs specific to each class. If an informative gene contains at least two CERs, each corresponding to a different class (Figure 2), it serves as a classifier for both classes. For the categorization of each new sample the method proceeds as follows: first, the subsets of genes that are able to predict its category are retrieved from each class-specific list. Their respective CERs are then used to assign a class label to the new sample thus generating two lists of 0 and 1 votes, respectively. At this stage, the votes correspond to the informative genes containing the CER and not the CER itself. The procedure is repeated for all unseen samples and the class assignments for each sample are fed to a modified Manhattan distance to estimate dissimilarity scores between samples. Specifically, the distance between two samples a and b is defined as:
D(a, b) = T - C(a, b) (1)
where T is the total number of informative genes which constitute the aggregate classifier and C(a, b) is the number of genes that give the same vote (0 or 1) for both samples a and b. Alternatively, a similarity score between samples a and b is given by C(a, b). The determination of similarities between samples is graphically illustrated in Figure 3. Finally, the dissimilarity scores between all samples are fed in the publicly available phylogenetic software MEGA2  to build a hierarchical tree. The method's performance is measured as its discrimination capacity on the set of unseen samples.
Comparison to Entropy-based methods
Since the proposed method may sound similar to Entropy-based discretization methods frequently used in machine learning problems, we include a comparison between these two approaches.
In an information-based framework, Shannon's entropy  can be used for the evaluation of the information content of a given gene. The entropy of a particular k class input (k possible states) is given by the formula:
where p i corresponds to the probability of the state i. In the analysis presented here k equals 2, since all datasets were broken down to two-class discrimination problems.
Entropy-based methods consider all possible states (classes) of the input data in order to estimate discretization thresholds and identify informative genes. On the contrary, the method presented here detects genes whose expression within a well defined region consistently maps to a single class, without taking into account the remaining classes.
Entropy-based methods usually search for a single expression threshold that minimizes the entropy of a discretized gene. Although multi-interval discretization can be achieved with iterative application of entropy-based minimization methods , such an approach is of high computational cost. The basic idea of this method is to partition a range of real values into a number of disjoint intervals such that the entropy of the intervals is minimal. However, this method also considers minimization criteria that involve both classes for each interval.
Finally, the method presented here does not utilize any minimization criterion but searches for statistically significant homogeneous regions (CERs) in which no other state can occur as opposed to entropy-based methods where a few instances of other states are allowed.
Detection of sample sub-groups
Identification of sub-groups within a given disease class can be of major importance as it may pinpoint patient subcategories that respond differentially to a given treatment. To detect such sub-groups, the method utilizes informative genes that inherently separate a main class into two or more clusters by grouping different subsets of samples in different CERs. An example is given by gene VII in Figure 2, whose expression profile separates class 1 samples into two distinct sub-groups. Down-regulation of this gene is characteristic of the first while up-regulation is characteristic of the second sub-group. For the detection of within class sub-groups, the method combines all higher order informative genes (i.e. genes that contain more than one CER corresponding to the same class) or just a selected subset of them.
Using all genes with multiple single-class CERs
In this approach, all informative genes that contain at least two CERs specific to the same class (n > 1) are utilized. For each informative gene, samples that lie within different CERs are assigned to different sub-groups, using a voting scheme similar to that of the class prediction task. However, in contrast to the class prediction task, CERs of the same gene now offer a different vote which can take a value raging from 0 to the gene order. Resulting voting lists are then used along with the modified Manhattan distance to construct a dendrogram. This approach is particularly suitable for datasets in which the actual number of class categories is larger than originally suggested, as for example the classification of ALL/MLL/AML leukemia samples shown in Figure 6.
Using a tight set of genes
In the second approach, only genes of the same order are used to identify sample sub-groups (as shown in Figures 4 and 5). Reversely, the method identifies genes that support a pre-existing sub-clustering of samples. These sub-groups, which may be irrelevant to the original classification, are often supported by only a small -tight- set of genes. A tight set consists of same-order genes whose expression in the same -significantly large- set of samples is bounded by exactly one CER per gene. More importantly, their expression in the remaining samples must range over several other CERs. The reasoning for this second constraint is that a gene which clusters in one region samples that lie in various regions of other genes has a tendency to distract the grouping achieved by these genes and should thus be omitted from the set. Notice that this procedure is applied to each class separately and thus genes used are all specific to the same class. For a more detailed explanation regarding the construction of tight sets of genes, see [Additional file 1].
Comparison to bi-clustering methods
The approach described above for the identification of sample and gene subgroups is fundamentally different from existing bi-clustering methods. Bi-clustering methods allow for the identification of sets of genes that share compatible expression patterns across subsets of samples. These methods group samples and genes simultaneously. According to , if A is an expression matrix, with X genes and Y conditions then a ij represents the expression of gene i at condition j. I ⊂ X and J ⊂ Y denote a subset of genes and conditions respectively. The pair (I, J) specifies a submatrix A IJ or a bi-cluster A IJ and H(I, J) represents the following mean squared residue score:
In bi-clustering methods, selected genes within a bi-cluster must share similar expression profiles. In the proposed method these genes are only required to map approximately the same set of samples within a single CER, irrespectively of the expression values contained in this CER. We term these regions "significantly overlapping."
In addition to this similarity criterion, the proposed method demands that selected genes do not contain more than one significantly overlapping CER for a specific subset of samples. It is however possible to have several significantly overlapping CERs within the same subset of genes as long as they contain distinct subsets of samples. As a result, a set of genes containing two clusters of CERs can thus represent two different sample sub-groups.
The above criteria identify subsets of genes each of which group together a subset of samples in a single CER, irrespectively of co-expression constraints. If CERs were thought as independent features, the proposed method would resemble bi-clustering approaches except for the co-expression requirement which is not a prerequisite here.
We thank members of our lab for helpful discussions and comments on the manuscript. We thank Anastasis Oulas, Alkiviadis Simeonidis and Babis Papamanthou for their technical advice during the development of the algorithm. This work was supported by the EMBO Young Investigator Award (P. Poirazi.) and the IKY Foundation (P. Pavlides).
- Felipe MS, Andrade RV, Arraes FB, Nicola AM, Maranhao AQ, Torres FA, Silva-Pereira I, Pocas-Fonseca MJ, Campos EG, Moraes LM, Andrade PA, Tavares AH, Silva SS, Kyaw CM, Souza DP, Network P, Pereira M, Jesuino RS, Andrade EV, Parente JA, Oliveira GS, Barbosa MS, Martins NF, Fachin AL, Cardoso RS, Passos GA, Almeida NF, Walter ME, Soares CM, Carvalho MJ, Brigido MM: Transcriptional profiles of the human pathogenic fungus Paracoccidioides brasiliensis in mycelium and yeast cells. J Biol Chem 2005, 280(26):24706–14. [0021–9258 (Print) Journal Article] [0021-9258 (Print) Journal Article] 10.1074/jbc.M500625200View ArticlePubMedGoogle Scholar
- Ferrando AA, Look AT: DNA microarrays in the diagnosis and management of acute lymphoblastic leukemia. Int J Hematol 2004, 80(5):395–400. [0925–5710 (Print) Journal Article Review] [0925-5710 (Print) Journal Article Review] 10.1532/IJH97.04137View ArticlePubMedGoogle Scholar
- Kolch W, Mischak H, Pitt AR: The molecular make-up of a tumour: proteomics in cancer research. Clin Sci (Lond) 2005, 108(5):369–83. [0143–5221 (Print) Journal Article Review] [0143-5221 (Print) Journal Article Review]View ArticleGoogle Scholar
- Li Y, Li Y, Tang R, Xu H, Qiu M, Chen Q, Chen J, Fu Z, Ying K, Xie Y, Mao Y: Discovery and analysis of hepatocellular carcinoma genes using cDNA microarrays. J Cancer Res Clin Oncol 2002, 128(7):369–79. [0171–5216 (Print) Journal Article] [0171-5216 (Print) Journal Article] 10.1007/s00432-002-0347-0View ArticlePubMedGoogle Scholar
- Nambiar S, Mirmohammadsadegh A, Doroudi R, Gustrau A, Marini A, Roeder G, Ruzicka T, Hengge UR: Signaling networks in cutaneous melanoma metastasis identified by complementary DNA microarrays. Arch Dermatol 2005, 141(2):165–73. [0003–987X (Print) Journal Article] [0003-987X (Print) Journal Article] 10.1001/archderm.141.2.165View ArticlePubMedGoogle Scholar
- Reiss J, Bonin M, Schwegler H, Sass JO, Garattini E, Wagner S, Lee HJ, Engel W, Riess O, Schwarz G: The pathogenesis of molybdenum cofactor deficiency, its delay by maternal clearance, and its expression pattern in microarray analysis. Mol Genet Metab 2005, 85: 12–20. [1096–7192 (Print) Journal Article] [1096-7192 (Print) Journal Article] 10.1016/j.ymgme.2005.01.008View ArticlePubMedGoogle Scholar
- Ring BZ, Ross DT: Microarrays and molecular markers for tumor classification. Genome Biol 2002., 3(5): comment2005. [1465–6914 (Electronic) Journal Article Review] comment2005. [1465-6914 (Electronic) Journal Article Review]
- Sriuranpong V, Mutirangura A, Gillespie JW, Patel V, Amornphimoltham P, Molinolo AA, Kerekhanjanarong V, Supanakorn S, Supiyaphun P, Rangdaeng S, Voravud N, Gutkind JS: Global gene expression profile of nasopharyngeal carcinoma by laser capture microdissection and complementary DNA microarrays. Clin Cancer Res 2004, 10(15):4944–58. [1078–0432 (Print) Journal Article] [1078-0432 (Print) Journal Article] 10.1158/1078-0432.CCR-03-0757View ArticlePubMedGoogle Scholar
- Steinau M, Lee DR, Rajeevan MS, Vernon SD, Ruffn MT, Unger ER: Gene expression profile of cervical tissue compared to exfoliated cells: impact on biomarker discovery. BMC Genomics 2005, 6: 64. [1471–2164 (Electronic) Journal Article] [1471-2164 (Electronic) Journal Article] 10.1186/1471-2164-6-64PubMed CentralView ArticlePubMedGoogle Scholar
- Steller S, Angenendt P, Cahill DJ, Heuberger S, Lehrach H, Kreutzberger J: Bacterial protein microarrays for identification of new potential diagnostic markers for Neisseria meningitidis infections. Proteomics 2005, 5(8):2048–55. [1615–9853 (Print) Journal Article] [1615-9853 (Print) Journal Article] 10.1002/pmic.200401097View ArticlePubMedGoogle Scholar
- Callagy G, Pharoah P, Chin SF, Sangan T, Daigo Y, Jackson L, Caldas C: Identification and validation of prognostic markers in breast cancer with the complementary use of array-CGH and tissue microarrays. J Pathol 2005, 205(3):388–96. [0022–3417 (Print) Journal Article Validation Studies] [0022-3417 (Print) Journal Article Validation Studies] 10.1002/path.1694View ArticlePubMedGoogle Scholar
- Chen Y, Miller C, Mosher R, Zhao X, Deeds J, Morrissey M, Bryant B, Yang D, Meyer R, Cronin F, Gostout BS, Smith-McCune K, Schlegel R: Identification of cervical cancer markers by cDNA and tissue microarrays. Cancer Res 2003, 63(8):1927–35. [0008–5472 (Print) Journal Article] [0008-5472 (Print) Journal Article]PubMedGoogle Scholar
- Iacobuzio-Donahue CA, Maitra A, Olsen M, Lowe AW, van Heek NT, Rosty C, Walter K, Sato N, Parker A, Ashfaq R, Jaffee E, Ryu B, Jones J, Eshleman JR, Yeo CJ, Cameron JL, Kern SE, Hruban RH, Brown PO, Goggins M: Exploration of global gene expression patterns in pancreatic adenocarcinoma using cDNA microarrays. Am J Pathol 2003, 162(4):1151–62. [0002–9440 (Print) Journal Article] [0002-9440 (Print) Journal Article]PubMed CentralView ArticlePubMedGoogle Scholar
- Arfin SM, Long AD, Ito ET, Tolleri L, Riehle MM, Paegle ES, Hatfield GW: Global gene expression profiling in Escherichia coli K12. The effects of integration host factor. J Biol Chem 2000, 275(38):29672–84. [0021–9258 (Print) Journal Article] [0021-9258 (Print) Journal Article] 10.1074/jbc.M002247200View ArticlePubMedGoogle Scholar
- Baldi P, Long AD: A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 2001, 17(6):509–19. [1367–4803 (Print) Journal Article] [1367-4803 (Print) Journal Article] 10.1093/bioinformatics/17.6.509View ArticlePubMedGoogle Scholar
- Chen X, Cheung ST, So S, Fan ST, Barry C, Higgins J, Lai KM, Ji J, Dudoit S, Ng IO, Van De Rijn M, Botstein D, Brown PO: Gene expression patterns in human liver cancers. Mol Biol Cell 2002, 13(6):1929–39. [1059–1524 (Print) Journal Article] [1059-1524 (Print) Journal Article] 10.1091/mbc.02-02-0023.PubMed CentralView ArticlePubMedGoogle Scholar
- Lenburg ME, Liou LS, Gerry NP, Frampton GM, Cohen HT, Christman MF: Previously unidentified changes in renal cell carcinoma gene expression identified by parametric analysis of microarray data. BMC Cancer 2003, 3: 31. [1471–2407 (Electronic) Evaluation Studies Journal Article] [1471-2407 (Electronic) Evaluation Studies Journal Article] 10.1186/1471-2407-3-31PubMed CentralView ArticlePubMedGoogle Scholar
- Ryder MI, Hyun W, Loomer P, Haqq C: Alteration of gene expression profiles of peripheral mononuclear blood cells by tobacco smoke: implications for periodontal diseases. Oral Microbiol Immunol 2004, 19: 39–49. [0902–0055 (Print) Journal Article] [0902-0055 (Print) Journal Article] 10.1046/j.0902-0055.2003.00110.xView ArticlePubMedGoogle Scholar
- Sanchez-Carbayo M, Socci ND, Lozano JJ, Li W, Charytonowicz E, Belbin TJ, Prystowsky MB, Ortiz AR, Childs G, Cordon-Cardo C: Gene discovery in bladder cancer progression using cDNA microarrays. Am J Pathol 2003, 163(2):505–16. [0002–9440 (Print) Journal Article] [0002-9440 (Print) Journal Article]PubMed CentralView ArticlePubMedGoogle Scholar
- Tanaka TS, Jaradat SA, Lim MK, Kargul GJ, Wang X, Grahovac MJ, Pantano S, Sano Y, Piao Y, Nagaraja R, Doi H, Wood rWH, Becker KG, Ko MS: Genome-wide expression profiling of mid-gestation placenta and embryo using a 15,000 mouse developmental cDNA microarray. Proc Natl Acad Sci USA 2000, 97(16):9127–32. [0027–8424 (Print) Journal Article] [0027-8424 (Print) Journal Article] 10.1073/pnas.97.16.9127PubMed CentralView ArticlePubMedGoogle Scholar
- Varma S, Simon R: Iterative class discovery and feature selection using Minimal Spanning Trees. BMC Bioinformatics 2004, 5: 126. [1471–2105 Journal Article] [1471-2105 Journal Article] 10.1186/1471-2105-5-126PubMed CentralView ArticlePubMedGoogle Scholar
- von Heydebreck A, Huber W, Poustka A, Vingron M: Identifying splits with clear separation: a new class discovery method for gene expression data. Bioinformatics 2001, 17(Suppl 1):S107–14. [1367–4803 (Print) Journal Article] [1367-4803 (Print) Journal Article]View ArticlePubMedGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286(5439):531–7. [0036–8075 (Print) Journal Article] [0036-8075 (Print) Journal Article] 10.1126/science.286.5439.531View ArticlePubMedGoogle Scholar
- Ben-Dor A, Friedman N, Yakhini Z: Overabundance Analysis and Class Discovery in Gene Expression Data. RECOMB 2001.Google Scholar
- Bijlani R, Cheng Y, Pearce DA, Brooks AI, Ogihara M: Prediction of biologically significant components from microarray data: Independently Consistent Expression Discriminator (ICED). Bioinformatics 2003, 19: 62–70. [1367–4803 (Print) Evaluation Studies Journal Article Validation Studies] [1367-4803 (Print) Evaluation Studies Journal Article Validation Studies] 10.1093/bioinformatics/19.1.62View ArticlePubMedGoogle Scholar
- Thomas JG, Olson JM, Tapscott SJ, Zhao LP: An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genome Res 2001, 11(7):1227–36. [1088–9051 (Print) Journal Article] [1088-9051 (Print) Journal Article] 10.1101/gr.165101PubMed CentralView ArticlePubMedGoogle Scholar
- Guyon I, Weston J, Barnhill S, V V: Gene selection for cancer classification using support vector machines. BIOWulf Technical Report 2000.Google Scholar
- Inza I, Larranaga P, Blanco R, Cerrolaza AJ: Filter versus wrapper gene selection approaches in DNA microarray domains. Artif Intell Med 2004, 31(2):91–103. [0933–3657 (Print) Journal Article Review] [0933-3657 (Print) Journal Article Review] 10.1016/j.artmed.2004.01.007View ArticlePubMedGoogle Scholar
- Bloch K, Arce G: Nonlinear Correlation For The Analysis Of Gene Expression Data. In Workshop on Genomic Signal Processing and Statistics. Raleigh, North Carolina; 2002.Google Scholar
- Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V: Feature selection for SVMs. In Advances in Neural Information Processing Systems 13. MIT Press; 2001.Google Scholar
- Datasets URL[http://sdmc.lit.org.sg/GEDatasets/Datasets]
- Dabney AR: Classification of microarrays to nearest centroids. Bioinformatics 2005, 21(22):4148–54. [1367–4803 (Print) Journal Article] [1367-4803 (Print) Journal Article] 10.1093/bioinformatics/bti681View ArticlePubMedGoogle Scholar
- Diaz-Uriarte R, Alvarez de Andres S: Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006, 7: 3. [1471–2105 (Electronic) Evaluation Studies Journal Article] [1471-2105 (Electronic) Evaluation Studies Journal Article] 10.1186/1471-2105-7-3PubMed CentralView ArticlePubMedGoogle Scholar
- Li J, Liu H, Ng SK, Wong L: Discovery of significant rules for classifying cancer diagnosis data. Bioinformatics 2003, 19(Suppl 2):II93-II102. [1367–4803 (Print) Journal Article] [1367-4803 (Print) Journal Article]View ArticlePubMedGoogle Scholar
- Liu X, Krishnan A, Mondry A: An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinformatics 2005, 6: 76. [1471–2105 (Electronic) Journal Article] [1471-2105 (Electronic) Journal Article] 10.1186/1471-2105-6-76PubMed CentralView ArticlePubMedGoogle Scholar
- Martella F: Classification of microarray data with factor mixture models. Bioinformatics 2006, 22(2):202–8. [1367–4803 (Print) Evaluation Studies Journal Article] [1367-4803 (Print) Evaluation Studies Journal Article] 10.1093/bioinformatics/bti779View ArticlePubMedGoogle Scholar
- Shevade SK, Keerthi SS: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 2003, 19(17):2246–53. [1367–4803 (Print) Evaluation Studies Journal Article Validation Studies] [1367-4803 (Print) Evaluation Studies Journal Article Validation Studies] 10.1093/bioinformatics/btg308View ArticlePubMedGoogle Scholar
- Wang Y, Makedon FS, Ford JC, Pearlman J: HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data. Bioinformatics 2005, 21(8):1530–7. [1367–4803 (Print) Evaluation Studies Journal Article] [1367-4803 (Print) Evaluation Studies Journal Article] 10.1093/bioinformatics/bti192View ArticlePubMedGoogle Scholar
- Cancer Program, Broad Institute[http://www.genome.wi.mit.edu/MPR]
- Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ: MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet 2002, 30: 41–7. [1061–4036 (Print) Journal Article] [1061-4036 (Print) Journal Article] 10.1038/ng765View ArticlePubMedGoogle Scholar
- Ross ME, Mahfouz R, Onciu M, Liu HC, Zhou X, Song G, Shurtleff SA, Pounds S, Cheng C, Ma J, Ribeiro RC, Rubnitz JE, Girtman K, Williams WK, Raimondi SC, Liang DC, Shih LY, Pui CH, Downing JR: Gene expression profiling of pediatric acute myelogenous leukemia. Blood 2004, 104(12):3679–87. [0006–4971 Journal Article] [0006-4971 Journal Article] 10.1182/blood-2004-03-1154View ArticlePubMedGoogle Scholar
- West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JJA, Marks JR, Nevins JR: Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci USA 2001, 98(20):11462–7. [0027–8424 Journal Article] [0027-8424 Journal Article] 10.1073/pnas.201162998PubMed CentralView ArticlePubMedGoogle Scholar
- Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, Kim JY, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR: Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 2002, 415(6870):436–42. [0028–0836 Journal Article] [0028-0836 Journal Article] 10.1038/415436aView ArticlePubMedGoogle Scholar
- Yanagi Y, Yoshikai Y, Leggett K, Clark SP, Aleksander I, Mak TW: A human T cell-specific cDNA clone encodes a protein having extensive homology to immunoglobulin chains. Nature 1984, 308(5955):145–9. [0028–0836 Journal Article] [0028-0836 Journal Article] 10.1038/308145a0View ArticlePubMedGoogle Scholar
- Motz C, Martin H, Krimmer T, Rassow J: Bcl-2 and porin follow different pathways of TOM-dependent insertion into the mitochondrial outer membrane. J Mol Biol 2002, 323(4):729–38. [0022–2836 (Print) Journal Article] [0022-2836 (Print) Journal Article] 10.1016/S0022-2836(02)00995-6View ArticlePubMedGoogle Scholar
- Schleiff E, Shore G, Goping I: Human mitochondrial import receptor Tom20p. Use of glutathione to reveal specific interactions between Tom20-glutathione S-transferase and mitochondrial precursor proteins. FEBS Lett 1997.Google Scholar
- Karakas T, Maurer U, Weidmann E, Miething CC, Hoelzer D, Bergmann L: High expression of bcl-2 mRNA as a determinant of poor prognosis acute myeloid leukemia. Ann Oncol 1998, 9(2):159–165. 10.1023/A:1008255511404View ArticlePubMedGoogle Scholar
- Salomons G, Smets L, Verwijs-Janssen M, Hart A, Haarman E, Kaspers G, Wering E, Der Does-Van Den Berg A, WA K: Bcl-2 family members in childhood acute lymphoblastic leukemia: relationships with features at presentation, in vitro and in vivo drug response and long-term clinical outcome. Leukemia 1999, 13(10):1574–80. 10.1038/sj/leu/2401529View ArticlePubMedGoogle Scholar
- Coustan-Smith E, Kitanaka A, Pui C, McNinch L, Evans W, Raimondi S, Behm F, Arico M, D C: Clinical relevance of BCL-2 overexpression in childhood acute lymphoblastic leukemia. Blood 1996, 87(3):1140–6.PubMedGoogle Scholar
- Held-Feindt J, Mentlein R: CD70/CD27 ligand, a member of the TNF family, is expressed in human brain tumors. Int J Cancer 2002, 98(3):352–56. 10.1002/ijc.10207View ArticlePubMedGoogle Scholar
- Aulwurm S, Wischhusen J, Friese M, Borst J, M W: Immune stimulatory effects of CD70 override CD70-mediated immune cell apoptosis in rodent glioma models and confer long-lasting antiglioma immunity in vivo. Int J Cancer 2006, 118(7):1728–35. 10.1002/ijc.21544View ArticlePubMedGoogle Scholar
- Ilyin S, Gonzalez-Gomez I, Gilles F, Plata-Salaman C: Interleukin-1 alpha (IL-1 alpha), IL-1 beta, IL-1 receptor type I, IL-1 receptor antagonist, and TGF-beta 1 mRNAs in pediatric astrocytomas, ependymomas, and primitive neuroectodermal tumors. Mol Chem Neuropathol 1998, 33(2):125–37.View ArticlePubMedGoogle Scholar
- Ilyin S, Gonzalez-Gomez I, Romanovicht A, Gayle D, Gilles F, Plata-Salaman C: Autoregulation of the interleukin-1 system and cytokine-cytokine interactions in primary human astrocytoma cells. Brain Res Bull 2000, 51: 29–34. 10.1016/S0361-9230(99)00190-2View ArticlePubMedGoogle Scholar
- Wu C, Miloslavskaya I, Demontis S, Maestro R, Galaktionov K: Regulation of cellular response to oncogenic and oxidative stress by Seladin-1. Nature 2004, 432(7017):640–5. 10.1038/nature03173View ArticlePubMedGoogle Scholar
- Luciani P, Ferruzzi P, Arnaldi G, Crescioli C, Benvenuti S, Nesi G, Valeri A, Greeve I, Serio M, Mannelli M, Peri A: Expression of the novel adrenocorticotropin-responsive gene selective Alzheimer's disease indicator-1 in the normal adrenal cortex and in adrenocortical adenomas and carcinomas. J Clin Endocrinol Metab 2004, 89(3):1332–9. 10.1210/jc.2003-031065View ArticlePubMedGoogle Scholar
- Luciani P, Gelmini S, Ferrante E, Lania A, Benvenuti S, Baglioni S, Mantovani G, Cellai I, Ammannati F, Spada A, Serio M, Peri A: Expression of the antiapoptotic gene seladin-1 and octreotide-induced apoptosis in growth hormone-secreting and nonfunctioning pituitary adenomas. J Clin Endocrinol Metab 2005, 90(11):6156–61. 10.1210/jc.2005-0633View ArticlePubMedGoogle Scholar
- Zhang J, Krishnamurthy P, Johnson G: Cdk5 phosphorylates p53 and regulates its activity. J Neurochem 2002, 81(2):307–13. 10.1046/j.1471-4159.2002.00824.xView ArticlePubMedGoogle Scholar
- Chang C, Kampalath B, Schultz C, Bunyi-Teopengco E, Logan B, Eshoa C, Dincer A, Perkins S: Expression of p53, c-Myc, or Bcl-6 suggests a poor prognosis in primary central nervous system diffuse large B-cell lymphoma among immunocompetent individuals. Arch Pathol Lab Med 2003, 127(2):208–12.PubMedGoogle Scholar
- Ganigi P, Santosh V, Anandh B, Chandramouli B, Sastry Kolluri V: Expression of p53, EGFR, pRb and bcl-2 proteins in pediatric glioblastoma multiforme: a study of 54 patients. Pediatr Neurosurg 2005, 41(6):292–9. 10.1159/000088731View ArticlePubMedGoogle Scholar
- Lyons-Weiler J, Patel S, Bhattacharya S: A classification-based machine learning approach for the analysis of genome-wide expression data. Genome Res 2003, 13(3):503–12. [1088–9051 Journal Article] [1088-9051 Journal Article] 10.1101/gr.104003PubMed CentralView ArticlePubMedGoogle Scholar
- Kumar S, Tamura K, Jakobsen IB, Nei M: MEGA2: molecular evolutionary genetics analysis software. Bioinformatics 2001, 17(12):1244–5. [1367–4803 Journal Article] [1367-4803 Journal Article] 10.1093/bioinformatics/17.12.1244View ArticlePubMedGoogle Scholar
- The Mathematical theory of Communication. University of Illinois Press; 1949.
- Wong A, Chiu D: Synthesizing Statistical Knowledge from Incomplete Mixed-Mode Data. IEEE Trans Pattern Analysis and Machine Intelligence 1987.Google Scholar
- Catlett J: On Changing Continuous Attributes into Ordered Discrete Attributes. Machine Learning-EWSL-91, Proc. European Working Session on Learning 1991.Google Scholar
- Fayyad U, Irani K: Multi-interval discretization of continuous-valued attributes for statistical learning. Proc of the 13th International Joint Conference on Artificial Intelligence 1993, 1022–1029.Google Scholar
- Liu X, Krishnan A, Mondry A: An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinformatics 2005, 6: 76. [1471–2105 (Electronic) Journal Article] [1471-2105 (Electronic) Journal Article] 10.1186/1471-2105-6-76PubMed CentralView ArticlePubMedGoogle Scholar
- Yan X, Deng M, Fung WK, Qian M: Detecting differentially expressed genes by relative entropy. J Theor Biol 2005, 234(3):395–402. [0022–5193 (Print) Journal Article] [0022-5193 (Print) Journal Article] 10.1016/j.jtbi.2004.11.039View ArticlePubMedGoogle Scholar
- Cheng Y, Church GM: Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol 2000, 8: 93–103. [1553–0833 (Print) Journal Article] [1553-0833 (Print) Journal Article]PubMedGoogle Scholar
- Sofware Download Site[http://www.imbb.forth.gr/people/poirazi/software.html]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.