Expression profiles of switch-like genes accurately classify tissue and infectious disease phenotypes in model-based classification
© Gormley and Tozeren; licensee BioMed Central Ltd. 2008
Received: 10 June 2008
Accepted: 17 November 2008
Published: 17 November 2008
Large-scale compilation of gene expression microarray datasets across diverse biological phenotypes provided a means of gathering a priori knowledge in the form of identification and annotation of bimodal genes in the human and mouse genomes. These switch-like genes consist of 15% of known human genes, and are enriched with genes coding for extracellular and membrane proteins. It is of interest to determine the prediction potential of bimodal genes for class discovery in large-scale datasets.
Use of a model-based clustering algorithm accurately classified more than 400 microarray samples into 19 different tissue types on the basis of bimodal gene expression. Bimodal expression patterns were also highly effective in differentiating between infectious diseases in model-based clustering of microarray data. Supervised classification with feature selection restricted to switch-like genes also recognized tissue specific and infectious disease specific signatures in independent test datasets reserved for validation. Determination of "on" and "off" states of switch-like genes in various tissues and diseases allowed for the identification of activated/deactivated pathways. Activated switch-like genes in neural, skeletal muscle and cardiac muscle tissue tend to have tissue-specific roles. A majority of activated genes in infectious disease are involved in processes related to the immune response.
Switch-like bimodal gene sets capture genome-wide signatures from microarray data in health and infectious disease. A subset of bimodal genes coding for extracellular and membrane proteins are associated with tissue specificity, indicating a potential role for them as biomarkers provided that expression is altered in the onset of disease. Furthermore, we provide evidence that bimodal genes are involved in temporally and spatially active mechanisms including tissue-specific functions and response of the immune system to invading pathogens.
Gene expression is controlled over a wide range at the transcript level through complex interplay between epigenetic modifications, DNA regulatory proteins, and microRNA molecules [1–3]. Genome-wide screening of expression profiles has provided an expansive perspective on gene regulation in health and disease. For example, identification of constitutively expressed housekeeping genes has aided in the inference of sets of minimal processes required for basic cellular function [4, 5]. Similarly, we have identified and annotated genes with switch-like expression profiles in the mouse and human, using large microarray datasets of healthy tissue . Genes with switch-like expression profiles represent fifteen percent of the human gene population. Classification of samples on the basis of bimodal or switch-like gene expression may give insight into temporally and spatially active mechanisms that contribute to phenotypic diversity. Given the variable expression of switch-like genes, they may also provide a viable candidate gene set for the detection of clinically relevant expression signatures in a feature space with reduced dimensionality.
The high-dimensionality inherent in genome-wide quantification makes extracting meaningful biological information from gene expression datasets a difficult task. Early attempts at genome-wide expression analysis used unsupervised clustering methods to identify groups of genes or conditions with similar expression profiles [7–9]. Biological insight can be derived from the observation that functionally related or co-regulated genes often cluster together. Supervised classification methods require datasets in which the class of the samples is known in advance. Statistical hypothesis testing [10, 11] is used to identify groups of genes that exhibit changes in expression associated with class distinction. Significant genes can be used to build decision rules to predict the class of unseen samples [12–14]. Unsupervised classification is better suited for class discovery whereas supervised classification is tailored for class prediction. In both of these complimentary approaches, dimension reduction can lead to increased classification accuracy.
Many simple unsupervised learning algorithms rely on distance metrics to either partition profiles into distinct groups [15, 16] or build clusters from pair-wise distances in a nested, hierarchical fashion . The optimal number of clusters must be defined heuristically or in advance and confidence in cluster membership is difficult to determine. Model-based clustering provides the necessary statistical framework to address these concerns while allowing for class discovery. In model-based clustering, it is assumed that similar expression profiles are generated as draws from a set of multivariate Gaussian random variables. Clusters are identified by fitting the parameters of the cluster-specific distributions to the data. Expectation-maximization [17–19] or Bayesian methods [20–22] are used for optimization. Estimation of the number of clusters as well as the incorporation of confidence in cluster membership is implicit in this process.
Methods such as unsupervised, supervised and model-based classification provide the means to evaluate switch-like gene expression patterns in high-dimensional datasets profiling diverse biological conditions. For this purpose, we compiled two large-scale gene expression microarray datasets from publicly available data repositories. The first dataset included samples spanning nineteen different tissue types from healthy donors. The second dataset included samples from donors with one of a number of infectious diseases including HIV-1 infection, hepatitis C, influenza, and malaria. Our results demonstrate that switch-like genes exhibit tissue and disease-specific expression signatures. Dimension reduction of genome-wide expression data through the identification of switch-like genes enabled highly accurate classification of samples into tissue-specific and disease-specific clusters. Moreover, analysis of activated switch-like genes in various disease and tissue types revealed that these genes participate in specialized or temporally active mechanisms. Further study of genes in the switch-like gene set may provide biologically significant information about the molecular basis of phenotype distinction.
Three hundred bimodal genes classify nineteen tissue types with high accuracy in model-based classification
Microarray datasets used in this study
Tissue Phenotype Data
No. of Samples
Gene Expression Omnibus/Array Express Accn. #
GSE3526, GSE8514, GSE2316
GSE3526, GSE7621, GSE7307, GSE2361, E_AFMX-11, E-TABM-20,
E-TABM-176, GSE8671, GSE9254, GSE9452
GSE1133, GSE2361, GSE3419, GSE3526, GSE7307
E_AFMX-11, E-MIMR-27, GSE1133, GSE2240, GSE2361, GSE3526, GSE3585, GSE7307
E_AFMX-11, GSE2004, GSE2361, GSE3526, GSE7392
E_AFMX-11, GSE2004, GSE3526, GSE6764
E-MEXP-231, GSE10072, GSE1133, GSE2361, GSE3526
E-TABM-66, GSE2361, GSE3526, GSE7307, GSE7904
GSE10760, GSE2328, GSE3526, GSE5110, GSE6798, GSE7307, GSE9103,
GSE2361, GSE3526, GSE6008, GSE7307
GSE1133, GSE2361, GSE7307
GSE7462, GSE8608, GSE8668, GSE8762, GSE9692
GSE2004, GSE2361, GSE3526, GSE7307
GSE2361, GSE3526, GSE7307
E_AFMX-11, GSE1133, GSE2361, GSE3218, GSE3526, GSE7307, GSE7808
GSE1133, GSE2361, GSE7307
No. of Samples
Gene Expression Omnibus/Array Express Accn. #
Adjusted Rand Index compares observed partitions with true classification of samples in tissue phenotype data
All bimodal genes
Enrichment analysis reveals tissue-specific functions of "on" genes in brain, skeletal muscle, cardiac muscle, and lung tissue
GO categories significantly enriched with "on" genes in brain tissue
▪ Neuron migration
▪ Actin binding
▪ GTPase activity
▪ Ion transport
▪ Microtubule associated complex
▪ Transmembrane receptor protein tyrosine
▪ Negative regulation of microtubule depolymerization
▪ Structural molecule activity
▪ Cell adhesion
▪ Strucutural constituent of cytoskeleton
▪ Neuron adhesion
▪ Integral to membrane
▪ Ion channel activity
▪ Transmembrane receptor protein tyrosine phosphatase signaling pathway
▪ Structural constituent of myelin sheath
▪ Synaptic transmission
▪ Cell junction
▪ Neuromuscular synaptic transmission
▪ Nervous system development
▪ Growth cone
▪ Central nervous system development
▪ Postsynaptic membrane
▪ Neuron recognition
▪ Anterograde axon cargo transport
▪ Neuron differentiation
Model-based classification of infectious disease and immune response signature
GO categories significantly enriched with "on" genes in infectious disease
▪ Immune response1, 2, 3, 4, 5
▪ B cell receptor complex1, 2, 4, 5
▪ Antigen binding1, 2, 4, 5
▪ Humoral immune response by circulating immunoglobin1, 2, 4, 5
▪ Immunoglobulin complex, circulating1, 2, 4, 5
▪ Succinate dehydrogenase activity2,3,4
▪ Positive regulation of B cell proliferation1, 2, 4, 5
▪ Perinuclear region of cytoplasm1, 2, 4, 5
▪ RNA binding3
▪ Early endosome to late endosome transport1, 2, 4, 5
▪ External side of plasma membrane1,4
▪ Structural constituent of cytoskeleton3
▪ Positive regulation of peptidyl-tyrosine phosphorylation1, 2, 4, 5
▪ Membrane fraction4,5
▪ Protein binding3
▪ B cell receptor signaling pathway1, 2, 4, 5
▪ Activation of MAPK activity1, 2, 4
▪ dehydrogenase activity5
▪ tRNA aminoacylation for protein translation1,4
▪ Actin cytoskeleton3
▪ Endopeptidase inhibitor activity5
▪ Antigen processing and Presentation1,4
▪ Extracellular region5
▪ Structural molecule activity5
▪ DNA methylation3
▪ Proteinaceous extracellular matrix5
▪ Extracellular matrix structural constituent5
▪ Translational initiation3
▪ Negative regulation of protein kinase activity3
▪ Defense response3
▪ Inflammatory response4
▪ Hemocyte development4
▪ Cell-cell adhesion4
▪ Pyridine nucleotide biosynthetic process4
▪ Respiratory burst4
▪ Response to calcium ion3,4
▪ Tricarboxylic acid cycle5
▪ Cell adhesion5
▪ Blood coagulation5
▪ Sensory perception of sound3,5
Supervised classification with bimodal genes capture tissue specific and infectious disease specific signatures in microarray data
We used simulated microarray data in order to gain insights on which parameters of supervised classification are determinant of the classification accuracy in datasets considered in this study. Supervised classification of simulated gene expression profiles illustrated the strong dependence of prediction accuracy on sample size, extent of separation between bimodal peaks and the number of informative genes. Classification accuracy generally improved as expression profiles became more bimodal. Increased sample size and decreased number of informative genes also resulted in more accurate classification.
Development and subsequent commercialization of microarray platforms has led to extensive investigation of global gene expression profiles in health and disease. Expression profiling of diverse healthy tissues provides a comprehensive perspective of the range of transcriptional regulation under physiologic conditions [42–44]. Similarly, identification of gene expression signatures indicative of disease subtypes improves our understanding of the molecular basis of pathology [7, 8, 45]. Small sample size and the large number of measurements for each sample are among the limiting factors that hinder the effectiveness of gene expression profiling and drive the development of new analytical methods.
Unsupervised clustering of microarray data classifies samples in an unbiased manner according to similarity in gene expression profiles. Adaptation of model-based clustering to low sample size, high dimensional datasets  and formalization of statistical approaches for selecting the optimum number of clusters  represent significant advances. In this study, we used these advanced methods to cluster and classify infectious disease and tissue phenotypes in large scale microarray data using a reduced set of 1265 switch-like genes [6, 47]. Switch-like genes are identified through the detection of bimodal gene expression patterns across diverse biological conditions. Switch-like genes are likely to be under strict transcriptional regulation and are statistically enriched for cell membrane and extracellular proteins .
We demonstrated that model-based clustering of switch-like gene expression patterns differentiates between tissue phenotypes in a microarray dataset with tissue-specific sample sizes ranging from 5 to nearly 100. Because model-based clustering operates on the assumption that samples are drawn from multivariate Gaussian distributions, the method is particularly well-suited for the analysis of bimodal gene expression profiles. Distance-based unsupervised classification methods such as Kmeans and hierarchical clustering also led to accurate classification for tissues with large sample sizes (> 25) but had little differentiation potential at small sample sizes. The decrease in classification accuracy observed with the use of distance-based clustering may be due to estimation of the number of clusters via the gap statistic . Incorporating optimization of the number of clusters into the model fitting process likely improves the performance of model-based clustering [48, 49] such that tissue types with smaller sample sizes are resolved into separate clusters.
A set of 300 bimodal genes expressed on the extracellular matrix or the plasma membrane is sufficient to accurately differentiate between nineteen different tissue types in model-based clustering even at 5 microarray samples for tissue type. This set of genes includes those that code for membrane-bound integrin proteins and ECM proteins belonging to collagen, laminin, and fibronectin families. Genes expressed in the "on" mode in brain tissue and the "off" mode in muscle tissue largely coded for neural-specific cell adhesion molecules. Supervised classification has the potential to further reduce the set of 300 bimodal genes to biomarker sets when considering biomarkers for tissue-specific diseases. Accurate classification with the subset of bimodal genes presented in this article demonstrate the importance of cell/ECM interactions in tissue differentiation  and will prove useful as a priori knowledge in the analysis of microarray data produced by different laboratories.
Our study showed that the bimodal gene set identified using microarray data associated with healthy tissue is highly effective in differentiating between microarray data from tissues infected by various infectious diseases such as the HIV-1 infection, hepatitis C, influenza and malaria. The classification was unsupervised and the disease signature was conserved across laboratories. Moreover, bimodal gene sets differentiated between liver and blood cell tissues infected with the same hepatitis virus. The identification of bimodal genes expressed in the activated state in various infectious diseases and subsequent enrichment analysis with KEGG pathways provide biological context to the perturbation of various cell signaling networks induced by invading viruses. In the infectious disease states investigated here, bimodal genes expressed in the "on" mode were related to both innate and antigen-mediated immune responses.
It should be noted that other gene sets determined by feature selection may be even more discriminative of the phenotypes included in this analysis than the switch genes under consideration. Our intent in this study was not to identify discriminative genes but rather to use unsupervised clustering to determine whether switch-like expression patterns are associated with phenotype and whether previously identified switch-like genes could be used a priori to reduce the feature space in microarray analysis. The large body of evidence presented in this work points to the success of switch-like gene sets in capturing biologically-relevant gene expression signatures from microarray data.
Given the demonstrated biological relevance of bimodal expression patterns, it would be worthwhile to determine the clinical relevance of switch-like gene annotation. Identification of bimodal genes expressed in the activated state in complex diseases such as autism, diabetes and cancer may provide a method for dimension reduction in the identification of disease-related single nucleotide polymorphisms (SNPs)  and expression quantitative trait loci (eQTL) [51, 52] in genome-wide association studies. Both gene sequences and promoter regions of bimodal genes expressed in the high mode identified from large scale microarray data could be searched for SNPs and eQTL linked to the onset of disease or disease progression. Further studies are needed to investigate the full potential of clinically relevant classification using switch-like gene annotation from microarray data.
In this study, we showed that a priori knowledge gained from compilation of large-scale microarray datasets from multiple laboratories containing at least 400 samples for each gene in the array could be successfully used in reducing the dimension of features in microarray analysis. We reduced dimensionality by focusing on a set of genes with bimodal expression patterns, i.e. genes that adopt either an "on" or "off" mode of expression and are tightly regulated at the transcript level. Detection of bimodality using expectation maximization revealed a list of 1265 bimodal genes in the human genome. A subset of 300 bimodal genes was sufficient to differentiate between nineteen different tissue signatures even in small sample sizes. These genes code for proteins either on the cell membrane or at the extracellular matrix. Such proteins can be identified in tissue using fluorescence, Q dots and other methods and as such are candidate biomarkers for specific tissues.
The set of bimodal genes are capable of capturing infectious disease signatures from microarray data corresponding to hepatitis C, influenza, HIV-1 infection and malaria. Disease-specific expression patterns of bimodal genes suggest that infection by different pathogens may initiate different host responses mediated by switch-like gene expression. Determination of "on" and "off" states of switch-like genes in various tissues and diseases allowed for the identification of activated/deactivated pathways that are consistent with existing research data. Classification accuracy was exceptional even with class-specific sample sizes between ten and twenty arrays. The use of a priori knowledge from public microarray datasets in the form of bimodal gene sets has clinical implications in disease subtype classification. Genome-wide association studies for SNP discovery linked to complex diseases such as autism and cancer could potentially benefit from dimension reduction by focusing on regions of DNA that code for switch-like genes and their promoter regions.
Microarray datasets used in this study were compiled from the online public repositories Gene Expression Omnibus (GEO)  and Array Express (AE)  as described in additional file2. All datasets were profiled on the HGU133A or its recently expanded version, the HGU133plus2 Affymetrix platforms. The datasets used in the study are shown in Table 1. Accession numbers of arrays used in this study are listed in Additional File 3 with corresponding phenotype information.
Datasets were first filtered such that only the 22,277 probe sets common to both the HGU133A and HGU133plus2 platforms were retained. Reference robust multi-chip averaging (refRMA)  was used for normalization. RefRMA is an adaptation of the classic RMA approach  that is better suited for large datasets. RMA background adjustment was applied to each array and then the arrays were normalized by fitting probe level intensities for each chip to an empirical distribution obtained by applying quantile normalization to an 800-array training set . Probe affinity effects were estimated by median polishing on the training set and used to adjust the normalized probe level measures. Following these steps, probe set expression values were derived from the median value of constituent probe level intensities.
Probe set annotation
Probe sets were annotated using Entrez Gene ID, Ensembl accession number, gene symbol, Gene Ontology terms  and KEGG pathways . Gene identifiers and gene ontology terms were obtained from the HGU133plus2 annotation information on the Affymetrix website in March 2008. KEGG pathway annotations were obtained from the KEGG ftp site on April 28th, 2008.
Identification of bimodal genes
Bimodal genes were identified in expression data of healthy tissues using a statistical method previously applied in the detection of switch-like behavior among mouse  and human  genes. The expectation maximization method thus employed has also been used to detect bimodality in blood glucose concentrations [59, 60]. For each gene, we tested the hypothesis that the expression distribution fits a two-component Gaussian mixture model versus the null hypothesis that expression follows a single normal distribution. To correct for skewness observed in expression profiles, we used the box-cox transformation  as described in detail in our previous work [6, 47]. The distribution of box-cox parameters over all genes was centered at zero and approximately normally distributed, suggesting that the degree of skewness is small for a majority of genes. Parameters of the two-component mixture model were fit using expectation maximization . Parameters of the single normal distribution were estimated from gene-specific sample means and standard deviations. The modified log-likelihood ratio test statistic -2logλ was used to reject the null hypothesis. As in our previous work [6, 47], p-values were generated by evaluating the chi-square distribution with six degrees of freedom at the values of the test statistic. Genes with p-values less than 0.001 were selected as candidate bimodal genes. This subset of switch-like genes was further reduced by restricting the standardized area of intersection between the distributions of the component Gaussians to 10 percent . This reduction assured bimodality with significant distance between the two peaks, resulting in a list of 1265 bimodal genes. A subset of 300 bimodal genes was obtained by identifying genes with either plasma membrane and/or extracellular membrane among their cell compartment GO categories.
Identification of "on" genes in brain, skeletal muscle, cardiac muscle, lung and infectious disease phenotypes
Bimodal gene expression values were binarized by defining a gene-specific threshold at the intersection of the probability density functions of the two-component mixture models . Expression values above this threshold are described as "high" or "on". Bimodal genes in the "on" state in a majority of samples of a given phenotype were identified using a Bernoulli process . Each observation or sample was modeled as an independent trial. Success was defined as expression in the "on" mode. P-values were calculated from the binomial distribution with an equal probability of success and failure. A value of p < = 0.01 indicates a significant association between bimodal gene expression and phenotype.
Gene sets characterized by KEGG pathways and GO terms were analyzed to identify functional categories enriched in sets of bimodal genes biased to the "on" or "off" mode in healthy and disease phenotypes. We assessed the enrichment of functional gene sets by comparing the number of "on" or "off" genes observed in a particular functional group to the number expected by chance . The hypergeometric test was used to assign significance to the enriched functional gene sets. In functional enrichment, p-values less than 0.001 were considered significant.
Two distance-based clustering algorithms, Kmeans  and hierarchical clustering , were implemented in the R statistical environment in order to classify tissue samples into groups with similar expressions of bimodal genes. In both cases, we used Euclidean distance as the distance metric. In our implementation of Kmeans, we ran ten iterations with different initial cluster centroid locations and retained the cluster partition associated with the minimal within-cluster sum of squares. In hierarchical clustering, we used complete linkage to define the distance between clusters and observations . A single cluster solution was obtained from the resulting dendrogram by cutting the tree at a level which produced the desired number of clusters. In both of these algorithms, the data-driven optimal number of clusters was determined using the gap statistic, as described below.
Definition of the number of clusters in distance-based clustering
in which W kb *, (b = 1, 2,... B and k = 1, 2,... 25) and W k are within-cluster sums of squares of the reference and observed datasets respectively. The estimated number of clusters
and sd k is the standard deviation of log(W kb *).
Model-based subspace clustering
A model-based clustering algorithm , developed for the analysis of comparative genomic hybridization data, was used to cluster tissue samples on the basis of bimodal gene expression. In this approach, clusters are identified by finding an optimal partition of samples into K groups defined by cluster-specific multivariate Gaussian distributions. It is assumed that clusters can be differentiated by shifts in the mean expression values for a subset of genes and samples. Each sample is modeled as follows:y i = μ + r i × δ i + ε i
where fn-1 is the empirical distribution of Θ1,..., Θn and α is a constant. This process potentially results in less than n unique draws from the baseline distribution and therefore naturally leads to clustering. Parameters of the model are fit from the data using a Gibbs sampling algorithm . We ran the model-based clustering algorithm  in the R statistical environment on 25 parallel Markov chains with 250 iterations each. We found that each chain quickly converged to equally likely, unique solutions, indicating a multi-modal posterior distribution. To obtain an approximation of the true posterior distribution, we took the average of the cluster partition with the highest log-likelihood from each chain as reported elsewhere [20, 21].
Pairwise posterior probabilities
in which c i (i = 1,..., n samples) is a vector indicating which cluster sample i is assigned to. Although the pairwise posterior probability is a useful measure in itself, it does not provide a single cluster partition. For this purpose, a distance metric (D ij ) was defined from the pairwise posterior probabilities equal to D ij = 1 - P ij . A unique cluster partition can then be found using the complete linkage method, such that cluster objects are maximally separated between clusters.
Quantifying the agreement between observed clusters and known phenotype
Contingency table comparing two partitions
n R 1
n R 2
n.. = n
A multi-class supervised learning scheme was used to classify tissue samples on the basis of bimodal gene expression. In binary classification of microarray data, training data was used to rank features by a two-class test statistic . Discriminative genes were selected from the top of this ranked list. A decision rule associated with class distinction in the set of training samples was defined on the basis of the expression of the selected genes. The decision rule was then evaluated on an independent set of samples. To extend the supervised learning scheme to multiple class problems, we trained separate classifiers to identify tissue samples of each class vs. all others . Results are based on 100 independent iterations of the following training and testing procedure. Prior to classification, datasets were divided into training and testing sets in a class-proportional manner such that two-thirds of the samples in each class were used for training and one-third for testing. For the j th classifier (j = 1,..., number of classes), training samples in class j were assigned to class 1. All other samples were assigned to class 0. Discriminative bimodal genes were identified from the training data according to the ratio of within class to between class sums of squares . Diagonal linear discriminant analysis was used to define the distances between test sample i and samples in class 0 (d co ) and class 1 (dc 1), respectively . A confidence measure, defined from 0 to 1, was calculated as d co /(d co +dc 1). Values close to 0/1 indicate low/high confidence that test sample i belongs to class j. Confidence measures were compared from each classifier and test sample i was assigned to the class associated with the highest confidence.
in which β1 ... β M are the defined regression coefficients and x1, i... xM, iare the expression values of the informative genes in the i th observation . The simulated dataset was completed by drawing the response variable y i on the basis of π i (y i = 1 iff π i > 0.5). In this manner, the relationship between the j th gene and the response variable y i can be specified exactly (i.e. the value of β), independent of the sample distribution of gene j.
Adjusted Rand Index
Kyoto Encyclopedia of Genes and Genomes
peripheral blood mononuclear cells
human immunodeficiency virus
reference robust multi-array average
single nucleotide polymorphism
expression quantitative trait loci
area under the receiver operating characteristic curve.
This study was supported by the National Institute of Health (NIH) grant #232240 and by the National Science Foundation (NSF) grant # 235327.
- Arora A, Simpson DA: Individual mRNA expression profiles reveal the effects of specific microRNAs. Genome biology 2008, 9(5):R82. 10.1186/gb-2008-9-5-r82PubMed CentralView ArticlePubMedGoogle Scholar
- Hobert O: Gene regulation by transcription factors and microRNAs. Science (New York, NY) 2008, 319(5871):1785–1786.View ArticleGoogle Scholar
- Jaenisch R, Bird A: Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals. Nature genetics 2003, 33(Suppl):245–254. 10.1038/ng1089View ArticlePubMedGoogle Scholar
- Hsiao LL, Dangond F, Yoshida T, Hong R, Jensen RV, Misra J, Dillon W, Lee KF, Clark KE, Haverty P, et al.: A compendium of gene expression in normal human tissues. Physiological genomics 2001, 7(2):97–104.View ArticlePubMedGoogle Scholar
- Warrington JA, Nair A, Mahadevappa M, Tsyganskaya M: Comparison of human adult and fetal expression and identification of 535 housekeeping/maintenance genes. Physiological genomics 2000, 2(3):143–147.PubMedGoogle Scholar
- Ertel A, Tozeren A: Human switch-like genes and their regulation via transcription initiation and histone methylation. BMC Genomics 2008. (in review)Google Scholar
- Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, et al.: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403(6769):503–511. 10.1038/35000501View ArticlePubMedGoogle Scholar
- Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America 1999, 96(12):6745–6750. 10.1073/pnas.96.12.6745PubMed CentralView ArticlePubMedGoogle Scholar
- Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences of the United States of America 1998, 95(25):14863–14868. 10.1073/pnas.95.25.14863PubMed CentralView ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al.: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(43):15545–15550. 10.1073/pnas.0506580102PubMed CentralView ArticlePubMedGoogle Scholar
- Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences of the United States of America 2001, 98(9):5116–5121. 10.1073/pnas.091062498PubMed CentralView ArticlePubMedGoogle Scholar
- van't Veer LJ, Dai H, Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, Kooy K, Marton MJ, Witteveen AT, et al.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530–536. 10.1038/415530aView ArticleGoogle Scholar
- Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, et al.: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 2005, 365(9460):671–679.View ArticlePubMedGoogle Scholar
- West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA Jr, Marks JR, Nevins JR: Predicting the clinical status of human breast cancer by using gene expression profiles. Proceedings of the National Academy of Sciences of the United States of America 2001, 98(20):11462–11467. 10.1073/pnas.201162998PubMed CentralView ArticlePubMedGoogle Scholar
- Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture. Nature genetics 1999, 22(3):281–285. 10.1038/10343View ArticlePubMedGoogle Scholar
- Toronen P, Kolehmainen M, Wong G, Castren E: Analysis of gene expression data using self-organizing maps. FEBS letters 1999, 451(2):142–146. 10.1016/S0014-5793(99)00524-4View ArticlePubMedGoogle Scholar
- Ghosh D, Chinnaiyan AM: Mixture modelling of gene expression data from microarray experiments. Bioinformatics (Oxford, England) 2002, 18(2):275–286. 10.1093/bioinformatics/18.2.275View ArticleGoogle Scholar
- McLachlan GJ, Bean RW, Peel D: A mixture model-based approach to the clustering of microarray expression data. Bioinformatics (Oxford, England) 2002, 18(3):413–422. 10.1093/bioinformatics/18.3.413View ArticleGoogle Scholar
- Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL: Model-based clustering and data transformations for gene expression data. Bioinformatics (Oxford, England) 2001, 17(10):977–987. 10.1093/bioinformatics/17.10.977View ArticleGoogle Scholar
- Joshi A, Peer Y, Michoel T: Analysis of a Gibbs sampler method for model-based clustering of gene expression data. Bioinformatics (Oxford, England) 2008, 24(2):176–183. 10.1093/bioinformatics/btm562View ArticleGoogle Scholar
- Medvedovic M, Yeung KY, Bumgarner RE: Bayesian mixture model based clustering of replicated microarray data. Bioinformatics (Oxford, England) 2004, 20(8):1222–1232. 10.1093/bioinformatics/bth068View ArticleGoogle Scholar
- Qin ZS: Clustering microarray gene expression data using weighted Chinese restaurant process. Bioinformatics (Oxford, England) 2006, 22(16):1988–1997. 10.1093/bioinformatics/btl284View ArticleGoogle Scholar
- Hoff PD: Model-based subspace clustering. Bayesian Analysis 2006, 1(2):321–344.View ArticleGoogle Scholar
- Hynes RO: Integrins: bidirectional, allosteric signaling machines. Cell 2002, 110(6):673–687. 10.1016/S0092-8674(02)00971-6View ArticlePubMedGoogle Scholar
- Nelson CM, Bissell MJ: Of extracellular matrix, scaffolds, and signaling: tissue architecture regulates development, homeostasis, and cancer. Annual review of cell and developmental biology 2006, 22: 287–309. 10.1146/annurev.cellbio.22.010305.104315PubMed CentralView ArticlePubMedGoogle Scholar
- Bon G, Folgiero V, Di Carlo S, Sacchi A, Falcioni R: Involvement of alpha6beta4 integrin in the mechanisms that regulate breast cancer progression. Breast Cancer Res 2007, 9(1):203. 10.1186/bcr1651PubMed CentralView ArticlePubMedGoogle Scholar
- Buttery RC, Rintoul RC, Sethi T: Small cell lung cancer: the importance of the extracellular matrix. The international journal of biochemistry & cell biology 2004, 36(7):1154–1160. 10.1016/S1357-2725(03)00261-9View ArticleGoogle Scholar
- van Horssen J, Dijkstra CD, de Vries HE: The extracellular matrix in multiple sclerosis pathology. Journal of neurochemistry 2007, 103(4):1293–1301. 10.1111/j.1471-4159.2007.04897.xView ArticlePubMedGoogle Scholar
- Yu WM, Yu H, Chen ZL: Laminins in peripheral nerve development and muscular dystrophy. Molecular neurobiology 2007, 35(3):288–297. 10.1007/s12035-007-0026-xView ArticlePubMedGoogle Scholar
- Yamada S, Nelson WJ: Synapses: sites of cell recognition, adhesion, and functional specification. Annual review of biochemistry 2007, 76: 267–294. 10.1146/annurev.biochem.75.103004.142811PubMed CentralView ArticlePubMedGoogle Scholar
- Scherer SS, Arroyo EJ: Recent progress on the molecular organization of myelinated axons. J Peripher Nerv Syst 2002, 7(1):1–12. 10.1046/j.1529-8027.2002.02001.xView ArticlePubMedGoogle Scholar
- Janeway CA Jr, Medzhitov R: Innate immune recognition. Annual review of immunology 2002, 20: 197–216. 10.1146/annurev.immunol.20.083001.084359View ArticlePubMedGoogle Scholar
- Pasare C, Medzhitov R: Toll-like receptors: linking innate and adaptive immunity. Microbes and infection/Institut Pasteur 2004, 6(15):1382–1387.View ArticlePubMedGoogle Scholar
- Bureau C, Bernad J, Chaouche N, Orfila C, Beraud M, Gonindard C, Alric L, Vinel JP, Pipy B: Nonstructural 3 protein of hepatitis C virus triggers an oxidative burst in human monocytes via activation of NADPH oxidase. The Journal of biological chemistry 2001, 276(25):23077–23083. 10.1074/jbc.M100698200View ArticlePubMedGoogle Scholar
- Guida M, D'Elia G, Benvestito S, Casamassima A, Micelli G, Quaranta M, Moschetta R, De Lena M, Lorusso V: Hepatitis C virus infection in patients with B-cell lymphoproliferative disorders. Leukemia 2002, 16(10):2162–2163. 10.1038/sj.leu.2402626View ArticlePubMedGoogle Scholar
- Landau DA, Saadoun D, Calabrese LH, Cacoub P: The pathophysiology of HCV induced B-cell clonal disorders. Autoimmunity reviews 2007, 6(8):581–587. 10.1016/j.autrev.2007.03.010View ArticlePubMedGoogle Scholar
- Lindenschmidt EG, Granato CH, Katzner K, Laufs R: Evidence for limited humoral immunoglobulin M antibody response to hepatitis B core antigen during acute and chronic hepatitis B virus infections. Journal of clinical microbiology 1985, 21(6):1000–1003.PubMed CentralPubMedGoogle Scholar
- Sarantis H, Gray-Owen SD: The specific innate immune receptor CEACAM3 triggers neutrophil bactericidal activities via a Syk kinase-dependent pathway. Cellular microbiology 2007, 9(9):2167–2180. 10.1111/j.1462-5822.2007.00947.xView ArticlePubMedGoogle Scholar
- Anand AR, Ganju RK: HIV-1 gp120-mediated apoptosis of T cells is regulated by the membrane tyrosine phosphatase CD45. The Journal of biological chemistry 2006, 281(18):12289–12299. 10.1074/jbc.M511786200View ArticlePubMedGoogle Scholar
- Barat C, Tremblay MJ: Engagement of CD43 enhances human immunodeficiency virus type 1 transcriptional activity and virus production that is induced upon TCR/CD3 stimulation. The Journal of biological chemistry 2002, 277(32):28714–28724. 10.1074/jbc.M111935200View ArticlePubMedGoogle Scholar
- Perfettini JL, Roumier T, Castedo M, Larochette N, Boya P, Raynal B, Lazar V, Ciccosanti F, Nardacci R, Penninger J, et al.: NF-kappaB and p53 are the dominant apoptosis-inducing transcription factors elicited by the HIV-1 envelope. The Journal of experimental medicine 2004, 199(5):629–640. 10.1084/jem.20031216PubMed CentralView ArticlePubMedGoogle Scholar
- Shyamsundar R, Kim YH, Higgins JP, Montgomery K, Jorden M, Sethuraman A, Rijn M, Botstein D, Brown PO, Pollack JR: A DNA microarray survey of gene expression in normal human tissues. Genome biology 2005, 6(3):R22. 10.1186/gb-2005-6-3-r22PubMed CentralView ArticlePubMedGoogle Scholar
- Whitney AR, Diehn M, Popper SJ, Alizadeh AA, Boldrick JC, Relman DA, Brown PO: Individuality and variation in gene expression patterns in human blood. Proceedings of the National Academy of Sciences of the United States of America 2003, 100(4):1896–1901. 10.1073/pnas.252784499PubMed CentralView ArticlePubMedGoogle Scholar
- Yanai I, Benjamin H, Shmoish M, Chalifa-Caspi V, Shklar M, Ophir R, Bar-Even A, Horn-Saban S, Safran M, Domany E, et al.: Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification. Bioinformatics (Oxford, England) 2005, 21(5):650–659. 10.1093/bioinformatics/bti042View ArticleGoogle Scholar
- Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286(5439):531–537. 10.1126/science.286.5439.531View ArticlePubMedGoogle Scholar
- Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B-Statistical Methodology 2001, 63: 411–423. 10.1111/1467-9868.00293View ArticleGoogle Scholar
- Ertel A, Tozeren A: Switch-like genes populate cell communication pathways and are enriched for extracellular proteins. BMC Genomics 2008, 9: 3. 10.1186/1471-2164-9-3PubMed CentralView ArticlePubMedGoogle Scholar
- Yeung KY, Medvedovic M, Bumgarner RE: Clustering gene-expression data with repeated measurements. Genome Biology 2003, 4: R34. 10.1186/gb-2003-4-5-r34PubMed CentralView ArticlePubMedGoogle Scholar
- Thalamuthu A, Mukhopadhyay I, Zheng X, Tseng GC: Evaluation and comparision of gene clustering methods in microarray analysis. Bioinformatics 2006, 22: 2405–2412. 10.1093/bioinformatics/btl406View ArticlePubMedGoogle Scholar
- The Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007, 447(7145):661–678. 10.1038/nature05911PubMed CentralView ArticleGoogle Scholar
- Stranger BE, Forrest MS, Clark AG, Minichiello MJ, Deutsch S, Lyle R, Hunt S, Kahl B, Antonarakis SE, Tavare S, et al.: Genome-wide associations of gene expression variation in humans. PLoS genetics 2005, 1(6):e78. 10.1371/journal.pgen.0010078PubMed CentralView ArticlePubMedGoogle Scholar
- Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, et al.: Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 2007, 315(5813):848–853. 10.1126/science.1136678PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research 2002, 30(1):207–210. 10.1093/nar/30.1.207PubMed CentralView ArticlePubMedGoogle Scholar
- Parkinson H, Kapushesky M, Shojatalab M, Abeygunawardena N, Coulson R, Farne A, Holloway E, Kolesnykov N, Lilja P, Lukk M, et al.: ArrayExpress – a public database of microarray experiments and gene expression profiles. Nucleic acids research 2007, (35 Database):D747–750. 10.1093/nar/gkl995Google Scholar
- Katz S, Irizarry RA, Lin X, Tripputi M, Porter MW: A summarization approach for Affymetrix GeneChip data using a reference training set from a large, biologically diverse database. BMC bioinformatics 2006, 7: 464. 10.1186/1471-2105-7-464PubMed CentralView ArticlePubMedGoogle Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (Oxford, England) 2003, 4(2):249–264.View ArticleGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics 2000, 25(1):25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research 2000, 28(1):27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Fan J, May SJ, Zhou Y, Barrett-Connor E: Bimodality of 2-h plasma glucose distributions in whites: the Rancho Bernardo study. Diabetes care 2005, 28(6):1451–1456. 10.2337/diacare.28.6.1451View ArticlePubMedGoogle Scholar
- Lim TO, Bakri R, Morad Z, Hamid MA: Bimodality in blood glucose distribution: is it universal? Diabetes care 2002, 25(12):2212–2217. 10.2337/diacare.25.12.2212View ArticlePubMedGoogle Scholar
- Maclean CJ, Morton NE, Elston RC, Yee S: Skewness in commingled distributions. Biometrics 1976, 32(3):695–699. 10.2307/2529760View ArticlePubMedGoogle Scholar
- Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM alogrithm. Journal of the Royal Statistical Society 1977, 39: 1–38.Google Scholar
- Zhang B, Kirov S, Snoddy J: WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic acids research 2005, (33 Web Server):W741–748. 10.1093/nar/gki475Google Scholar
- Hartigan JA, Wong MA: A K-means clustering algorithm. Applied Statistics 1979, 28: 100–108. 10.2307/2346830View ArticleGoogle Scholar
- Gibbons FD, Roth FP: Judging the quality of gene expression-based clustering methods using gene annotation. Genome Research 2002, 12: 1574–1581. 10.1101/gr.397002PubMed CentralView ArticlePubMedGoogle Scholar
- Medvedovic M, Sivaganesan S: Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics (Oxford, England) 2002, 18(9):1194–1206. 10.1093/bioinformatics/18.9.1194View ArticleGoogle Scholar
- Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 2002, 97(457):77–87. 10.1198/016214502753479248View ArticleGoogle Scholar
- Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, et al.: Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Sciences of the United States of America 2001, 98(26):15149–15154. 10.1073/pnas.211566398PubMed CentralView ArticlePubMedGoogle Scholar
- Venables WN, Ripley BD: Modern Applied Statistics with S. New York: Springer; 2002.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.