Literature-aided meta-analysis of microarray data: a compendium study on muscle development and disease
© Jelier et al; licensee BioMed Central Ltd. 2008
Received: 28 May 2008
Accepted: 24 June 2008
Published: 24 June 2008
Comparative analysis of expression microarray studies is difficult due to the large influence of technical factors on experimental outcome. Still, the identified differentially expressed genes may hint at the same biological processes. However, manually curated assignment of genes to biological processes, such as pursued by the Gene Ontology (GO) consortium, is incomplete and limited. We hypothesised that automatic association of genes with biological processes through thesaurus-controlled mining of Medline abstracts would be more effective. Therefore, we developed a novel algorithm (LAMA: Literature-Aided Meta-Analysis) to quantify the similarity between transcriptomics studies. We evaluated our algorithm on a large compendium of 102 microarray studies published in the field of muscle development and disease, and compared it to similarity measures based on gene overlap and over-representation of biological processes assigned by GO.
While the overlap in both genes and overrepresented GO-terms was poor, LAMA retrieved many more biologically meaningful links between studies, with substantially lower influence of technical factors. LAMA correctly grouped muscular dystrophy, regeneration and myositis studies, and linked patient and corresponding mouse model studies. LAMA also retrieves the connecting biological concepts. Among other new discoveries, we associated cullin proteins, a class of ubiquitinylation proteins, with genes down-regulated during muscle regeneration, whereas ubiquitinylation was previously reported to be activated during the inverse process: muscle atrophy.
Our literature-based association analysis is capable of finding hidden common biological denominators in microarray studies, and circumvents the need for raw data analysis or curated gene annotation databases.
The comparative analysis of expression microarray studies can refine conclusions and interpretations from individual studies and can be used to identify previously uncharacterized parallels between studies [1, 2]. However, such analyses are hampered by the large influences of biological variation between specimens (see e.g. Eid-Dor et al. ), and technical differences between the studies [4–9] on the identified differentially expressed genes. The varying technical factors include: differences in experimental procedures for the collection of the biological material and for RNA amplification and labeling , differences in sampling times and the DNA microarray platform used (see Kuo et al.  for a recent platform comparison), and the applied statistical analysis .
To overcome this hurdle, it has been suggested that studies should be compared at the level of perturbed biological processes [10, 11]. This could be more robust as different genes may hint at the same process. To identify perturbed biological processes, methodologies have been developed in recent years by analysis of the correlated behaviour of groups of genes with a similar biological function [11–13]. A limitation when using these approaches is that to identify which genes share a biological function, we are currently largely dependent on the ontology-based annotation of genes in manually curated databases. Due to the labor intensive manual curation effort, these databases are necessarily highly focused and notoriously incomplete (see e.g. Khatri et al. ). The best known public databases are the Gene Ontology (GO) annotation project  for biological process, molecular function and cellular localization, and KEGG  for metabolic pathways.
In the present study we introduce an approach to compare expression profiling studies based on perturbed biological processes. Instead of using manually curated gene annotation databases we base our approach on gene associations automatically derived from literature. To identify gene associations, concept profiles are generated for all genes . A concept profile is a weighted list of biological concepts that characterizes the set of documents associated to a gene. Subsequently concept profiles are compared to identify gene associations: pairs of genes strongly associated to the same biological concepts. Finally, DNA microarray datasets are compared based on the observed number of gene associations between the sets of differentially expressed genes. We call our approach LAMA (Literature-Aided Meta-Analysis).
We evaluate our methodology on a compendium of 102 DNA microarray studies published in the field of muscle development and disease, and compare it to analyses based on gene overlap and the classical group overrepresentation analysis. The compendium contains a very diverse set of datasets: patient versus control studies for different myopathies; studies in animal disease models and studies in cultured muscle cells. The studies were performed on 22 different microarray chip types, and in three different organisms: human, mouse and rat. The considerable influence of the statistical analysis on the identified differentially expressed genes , indicates that, ideally, a standardized statistical analysis should precede any comparison between datasets. Unfortunately, raw data is required for such an analysis and they are often unavailable (see also Larsson et al. ). Therefore, we relied on the reported lists of differentially expressed genes, which should be useful for initial comparisons of microarray studies [18, 19] and, at least, were judged by the authors to be biologically relevant. In our evaluation, we first take a directed approach: we measured to which extent the approaches could reproduce a manual clustering of a selection of datasets. Second, we perform an exploratory clustering of all datasets, and characterize and interpret the identified clusters.
Study selection and data retrieval
The 102 microarray datasets in the compendium are represented and annotated [see Additional file 1] and the data underlying the analyses is included [see Additional file 2]. The compendium was extracted from 53 publications and 6 in-house studies. The datasets include studies on myoblast differentiation as an in vitro model for muscle development and regeneration, studies on gene expression differences between different types of skeletal muscles, skeletal muscle disease (including induced muscular atrophy), the effect of exercise and ageing and the treatment with drugs, growth factors or lipid infusion. The compendium was limited to studies in human (N = 37), mouse (N = 51), and rat (N = 13), but included one study in monkey performed with a human DNA microarray platform. To allow for a direct comparison of datasets from different organisms, homologous genes were mapped to each other according to the NCBI's homologene database .
Frequency of differential expression per gene
Pair-wise analysis of similarity between DNA microarray datasets
First, we compared the DNA microarray studies to each other based on gene identity. For every study, the genes interrogated by the DNA microarray platform were separated into three categories: upregulated, downregulated and not up- or downregulated. With the kappa statistic  we measured the chance-corrected level of agreement in the three categories between two studies. By performing a kappa statistic based test  we found that of the 5151 possible dataset pairs only 307 (6%) have an above chance level of agreement (p < 0.05). This is in line with our conclusion of limited overlap in the previous section. Second, our LAMA method found significant associations (p < 0.05) between 2732 (53%) pairs of studies, which indicates that considerably more similarities between datasets are identified with our text-derived gene associations than based on gene list overlap.
Third, we compared datasets based on over-represented GO codes. We tested over-representation of biological processes in the up and down lists of our datasets separately with a cutoff of p < 0.05 (hypergeometric test). Subsequently, we evaluated the similarity between datasets with the kappa measure: Only 18% of the dataset pairs had a significant overlap (p < 0.05), whereas 34% of scores was very low (<0). The used GO over-representation p-value cutoff is overly permissive, as with the high number of tests, we should correct for multiple testing. But when we corrected for multiple testing (Benjamini and Hochberg's method , same chance level), we found not any over-represented GO code for 43 of the 102 studies. With this cut-off, 85% of the kappa scores was 0 or lower and only 212 dataset pairs (4%) had a significant association at the 0.05 significance level. The poor overlap between datasets is partly caused by the fact that it is less likely to identify an over-represented GO code if the gene list is small.
Reproduction of a manual clustering
The dysferlinopathy group has much higher classification rates with LAMA than in the kappa analysis. The studies in this group were more heterogeneous than the other groups in several aspects: it contained human and mouse studies, different mouse strains and differently aged mice, and four different microarray platforms were used in the six studies contained in this group. The human study that compared limb-girdle muscular dystrophy (LGMD) type 2B patients to controls (dataset 16) was much better classified to the dysferlinopathy group with the LAMA based approach than with the kappa based approach (AUC 0.73 vs 0.48). Datasets 16 (human) and 75a (dysferlin-deficient SJL mice versus controls) have no differentially expressed genes in common. Nevertheless, the LAMA score is the lowest possible score given the number of Monte Carlo simulations, which indicates there is a highly significant over-representation of gene associations. We identified "macrophage" as the most important shared concept between dysferlin-deficiency in humans and mice. Indeed, many of the identified associations are between genes known to be expressed in macrophages and macrophage infiltration is an important feature in both the LGMD patients pathology and the mouse model .
The slightly worse performance of the LAMA method in the classification of the myositis studies was due to strong associations with the group of dystrophinopathy studies. The two groups were found connected through concepts pertaining to inflammatory processes. This is reflective of the pronounced inflammatory component in both dystrophinopathies [21, 22, 31–33] and myositis patients. For the extraocular group the lower performance for LAMA is explained by the comparatively poor scores between datasets 4a and both 14a and b, while datasets 4b, 14a and 14b had high pairwise scores. Dataset 4a only contains 13 genes up-regulated genes, which limits the power for the LAMA analysis. The list shared only 2 genes with 14a and b, but still the kappa score was comparatively high due to the limited overlap overall. Classification for the GO-based over-representation analysis (p < 0.05, no correction for multiple testing) showed poor results: performance was considerably worse than based on gene list overlap for 5 of the 7 groups, a similar score was obtained for the dysferlinopathy group and a slightly better score for the ageing group. These results and a view on the shared GO codes indicated the used test condition was too lenient and spurious GO codes were assigned. Based on these results and the poor results presented in the previous section, we do not discuss the clustering based on the method here, but include the classification results and the clustering as supplementary material [see Additional file 3].
Classification of new studies
To demonstrate the utility of our approach for the interpretation of gene lists from new experiments, we compared to our manual grouping the gene lists from a recent paper on dy/dy mice . These mice have a muscular dystrophy as a consequence of a genetic defect in alpha-2 laminin. Our LAMA-approach classifies this study with high confidence in the dystrophinopathy group (AUC = 0.83). This is correct given the pathology of these mice and the two genetic deficiencies affecting the same macromolecular protein complex. The shared biological concepts between this dataset and a dataset from mdx mice (dataset 1; ), were the infiltration of macrophages and differential expression of collagens, metalloproteinases, cathepsins, and HLA-antigens.
Dataset clustering based on kappa statistic
Apart from the large dystrophy/myositis cluster of studies, there is only very limited overlap between the gene lists from the studies, as expected based on the pair-wise analysis presented above. Cluster 2 contains two studies investigating the spared EOM muscle in dystrophin-deficient mdx mice and the expression profiles of the diaphragm and hind limb muscles of presymptomatic mdx mice. Cluster 3 contains 4 highly overlapping studies (datasets 5b-d) from the same paper on developmental changes in the EOM muscle. Again, clusters 2 and 3 contain only studies done by the same group on the same platform.
Dataset clustering based on LAMA
Characterizing concepts for clusters identified through LASSO analysis.
Biological Concepts (Up)
Biological Concepts (Down)
Atrophy – PABPN1 overexpression
Amino acyl tRNA synthetases, spermidine, polyamines, spermine, eukaryotic initiation factors
Platelet-derived growth factor, transforming growth factor-beta, insulin-like growth factor binding proteins
Adipocytes, acyl CoA dehydrogenase
Cyclins, keratin, cyclin-dependent kinases
Troponin, matrix metalloproteases
Mitogen activated protein kinases, insulin, ERK1 acitivity, phosphorylation
Dystrophin defiency in EOM muscle
Chemokine, chemokine receptor
T-lymphocyte, phosphotransferases, phosphorylation, mitogen-activated protein kinases, integrins, cell cycle
Cullin proteins, mitogen-activated protein kinases, ligase
Troponin, tropomyosin, nemaline myopathies, sarcomeres, myosin heavy chain, calsequestrin
Inhibitor of differentiation proteins, E2F transcription factors, proteoglycan, cell cycle proteins
Leptin, desaturase, myosin heavy chains, neural cell adhesion molecules
Mitogen-activated protein kinases
Heterogeneous nuclear ribonucleoproteins, protein sumoylation, small nuclear ribonucleoprotein
Cluster 1 was not found by the kappa-based clustering. Analysis of the underlying concept associations revealed similarities between the molecular processes during induced atrophy in cultured myoblasts (dataset 63) and in vivo models for muscular atrophy, i.e. during hind limb suspension or in space-flown rats (datasets 28a, 28b and 71). Amongst others, there is an interesting set of non-overlapping members of the semaphorin family shared between atrophy studies. Semaphorins are presumably involved in cell-cell contacts in neuronal cells  (during axon regeneration) but also in fusing myoblasts . For cluster 1 we observe an increase in metabolic activity (both glycolytic and fatty acid oxidation) and a downregulation of extracellular matrix proteins: These processes seem to be revelant to age-related changes in EOM muscle (datasets 5b-d; subcluster 1b), and diverse myopathies mitochondrial encephelo myopathy (dataset 41), nemaline myopathy (dataset 34), and oculopharyngeal muscular dystrophy (OPMD; dataset 57). Cluster 3 contains all the ageing and sarcopenia studies. Interestingly, also a cell model for over-expression of the polyadenylation factor PABPN1 (also responsible for OPMD, dataset 35) is found in this cluster. In this case, the differences in RNA metabolism induced in the cell model aid the interpretation of the molecular phenotype observed in the ageing studies. Differences in RNA processing and splicing during ageing were also noted by the authors of the ageing studies [41, 42].
The overlap between the gene lists of the microarray datasets in our compendium was limited, even though the studied phenomena were closely related. In addition, studies performed in the same laboratory or on the same microarray platform were more likely to demonstrate overlap than studies where more heterogeneous technologies and analysis approaches were used. The comparative analysis of the datasets through literature-derived gene associations resulted in the finding of many biologically relevant associations, and was more biology- than technology-driven. Both the analysis of the hierarchical clusterings and the reproduction of the manual clustering revealed that the LAMA method identified useful associations between datasets that were not retrieved by looking at gene overlap. Our method found these associations through correctly retrieved shared biological processes between the datasets.
Standard exploratory analysis based on an over-representation analysis of GO categories was not very powerful for our compendium, as shown by the lack of overlap between studies and the poor classification results. In general, the hypergeometric test will not often identify over-represented GO categories when short gene lists are analyzed. Yet, our association-based method was still able to find useful associations between datasets, even in cases where not a single shared over-represented GO code was found. Also, the associations we use cover a much broader range than GO. Indeed, not all of the concepts in table 1 are covered by the GO thesaurus (e.g. leptin). In addition, even if an appropriate GO term exists, it may not have been assigned any genes yet (e.g., cullin deneddylation).
Our broad network of associations increases our sensitivity for identifying interesting associations between datasets. We chose to use associations derived from literature to optimize for serendipity, but the network of associations could be taken from any source, including GO. An important feature of this approach is that by modulating the associations that are taken up in the network, the specificity and sensitivity of the found associations between datasets can be controlled.
Tomlins et al.  performed an over-representation analysis on gene lists representing the different stages of prostate cancer and used identified over-represented gene groups to compare the disease stages. The basis for their analysis was a database of 14000 groups of genes that share a relevant characteristic, or "molecular concept". They do not report low recall or lack of overlap of over-represented "molecular concepts". The likely explanation is that their meticulous sample preparation and highly standardized data generation and analysis avoided lack of overlap at the gene level and short gene lists. Clearly access to the raw data is commendable for exploratory studies, and standardized data generation is extremely useful given the high levels of variance observed with microarray experiments. It should be noted though that, besides the limited availability of raw data, the statistical models for the analysis of the raw data are hard to standardize. The choice for a statistical model depends on the design of the study, e.g. time course experiments or group comparisons, and technical factors, such as whether a one or two color microarray is used. Therefore perhaps the strongest point of our approach is that even with a wide range of study designs and statistical evaluations, across various platforms and species, a useful and insightful exploratory study was possible.
The issue of comparability between studies has been addressed for meta-analyses with a different objective than ours; the aggregation of information obtained from different DNA microarray studies [44–47]. It has been suggested that DNA microarray datasets could well be compared by the use of rankings of genes based on the level of significance of differential expression [47, 48]. Indeed, also GSEA, a current popular method to test whether a set of genes is associated with the experimental variable is based on gene rankings . It would be an interesting extension of the current work to use rankings in the dataset comparisons. A rank-based approach could be adapted to incorporate our text-based associations between genes. Also in this case, information on the statistical ranks is, however, frequently unavailable.
A limitation of the current LAMA analysis is that it relies on simulations to derive a measure to compare datasets. Simulations are computationally intensive, and have a resolution proportional to and limited by the number of performed iterations. The results presented here can be considered a proof of the utility of our approach, and a logical next step is to derive a model-based approximation as an alternative to our simulation-based measure. This would also avoid the necessity for the setting of a threshold on the gene association score. The currently applied threshold of 1% appears to be optimal [see Additional file 4]. When lowering the threshold, results will be more similar to the kappa method, as identity relations will start to dominate, whereas raising the threshold introduces noise and spurious connections.
The compendium of studies showed limited overlap on gene ids, and a bias towards higher overlap between studies with technical similarities. The over-representation analysis based on GO categories was not very helpful in comparing studies, due to limited sensitivity and the incompleteness of the manually curated gene annotations. Compared to these approaches LAMA provided more biology- than technology-driven results and identified more biologically relevant associations between datasets. As the shared biological processes between studies could also be easily recognized, we believe LAMA is a powerful approach for the comparative meta-analysis of DNA microarray datasets.
In our meta-analysis, we included DNA microarray studies on skeletal muscle development and/or disease. The compendium was limited to studies in human, mouse, and rat. Studies were included till December 2005. From each paper, lists of up- and downregulated genes were extracted from the tables reported in the paper or in the supplementary data. The compendium is not complete. For some of the studies, data could not be retrieved and requests for gene lists to the authors were unsuccessful. Since a full list of genes interrogated by each platform was essential for statistical analysis, studies on home-made arrays for which this information was not available had to be omitted as well. All probes on the array were mapped to Entrez Gene IDs. To be able to compare gene lists from the different organisms we mapped homologous genes to each other based on NCBI's HomoloGene database .
Comparing DNA microarray experiments based on gene identity
where P(A) is the proportion of times that the two experiments give the same tag to a gene and P(E) is the expected proportion of times that the experiments give the same tag to a dataset. When calculating kappa we only consider the genes that are present on both platforms. If the DNA microarray datasets show identical results (P(A) = 1) then κ = 1. If the agreement is close to the level of agreement expected to occur by chance (P(A) ≈ P(E)) then κ ≈ 0.
Recognizing references to concepts in texts
The corpus of literature for our experiments consisted of 3,160,002 MEDLINE abstracts, selected with the PubMed query "(protein OR gene) AND mammals". We used titles, MeSH headings, and abstracts. Stop words were removed and words were stemmed to their uninflected form by the LVG normalizer . We used a thesaurus to identify concepts in texts. The thesaurus was composed of two parts: the 2006AC version of the UMLS thesaurus  and a gene thesaurus derived from multiple databases. The gene thesaurus was a combination of gene names from the rat genome database , mouse genome database , and a human gene thesaurus from several databases . Homologous genes between the three species were mapped to each other using NCBI's HomoloGene database . In order to exclude irrelevant concepts, two molecular biologists created a list of UMLS semantic types [see Additional file 5 for the complete list] relevant for biological information about genes. All concepts with other semantic types were removed from the thesaurus. Following Aronson , the UMLS thesaurus was also adapted for efficient natural language processing, avoiding overly ambiguous or duplicate terms, and terms that are very unlikely to be found in natural text. The gene thesaurus was expanded by rewrite rules to take into account common spelling variations . For instance, numbers were replaced with roman numerals and vice versa, and hyphens before numbers at the end of gene symbols were inserted or removed (e.g. "WAF1" was rewritten as "WAF-1" and added as a synonym).
Concept profile methodology
For every gene in our thesaurus that we identified in at least 5 documents, we characterized the documents in which the gene occurs with a concept profile. A concept profile of a concept i, for instance a gene, is an M-dimensional vector w i = (wi 1, wi 2, ⋯, w iM ) where M is the number of concepts in the thesaurus. The weight w ij for a concept j in this profile indicates the strength of its association to the concept i. The weights in a concept profile for concept i are derived from the set of documents in which concept i occurs, D i , which is a subset of the total set of documents D.
where O and O i represent the number of concept occurrences in D and D i resp.; o j and o ij represent the number of occurrences of concept j in D and D i resp. The uncertainty coefficient is a normalized variant of the mutual information measure. The symmetric coefficient is the weighted average of the two assymmetric uncertainty coefficients: 1. the proportion of information in Y explained by knowledge of X and 2. the proportion of information in X explained by knowledge of Y.
Literature-based comparison of gene lists
The similarity of the concept profiles of two genes was measured with the cosine similarity score . If the similarity score exceeded a threshold, then the two genes were considered to have an association. For our experiments here, we calculated the similarity score for all pairs of genes for which a concept profile was available; the highest scoring 1% of pairs were taken as associations. A justification for the use of this threshold is given in the supplementary material. Subsequently, the associations are used to compare two gene lists. To do this, two gene lists were considered as separate sets of nodes, and the number of associations between the two were counted. We assessed how uncommon the observed number of associations was, by means of a distribution representing unassociated genelists. This distribution was estimated based on Monte Carlo simulations. For each simulation we performed the following two steps: 1. For each gene list we randomly selected a number of genes equal to the size of the gene list. These genes were selected from the genes present on the appropriate DNA microarray platform for which we had a concept profile available. 2. The number of connections between the two new gene lists were counted. Using the empirical distribution we subsequently estimated the chance of observing the given number of associations or more.
For each DNA microarray experiment we retrieved two gene lists, the upregulated and the downregulated genes. When comparing two experiments, p-values were computed for the two up and down lists; the final LAMA score was obtained by taking the log of the product of the two p-values. The log of the p-value product is commonly used in meta-analysis and is known as Fisher's method . This score has been shown to follow a chi-square distribution, and can be used to derive a p-value. Here the measure is used to identify strongly associated datasets.
In order to interpret the LAMA score between two datasets we developed a computer program. The program shows for every gene in one set the associations that connect it to the other set. The biomedical concepts that underlie the gene associations could readily be retrieved and traced back to the literature through an incorporated version of Anni, a tool we published earlier . To annotate a cluster of datasets we calculate the percentual contribution to the number of annotations for every gene, averaged over all dataset comparisons between cluster members. Subsequently we identified descriptive concepts for the cluster by retrieving concepts strongly associated to the top-ranking genes through the Anni annotation view. For table 1 we used as cutoffs 0.2% for selecting the genes and concepts in the annotation view were selected when their contribution was larger than 1. For brevity genes were excluded from this table. Of partially redundant concepts (e.g. "heterogeneous nuclear ribonucleoproteins" and "heterogeneous nuclear ribonucleoproteins activity") only the highest scoring concept was shown.
High-throughput analysis of over-represented GO-terms
To evaluate if a set of differentially expressed genes shows an over-representation of genes belonging to a certain biological process, molecular function or cellular localization, as annotated by the Gene Ontology (GO) consortium , a hypergeometric test is commonly used, see e.g. the web tools DAVID  and GOTM . We used the HyperGTest from the GOstats 2.0.4 package from the bioconductor open source software platform . Only GO terms from the branch "Biological Process" subset of GO-terms were evaluated, since this was most relevant to the biological problem. To perform the test, an annotation package was built per species with the AnnBuilder 1.12.0 package in R, for the concatenated list of Entrez Gene identifiers represented by the relevant platforms. We analyzed up and down regulated gene lists separately. Similar to how we calculated the similarity of gene lists based on gene identifiers, the kappa statistic was used to calculate the similarity of significantly overrepresented GO-terms (p < 0.05) in the up-and downregulated gene sets from two microarray datasets.
Reproduction of a manual clustering
We tested to which extent a manual grouping based on studied biological phenomena (cf. table 1) was reflected by the pair-wise similarity dataset scores by performing a classification experiment: The association measures were used to produce a ranking of the set of studies relative to one so-called seed study. All studies in turn served as a seed, producing a ranking for each of the other studies in the groups. Studies from the same group as the seed study were considered positive, studies from other groups negative cases. Based on the sorted list of positive and negative cases we constructed for each study a receiver operating characteristics (ROC) curve . The area under the curve (AUC) was used as a performance measure . An AUC of 1 represents perfect ordering, i.e. the studies from the same group as the selected study hold the top ranks, and an AUC of 0.5 is the expected score for a random ordering .
Clustering DNA microarray data experiments
The DNA microarray studies can be compared to each other through the LAMA and kappa measures. To identify patterns in these associations we clustered the studies through agglomerative hierarchical clustering and subsequently annotated the identified clusters. For this purpose the LAMA scores were -log10 transformed.
The area under the receiver operating characteristics (ROC) curve
limb-girdle muscular dystrophy
oculopharyngeal muscular dystrophy.
This work was conducted within the Centre for Medical Systems Biology (CMSB), established by the Netherlands Genomics Initiative/Netherlands Organisation for Scientific Research (NGI/NWO). RJ was supported by the ErasmusMC Breedtestrategie.
- Rhodes DR, Chinnaiyan AM: Integrative analysis of the cancer transcriptome. Nat Genet 2005, 37(Suppl):S31-S37. 10.1038/ng1570View ArticlePubMedGoogle Scholar
- Larsson O, Wennmalm K, Sandberg R: Comparative microarray analysis. OMICS 2006, 10(3):381–397. 10.1089/omi.2006.10.381View ArticlePubMedGoogle Scholar
- Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 2005, 21(2):171–178. 10.1093/bioinformatics/bth469View ArticlePubMedGoogle Scholar
- Tan PK, Downey TJ, Spitznagel EL, Xu P, Fu D, Dimitrov DS, Lempicki RA, Raaka BM, Cam MC: Evaluation of gene expression measurements from commercial microarray platforms. Nucleic Acids Res 2003, 31(19):5676–5684. 10.1093/nar/gkg763PubMed CentralView ArticlePubMedGoogle Scholar
- Mah N, Thelin A, Lu T, Nikolaus S, Kuehbacher T, Gurbuz Y, Eickhoff H, Kloeppel G, Lehrach H, Mellgaerd B, Costello CM, Schreiber S: A comparison of oligonucleotide and cDNA-based microarray systems. Physiol Genomics 2004, 16(3):361–370. 10.1152/physiolgenomics.00080.2003View ArticlePubMedGoogle Scholar
- Shi L, Tong W, Fang H, Scherf U, Han J, Puri RK, Frueh FW, Goodsaid FM, Guo L, Su Z, Han T, Fuscoe JC, Xu ZA, Patterson TA, Hong H, Xie Q, Perkins RG, Chen JJ, Casciano DA: Cross-platform comparability of microarray technology: intra-platform consistency and appropriate data analysis procedures are essential. BMC Bioinformatics 2005, 6(Suppl 2):S12. 10.1186/1471-2105-6-S2-S12PubMed CentralView ArticlePubMedGoogle Scholar
- Nimgaonkar A, Sanoudou D, Butte AJ, Haslett JN, Kunkel LM, Beggs AH, Kohane IS: Reproducibility of gene expression across generations of Affymetrix microarrays. BMC Bioinformatics 2003, 4: 27. 10.1186/1471-2105-4-27PubMed CentralView ArticlePubMedGoogle Scholar
- Draghici S, Khatri P, Eklund AC, Szallasi Z: Reliability and reproducibility issues in DNA microarray measurements. Trends Genet 2006, 22(2):101–109. 10.1016/j.tig.2005.12.005PubMed CentralView ArticlePubMedGoogle Scholar
- Kuo WP, Liu F, Trimarchi J, Punzo C, Lombardi M, Sarang J, Whipple ME, Maysuria M, Serikawa K, Lee SY, McCrann D, Kang J, Shearstone JR, Burke J, Park DJ, Wang X, Rector TL, Ricciardi-Castagnoli P, Perrin S, Choi S, Bumgarner R, Kim JH, Short GF, Freeman MW, Seed B, Jensen R, Church GM, Hovig E, Cepko CL, Park P, Ohno-Machado L, Jenssen TK: A sequence-oriented comparison of gene expression measurements across different hybridization-based technologies. Nat Biotechnol 2006, 24(7):832–840. 10.1038/nbt1217View ArticlePubMedGoogle Scholar
- Manoli T, Gretz N, Groene HJ, Kenzelmann M, Eils R, Brors B: Group testing for pathway analysis improves comparability of different microarray datasets. Bioinformatics 2006, 22(20):2500–2506. 10.1093/bioinformatics/btl424View ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 2005, 102(43):15545–15550. 10.1073/pnas.0506580102PubMed CentralView ArticlePubMedGoogle Scholar
- Goeman JJ, Geer SA, de Kort F, van Houwelingen HC: A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 2004, 20: 93–99. 10.1093/bioinformatics/btg382View ArticlePubMedGoogle Scholar
- Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Global functional profiling of gene expression. Genomics 2003, 81(2):98–104. 10.1016/S0888-7543(02)00021-6View ArticlePubMedGoogle Scholar
- Khatri P, Done B, Rao A, Done A, Draghici S: A semantic analysis of the annotations of the human genome. Bioinformatics 2005, 21(16):3416–3421. 10.1093/bioinformatics/bti538PubMed CentralView ArticlePubMedGoogle Scholar
- Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, (32 Database):D262-D266. 10.1093/nar/gkh021
- Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 2000, 28: 27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Jelier R, Jenster G, Dorssers LCJ, Wouters B, Hendriksen P, Mons B, Delwel R, Kors JA: Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation. BMC Bioinformatics 2007, 8: 14. 10.1186/1471-2105-8-14PubMed CentralView ArticlePubMedGoogle Scholar
- Cahan P, Ahmad AM, Burke H, Fu S, Lai Y, Florea L, Dharker N, Kobrinski T, Kale P, McCaffrey TA: List of lists-annotated (LOLA): a database for annotation and comparison of published microarray gene lists. Gene 2005, 360: 78–82. 10.1016/j.gene.2005.07.008View ArticlePubMedGoogle Scholar
- Finocchiaro G, Mancuso F, Muller H: Mining published lists of cancer related microarray experiments: identification of a gene expression signature having a critical role in cell-cycle control. BMC Bioinformatics 2005, 6(Suppl 4):S14. 10.1186/1471-2105-6-S4-S14PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Geer LY, Kapustin Y, Khovayko O, Landsman D, Lipman DJ, Madden TL, Maglott DR, Ostell J, Miller V, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Sirotkin K, Souvorov A, Starchenko G, Tatusov RL, Tatusova TA, Wagner L, Yaschenko E: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2007, (35 Database):D5–12. 10.1093/nar/gkl1031
- Turk R, Sterrenburg E, Wees CGC, de Meijer EJ, de Menezes RX, Groh S, Campbell KP, Noguchi S, van Ommen GJB, den Dunnen JT, 't Hoen PAC: Common pathological mechanisms in mouse models for muscular dystrophies. FASEB J 2006, 20: 127–129.PubMedGoogle Scholar
- Porter JD, Khanna S, Kaminski HJ, Rao JS, Merriam AP, Richmonds CR, Leahy P, Li J, Guo W, Andrade FH: A chronic in ammatory response dominates the skeletal muscle molecular signature in dystrophin-deficient mdx mice. Hum Mol Genet 2002, 11(3):263–272. 10.1093/hmg/11.3.263View ArticlePubMedGoogle Scholar
- Li P, Oparil S, Feng W, Chen YF: Hypoxia-responsive growth factors upregulate periostin and osteopontin expression via distinct signaling pathways in rat pulmonary arterial smooth muscle cells. J Appl Physiol 2004, 97(4):1550–8. discussion 1549 10.1152/japplphysiol.01311.2003View ArticlePubMedGoogle Scholar
- Wang D, Oparil S, Feng JA, Li P, Perry G, Chen LB, Dai M, John SWM, Chen YF: Effects of pressure overload on extracellular matrix expression in the heart of the atrial natriuretic peptide-null mouse. Hypertension 2003, 42: 88–95. 10.1161/01.HYP.0000074905.22908.A6View ArticlePubMedGoogle Scholar
- Kii I, Amizuka N, Minqi L, Kitajima S, Saga Y, Kudo A: Periostin is an extracellular matrix protein required for eruption of incisors in mice. Biochem Biophys Res Commun 2006, 342(3):766–772. 10.1016/j.bbrc.2006.02.016View ArticlePubMedGoogle Scholar
- Trueblood NA, Xie Z, Communal C, Sam F, Ngoy S, Liaw L, Jenkins AW, Wang J, Sawyer DB, Bing OH, Apstein CS, Colucci WS, Singh K: Exaggerated left ventricular dilation and reduced collagen deposition after myocardial infarction in mice lacking osteopontin. Circ Res 2001, 88(10):1080–1087. 10.1161/hh1001.090842View ArticlePubMedGoogle Scholar
- Fleiss J: Measuring nominal scale agreement among many raters. Psychological Bulletin 1971, 76: 378–382. 10.1037/h0031619View ArticleGoogle Scholar
- Siegel S, Castellan N: Nonparametric statistics for the behavorial sciences. McGraw-Hill, New York; 1988.Google Scholar
- Hochberg Y, Benjamini Y: More powerful procedures for multiple significance testing. Stat Med 1990, 9(7):811–818. 10.1002/sim.4780090710View ArticlePubMedGoogle Scholar
- Nemoto H, Konno S, Nakazora H, Miura H, Kurihara T: Histological and immunohistological changes of the skeletal muscles in older SJL/J mice. Eur Neurol 2007, 57: 19–25. 10.1159/000097005View ArticlePubMedGoogle Scholar
- Chen YW, Nagaraju K, Bakay M, McIntyre O, Rawat R, Shi R, Hoffman EP: Early onset of in ammation and later involvement of TGFbeta in Duchenne muscular dystrophy. Neurology 2005, 65(6):826–834. 10.1212/01.wnl.0000173836.09176.c4View ArticlePubMedGoogle Scholar
- Deconinck N, Dan B: Pathophysiology of duchenne muscular dystrophy: current hypotheses. Pediatr Neurol 2007, 36: 1–7. 10.1016/j.pediatrneurol.2006.09.016View ArticlePubMedGoogle Scholar
- Turk R, Sterrenburg E, de Meijer EJ, van Ommen GJB, den Dunnen JT, 't Hoen PAC: Muscle regeneration in dystrophin-deficient mdx mice studied by gene expression profiling. BMC Genomics 2005, 6: 98. 10.1186/1471-2164-6-98PubMed CentralView ArticlePubMedGoogle Scholar
- van Lunteren E, Moyer M, Leahy P: Gene expression profiling of diaphragm muscle in alpha2-laminin (merosin)-deficient dy/dy dystrophic mice. Physiol Genomics 2006, 25: 85–95. 10.1152/physiolgenomics.00226.2005View ArticlePubMedGoogle Scholar
- Bakay M, Zhao P, Chen J, Hoffman EP: A web-accessible complete transcriptome of normal human and DMD muscle. Neuromuscul Disord 2002, 12(Suppl 1):S125-S141. 10.1016/S0960-8966(02)00093-7View ArticlePubMedGoogle Scholar
- Boer JM, de Meijer EJ, Mank EM, van Ommen GB, den Dunnen JT: Expression profiling in stably regenerating skeletal muscle of dystrophin-deficient mdx mice. Neuromuscul Disord 2002, 12(Suppl 1):S118-S124. 10.1016/S0960-8966(02)00092-5View ArticlePubMedGoogle Scholar
- Cao PR, Kim HJ, Lecker SH: Ubiquitin-protein ligases in muscle wasting. Int J Biochem Cell Biol 2005, 37(10):2088–2097. 10.1016/j.biocel.2004.11.010View ArticlePubMedGoogle Scholar
- Glass DJ: Molecular mechanisms modulating muscle mass. Trends Mol Med 2003, 9(8):344–350. 10.1016/S1471-4914(03)00138-2View ArticlePubMedGoogle Scholar
- Pasterkamp RJ, Verhaagen J: Semaphorins in axon regeneration: developmental guidance molecules gone wrong? Philos Trans R Soc Lond B Biol Sci 2006, 361(1473):1499–1511. 10.1098/rstb.2006.1892PubMed CentralView ArticlePubMedGoogle Scholar
- Ko JA, Gondo T, Inagaki S, Inui M: Requirement of the transmembrane semaphorin Sema4C for myogenic differentiation. FEBS Lett 2005, 579(10):2236–2242. 10.1016/j.febslet.2005.03.022View ArticlePubMedGoogle Scholar
- Welle S, Brooks AI, Delehanty JM, Needler N, Thornton CA: Gene expression profile of aging in human muscle. Physiol Genomics 2003, 14(2):149–159.View ArticlePubMedGoogle Scholar
- Welle S, Brooks AI, Delehanty JM, Needler N, Bhatt K, Shah B, Thornton CA: Skeletal muscle gene expression profiles in 20–29 year old and 65–71 year old women. Exp Gerontol 2004, 39(3):369–377. 10.1016/j.exger.2003.11.011View ArticlePubMedGoogle Scholar
- Tomlins SA, Mehra R, Rhodes DR, Cao X, Wang L, Dhanasekaran SM, Kalyana-Sundaram S, Wei JT, Rubin MA, Pienta KJ, Shah RB, Chinnaiyan AM: Integrative molecular concept modeling of prostate cancer progression. Nat Genet 2007, 39: 41–51. 10.1038/ng1935View ArticlePubMedGoogle Scholar
- Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T, Pandey A, Chinnaiyan AM: Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci USA 2004, 101(25):9309–9314. 10.1073/pnas.0401994101PubMed CentralView ArticlePubMedGoogle Scholar
- Wang J, Coombes KR, Highsmith WE, Keating MJ, Abruzzo LV: Differences in gene expression between B-cell chronic lymphocytic leukemia and normal B cells: a meta-analysis of three microarray studies. Bioinformatics 2004, 20(17):3166–3178. 10.1093/bioinformatics/bth381View ArticlePubMedGoogle Scholar
- Parmigiani G, Garrett-Mayer ES, Anbazhagan R, Gabrielson E: A cross-study comparison of gene expression studies for the molecular classification of lung cancer. Clin Cancer Res 2004, 10(9):2922–2927. 10.1158/1078-0432.CCR-03-0490View ArticlePubMedGoogle Scholar
- DeConde RP, Hawley S, Falcon S, Clegg N, Knudsen B, Etzioni R: Combining results of microarray experiments: a rank aggregation approach. Stat Appl Genet Mol Biol 2006, 5: Article15.PubMedGoogle Scholar
- Yuen T, Wurmbach E, Pfeffer RL, Ebersole BJ, Sealfon SC: Accuracy and calibration of commercial oligonucleotide and custom cDNA microarrays. Nucleic Acids Res 2002, 30(10):e48. 10.1093/nar/30.10.e48PubMed CentralView ArticlePubMedGoogle Scholar
- McCray AT, Srinivasan S, Browne AC: Lexical methods for managing variation in biomedical terminologies. Proc Annu Symp Comput Appl Med Care 1994, 235–239.Google Scholar
- Bodenreider O: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004, (32 Database):D267-D270. 10.1093/nar/gkh061
- Rat Genome Database Web Site, Medical College of Wisconsin, Milwaukee, Wisconsin World Wide Web 2006. [http://rgd.mcw.edu/]
- Mouse Genome Database (MGD), Mouse Genome Informatics Web Site, The Jackson Laboratory, Bar Harbor, Maine World Wide Web 2006. [http://www.informatics.jax.org]
- Kors J, Schuemie M, Schijvenaars B, Weeber M, Mons B: Combination of genetic databases for improving identification of genes and proteins in text. Biolink Conference, Detroit 2005.Google Scholar
- Aronson AR: Filtering the UMLS metathesaurus for MetaMap. Tech rep, National Library of Medicine; 2006.Google Scholar
- Schuemie MJ, Mons B, Weeber M, Kors JA: Evaluation of techniques for increasing recall in a dictionary approach to gene and protein name identification. J Biomed Inform 2007, 40(3):316–324. 10.1016/j.jbi.2006.09.002View ArticlePubMedGoogle Scholar
- Goodman L, Kruskal W: Measures of association for cross classifications. Springer-Verlag, New York; 1979.View ArticleGoogle Scholar
- Jelier R, Schuemie M, Roes P, Van Mulligen E, Kors J: Literature-based concept profiles for gene annotation: the issue of weighting. Int J of Med Inform 2008, 77: 354–362. 10.1016/j.ijmedinf.2007.07.004View ArticleGoogle Scholar
- Salton G: Automatic text processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading, MA; 1989.Google Scholar
- Fisher R: Combining independent tests of significance. American Statistician 1948, 2: 30. 10.2307/2681650Google Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Dennis G, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003, 4(5):P3. 10.1186/gb-2003-4-5-p3View ArticlePubMedGoogle Scholar
- Zhang B, Kirov S, Snoddy J: WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res 2005, (33 Web Server):W741-W748. 10.1093/nar/gki475
- Gentleman R, Carey V, Bates D, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini A, Sawitzki G, Smith C, Smyth G, Tierney L, Yang J, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004, 5: R80. 10.1186/gb-2004-5-10-r80PubMed CentralView ArticlePubMedGoogle Scholar
- Swets JA: Measuring the accuracy of diagnostic systems. Science 1988, 240(4857):1285–1293. 10.1126/science.3287615View ArticlePubMedGoogle Scholar
- Hanley JA, McNeil BJ: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143: 29–36.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.