- Software
- Open Access
sdef: an R package to synthesize lists of significant features in related experiments
- Marta Blangiardo^{1}Email author,
- Alberto Cassese^{2} and
- Sylvia Richardson^{1}
https://doi.org/10.1186/1471-2105-11-270
© Blangiardo et al; licensee BioMed Central Ltd. 2010
- Received: 5 January 2010
- Accepted: 20 May 2010
- Published: 20 May 2010
Abstract
Background
In microarray studies researchers are often interested in the comparison of relevant quantities between two or more similar experiments, involving different treatments, tissues, or species. Typically each experiment reports measures of significance (e.g. p-values) or other measures that rank its features (e.g genes). Our objective is to find a list of features that are significant in all experiments, to be further investigated. In this paper we present an R package called sdef, that allows the user to quantify the evidence of communality between the experiments using previously proposed statistical methods based on the ranked lists of p-values. sdef implements two approaches that address this objective: the first is a permutation test of the maximal ratio of observed to expected common features under the hypothesis of independence between the experiments. The second approach, set in a Bayesian framework, is more flexible as it takes into account the uncertainty on the number of genes differentially expressed in each experiment.
Results
We used sdef to re-analyze publicly available data i) on Type 2 diabetes susceptibility in mice on liver and skeletal muscle (two experiments); ii) on molecular similarities between mammalian sexes (three experiments). For the first example, we found between 68 and 104 genes commonly perturbed between the two tissues, using the two methods described above, and enrichment of the inflammation pathways, which are related to obesity and diabetes. For the second example, looking at three lists of features, we found 110 genes commonly perturbed between the three tissues, using the same two methods, and enrichment on genes involved in cell development.
Conclusions
sdef is an R package that provides researchers with an easy and powerful methodology to find lists of features commonly perturbed in two or more experiments to be further investigated. The package is provided with plots and tables to help the user visualize and interpret the results. The Windows, Linux and MacOS versions of the package, together with the documentation are available on the website http://cran.r-project.org/web/packages/sdef/index.html.
Keywords
- Posterior Distribution
- Bayesian Model
- Credibility Interval
- Null Distribution
- Common Gene
Background
In microarray experiments, a commonly encountered problem is the comparison of two or more similar experiments that involve different tissue/treatment/species, with the aim of finding a list of common features perturbed in all experiments. This list should highlight a restricted set of interesting features to be further investigated and validated by direct experimentation. A natural way to proceed considers the intersection of ranked lists of features from each experiment. Here the rank is based on the p-values associated with each experiment, but the same methodology could be applied to other measures of interest as long as they have a common scale across the experiments (e.g. correlation coefficient). Depending on the threshold chosen to declare a gene significant in each list, intersected lists of different size can be produced. The methods implemented in this package give effective ways to derive a meaningful threshold and to return one common list. To statistically assess the intersection lists, we have proposed a novel method [1], which is based on an association ratio quantifying the departure from the null hypothesis of independence between the lists. Several testing procedures were presented in [1]. The first one tests by permutations the maximal ratio between the number of significant features observed in common between the experiments and the number in common under the hypothesis of independence. The second procedure is formulated in a Bayesian framework. It uses a multinomial distribution to model the joint distribution of significant features in the set of experiments. From the output of the Bayesian analysis, several criteria for selecting the intersection list were investigated in an extensive simulation study and compared on the basis of false positives and false negatives [1].
In this paper we describe an R package, called sdef, that enables the user to perform the two procedures proposed, returns a table with the list of genes in common and some illustrative plots.
Implementation
For the sake of clarity, we now briefly recall the methodology on which sdef is based and describe the functions of the package in the setup of two related experiments, presented in the section "Illustrative analysis: Type 2 diabetes susceptibility in mice". However, we stress that the package deals with any number of lists and we include an example about molecular similarities between mammalian sexes for three tissues (section "Illustrative example: molecular similarities between mammalian sexes") sdef only requires as input the p-values associated with the comparison performed in each experiment. In order to make the description more concrete, we phrase it in the context of differential expression (i.e. when the biological focus is on finding genes differentially expressed between two experimental conditions, e.g. in two tissues or in two species), but we emphasize that sdef can be used to synthesize any lists of features of interest, for instance to compare two or more relevance networks and to build a list of significant pairwise associations that are common to the two networks.
Frequentist Test of Maximal Association Ratio
It quantifies the strength of association between the lists in terms of the ratio of observed to expected, to avoid multiple testing issues. We focus attention on the ordinal statistic T(h_{ max }) = max_{ h }T (h) which represents the maximal deviation from the null model of independence between the two experiments. This maximum value is associated with a threshold h_{ max }on the probability measure and with a number O_{11} (h_{ max }) of genes in common which can be selected for further investigations and mined for relevant biological pathways.
The value of the ordinal statistic T(h_{ max }) is tested through a Monte Carlo permutation test and its significance is returned by a Monte Carlo p-value.
Data format for sdef: two lists.
Gene | List.Pval1 | List.Pval2 |
---|---|---|
100005_at | 0.936421204 | 0.91858576 |
100007_at | 0.876117486 | 0.95866826 |
100011_at | 0.410755946 | 0.06171335 |
100016_at | 0.166471395 | 0.76881385 |
100024_at | 0.008681877 | 0.11661176 |
... | ... | ... |
Data format for sdef: three lists. The table presents the typical data format required by sdef using the mice data described in the section "Illustrative analysis: molecular similarities between mammalian sexes" (three lists).
Gene | List.Pval1 | List.Pval2 | List.Pval3 |
---|---|---|---|
1415670_at | 0.01310184 | 0.78514374 | 0.3635318 |
1415671_at | 0.15744532 | 0.40366007 | 0.9661227 |
1415672_at | 0.01613549 | 0.96078200 | 0.1406895 |
141567_at | 0.45965033 | 0.35167466 | 0.6622451 |
1415674_a_at | 0.97597216 | 0.90075596 | 0.7839352 |
1415675_at | 0.15111598 | 0.06903487 | 0.1528421 |
... | ... | ... | ... |
Bayesian Model for Association Ratio
As the model is conjugate, it is easy to sample from the posterior distribution of R(h) given the data and to compute CI(h), the two sided Credibility Intervals for each R(h) as well as the median of the posterior distribution, Median(R(h)) for the desired level.
With the aim of obtaining a common list we propose to use the posterior distribution of R(h) to derive two thresholds, h_{ max }and h_{2} , which characterize respectively two decision rules. The first rule searches for the strongest deviation from independence and it is very specific (few false positives). It is obtained as the maximum of Median(R(h)), called R(h_{ max }) over the subset of credibility intervals which do not include the value 1 and it is equivalent to T(h_{ max }) in the frequentist framework. The second rule uses the largest threshold h where the number of genes called in common at least doubles the number of genes expected in common under independence (Median(R(h)) ≥ 2 = R(h_{2} )). It leads to a fair balance between specificity and sensitivity. See [1] for the details about the simulation studies set up to evaluate the errors associated with the two decision rules.
Results
Common genes found using sdef: two lists.
Rule | T(h) | R(h) | CI _{95%} | O _{11} | O _{1+} | O _{+1} |
---|---|---|---|---|---|---|
h_{ max }= 0.02 | 2.51 | 2.51 | 2.04 - 3.00 | 68 | 264 | 299 |
h_{2} = 0.04 | 2.11 | 1.81 - 2.44 | 104 | 351 | 410 |
Common genes found using sdef: three lists. The table shows a summary of the information on the degree of similarity between the experiments for the mice data described in the section "Illustrative analysis: molecular similarities between mammalian sexes" (three lists). It is obtained running the function createTable. It contains the rule (h_{ max }as h_{2} does not apply to this data as R(h) does not reach 2), T(h), R(h) with its credibility interval, the number of genes in common and the number of differentially expressed genes in each experiment.
Rule | T(h) | R(h) | CI _{95%} | O _{11} | O _{1++} | O _{+1+} | O _{++1} |
---|---|---|---|---|---|---|---|
h_{ max }(freq & Bayesian) = 0.12 | 1.67 | 1.69 | 1.41 - 2.03 | 110 | 1337 | 2126 | 973 |
Finally, extractFeatures.T and extractFeatures.R return the list of the common genes when h_{ max }, h_{2} or an additional user defined threshold has been selected. It also creates a .csv file with the same information which can be used for further investigation, for instance to be included in softwares that perform gene enrichment (e.g. [2, 3]).
Illustrative analysis: Type 2 diabetes susceptibility in mice
We used sdef to re-analyze a publicly available experiment to evaluate the Type 2 diabetes susceptibility in obese and normal mice in different tissues. We focused attention on the differential expression between normal and obese mice in liver and skeletal muscle. The data are available at http://www.ncbi.nlm.nih.gov/geo, accession number GDS1443. The starting point of our methodology and the input for the R package is the matrix of p-values, where each row correspond to a gene (2912) and each column identifies one experiment (2 tissues). We normalized the data using the RMA function [4] implemented in the Affy R package [5] and applied Cyber-T [6] to obtain a list of p-values for each tissue. The format of the data matrix is presented in Table 1.
- 1.
Firstly we explore the similarities between the differential expression of the two tissues through the Frequentist model. For each threshold we calculate the value of the ratio T(h)
> Th <- ratio(data)
- i)
a list with the number of differentially expressed genes in each experiment for each h, the values of the ratio T(h) and the number of genes found in common:
> Th
$h
[1] 0.01 0.02 0.03 ...
$DE
list1 list2
0.01 199 233
0.02 264 299
0.03 305 348
...
$ratios
ratio
0.01 2.449328
0.02 2.508564
0.03 2.277143
...
$common
genes in common
0.01 39
0.02 68
0.03 83
...
- 2.
To compute a p-value for T(h_{ max }) under the hypothesis of independence between the experiments we test T(h_{ max }) using the Monte Carlo method based on permutations:
> MC <- Tmc(Th)
- i)
an empirical p-value which provides the strength of the evidence that the two experiments are associated:
> MC
pvalue < 0.001
- 3.
We ran the Bayesian model, which is less computationally intensive (it takes 12 minutes to do 1000 iterations on a Dell Precision workstation with 2GB of RAM):
> Rh <- baymod(Th)
- i)
a table containing the posterior estimate of R(h) and its 95% credibility interval for each h:
> Rh
2.5% Median 97.5%
1.8263361 2.404265 3.038746
2.0271394 2.503913 3.088150
...
- 4.
Finally the list of genes in common using h_{2} as threshold is obtained:
> genes.R <- extractFeatures.R$rule2
$rule2
Names List.Pval1 List.Pval2
100064_f_at 6.123493e-03 5.005709e-03
100151_at 2.255893e-03 1.454567e-03
100436_at 2.698470e-02 1.199453e-03
...
Focusing attention on this list, CsnK2a2, a casein kinase 2 and Lgals3, a galactin, have been linked to inflammatory conditions in the literature [7, 8], while atf3 (activating transcription factor 3) and Btg1 (B-cell translocation gene 1, anti-proliferative) are stress-related genes; both inflammation and stress are triggered by obesity and diabetes. Moreover, dbp (D site albumin promoter binding protein) has been previously related to diabetes in liver and heart [9], while Enpp2 (autoxin) is associated to severe type 2 diabetes and linked to obesity-associated pathologies in adipose tissues [10]. Our results indicate that the role of these genes is conserved in different tissues, suggesting a systemic response that should be further investigated. sdef thus gives a powerful data mining tool to suggest or confirm hypotheses that require the simultaneous consideration of several experiments.
Illustrative analysis: molecular similarities between mammalian sexes
sdef deals with any number of lists and we provide an example on three lists, re-analyzing a publicly available experiment about molecular similarities between mammalian sexes [11], which focuses attention on several tissues (hypothalamus, kidney and liver). The data are available at http://www.ncbi.nlm.nih.gov/geo, accession number GSE1147-GSE1148.
The matrix with the p-values contains 3 columns: i) p-values of differential expression between male and female mice in kidney, p-values of differential expression between male and female mice in liver, p-values of differential expression between male and female mice in reproductive system. We normalized the data using the RMA function [4] implemented in the Affy R package [5] and applied Cyber-T [6] to obtain a list of p-values for each tissue. We focused attention only on the present genes obtained using the mas5call function implemented in the Affy package. The total number of genes is 6477. The format of the data matrix is presented in Table 2.
The implementation of this example does not differ from what has been presented for two lists, as automatically the package recognizes the number of lists to be used by the number of columns in the data input. For this reason we do not repeat the code illustration, but we focus attention on the results. Note that this example is available as part of the R package (Example3Lists function).
Conclusion
sdef is a collection of functions to perform the comparison of two or more lists of features from similar experiments with the purpose of finding common ones to be further investigated. It is easy to use and since it needs only the lists of p-values as inputs it can be used to obtain results at different levels (gene level, biological function level) allowing the user to customize it to answer different types of biological questions. The methodology and the package can be applied also when a measure different from p-value (e.g. fold change) is used to rank the features in the experiments. However, this has an impact on the selection of the thresholds: fold changes, for instance, vary for each experiment and researchers should define a global range of values that is sensible for synthesizing all the comparisons of interest. Nevertheless the conclusions from the models would not be different using different measures of ranking, as the list of common features obtained will still contain interesting features, only based on a different measure (e.g. fold-change).
In this paper the frequentist and Bayesian approach are treated as two subsequent steps of the analysis, but we want to stress that they can be used independently from one another. The frequentist approach is an easy way to investigate the trend of T(h) and to identify how many features are found in common for different thresholds, but assessing the significance of T(h_{ max }) is extremely time consuming. Moreover, it only considers one rule (h_{ max }), which is more conservative and has been shown to be more affected by false negatives. The main advantage of the Bayesian approach is that it returns more accurate results through h_{2} and is characterized by larger lists of common features, that include all the common genes found using the frequentist approach. h_{2} is less affected by false negatives, but in [1] we showed that also the number of false positives remain relatively small. In addition, the Bayesian approach is extremely flexible, allowing the user to define custom thresholds, different from h_{ max }and h_{2} .
Since our methodology identifies features perturbed in two or more experiments, the proportion of false positives tends to be very small (it was around 0.5%-1.5% in the simulation presented in [1]) and the proportion is reduced as the number of lists increases. To explicitly control for false positives on the experiments under study, the user could get an estimate of the false discovery rate for each features (for instance using the method proposed by Storey in [15]) and use that as ranking statistic.
At present the package does not extend to investigate more complex patterns of association between two or more lists, for example by considering features which are perturbed only in a subset of the experiments and not in the others. This would require a modification of the methodology described in [1], which is currently under way and we plan to extend the package in the future to answer a variety of composite questions.
Availability and requirements
Project name : Synthesizing Differential Expressed Genes (sdef package)
Project home page : http://cran.r-project.org/web/packages/sdef/index.html
Operating systems : Windows, Linux, MacOS
Programming language : R
Other requirements : None
License : GNU2
Any restrictions to use by non-academics : None
Declarations
Acknowledgements
MB started this work while funded by a Wellcome Trust Functional thematic award PC 2910_DHCT. SR acknowledges partial support from BBSRC grant 28 EGM 16093, from BBSRC grant BB/E 020372/1, from MRC grant G 600609 and from MRC grant P 07008_DFHM. AC finalized the package while visiting the Imperial College Department of Epidemiology and Biostatistics.
Authors’ Affiliations
References
- Blangiardo M, Richardson S: Statistical tools for synthesizing lists of differentially expressed features in microarray experiments. Genome Biology 2007, 8: R54. 10.1186/gb-2007-8-4-r54View ArticlePubMedPubMed CentralGoogle Scholar
- Al-Shahrour F, Minguez P, Tarraga J, Montaner D, Alloza E, Vaquerizas J, Conde L, Blaschke C, Vera J, Dopazo J: BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments. Nucleic Acids Research (Web Server issue) 2006, 34: W472-W476. 10.1093/nar/gkl172View ArticleGoogle Scholar
- Subramanian A, Tamayoa P, Mootha V, Mukherjeed S, Eberta B, Gillettea M, Paulovichg A, Pomeroyh S, Goluba T, Landera E, Mesirov J: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS 2005, 43(102):15545–15550. 10.1073/pnas.0506580102View ArticleGoogle Scholar
- Bolstad B, Irizarry R, Astrand M, Speed T: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185–193. 10.1093/bioinformatics/19.2.185View ArticlePubMedGoogle Scholar
- Affymetrix Statistical Algorithms Description Document 2001.Google Scholar
- Baldi P, Long A: A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 2001, 17: 509–519. 10.1093/bioinformatics/17.6.509View ArticlePubMedGoogle Scholar
- Torkamania A, Topola E, Schork N: Pathway analysis of seven common diseases assessed by genome-wide association. Genomics 2008, 92(5):265–272. 10.1016/j.ygeno.2008.07.011View ArticleGoogle Scholar
- Mzhavia N, Yu S, Ikeda S, Chu T, Goldberg I, Dansky H: Neuronatin: A New Inflammation Gene Expressed on the Aortic Endothelium of Diabetic Mice. Diabetes 2008, 57: 2774–2783. 10.2337/db07-1746View ArticlePubMedPubMed CentralGoogle Scholar
- Iveta Herichová I, Michal Zeman M, Stebelová K, Ravingerová T: Effect of streptozotocin-induced diabetes on daily expression of per2 and dbp in the heart and liver and melatonin rhythm in the pineal gland of Wistar rat. Molecular and Cellular Biochemistry 2005, 270(1–2):223–229. 10.1007/s11010-005-5323-yView ArticlePubMedGoogle Scholar
- Boucher J, Quilliot D, Pradère JP, Simon MF, Grès S, Guigné C, Prévot D, Ferry G, Boutin J, Carpéné C, Valet P, Saulnier-Blache JS: Potential involvement of adipocyte insulin resistance in obesity-associated up-regulation of adipocyte lysophospholipase D/autotaxin expression. Diabetologia 2005, 48(3):569–577. 10.1007/s00125-004-1660-8View ArticlePubMedPubMed CentralGoogle Scholar
- Rinn J, Rozowsky J, Laurenzi I, Petersen P, Zou ZWK, Gerstein M, Snyder1 M: Major Molecular Differences between Mammalian Sexes Are Involved in Drug Metabolism and Renal Function. Developmental Cell 2004, 6: 791–800. 10.1016/j.devcel.2004.05.005View ArticlePubMedGoogle Scholar
- Prieto I, Pezzi N, Buesa J, Kremer L, Barthelemy I, Carreiro C, Roncal F, Martinez A, Gomez L, Fernandez R, Martinez A, Barbero J: STAG2 and Rad21 mammalian mitotic cohesins are implicated in meiosis. EMBO Rep 2002, 3(6):543–550. 10.1093/embo-reports/kvf108View ArticlePubMedPubMed CentralGoogle Scholar
- Kim S, Xu B, Kastan M: Involvement of the cohesin protein, Smc1, in Atm-dependent and independent responses to DNA damage. Genes Dev 2002, 16(5):560–570. 10.1101/gad.970602View ArticlePubMedPubMed CentralGoogle Scholar
- Ybot-Gonzalez P, Copp A, Greene N: Expression pattern of glypican-4 suggests multiple roles during mouse development. Developmental Dynamics 2005, 233(3):1013–1017. 10.1002/dvdy.20383View ArticlePubMedGoogle Scholar
- Storey J: The positive false discovery rate: a Bayesian interpretation and the q-value. Annals of Statistics 2003, 31(6):2013–2035. 10.1214/aos/1074290335View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.