Gene set enrichment analysis for multiple continuous phenotypes
 Xiaoming Wang^{1}Email author,
 Saumyadipta Pyne^{2, 3} and
 Irina Dinu^{1}
DOI: 10.1186/1471210515260
© Wang et al.; licensee BioMed Central Ltd. 2014
Received: 21 March 2014
Accepted: 25 July 2014
Published: 3 August 2014
Abstract
Background
Gene set analysis (GSA) methods test the association of sets of genes with phenotypes in gene expression microarray studies. While GSA methods on a single binary or categorical phenotype abounds, little attention has been paid to the case of a continuous phenotype, and there is no method to accommodate correlated multiple continuous phenotypes.
Result
We propose here an extension of the linear combination test (LCT) to its new version for multiple continuous phenotypes, incorporating correlations among gene expressions of functionally related gene sets, as well as correlations among multiple phenotypes. Further, we extend our new method to its nonlinear version, referred as nonlinear combination test (NLCT), to test potential nonlinear association of gene sets with multiple phenotypes. Simulation study and a real microarray example demonstrate the practical aspects of the proposed methods.
Conclusion
The proposed approaches are effective in controlling type I errors and powerful in testing associations between genesets and multiple continuous phenotypes. They are both computationally effective. Naively (univariately) analyzing a group of multiple correlated phenotypes could be dangerous. Rcodes to perform LCT and NLCT for multiple continuous phenotypes are available at http://www.ualberta.ca/~yyasui/homepage.html.
Keywords
DNA microarrays Gene expression Linear combination test Nonlinear combination test Geneset analysisBackground
Microarray data analysis at an individual gene level usually leads to a large list of significant genes, even after multiple comparison adjustment has been made. The process of trying to interpret such a large list of genes is difficult. Molecular biologists have put together lists of genes grouped by function, such as biological pathways or sets of genes. Various pathways or gene sets databases have been compiled, for example, Kyoto Encyclopedia of Genes and Genomes (KEGG) [1, 2], Gene Ontology [3], Biocarta [4] and Molecular Signature Data Base [5]. There has been a shift in focus from gene level analysis to pathway level, or gene set level, with many Gene Set Analysis (GSA) methods being proposed in the past decade. The most popular one is Gene Set Enrichment Analysis (GSEA) [6]. Extensive reviews and methodological discussions were given by Goeman and Buhlmann [7] and Nam and Kim [8].
While GSA methods on a single binary or categorical phenotype abounds, little attention has been paid to the case of a continuous phenotype, and there are no methods to accommodate correlated multiple continuous phenotypes. Such correlated continuous variables are measured routinely in many important clinicopathological observations such as lung functions, tumor size or measurements of marker proteins. A naïve approach to analyzing such data with existing GSA methods would be to categorize the continuous phenotypes into two or more discrete classes, as well as analyze the multiple correlated phenotypes univariately, i.e., one at a time. Such artificial categorization and univariate analyses may lead to less efficiency in geneset analysis and even cause inaccurate identification of significant gene sets, especially if the multiple phenotypes exhibit relatively higher correlations.
There is an important methodological distinction between the competitive and selfcontained GSA approaches [6, 7]. For a binary phenotype, e.g., competitive methods use gene permutation to test whether or not the association of the phenotype with a gene set is similar to its association with the other gene sets (the “Q1 hypothesis”), while selfcontained methods employ sample permutation to test the equality of the means of the two vectors of geneset expressions which correspond to the two phenotype groups (the “Q2 hypothesis”). Here, we focused on the selfcontained methods. Unlike the gene permutation strategy, sample permutation preserves correlation structure within gene sets and correlation structure within phenotypes  a key property that we wish to fully take advantage in the proposed GSA methods.
To the best of our knowledge, although correlations among genes in gene sets have long been observed, correlation structure was considered only in a few GSA methods. These were the modified Hotelling’s T^{2} test for categorical phenotype [9], and the linear combination test (LCT) for binary phenotype [10] and for continuous phenotype [11]. It has been realized that incorporation of correlations among gene expressions in a GSA approach can significantly improve efficiency of the analysis [9]; however, it could also spell a heavy computational burden. The linear combination test was designed to incorporate correlations among gene expressions while overcome the computational burden. In the case of binary phenotype, it has been showed that LCT was much more computationally efficient than the modified Hotelling’s T^{2} test and approximated its superior power very well [10]; in the case of continuous phenotype, it has been showed that LCT was superior in power to the other GSA methods under compare [11].
We propose here an extension of LCT to its new version for multiple continuous phenotypes, incorporating correlations among gene expressions of functionally related gene sets, as well as correlations among multiple phenotypes. Further, we extend the new method to its nonlinear version, referred as nonlinear combination test (NLCT), to test potential nonlinear association between gene sets and multiple phenotypes, especially recommended for analyzing relatively larger microarrays. The extension strategy can also be used for other GSA tools for continuous phenotype/phenotypes, such as Global Test [12]. The rest of the article is organized as follows. In section 2 we give detailed derivations of the two proposed GSA methods. In section 3, we used a simulation study to show the practice aspects of these two proposed methods using various settings on sample size, geneset size, and correlation level among genes and among phenotypes. Section 4 presents the performances of the proposed methods on a real gene expression microarray data from prostate tumor samples of AfricanAmerican prostate cancer patients [13].
Method
Linear combination test for multiple continuous phenotypes
Consider a microarray study on n subjects, with measurements on expressions of a predefined set of P genes X = (x_{1}, …, x_{ p })^{ T } and measurements on a group of q continuous phenotypes Y = (y_{1}, …, y_{ q })^{ T }. Suppose columns in both X and Y are centered and scaled across the subjects. We are interested in testing whether there is a significant linear relationship between the gene set X and the group of phenotypes Y. The null hypothesis to be tested is that expressions of the genes in the predefined gene set X are linearly independent with the phenotypes Y. The multivariate null hypothesis can be expressed linearly and univariately as
H_{ 0 }: There is no association between any of the linear combinations of x_{1}, …, x_{ p }and any of the linear combinations of y_{1}, …, y_{ q }.
To test the linear relationship, let Z(X, A) = a_{1}x_{1} + ⋯ + a_{ p }x_{ p } be a linear combination of x_{1}, …, x_{ p }, and Z(Y, B) = b_{1}y_{1} + ⋯ + b_{ q }y_{ q } a linear combination of y_{1}, …, y_{ q }, where A ∈ R^{ p } and B ∈ R^{ q } represent the coefficient vectors of a_{ i }'s and b_{ j }'s, respectively. For given coefficient vectors A and B of the combination coefficients, we can focus on testing whether the combination Z(X,A) is associated with the combination Z(Y,B). This is a classical correlation test and a commonly used test statistic is based on measuring the Pearson correlation between Z(X,A) and Z(Y,B), i.e. $\rho $ = $\rho $(Z(X,A),Z(Y,B)). If both X and Y are normally distributed, then the statistic $t=\rho \sqrt{\left(n2\right)/\left(1{\rho}^{2}\right)}$ follows a Student's tdistribution with degrees of freedom n 2 under the null hypothesis [14]. This also holds approximately if the observed values are nonnormal, provided sample size n is large enough [15].
The old version of LCT for single continuous phenotype [11] is a special case of it.
where $\alpha ={D}_{X}^{1/2}{U}^{T}A$, $\beta ={D}_{Y}^{1/2}{V}^{T}B$ and ${\Sigma}_{\tilde{X}\tilde{Y}}$ is the covariance matrix between $\tilde{X}$ and $\tilde{Y}$, with its (i,j) entry being $cov\left({\tilde{x}}_{i},{\tilde{y}}_{j}\right)$.
The optimization problem in (4) can be solved in two steps. Firstly, for a given β, find the optimal α, which is proportional to ${\Sigma}_{\tilde{X}\tilde{Y}}\beta $; secondly, substitute the optimal α into (4), and find the global optimal β, which is proportional to the first eigenvector of the matrix ${\Sigma}_{{}_{\tilde{X}\tilde{Y}}}^{T}{\Sigma}_{{}_{\tilde{X}\tilde{Y}}}$ corresponding to the largest eigenvalue. We note that the value of T^{2 *} equals to the largest eigenvalue of either the q × q matrix ${\Sigma}_{{}_{\tilde{X}\tilde{Y}}}^{T}{\Sigma}_{{}_{\tilde{X}\tilde{Y}}}$or the p × p matrix ${\Sigma}_{{}_{\tilde{Y}\tilde{X}}}^{T}{\Sigma}_{{}_{\tilde{Y}\tilde{X}}}$. The cost for getting the largest eigenvalue is low, providing min(p,q) is not big.
The computation advantage is obvious when sample permutations are used to calculate pvalue of the test. Since sample permutation changes neither the correlation structure within gene sets nor the correlation structure within phenotypes, so that we don’t need to repeat the same eigenvalue decompositions of the two shrinkage covariance matrices in (3) for the permuted data, but only for the original one. In fact, after performing the eigenvalue decompositions for the two shrinkage covariance matrices ${\Sigma}_{\mathit{XX}}^{*}$ and ${\Sigma}_{\mathit{YY}}^{*}$ and creating two groups of orthogonal basis vectors $\tilde{X}$ and $\tilde{Y}$, permutations can be done only on $\tilde{Y}$directly, instead of on the original phenoltype Y.
Nonlinear combination test for multiple continuous phenotypes
The proposed LCT method assumes a linear relationship between the genes in a gene set and the phenotypes. So do almost all the selfcontained GSA approaches that have been proposed in the literature. The reason for us to focus on testing linear relationship is mainly for simplicity of the method. When we have access to limited data points, a simpler approach could be more reliable than a complex/flexible one. If a larger sample size is available or if there is evidence that the relationships between gene sets and phenotypes could be nonlinear/nonmonotone, we may consider relaxing the linearity assumption, and testing more general null hypnoses, i.e.,
H _{ 0 } ^{ * } : there is no relationship between genes in the gene set and the phenotypes.
The linear combination test proposed can be easily adapted to test nonlinear relationships between genes in a gene set and phenotypes, by using nonparametric techniques. The main idea here is to apply a nonlinear transformation to the vectors of genes X, then use linear test methods to check if there is a significant linear relationship between the nonlinear transformation of X and the phenotypes Y. This strategy is similar to that of ‘basis expansion’ which is widely adopted in regression/discrimination analyses [17]. Some widely used nonlinear transformations are polynomial transformations of single or multiple genes to achieve higherorder Taylor expansions; cubic splines or wavelets transformations of single genes. We note that the same transformation strategy can be applied to the phenotypes Y. We prefer to leave Y untransformed to avoid higher flexibility of the method, which requires larger sample size as well as higher computational costs. In our NLCT test method used in the simulation study and the real microarray example study, we transform each gene in a gene set to a natural cubic spline with the degree of freedom set at 5.
Simulation study design
Our simulation study was designed to check performance of both LCT and NLCT methods. More specifically, we focused on the typeIerror performance and the power performance of the proposed tests, by varying geneset size, sample size, and correlation levels among genes and among phenotypes.
where β_{p × q} is a coefficient matrix, and ϵ_{n × q} the error matrix generated from a multivariate normal distribution. The correlation between each pair of the errors was set at p so that each pair of the columns in Y is correlated with correlation ρ. In the null model, used to check the size of the tests, we set all entries of β to 0, so that columns in X are not correlated with columns in Y. In the alternative model, used to check the power of the tests, we randomly selected five rows and three columns of the coefficient matrix, and set the corresponding fifteen entries to a common value μ, ranging from 0 to 5, with an increment of 0.25. The rest of the entries in the coefficient matrix were set at 0. We noted that the five selected columns of X are correlated with the three selected columns of Y, and that the correlation increases with μ. We used various sample sizes and geneset sizes, including large p and small n, a scenario which is common in geneset analysis. Because the LCT and NLCT procedures are in fact tests of correlation between two groups of variables, ignoring which is gene group and which is phenotype group, we set q as fix and changed p in the simulation design. The simulation data were replicated 1,000 times in each model. The pvalues were calculated based on 1,000 permutations.
Results
Simulation study
Type I errors of the multiple version of LCT and NLCT GSA methods, with dimension of the multiple phenotypes set at q = 10
Method  ρ  n = 20  n = 50  

p = 20  p = 50  p = 100  p = 20  p = 50  p = 100  
LCT  0.0  0.050  0.047  0.047  0.052  0.043  0.047 
0.3  0.051  0.053  0.050  0.055  0.045  0.043  
0.6  0.050  0.054  0.042  0.051  0.042  0.048  
0.9  0.053  0.052  0.044  0.058  0.050  0.046  
NLCT  0.0  0.041  0.039  0.035  0.062  0.049  0.058 
0.3  0.042  0.047  0.051  0.061  0.052  0.043  
0.6  0.044  0.062  0.047  0.052  0.049  0.042  
0.9  0.050  0.060  0.049  0.044  0.053  0.051 
We have two considerations for choosing q = 10. Firstly, the method is designed for q multiple continuous phenotypes and we wanted to show performance of LCT for a relatively large number of phenotypes, such as 10. We note that q = 1 reduces to our previous publication on LCT for a univariate phenotype. We reason that simulations for q in the range of 2 to 9 would give similar or even better performance than q = 10. Second, the method does not distinguish the “input” variables and “output” variables. It is in fact a correlation test, and from pure statistic point of view, there is no difference in testing results (or pvalues) if one considers “X as genes and Y as phenotypes”, or if one considers “Y as genes and X as phenotypes”. Hence, our simulation scenarios with p = 20/50/100/200 (see upright panel of Figure 1) can also be viewed as scenarios of q = 20/50/100/200.
Application
Leptin is a 16kDa protein hormone that plays a key role in regulating energy intake and expenditure, including appetite and hunger, metabolism and behavior. It is one of the most important adiposederived hormones [18]. Adiponectin (also refer to as GBP28, apMI, AdipoQ and Acrp30) is a protein which in humans is encoded by ADIPOQ gene [19]. It is involved in regulating glucose level and fatty acid oxidation. Both leptin and adiponectin are wellknown markers of human obesity [20–24]. They are hormones associated with various metabolic and inflammatory conditions. Interestingly, while leptin transcript levels are found to be overexpressed in obese subjects, adiponectin is generally underexpressed, and these may be observed not just in adipose but also in other tissues. We therefore considered using the levels of these dual markers as a multiphenotype for application of LCT.
We applied both LCT and NLCT to analyze a real Affymetrix microarray dataset consisting of genomewide transcriptomic measurements of prostate tumor samples from AfricanAmerican prostate cancer patients [13]. The purpose of the real microarray study was to see the performance of the new approaches on testing association between genesets and expressions of human leptin gene (LEP) and adiponection gene (ADIPOQ). The gene expression measurements were used as surrogate phenotypes, since the blood serum measurements were not available. The publicly available data were downloaded from Gene Expression Omnibus [25] [GEO: GSE6956]. The data that we used in the present study are part of a larger microarray study of the immunobiological differences in prostate cancer tumors between AfricanAmerican and EuropeanAmerican men. Because the LEP and ADIPOQ expression levels may be different between the two groups, we used only the data from the AfricanAmerican group to examine the LCT and NLCT methods. For our analysis, we used the expression values of 13,233 genes measured in tumor samples from 33 patients. The tumor samples were resected adenocarcinomas from patients who had not received any therapy before prostatectomy and were obtained from the National Cancer Institute Cooperative Prostate Cancer Tissue Resource (CPCTR) and the Department of Pathology at the University of Maryland. According to Wallace et al. [13], the macro dissected CPCTR tumor specimens were reviewed by a CPCTRassociated pathologist who confirmed the presence of tumors in the specimens. The tissues were collected between 2002 and 2004 at four different sites. The median age of patients with prostatectomy was 61 and the median prostatespecific antigen (PSA) concentration at diagnosis was 6.1 ng/ml. Fiftyfive percent of the tumors were stage pT2, and 45% were stage pT3 or more. Detailed RNA extraction, labeling and hybridization protocols were as described previously [13].
For comprehensive geneset analysis, the C2 catalog from MsigDB [6] consisting of 1,892 gene sets were used, including metabolic and signaling pathways from major pathway databases, gene signatures from biomedical literature including 340 PubMed articles, as well as other gene sets compiled from published mammalian microarray studies. 1,846 gene sets with size range from 5 to 500 were used in our analysis. Each gene set was tested, using both LCT and NLCT approach, for its association with the LEP and ADIPOQ expression measurements.
Percentages of gene sets with pvalues less than 0.005, 0.01, 0.05 and 0.10, which from LCT/NLCT test for univariate phenotype LEP and ADIPOQ , and multiple phenotypes ( LEP, ADIPOQ )
Method  Pvalue  

≤.005  ≤.01  ≤.05  ≤.10  
LCT for LEP  2.8  4.5  19.9  36.1 
LCT for ADIPOQ  0.4  0.9  3.1  6.3 
LCT for (LEP, ADIPOQ)  0.9  1.5  8.6  18.6 
NLCT for LEP  0.6  1.4  7.6  16.0 
NLCT for ADIPOQ  0.3  0.7  3.8  10.1 
NLCT for (LEP, ADIPOQ)  0.3  0.8  5.1  11.0 
Gene sets with LCT pvalues for multiple phenotypes ( LEP , ADIPOQ ) less than 0.05, while pvalues for univariate phenotype LEP and ADIPOQ are larger than 0.05
Geneset name  Geneset size  LEP pvalue  ADIPOQpvalue  ( ADIPOQ, LEP) pvalue 

YEN_MYC_WT  8  0.061  0.199  0.034 
GLUCONEOGENESIS  50  0.195  0.192  0.036 
BYSTRYKH_HSC_BRAIN_TRANS_GLOCUS  144  0.118  0.157  0.047 
PENG_LEUCINE_DN  135  0.186  0.189  0.035 
AMINOSUGARS_METABOLISM  14  0.223  0.341  0.039 
PENTOSE_PHOSPHATE_PATHWAY  21  0.192  0.098  0.044 
ZELLER_MYC_UP  22  0.09  0.262  0.032 
POMEROY_DESMOPLASIC_VS_CLASSIC_MD_DN  38  0.093  0.079  0.011 
FBW7PATHWAY  8  0.088  0.314  0.048 
GSK3PATHWAY  24  0.18  0.225  0.028 
GOLDRATH_CELLCYCLE  31  0.21  0.105  0.048 
STREPTOMYCIN_BIOSYNTHESIS  8  0.094  0.072  0.013 
FRUCTOSE_AND_MANNOSE_METABOLISM  24  0.168  0.061  0.007 
GLYCOLYSISPATHWAY  8  0.137  0.076  0.013 
UBIQUINONE_BIOSYNTHESIS  12  0.086  0.051  0.013 
HOFMANN_MANTEL_LYMPHOMA_VS_LYMPH_NODES_UP  45  0.053  0.135  0.022 
HOGERKORP_CD44_DN  22  0.057  0.504  0.05 
CROMER_HYPOPHARYNGEAL_MET_VS_NON_DN  72  0.11  0.178  0.05 
RUTELLA_HEPATGFSNDCS_UP  144  0.058  0.136  0.045 
METHOTREXATE_PROBCELL_DN  11  0.102  0.132  0.033 
GENOTOXINS_4HRS_DISCR  33  0.194  0.121  0.03 
HTERT_UP  57  0.083  0.089  0.036 
METHOTREXATE_PROBCELL_UP  14  0.147  0.111  0.046 
CAMPTOTHECIN_PROBCELL_UP  17  0.085  0.159  0.038 
UV_UNIQUE_FIBRO_UP  20  0.058  0.343  0.014 
CITED1_KO_HET_DN  29  0.097  0.157  0.012 
HEATSHOCK_YOUNG_UP  11  0.113  0.268  0.042 
HSA00051_FRUCTOSE_AND_MANNOSE_METABOLISM  35  0.148  0.075  0.044 
HSA00052_GALACTOSE_METABOLISM  27  0.053  0.277  0.044 
HSA00521_STREPTOMYCIN_BIOSYNTHESIS  10  0.099  0.124  0.026 
HSA01030_GLYCAN_STRUCTURES_BIOSYNTHESIS_1  83  0.106  0.25  0.049 
HSA04080_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION  218  0.103  0.116  0.044 
HSA04120_UBIQUITIN_MEDIATED_PROTEOLYSIS  34  0.124  0.082  0.041 
Discussion
We focused here on selfcontained GSA methods. We note that competitive and selfcontained methods test different hypotheses, and therefore it is important for the user to make an informed choice based on the hypothesis of interest and their understanding of the limitations of the two approaches (see reviews by Nam and Kim [8] and Dinu et al. [27]). An important limitation of the selfcontained approaches is that only a few genes can drive the association between the gene set and the phenotypes. In such cases, posthoc analysis can be used to reduce the gene set to a core subset associated with the phenotypes. Such an analysis has been reported in simulations and in a real example for a single binary phenotype [27].
Our proposed method is useful for testing associations between sets of genes or pathways and correlated multiple continuous phenotypes. These are often measured in molecular epidemiology studies that include clinicopathological measurements of tissue features such as tumor size and staining based readouts; cellular characteristics indicated by the amount of lymphocytic infiltration in a tumor environment; and subjectspecific measurements involving diagnostic or prognostic marker protein or metabolite concentrations. The LCT approach may still need to be adjusted for a mixture of continuous and categorical covariates. The LEP and ADIPOQ levels in the prostate tumor example that we have considered may also have been influenced by patientspecific covariates such as body mass index (BMI), age, and/or smoking status. We note that smoking status did not show a significant association with LEP expression levels (pvalue = 0.36), or ADIPOQ expressions levels (pvalue = 0.52) in our data, and BMI and age data were not available for our analysis.
The LCT approach can be used for both univariate and multivariate analyses. From the real data analysis, we can see that the univariate LCT for LEP is more sensitive/powerful in comparing to the multiple LCT. Generally speaking, if we knew previously that a subset of the group of phenotypes is more highly associated with the gene sets than the rest of phenotypes, then focusing on the subset of the phenotypes will gain higher power for the test, for further information is incorporated in the testing. Here, we want to point out that naively (univariately) analyzing a group of multiple correlated phenotypes will lead to problems. In the real data analysis, for controlling type I error (e.g. 0.05), it is hard to set a threshold for the two univariate tests, because of correlations between LEP and ADIPOQ. If we can assume that the two phenotypes are independent, we can set a common threshold (roughly as 0.02532057) for them. We then get 209 (11.32%) significant gene sets tested by the naïve approach, but not including 67 (3.63%) of the 159 (8.61%) significant gene sets tested by the multiple LCT. This indicates that naïve approach can identify only gene sets associated with one of the two multiple phenotypes, instead of their combination.
LCT methods rely on the linearity assumption. To check the linearity assumption, exploratory data analysis should be used prior to running a formal inference. However, a small sample size which is common in microarray studies, would limit a thorough check for nonlinearities. In the case of small sample size, we prefer using LCT instead of NLCT. The latter is suitable for relatively larger sample sizes and in the case linear assumption does not hold. Our simulation and real microarray studies indicated LCT methods perform very well for small sample sizes. The question of how small is small is debatable and depends largely on the study design. In the case of a binary/categorical phenotype, at least five samples per group are desirable. In the case of continuous phenotypes, assessing significance based on less than 10 samples is dangerous; an alternative would be to rely upon representations that are more descriptive/exploratory in nature. In terms of computation, both LCT and NLCT are highly efficient compared to other GSA methods, especially given the incorporation of the covariance matrix into the estimations.
We noticed that high correlation among genes and/or phenotypes enhances the testing power of LCT and NLCT. To understand this phenomenon, we need to distinguish correlation testing from regression modeling. In the later, we try to explain variance of the dependent variable by a group of predictors. So it is better for the predictors to be linearly independent, since high correlation among them may reduce rank(X), i.e. the real number of predictors. In the former, we are trying to find two linear combinations of genes and phenotypes respectively, with highest correlation between them. High correlation among genes may reduce p = rank(X), i.e. the real dimension of genes; and high correlation among phenotypes may reduce q = rank(Y), i.e. the real dimension of phenotypes. The smaller the dimensions p and q the easier to test the correlation between genes and phenotypes.
Conclusions
Our proposed LCT and NLCT approaches are effective in controlling type I errors and powerful in testing associations between genesets and multiple continuous phenotypes. They are both computationally effective. Naively (univariately) analyzing a group of multiple correlated phenotypes, i.e., ignoring correlation structure among phenotypes, could be dangerous.
Availability and requirements
Project name: Linear Combination Test for GeneSet Analysis of Multiple Continuous Phenotypes
Project home page: http://www.ualberta.ca/~yyasui/homepage.html
Operating system(s): Microsoft Windows XP
Programming language: R 2.10.1
Abbreviations
 GSA:

Gene set analysis
 LCT:

Linear combination test
 NLCT:

Nonlinear combination test.
Declarations
Acknowledgements
We thank the editor and reviewers for helpful guidance and comments, which have improved the exposition of this paper substantially. SP is supported by Ramalingaswami Fellowship of DBT and grants from MoS&PI, GoI and DST CMS (Project No. SR/SA/MS:516/07 dated 21/04/2008).
Authors’ Affiliations
References
 Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M: KEGG for integration and interpretation of largescale molecular datasets. Nucleic Acids Res. 2012, 40: D109D114.View ArticlePubMed CentralPubMedGoogle Scholar
 Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28: 2730.View ArticlePubMed CentralPubMedGoogle Scholar
 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, IsselTarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet. 2000, 25: 2529.View ArticlePubMed CentralPubMedGoogle Scholar
 Nishimura D: BioCarta. Biotech Software & Internet Report. 2001, 2 (3): 117120.View ArticleGoogle Scholar
 Liberzon A, Subramanian A, Pinchback R, Thorvaldsdóttir H, Tamayo P, Mesirov JP: Molecular signature database (MSigDB) 3.0. Bioinformatics. 2011, 27 (12): 17391740.View ArticlePubMed CentralPubMedGoogle Scholar
 Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proc Natl Acad Sci U S A. 2005, 102: 1554515550.View ArticlePubMed CentralPubMedGoogle Scholar
 Goeman JJ, Buhlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007, 23: 980987.View ArticlePubMedGoogle Scholar
 Nam D, Kim SY: Geneset approach for expression pattern analysis. Brief Bioinform. 2008, 9 (5): 189197.View ArticlePubMedGoogle Scholar
 Tsai C, Chen JJ: Multivariate analysis of variance test for gene set analysis. Bioinformatics. 2009, 25 (7): 897903.View ArticlePubMedGoogle Scholar
 Wang X, Dinu I, Liu W, Yasui Y: Linear Combination Test for Hierarchical Gene Set Analysis. Stat Appl Genet Mol Biol. 2011, 10 (1): Article 13PubMedGoogle Scholar
 Dinu I, Wang X, Vatanpour S, Kelemen LE, Vatanpour S, Pyne S: Linear combination test for gene set analysis of a continuous phenotype. BMC Bioinformatics. 2013, 14: 212View ArticlePubMed CentralPubMedGoogle Scholar
 Goeman JJ, van de Geer SA, de Kort F, van Houwelingen HC: A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004, 20: 9399.View ArticlePubMedGoogle Scholar
 Wallace TA, Prueitt RL, Yi MH, Yi M, Howe TM, Gillespie JW, Yfantis HG, Stephens RM, Caporaso NE, Loffredo CA, Ambs S: Tumor Immunobiological Differences in Prostate Cancer between AfricanAmerican and EuropeanAmerican Men. Cancer Res. 2008, 68 (3): 927936.View ArticlePubMedGoogle Scholar
 Rahman NA: A Course in Theoretical Statistics. 1968, Charles Griffin and CompanyGoogle Scholar
 Kendall MG, Stuart A: The Advanced Theory of Statistics, Volume 2: Inference and Relationship. 1973, London: Griffin, 3Google Scholar
 Schäfer J, Strimmer K: A shrinkage approach to largescale covariance matrix estimation and implications for functional genomics. Statist Appl Genet Mol Biol. 2005, 4: Article 32Google Scholar
 Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. 2009, New York: Springer, 2View ArticleGoogle Scholar
 Brennan AM, Mantzoros CS: Drug Insight: the role of leptin in human physiology and pathophysiology–emerging clinical applications. Nat Clin Pract Endocrinol Metab. 2006, 2 (6): 318327.View ArticlePubMedGoogle Scholar
 Maeda K, Okubo K, Shimomura I, Funahashi T, Matsuzawa Y, Matsubara K: cDNA cloning and expression of a novel adipose specific collagenlike factor, apM1 (AdiPose Most abundant Gene transcript 1). Biochem Biophys Res Commun. 1996, 221 (2): 286289.View ArticlePubMedGoogle Scholar
 Chang S, Hursting SD, Contois JH, Strom SS, Yamamura Y, Babaian RJ, Troncoso P, Scardino PS, Wheeler TM, Amos CI, Spitz MR: Leptin and prostate cancer. Prostate. 2001, 46 (1): 6267.View ArticlePubMedGoogle Scholar
 Saglam K, Aydur E, Yilmaz M, Göktaş S: Leptin influences cellular differentiation and progression in prostate cancer. J Urol. 2003, 169 (4): 130811.View ArticlePubMedGoogle Scholar
 Singh SK, Grifson JJ, Mavuduru RS, Agarwal MM, Mandal AK, Jha V: Serum leptin: A marker of prostate cancer irrespective of obesity. Cancer Biomarkers. 2010, 7 (1): 1115.PubMedGoogle Scholar
 Goktas S, Yilmaz MI, Caglar K, Sonmez A, Kilic S, Bedir S: Prostate cancer and adiponectin. Urology. 2005, 65 (6): 11681172.View ArticlePubMedGoogle Scholar
 Bub JD, Miyazaki T, Iwamoto Y: Adiponectin as a growth inhibitor in prostate cancer cells. Biochem Biophys Res Commun. 2006, 340 (4): 11581166.View ArticlePubMedGoogle Scholar
 Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30 (1): 207210.View ArticlePubMed CentralPubMedGoogle Scholar
 Storey JD: A direct approach to false discovery rates. J R Stat Soc Series B Stat Methodol. 2002, 64: 479498.View ArticleGoogle Scholar
 Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulsky KS, Halloran PF, Yasui Y: Gene Set Analysis and Reduction. Brief Bioinform. 2009, 10 (1): 2434.View ArticlePubMed CentralPubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.