 Methodology article
 Open Access
 Published:
Probabilistic prioritization of candidate pathway association with pathway score
BMC Bioinformatics volume 19, Article number: 391 (2018)
Abstract
Background
Current methods for geneset or pathway analysis are usually designed to test the enrichment of a single geneset. Once the analysis is carried out for each of the sets under study, a list of significant sets can be obtained. However, if one wishes to further prioritize the importance or strength of association of these sets, no such quantitative measure is available. Using the magnitude of pvalue to rank the pathways may not be appropriate because pvalue is not a measure for strength of significance. In addition, when testing each pathway, these analyses are often implicitly affected by the number of differentially expressed genes included in the set and/or affected by the dependence among genes.
Results
Here we propose a twostage procedure to prioritize the pathways/genesets. In the first stage we develop a pathwaylevel measure with three properties. First, it contains all genes (differentially expressed or not) in the same set, and summarizes the collective effect of all genes per sample. Second, this pathway score accounts for the correlation between genes by synchronizing their correlation directions. Third, the score includes a rank transformation to enhance the variation among samples as well as to avoid the influence of extreme heterogeneity among genes. In the second stage, all scores are included simultaneously in a Bayesian logistic regression model which can evaluate the strength of association for each set and rank the sets based on posterior probabilities. Simulations from Gaussian distributions and human microarray data, and a breast cancer study with RNASeq are considered for demonstration and comparison with other existing methods.
Conclusions
The proposed summary pathway score provides for each sample an overall evaluation of gene expression in a geneset. It demonstrates the advantages of including all genes in the set and the synchronization of correlation direction. The simultaneous utilization of all pathwaylevel scores in a Bayesian model not only offers a probabilistic evaluation and ranking of the pathway association but also presents good accuracy in identifying the topranking pathways. The resulting recommendation list of ranked pathways can be a reference for potential target therapy or for future allocation of research resources.
Background
To evaluate the enrichment of a pathway or geneset under consideration, several methods for pathway analysis (PA) or geneset analysis (GSA) have been proposed over the past decades, including the overrepresentation analysis (ORA), significance analysis of function and expression [1], geneset enrichment analysis (GSEA) [2], global test [3], and signaling pathway impact analysis (SPIA) [4, 5] (more reviews in [6,7,8]). The existence of the enrichment of the pathway or geneset, often a gene ontology term, is sometimes interpreted as the association between the phenotype and the set. A significantly enriched pathway or geneset would then be recommended for further investigation of subset analysis or target therapy. When several significant pathways are available, these sets may need to be prioritized for future research or for better allocation of limited resources. Two problems arise, however. The first one is that the genesets identified by different procedures may not be consistent with each other [6, 9], and the second is the lack of a measure to quantify the strength of association of each set.
The problem in reproducibility can be caused by the discrepancy between the statistical assumptions underlying the approaches and the biological reality of the genegene relationship. For instance, genes in the same pathway are often considered independent in several GSAs; while they can correlate with each other because they participate in the same or related biological functions [10,11,12]. This correlation can inflate type I error rates and reduce power of both univariate and multivariate tests [8, 13, 14]. Another issue of concern is the condition on genes to be included in GSAs. Some analyses including ORA utilize only genes that are differentially expressed (DE), while excluding those exerting mild or weak effect. For instance, ORA uses hypergeometric test for GSA. As pointed out by Rahmatallah and colleagues [9], the power of a geneset analysis may be influenced by the number of DE genes in that set.
To quantify the strength of association, a common practice is to order the sets based on pvalues resulting from a certain GSA that is applied to each individual set. Since pvalue is defined for data more extreme than the observed assuming the null hypothesis is true, its magnitude would not be proper to serve as a quantitative measure for the strength of association [15]; therefore the ranking based on pvalues would be inappropriate.
To solve these problems, we propose first to summarize the gene expression levels, whether DE or not, in the same pathway with a rank transformation adjusted by direction of correlation. The rank is applied on all samples per gene to depict the relative magnitude of a gene across samples, and the sign of correlation is used to incorporate and synchronize the genegene relationship. This procedure is conducted for all gene nodes in the pathway, including those in subpathways. We next adopt the Bayesian regression machine to model the degree of association between the pathway and disease status, where the prioritization of competing pathways is carried out based on conditional probabilities. The use of Bayesian model for GSA was considered earlier in [16] for DNA methylation profiling. The rest of the paper is organized as follows. The formulation of the proposed pathway score and the construction of the Bayesian model will be introduced in Section “Methods”. In Section “Results and discussion” we demonstrate the performance of the procedure with simulation studies. The simulated gene expressions are generated either from multivariate Gaussian distributions or from public human expression data to reserve the dependence among genes. The evaluation of performance is based on the type I error rate, percentage of correctly ranking the genesets, and the ability to detect the associated pathways. In the same Section we also apply the proposed methodology on a study of highgrade ductal carcinoma in situ with RNASeq data and six competing pathways, followed by discussion and conclusion. Note that here we consider a pathway also a geneset and will interchange the words pathway and geneset to refer to a set of genes under investigation.
Methods
Suppose there are N samples and M genes or gene nodes in the study. Let G_{nm} denote the expression value of the nth sample in the mth gene, where n = 1, …, N and m = 1, …, M, and let the N × M matrix G contain all expression values, where its column vector is denoted as G_{·m} = (G_{1m}, G_{2m}, …, G_{Nm})^{t} for the expression of all samples in the mth gene. The rank function is next applied on each gene (column) vector respectively. That is, each column vector in G is replaced with the vector r(G_{·m}) = (r(G_{1m}), r(G_{2m}), …, r(G_{Nm}))^{t}.
Next we establish the relationship between genes by first selecting a reference gene, denoting its gene vector as G_{·R}, computing the correlation between this gene and every other column G_{·m}, m = 1, …, M, in G, and recording the direction of the correlation between G_{·R} and G_{·m} with the sign function S(G_{·R}, G_{·m}). That is, the value of S(G_{·R}, G_{·m}) = sign (corr(G_{·R}, G_{·m})) is 1 if they are positively correlated and 1 otherwise. The choice of a fixed reference gene in this procedure is to adjust all correlation directions from the same base unit, i.e., the reference gene in our case, and to avoid cancellation when no base is considered.
Pathway score
Suppose there are K competing pathways, let C_{k} contain the indices of genes (or gene symbols) in the kth pathway, k = 1, …, K. If a gene appears in more than one node in the pathway, the frequency of its index is identical to the number of its appearance. Let its cardinality C_{k} denote the number of elements in C_{k}. That is, C_{k} is the size of the kth pathway. In this pathway, a reference gene is first selected and then a standardized pathway score p_{nk} is defined to summarize the expression values for the nth sample as
Note that Q_{nk} is the average ranks of the expression levels with signs for the nth sample and p_{nk} is the standardized score so that the (p_{n1}, p_{n2}, …, p_{nK}) are comparable among the K competing pathways. The standard deviation sd(Q_{1k}, …, Q_{Nk}) in the denominator is calculated across samples n = 1, …, N for each fixed k. After all pathway scores are computed for each sample, the values are stored in the N × K matrix P.
This proposed pathway score has several advantages. First, the pathway score summarizes for each sample the gene expression through the rank transformation so that the quantity is robust to extreme expression values, as oppose to the direct average. This transformation also standardizes the variability across genes as well as enlarges the heterogeneity. In addition, the product of the function S and ranked expression r(G_{nm}) in the score integrates all genes by adjusting the direction of correlation between any single gene and the reference. In other words, depending on the direction of the reference gene, the quantity becomes extreme when many genes in the pathway are simultaneously overexpressed or underexpressed. This function S can be considered as a synchronizing factor.
Strength of association and prioritization
To evaluate the strength of association of the K pathways, a generalized linear model with a logit link g in the casecontrol setting is employed,
The Y_{n} stands for the disease status of the nth sample, p_{nk} is the standardized pathway score defined above with the corresponding coefficient β_{k}, and X_{n} contains other nongenetic explanatory variables associated with this sample.
For the regression coefficient β_{k}, we adopt the maximum posterior probability P^{(k)} = max {P(β_{k} > 0 Y, X, P), P(β_{k} < 0 Y, X, P)} as a probabilistic evaluation of the strength of association between the kth pathway and the disease status. Here Y is the column vector containing disease status of all samples, X contains X_{1}, X_{2}, …, X_{N}, and P is defined as above. The value P^{(k)} ranges between 0.5 and 1. It represents the degree of association: A value closer to 1 implies a stronger association between the set and the disease status; while a value closer to 0.5 indicates weak or no association. Take two competing pathways k_{1} and k_{2} for example, if \( {P}^{\left({k}_1\right)} \) is larger than \( {P}^{\left({k}_2\right)} \), it implies a larger degree of association between k_{1} and the disease status than that between k_{2} and Y. This quantity can now be used to prioritize the K competing pathways.
The computation of P^{(k)} as well as the Markov chain Monte Carlo (MCMC) posterior samples of β_{k} are carried out with an R package R2OpenBUGS to evaluate the posterior probability P(β_{k} > 0 Y, X, P) and P(β_{k} < 0 Y, X, P). The code and specification of the full Bayesian model including the distributions of prior and hyperparameters are provided in Additional files 1 and 2.
Results and discussion
Simulation settings
In the following simulation studies, we compare the Bayesian approach with other methods such as GSEA, ORA, global test, frequentist logistic regression with the proposed pathway score (denoted as Logistic (ps)), frequentist logistic regression with the average expression level as the pathway score (Logistic (sum)), and the Fisher’s method. First we generated either 50 or 100 gene expression levels from a multivariate normal distribution with assigned mean, variance, and correlation ρ to examine the type I error. The disease status was next determined based on the logistic regression model described above with the intercept β_{0} set at 0.01 for a prevalence of 1% and all other regression coefficients set at 0 for no association, or at other given values if association is assumed (described below when data were generated from human genome data for power evaluation). In each replication, 50 cases and 50 controls were generated and the total number of replications is 1000. The value of ρ was selected from 0, 0.1, and 0.3 to mimic the independence, weak, and mild correlation among genes in the same set. The pvalues for the nonBayesian methods were computed based on asymptotics (if applicable) or 1000 permutations and their significance level was set at 0.05; while for the Bayesian approach, the threshold was set at 0.99 for P^{(k)} based on 5000 posterior samples. The gene is defined as DE if its singlemarker test results in p < 0.05.
In addition, we simulated real gene expression data from a large breast cancer study [17] to preserve the relationship among genes. This study contained 13,751 gene expressions from 623 subjects with primary breast cancer. The expression levels were collected from microarray experiments and can be downloaded from Gene Expression Omnibus (GEO) repository (accession number GSE48091). Again, either 50 or 100 genes were randomly selected from this expression data to form a geneset and to determine the disease status, followed by the analysis with each method to test for association.
Evaluation of type I error rate
Table 1 lists the type I error rates for each method but SPIA, because SPIA is designed for known pathways and not hypothetical ones. Under the null hypothesis of no pathway association, we note first that the type I error rate does not change much across different values of ρ, regardless of the pathway size (50 or 100 genes per set). Second, most tests show reasonable rates: the rates of Logistic (ps), Logistic (sum), GSEA and Fisher’s test are slightly smaller than 0.05, the rates under the Global test are around 0.03 with 50 genes per set and 0.05 with 100 genes, the rates under the Bayesian approach are around 0.02, but that under ORA tends to be as large as 0.1. However, when the data were generated from GSE48091, an inflated type I error rate is apparent for each method, though of different degree. Only the Bayesian regression approach can maintain a rate smaller than 0.05.
Evaluation of accuracy performance in setting IV
Next we evaluate the power in terms of the accuracy in selecting the most influential pathway. In other words, the analysis is considered making the correct decision if the pvalue corresponding to the truly most influential pathway is the smallest; while in the Bayesian approach, the P^{(k)} has to be the largest. Again we simulated real gene expression data from GSE48091 to preserve the correlation among genes. Furthermore, to compare with SPIA, we deliberately selected p53, JakSTAT, mTOR, and taste transduction pathways, denoted by C_{1}, C_{2}, C_{3}, and C_{4}, as the four competing pathways. The contents of these pathways and their subpathways all follow the definition in the R package SPIA. A pathway with a regression coefficient β=1 or 2 is considered exerting strong association, since it corresponds to the odds ratios of log(1) = 2.718 and log(2) = 7.389. Similarly, β=0.5, 0.1, 0.01, and 0.001 correspond to 1.649, 1.105, 1.010, and 1.001, and are considered as of moderate, weak, and almostnone association effect. Five simulation settings (IV) are included and the values of β are displayed in Fig. 1.
Figure 2a demonstrates the accuracy of the 1000 replications in selecting the most influential pathway. For instance, in setting I, the accuracy is defined as the percentage of selecting set C_{1} as the top pathway, since C_{1} corresponds to the largest absolute regression coefficient 2 in Fig. 1. That is, in each replication, if the pvalue for C_{1} is smaller than pvalues for C_{2} − C_{4}, respectively, it is counted as accurate for the method. For Bayesian approach, this replication is considered accurate if P^{(1)} > P^{(k)} for all k = 2, 3, 4. While in setting II, the accuracy is the percentage that C_{2} is selected, since its absolute coefficient 2 is the largest. Among the five settings, generally the Bayesian approach and Logistic (ps) perform the best, except in setting II they are slightly behind the Fisher’s method by only 0.3%. The advantage of the Bayesian approach and Logistic (ps) decreases in setting V; the accuracy of every method is between 20 and 30%. This is under expectation, since all four competing pathways exert weak and similar effects (coefficients between 0.1 and 0.001), making them less differentiable from each other.
The larger accuracy of the Bayesian approach and Logistic (ps) demonstrates the usefulness of the summary measure, the pathway score. In addition, the comparison between Logistic (ps) and Logistic (sum) implies the advantage of incorporating the correlation direction and rank information, which leads to a better performance. Note that here GSEA, SPIA, and Fisher’s method remain in the second best group in identifying the topranking pathway. Fisher’s method uses the Chisquare as the asymptotic distribution where significance occurs frequently when the pathway size is large and leads to a large degree of freedom. This may explain why it tends to reach the significant result. The ORA does not perform well because it considers only DE genes and no correlation among genes is included in the analysis.
In Fig. 2b we demonstrate the percentage of correctly selecting the top two ranking pathways. The accuracy evaluation decreases for all methods. Again, the Bayesian approach and Logistic (ps) perform better than the other methods. It appears that these two have the ability to make the correct selection as long as the coefficients are large and separable, such as the first four settings IIV.
Evaluation of accuracy performance under singlemarker association (setting VIIX)
In addition to the above pathwaylevel generation of association, we also consider the scenario of traditional singlemarker association, where the gene, instead of the pathway score, is assigned with an effect to associate with the disease. In other words, the disease probability for the ith subject is linked (via a logit scale) to the linear combination \( \sum \limits_{j\in {C}_k}{\beta}_{kj}{G}_{ij} \), where C_{k} contains the index of genes in the associated pathway, β_{kj} is the effect size of the jth gene in this set, and G_{ij} is the gene expression of the corresponding gene in this subject. In this scenario, nonzero effect can be assigned to a subset of genes in this set. A randomly selected M percent of genes in C_{1} are assigned with n_{1}β_{1j} = 0.5, M percent of genes in C_{2} are assigned with n_{2}β_{1j} = 1; while the rest 100 − M percent in C_{1} and C_{2}, and all genes in C_{3} and C_{4} are assigned with n_{k}β_{kj} = 0.01. n_{k} is the number of causal genes. The four values, 20, 50, 80, and 100, of M correspond to the four settings VI, VII, VIII, and IX, respectively.
The performance of accuracy based on 100 replications is displayed in Fig. 2c for selecting the correct top ranking pathway. All methods perform similarly well, where the global test is slightly less powerful (but still around 0.80). In Fig. 2d for selecting the correct top two ranking pathways, all methods perform poorly; the largest power is around 0.50 for GSEA. In other words, no method presents clear advantage.
Evaluation of power performance for individual pathway
An alternative way to evaluate the performance of these competing methods is to examine their ability to correctly identify each of the associated pathway/geneset. Using the threshold suggested earlier (i.e., the 0.05 for pvalue and 0.99 for the maximum posterior probability P^{(k)}), we display in Fig. 3 the percentage of detected association for each pathway. Only settings I and VI are presented here because all show similar patterns. We therefore select I from the first group of simulation design (settings IV) and VI from the second group (VIIX).
When the pathway exerts strong association (C_{1},C_{2} in I and C_{2} in VI), four methods (Bayesian, Logistic (ps), Global, and Fisher’s) attain consistently a large power. When the pathway effect is moderate or weak (C_{3} in I and C_{1} in VI), only Global test and Fisher’s method can detect the association. However, these two methods incorrectly identify, with a high frequency, the pathway with null (almostnone) effects (C_{4} in I and C_{3},C_{4} in VI). In summary, the Bayesian and Logistic (ps) perform similarly well, but the Bayesian approach has a slightly smaller error (C_{4} in I and C_{3},C_{4} in VI).
Application: Highgrade DCIS study
We applied the proposed method to a breast cancer study of pure highgrade ductal carcinoma in situ (DCIS). This study included 25 breast cancer patients and 10 normal controls [18], and applied the nextgeneration sequencing (NGS) technique to quantify the gene expression. The RNASeq data are freely available from National Center for Biotechnology Information (NCBI) GEO database (accession number GSE69240). The data contained read counts from 16,532 genes. Six competing pathways (p53, estrogen, JakSTAT, mTOR, oocyte meiosis, and taste transduction) were selected specifically for pathway ranking. The first five pathways have been reported to be associated with breast cancer [19]; while the last one was not and is considered here as the null for comparison. The contents of these pathways follow the definition in Kyoto Encyclopedia of Genes and Genomes (KEGG).
To demonstrate the effect of rank transformation, we display in Fig. 4 the heatmap of gene counts for the JakSTAT signaling pathway, where Fig. 4a contains the original RNASeq data and Fig. 4b includes the ranks of gene counts. The pattern of the relative magnitude does not change, but the contrast in Fig. 4b is much stronger than that in Fig. 4a. In the lower panel, Fig. 4c plots the summation of all gene counts in this pathway for each sample and Fig. 4d shows the value of the proposed pathway score. It can be observed that the 35 summation values in the left tend to overlap with each other; while the pathway scores in the right seem to discriminate better the 10 controls versus the 25 patients.
The six pathways were then investigated under the Bayesian approach, Logistic (ps), Logistic (sum), GSEA, SPIA, ORA, global test, and Fisher’s method, respectively. Table 2 lists either the P^{(k)} under the Bayesian model, or the pvalues under other tests. Note that the DE genes here are defined if they pass the singlemarker test with Bonferroni correction and the foldchange not between 0.5 and 2. Under each method, the largest and smallest values are in boldface, corresponding to the most influential (topranked) and the least influential pathway.
For the Bayesian ranking scheme with all six pathways simultaneously included in the same model, JakSTAT is the most influential because its P^{(k)} is clearly the largest and around 90%; while the taste transduction is identified the last one with P^{(k)} close to 50%. For the other tests, the results are contradictory. They can identify one from the first five sets as the most influential, but they may select one from this group as the least important. They are not consistent in identifying the taste transduction pathway as the least important geneset. GSEA, SPIA and ORA identify mTOR as the least influential pathway; whereas Logistic (sum) recommends mTOR and taste transduction being the most influential. Furthermore, both the global test and Fisher’s method cannot differentiate the six pathways, since the pvalues are extremely small that it is not meaningful to compare the relative magnitude. The Logistic (ps) cannot be conducted here because the proposed pathway score separate the two groups (healthy vs. disease) almost perfectly, leading to failure in estimating the effect size. For Logistic (sum), the six pathwayspecific covariates based on summation of all expression levels cannot be included in the logistic regression model simultaneously due to the small sample size (relative to the number of pathways), therefore the pvalues in the Table are derived when only one covariate is considered in the analysis.
The distributions of pathway effect, the regression coefficient, are displayed in Fig. 5 with boxplots of their MCMC posterior samples. The top ranking pathway (the largest P^{(k)}) is JakSTAT; its pathway score is positively associated with a higher chance of disease risk. This is expected because our proposed score synchronizes directions of all expression counts according to the reference gene, thus follows the same effect direction of the reference. Here IL17D is the reference and it does show more expression counts in patients than in controls. This reference is a member of the interleukin 17 family (IL17) that has been reported to significantly link to tumor progression including the invasive ductal carcinoma, the most common type of breast cancer [20,21,22]. In other words, a larger value of this summary pathway score implies a higher risk of disease.
The second group of boxplots in Fig. 5 includes three pathways, p53, estrogen, and mTOR, sharing similar values of P^{(k)} (between 0.72 and 0.74). The degree of association of these three pathways is less significant than that of JakSTAT, but certainly more substantial than Oocyte meiosis and taste transduction. The least P^{(k)} is 0.554 for taste transduction, representing a symmetric distribution around zero, which is expected under our hypothesis of no association.
Conclusions
In this research, we have constructed a measure to summarize the geneset activity for each subject. This quantity takes into consideration all genes in the same set, accounts for the relationship among genes in terms of correlation direction, and can enhance the contrast of each gene across samples. This measure is then applied in a Bayesian model to evaluate the strength of association between this geneset and the phenotype of interest, where the posterior probability can represent the degree of association. If multiple genesets are of interest, the corresponding probabilities can be ordered to prioritize the genesets for future studies. In contrast, other methods consider only one set at a time and use pvalues for ranking.
There are other advantages in using the Bayesian regression model. The first is its interpretability. The exponent of the regression coefficient is the odds ratio: an odds ratio greater than 1 implies an increase in risk, while a value smaller than 1 implies a decrease. In addition, since the pathway score synchronizes all genes with the reference, the direction of change in pathway risk (i.e., the sign of the regression coefficient) is the same as that of the reference gene. The second advantage is its flexibility. This model can account for quantitative traits when the function g is replaced with the identity link, and can be extended to survival analysis and pedigree studies. Furthermore, other covariates like demographic and environmental variables can be included to account for other effects.
For the choice of the reference gene in the geneset, our criteria include (1) the hub gene, with many neighboring genes being either its up or downregulating entities; (2) the one acting as an upstream gene in the set, particularly in a signaling pathway; and (3) the gene with a large variation in expression level. For the breast cancer NGS study, we deliberately selected TP53, ADCY1, IL17D, TNF, CALML5, and TAS1R3 as the reference for p53, Estrogen, JakSTAT, mTOR, oocyte meiosis, and taste transduction pathways, respectively. Following the criteria, these genes are all connected to many nodes in the same pathway, locate in the upstream, and show greater variation then others. There may be more than one such gene in each set. For example, for the JakSTAT pathway, we have tried LEP, CNTFR, GHR, IIL7R, IL20RB, and IL23A and the results all support the same top ranking pathway. Table 3 lists the corresponding probability P^{(k)} when these genes are used as the reference, as well as the posterior mean for the regression coefficient. Notice that the sign of the mean is the same as that of the reference gene, i.e., it is positive if the gene is overexpressed in the diseased group and negative if underexpressed. A systematic investigation would be necessary to find an optimal choice.
Several issues are worth noting here. First, when the purpose is to screen a large collection of genesets and not to prioritize a limited set of candidate pathways, the current Bayesian model may not be able to accommodate all sets in a single model, especially when the sample size is not large enough to support statistical inference. In this case, a preselection procedure is advised. One may adopt the proposed pathway score as a preselection tool in a frequentist logistic regression model for binary response outcomes or in the usual linear regression for quantitative response variables. Based on the performance of singlepathway test in the simulations, although Logistic (ps) has the tendency to provide false positive results, it is easy to use with large power. The resulting sets can next enter the Bayesian procedure for prioritization. The use of other tests like ORA and SPIA in the preselection stage would need special attention in the number of DE genes. The set tends to be significant if the number is large. This relationship is demonstrated in Fig. 6 for the breast cancer NGS study. The pvalue from such singlepathway test, however, cannot replace the ranking procedure. Inference based on a joint model, such as the one proposed in this research, would be preferred.
Second, our proposed procedure is in spirit closer to a selfcontained than a competitive test, when adopting the definition in Goeman and Buhlmann [10]. This is because the problem we are dealing with is the association strength of genesets. In other words, only sets included in the analysis are under investigation. These sets need not to compete with genes outside the sets. On the other hand, our procedure goes beyond a selfcontained test, because we are trying to evaluate the degree of association, not just to test if the association exists. The future aim is to enlarge the list of candidate pathways when all, not just candidates, are included for exploratory and screening purposes.
Abbreviations
 DCIS:

Ductal carcinoma in situ
 DE:

Differentially expressed
 GEO:

Gene expression omnibus
 GSA:

Geneset analysis
 GSEA:

Geneset enrichment analysis
 KEGG:

Kyoto Encyclopedia of Genes and Genomes
 MCMC:

Markov chain Monte Carlo
 NCBI:

National Center for Biotechnology Information
 NGS:

Nextgeneration sequencing
 ORA:

Overrepresentation analysis
 PA:

Pathway analysis
 RNASeq:

RNAsequencing
 SPIA:

Signaling pathway impact analysis
References
 1.
Barry WT, Nobel AB, Wright FA. Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics. 2005;21:1943–9.
 2.
Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proc Natl Acad Sci. 2005;102:15545–50.
 3.
Goeman JJ, van de Geer SA, de Kort F, et al. A global test for groups of genes: testing association with clinical outcome. Bioinformatics. 2004;20:93–9.
 4.
Draghici S, Purvesh K, Tarca AL, et al. A systems biology approach for pathway level analysis. Genome Res. 2007;17:1537–45.
 5.
Tarca T, Draghici S, Khatri P, et al. A novel signaling pathway impact analysis. Bioinformatics. 2009;25:75–82.
 6.
Ackermann M, Strimmer K. A general modular framework for gene set enrichment analysis. BMC Bioinformatics. 2009;10:47.
 7.
Maciejewski H. Gene set analysis methods: statistical models and methodological differences. Brief Bioinform. 2014;15:504–18.
 8.
de Leeuw CA, Neale BM, Heskes T, et al. The statistical properties of geneset analysis. Nat Rev Genet. 2016;17:353–64.
 9.
Rahmatallah Y, EmmertStreib F, Glazko G. Gene set analysis approaches for RNAseq data: performance evaluation and application guideline. Brief Bioinform. 2016;17:393–407.
 10.
Goeman JJ, Buhlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007;23:980–7.
 11.
Gatti DM, Barry WT, Nobel AB, et al. Heading down the wrong pathway: on the influence of correlation within gene sets. BMC Genomics. 2010;11:574.
 12.
Montaner D, Minguez P, AlShahrour F, et al. Gene set internal coherence in the context of functional profiling. BMC Genomics. 2009;10:197.
 13.
Glazko G, EmmertStreib F. Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics. 2009;25:2348–54.
 14.
Rahmatallah Y, EmmertStreib F, Glazko G. Comparative evaluation of gene set analysis approaches for RNASeq data. BMC Bioinformatics. 2014;15:397.
 15.
Nuzzo R. Statistical errors. Nature. 2014;506:150–2.
 16.
Chang CW, Lu TP, She CX, et al. Geneset analysis with CGI information for differential DNA methylation profiling. Sci Rep. 2016;6:24666.
 17.
Cunha SI, Bocci M, Lovrot J, et al. Endothelial ALK1 is a therapeutic target to block metastatic dissemination of breast cancer. Cancer Res. 2015;75:2445–56.
 18.
Abba MC, Gong T, Lu Y, et al. A molecular portrait of highgrade ductal carcinoma in situ. Cancer Res. 2015;75:3980–90.
 19.
Eroles P, Bosch A, PerezFidalgo JA, et al. Molecular biology in breast cancer: intrinsic subtypes and signaling pathways. Cancer Treat Rev. 2012;38:698–707.
 20.
Starnes T, Broxmeyer HE, Robertson MJ, et al. Cutting edge: IL17D, a novel member of the IL17 family, stimulates cytokine production and inhibits hemopoiesis. J Immunol. 2002;169:642–6.
 21.
O’Sullivan T, SaddawiKonefka R, Gross E, et al. Interleukin17D mediates tumor rejection through recruitment of natural kill cells. Cell Rep. 2014;7:989–98.
 22.
Benevides L, da Fonseca DM, Donate PB, et al. IL17 promotes mammary tumor progression by changing the behavior of tumor cells and eliciting tumorigenic neutrophils recruitment. Cancer Res. 2015;75:3788–99.
Funding
This work was supported in part by Ministry of Science and Technology (MOST 106–2314B002 097 MY3).
Availability of data and materials
The gene expression data with accession number GSE48091 and RNASeq data (GSE69240) are both publicly available from the GEO repository at NCBI. The R code for computing the pathway score and conducting the Bayesian analysis is in Additional file 1. The R document and examples are in Additional file 2.
Author information
Affiliations
Contributions
SJL and CKH designed the method. SJL implemented the simulation studies and conducted data analysis. TPL and CKH supervised the design and analysis of the methods. SJL, TPL, and CKH prepared the manuscript. QYY conducted the analysis of KEGG pathways and prepared the R document. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable. All analyses were performed either on simulation data or on publicly available data from the Gene Expression Omnibus.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1:
R code for computing the pathway score and conducting Bayesian analysis. (TXT 9 kb)
Additional file 2:
the R document for the analysis of two hypothetical examples. (PDF 214 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Lin, SJ., Lu, TP., Yu, QY. et al. Probabilistic prioritization of candidate pathway association with pathway score. BMC Bioinformatics 19, 391 (2018). https://doi.org/10.1186/s128590182411z
Received:
Accepted:
Published:
Keywords
 Association study
 Bayesian logistic regression
 Competing pathways
 Differentially expressed genes
 Geneset analysis
 Pathway ranking
 Pahtway score