As shown in Fig.1, MGOGP takes gene expression datasets, gene modules, known disease genes and gene ontology annotation information [36] as input, and the ranked genes as output. The main parts including: module importance measure, module-specific gene importance measure, module rank and module-specific gene prioritization, and global cancer-related gene prioritization. Figure 2 schematically illustrates these steps in detail.
First, obtain functional gene modules; then get the global ranking of all modules and the local ranking of all module-specific genes based on their importance; finally, the rank fusion algorithm further gives all genes a global rank.
Input datasets
As shown in Fig. 1, MGOGP takes gene expression datasets, gene modules, known disease-related genes and gene ontology annotation information as input. In this study, all gene modules are downloaded from GSEA website (http://software.broadinstitute.org/gsea/downloads.jsp). All GO ontologies of genes are downloaded from GeneCards [37, 38]. Information of relationships between GO terms are got from Gene Ontology Consortium website.
Module importance measure
We measure the importance of a module by: the number of differentially expressed genes in the module, the number of differential correlations between module genes and the basic importance of the module itself.
We use DESeq2 for gene differential expression analysis [3, 35, 39, 40]. If genes with padj(g
i
) value bigger than the threshold value μ, we set Se(g
i
) = 0. Otherwise, we set Se(g
i
) = 1, which means the gene g
i
is a candidate differential expression gene. Se(g
i
) is defined as follows:
$$ Se\left({g}_i\right)=\left\{\begin{array}{l}0,\mathrm{if}\; padj\left({g}_i\right)>\mu \\ {}1,\mathrm{else}\end{array}\right. $$
(1)
To further improve the statistical significance of the selected candidate differential expression genes, we applied a multiple random sampling strategy. As defined in Eq. 2.
$$ DEG\left({g}_i\right)=\left\{\begin{array}{l}0,\mathrm{if}\frac{1}{s}\sum \limits_{s=1}^S Se\left({g}_i\right)<\omega \\ {}1,\mathrm{else}\end{array}\right. $$
(2)
Where S is the number of sampling; ω is a threshold value; if a gene g
i
is selected as a differential expression gene we set DEG(g
i
) = 1, Otherwise, we set DEG(g
i
) = 0.
We define Ncr(m
j
) as the ratio of differential expression genes in the module m
j
as shown in Eq. 3:
$$ {\displaystyle \begin{array}{l} Ncr\left({m}_j\right)=\frac{\sum_{i=1}^N DEG\left({g}_i\right)}{N}\\ {}j\in 1,2,3,\dots, M\end{array}} $$
(3)
Where, g
i
is the ith gene in the module m
j
; N is the total number of genes in the module m
j
; DEG(g
i
) is defined in Eq. 2.
Next, for each pair of genes in the module m
j
, two correlation values are calculated using normal and tumor samples respectively. As defined in Eqs. 4 and 5 respectively.
$$ {r}_N\left({g}_i,{g}_h\right)=\frac{\sum_{l=1}^L\left({x}_l-\overline{x}\right)\left({y}_l-\overline{y}\right)}{\sqrt{\sum_{l=1}^L{\left({x}_l-\overline{x}\right)}^2{\left({y}_l-\overline{y}\right)}^2}} $$
(4)
r
N
(g
i
, g
h
) is the Pearson correlation value between gene g
i
and gene g
h
across all normal samples. L is the normal sample number.
$$ {r}_T\left({g}_i,{g}_h\right)=\frac{\sum_{q=1}^Q\left({x}_q-\overline{x}\right)\left({y}_q-\overline{y}\right)}{\sqrt{\sum_{q=1}^Q{\left({x}_q-\overline{x}\right)}^2{\left({y}_q-\overline{x}\right)}^2}} $$
(5)
r
T
(g
i
, g
h
) is the Pearson correlation value between gene g
i
and gene g
h
across all tumor samples. Q is the tumor sample number.
To test whether the correlation coefficient between gene g
i
and gene g
h
is differentially correlated, we test whether r
T
(g
i
, g
h
) and r
N
(g
i
, g
h
) are significantly different. The two correlation coefficients are changed to Z
N
(g
i
, g
h
) and Z
T
(g
i
, g
h
) respectively.
$$ {Z}_N\left({g}_i,{g}_h\right)=\frac{1}{2}\log \frac{1+{r}_N\left({g}_i,{g}_h\right)}{1-{r}_N\left({g}_i,{g}_h\right)} $$
(6)
Similarly, r
T
(g
i
, g
h
) is changed to Z
T
(g
i
, g
h
) as Eq. (6). The differential correlation is tested based on Fisher’s z-test [41]. As defined in Eq. (7):
$$ Z=\frac{Z_N\left({g}_i,{g}_h\right)-{Z}_T\left({g}_i,{g}_h\right)}{\sqrt{\frac{1}{L-3}+\frac{1}{Q-3}}} $$
(7)
The Z value has an approximately Gaussian distribution under the null hypothesis [41]. If the fdr value of a gene is bigger than the threshold value υ, we set Sc(g
i
, g
h
) = 0, otherwise we set Sc(g
i
, g
h
) = 1, which means the correlation coefficient is a potential differential correlation. Sc(g
i
, g
h
) is defined as follows:
$$ Sc\left({g}_i,{g}_h\right)=\left\{\begin{array}{l}0,\mathrm{if}\; fdr\left({g}_i,{g}_h\right)>\upsilon \\ {}1,\mathrm{else}\end{array}\right. $$
(8)
Where fdr(g
i
, g
h
) is the local false-discovery rate (fdr) derived from fdrtool package [42]; υ is a threshold value.
As the way we find differential expression genes, we retain only those significantly changed correlations. As defined in Eq. 9:
$$ DEE\left({g}_i,{g}_h\right)=\left\{\begin{array}{l}0,\mathrm{if}\frac{1}{s}\sum \limits_{s=1}^S Sc\left({g}_i,{g}_h\right)<\delta \\ {}1,\mathrm{else}\end{array}\right. $$
(9)
Where S is the number of sampling; δ is a threshold value; we set DEE(g
i
, g
h
) = 1 if the gene g
i
and g
h
are differentially correlated. Otherwise, we set DEE(g
i
, g
h
) = 0.
We define Ecr(m
j
) as the ratio of differential correlations among genes in the module m
j
. Ecr(m
j
) is defined in Eq. 10:
$$ {\displaystyle \begin{array}{l} Ecr\left({m}_j\right)=\frac{\sum_{k=1}^K DEE\left({g}_i,{g}_h\right)}{K}\\ {}K=\frac{N\left(N-1\right)}{2}\; and\;i,h\in 1,2,3,\dots, N\end{array}} $$
(10)
K and N is the edge number and the gene number of the module m
j
respectively.
We measure the basic importance of a module by calculating the ratio of known disease genes in a module, as shown in Eq. 11:
$$ info\left({m}_j\right)=\left( num\left({d}_j\right)+1\right)/N $$
(11)
num(d
j
) is the number of known disease genes in the module m
j
; N is the number of genes in the module m
j
.
The module importance is defined in Eq. 12.
$$ {\displaystyle \begin{array}{l}p\left({m}_j\right)=\left(\left( Ncr\left({m}_j\right)+ Ecr\left({m}_j\right)\right)/2\right)\ast info\left({m}_j\right)\ \\ {}\ \mathrm{j}\in 1,2,3\dots, \mathrm{M}\end{array}} $$
(12)
where m
j
means the jth module; M is the total number of modules.
Module-specific gene importance measure
We measure the importance of a gene (p(g
i
)) in the module by measuring: the gene’s differential expression value, the number of differential correlations between the gene and all other module genes and the basic importance of the gene itself.
The number of differential correlations (CorC(g
i
)) between the gene g
i
and all other genes in the same module is calculated as in Eq. 13.
$$ {\displaystyle \begin{array}{l} CorC\left({g}_i\right)=\frac{\sum_{h=1,h\ne i}^{N-1} Sc\left({g}_i,{g}_h\right)}{N-1}\\ {}i,h\in 1,2,3,\dots, N,{g}_i\in {m}_j\\ {}j\in 1,2,3,\dots, M\end{array}} $$
(13)
N is the number of genes in the module m
j
; M is the total module number.
Finally, the basic importance of a gene is determined by the gene ontology-based fuzzy measure similarity values between the gene and all known disease gene (if exist) in the same module. As shown in Eq. 14.
$$ {\displaystyle \begin{array}{l} info\left({m}_j\_{g}_i\right)=\\ {}\left\{\begin{array}{l}0, if\; num\left({m}_j\_{d}_h\right)=0\\ {}1, if\;{g}_i\; is\;a\; known\kern0.17em disease\kern0.17em gene\kern0.17em itself\\ {}{\sum}_{h=1}^{num\left({m}_j\_{d}_h\right)}{S}_{FMS}\left({g}_i,{m}_j\_{d}_h\right)/ num\left({m}_j\_{d}_h\right), else\end{array}\right.\end{array}} $$
(14)
num(m
j
_d
h
) is the number of known disease genes in the module m
j
. If num(m
j
_d
h
) = 0, which means no known disease gene in the module m
j
, we set info(m
j
_g
i
) = 0. If g
i
itself is a known disease gene, we set info(m
j
_g
i
) = 1. Otherwise, we calculate the gene importance value based on the fuzzy similarity measure between the gene and all the known disease gene in the module m
j
. S
FMS
(m
j
_g
i
, m
j
_d
h
) is defined in Eq. 15, as in [43]:
$$ {S}_{FMS}\left({\mathrm{m}}_j\_{g}_i,{\mathrm{m}}_j\_{d}_h\right)=\frac{Sm_i\left({T}_{{\mathrm{m}}_j\_{g}_i}\cap {T}_{m_j\_{d}_h}\right)+{Sm}_h\left({T}_{{\mathrm{m}}_j\_{g}_i}\cap {T}_{m_j\_{d}_h}\right)}{2} $$
(15)
Where Sm
i
is the Sugeno measure [43] defined on GO terms of gene m
j
_g
i
and Sm
h
is the Sugeno measure defined on GO terms of module disease gene m
j
_d
h
.
Let \( {T}_{{\mathrm{m}}_j\_{g}_i} \) is the set of GO annotation terms of gene m
j
_g
i
, Sm
i
, is a real value function, satisfying [44]:
-
1)
\( {Sm}_i\left({T}_{{\mathrm{m}}_j\_{g}_i}\right)=0, if\;{T}_{{\mathrm{m}}_j\_{g}_i}=\varnothing, else\;{Sm}_i\left({T}_{{\mathrm{m}}_j\_{g}_i}\right)=1. \)
-
2)
\( {Sm}_i\left({T}_{{\mathrm{m}}_j\_{g}_i}\right)\le Sm\left({T}_{m_j\_{d}_h}\right)\; if\;{T}_{{\mathrm{m}}_j\_{g}_i}\subseteq {T}_{m_j\_{d}_h} \)
-
3)
For all \( {T}_A,{T}_B\subseteq {T}_{{\mathrm{m}}_j\_{g}_i} \) with T
A
∩ T
B
= Φ
\( {\displaystyle \begin{array}{l}{Sm}_i\left({T}_A\cup {T}_B\right)={Sm}_i\left({T}_A\right)+{Sm}_i\left({T}_B\right)\\ {}+\lambda {Sm}_i\left({T}_A\right){Sm}_i\left({T}_B\right),\lambda >-1\end{array}} \)
For a given gene annotation set \( {T}_{{\mathrm{m}}_j\_{g}_i} \), the parameter λ of its Sugeno fuzzy measure can be uniquely solved as in Eq. 16:
$$ \left(1+\lambda \right)=\prod \limits_{i=1}^n\left(1+\lambda {Sm}_i\right) $$
(16)
This equation has a unique solution for λ > −1. Let Sm
k
= Sm({T
k
}). The mapping T
k
→ Sm
k
is called a fuzzy density function. The fuzzy density value, Sm
k
, is interpreted as the importance of the single information source T
k
in determining the similarity of two genes. As defined in Eq. 17:
$$ {Sm}_k=-\ln \left(p\left({T}_k\right)/\underset{T_j\in {T}_{g_i}}{\max}\left\{-\ln \left(p\left({T}_j\right)\right)\right\}\right) $$
(17)
Where p(T
k
) is defined in Eq. 18:
$$ {\displaystyle \begin{array}{l}p\left({T}_k\right)=\\ {}\left(\frac{count\left({T}_k+ children\kern0.17em of\;{T}_k\; in\kern0.17em corpus\right)}{count\left( all\; GO\; term\mathrm{s}\;\mathrm{in}\; corpus\right)}\right)\\ {}1\le k\le \mid {T}_{g_i}\mid \end{array}} $$
(18)
The importance of gene (p(g
i
)) in a module is defined in Eq. 19.
$$ {\displaystyle \begin{array}{l}p\left({g}_i\right)= padj\left({g}_i\right)+ CorC\left({g}_i\right)+ info\left({g}_i\right)\\ {}i\in 1,2,3,\dots, N,{g}_i\in {m}_j\end{array}} $$
(19)
N is the number of genes in the module m
j
.
Global gene ranking
Most genes deploy their functions in the context of sophisticated functional modules [45, 46]. Therefore, the global rank of a gene need be decided by its own importance and the importance of its affiliated module. As in [34], a rank fusion strategy is used to fuse the local rank of genes in each module into a global rank.
The rank fusion strategy is a recursive process. It decides the rank of the nth gene based on all the top-ranked n − 1 genes. We define i as the number of genes having already obtained their global ranking in the recursive process of rank fusion, m(i, j) as the number of top i genes located in the module j after having determined the top i genes. t(i, j) as the expectation of the number of top i genes located in the module j. e(i, j) as the expectation of probability that the i + 1 globally ranked genes come from the module j. We use the module importance value p(m
j
) as the probability of a disease-related gene comes from it. The relationship between i, m(i, j), t(i, j) and p(m
j
) is shown in Eq. 20:
$$ {\displaystyle \begin{array}{l}t\left(i,j\right)= ip\left({m}_j\right)\\ {}e\left(i,j\right)=t\left(i+1,j\right)-m\left(i,j\right)\end{array}} $$
(20)
Initially, the first ranked gene in the module with highest importance value is chosen as the top 1 gene in the gene’s global rank, because all genes in each module have been ranked from big to small according to their importance value. Let i as the number of genes having obtained their global ranking, to decide the i + 1 ranked gene, we need to find the module with the biggest e(i, j) value, because e(i, j) indicates the expectation of probability that the i + 1 globally ranked genes from module j. So the genes ranked m(i, j) + 1 in the module j will be chosen as the top i + 1 ranked gene, because in the module j, top m(i, j) genes has obtained the global ranking. Repeat the process until all genes get ranked. As shown in Fig. 3 (in Additional file 1).