As shown in Fig.1, MGOGP takes gene expression datasets, gene modules, known disease genes and gene ontology annotation information [36] as input, and the ranked genes as output. The main parts including: module importance measure, modulespecific gene importance measure, module rank and modulespecific gene prioritization, and global cancerrelated gene prioritization. Figure 2 schematically illustrates these steps in detail.
First, obtain functional gene modules; then get the global ranking of all modules and the local ranking of all modulespecific genes based on their importance; finally, the rank fusion algorithm further gives all genes a global rank.
Input datasets
As shown in Fig. 1, MGOGP takes gene expression datasets, gene modules, known diseaserelated genes and gene ontology annotation information as input. In this study, all gene modules are downloaded from GSEA website (http://software.broadinstitute.org/gsea/downloads.jsp). All GO ontologies of genes are downloaded from GeneCards [37, 38]. Information of relationships between GO terms are got from Gene Ontology Consortium website.
Module importance measure
We measure the importance of a module by: the number of differentially expressed genes in the module, the number of differential correlations between module genes and the basic importance of the module itself.
We use DESeq2 for gene differential expression analysis [3, 35, 39, 40]. If genes with padj(g_{
i
}) value bigger than the threshold value μ, we set Se(g_{
i
}) = 0. Otherwise, we set Se(g_{
i
}) = 1, which means the gene g_{
i
} is a candidate differential expression gene. Se(g_{
i
}) is defined as follows:
$$ Se\left({g}_i\right)=\left\{\begin{array}{l}0,\mathrm{if}\; padj\left({g}_i\right)>\mu \\ {}1,\mathrm{else}\end{array}\right. $$
(1)
To further improve the statistical significance of the selected candidate differential expression genes, we applied a multiple random sampling strategy. As defined in Eq. 2.
$$ DEG\left({g}_i\right)=\left\{\begin{array}{l}0,\mathrm{if}\frac{1}{s}\sum \limits_{s=1}^S Se\left({g}_i\right)<\omega \\ {}1,\mathrm{else}\end{array}\right. $$
(2)
Where S is the number of sampling; ω is a threshold value; if a gene g_{
i
} is selected as a differential expression gene we set DEG(g_{
i
}) = 1, Otherwise, we set DEG(g_{
i
}) = 0.
We define Ncr(m_{
j
}) as the ratio of differential expression genes in the module m_{
j
} as shown in Eq. 3:
$$ {\displaystyle \begin{array}{l} Ncr\left({m}_j\right)=\frac{\sum_{i=1}^N DEG\left({g}_i\right)}{N}\\ {}j\in 1,2,3,\dots, M\end{array}} $$
(3)
Where, g_{
i
} is the ith gene in the module m_{
j
}; N is the total number of genes in the module m_{
j
}; DEG(g_{
i
}) is defined in Eq. 2.
Next, for each pair of genes in the module m_{
j
}, two correlation values are calculated using normal and tumor samples respectively. As defined in Eqs. 4 and 5 respectively.
$$ {r}_N\left({g}_i,{g}_h\right)=\frac{\sum_{l=1}^L\left({x}_l\overline{x}\right)\left({y}_l\overline{y}\right)}{\sqrt{\sum_{l=1}^L{\left({x}_l\overline{x}\right)}^2{\left({y}_l\overline{y}\right)}^2}} $$
(4)
r_{
N
}(g_{
i
}, g_{
h
}) is the Pearson correlation value between gene g_{
i
} and gene g_{
h
} across all normal samples. L is the normal sample number.
$$ {r}_T\left({g}_i,{g}_h\right)=\frac{\sum_{q=1}^Q\left({x}_q\overline{x}\right)\left({y}_q\overline{y}\right)}{\sqrt{\sum_{q=1}^Q{\left({x}_q\overline{x}\right)}^2{\left({y}_q\overline{x}\right)}^2}} $$
(5)
r_{
T
}(g_{
i
}, g_{
h
}) is the Pearson correlation value between gene g_{
i
} and gene g_{
h
} across all tumor samples. Q is the tumor sample number.
To test whether the correlation coefficient between gene g_{
i
} and gene g_{
h
} is differentially correlated, we test whether r_{
T
}(g_{
i
}, g_{
h
}) and r_{
N
}(g_{
i
}, g_{
h
}) are significantly different. The two correlation coefficients are changed to Z_{
N
}(g_{
i
}, g_{
h
}) and Z_{
T
}(g_{
i
}, g_{
h
}) respectively.
$$ {Z}_N\left({g}_i,{g}_h\right)=\frac{1}{2}\log \frac{1+{r}_N\left({g}_i,{g}_h\right)}{1{r}_N\left({g}_i,{g}_h\right)} $$
(6)
Similarly, r_{
T
}(g_{
i
}, g_{
h
}) is changed to Z_{
T
}(g_{
i
}, g_{
h
}) as Eq. (6). The differential correlation is tested based on Fisher’s ztest [41]. As defined in Eq. (7):
$$ Z=\frac{Z_N\left({g}_i,{g}_h\right){Z}_T\left({g}_i,{g}_h\right)}{\sqrt{\frac{1}{L3}+\frac{1}{Q3}}} $$
(7)
The Z value has an approximately Gaussian distribution under the null hypothesis [41]. If the fdr value of a gene is bigger than the threshold value υ, we set Sc(g_{
i
}, g_{
h
}) = 0, otherwise we set Sc(g_{
i
}, g_{
h
}) = 1, which means the correlation coefficient is a potential differential correlation. Sc(g_{
i
}, g_{
h
}) is defined as follows:
$$ Sc\left({g}_i,{g}_h\right)=\left\{\begin{array}{l}0,\mathrm{if}\; fdr\left({g}_i,{g}_h\right)>\upsilon \\ {}1,\mathrm{else}\end{array}\right. $$
(8)
Where fdr(g_{
i
}, g_{
h
}) is the local falsediscovery rate (fdr) derived from fdrtool package [42]; υ is a threshold value.
As the way we find differential expression genes, we retain only those significantly changed correlations. As defined in Eq. 9:
$$ DEE\left({g}_i,{g}_h\right)=\left\{\begin{array}{l}0,\mathrm{if}\frac{1}{s}\sum \limits_{s=1}^S Sc\left({g}_i,{g}_h\right)<\delta \\ {}1,\mathrm{else}\end{array}\right. $$
(9)
Where S is the number of sampling; δ is a threshold value; we set DEE(g_{
i
}, g_{
h
}) = 1 if the gene g_{
i
} and g_{
h
} are differentially correlated. Otherwise, we set DEE(g_{
i
}, g_{
h
}) = 0.
We define Ecr(m_{
j
}) as the ratio of differential correlations among genes in the module m_{
j
}. Ecr(m_{
j
}) is defined in Eq. 10:
$$ {\displaystyle \begin{array}{l} Ecr\left({m}_j\right)=\frac{\sum_{k=1}^K DEE\left({g}_i,{g}_h\right)}{K}\\ {}K=\frac{N\left(N1\right)}{2}\; and\;i,h\in 1,2,3,\dots, N\end{array}} $$
(10)
K and N is the edge number and the gene number of the module m_{
j
} respectively.
We measure the basic importance of a module by calculating the ratio of known disease genes in a module, as shown in Eq. 11:
$$ info\left({m}_j\right)=\left( num\left({d}_j\right)+1\right)/N $$
(11)
num(d_{
j
}) is the number of known disease genes in the module m_{
j
}; N is the number of genes in the module m_{
j
}.
The module importance is defined in Eq. 12.
$$ {\displaystyle \begin{array}{l}p\left({m}_j\right)=\left(\left( Ncr\left({m}_j\right)+ Ecr\left({m}_j\right)\right)/2\right)\ast info\left({m}_j\right)\ \\ {}\ \mathrm{j}\in 1,2,3\dots, \mathrm{M}\end{array}} $$
(12)
where m_{
j
} means the jth module; M is the total number of modules.
Modulespecific gene importance measure
We measure the importance of a gene (p(g_{
i
})) in the module by measuring: the gene’s differential expression value, the number of differential correlations between the gene and all other module genes and the basic importance of the gene itself.
The number of differential correlations (CorC(g_{
i
})) between the gene g_{
i
} and all other genes in the same module is calculated as in Eq. 13.
$$ {\displaystyle \begin{array}{l} CorC\left({g}_i\right)=\frac{\sum_{h=1,h\ne i}^{N1} Sc\left({g}_i,{g}_h\right)}{N1}\\ {}i,h\in 1,2,3,\dots, N,{g}_i\in {m}_j\\ {}j\in 1,2,3,\dots, M\end{array}} $$
(13)
N is the number of genes in the module m_{
j
}; M is the total module number.
Finally, the basic importance of a gene is determined by the gene ontologybased fuzzy measure similarity values between the gene and all known disease gene (if exist) in the same module. As shown in Eq. 14.
$$ {\displaystyle \begin{array}{l} info\left({m}_j\_{g}_i\right)=\\ {}\left\{\begin{array}{l}0, if\; num\left({m}_j\_{d}_h\right)=0\\ {}1, if\;{g}_i\; is\;a\; known\kern0.17em disease\kern0.17em gene\kern0.17em itself\\ {}{\sum}_{h=1}^{num\left({m}_j\_{d}_h\right)}{S}_{FMS}\left({g}_i,{m}_j\_{d}_h\right)/ num\left({m}_j\_{d}_h\right), else\end{array}\right.\end{array}} $$
(14)
num(m_{
j
}_d_{
h
}) is the number of known disease genes in the module m_{
j
}. If num(m_{
j
}_d_{
h
}) = 0, which means no known disease gene in the module m_{
j
}, we set info(m_{
j
}_g_{
i
}) = 0. If g_{
i
} itself is a known disease gene, we set info(m_{
j
}_g_{
i
}) = 1. Otherwise, we calculate the gene importance value based on the fuzzy similarity measure between the gene and all the known disease gene in the module m_{
j
}. S_{
FMS
}(m_{
j
}_g_{
i
}, m_{
j
}_d_{
h
}) is defined in Eq. 15, as in [43]:
$$ {S}_{FMS}\left({\mathrm{m}}_j\_{g}_i,{\mathrm{m}}_j\_{d}_h\right)=\frac{Sm_i\left({T}_{{\mathrm{m}}_j\_{g}_i}\cap {T}_{m_j\_{d}_h}\right)+{Sm}_h\left({T}_{{\mathrm{m}}_j\_{g}_i}\cap {T}_{m_j\_{d}_h}\right)}{2} $$
(15)
Where Sm_{
i
} is the Sugeno measure [43] defined on GO terms of gene m_{
j
}_g_{
i
} and Sm_{
h
} is the Sugeno measure defined on GO terms of module disease gene m_{
j
}_d_{
h
}.
Let \( {T}_{{\mathrm{m}}_j\_{g}_i} \) is the set of GO annotation terms of gene m_{
j
}_g_{
i
}, Sm_{
i
}, is a real value function, satisfying [44]:

1)
\( {Sm}_i\left({T}_{{\mathrm{m}}_j\_{g}_i}\right)=0, if\;{T}_{{\mathrm{m}}_j\_{g}_i}=\varnothing, else\;{Sm}_i\left({T}_{{\mathrm{m}}_j\_{g}_i}\right)=1. \)

2)
\( {Sm}_i\left({T}_{{\mathrm{m}}_j\_{g}_i}\right)\le Sm\left({T}_{m_j\_{d}_h}\right)\; if\;{T}_{{\mathrm{m}}_j\_{g}_i}\subseteq {T}_{m_j\_{d}_h} \)

3)
For all \( {T}_A,{T}_B\subseteq {T}_{{\mathrm{m}}_j\_{g}_i} \) with T_{
A
} ∩ T_{
B
} = Φ
\( {\displaystyle \begin{array}{l}{Sm}_i\left({T}_A\cup {T}_B\right)={Sm}_i\left({T}_A\right)+{Sm}_i\left({T}_B\right)\\ {}+\lambda {Sm}_i\left({T}_A\right){Sm}_i\left({T}_B\right),\lambda >1\end{array}} \)
For a given gene annotation set \( {T}_{{\mathrm{m}}_j\_{g}_i} \), the parameter λ of its Sugeno fuzzy measure can be uniquely solved as in Eq. 16:
$$ \left(1+\lambda \right)=\prod \limits_{i=1}^n\left(1+\lambda {Sm}_i\right) $$
(16)
This equation has a unique solution for λ > −1. Let Sm_{
k
} = Sm({T_{
k
}}). The mapping T_{
k
} → Sm_{
k
} is called a fuzzy density function. The fuzzy density value, Sm_{
k
}, is interpreted as the importance of the single information source T_{
k
} in determining the similarity of two genes. As defined in Eq. 17:
$$ {Sm}_k=\ln \left(p\left({T}_k\right)/\underset{T_j\in {T}_{g_i}}{\max}\left\{\ln \left(p\left({T}_j\right)\right)\right\}\right) $$
(17)
Where p(T_{
k
}) is defined in Eq. 18:
$$ {\displaystyle \begin{array}{l}p\left({T}_k\right)=\\ {}\left(\frac{count\left({T}_k+ children\kern0.17em of\;{T}_k\; in\kern0.17em corpus\right)}{count\left( all\; GO\; term\mathrm{s}\;\mathrm{in}\; corpus\right)}\right)\\ {}1\le k\le \mid {T}_{g_i}\mid \end{array}} $$
(18)
The importance of gene (p(g_{
i
})) in a module is defined in Eq. 19.
$$ {\displaystyle \begin{array}{l}p\left({g}_i\right)= padj\left({g}_i\right)+ CorC\left({g}_i\right)+ info\left({g}_i\right)\\ {}i\in 1,2,3,\dots, N,{g}_i\in {m}_j\end{array}} $$
(19)
N is the number of genes in the module m_{
j
}.
Global gene ranking
Most genes deploy their functions in the context of sophisticated functional modules [45, 46]. Therefore, the global rank of a gene need be decided by its own importance and the importance of its affiliated module. As in [34], a rank fusion strategy is used to fuse the local rank of genes in each module into a global rank.
The rank fusion strategy is a recursive process. It decides the rank of the nth gene based on all the topranked n − 1 genes. We define i as the number of genes having already obtained their global ranking in the recursive process of rank fusion, m(i, j) as the number of top i genes located in the module j after having determined the top i genes. t(i, j) as the expectation of the number of top i genes located in the module j. e(i, j) as the expectation of probability that the i + 1 globally ranked genes come from the module j. We use the module importance value p(m_{
j
}) as the probability of a diseaserelated gene comes from it. The relationship between i, m(i, j), t(i, j) and p(m_{
j
}) is shown in Eq. 20:
$$ {\displaystyle \begin{array}{l}t\left(i,j\right)= ip\left({m}_j\right)\\ {}e\left(i,j\right)=t\left(i+1,j\right)m\left(i,j\right)\end{array}} $$
(20)
Initially, the first ranked gene in the module with highest importance value is chosen as the top 1 gene in the gene’s global rank, because all genes in each module have been ranked from big to small according to their importance value. Let i as the number of genes having obtained their global ranking, to decide the i + 1 ranked gene, we need to find the module with the biggest e(i, j) value, because e(i, j) indicates the expectation of probability that the i + 1 globally ranked genes from module j. So the genes ranked m(i, j) + 1 in the module j will be chosen as the top i + 1 ranked gene, because in the module j, top m(i, j) genes has obtained the global ranking. Repeat the process until all genes get ranked. As shown in Fig. 3 (in Additional file 1).