NaviGO is as a web-based tool at http://kiharalab.org/web/navigo. The source codes are made available at Github, https://github.com/kiharalab/NaviGO and https://github.com/kiharalab/GOVisualizer.
GO similarity/association scores
In NaviGO, six scores can be used to quantify similarity or association relationship of GO terms. Three scores are for quantifying semantic similarity of GO terms: Resnik’s, Lin’s, and the relevant semantic similarity score. The other three scores, CAS, PAS, and IAS are for quantifying GO associations. Detailed explanation of the scores is provided in separate sections below.
To quantify the functional similarity of two genes, the funsim score [4, 5] is used. Funsim of two sets of terms, i.e. GO annotations of two genes, is calculated from an all-by-all similarity matrix, where each entry of the matrix is a similarity score of users’ choice between a GO pair.
CAS and PAS
We previously developed two function association scores, Co-occurrence Association Score (CAS) and PubMed Association Score (PAS) [3]. CAS quantifies frequency of co-occurring GO terms within the gene annotations in the GOA database while PAS takes consideration of co-occurrence of GO terms in PubMed abstracts. A characteristic differentiating the two methods from other methods is that the two scores can be defined cross-domain associations between GO terms, i.e. terms from Molecular Function (MF) and Biological Process (BP), those from MF and Cellular Component (CC), and those from BP and CC.
$$ C A S\left( i, j\right)=\frac{\frac{c\left( i, j\right)}{{\displaystyle {\sum}_{ij} c\left( i, j\right)}}}{\left(\frac{c(i)}{{\displaystyle {\sum}_k c(k)}}\right)\left(\frac{c(j)}{{\displaystyle {\sum}_k c(k)}}\right)} $$
(1)
where C(i,j) is the number of sequences in the database that contain both the GO terms i and j. Similarly, C(i) is the total number of sequences annotated with the GO term i, and so is the C(j). The numerator of Eq. 1, \( \frac{c\left( i, j\right)}{{\displaystyle {\sum}_{ij} c\left( i, j\right)}} \) is essentially the fraction of sequences that are annotated with two particular GO terms, i and j, among all the sequences in the database. The denominator multiplies the fraction (probability) of sequences in the database that are annotated with GO term i and the fraction of sequences in the database that are annotated with GO term j. Thus, it is the expected fraction of sequences in the database with the two GO annotations, i and j, if i and j are randomly assigned to sequences. Using the numerator and the denominator, altogether CAS quantifies how often two GO terms i and j co-annotate sequences relative to the random chance. CAS = 1 means that the observation of co-annotation of i and j is the same as expected by the random chance, and a larger value indicates that i and j are correlated in gene annotation.
Similarly, PAS is defined as:
$$ P A S\left( i, j\right)=\frac{\frac{Pub\left( i, j\right)}{{\displaystyle {\sum}_{i, j}}\; Pub\left( i, j\right)}}{\left(\frac{Pub(i)}{{\displaystyle {\sum}_k}\; Pub(k)}\right)\left(\frac{Pub(j)}{{\displaystyle {\sum}_k}\; Pub(k)}\right)}=\frac{Pub\left( i, j\right)}{Pub(i) Pub(j)}\cdot \frac{{\left({\displaystyle {\sum}_k} Pub(k)\right)}^2}{{\displaystyle \sum {{{}_k}_{,}}_l}\; Pub\left( k, l\right)} $$
(2)
Here, Pub(i,j) is the number of PubMed abstracts which contain both the GO terms i and j. Similarly, Pub(i) is the number of abstracts that contain GO term i and the same is applicable for Pub(j). The numerator of Eq. 2, \( \frac{Pub\left( i, j\right)}{{\displaystyle {\sum}_{ij\;} Pub\left( i, j\right)}} \), is the fraction of abstracts in PubMed that mention two particular GO terms, i and j, among all the abstracts in the PubMed database. The denominator multiplies the fraction (probability) of abstracts in PubMed that mention GO term i and the fraction of abstracts that mention GO term j. Thus, it is the expected fraction of abstracts in the database with the two GO annotations, i and j, if i and j randomly show up in abstracts. Altogether, PAS quantifies how often two GO terms i and j are co-mentioned in PubMed abstracts relative to the random chance. PAS = 1 means that GO term i and j are not related, and a larger value indicates that i and j are related and frequently co-mentioned in biological contexts. Importantly, it is possible that GO terms that do not have a high functional similar scores (Resnik, Lin’s, and Relevance Similarity scores) have a high CAS or PAS. High PAS and CAS implies that proteins with the GO term annotation are functionally related and play roles in the same biological context, e.g. pathways.
IAS
The Interaction Association Score (IAS) [7] captures the propensity of GO term pairs to occur in interacting proteins by counting the number of GO term pair that occur in interacting proteins normalized by random chance. Thus, high IAS between a protein pair indicates a high possibility that the protein pairs interact with each other. The GO_IAS for each GO term pair was computed as follows:
$$ GO\_ I A S\left( GOx, GOy\right)=\frac{\frac{N\left( GOx, GOy\right)}{\# T. Edges}}{\left(\frac{N(GOx)}{\# T. Nodes}\right)\left(\frac{N(GOy)}{\# T. Nodes}\right)} $$
(3)
where N(GOx-GOy) is the number of times GO term pair GOx and GOy interact in PPI networks, #T.Edges is the total number of interactions (edges) in PPI networks, N(GOx) and N(GOy) are the number of times GOx and GOy independently occur in proteins the networks, and #T.Nodes is the total number of proteins in the PPI networks. Figure 9 shows an example of a small PPI network. This network has 5 edges between 5 proteins; 3 proteins are annotated with GO:1, and 2 proteins with GO:2. There are 2 edges that connects between GO:1 and GO:2 (P1 to P2 and P2 to P4). From this network, GO_IAS for GO:1 and GO:2 is computed as (2/5)/((3/5)(2/5)) = 1.67. Similar to PAS and CAS, IAS quantifies how often two GO terms i and j are observed in physically interacting proteins in a protein-protein interaction network relative to the expected number of observations by the random chance. If two proteins are annotated with GO terms that have high IAS, it suggests that the proteins may physically interact with each other.
Significant difference between CAS, PAS, and IAS from conventional GO functional similarity scores described in the next section is that the former three scores quantifies functional relevance of GO term pairs in biological contexts, co-annotation to genes (CAS), co-mention in PubMed abstracts (PAS), and interacting protein pairs (IAS). Due to the design, these scores are capable of identifying proteins in the same pathways (CAS, PAS) [3] and physical interacting proteins (IAS) [7]. Correlation of CAS/PAS/IAS to regular functional similarity scores (below) is not very high [3, 7], because proteins in the same pathway and physically interacting proteins are not necessarily have similar function.
Resnik, Lin’s, and relevance similarity scores
For quantifying GO term similarity, NaviGO provides three score options. The Resnik’s [5] similarity score measures the semantic similarity of a GO term pair according to the lowest common ancestor (LCA) of the GO term pair, while the Lin’s similarity is based on the information content of LCA and the GO term pair queried [3].
$$ s i{m}_{Resnik}\left({c}_1,{c}_2\right)={ \max}_{c\in S\left({c}_1,{c}_2\right)}\left(- \log p(c)\right) $$
(4)
$$ s i{m}_{Lin}\left({c}_1,{c}_2\right) = { \max}_{c\in S\left({c}_1,{c}_2\right)}\left(\frac{2\cdot log\ p(c)}{log\ p\left({c}_1\right)+ log\ p\left({c}_2\right)}\right) $$
(5)
Here p(c) is the probability of a GO term c, which is defined as the fraction of the occurrence of c in the GO Database. s(c1,c2) is the set of common ancestors of the GO terms c1 and c2. The root of the ontology has a probability of 1.0.
The relevance semantic similarity score (simRel) [4] for computing functional similarity of a pair of GO terms, c1 and c2:
$$ s i{m}_{Rel}\left({c}_1,{c}_2\right) = { \max}_{c\in S\left({c}_1,{c}_2\right)}\left(\frac{2\cdot log\ p(c)}{log\ p\left({c}_1\right)+ log\ p\left({c}_2\right)}\cdot \left(1- p(c)\right)\right) $$
(6)
The first term considers the relative depth of the common ancestor c to the average depth of the two terms c1 and c2 while the second term takes into account how rare it is to identify the common ancestor c by chance.
Functional similarity score of gene pairs
To quantify the functional similarity of two annotated genes, we used the funsim score [4, 5]. The funsim score of two sets of terms, GOA and GOB for gene A and B, of a respective size of N and M, is calculated from an all-by-all similarity matrix s
ij
.
$$ S i j= s i m{\left( G{O_i}^A, G{O_j}^B\right)}_{\forall i\in \left\{1.. N\right\},\forall j\in \left\{1.. M\right\}} $$
(7)
For sim(GO
i
A
, GO
i
B
), the relevance similarity score is usually used but other scores can be used, too. Since the relevance similarity score is defined only for GO pairs of the same category, a matrix is computed separately for the three categories, BP, CC, and MF:
$$ G{O}_{s core}= \max \left(\frac{1}{N}{\displaystyle \sum_{i=1}^N\underset{1\le i\le M}{ \max }{s}_{i j}}\right.,\left.\frac{1}{M}{\displaystyle \sum_{i=1}^M\underset{1\le i\le N}{ \max }{s}_{i j}}\right) $$
(8)
GOscore will be any of the three category scores (MFscore, BPscore, CCscore). Finally, the funsim score is computed as
$$ funsim=\frac{1}{3}\left[{\left(\frac{MFscore}{ \max (MFscore)}\right)}^2+{\left(\frac{BPscore}{ \max (BPscore)}\right)}^2+{\left(\frac{CCscore}{ \max (CCscore)}\right)}^2\right] $$
(9)
where max(GOscore) = 1 (maximum possible GOscore) and the range of the funSim score is [0, 1].
Gene ontology enrichment analysis
The probability of a GO term X being annotated to a protein in the cluster is computed by:
$$ f\left( k; N,\ m, n\right)=\frac{\left(\begin{array}{c}\hfill m\hfill \\ {}\hfill k\hfill \end{array}\right)\left(\begin{array}{c}\hfill N- m\hfill \\ {}\hfill n-\mathrm{k}\hfill \end{array}\right)}{\left(\begin{array}{c}\hfill N\hfill \\ {}\hfill n\hfill \end{array}\right)} $$
(10)
where k is the number of proteins in the cluster annotated with X, N is the number of annotated proteins in the organism, m is the number of proteins in the organism annotated with X, and n is the number of annotated proteins in the cluster. To calculate a p-value for overrepresentation of a term, we use the following equation:
$$ {P}_{hg}(X)={\displaystyle {\sum}_{i= k}^n f\left( i; N, m,\mathrm{k}\right)} $$
(11)