- Research
- Open Access
- Published:

# Computational discovery of direct associations between GO terms and protein domains

*BMC Bioinformatics*
**volume 19**, Article number: 413 (2018)

## Abstract

### Background

Families of related proteins and their different functions may be described systematically using common classifications and ontologies such as Pfam and GO (Gene Ontology), for example. However, many proteins consist of multiple domains, and each domain, or some combination of domains, can be responsible for a particular molecular function. Therefore, identifying which domains should be associated with a specific function is a non-trivial task.

### Results

We describe a general approach for the computational discovery of associations between different sets of annotations by formalising the problem as a bipartite graph enrichment problem in the setting of a tripartite graph. We call this approach “CODAC” (for COmputational Discovery of Direct Associations using Common Neighbours). As one application of this approach, we describe “GODomainMiner” for associating GO terms with protein domains. We used GODomainMiner to predict GO-domain associations between each of the 3 GO ontology namespaces (MF, BP, and CC) and the Pfam, CATH, and SCOP domain classifications. Overall, GODomainMiner yields average enrichments of 15-, 41- and 25-fold GO-domain associations compared to the existing GO annotations in these 3 domain classifications, respectively.

### Conclusions

These associations could potentially be used to annotate many of the protein chains in the Protein Databank and protein sequences in UniProt whose domain composition is known but which currently lack GO annotation.

## Background

Proteins are macromolecules which carry out many biological functions in living organisms. At the molecular level, protein functions are often performed by highly conserved structural regions identified from sequence or structure alignments, which may be classified into families of domains. Because many protein domains fold into characteristic three-dimensional (3D) structures, there is often a close relationship between protein structure and protein function [1]. Currently, the Pfam database is one of the most widely used sequence-based classifications of protein domains and domain families [2]. The CATH [3] and SCOP [4] databases are examples of structural domain classifications. As well as sequence-based and structure-based classifications, proteins may also be classified according to their function. For example, the Gene Ontology (GO) [5] consists of a controlled vocabulary of GO terms which describe the gene products in a cell. Each GO term has a name, a distinct alphanumeric identifier, and a “namespace” (ontology) which has one of the following 3 values: biological process (BP), molecular function (MF), or cellular component (CC). The GO ontology is structured as a rooted Directed Acyclic Graph (rDAG) in which terms are nodes connected by different hierarchical relations. However, most protein domain classification systems annotate domains only according to the entire protein to which it belongs. One interesting exception is the dcGO database [6] which provides multiple ontological annotations (such as GO) for protein domains. Nonetheless, we found that there are several manually curated GO-Pfam associations from InterPro [7] which are not present in dcGO. Indeed, from the results of a previous version of our approach [8, 9], we estimated that dcGO associations can only annotate 43% of the unannotated structures in the Protein Databank (PDB) [10].

More generally, there are many millions of protein sequences that currently lack GO annotations. On the other hand, only a relatively small number of distinct protein domain families exist, which are re-used and combined in different ways in different proteins. Indeed, compared to the vast number of different sequences that exist, current domain classifications contain of the order of only 15,000 distinct protein domain families. Therefore, it is natural to suppose that if known protein structure and sequence annotations could be assigned GO terms at the domain level, many of these annotations could be transferred to a potentially very large number of unannotated proteins. However, we emphasize here that our aim is to discover functional annotations for protein domains themselves rather than entire protein sequences, in order to improve domain description and classification by combining structural and functional features. Nonetheless, even the task of associating GO terms with protein domains is a non-trivial problem because, except for single-domain proteins where the mapping is obvious, many different kinds of relationships can occur (see Fig. 1).

We described an early version of the approach presented here for assigning Enzyme Commission (EC) numbers to Pfam domains [9]. Because our new GODomainMiner approach [11] aims to answer a similar problem, with GO terms replacing EC numbers, we decided to generalise the overall approach under the name of CODAC (for COmputational Discovery of Direct Associations using Common Neighbours). Firstly, the problem is formalised as a bipartite graph enrichment problem in the setting of a tripartite graph. The core CODAC algorithm solves this problem using the vector cosine similarity model, from which it creates new weighted edges between items of the bipartite graph on the basis of their graph neighbourhood similarity. This approach is augmented using techniques to handle the problems of multiple data sources, bias due to identical items, the influence of the hierarchical organisation of the GO ontology, and statistical significance. Here, the overall approach is applied to 9 different bipartite graphs involving the 3 GO ontologies (BP, MF, and CC) and 3 popular protein domain classifications (Pfam, CATH, and SCOP). Our results show that the GO-domain associations discovered by this approach represent an average of 15-, 41- and 25-fold increase in the number of edges on the concerned bipartite graphs. These newly discovered associations are compared with existing associations from InterPro and those predicted by dcGO, and a selected subset of one-to-one associations is analysed from a biological point of view.

## Methods

### Tripartite graph model

In graph theory, a *k*-partite graph is a graph whose vertices can be partitioned into *k* disjoint subsets, such that in each subset no two vertices are connected. If *k*=2, the graph is called a bipartite graph (or bigraph), and if *k*=3 it is called a tripartite graph. The CODAC approach is designed to solve problems of bipartite graph enrichment within a tripartite graph framework. The main intuition is to calculate new weighted edges between two sets of items which already contain reliable but sparse associations, and which are indirectly connected through common associations with a third set of items.

Let \(\mathcal {G}^{\null } (X, Y, Z, {E}^{\null })\) be a tripartite graph where *X*, *Y* and *Z* are 3 sets of items and *E*is the set of all edges connecting *X*, *Y* and *Z* in the input configuration. Let us consider 3 bipartite subgraphs of \(\mathcal {G}^{\null }\), denoted as \(\mathcal {G}_{1} (X, Z, E_{1})\), \(\mathcal {G}_{2} (Y, Z, E_{2})\), and \(\mathcal {G}_{3}^{\null } (X,Y, E_{3}^{\null })\). We now assume that the set of edges \(E_{3}^{\null }\) is incomplete, and that the aim is to compute new edges between items of *X* and items of *Y* in order to generate \(\mathcal {G}_{3}^{*} \left (X,Y, E_{3}^{*}\right)\) which together with \(\mathcal {G}_{1}\) and \(\mathcal {G}_{2}\) will make the final tripartite graph, \(\mathcal {G}^{*} (X, Y, Z, E^{*})\), where *E*^{∗} denotes an enriched set of edges. New edges may be discovered by exploiting the existing edge distributions in \(\mathcal {G}_{1}\) and \(\mathcal {G}_{2}\). For example, if items *x*_{i} of *X* and *y*_{j} of *Y* share the same (or almost the same) set of neighbours {*z*_{k}} in *Z*, then it may be supposed that an edge might exist between *x*_{i} and *y*_{j}. Figure 2 illustrates the discovery of a candidate edge between *x*_{2} and *y*_{2} because these items are associated with the same subset of items { *z*_{1}, *z*_{3}, *z*_{4}} from *Z*. Candidate edges found in this way are then scored and filtered, as described in more detail below.

It is now possible to instantiate our model with a set of MF GO terms (*X*), a set of Pfam domains (*Y*), and a set of UniProtKB/SwissProt sequences (*Z*). *E*_{1} is the set of edges derived from the MF GO annotation of UniProtKB/SwissProt sequences, *E*_{2} is the set of edges derived from the domain contents of UniProtKB/SwissProt sequences, and \(E_{3}^{\null }\) is the set of edges derived from the InterPro manually curated MF GO annotations of Pfam domains. In this case, our aim is to produce \(E_{3}^{*}\), which will contain an enriched set of MF GO-Pfam associations weighted by their neighbourhood similarity score.

### Biadjacency representation of bigraphs

While graphs allow complex relationships to be visualised easily, analysing graphs computationally can be very time-consuming. In our approach it is convenient to represent each bigraph as a bi-adjacency matrix, in which a matrix element has a value of 1 or 0 according to whether the corresponding pair of nodes is connected or not.

Given a tripartite graph \(\mathcal {G}^{\null } (X, Y, Z, {E}^{\null })\) as input, the core CODAC algorithm divides it into two bigraphs \(\mathcal {G}_{1} (X, Z, E_{1})\) and \(\mathcal {G}_{2} (Y, Z, E_{2})\). A procedure named *Cosine* calculates a cosine similarity matrix *C* between items of *X* and items of *Y* using the two biadjacency matrices *M*_{1} (of dimension |*X*|×|*Z*|) and *M*_{2} (dimension |*Y*|×|*Z*|), derived from \(\mathcal {G}_{1}\) and \(\mathcal {G}_{2}\), respectively. These matrices are then row-normalised to give matrices *U*_{1} and *U*_{2}. Each element of the matrix \(C = U_{1} \times U_{2}^{T}\) thus represents a cosine similarity between an item *x* of *X* and an item *y* of *Y*, according to the number of common associations with the items in *Z*.

The main procedure called *PredictAssociations* determines a similarity threshold *T* for filtering the raw scores in *C* to produce *C*^{∗}. The matrix *C*^{∗} can be interpreted as the weighted biadjacency matrix of the enriched bigraph \(\mathcal {G}_{3}^{*}\left (X, Y, E_{3}^{*}\right)\) and therefore used to predict new weighted associations between items of *X* and *Y*. Pseudocode for the core CODAC algorithm is presented in Algorithm 1.

### Gold standard of positive and negative examples

In order to determine an edge similarity threshold, we need to define a “gold standard” set of positive and negative examples of associations. Here, we take all of the \(P = |E_{3}^{\null }|\) existing associations present in \(\mathcal {G}_{3}^{\null }\) as positive examples. To create negative examples, we shuffle the edges of \(\mathcal {G}_{1}\) and \(\mathcal {G}_{2}\) in order to rearrange in a random way all edges between *X* and *Z*, and between *Y* and *Z*. During shuffling, the node degrees of each *x*_{i}, *y*_{j} and *z*_{k} is kept constant, and the shuffled edges are constrained not to overlap the original edges. The shuffled graphs are denoted by \(\mathcal {G}_{1}^{\#}\) and \(\mathcal {G}_{2}^{\#}\), from which a new shuffled cosine similarity matrix, *C*^{#}, may be calculated. This matrix is then used to select |*N*|=|*P*| negative examples at random. Taken together, the *P* positive and *N* negative examples constitute our “Gold Standard” dataset.

### Determining the score threshold

We randomly split the Gold Standard dataset into two groups with equal distributions of positive and negative examples to give a “Training” and a “Test” subset. We then rank the scores of all members of the Training subset, and label them “positive” or “negative” according to a score threshold that is varied from 0.0 to 1.0 in steps of 0.001. This allows us to determine the numbers of true positive (*TP*), false positive (*FP*), true negative (*TN*), and false negative (*FN*) predictions for each threshold. We then calculate the recall, *R*=*T**P*/(*T**P*+*F**N*), precision, *P*=*T**P*/(*T**P*+*F**P*), and the F-measure, *F*_{1}=2*R**P*/(*P*+*R*). The similarity threshold *T* that gives the best F-measure with the Training subset is verified using the Test subset and retained to calculate a filtered cosine similarity matrix, *C*^{∗}, according to \(C_{i,j}^{*} = C_{i,j}\) if *C*_{i,j}>*T* or if the (*x*_{i},*y*_{j}) edge already exists in *E*_{3}, otherwise, \(C_{i,j}^{*} = 0\).

### Combining multiple datasets

There may often be more than one configuration for a graph \(\mathcal {G}^{\null }\), that has the same \(\mathcal {G}_{3}^{\null }\) but different *Z*, *E*_{1}, and *E*_{2} in \(\mathcal {G}_{1}\) and \(\mathcal {G}_{2}\). In our instantiation this corresponds to the fact that GO terms and Pfam domains can be indirectly connected either through UniProtKB/SwissProt sequences [12] or through PDB chains in SIFTS [13]. To handle multiple datasets, each input tripartite graph is processed separately to calculate its respective cosine similarity matrix *C*^{d}. The cosine similarity scores are then combined as a weighted average to give a consensus similarity matrix, *CS* (Algorithm 2). Whenever there is no data for a given pair (*x*,*y*) in an input graph, the corresponding score \(C_{x, y}^{d}\) is set to zero.

Receiver-operator-characteristic (ROC) analysis provides an objective way to measure the ability of an information retrieval system to retrieve positive documents as first ranked, i.e. with the best scores [14]. One advantage of ROC-based approaches is that they are rather insensitive to the particular numbers of the positive and negative instances used [15]. Here, in order to find the best values for the dataset weights *w*_{d}, each weight is varied from 1 to 10 in steps of 0.1, and for each combination of weights a ROC performance curve is calculated using the complete ranked list of consensus scores and our Gold Standard set of positive examples. The combination of weights that gives the largest area under the curve (AUC) is selected and used to calculate the best consensus similarity matrix *CS*. Then, the *PredictAssociations* procedure determines the best threshold to filter the consensus similarity matrix *CS* and to deduce the resulting enriched bipartite graph \(\mathcal {G}_{3}^{*}\).

### Bipartite graph extension with hierarchy of classes

Ontologies are often described as taxonomic hierarchies of classes, as is the case for the GO gene ontology [5]. Thus, if one of the input graphs contains items from a hierarchical ontology, important relationships between the ancestors of a term and its neighbour(s) could be missed because they are generally not mentioned explicitly in the data. For example, if a vertex *x* from set *X* represents a term in an ontology and has a neighbour *z* in set *Z*, it is quite possible that all of the ancestors of *x* present in *X* should also have *z* as neighbour. If requested by the user, whenever an edge (*x*,*z*) is found where *z* is annotated with an ontology term *x*, then CODAC will add additional edges between item *z* and all parents of *x* present in *X*. This is illustrated in Fig. 3.

### Clustering graph edges

A possible source of bias in any data mining approach is the existence of redundant items in the input. This is especially the case for protein entries in UniProt where it is quite possible to have entries with different identifiers but identical amino-acid sequences. In order to deal with this possibility, CODAC groups all items in *Z* into clusters having 100% identity. Each cluster is represented by a unique cluster identifier (*CID*). As shown in Algorithm 3, all source edges (*x*,*z*_{i}) and (*y*,*z*_{j}) from *E*_{1} and *E*_{2} in which identical *z*_{i} and *z*_{j} belong to the same *CID*, are merged into unique (*x*,*C**I**D*) and (*y*,*C**I**D*) edges, producing \(\mathcal {G}_{1}^{Cl}\) and \(\mathcal {G}_{2}^{Cl}\), the reduced bipartite graphs that serve as input to the CODAC core approach. It should be noted that the 100% sequence identity threshold may be reduced to 99% or lower if desired. As illustrated in Fig. 4, grouping identical items into clusters of 100% identity can be very beneficial for recovering missing edges.

### Calculating statistically significant edges in \(E_{3}^{*}\)

While our approach provides a systematic way to predict edges in \(\mathcal {G}_{3}^{*}\), it is important to calculate a probability, or “*p*-value”, for finding an edge simply by chance. For example, it is reasonable to suppose that an edge (*x*,*y*) might be predicted at random if *x* and *y* are each highly connected to many items in *Z*. In order to estimate the probability of finding edges by chance, one could generate multiple random graphs by shuffling the edges of a given input graph, as described above for constructing the Gold Standard *Negative* examples. However, this is quite impractical given the very large numbers of items in *X*, *Y*, and *Z* and the complexity of the filtering procedure that would have to be repeated for each shuffled version of the dataset. Instead, we assume that the probability for finding an edge (*x*,*y*) by random chance is given by a hypergeometric distribution of the number of common neighbours (*x*,*z*) and (*y*,*z*). Letting *N*_{z} denote the total number of items in *Z*, *N*_{x} the number of neighbours of *x* in *Z*, and *N*_{y} the number of neighbours of *y* in *Z*, the hypergeometric probability distribution is given by

where \(p(K \geqslant K_{x,y})\) is the predicted probability of having a number, *K*, equal to or greater than the observed number *K*_{x,y} of common neighbours *z* of both *x* and *y*. Because this *p*-value test is applied to a large number of (*x*,*y*) edges in \(\mathcal {G}_{3}^{*}\), we apply a Bonferoni correction to take into account the so-called family-wise error rate [16]. Therefore, letting \(|E_{3}^{*}|\) denote the total number of edges tested, we consider any *p*-value less than \( 0.05 / |E_{3}^{*}|\) as denoting a statistically significant edge.

### Classification into gold, silver, and bronze associations

While the above consensus scores and *p*-values give objective measures of the quality of predicted associations, from a user’s point of view it is often convenient to provide a simple and memorable quality scale. Therefore, we classify a predicted association as “Gold” if all of the individual data source *p*-values for this association are statistically significant.

A predicted association is classed as “Silver” if more than half of the data source *p*-values are statistically significant. Otherwise, it is classed as a “Bronze” association.

## Results and discussion

### GODomainMiner data preparation

In this paper, the CODAC approach is applied to discover new weighted GO-domain associations. In our \(\mathcal {G}^{\null } (X, Y, Z, {E}^{\null })\) tripartite graph model, the set *X* corresponds to one of the MF, BP or CC GO namespaces, and *Y* corresponds to one of the Pfam, CATH, or SCOP protein domain classifications. For each of the 9 combinations of *X* and *Y*, 3 data sources were selected to provide common neighbours (*Z*) of the items in *X* and *Y*, namely: (i) SIFTS providing curated PDB chain associations, (ii) UniProtKB/SwissProt (SP) providing curated UniProt entries, and (iii) UniProtKB/TrEMBL (TR) providing non-curated automatically annotated UniProt sequences.

Flat data files of SIFTS (June 2017), UniProt (June 2017), and InterPro (version 63.0) were downloaded and parsed using in-house Python scripts. Associations between PDB chains and GO terms, and associations between PDB chains and protein domains (Pfam, CATH, and SCOP) were extracted from the SIFTS data. All CATH and SCOP domain families were transformed into their corresponding superfamilies, and all Pfam “repeat” and “motif” domain types were discarded. Associations between UniProt sequence accession numbers (ANs) and GO terms and AN-Pfam associations (as well as AN-CATH and AN-SCOP associations) were extracted from the UniProtKB/SwissProt and UniProtKB/TrEMBL sections of UniProt to give two datasets of UniProtKB/SwissProt associations and UniProtKB/TrEMBL associations, respectively. Then, using the evidence code of the GO term, the associations in the SIFTS, UniProtKB/SwissProt, and UniProtKB/TrEMBL datasets were divided into two groups, namely one group for which the GO term evidence code indicated manual curation, and one group for GO terms with evidence code “inferred from electronic annotation” (IEA). Here, the resulting 6 datasets are called SIFTS, SIFTS-IEA, SP, SP-IEA, TR, and TR-IEA. Thus, there are 6 input tripartite graphs for each of the 9 combinations of the *X* and *Y* source datasets. All PDB chain IDs and UniProt ANs having identical sequences were clustered using the Uniref non-redundant cluster annotations [17].

We do not make any distinction between the various possible manual evidence codes. However, we note that the GO_REF field for IEA currently covers 12 annotations sources, namely InterPro2GO, UniProt Keywords2GO, UniProt Subcellular Location2GO, EC2GO, UniRule2GO, UniPathway2GO, Ensembl Compara, Ensembl Fungi, Ensembl Metazoa, Ensembl Plants, Ensembl Protists, and the Gene Ontology Consortium. Of these, the largest number of annotations come from InterPro2GO and UniProt Keywords2GO, which each provide around 169 million associations in UniProtKB. It should be noted that, only 34%, 4%, and 5% of the InterPro2GO annotations are GO-Pfam, GO-CATH, and GO-SCOP associations, respectively.

### Dataset weights and threshold scores

For each of the 9 settings of this study, the weights assigned to each dataset have been optimised. The procedure is described in Methods (Algorithm 2) and is based on a ROC-plot analysis of the ranking of our Gold-Standard InterPro-based positive examples versus all other associations computed from all the datasets and considered as background. Then the best threshold is determined from the consensus scores calculated with the optimised set of weights, using the Gold Standard Training and Test subsets of positive and negative examples. Table 1 shows that our procedure gives greater weight to GO-Pfam associations from the IEA sections of the SIFTS, UniProtKB/SwissProt, and UniProtKB/TrEMBL than to associations from the experimental and manually curated sections of SIFTS and UniProtKB/SwissProt datasets. In order to investigate this further, we re-calculated the AUC-based weight optimization with all IEA weights forced to zero (Additional file 1). This caused our optimal AUC to fall from around 0.96 to less than 0.60. This reflects the fact that in this setting, we do not consider the propagated InterPro2GO annotations in UniProtKB, and consequently GODomainMiner retrieves fewer Gold-Standard associations. However, as IEA annotations are extracted from several other data sources as well as InterPro, setting the IEA weight to zero also excludes these other data sources (refer to previous section). We therefore decided to include all IEA data in the rest of this study.

### Analysis of algorithm complexity

Because we exploit existing UniProt cluster IDs to form clusters of similar protein sequences and to eliminate duplicate sequences, the computational cost in the initial data preparation stage scales as approximately *O*(*s*×*c*), where *s* is the number of sequences and *c* is the number of UniProt clusters. The scoring stage then scales as *O*(*g*×*d*), where *g* is the number of GO terms and *d* is the number of domains. Here, the largest calculation is to find GO BP-Pfam associations. This takes around 12 hours on one CPU core of an Intel Xeon E5-2630 2.40 GHz workstation with 128 Gb memory.

### Analysis of calculated GO-Pfam associations

Summaries of our calculated GO MF-domain, BP-domain, and CC-domain associations are shown in Tables 2, 3, and 4, respectively. These tables show the numbers of distinct GO terms and domain entries (in units of thousands) involved in associations for the 6 source datasets, the filtered GODomainMiner predictions and the InterPro dataset of positive associations. In these tables, the total numbers of GO-Pfam associations found by GODomainMiner refer only to most-specific GO terms in each branch of a GO hierarchy. In other words, if a domain is associated to a GO term and to one or more of its parent terms, only the most-specific (non-parent) term is counted as a found association.

The overlap between the GODomainMiner predictions and InterPro is shown in the last row of these tables (here, a match at any GO level is counted as a common association). The high percentage of overlap between GODomainMiner and InterPro (from 91% to more than 99%) reflects the fact that our method is calibrated to recover as many as possible correct InterPro associations. Nevertheless it also shows that a small percentage of the InterPro associations have consensus scores below our calculated score threshold, revealing the role of human rather than data-driven knowledge in the definition of such associations.

Overall, our approach yields a total of 32,881 MF GO-Pfam associations (shown as 33×10^{3} in Table 2) that include 3968 associations already present in InterPro (2657 specific term matches plus 1311 parent term matches). This corresponds to an enrichment of about 8-fold in MF GO-Pfam associations. Similar calculations give fold-enrichments of about 22 and 13 for MF GO associations with CATH and SCOP domain superfamilies, respectively. For BP GO terms, we find fold-enrichments of 20, 50, and 31 for associations with Pfam, CATH, and SCOP domains, respectively, and for CC GO terms the fold-enrichments are 17, 52, and 31, respectively. A comparison with the Pfam2GO associations from the Gene Ontology website was also performed. It reveals that GODomainMiner retrieves 3966, 3541, and 2055 MF, BP, and CC GO-Pfam associations that were provided by Pfam2GO, respectively. On the other hand, it finds 99 out of 187 MF GO-Pfam associations, 108 out of 256 BP GO-Pfam associations, and 29 out of 65 CC GO-Pfam associations which are present in Pfam2GO but which are not in the InterPro database.

These results indicate that GODomainMiner discovers many new associations compared to Pfam2Go and InterPro. This can be explained by the fact that our program does not make any consideration about the possible usage of these associations for protein annotation, whereas InterPro policy is to retain only those GO-domain associations that can be transferred to all the proteins containing a given domain [18].

### Distribution of GO-domain associations per GO term and per domain

Figure 5a shows the average numbers of MF, BP, and CC GO-Pfam associations per GO term and Pfam entry for associations in InterPro (green) and those calculated by GODomainMiner when counting the most-specific GO terms assigned to a domain (purple).

GODomainMiner generally predicts more associations per GO term and per Pfam domain than exist in InterPro. For example (top panel), GODomainMiner predicts that each MF GO term and each Pfam entry are associated with an average of 5.2 domains and 4.0 MF GO terms, respectively, compared to averages of 3.9 domains and 1.3 MF GO terms in InterPro, respectively. For BP and CC GO terms we see similar enrichments from GODomainMiner compared with InterPro, with ratios of 5.4 versus 3.5 and 16.9 versus 6.8 associations per GO term, and 8.2 versus 1.17 and 4.5 versus 1.1 associations per Pfam, respectively. These results demonstrate that GODomainMiner produces a considerable enrichment in the number of annotations compared to InterPro. They also support the notion that many Pfam domains participate in different functions, either as singleton domains or as components of multi-domain proteins.

The bar charts in Fig. 5b show the distributions of GO terms (shown in orange) and Pfam entries (in blue) according to the number of associations they are involved in. For example, considering the first two bars in part B, it can be seen that some 2100 MF, 3500 BP, and 320 CC GO terms and 2600, 2300, and 2800 Pfam domains are involved in only one GO-Pfam association. The remainder of this figure shows that many GO terms and Pfam domains are involved in two or more associations, which supports the notion that complex many-to-many relationships exist between GO terms and domains (Fig. 1). More precisely, Fig. 5b indicates that the number of Pfam domains involved in only one GO BP-Pfam association is less than the number of Pfam domains involved in only one MF-Pfam association. This is consistent with the notion that a domain most likely has one function but it can be involved in several processes. Moreover, on average, twice as many BP terms are associated to Pfam domains as MF and CC terms (Fig. 5a), which demonstrates the complexity of assigning GO BP terms to Pfam domains. On the other hand, this ratio is consistent with the idea that GO BP terms describe the cooperation of one or more individial molecular functions to achieve a particular biological purpose [19]. Similar results for GO-CATH and GO-SCOP associations are shown in Additional file 1.

Finally, Table 5 shows the distribution of GODomainMiner predicted associations according to our Gold, Silver, and Bronze classification, along with the degree of overlap with the InterPro reference dataset. Since the Gold class represents associations with statistically significant *p*-values, it is interesting to see that the majority (68%) of our predicted MF GO-Pfam associations common with InterPro fall in this class. Overall, we calculate that 47% of the GODomainMiner MF GO-Pfam associations and 33% of the predicted BP and CC associations are of Gold quality. The quality of GO predictions for CATH and SCOP classifications also follow very similar paths (see Additional file 2).

### Comparison with GO-domain associations from dcGO

In order to compare the GODomainMiner results with those obtained from dcGO [6], we extracted the Pfam2GO associations from the dcGO website [20]. To avoid the complexity of comparing GO annotations at different levels in the rDAG, our comparison mainly focuses on GO-domain associations in which GO terms are leaves of the GO rDAG. GODomainMiner contains a total of 515,582 GO-Pfam associations regardless of their level in GO hierarchy, of which 79,589 involve leaf GO terms (comprising 21,410 MF, 36,814 BP, and 21,365 CC GO-Pfam associations). The Pfam2GO dataset from dcGO contains a total of 720,534 associations, of which 62,779 involve leaf GO terms (comprising 5939 MF, 24,334 BP, and 32,506 CC associations). Thus, the numbers of associations in GODomainMiner and Pfam2GO are broadly comparable. However, when considering the leaf levels of all 3 ontologies, Fig. 6 shows that only 11,138 GO-Pfam associations are common between GODomainMiner and dcGO (overlap region B, about 14% of the GODomainMiner set and 18% of the dcGO set). Looking at the overlap with InterPro, which contains 2799 leaf level GO-Pfam associations, GODomainMiner shares 2744 associations (98%) with InterPro, while dcGO shares only 724 associations (26%; overlap C). This shows that GODomainMiner gives a greater coverage of the InterPro reference set than dcGO. Although this is perhaps not surprising since InterPro was used to calibrate GODomainMiner, the high agreement between GODomainMiner and InterPro gives a good indication of the reliability of other associations predicted by GODomainMiner.

We also compared GO-SCOP associations predicted by GODomainMiner with the SCOP2GO database from dcGO and with InterPro. Overall, GODomainMiner calculates a total of 19,708 leaf GO-SCOP associations, compared to 2445 such associations in SCOP2GO and 422 in InterPro. Of these, 845 GO-SCOP associations are common to GODomainMiner and SCOP2GO. Also, 421 (i.e. *99.75%* of InterPro set) GODomainMiner associations overlap with InterPro, whereas only 55 (*13%* of InterPro set) SCOP2GO associations from dcGO are found in InterPro. This confirms the trend observed for GO-Pfam associations, in favor of a much better coverage by GODomainMiner than by dcGO, of the InterPro reference set.

### Biological assessment of new discovered GO-Pfam associations

It would certainly be a very tedious task to validate manually the huge number of new GO-domain associations proposed by the GODomainMiner approach. For this reason, we decided to check manually a small subset of these associations, namely the strict one-to-one and many-to-one GO-domain associations in which one or several GO terms are uniquely associated with one domain and where this domain is not associated with any other GO terms. Such associations can easily be used to assess the novelty and biological consistency of knowledge discovered through our approach. All lists of strict one-to-one and many-to-one associations found in the 9 settings of this study are available on the GODomainMiner website.

For the sake of brevity, we review here only the one-to-one and many-to-one MF GO-Pfam associations. We obtained 125 one-to-one MF GO-Pfam associations with consensus scores ranging from 0.9704 to 0.0052, 75 associations in the gold category (all *p*-values significant), 30 and 20 in the silver and bronze categories, respectively. From the 125 associations, 30 are already known in InterPro (21 from the gold category) and 95 are new (54 from the gold category). Manual checking of the MF GO terms and Pfam domain names led us to distinguish 5 situations (see the examples in Table 6). (i) The MF GO terms and Pfam domains descriptions are almost identical (34 associations). Such associations are trivial but only 16 of them are reported in InterPro, probably because the remaining 18 escaped automatic retrieval due to unpredictable spelling differences. (ii) The MF GO term is more specific than the Pfam domain description (21 associations including 3 from InterPro). (iii) The Pfam description is more specific than the MF GO term (11 associations including 3 from InterPro). (iv) The MF GO term and the Pfam descriptions are quite different (51 associations including 8 from InterPro). Such associations are likely the most interesting to provide to the expert for further analyses. (v) The Pfam domain has no known function (8 associations not present in InterPro). These 8 associations are listed in Table 6 as examples of new knowledge discovered by the CODAC approach. We expect that many further novel associations between MF GO terms and yet uncharacterized domains may be mined from the complete MF GO-Pfam dataset which contains more than 3400 associations concerning so-called DUF (Domain of Unknown Function) or UPF (Uncharacterized Protein Family) Pfam domains.

Concerning the strict many-to-one MF GO-Pfam associations, we identified 30 such Pfam domains, most of which have only two associated GO terms. This results in 55 associations of which 7 are known in InterPro (6 gold and 1 bronze) and 48 are new (33 gold, 8 silver and 7 bronze). For one Pfam domain only (CobS, PF02654) the two GO terms are known already in InterPro. For 5 other Pfam domains, one of the GO terms is known in InterPro and the other one is new. New MF GO-Pfam associations generally give lower scores than known InterPro associations. However, in some cases this suggests an alternative substrate for the domain activity which may be interesting to investigate. For example, for Pfam domain Mqo (PF06039 Malate:quinone oxidoreductase), GO:0052589 (malate dehydrogenase (menaquinone) activity) is found in addition to GO:0008924 (malate dehydrogenase (quinone) activity). The remaining 24 Pfam domains all have new GO MF annotations that do not exist in InterPro. Interestingly, in some cases a different more general InterPro annotation exists, as in the case of PF07722 domain Peptidase_C2 which GODomainMiner associates with GO:0034722 (gamma-glutamyl-peptidase activity) and with GO:0033969 (gamma-glutamyl-gamma-aminobutyrate hydrolase) activity, whereas the InterPro annotation is simply GO:0016787 (hydrolase activity).

### Implications for protein sequence annotation

It is natural to suppose that predicted GO-domain associations could help to annotate entire protein sequences. However, it does not automatically follow that GO-domain associations are directly transferable to sequences because the function of a particular protein can depend on, for example, its domain architecture, organism, cell type, and cellular location [18]. Therefore, an automatic domain-based sequence annotation system should take such factors into account by, e.g., constructing and applying filtering rules that take into account the taxa and cellular environment of each protein sequence to be annotated.

In any case, it is reasonable to expect that the difference in specificity compared to InterPro annotations will likely prevent many of the GODomainMiner annotations from being transferred directly to all proteins that match a given domain. However, there is no doubt that the newly discovered associations should contribute to the generation of new rules to annotate protein sequences. Nonetheless, the domain-level functional annotations predicted by GODomainMiner should first be subjected to further benchmarking in order to validate their usefulness. We recently participated in the 2017 round of the CAFA (Critical Assessment of Functional Annotation) community experiment [21], in which we applied taxa-based filtering of GODomainMiner annotations. However, the evaluation of this CAFA edition has not yet been published. Participation in future CAFA editions will allow GODomainMiner’s annotations to be assessed according to community standards.

## Conclusion

We have presented a systematic approach called CODAC for mining associations from datasets that can be represented as tripartite graphs. We have presented one implementation of this approach called GODomainMiner, for predicting associations between GO terms and protein domains. This was achieved by first collecting existing Pfam, CATH, and SCOP domain annotations of protein chains and sequences on one hand and MF, BP, and CC GO term annotations on the other. We then applied our method to find a list of direct associations between GO terms and domains. Considering only the most-specific GO terms, our approach yields an enrichment of about 15-fold in the number of GO-Pfam associations that currently exist in InterPro. A selected subset of one-to-one and many-to-one associations has been analyzed from a biological point of view, and these all appear to be highly meaningful and consistent with available knowledge. Nonetheless, there remains a need for the associations predicted by our approach to be validated more extensively, and we plan to test our approach thoroughly in the next CAFA community experiment.

## References

- 1
Berg JM, Tymoczko JL, Stryer L. Protein structure and function. New York: WH Freeman; 2002.

- 2
Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, L SEL, Tate J, Punta M. Pfam: the protein families database. Nucleic Acids Res. 2014; 42(D1):222–30.

- 3
Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. CATH – a hierarchic classification of protein domain structures. Structure. 1997; 5(8):1093–109.

- 4
Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995; 247(4):536–40.

- 5
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.Gene ontology: tool for the unification of biology. Nat Genet. 2000; 25(1):25–9.

- 6
Fang H, Gough J. dcGO: database of domain-centric ontologies on functions, phenotypes, diseases and more. Nucleic Acids Res. 2013; 41(D1):536–44.

- 7
Mitchell A, Chang HY, Daugherty L, Fraser M, Hunter S, Lopez R, McAnulla C, McMenamin C, Nuka G, Pesseat S, Sangrador-Vegas A, Scheremetjew M, Rato C, Yong S-Y, Bateman A, Punta M, Attwood TK, Sigrist CJA, Redaschi N, Rivoire C, Xenarios I, Kahn D, Guyot D, Bork P, Letunic I, Gough J, Oates M, Haft D, Huang H, Natale DA, Wu CH, Orengo C, Sillitoe I, Mi H, Paul D, Thomas PD, D FR. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 2015; 43(D1):213–21.

- 8
Alborzi SZ, Devignes M-D, RITCHIE DW. Ec-psi: Associating enzyme commission numbers with pfam domains. bioRxiv. 2015. https://doi.org/10.1101/022343. http://biorxiv.org/content/early/2015/07/10/022343.full.pdf.

- 9
Alborzi SZ, Devignes M-D, Ritchie DW. Ecdomainminer: discovering hidden associations between enzyme commission numbers and pfam domains. BMC Bioinformatics. 2017; 18(1):107.

- 10
Gutmanas A, Alhroub Y, Battle GM, Berrisford JM, Bochet E, Conroy MJ, Dana JM, Montecelo MAF, van Ginkel G, Gore SP, Haslam P, Hatherley R, Hendrickx PMS, Hirshberg M, Lagerstedt I, Mir S, Mukhopadhyay A, Oldfield TJ, Patwardhan A, Rinaldi L, Sahni G, Sanz-García E, Sen S, Slowley RA, Velankar S, Wainwright ME, Kleywegt GJ. PDBe: protein data bank in europe. Nucleic Acids Res. 2014; 42(D1):285–91.

- 11
Alborzi SZ, Devignes M-D, Ritchie DW. Associating gene ontology terms with Pfam protein domains In: Rojas I, Ortuño F, editors. Bioinformatics and Biomedical Engineering. IWBBIO 2017, vol. LNCS 10209. Cham: Springer: 2017. p. 127–138.

- 12
The UniProt Consortium. The universal protein resource (UniProt) in 2010. Nucleic Acids Res. 2010; 38(suppl 1):142–8.

- 13
Velankar S, Dana JM, Jacobsen J, van Ginkel G, Gane PJ, Luo J, Oldfield TJ, O’Donovan C, Martin MJ, Kleywegt GJ. SIFTS: structure integration with function, taxonomy and sequences resource. Nucleic Acids Res. 2013; 41(D1):483–9.

- 14
Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval. New York: Cambridge University Press; 2008.

- 15
Chawla NV, Japkowicz N, Kotcz A. Special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newsl. 2004; 6(1):1–6.

- 16
Cui X, Churchill GA. Statistical tests for differential expression in cdna microarray experiments. Genome Biol. 2003; 4(4):210.

- 17
Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: domprehensive and non-redundant UniProt reference clusters. Bioinformatics. 2007; 23(10):1282–8.

- 18
Sangrador-Vegas A, Mitchell AL, Chang H-Y, Yong S-Y, Finn RD. Go annotation in interpro: why stability does not indicate accuracy in a sea of changing annotations. Database. 2016; 2016:027. https://doi.org/10.1093/database/baw027.

- 19
Bargsten JW, Severing EI, Nap J-P, Sanchez-Perez GF, van Dijk AD. Biological process annotation of proteins across the plant kingdom. Curr Plant Biol. 2014; 1:73–82.

- 20
dcGO. http://supfam.org/SUPERFAMILY/dcGO/. Accessed 28 July 2017.

- 21
Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, et al.A large-scale evaluation of computational protein function prediction. Nat Methods. 2013; 10(3):221–7.

## Acknowledgements

The authors wish to thank Jean-Sebastien Sereni for careful reading of the graph modeling section.

### Funding

This work and the publication of this article were funded by Agence Nationale de la Recherche (grant numbers: ANR-11-MONU-006-02 and ANR-15-RHUS-0004), Inria Nancy – Grand Est, and Région Lorraine (CPER IT2MP).

### Availability of data and materials

The GODomainMiner results can be accessed with a web browser at http://godm.loria.fr/.

### About this supplement

This article has been published as part of *BMC Bioinformatics Volume 19 Supplement 14, 2018: Selected articles from the 5th International Work-Conference on Bioinformatics and Biomedical Engineering: bioinformatics*. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-19-supplement-14.

## Author information

### Affiliations

### Contributions

SZA designed the study and was involved in data processing and management, analysis and testing. MDD was involved in biological interpretation. SZA, MDD, and DWR worked together on the mathematical modeling of the approach, discussed the results and wrote the manuscript. All authors have read and approved the final manuscript.

### Corresponding author

## Ethics declarations

### Ethics approval and consent to participate

Not applicable.

### Consent for publication

Not applicable.

### Competing interests

The authors declare that they have no competing interests.

### Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Additional files

### Additional file 1

Supplementary figures. (PDF 93 kb)

### Additional file 2

Supplementary tables. (PDF 36 kb)

## Rights and permissions

**Open Access** This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

## About this article

### Cite this article

Alborzi, S., Ritchie, D. & Devignes, MD. Computational discovery of direct associations between GO terms and protein domains.
*BMC Bioinformatics* **19, **413 (2018). https://doi.org/10.1186/s12859-018-2380-2

Published:

### Keywords

- Protein structure
- Protein domain
- Protein function
- Gene ontology
- Vector similarity