
Research

Open

 Published:
Computational discovery of direct associations between GO terms and protein domains
BMC Bioinformaticsvolume 19, Article number: 413 (2018)
Abstract
Background
Families of related proteins and their different functions may be described systematically using common classifications and ontologies such as Pfam and GO (Gene Ontology), for example. However, many proteins consist of multiple domains, and each domain, or some combination of domains, can be responsible for a particular molecular function. Therefore, identifying which domains should be associated with a specific function is a nontrivial task.
Results
We describe a general approach for the computational discovery of associations between different sets of annotations by formalising the problem as a bipartite graph enrichment problem in the setting of a tripartite graph. We call this approach “CODAC” (for COmputational Discovery of Direct Associations using Common Neighbours). As one application of this approach, we describe “GODomainMiner” for associating GO terms with protein domains. We used GODomainMiner to predict GOdomain associations between each of the 3 GO ontology namespaces (MF, BP, and CC) and the Pfam, CATH, and SCOP domain classifications. Overall, GODomainMiner yields average enrichments of 15, 41 and 25fold GOdomain associations compared to the existing GO annotations in these 3 domain classifications, respectively.
Conclusions
These associations could potentially be used to annotate many of the protein chains in the Protein Databank and protein sequences in UniProt whose domain composition is known but which currently lack GO annotation.
Background
Proteins are macromolecules which carry out many biological functions in living organisms. At the molecular level, protein functions are often performed by highly conserved structural regions identified from sequence or structure alignments, which may be classified into families of domains. Because many protein domains fold into characteristic threedimensional (3D) structures, there is often a close relationship between protein structure and protein function [1]. Currently, the Pfam database is one of the most widely used sequencebased classifications of protein domains and domain families [2]. The CATH [3] and SCOP [4] databases are examples of structural domain classifications. As well as sequencebased and structurebased classifications, proteins may also be classified according to their function. For example, the Gene Ontology (GO) [5] consists of a controlled vocabulary of GO terms which describe the gene products in a cell. Each GO term has a name, a distinct alphanumeric identifier, and a “namespace” (ontology) which has one of the following 3 values: biological process (BP), molecular function (MF), or cellular component (CC). The GO ontology is structured as a rooted Directed Acyclic Graph (rDAG) in which terms are nodes connected by different hierarchical relations. However, most protein domain classification systems annotate domains only according to the entire protein to which it belongs. One interesting exception is the dcGO database [6] which provides multiple ontological annotations (such as GO) for protein domains. Nonetheless, we found that there are several manually curated GOPfam associations from InterPro [7] which are not present in dcGO. Indeed, from the results of a previous version of our approach [8, 9], we estimated that dcGO associations can only annotate 43% of the unannotated structures in the Protein Databank (PDB) [10].
More generally, there are many millions of protein sequences that currently lack GO annotations. On the other hand, only a relatively small number of distinct protein domain families exist, which are reused and combined in different ways in different proteins. Indeed, compared to the vast number of different sequences that exist, current domain classifications contain of the order of only 15,000 distinct protein domain families. Therefore, it is natural to suppose that if known protein structure and sequence annotations could be assigned GO terms at the domain level, many of these annotations could be transferred to a potentially very large number of unannotated proteins. However, we emphasize here that our aim is to discover functional annotations for protein domains themselves rather than entire protein sequences, in order to improve domain description and classification by combining structural and functional features. Nonetheless, even the task of associating GO terms with protein domains is a nontrivial problem because, except for singledomain proteins where the mapping is obvious, many different kinds of relationships can occur (see Fig. 1).
We described an early version of the approach presented here for assigning Enzyme Commission (EC) numbers to Pfam domains [9]. Because our new GODomainMiner approach [11] aims to answer a similar problem, with GO terms replacing EC numbers, we decided to generalise the overall approach under the name of CODAC (for COmputational Discovery of Direct Associations using Common Neighbours). Firstly, the problem is formalised as a bipartite graph enrichment problem in the setting of a tripartite graph. The core CODAC algorithm solves this problem using the vector cosine similarity model, from which it creates new weighted edges between items of the bipartite graph on the basis of their graph neighbourhood similarity. This approach is augmented using techniques to handle the problems of multiple data sources, bias due to identical items, the influence of the hierarchical organisation of the GO ontology, and statistical significance. Here, the overall approach is applied to 9 different bipartite graphs involving the 3 GO ontologies (BP, MF, and CC) and 3 popular protein domain classifications (Pfam, CATH, and SCOP). Our results show that the GOdomain associations discovered by this approach represent an average of 15, 41 and 25fold increase in the number of edges on the concerned bipartite graphs. These newly discovered associations are compared with existing associations from InterPro and those predicted by dcGO, and a selected subset of onetoone associations is analysed from a biological point of view.
Methods
Tripartite graph model
In graph theory, a kpartite graph is a graph whose vertices can be partitioned into k disjoint subsets, such that in each subset no two vertices are connected. If k=2, the graph is called a bipartite graph (or bigraph), and if k=3 it is called a tripartite graph. The CODAC approach is designed to solve problems of bipartite graph enrichment within a tripartite graph framework. The main intuition is to calculate new weighted edges between two sets of items which already contain reliable but sparse associations, and which are indirectly connected through common associations with a third set of items.
Let $\mathcal {G}^{\null } (X, Y, Z, {E}^{\null })$ be a tripartite graph where X, Y and Z are 3 sets of items and Eis the set of all edges connecting X, Y and Z in the input configuration. Let us consider 3 bipartite subgraphs of $\mathcal {G}^{\null }$, denoted as $\mathcal {G}_{1} (X, Z, E_{1})$, $\mathcal {G}_{2} (Y, Z, E_{2})$, and $\mathcal {G}_{3}^{\null } (X,Y, E_{3}^{\null })$. We now assume that the set of edges $E_{3}^{\null }$ is incomplete, and that the aim is to compute new edges between items of X and items of Y in order to generate $\mathcal {G}_{3}^{*} \left (X,Y, E_{3}^{*}\right)$ which together with $\mathcal {G}_{1}$ and $\mathcal {G}_{2}$ will make the final tripartite graph, $\mathcal {G}^{*} (X, Y, Z, E^{*})$, where E^{∗} denotes an enriched set of edges. New edges may be discovered by exploiting the existing edge distributions in $\mathcal {G}_{1}$ and $\mathcal {G}_{2}$. For example, if items x_{i} of X and y_{j} of Y share the same (or almost the same) set of neighbours {z_{k}} in Z, then it may be supposed that an edge might exist between x_{i} and y_{j}. Figure 2 illustrates the discovery of a candidate edge between x_{2} and y_{2} because these items are associated with the same subset of items { z_{1}, z_{3}, z_{4}} from Z. Candidate edges found in this way are then scored and filtered, as described in more detail below.
It is now possible to instantiate our model with a set of MF GO terms (X), a set of Pfam domains (Y), and a set of UniProtKB/SwissProt sequences (Z). E_{1} is the set of edges derived from the MF GO annotation of UniProtKB/SwissProt sequences, E_{2} is the set of edges derived from the domain contents of UniProtKB/SwissProt sequences, and $E_{3}^{\null }$ is the set of edges derived from the InterPro manually curated MF GO annotations of Pfam domains. In this case, our aim is to produce $E_{3}^{*}$, which will contain an enriched set of MF GOPfam associations weighted by their neighbourhood similarity score.
Biadjacency representation of bigraphs
While graphs allow complex relationships to be visualised easily, analysing graphs computationally can be very timeconsuming. In our approach it is convenient to represent each bigraph as a biadjacency matrix, in which a matrix element has a value of 1 or 0 according to whether the corresponding pair of nodes is connected or not.
Given a tripartite graph $\mathcal {G}^{\null } (X, Y, Z, {E}^{\null })$ as input, the core CODAC algorithm divides it into two bigraphs $\mathcal {G}_{1} (X, Z, E_{1})$ and $\mathcal {G}_{2} (Y, Z, E_{2})$. A procedure named Cosine calculates a cosine similarity matrix C between items of X and items of Y using the two biadjacency matrices M_{1} (of dimension X×Z) and M_{2} (dimension Y×Z), derived from $\mathcal {G}_{1}$ and $\mathcal {G}_{2}$, respectively. These matrices are then rownormalised to give matrices U_{1} and U_{2}. Each element of the matrix $C = U_{1} \times U_{2}^{T}$ thus represents a cosine similarity between an item x of X and an item y of Y, according to the number of common associations with the items in Z.
The main procedure called PredictAssociations determines a similarity threshold T for filtering the raw scores in C to produce C^{∗}. The matrix C^{∗} can be interpreted as the weighted biadjacency matrix of the enriched bigraph $\mathcal {G}_{3}^{*}\left (X, Y, E_{3}^{*}\right)$ and therefore used to predict new weighted associations between items of X and Y. Pseudocode for the core CODAC algorithm is presented in Algorithm 1.
Gold standard of positive and negative examples
In order to determine an edge similarity threshold, we need to define a “gold standard” set of positive and negative examples of associations. Here, we take all of the $P = E_{3}^{\null }$ existing associations present in $\mathcal {G}_{3}^{\null }$ as positive examples. To create negative examples, we shuffle the edges of $\mathcal {G}_{1}$ and $\mathcal {G}_{2}$ in order to rearrange in a random way all edges between X and Z, and between Y and Z. During shuffling, the node degrees of each x_{i}, y_{j} and z_{k} is kept constant, and the shuffled edges are constrained not to overlap the original edges. The shuffled graphs are denoted by $\mathcal {G}_{1}^{\#}$ and $\mathcal {G}_{2}^{\#}$, from which a new shuffled cosine similarity matrix, C^{#}, may be calculated. This matrix is then used to select N=P negative examples at random. Taken together, the P positive and N negative examples constitute our “Gold Standard” dataset.
Determining the score threshold
We randomly split the Gold Standard dataset into two groups with equal distributions of positive and negative examples to give a “Training” and a “Test” subset. We then rank the scores of all members of the Training subset, and label them “positive” or “negative” according to a score threshold that is varied from 0.0 to 1.0 in steps of 0.001. This allows us to determine the numbers of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) predictions for each threshold. We then calculate the recall, R=TP/(TP+FN), precision, P=TP/(TP+FP), and the Fmeasure, F_{1}=2RP/(P+R). The similarity threshold T that gives the best Fmeasure with the Training subset is verified using the Test subset and retained to calculate a filtered cosine similarity matrix, C^{∗}, according to $C_{i,j}^{*} = C_{i,j}$ if C_{i,j}>T or if the (x_{i},y_{j}) edge already exists in E_{3}, otherwise, $C_{i,j}^{*} = 0$.
Combining multiple datasets
There may often be more than one configuration for a graph $\mathcal {G}^{\null }$, that has the same $\mathcal {G}_{3}^{\null }$ but different Z, E_{1}, and E_{2} in $\mathcal {G}_{1}$ and $\mathcal {G}_{2}$. In our instantiation this corresponds to the fact that GO terms and Pfam domains can be indirectly connected either through UniProtKB/SwissProt sequences [12] or through PDB chains in SIFTS [13]. To handle multiple datasets, each input tripartite graph is processed separately to calculate its respective cosine similarity matrix C^{d}. The cosine similarity scores are then combined as a weighted average to give a consensus similarity matrix, CS (Algorithm 2). Whenever there is no data for a given pair (x,y) in an input graph, the corresponding score $C_{x, y}^{d}$ is set to zero.
Receiveroperatorcharacteristic (ROC) analysis provides an objective way to measure the ability of an information retrieval system to retrieve positive documents as first ranked, i.e. with the best scores [14]. One advantage of ROCbased approaches is that they are rather insensitive to the particular numbers of the positive and negative instances used [15]. Here, in order to find the best values for the dataset weights w_{d}, each weight is varied from 1 to 10 in steps of 0.1, and for each combination of weights a ROC performance curve is calculated using the complete ranked list of consensus scores and our Gold Standard set of positive examples. The combination of weights that gives the largest area under the curve (AUC) is selected and used to calculate the best consensus similarity matrix CS. Then, the PredictAssociations procedure determines the best threshold to filter the consensus similarity matrix CS and to deduce the resulting enriched bipartite graph $\mathcal {G}_{3}^{*}$.
Bipartite graph extension with hierarchy of classes
Ontologies are often described as taxonomic hierarchies of classes, as is the case for the GO gene ontology [5]. Thus, if one of the input graphs contains items from a hierarchical ontology, important relationships between the ancestors of a term and its neighbour(s) could be missed because they are generally not mentioned explicitly in the data. For example, if a vertex x from set X represents a term in an ontology and has a neighbour z in set Z, it is quite possible that all of the ancestors of x present in X should also have z as neighbour. If requested by the user, whenever an edge (x,z) is found where z is annotated with an ontology term x, then CODAC will add additional edges between item z and all parents of x present in X. This is illustrated in Fig. 3.
Clustering graph edges
A possible source of bias in any data mining approach is the existence of redundant items in the input. This is especially the case for protein entries in UniProt where it is quite possible to have entries with different identifiers but identical aminoacid sequences. In order to deal with this possibility, CODAC groups all items in Z into clusters having 100% identity. Each cluster is represented by a unique cluster identifier (CID). As shown in Algorithm 3, all source edges (x,z_{i}) and (y,z_{j}) from E_{1} and E_{2} in which identical z_{i} and z_{j} belong to the same CID, are merged into unique (x,CID) and (y,CID) edges, producing $\mathcal {G}_{1}^{Cl}$ and $\mathcal {G}_{2}^{Cl}$, the reduced bipartite graphs that serve as input to the CODAC core approach. It should be noted that the 100% sequence identity threshold may be reduced to 99% or lower if desired. As illustrated in Fig. 4, grouping identical items into clusters of 100% identity can be very beneficial for recovering missing edges.
Calculating statistically significant edges in $E_{3}^{*}$
While our approach provides a systematic way to predict edges in $\mathcal {G}_{3}^{*}$, it is important to calculate a probability, or “pvalue”, for finding an edge simply by chance. For example, it is reasonable to suppose that an edge (x,y) might be predicted at random if x and y are each highly connected to many items in Z. In order to estimate the probability of finding edges by chance, one could generate multiple random graphs by shuffling the edges of a given input graph, as described above for constructing the Gold Standard Negative examples. However, this is quite impractical given the very large numbers of items in X, Y, and Z and the complexity of the filtering procedure that would have to be repeated for each shuffled version of the dataset. Instead, we assume that the probability for finding an edge (x,y) by random chance is given by a hypergeometric distribution of the number of common neighbours (x,z) and (y,z). Letting N_{z} denote the total number of items in Z, N_{x} the number of neighbours of x in Z, and N_{y} the number of neighbours of y in Z, the hypergeometric probability distribution is given by
where $p(K \geqslant K_{x,y})$ is the predicted probability of having a number, K, equal to or greater than the observed number K_{x,y} of common neighbours z of both x and y. Because this pvalue test is applied to a large number of (x,y) edges in $\mathcal {G}_{3}^{*}$, we apply a Bonferoni correction to take into account the socalled familywise error rate [16]. Therefore, letting $E_{3}^{*}$ denote the total number of edges tested, we consider any pvalue less than $ 0.05 / E_{3}^{*}$ as denoting a statistically significant edge.
Classification into gold, silver, and bronze associations
While the above consensus scores and pvalues give objective measures of the quality of predicted associations, from a user’s point of view it is often convenient to provide a simple and memorable quality scale. Therefore, we classify a predicted association as “Gold” if all of the individual data source pvalues for this association are statistically significant.
A predicted association is classed as “Silver” if more than half of the data source pvalues are statistically significant. Otherwise, it is classed as a “Bronze” association.
Results and discussion
GODomainMiner data preparation
In this paper, the CODAC approach is applied to discover new weighted GOdomain associations. In our $\mathcal {G}^{\null } (X, Y, Z, {E}^{\null })$ tripartite graph model, the set X corresponds to one of the MF, BP or CC GO namespaces, and Y corresponds to one of the Pfam, CATH, or SCOP protein domain classifications. For each of the 9 combinations of X and Y, 3 data sources were selected to provide common neighbours (Z) of the items in X and Y, namely: (i) SIFTS providing curated PDB chain associations, (ii) UniProtKB/SwissProt (SP) providing curated UniProt entries, and (iii) UniProtKB/TrEMBL (TR) providing noncurated automatically annotated UniProt sequences.
Flat data files of SIFTS (June 2017), UniProt (June 2017), and InterPro (version 63.0) were downloaded and parsed using inhouse Python scripts. Associations between PDB chains and GO terms, and associations between PDB chains and protein domains (Pfam, CATH, and SCOP) were extracted from the SIFTS data. All CATH and SCOP domain families were transformed into their corresponding superfamilies, and all Pfam “repeat” and “motif” domain types were discarded. Associations between UniProt sequence accession numbers (ANs) and GO terms and ANPfam associations (as well as ANCATH and ANSCOP associations) were extracted from the UniProtKB/SwissProt and UniProtKB/TrEMBL sections of UniProt to give two datasets of UniProtKB/SwissProt associations and UniProtKB/TrEMBL associations, respectively. Then, using the evidence code of the GO term, the associations in the SIFTS, UniProtKB/SwissProt, and UniProtKB/TrEMBL datasets were divided into two groups, namely one group for which the GO term evidence code indicated manual curation, and one group for GO terms with evidence code “inferred from electronic annotation” (IEA). Here, the resulting 6 datasets are called SIFTS, SIFTSIEA, SP, SPIEA, TR, and TRIEA. Thus, there are 6 input tripartite graphs for each of the 9 combinations of the X and Y source datasets. All PDB chain IDs and UniProt ANs having identical sequences were clustered using the Uniref nonredundant cluster annotations [17].
We do not make any distinction between the various possible manual evidence codes. However, we note that the GO_REF field for IEA currently covers 12 annotations sources, namely InterPro2GO, UniProt Keywords2GO, UniProt Subcellular Location2GO, EC2GO, UniRule2GO, UniPathway2GO, Ensembl Compara, Ensembl Fungi, Ensembl Metazoa, Ensembl Plants, Ensembl Protists, and the Gene Ontology Consortium. Of these, the largest number of annotations come from InterPro2GO and UniProt Keywords2GO, which each provide around 169 million associations in UniProtKB. It should be noted that, only 34%, 4%, and 5% of the InterPro2GO annotations are GOPfam, GOCATH, and GOSCOP associations, respectively.
Dataset weights and threshold scores
For each of the 9 settings of this study, the weights assigned to each dataset have been optimised. The procedure is described in Methods (Algorithm 2) and is based on a ROCplot analysis of the ranking of our GoldStandard InterProbased positive examples versus all other associations computed from all the datasets and considered as background. Then the best threshold is determined from the consensus scores calculated with the optimised set of weights, using the Gold Standard Training and Test subsets of positive and negative examples. Table 1 shows that our procedure gives greater weight to GOPfam associations from the IEA sections of the SIFTS, UniProtKB/SwissProt, and UniProtKB/TrEMBL than to associations from the experimental and manually curated sections of SIFTS and UniProtKB/SwissProt datasets. In order to investigate this further, we recalculated the AUCbased weight optimization with all IEA weights forced to zero (Additional file 1). This caused our optimal AUC to fall from around 0.96 to less than 0.60. This reflects the fact that in this setting, we do not consider the propagated InterPro2GO annotations in UniProtKB, and consequently GODomainMiner retrieves fewer GoldStandard associations. However, as IEA annotations are extracted from several other data sources as well as InterPro, setting the IEA weight to zero also excludes these other data sources (refer to previous section). We therefore decided to include all IEA data in the rest of this study.
Analysis of algorithm complexity
Because we exploit existing UniProt cluster IDs to form clusters of similar protein sequences and to eliminate duplicate sequences, the computational cost in the initial data preparation stage scales as approximately O(s×c), where s is the number of sequences and c is the number of UniProt clusters. The scoring stage then scales as O(g×d), where g is the number of GO terms and d is the number of domains. Here, the largest calculation is to find GO BPPfam associations. This takes around 12 hours on one CPU core of an Intel Xeon E52630 2.40 GHz workstation with 128 Gb memory.
Analysis of calculated GOPfam associations
Summaries of our calculated GO MFdomain, BPdomain, and CCdomain associations are shown in Tables 2, 3, and 4, respectively. These tables show the numbers of distinct GO terms and domain entries (in units of thousands) involved in associations for the 6 source datasets, the filtered GODomainMiner predictions and the InterPro dataset of positive associations. In these tables, the total numbers of GOPfam associations found by GODomainMiner refer only to mostspecific GO terms in each branch of a GO hierarchy. In other words, if a domain is associated to a GO term and to one or more of its parent terms, only the mostspecific (nonparent) term is counted as a found association.
The overlap between the GODomainMiner predictions and InterPro is shown in the last row of these tables (here, a match at any GO level is counted as a common association). The high percentage of overlap between GODomainMiner and InterPro (from 91% to more than 99%) reflects the fact that our method is calibrated to recover as many as possible correct InterPro associations. Nevertheless it also shows that a small percentage of the InterPro associations have consensus scores below our calculated score threshold, revealing the role of human rather than datadriven knowledge in the definition of such associations.
Overall, our approach yields a total of 32,881 MF GOPfam associations (shown as 33×10^{3} in Table 2) that include 3968 associations already present in InterPro (2657 specific term matches plus 1311 parent term matches). This corresponds to an enrichment of about 8fold in MF GOPfam associations. Similar calculations give foldenrichments of about 22 and 13 for MF GO associations with CATH and SCOP domain superfamilies, respectively. For BP GO terms, we find foldenrichments of 20, 50, and 31 for associations with Pfam, CATH, and SCOP domains, respectively, and for CC GO terms the foldenrichments are 17, 52, and 31, respectively. A comparison with the Pfam2GO associations from the Gene Ontology website was also performed. It reveals that GODomainMiner retrieves 3966, 3541, and 2055 MF, BP, and CC GOPfam associations that were provided by Pfam2GO, respectively. On the other hand, it finds 99 out of 187 MF GOPfam associations, 108 out of 256 BP GOPfam associations, and 29 out of 65 CC GOPfam associations which are present in Pfam2GO but which are not in the InterPro database.
These results indicate that GODomainMiner discovers many new associations compared to Pfam2Go and InterPro. This can be explained by the fact that our program does not make any consideration about the possible usage of these associations for protein annotation, whereas InterPro policy is to retain only those GOdomain associations that can be transferred to all the proteins containing a given domain [18].
Distribution of GOdomain associations per GO term and per domain
Figure 5a shows the average numbers of MF, BP, and CC GOPfam associations per GO term and Pfam entry for associations in InterPro (green) and those calculated by GODomainMiner when counting the mostspecific GO terms assigned to a domain (purple).
GODomainMiner generally predicts more associations per GO term and per Pfam domain than exist in InterPro. For example (top panel), GODomainMiner predicts that each MF GO term and each Pfam entry are associated with an average of 5.2 domains and 4.0 MF GO terms, respectively, compared to averages of 3.9 domains and 1.3 MF GO terms in InterPro, respectively. For BP and CC GO terms we see similar enrichments from GODomainMiner compared with InterPro, with ratios of 5.4 versus 3.5 and 16.9 versus 6.8 associations per GO term, and 8.2 versus 1.17 and 4.5 versus 1.1 associations per Pfam, respectively. These results demonstrate that GODomainMiner produces a considerable enrichment in the number of annotations compared to InterPro. They also support the notion that many Pfam domains participate in different functions, either as singleton domains or as components of multidomain proteins.
The bar charts in Fig. 5b show the distributions of GO terms (shown in orange) and Pfam entries (in blue) according to the number of associations they are involved in. For example, considering the first two bars in part B, it can be seen that some 2100 MF, 3500 BP, and 320 CC GO terms and 2600, 2300, and 2800 Pfam domains are involved in only one GOPfam association. The remainder of this figure shows that many GO terms and Pfam domains are involved in two or more associations, which supports the notion that complex manytomany relationships exist between GO terms and domains (Fig. 1). More precisely, Fig. 5b indicates that the number of Pfam domains involved in only one GO BPPfam association is less than the number of Pfam domains involved in only one MFPfam association. This is consistent with the notion that a domain most likely has one function but it can be involved in several processes. Moreover, on average, twice as many BP terms are associated to Pfam domains as MF and CC terms (Fig. 5a), which demonstrates the complexity of assigning GO BP terms to Pfam domains. On the other hand, this ratio is consistent with the idea that GO BP terms describe the cooperation of one or more individial molecular functions to achieve a particular biological purpose [19]. Similar results for GOCATH and GOSCOP associations are shown in Additional file 1.
Finally, Table 5 shows the distribution of GODomainMiner predicted associations according to our Gold, Silver, and Bronze classification, along with the degree of overlap with the InterPro reference dataset. Since the Gold class represents associations with statistically significant pvalues, it is interesting to see that the majority (68%) of our predicted MF GOPfam associations common with InterPro fall in this class. Overall, we calculate that 47% of the GODomainMiner MF GOPfam associations and 33% of the predicted BP and CC associations are of Gold quality. The quality of GO predictions for CATH and SCOP classifications also follow very similar paths (see Additional file 2).
Comparison with GOdomain associations from dcGO
In order to compare the GODomainMiner results with those obtained from dcGO [6], we extracted the Pfam2GO associations from the dcGO website [20]. To avoid the complexity of comparing GO annotations at different levels in the rDAG, our comparison mainly focuses on GOdomain associations in which GO terms are leaves of the GO rDAG. GODomainMiner contains a total of 515,582 GOPfam associations regardless of their level in GO hierarchy, of which 79,589 involve leaf GO terms (comprising 21,410 MF, 36,814 BP, and 21,365 CC GOPfam associations). The Pfam2GO dataset from dcGO contains a total of 720,534 associations, of which 62,779 involve leaf GO terms (comprising 5939 MF, 24,334 BP, and 32,506 CC associations). Thus, the numbers of associations in GODomainMiner and Pfam2GO are broadly comparable. However, when considering the leaf levels of all 3 ontologies, Fig. 6 shows that only 11,138 GOPfam associations are common between GODomainMiner and dcGO (overlap region B, about 14% of the GODomainMiner set and 18% of the dcGO set). Looking at the overlap with InterPro, which contains 2799 leaf level GOPfam associations, GODomainMiner shares 2744 associations (98%) with InterPro, while dcGO shares only 724 associations (26%; overlap C). This shows that GODomainMiner gives a greater coverage of the InterPro reference set than dcGO. Although this is perhaps not surprising since InterPro was used to calibrate GODomainMiner, the high agreement between GODomainMiner and InterPro gives a good indication of the reliability of other associations predicted by GODomainMiner.
We also compared GOSCOP associations predicted by GODomainMiner with the SCOP2GO database from dcGO and with InterPro. Overall, GODomainMiner calculates a total of 19,708 leaf GOSCOP associations, compared to 2445 such associations in SCOP2GO and 422 in InterPro. Of these, 845 GOSCOP associations are common to GODomainMiner and SCOP2GO. Also, 421 (i.e. 99.75% of InterPro set) GODomainMiner associations overlap with InterPro, whereas only 55 (13% of InterPro set) SCOP2GO associations from dcGO are found in InterPro. This confirms the trend observed for GOPfam associations, in favor of a much better coverage by GODomainMiner than by dcGO, of the InterPro reference set.
Biological assessment of new discovered GOPfam associations
It would certainly be a very tedious task to validate manually the huge number of new GOdomain associations proposed by the GODomainMiner approach. For this reason, we decided to check manually a small subset of these associations, namely the strict onetoone and manytoone GOdomain associations in which one or several GO terms are uniquely associated with one domain and where this domain is not associated with any other GO terms. Such associations can easily be used to assess the novelty and biological consistency of knowledge discovered through our approach. All lists of strict onetoone and manytoone associations found in the 9 settings of this study are available on the GODomainMiner website.
For the sake of brevity, we review here only the onetoone and manytoone MF GOPfam associations. We obtained 125 onetoone MF GOPfam associations with consensus scores ranging from 0.9704 to 0.0052, 75 associations in the gold category (all pvalues significant), 30 and 20 in the silver and bronze categories, respectively. From the 125 associations, 30 are already known in InterPro (21 from the gold category) and 95 are new (54 from the gold category). Manual checking of the MF GO terms and Pfam domain names led us to distinguish 5 situations (see the examples in Table 6). (i) The MF GO terms and Pfam domains descriptions are almost identical (34 associations). Such associations are trivial but only 16 of them are reported in InterPro, probably because the remaining 18 escaped automatic retrieval due to unpredictable spelling differences. (ii) The MF GO term is more specific than the Pfam domain description (21 associations including 3 from InterPro). (iii) The Pfam description is more specific than the MF GO term (11 associations including 3 from InterPro). (iv) The MF GO term and the Pfam descriptions are quite different (51 associations including 8 from InterPro). Such associations are likely the most interesting to provide to the expert for further analyses. (v) The Pfam domain has no known function (8 associations not present in InterPro). These 8 associations are listed in Table 6 as examples of new knowledge discovered by the CODAC approach. We expect that many further novel associations between MF GO terms and yet uncharacterized domains may be mined from the complete MF GOPfam dataset which contains more than 3400 associations concerning socalled DUF (Domain of Unknown Function) or UPF (Uncharacterized Protein Family) Pfam domains.
Concerning the strict manytoone MF GOPfam associations, we identified 30 such Pfam domains, most of which have only two associated GO terms. This results in 55 associations of which 7 are known in InterPro (6 gold and 1 bronze) and 48 are new (33 gold, 8 silver and 7 bronze). For one Pfam domain only (CobS, PF02654) the two GO terms are known already in InterPro. For 5 other Pfam domains, one of the GO terms is known in InterPro and the other one is new. New MF GOPfam associations generally give lower scores than known InterPro associations. However, in some cases this suggests an alternative substrate for the domain activity which may be interesting to investigate. For example, for Pfam domain Mqo (PF06039 Malate:quinone oxidoreductase), GO:0052589 (malate dehydrogenase (menaquinone) activity) is found in addition to GO:0008924 (malate dehydrogenase (quinone) activity). The remaining 24 Pfam domains all have new GO MF annotations that do not exist in InterPro. Interestingly, in some cases a different more general InterPro annotation exists, as in the case of PF07722 domain Peptidase_C2 which GODomainMiner associates with GO:0034722 (gammaglutamylpeptidase activity) and with GO:0033969 (gammaglutamylgammaaminobutyrate hydrolase) activity, whereas the InterPro annotation is simply GO:0016787 (hydrolase activity).
Implications for protein sequence annotation
It is natural to suppose that predicted GOdomain associations could help to annotate entire protein sequences. However, it does not automatically follow that GOdomain associations are directly transferable to sequences because the function of a particular protein can depend on, for example, its domain architecture, organism, cell type, and cellular location [18]. Therefore, an automatic domainbased sequence annotation system should take such factors into account by, e.g., constructing and applying filtering rules that take into account the taxa and cellular environment of each protein sequence to be annotated.
In any case, it is reasonable to expect that the difference in specificity compared to InterPro annotations will likely prevent many of the GODomainMiner annotations from being transferred directly to all proteins that match a given domain. However, there is no doubt that the newly discovered associations should contribute to the generation of new rules to annotate protein sequences. Nonetheless, the domainlevel functional annotations predicted by GODomainMiner should first be subjected to further benchmarking in order to validate their usefulness. We recently participated in the 2017 round of the CAFA (Critical Assessment of Functional Annotation) community experiment [21], in which we applied taxabased filtering of GODomainMiner annotations. However, the evaluation of this CAFA edition has not yet been published. Participation in future CAFA editions will allow GODomainMiner’s annotations to be assessed according to community standards.
Conclusion
We have presented a systematic approach called CODAC for mining associations from datasets that can be represented as tripartite graphs. We have presented one implementation of this approach called GODomainMiner, for predicting associations between GO terms and protein domains. This was achieved by first collecting existing Pfam, CATH, and SCOP domain annotations of protein chains and sequences on one hand and MF, BP, and CC GO term annotations on the other. We then applied our method to find a list of direct associations between GO terms and domains. Considering only the mostspecific GO terms, our approach yields an enrichment of about 15fold in the number of GOPfam associations that currently exist in InterPro. A selected subset of onetoone and manytoone associations has been analyzed from a biological point of view, and these all appear to be highly meaningful and consistent with available knowledge. Nonetheless, there remains a need for the associations predicted by our approach to be validated more extensively, and we plan to test our approach thoroughly in the next CAFA community experiment.
References
 1
Berg JM, Tymoczko JL, Stryer L. Protein structure and function. New York: WH Freeman; 2002.
 2
Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, Heger A, Hetherington K, Holm L, Mistry J, L SEL, Tate J, Punta M. Pfam: the protein families database. Nucleic Acids Res. 2014; 42(D1):222–30.
 3
Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. CATH – a hierarchic classification of protein domain structures. Structure. 1997; 5(8):1093–109.
 4
Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995; 247(4):536–40.
 5
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.Gene ontology: tool for the unification of biology. Nat Genet. 2000; 25(1):25–9.
 6
Fang H, Gough J. dcGO: database of domaincentric ontologies on functions, phenotypes, diseases and more. Nucleic Acids Res. 2013; 41(D1):536–44.
 7
Mitchell A, Chang HY, Daugherty L, Fraser M, Hunter S, Lopez R, McAnulla C, McMenamin C, Nuka G, Pesseat S, SangradorVegas A, Scheremetjew M, Rato C, Yong SY, Bateman A, Punta M, Attwood TK, Sigrist CJA, Redaschi N, Rivoire C, Xenarios I, Kahn D, Guyot D, Bork P, Letunic I, Gough J, Oates M, Haft D, Huang H, Natale DA, Wu CH, Orengo C, Sillitoe I, Mi H, Paul D, Thomas PD, D FR. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 2015; 43(D1):213–21.
 8
Alborzi SZ, Devignes MD, RITCHIE DW. Ecpsi: Associating enzyme commission numbers with pfam domains. bioRxiv. 2015. https://doi.org/10.1101/022343. http://biorxiv.org/content/early/2015/07/10/022343.full.pdf.
 9
Alborzi SZ, Devignes MD, Ritchie DW. Ecdomainminer: discovering hidden associations between enzyme commission numbers and pfam domains. BMC Bioinformatics. 2017; 18(1):107.
 10
Gutmanas A, Alhroub Y, Battle GM, Berrisford JM, Bochet E, Conroy MJ, Dana JM, Montecelo MAF, van Ginkel G, Gore SP, Haslam P, Hatherley R, Hendrickx PMS, Hirshberg M, Lagerstedt I, Mir S, Mukhopadhyay A, Oldfield TJ, Patwardhan A, Rinaldi L, Sahni G, SanzGarcía E, Sen S, Slowley RA, Velankar S, Wainwright ME, Kleywegt GJ. PDBe: protein data bank in europe. Nucleic Acids Res. 2014; 42(D1):285–91.
 11
Alborzi SZ, Devignes MD, Ritchie DW. Associating gene ontology terms with Pfam protein domains In: Rojas I, Ortuño F, editors. Bioinformatics and Biomedical Engineering. IWBBIO 2017, vol. LNCS 10209. Cham: Springer: 2017. p. 127–138.
 12
The UniProt Consortium. The universal protein resource (UniProt) in 2010. Nucleic Acids Res. 2010; 38(suppl 1):142–8.
 13
Velankar S, Dana JM, Jacobsen J, van Ginkel G, Gane PJ, Luo J, Oldfield TJ, O’Donovan C, Martin MJ, Kleywegt GJ. SIFTS: structure integration with function, taxonomy and sequences resource. Nucleic Acids Res. 2013; 41(D1):483–9.
 14
Manning CD, Raghavan P, Schütze H. Introduction to Information Retrieval. New York: Cambridge University Press; 2008.
 15
Chawla NV, Japkowicz N, Kotcz A. Special issue on learning from imbalanced data sets. ACM Sigkdd Explor Newsl. 2004; 6(1):1–6.
 16
Cui X, Churchill GA. Statistical tests for differential expression in cdna microarray experiments. Genome Biol. 2003; 4(4):210.
 17
Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH. UniRef: domprehensive and nonredundant UniProt reference clusters. Bioinformatics. 2007; 23(10):1282–8.
 18
SangradorVegas A, Mitchell AL, Chang HY, Yong SY, Finn RD. Go annotation in interpro: why stability does not indicate accuracy in a sea of changing annotations. Database. 2016; 2016:027. https://doi.org/10.1093/database/baw027.
 19
Bargsten JW, Severing EI, Nap JP, SanchezPerez GF, van Dijk AD. Biological process annotation of proteins across the plant kingdom. Curr Plant Biol. 2014; 1:73–82.
 20
dcGO. http://supfam.org/SUPERFAMILY/dcGO/. Accessed 28 July 2017.
 21
Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, BenHur A, et al.A largescale evaluation of computational protein function prediction. Nat Methods. 2013; 10(3):221–7.
Acknowledgements
The authors wish to thank JeanSebastien Sereni for careful reading of the graph modeling section.
Funding
This work and the publication of this article were funded by Agence Nationale de la Recherche (grant numbers: ANR11MONU00602 and ANR15RHUS0004), Inria Nancy – Grand Est, and Région Lorraine (CPER IT2MP).
Availability of data and materials
The GODomainMiner results can be accessed with a web browser at http://godm.loria.fr/.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 19 Supplement 14, 2018: Selected articles from the 5th International WorkConference on Bioinformatics and Biomedical Engineering: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume19supplement14.
Author information
Affiliations
Contributions
SZA designed the study and was involved in data processing and management, analysis and testing. MDD was involved in biological interpretation. SZA, MDD, and DWR worked together on the mathematical modeling of the approach, discussed the results and wrote the manuscript. All authors have read and approved the final manuscript.
Corresponding author
Correspondence to MarieDominique Devignes.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1
Supplementary figures. (PDF 93 kb)
Additional file 2
Supplementary tables. (PDF 36 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Published
DOI
Keywords
 Protein structure
 Protein domain
 Protein function
 Gene ontology
 Vector similarity