Algorithm
The steps followed to obtain the Metric Model are schematically represented in Figure 1.
First, for each Molecular Function term we create a profile vector that represents its presence/absence in different Intepro entries (Figure 1, box 1). These vectors resemble the 'phylogenetic profiles' used to encode the proteins present in different organisms and to detect protein relationships [18, 19].
Initially, we started with 1532 MF-GO terms present in 5535 Intepro entries. Additionally, we included the semantic relationships represented by the Gene Ontology DAG by assigning the same Interpro domain to the parent(s) of a given GO term. The profiles were checked to detect the terms that were associated exclusively to one Interpro entry and to ensure that this entry was not annotated with any other term. Any such profiles were removed because they do not help to extract relationships between the terms. After filtering, we obtained a matrix of 1778 Interpro entries with 1392 MF-GO terms. In a second step (Figure 1, box 2), we built a matrix of co-occurrences of GO terms in Interpro entries. The occurrences were accumulated through all the profiles and we obtained the total mutual occurrences in the universe of the 1392 terms. Each co-occurrence vector describes a MF-GO term in relation to the rest of the MF-GO terms, which enables it to be used as a feature vector in the application of statistical learning techniques.
Third, the similarity between the terms was calculated using the cosine distance between their corresponding co-occurrence vectors (Figure 1, box 3). The similarity matrix S was obtained by crossing the vectors all-against-all (as graphically represented in Figure 2A) and the functional groups were obtained by the clustering of S. Full details of the Similarity matrix calculus are available in the Methods section. Finally we applied a Spectral Clustering algorithm [20, 21] as it performs a dimensional reduction of the data (Figure 1, box 4). The general ideas behind Spectral Clustering methods are introduced in the 'Spectral Clustering' subsection from the appendix. This approach improved the search for functional groups in the MF-GO terms space.
Spectral Clustering considers S as the Adjacency Matrix of a normalized weighted graph G, where the nodes stand for the MF-GO terms linked by the similarity values. Thus, the clustering problem is transformed into a partitioning graph problem. We only considered the graph comprised of terms that were connected with significant relationships, that is those connected by a pairwise similarity greater than a manually selected threshold value (see Methods, 'Similarity Matrix' subsection). After imposing this constraint, we obtained 995 MF-GO terms from the total of approximately 7500 terms integrated in the released version of this work.
We have also considered the NJW adaptation of Spectral Clustering (NJM-SC) by Ng, Jordan and Weiss [21], which is summarized in the general scheme in Figure 3 (see the 'NJW Spectral Clustering Algorithm' subsection from the appendix). The algorithm calculates a Transition Probability Matrix, P, from a N × N Similarity matrix, S, that represents the probability of transit from one node to another in the graph. P is diagonalized and its K first eigenvectors are stacked and normalized in a new K × K matrix, Y. The rows of Y can be treated as N vectors K dimensional. Therefore, NJM-SC projects the MF-GO terms (nodes of G) onto points in a K dimensional space. Subsequently, the terms can be grouped with any standard clustering technique. K was thus selected as the number of clusters in the optimum partition of G. The optimization procedure is presented in detail in the Methods section. The resulting number of optimal groups was 93.
Finally, from the vector projections of the MF-GO terms, we built a dendrogram with an Agglomerative Hierarchical Clustering algorithm [22] (Figure 1, box 5). The tree obtained ('Functional Tree') defines a distance D
f
between any two MF-GO terms from the set of 995 (see Additional file 1). The distance for two terms was the minimum height of their common nodes. From a mathematical point of view, D
f
satisfies the topological properties that induces a metric space (see 'Properties of a Metric Space' from the appendix). So, the metric generated by the Functional Tree establishes a 'distance scheme' that provides a measure of the closeness of any two MF-GO terms within the tree.
Testing
Functional Groups
The nodes of the Functional Tree are divided into groups imposing the number of clusters obtained in the optimization step. The 93 groups of Molecular Function terms are inspected and 20 groups with highly homogeneous biological function are detected. In the Functional Tree (Figure 4, see Additional file 1), the functionally homogeneous groups are coloured and ranked, and the labels assigned are shown with their rank number.
Some of these groups were very specific, like the group containing the 21 amino acyl-tRNA ligase activities. This group includes all the tRNA ligases and no other GO term, and hence the automatic clustering algorithm achieved a perfect segregation of this functional group (group 2).
Another big group mostly composed by activities related to hydrolysis (hydrolases, peptidases, nucleases, lipases) was labelled as group 1. Although this group was homogeneous for this activity, the coverage was not perfect since other hydrolases lay outside of this group. For example, group 7, which was far from group 2 in the tree, was mainly comprised of peptidase activities.
Interestingly, many different activities associated with DNA processing tended to cluster together despite the fact that they were apparently unrelated (i.e.: transcription factors and enzymes involved in DNA metabolism, DNA ligases, topoisomerases, etc... – group 3). As for the hydrolases case commented above, although this group contained only DNA-related activities and other DNA-related functional terms were not included in this group.
Most of the kinases of small metabolic compounds were clustered in a large group (group 15), while protein kinases were more widespread even though some of them clustered together in group 6. Many membrane transporters of apparently different nature (transporters for inorganic ions, drugs, proteins, etc...) were also clustered together in homogeneous groups.
All the 'protein inhibitor' activities within the dataset were clustered together in a homogeneous group (group 20), which is interesting given that the proteins they inhibit are of a very different nature (phos-phatases, ribonucleases, proteases, etc...).
Functional clustering was also evident for many other GO terms: methyl-transferases, phosphorybo-syltransferases, peptidases, some peptidic hormones, neurotransmitter receptors, phosphate-hydrolases, hy-dratases/dehydrateses, adenylyltransferases, etc....
For other clusters, this functional 'homogeneity' was not so evident. For example, oxidoreductases were spread across many groups even though some groups contained oxidoreductase activities only (group 19). This dispersion could be explained by the fact that this function is present in proteins with a very different evolutionary origin. Similarly, activities related to RNA metabolism were spread among the different clusters, except the tRNA-ligases discussed above.
In general, the clustering represented by the tree makes sense, meaning that GO molecular functions that are intuitively 'similar' were close in the tree and vice versa. This emphasise that the metric represented by the tree can be used to quantify functional similarities.
The Functional Tree as Metric Model
The functional distance D
f
defined from the Functional Tree allows a new quantitative analysis of the functional relationship between gene products. To assess D
f
as a functional similarity measure we correlated sequence and annotated function similarity over a set of aligned pairs of yeast proteins. The benchmark set has been selected by applying a very restrictive criterion to obtain a high reliable set of annotated proteins. The selection process (see Methods section) takes as quality assay the evidence codes in GOA. In this work, we picked only those sequences that had been functionally characterized either by experimental assay (IDA evidence code) or by traceable published works (TAS evidence code) and whose GO terms were included in our functional tree.
The functional distance between proteins (through their sets of annotated terms) is calculated using the hausdorff definition. The details are exposed in the 'Functional Comparison between Gene Products' subsection from Methods. The distance values are represented against the sequence similarity (Figure 5A). Lord semantic similarity D
s
was also implemented and represented in Figure 5B. Note that the Lord distance values are normalised in order to analize the metric derived in this work with respect to Lord's model.
To compare the models the mean distance values for each bin of sequence identity are superposed in figure 5C. In average, both approaches correlate well with sequence similarity and exhibit a similar trend for homologous pairs. This is partly due to the homology-based mechanism of annotation that transfers directly a source set of MF-GO terms to many homologous sequences.
In consequence, more than 84% of the alignments with sequence similarity values greater than 80% share the same annotations. This lack of richness in the annotations limits further analysis of the methods. However, we can observe that D
f
and D
s
show a different behavior. The distance space is discretized into three well-defined groups (Figure 5A) whereas the semantic similarity values produced a great spread.
These natural 'cut-offs' allow classifying the pairs into three categories with biological meaning that can be roughly labelled as 'closely functionally related' (distances less than 0.1), 'not related at all' (more than 0.9) and 'divergence in functionality' (in the intermediate interval with distances between 0.5 and 0.7). This partition results from the structure of the clusters (Figure 6B) showing small intra-group and large intergroup distances. This is in part due to 'biological' reasons but is also affected by the function transfer by sequence homology.
The repetitive and persistent presence of the same MF-GO terms in the Intepro domains indicates clear functional associations of the terms but it is also originated by the usage of a reduced set of annotations producing redundancy in the functional information of the sequences and low coverage with respect to the total number of terms in the ontology (1532 from a total of 7417 MF-GO terms).
In addition, the Functional Distance Model D
f
becomes very useful from the perspective of recovering proteins functionally similar to a query, as it provides new associations between the terms inferred from the homology information in the database entries. These new links enrich the ontology relationships among the terms. This is the case of group 3 (analized in the 'Functional Groups' from Testing section) whose MF-GO terms are spread across different lineages of the ontology involving DNA-related activities. These associations are very visible in many Pfam domains (Hormone receptor, Sigma-70 factor, Ets-domain, HSF-DNA binding, GATA-type transcription activator etc ...) but are not detected with a criteria based on the semantic proximity of the terms. Group 3 is partially represented in Figure 7 showing the relations of some terms in the ontology. Some terms of the group, such as 'DNA binding' and 'specific transcriptional repressor activity', are very distant in the DAG and share the root as common ancestor. This produces a semantic distance of 1. Other terms like 'transcription factor activity' and 'DNA replication origin binding' share the node 'DNA binding' that is three levels apart from to the root.
The benchmark set includes some example pairs in which D
f
assigns close distances to functionally related pairs while D
s
does not. One is the pair formed by [Uniprot:P20134] and [Uniprot:P10961] that shares 50% sequence identity. The first is a transcriptional repressor and activator annotated with the term GO:0016566 ('specific transcriptional repressor activity'). The second is a trimeric heat shock transcription factor annotated with GO:0003700 ('transcription factor activity'). Both are characterized by HSF-type DNA-binding Pfam domain. The relative posititions of their annotated terms in the DAG can be checked in Figure 7. D
s
is 0.76 indicating a weak relation between the proteins. However, D
f
situates the pair into the 'closely functionally related' region because the terms belong to the cluster 3 described before.
Other similar example is the pair formed by the protein kinases [Uniprot:P32801] and [Uniprot:P41808]. The proteins are characterized by protein kinase pfam domain and are annotated respectively with GO:0004674 ('protein serine/threonine kinase activity') and GO:0004707 ('MAP kinase activity'). Both terms belong to group 6 (Functional Groups subsection). So, as in the example before, the distance D
f
is 0 whereas D
s
is 0.6. Although these terms are close in the ontology ('protein serine/threonine kinase activity' is ancestor of 'MAP kinase activity' and separated only by two depth levels), the Lord model assigns such a value distance because the shared parent ('protein serine/threonine kinase activity') is referred many times in the gene association.goa human file. According to Lord definition, high probable terms carry low information content producing high distance values in the comparison of terms. In the case of the kinases pair, the probability introduces a bias that shifts the semantic distance value to a region that indicates, as the example before, a weak relation between the proteins.
Finally, the Functional Distance model is sensitive enough to detect subtle differences in the pairs [Uniprot:P15700]/[Uniprot:P07170] and [Uniprot:P15700]/[Uniprot:P26364] that are explained by sequence analyses. In both cases, the members of the pair were annotated with GO:0004849 ('uridine kinase activity') and GO:0004017 ('adenylate kinase activity') respectively. Lord's Semantic Model produced a distance of 0.37 between these proteins, indicating semantical relation. In fact, the aforementioned terms are close in the GO hierarchy, and the deepest common parent shared by both GO terms, two levels above, is 'nucleobase, nucleoside, nucleotide kinase activity' (GO:0019205). However, our Functional Distance located that pair of GO terms within the intermediate interval at a distance of 0.65. A thorough analysis of the sequences revealed that [Uniprot:P15700] has the 'ADK' Pfam domain. Adenylate kinases are phosphotransferases with well conserved ADK domains that include an important arginine which inactivates the enzyme if mutated, and an aspartate that is located in the catalytic cleft and that forms a crucial salt bridge. However, in the particular case of [Uniprot:P07170] and [Uniprot:P26364], the putative ADK domain is interrupted by another PFAM domain, the ADK lid. Looking at the sequence of this particular region, the ADK domain boundaries were not clearly delineated due to a high degree of divergence in the active site. So, in this example our metric is able to capture the 'functional difference' between these two proteins due to the inserted domain.
Implementation
The Spectral Clustering algorithm is implemented in Matlab 7.4.0 using the clustering functions available in the Statistics Toolbox. Lord's model is implemented in Python 2.5, and Python was also used to calculate the functional and semantic distances.