How to inherit statistically validated annotation within BAR+ protein clusters
© Piovesan et al.; licensee BioMed Central Ltd. 2013
Published: 28 February 2013
In the genomic era a key issue is protein annotation, namely how to endow protein sequences, upon translation from the corresponding genes, with structural and functional features. Routinely this operation is electronically done by deriving and integrating information from previous knowledge. The reference database for protein sequences is UniProtKB divided into two sections, UniProtKB/TrEMBL which is automatically annotated and not reviewed and UniProtKB/Swiss-Prot which is manually annotated and reviewed. The annotation process is essentially based on sequence similarity search. The question therefore arises as to which extent annotation based on transfer by inheritance is valuable and specifically if it is possible to statistically validate inherited features when little homology exists among the target sequence and its template(s).
In this paper we address the problem of annotating protein sequences in a statistically validated manner considering as a reference annotation resource UniProtKB. The test case is the set of 48,298 proteins recently released by the Critical Assessment of Function Annotations (CAFA) organization. We show that we can transfer after validation, Gene Ontology (GO) terms of the three main categories and Pfam domains to about 68% and 72% of the sequences, respectively. This is possible after alignment of the CAFA sequences towards BAR+, our annotation resource that allows discriminating among statistically validated and not statistically validated annotation. By comparing with a direct UniProtKB annotation, we find that besides validating annotation of some 78% of the CAFA set, we assign new and statistically validated annotation to 14.8% of the sequences and find new structural templates for about 25% of the chains, half of which share less than 30% sequence identity to the corresponding template/s.
Inheritance of annotation by transfer generally requires a careful selection of the identity value among the target and the template in order to transfer structural and/or functional features. Here we prove that even distantly remote homologs can be safely endowed with structural templates and GO and/or Pfam terms provided that annotation is done within clusters collecting cluster-related protein sequences and where a statistical validation of the shared structural and functional features is possible.
When a new protein sequence becomes available the problem of its annotation poses. Most of our expertise in trying to endow the new sequence with structural and functional features is based on similarity search [1–4]. Methods are mainly based on the knowledge that structure is more conserved than sequence through evolution and that structural alignment is conserved as long as sequence identity (SI) is ≥ 30% over the alignment length. This was observed originally by Chothia and Lesk  and once in a while revisited at increasing number of proteins solved with atomic resolution and deposited in the Protein Data Bank (PDB) . The observation is at the basis of one of the most popular method for computing the three dimensional structure of the target on a template, when found, after a sequence similarity search against the PDB . Recently maps of the protein structure space have revealed fundamental relationship between protein structure and function . When a target sequence well aligns with a template of known structure, its functional properties can be derived on the basis of structural conservation. Proteins sharing some 40-60% of sequence identity are likely to share also similar function [9, 10].
However a problem is at hand: how to recognize structural and functional templates when sequence identity is below 30%. In this case proteins are categorized to be distantly related to their homologous counterparts, since they may perform the same function, and possibly be endowed with the same structure although sharing very little sequence homology [11, 12]. To this purpose methods have developed trying to grasp local sequence conservation by modeling protein conserved structural and functional domains. The most popular is Pfam (, http://pfam.sanger.ac.uk). In this case function can be inferred when a protein is significantly retained by a specific Pfam model that is again based on a local sequence-to-profile alignment and its scoring. SUPERFAMILY (http://supfam.cs.bris.ac.uk/SUPERFAMILY), based on hidden Markov models as Pfam, has been recently modified to address specifically the problem of function assignment by including a domain-based Gene Ontology .
When function is to be assigned only on the basis of sequence, the problem still remains unsolved, since very little is known on the relationship among sequence similarity and transfer of function [1, 9]. Functions can be described with specific terms following the Gene Ontology vocabulary and comprising three main functional branches: Molecular Function (MFO), Biological Process (BPO), and Cellular Component (CCO) . UniProtKB, the largest resource of protein sequences curates automatically annotated protein records (, http://www.uniprot.org/help/biocuration). Here annotation integrates previous knowledge on protein structure and function from various sources, when available, again mainly based on sequence similarity search (UniProtKB/TrEMBL). Eventually the records are manually curated (UniProtKB/SwissProt). However out of the over 18 millions sequence entries presently available (Release 2011_12 of 14-Dec-2011), 75% are proteins inferred by homology or predicted whose features in most instances are far from being attributed even with computational methods.
Several methods have been developed to predict protein function from structures and sequences trying to infer features from selected and well annotated sets of proteins by mean of different computational approaches, including machine learning, and generally aiming at integrating different source of information (see for recent reviews [17, 18]).
Here we take advantage of the recently released set of proteins selected by CAFA (http://biofunctionprediction.org/) for function prediction in order to discuss how inheritance of annotation can be statistically validated. Validation is indeed an added value to the annotation process, when possible. For this we developed BAR+ [19, 20], a non hierarchical clustering annotation procedure that allows different types of annotation by means of a cluster-mediated transfer of annotation. We also show that our method allows a gain of annotation over a direct Pfam prediction and GOA electronic annotation (http://www.ebi.ac.uk/GOA/).
Databases and methods
The test set includes 48,298 sequences made available during the 2011 CAFA experiment (CAFA set, http://biofunctionprediction.org). 41,003 sequences of this set (85% of the CAFA set) could be mapped towards UniProtKB Release 2010_05 (CAFA/UniProtKB set); 96% of the CAFA/UniProtKB set were manually curated (UniProtKB/SwissProt) and 2,047 proteins have also a PDB structure; 13,684 of the set are proteins inferred from homology and predicted. We found that 44,495 sequences of the CAFA set (92% of the CAFA set) could be mapped into BAR+ (CAFA/BAR+ set).
BAR+, the Bologna Annotation Resource, is our annotation system (BAR+ is available at http://bar.biocomp.unibo.it/bar2.0/). BAR+ allows transfer of validated annotation [19, 20]. The method relies on the concept that sequences can inherit the same function/s and structure from their counterparts, provided that they fall into a cluster endowed with validated annotations. BAR+ is based on a clustering procedure with the constraint that sequence identity (SI) is ≥ 40% on at least 90% of the pairwise alignment overlapping (Coverage, Cov). Clusters in BAR+, as previously reported , allow three main categories of annotation: PDB [with or without SCOP (*)] and GO and/or Pfam; PDB (*) without GO and/or Pfam; GO and/or Pfam without PDB (*) and no annotation. Each category can further comprise clusters where GO and Pfam functional annotations are or are not statistically significant (see below). Depending on the categories of annotation in the cluster and provided that they are statistically validated, all new targets that fall into a cluster can inherit statistically validated annotations by transfer.
For generating BAR+ clusters we analyzed a total of over 13 million protein sequences from 988 genomes and UniProtKB release 2010_05. The BAR+ cluster building pipeline starts with an all-against-all sequence comparison with BLAST in a GRID environment . The alignment results are then regarded as an undirected graph where nodes are proteins and links are allowed only among chains that are 40% identical over at least 90% of the alignment length. All the connected nodes fall within the same cluster; when a cluster incorporates a UniProtKB entry, it inherits its annotations (GO and Pfam terms, PDB structures, SCOP classifications). Within a cluster GO and Pfam terms are statistically validated by means of a procedure that includes P-value evaluation with a Bonferroni correction and estimate of the significance threshold value after a bootstrapping procedure ; validated terms are those endowed with P-values< 0.01. Clusters can contain distantly related proteins that therefore can be annotated with high confidence and eventually can also inherit a structural template, if present. In BAR+, when PDB templates are present within a cluster profile HMMs (Hidden Markov Models) are computed on the basis of sequence-to-structure alignment and are cluster associated (Cluster-HMM) .
Results and discussion
BAR+ contains clusters with statistically validated annotation
70% of the 13,495,736 sequences of BAR+ are collected in 913,762 clusters (the number of sequences in a cluster ranges from 2 to 87,893). Interestingly 87% of the clusters contain sequences whose standard deviation of the protein length is ≤ 5 residues. 1.2% of the clusters, containing 23% of the whole set, contains also PDB structures and is endowed with a cluster specific structural HMM . 30% of the sequences are singletons that eventually can carry along structural and/or functional information.
Within BAR+, inheritance of validated annotation is possible only when a given sequence after alignment towards BAR+ finds a counterpart whose Sequence Identity (SI) is ≥ 40% over at least 90% of the pairwise alignment overlapping (Coverage, Cov).
Inheritance of statistically validated annotation
Annotating the CAFA set with BAR+
MFO OR BPO
ALL-O OR Pfam
For sake of exploring the relevance of the alignment length on the annotation system, we decreased the Cov value to ≥ 70%) while keeping SI≥ 40%. In this case the number of annotated CAFA targets increased by only 3% (Table 1), suggesting that the original 90% Cov value together with SI≥ 40% ensures that most of the CAFA set is already retained within validated clusters.
With our method it is also possible to model distantly related targets that fall into a cluster by aligning them to the template/s in the cluster by means of a cluster HMM, as previously described . By this about 25% of the CAFA set inherits also a PDB structural template/s (11,935 sequences, Table 1) and about 50% of these targets share a sequence identity with the template structure of the cluster lower than 30% (12.5% of the CAFA set). Concomitantly the sequence also inherits validated Pfam domains and GO ontologies and this allows a validation of the functional annotation directly on the protein computed structure.
Comparison with direct UniProtKB annotation
Comparing UniProtKB direct annotation with BAR+ annotation
Sequences with validated annotation
Sequences with new validated annotation
5,215 clusters are also endowed with a cluster HMM, suitable for sequence alignment of the target with the corresponding template/s of 11,935 sequences that by this can inherit also a structure (Table 2). Interestingly 50% of these sequences have a sequence identity to the corresponding template lower than 30%.
BAR+ web site
For the present analysis, BAR+ was updated by distinguishing two sets of clusters: those that are endowed with a statistically validated annotation (labeled with a yellow star), and those that are not statistically validated. A sequence can inherit annotation from a cluster in a statistically validated manner when upon alignment it falls into a statistically validated cluster; however at the web site for a sequence falling into BAR+ clusters we also provide all the cluster-associated and not validated terms. This is so also when the target aligns towards BAR+ singletons. Each cluster endowed with PDB templates is also endowed with a cluster HMM based alignment that for each sequence falling in the cluster allows building of the corresponding three dimensional protein structure. BAR+ is freely available at http://bar.biocomp.unibo.it/bar2.0/.
Functional annotation of protein sequences is one of the most important issues in annotation processes. When annotation is done electronically, mainly based on sequence similarity search, a robust validation process can help in the inheritance of Pfam and GO terms by transfer of annotation. Using our cluster-centric BAR+ annotation system and adopting as a test case the recently released CAFA set of sequences, we can annotate 84.9% of the CAFA set, 77.7% of which in a validated manner.
As compared with UniProtKB that annotates with GO and Pfam terms 77.1% of the CAFA set (Table 2), we validate 10,628 terms for 62.9% of the sequences, we increase the annotation for 7.6% of the set with some additional and validated 2,930 terms and annotate without validation the remaining 6.6% of the set.
Considering also that 7.2% of the CAFA set is newly annotated with validation, the gain in annotation within BAR+ is 14.8% with respect to UniProtKB, suggesting again that cluster specificity for a sequence is a necessary filter to inherit functional and structural features from well known proteins.
Furthermore we can endow with structural models some 25% of the whole CAFA set. At least 50% of the proteins that in BAR+ inherit a structural model share a sequence similarity with the template/s less than 30%, indicating that with our procedure also distantly related homologs can be safely annotated.
RC thanks the following grants: PRIN 2009 project 009WXT45Y (Italian Ministry for University and Research: MIUR), COST BMBS Action TD1101 (European Union RTD Framework Programme), and PON project PON01_02249 (Italian Ministry for University and Research: MIUR). DP is a recipient of a PHD fellowship from the Ministry of the Italian University and Research. GP is a recipient of a research contract from CIRI.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 3, 2013: Proceedings of Automated Function Prediction SIG 2011 featuring the CAFA Challenge: Critical Assessment of Function Annotations. The full contents of the supplement are available online at URL. http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S3.
- Lesk AM: Introduction to Bioinformatics. 2008, Oxford: Oxford University Press, 3Google Scholar
- Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A: Protein function annotation by homology-based inference. Genome Biology. 2009, 10: 207-10.1186/gb-2009-10-2-207.PubMed CentralView ArticlePubMedGoogle Scholar
- Petryszak R, Kretschmann E, Wieser D, Apweiler R: The predictive power of the CluSTr database. Bioinformatics. 2005, 21: 3604-3609. 10.1093/bioinformatics/bti542.View ArticlePubMedGoogle Scholar
- Kaplan N, Sasson O, Inbar U, Friedlich M, Fromer M, Fleischer H, Portugaly E, Linial N, Linial M: ProtoNet 4.0: a hierarchical classification of one million protein sequences. Nucleic Acids Research. 2005, 33: D216-D218.PubMed CentralView ArticlePubMedGoogle Scholar
- Chothia C, Lesk AM: The relation between the divergence of sequence and structure in proteins. EMBO J. 1986, 5: 823-826.PubMed CentralPubMedGoogle Scholar
- Rost B: Twilight zone of protein sequence alignments. Protein Eng. 1999, 12: 85-94. 10.1093/protein/12.2.85.View ArticlePubMedGoogle Scholar
- Sánchez R, Pieper U, Melo F, Eswar N, Martí-Renom MA, Madhusudhan MS, Mirković N, Sali A: Protein structure modeling for structural genomics. Nat Struct Biol. 2000, 7: 986-990.View ArticlePubMedGoogle Scholar
- Osadchy M, Kolodny R: Maps of protein structure space reveal a fundamental relationship between protein structure and function. Proc Natl Acad Sci USA. 2011, 108: 12301-6. 10.1073/pnas.1102727108.PubMed CentralView ArticlePubMedGoogle Scholar
- Rost B: Enzyme function less conserved than anticipated. J Mol Biol. 2002, 318: 595-608. 10.1016/S0022-2836(02)00016-5.View ArticlePubMedGoogle Scholar
- Tian W, Skolnick J: How well is enzyme function conserved as a function of pairwise sequence identity?. J Mol Biol. 2003, 333: 863-882. 10.1016/j.jmb.2003.08.057.View ArticlePubMedGoogle Scholar
- Dietmann S, Fernandez-Fuentes N, Holm L: Automated detection of remote homology. Curr Opin Struct Biol. 2002, 12: 362-367. 10.1016/S0959-440X(02)00332-9.View ArticlePubMedGoogle Scholar
- Fariselli P, Rossi I, Capriotti E, Casadio R: The WWWH of remote homolog detection: the state of the art. Brief Bioinform. 2007, 8: 78-87.View ArticlePubMedGoogle Scholar
- Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunesekaran P, Ceric G, Forslund K, Holm L, Sonnhammer EL, Eddy SR, Bateman A: The Pfam protein families database. Nucleic Acids Res. 2010, 38: D211-222. 10.1093/nar/gkp985.PubMed CentralView ArticlePubMedGoogle Scholar
- de Lima Morais DA, Fang H, Rackham OJ, Wilson D, Pethica R, Chothia C, Gough J: SUPERFAMILY 1.75 including a domain-centric gene ontology method. Nucleic Acids Res. 2011, 39: D427-34. 10.1093/nar/gkq1130.PubMed CentralView ArticlePubMedGoogle Scholar
- The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMed CentralView ArticleGoogle Scholar
- The UniProt Consortium: Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Res. 2011, 39: D214-D219.PubMed CentralView ArticleGoogle Scholar
- Clark WT, Radivojac P: Analysis of protein function and its prediction from amino acid sequence. Proteins. 2011, 79: 2086-96. 10.1002/prot.23029.View ArticlePubMedGoogle Scholar
- Rentzsch R, Orengo CA: Protein function prediction--the power of multiplicity. Trends Biotechnol. 2009, 27: 210-9. 10.1016/j.tibtech.2009.01.002.View ArticlePubMedGoogle Scholar
- Bartoli L, Montanucci L, Fronza R, Martelli PL, Fariselli P, Carota L, Donvito G, Maggi G, Casadio R: The Bologna Annotation Resource: a non-hierarchical method for the functional and structural annotation of protein sequences relying on a comparative large-scale genome analysis. J Proteome Res. 2009, 8: 4362-4371. 10.1021/pr900204r.View ArticlePubMedGoogle Scholar
- Piovesan D, Martelli PL, Fariselli P, Zauli A, Rossi I, Casadio R: BAR-PLUS: the Bologna Annotation Resource Plus for functional and structural annotation of protein sequences. Nucleic Acids Res. 2011, 39: W197-W202. 10.1093/nar/gkr292.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.