AutoFACT: An Auto matic F unctional A nnotation and C lassification T ool
© Koski et al; licensee BioMed Central Ltd. 2005
Received: 02 March 2005
Accepted: 16 June 2005
Published: 16 June 2005
Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets.
We present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1) analyzes nucleotide and protein sequence data; (2) determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3) assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4) generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%.
AutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at http://megasun.bch.umontreal.ca/Software/AutoFACT.htm.
Automatic functional annotation is essential for high-throughput sequencing projects. Typically, large datasets undergo annotation by means of "annotation jamborees", where groups of experts are assigned to manually annotate a designated portion of an organism's genome. More recently, various tools have become available to streamline this process [1–9]. However, limitations encountered with these tools are that many require web-submission of data , need substantial manual intervention [1, 4], supply only a single output format, are part of a large sequence analysis package  and most importantly, do not combine a broad range of information resources. To address these shortcomings, we developed a new annotation pipeline, which we term "AutoFACT".
Unique to AutoFACT, is its hierarchal filtering system for determining the most informative functional annotation. This paper describes AutoFACT's functional assignment capabilities, outlining the procedure for annotating unknown nucleotide or protein sequence data. We assess the validity of AutoFACT by comparing annotations to four previously annotated and phylogenetically diverse organisms, including human, yeast and both eukaryotic and bacterial pathogens. AutoFACT has been applied to the EST sequencing project of Acanthamoeba castellanii, a free-living soil amoeba and opportunistic human pathogen. This example highlights AutoFACT's performance, which yields a ~50% increase in functional annotations over a top-BLAST-hit approach against NCBI's non-redundant database or against UniProt's expert-annotated UniRef90 database.
AutoFACT is a command-line-driven program written in PERL for LINUX/UNIX operating systems. It uses BioPerl  modules to parse and analyze BLAST  reports. Average annotation time is 2.5 hours for 5000 sequences of approximately 500 bp in length on a desktop workstation (BLAST time not included). A web version of AutoFACT is available where users can submit up to 10 sequences at a time for annotation. For large sequencing projects, it is recommended that the user download and install the local version of AutoFACT.
AutoFACT annotation classes
Hit to LSU or SSU rRNA database
Hit to UniRef, nr, KEGG and/or COG
Hit is inform-ative
Hits share common inform-ative terms
Hit to Pfam or Smart
Hit to est_others
" [Functionally Annotated] protein"
" [Domain name]-containing protein"
Databases searched and classification information assigned by AutoFACT
European Ribosomal Database
Large subunit (LSU) ribosomal RNAs
Small subunit (SSU) ribosomal RNAs
Uniprot's UniRef 90
GeneOntology terms Enzyme Commission numbers Locus names
Clusters of Orthologous Groups (COG)
Kyoto Encyclopedia of Genes and Genomes (KEGG)
Metabolic pathways Enzyme Commission numbers Locus names
Protein Familes Database (Pfam)
NCBI's non-redundant database (nr)
NCBI's est_others database
Figure 1 outlines the AutoFACT methodology. When analyzing nucleotide data, AutoFACT begins by using BLAST to search the nucleotide sequences in the input file against the set of user-specified databases. If a match to the rRNA dataset is found with a minimum match length and percent sequence identity (default: 50 bp and 84% identity), the sequence is classified as a "ribosomal RNA". If no match is found the sequence is then searched against the remaining set of user-specified databases. In step 2 (or step 1 for protein data), description lines of significant hits, based on a user-specified bit score cutoff (default <40), are examined for the presence of functionally uninformative terms such as 'hypothetical', 'unknown', 'chromosome', etc. When a hit contains an uninformative term, the next best hit is scrutinized and so forth, until a description line without uninformative terms is found, e.g. 'proton-transporting ATP synthase'. The user specifies the number of top BLAST hits the program should filter. In step 3, a search for common terms among the informative hits from each database is performed. For annotation transfer, the user specifies a database order of importance so that informative terms from the first database are searched against informative terms from the remaining databases in a given order. For example, if the user specifies the database order as UniRef90, nr, KEGG and COG, informative terms in the informative hit from UniRef90 are first searched for matches to the informative hits from the other databases. If a match is found between at least one informative term from the UniRef90 hit and at least one other informative database hit (e.g., 'proton-transporting ATP synthase' matches 'H+-pumping ATP synthase'), the description line of the UniRef90 hit is assigned to the input sequence. If there are no matches to UniRef90 terms, the informative terms from the informative hit of the next database (nr, in this example) are then queried in the same way as above, until a functionally informative description line has been assigned to the sequence.
Database description line formats from ACL00000101 BLAST hits
ATP synthase beta chain related cluster
ATP synthase subunit beta [Salmonella typhimurium]
ATP synthase beta chain [Erwinia carotovora subsp. atroseptica SCRI1043] emb|CAG77407.1| ATP synthase beta chain [Erwinia carotovora subsp. atroseptica SCRI1043]
atpD; membrane-bound ATP synthase, F1 sector, beta-subunit [EC:126.96.36.199] [KO:K02112]
[C] COG0055 F0F1-type ATP synthase, beta subunit
AutoFACT proceeds to step 4 when there are no common informative terms between any of the databases, or when only uninformative hits are found. In this step, a sequence with significant similarity to one or more sequences in the Pfam or SMART databases is classified as a ' [domain name]-containing' protein or a 'multi-domain-containing protein'. A sequence containing no domains is simply classified as an 'unassigned protein'.
A sequence is also classified as a ' [domain name]-containing protein' when the only significant hit is to a domain database. It is considered 'unclassified' when no hits are found to any of the specified databases. When EST sequences are being annotated, the last step in the annotation pipeline is to check the sequence against NCBI's est_others database. If a significant match is found, the sequence is classified as an 'unknown EST'; otherwise it remains 'unclassified'.
In step 5, functionally annotated sequences are then classified according to KEGG pathways, COG functional groups, Enzyme Commission (EC) numbers, GeneOntology (GO) terms and locus names. Putative KEGG pathways are assigned if an informative term from the automatically assigned description line matches a term in the informative KEGG hit. The same reasoning is used to assign putative COG functional categories. EC numbers  are assigned in one of two ways, either from parsing the KEGG description line or by mapping the accession number of the informative UniRef hit to an enzyme via ExPASy's enzyme.dat file . GO terms are assigned by mapping the UniRef accession number of the informative hit via the gene_association.goa_uniprot file .
Comparison of Human Ensembl annotations to AutoFACT revealed no significant differences in annotation assignments. There were 2/200 (1%) sequences that AutoFACT annotated as 'unassigned protein', either because the only BLAST hits were to other human sequences or because the informative terms could not be matched across database sources. Had we been less strict in our annotation criteria and considered hits to the same species as informative, AutoFACT would then have assigned the same annotations as Ensembl to these two sequences. The high similarity between annotation results is primarily due to the fact that the source of most of the Ensembl annotations is UniProt/SWISSPROT, which AutoFACT also uses via UniRef90, the database of highest importance in the AutoFACT database order.
Differences found between AutoFACT and PEDANT annotations for Saccharomyces cerevisiae
AutoFACT % Identity
vacuolar aspartic protease
GON1; possible rho-like GTPase involved in secretory vesicle transport
SSZ1 – regulator protein involved in pleiotropic drug resistance
INM1 – inositol-1(or 4)-monophosphatase
*Protein qutG related cluster
DSE2 – glucan 1,3-beta-glucosidase activity
ECM34 – involved in cell wall biogenesis and architecture
DUP domain-containing protein
SPC72 – Stu2p Interactant
*Repeat organellar protein related cluster
THP2 – subunit of the THO complex, which appears to functionally connect transcription elongation with mitotic recombination
*Myosin heavy chain related cluster
RTT107 – Establishes Silent Chromatin
BRCT domain-containing protein
OPI1 – negative regulator of phospholipid biosynthesis pathway
UTP9 – U3 snoRNP protein
Borrelia_orfA domain-containing protein
Differences found between AutoFACT and TIGR preliminary annotations for Plasmodium falciparum
TIGR Preliminary Annotation
AutoFACT % Identity
PF14_0675 reticulocyte binding protein 2 homolog B, putative Reticulocyte Binding protein;
PF14_0655 RNA helicase-1, putative
Eukaryotic translation initiation factor 4A related cluster
PF14_0530 ferlin, putative
heat shock protein DNAJ pfj4
PF14_0112 POM1, putative
Twinkle related cluster
PF14_0078 HAP protein
Asp domain-containing protein
PF14_0036 acid phosphatase, putative
Metallophos domain-containing protein
PF14_0015 aminopeptidase, putative
hydrolase, alpha/beta fold family
PF14_0382 metalloendopeptidase, putative
Differences found between AutoFACT and GeneQuiz annotations for Rickettsia prowazekii
Gene Quiz Annotation
AutoFACT % Identity
PKM101 CONJUGATION PROTEINS (TRAL), (TRAM), (TRAA), (TRAB), (TRAC), (TRAB), (TRAC), (TRAD), (TRAN), (TRAE), (TRAO), (TRAF), (TRAG), ENTRY EXCLUSION PROTEIN (EEX), (KIKA), (KORB), (KORA) AND ENDONUCLEASE (NUC) GENES, COMPLETE CDS (TRAM) (TRAB) (TRAB) (TRA
VIRB4 PROTEIN related cluster
NEMPA PROTEIN PRECURSOR.
Aspartyl/glutamyl-tRNA(Asn/Gln) amidotransferase subunit B related cluster
D-STEREOSPECIFIC PEPTIDE HYDROLASE PRECURSOR.
Penicillin binding protein 4* related cluster
NADH-UBIQUINONE OXIDOREDUCTASE CHAIN 2 (EC 188.8.131.52).
Heme exporter protein B related cluster
NADH DEHYDROGENASE SUBUNIT 2.
*HyfB domain-containing protein related cluster
VIRB8 PROTEIN related cluster
CONJUGAL TRANSFER PROTEIN TRBI.
VIRB10 PROTEIN related cluster
CONJUGAL TRANSFER PROTEIN TRAG.
VIRD4 PROTEIN related cluster
LPS BIOSYNTHESIS RFBU RELATED PROTEIN.
*Glycosyltransferase related cluster
Case Study: Acanthamoeba castellanii
AutoFACT is currently used by the Protist EST Program (PEP) , a pan-Canadian genomics initiative involving investigators at six Canadian universities. The objective of PEP is to survey, through EST sequencing, the expressed portions of the genomes of a phylogenetically comprehensive selection of protists (30–40 of these mostly unicellular eukaryotes).
AutoFACT annotations for each organism mentioned above can be viewed at http://megasun.bch.umontreal.ca/Software/AutoFACT.htm
To efficiently and fully exploit the wealth of sequence data currently available, thorough and informative functional annotations are paramount. Considering the ever-growing number of EST sequencing projects, it becomes increasingly important to fully automate the annotation process and to make optimal use of the various available annotation resources and databases. Because no two annotation systems are exactly alike, choice of system is very much dependent on the user's end goal.
AutoFACT uses a hierarchal filtering system for determining the most informative functional annotation. It provides a means of classification by identifying EC numbers, KEGG pathways, COG functional classes and GeneOntology terms. AutoFACT supplies three different output formats and a log file, which are versatile and adaptable to user requirements. Importantly, it allows users to maintain data locally, whereas many other systems require sequence submission elsewhere for annotation. By combining multiple resources, AutoFACT associates sequences with a broad range of biological classifications and has proven to be very powerful for annotating both EST and protein sequence data. The A. castellanii case study shows that in comparison to the 'quick and easy' top-BLAST-hit approach against either NCBI's nr or UniProt's UniRef databases, AutoFACT substantially improves functional annotations of sequence data. Comparisons to other well-established annotation pipelines show that AutoFACT performs equally well and in some cases better than the alternative. We have also demonstrated that AutoFACT exhibits an equivalent level of performance (1–2% error rate) when it is used to annotate sequences across different domains of life.
Finally, we caution that over-prediction is common when using sequence similarity to infer protein function. Examples of similar sequences that do not share the same or even related functions have been documented . Automatic annotations therefore may require further validation in certain cases.
Availability and requirements
Project name: AutoFACT
Project homepage: http://megasun.bch.umontreal.ca/Software/AutoFACT.htm
Operating system(s): LINUX/UNIX
Programming language: PERL
Other requirements: BioPerl and BLAST
License: GNU General Public License (GPL)
Any restrictions to use by non-academics: None
This work has been conducted in the context of the Protist EST Program (PEP) and is supported by Genome Canada/Atlantic/Quebec. We thank Eric Wang and Pierre Rioux for their suggestions and script testing. Thank you to Amy Hauth for her critique of the manuscript and ever-helpful discussions, to Emmet O'Brien and Beatrice Roure for their useful feedback and to BioneQ for access to their high-performance computer cluster. Computer resources financed by a grant from the Canadian Institutes of Health Research (CIHR, grant # MOP15331) have also been used in this work.
- Almeida LG, Paixao R, Souza RC, Da Costa GC, Barrientos FJ, Dos Santos MT, De Almeida DF, Vasconcelos AT: A system for automated bacterial (genome) integrated annotation--SABIA. Bioinformatics 2004.Google Scholar
- Ayoubi P, Jin X, Leite S, Liu X, Martajaja J, Abduraham A, Wan Q, Yan W, Misawa E, Prade RA: PipeOnline 2.0: automated EST processing and functional data sorting. Nucleic Acids Res 2002, 30: 4761–4769. 10.1093/nar/gkf585PubMed CentralView ArticlePubMedGoogle Scholar
- Buerstedde JM, Prill F: FOUNTAIN: a JAVA open-source package to assist large sequencing projects. BMC Bioinformatics 2001, 2: 6. 10.1186/1471-2105-2-6PubMed CentralView ArticlePubMedGoogle Scholar
- Wyman SK, Jansen RK, Boore JL: Automatic annotation of organellar genomes with DOGMA. Bioinformatics 2004, 20: 3252–3255. 10.1093/bioinformatics/bth352View ArticlePubMedGoogle Scholar
- Andrade MA, Brown NP, Leroy C, Hoersch S, de Daruvar A, Reich C, Franchini A, Tamames J, Valencia A, Ouzounis C, Sander C: Automated genome sequence analysis and annotation. Bioinformatics 1999, 15: 391–412. 10.1093/bioinformatics/15.5.391View ArticlePubMedGoogle Scholar
- Abascal F, Valencia A: Automatic annotation of protein function based on family identification. Proteins 2003, 53: 683–692. 10.1002/prot.10449View ArticlePubMedGoogle Scholar
- Birney E, Clamp M, Durbin R: GeneWise and Genomewise. Genome Res 2004, 14: 988–995. 10.1101/gr.1865504PubMed CentralView ArticlePubMedGoogle Scholar
- Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SM, Clamp M: The Ensembl automatic gene annotation system. Genome Res 2004, 14: 942–950. 10.1101/gr.1858004PubMed CentralView ArticlePubMedGoogle Scholar
- Moller S, Leser U, Fleischmann W, Apweiler R: EDITtoTrEMBL: a distributed approach to high-quality automated protein sequence annotation. Bioinformatics 1999, 15: 219–227. 10.1093/bioinformatics/15.3.219View ArticlePubMedGoogle Scholar
- bioperl.org. www.bioperl.orgGoogle Scholar
- NCBI BLAST. www.ncbi.nih.gov/BLASTGoogle Scholar
- NCBI Education BLAST info Glossary. www.ncib.nlm.nih.gov/Education/BLASTinfo/glossary2.htmlGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410. 10.1006/jmbi.1990.9999View ArticlePubMedGoogle Scholar
- Barrett AJ: Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). Enzyme Nomenclature. Recommendations 1992. Supplement 4: corrections and additions (1997). Eur J Biochem 1997, 250: 1–6. 10.1111/j.1432-1033.1997.0269a.xView ArticlePubMedGoogle Scholar
- Gasteiger E, Gattiker A, Hoogland C, Ivanyi I, Appel RD, Bairoch A: ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res 2003, 31: 3784–3788. 10.1093/nar/gkg563PubMed CentralView ArticlePubMedGoogle Scholar
- Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, 32 Database issue: D262–6. 10.1093/nar/gkh021View ArticleGoogle Scholar
- GFF: An exchange format for Feature Description. www.sanger.ac.uk/Software/GFFGoogle Scholar
- Frishman D, Albermann K, Hani J, Heumann K, Metanomski A, Zollner A, Mewes HW: Functional and structural genomics using PEDANT. Bioinformatics 2001, 17: 44–57. 10.1093/bioinformatics/17.1.44View ArticlePubMedGoogle Scholar
- Frishman D, Mokrejs M, Kosykh D, Kastenmuller G, Kolesov G, Zubrzycki I, Gruber C, Geier B, Kaps A, Albermann K, Volz A, Wagner C, Fellenberg M, Heumann K, Mewes HW: The PEDANT genome database. Nucleic Acids Res 2003, 31: 207–211. 10.1093/nar/gkg005PubMed CentralView ArticlePubMedGoogle Scholar
- Gardner MJ, Hall N, Fung E, White O, Berriman M, Hyman RW, Carlton JM, Pain A, Nelson KE, Bowman S, Paulsen IT, James K, Eisen JA, Rutherford K, Salzberg SL, Craig A, Kyes S, Chan MS, Nene V, Shallom SJ, Suh B, Peterson J, Angiuoli S, Pertea M, Allen J, Selengut J, Haft D, Mather MW, Vaidya AB, Martin DM, Fairlamb AH, Fraunholz MJ, Roos DS, Ralph SA, McFadden GI, Cummings LM, Subramanian GM, Mungall C, Venter JC, Carucci DJ, Hoffman SL, Newbold C, Davis RW, Fraser CM, Barrell B: Genome sequence of the human malaria parasite Plasmodium falciparum. Nature 2002, 419: 498–511. 10.1038/nature01097View ArticlePubMedGoogle Scholar
- Andersson SG, Zomorodipour A, Andersson JO, Sicheritz-Ponten T, Alsmark UC, Podowski RM, Naslund AK, Eriksson AS, Winkler HH, Kurland CG: The genome sequence of Rickettsia prowazekii and the origin of mitochondria. Nature 1998, 396: 133–140. 10.1038/24094View ArticlePubMedGoogle Scholar
- GeneQuiz; Rickettsia prowazekii. http://jura.ebi.ac.uk:8765/ext-genequiz/genomes/rp0006Google Scholar
- PEPdb Pub; The Protist EST Database. http://amoebidia.bcm.umontreal.ca/public/pepdb/agrm.phpGoogle Scholar
- Kurosky A BDRLTHTBHREAMSBBHFWM: Covalent structure of human haptoglobin: a serine protease homolog. Proc Natl Acad Sci U S A 1980, 77: 3388–3392.PubMed CentralView ArticlePubMedGoogle Scholar
- Wuyts J, Perriere G, Van De Peer Y: The European ribosomal RNA database. Nucleic Acids Res 2004, 32: D101–3. 10.1093/nar/gkh065PubMed CentralView ArticlePubMedGoogle Scholar
- Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 2004, 32 Database issue: D115–9. 10.1093/nar/gkh131View ArticleGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4: 41. 10.1186/1471-2105-4-41PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278: 631–637. 10.1126/science.278.5338.631View ArticlePubMedGoogle Scholar
- Kanehisa M: A database for post-genome analysis. Trends Genet 1997, 13: 375–376. 10.1016/S0168-9525(97)01223-7View ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 2000, 28: 27–30. 10.1093/nar/28.1.27PubMed CentralView ArticlePubMedGoogle Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res 2004, 32 Database issue: D138–41. 10.1093/nar/gkh121View ArticleGoogle Scholar
- Schultz J, Milpetz F, Bork P, Ponting CP: SMART, a simple modular architecture research tool: identification of signaling domains. Proc Natl Acad Sci U S A 1998, 95: 5857–5864. 10.1073/pnas.95.11.5857PubMed CentralView ArticlePubMedGoogle Scholar
- National Center for Biotechnology Information. [http://www.ncbi.nlm.nih.gov]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.