Similarity-based gene detection: using COGs to find evolutionarily-conserved ORFs
© Powell and Hutchison; licensee BioMed Central Ltd. 2006
Received: 03 August 2005
Accepted: 19 January 2006
Published: 19 January 2006
Experimental verification of gene products has not kept pace with the rapid growth of microbial sequence information. However, existing annotations of gene locations contain sufficient information to screen for probable errors. Furthermore, comparisons among genomes become more informative as more genomes are examined. We studied all open reading frames (ORFs) of at least 30 codons from the genomes of 27 sequenced bacterial strains. We grouped the potential peptide sequences encoded from the ORFs by forming Clusters of Orthologous Groups (COGs). We used this grouping in order to find homologous relationships that would not be distinguishable from noise when using simple BLAST searches. Although COG analysis was initially developed to group annotated genes, we applied it to the task of grouping anonymous DNA sequences that may encode proteins.
"Mixed COGs" of ORFs (clusters in which some sequences correspond to annotated genes and some do not) are attractive targets when seeking errors of gene predicion. Examination of mixed COGs reveals some situations in which genes appear to have been missed in current annotations and a smaller number of regions that appear to have been annotated as gene loci erroneously. This technique can also be used to detect potential pseudogenes or sequencing errors. Our method uses an adjustable parameter for degree of conservation among the studied genomes (stringency). We detail results for one level of stringency at which we found 83 potential genes which had not previously been identified, 60 potential pseudogenes, and 7 sequences with existing gene annotations that are probably incorrect.
Systematic study of sequence conservation offers a way to improve existing annotations by identifying potentially homologous regions where the annotation of the presence or absence of a gene is inconsistent among genomes.
The rapidly growing amount of genomic sequence information necessitates tools for its annotation. Although predicting bacterial genes is in many ways simpler than predicting eukaryotic genes, it is clear that there remains room for improvement in the bacterial case. Several groups have undertaken efforts to re-annotate specific genomes [1–3], often finding a small but significant number of errors in existing annotation of gene loci. The presence of these errors has motivated the effort of some groups to systematically revise the gene annotations in public databases as a continuous process [4, 5].
Because technology for genome sequencing is much more mature than proteomic analysis, only a small fraction of annotated bacterial gene products have been detected as protein; most have been annotated using only computational methods. Although methods for detecting and identifying all proteins in a cell are being developed [3, 6, 7] and incorporated into annotations of newly-sequenced genomes , these techniques are currently limited by the ability to express all of the polypeptides in an organism and separate them into fractions with low enough complexity for analysis. It is still useful to refine our computational predictions so that we can make targeted searches for potential proteins.
Accuracy of gene identification is particularly important in studies of the gene content of a genome as a whole. Studies of phyletic patterns of gene presence , the extent of horizontal gene transfer among genomes, the entire set of protein structures encoded by a genome , and the components of a "minimal genome" [11, 12] are all predicated on an accurate catalog of the genes within an organism. Because these studies involve comparing the presence or absence of genes among several organisms, it is particularly important that all of the genes present be identified. Insights in these areas of study could impact our understanding of bacterial evolution physiology and pathogenicity. As an example, in the initial report of the Mycoplasma mobile genome sequence the correlation of presence or absence of certain genes with a presence or absence of a specific phenotypic characteristic (motility) among nine species was used to suggest genes which might confer that phenotype .
Methods for predicting protein-coding genes are often divided into intrinsic and extrinsic classes [13, 14]. Intrinsic methods only use evidence from within the primary sequence of a genome. This evidence may include i) the presence of a relatively long frame uninterrupted by a stop codon, ii) the statistical pattern of polynucleotide stretches that match the typical frequencies present in other coding regions of the organism and iii) the existence of appropriate non-coding control elements. It may be difficult to identify some small genes using the first two types of evidence; small genes can be difficult to distinguish from open reading frames that occur by chance, and in such short regions, sequence characteristics may be affected stochastically. Gene finding methods which use sequence characteristics or control elements often need to be tuned for the specific organism studied, and in many cases several statistical models of coding regions may need to be developed in a single organism .
Extrinsic methods use information from comparisons of genomes. These analyses originally used simple pairwise comparisons among potential protein-coding regions. Harrison et al.  examined ORFs of 15 or more codons in 65 microbial genome sequences, using BLAST E-value of less than 10-4 to indicate similarity suggestive of conserved function. Other searches use sequence alignments of a protein family as a query against all possible translations of the genome of interest. This alignment may be specified beforehand (e.g. using a Pfam protein family) or developed as part of the search as by PSI-BLAST . Pair hidden Markov models use a pairwise sequence alignment coupled to a hidden Markov model to more precisely determine the amino termini of protein-coding genes . Programs have been developed that use pairwise alignment of syntenic regions to predict gene structure in eukaryotes . The ratio of synonymous to nonsynonymous substitutions between pairs of putative genes can be used to examine whether there is selection for protein-coding function, but this requires sequences from closely related organisms . The majority of bacterial genome annotations have used intrinsic methods at least initially to predict the presence of genes. Extrinsic gene prediction methods serve as a useful complement to intrinsic methods because independent information is used to make the same prediction.
In this study we describe a systematic extension to examining similarity shared among several genomes using a modification of the analysis of Clusters of Orthologous Groups (COGs). COGs were developed to cluster annotated genes into functionally related groups in order to facilitate the transfer of functional annotations among organisms . Here we use COGs to cluster open reading frames as a means of recognizing genes. An advantage of the COG analysis is that no explicit threshold for sequence similarity is used; genes that are missed in pairwise comparisons may be detected. Because the focus is on annotation of gene location rather than gene function, we are not concerned with finding genes that are strictly orthologous. Evidence of homology to another gene is sufficient to imply that a region is a gene, and we make no effort to avoid the joining of COGs which may occur due to a gene fusion . The idea of stringency  of COGs expands upon the initial COG definition by requiring an adjustable of connectedness for grouping genes. As the number of studied organisms increases, the stringency can further filter some similarities which may have occurred by chance. ORFs that do not correspond to annotated genes but that nonetheless have conserved sequences present in several genomes are likely to be protein-coding genes that have been missed by current annotations.
A cluster of ORFs can be examined with regards to its multiple sequence alignment, the network of similarities among the ORFs, and the respective genomic contexts of the ORFs in the cluster. These characteristics of ORF clusters can be used to screen existing gene predictions for potential errors. The extrinsic nature of the use of COGs of ORFs is complementary to the intrinsic methods that have been used in producing the majority of gene loci annotations. Because our strategy has different strengths and weaknesses from the intrinsic methods, it may be expected to pick up some genes that other methods have missed.
Genomes included in this study
# of genes annotateda
# of ORFsb >30 aa
Bacillus halodurans strain C-125
Bacillus subtilis subsp. subtilis str. 168
Borrelia burgdorferi str. B31
Chlamydia trachomatis strain D/UW-3/CX
Chlamydophila pneumoniae CWL029
Clostridium acetobutylicum strain ATCC 824
Clostridium perfringens str. 13
Escherichia coli K12
Escherichia coli O157:H7 EDL933
Haemophilus influenzae Rd KW20
Helicobacter pylori J99
Mycoplasma gallisepticum str. R
Mycoplasma genitalium strain G-37
Mycoplasma mobile strain 163K
Mycoplasma mycoides subsp. mycoides SC
Mycoplasma penetrans strain HF-2
Mycoplasma pneumoniae strain M129
Mycoplasma pulmonis (Sabin 1941) Freundt 1955
Neisseria meningitidis serogroup B strain MC58
Pseudomonas aeruginosa str. PAO1
Rickettsia prowazekii da Rocha-Lima 1916
Salmonella enterica subsp. enterica serovar Typhi
Streptococcus pneumoniae str. R6
Treponema pallidum subsp. pallidum str. Nichols
Ureaplasma urealyticum biovar 2
Vibrio cholerae serotype O1 biotype ElTor strain N16961
Vibrio cholerae serotype O1 biotype ElTor strain N16961
We wanted to detect fragments of genes and genes which use start codons other than ATG, so we used a very general definition of an ORF: any frame of length at least 30 codons (90nucleotides) uninterrupted by stop codons. The software developed as part of this study, SPROCKET, can be used to detect probable errors in existing gene annotations.
Gene discovery using conservation of potential peptide sequence
Extrinsic gene-finding methods are based on the duality that, given sufficient evolutionary distance, conserved sequences are likely to be functional and that functional sequences are likely to be conserved. The conserved regions may represent control elements or may encode functional RNA molecules or proteins. We examined the extent to which multi-species sequence conservation could be used to detect the presence of protein-coding genes. We considered two systems for classifying ORFs as genes or not-genes. The first system was based solely on existence of a COG containing the ORF. If we determined that an ORF was a member of a COG then we classified it as a gene, otherwise we classified it as not being a gene. Our second classification system used existing gene annotations for all genomes except the one containing the ORF to be classified, reducing the number of false positives. In this system we classified an ORF as a gene if we found it in a COG containing at least one ORF from another genome that was annotated as a gene. In both systems the COG stringency controlled the extent of conservation required for classification.
We compared these gene predictions to existing gene annotations using sensitivity/specificity analysis. Although there are errors in existing gene annotations (as discussed below) the current annotations represent the expert consensus. Sensitivity for a gene classification system is the number of correct gene predictions divided by the number of actual genes. The specificity is the number of true negative predictions divided by the total number of non-genes.
Screening existing gene predictions for errors
ORFs in Majority-annotated mixed COGs of stringency 6 that may represent missed genes
ORF COG ida
Annotated gene(s) present in COGc
ORF COG ida
Annotated gene(s) present in COGc
Potential genes missed in current annotations
Potential genes missed in current annotations (continued)
Gene annotated in different framed
The potential amino acid sequences of 83 of these ORFs contain regions that have substantial similarity to the multiple-sequence alignment of the annotated genes that are also members of the COG (Table 2). We judge that these ORFs are likely to represent genes missed in current annotations. 5 of the 83 candidate genes involve instances where there were two ORFs with approximately equal length present in two different strands; however the opposite strand had previously been chosen as coding.
ORFs in majority-annotated mixed COGs that do not appear to represent missed genes
ORF COG ida
Annotated gene(s) present in COGc
ORF COG ida
Annotated gene(s) present in COGc
Existing annotation of pseudogene
Frameshift 3' fragmentc
Annotated in GenomeReviews but with different stop
Frameshift 5' fragmentc
Fragments around stop codons (nonsense)c
Minority-annotated mixed COGs of stringency 6
aORF COG id
Annotated locus tag
Explanation for similarity
ambiguous--smc may be annotated as too long
opposite strand dnaG
opposite strand tRNA cluster
opposite strand tRNA cluster
opposite strand annotated gene
opposite strand annotated gene
opposite strand RNA-gene (scr) in Bhal, Bsub
ribosomal protein in opposite strand
opposite strand ribosomal protein
opposite strand ribosomal protein
region upstream of gene is opposite pyrG
opposite strand pgk
Mixed COGs containing ORFs from Mycoplasma genitalium that do not correspond to annotated genes
Homologous to genes in 12 other organisms, some annotated as N-utilization substance
Deletion of 'C' at 237175 joins this to the gene (MG199) annotated at 236591-237084. Together the joined fragments are similar to ribonuclease genes. [GenBank:AAGX01000004.1]
Deletion of 'G' at 416710 joins this to fragment at 416661-416939. Together the joined fragments are similar to acyl carrier protein diesterases. [GenBank:AAGX01000016.1]
Insertion of 'T' at 290983 joins this fragment to the gene (MG243) annotated at 290922-291326. Together the joined fragments are similar to hypothetical genes in M. pneumonia, M. gallisepticum, and U. parvum . [GenBank:AAGX01000005.1]
Peptide sequence similarity as a gene discovery technique
Although studies of intrinsic gene prediction report higher accuracies (for example, GeneMark is reported to have detected genes with sensitivity 98.3% and specificity 91.3% averaged over eight prokaryotes , this fact should be considered with the caveat that those algorithms or algorithms similar to them were used to produce the initial annotations. Because intrinsic methods require statistical models of coding sequence to be tuned to a specific organism, they may miss recently acquired genes for which selection has not yet altered the polynucleotide frequencies to match the new host organism. Intrinsic methods may also miss small genes for which there is insufficient nucleotide sequence to provide a statistically significant result. Most of the genes missed by our method are likely to be species specific among the organisms studied, (i.e. they are ORFans  within the context of the analyzed genomes). Our method may be more able to pick up genes that may have been horizontally transferred when compared to intrinsic methods.
A different choice of size cutoff for consideration of ORFs would affect the accuracy of our method because longer ORFs are more likely to represent genes but genes smaller than the cutoff would be excluded. An increase in ORF size cutoff would result in an increase in specificity with a corresponding decrease in sensitivity. The fact that we were able to detect some genes that had been missed in prior genome annotations is in part because we used a very small size cutoff for consideration of which ORFs may be genes. The newly detected genes we report are disproportionately small (27% are shorter than 100 codons, compared to 11% of annotated genes).
A comparison of Figures 1A and 1B shows that incorporating gene information from other organisms greatly increases the specificity (reducing false positives) of similarity-based gene identification, with a much smaller decrease in sensitivity. Classification using stringency 2 COGs (i.e. single linkage clusters) is highly sensitive but not very specific. Increasing stringency to three (the COGs as described by Tatusov et al. ) causes the largest difference in specificity. The difference in specificity between stringencies two and three is even more pronounced when incorporating gene information from the other organisms.
The two gene prediction systems shown in Figure 1 are in general most accurate for the small genomes (Mycoplasma genitalium, Haemophilus influenzae, other Mycoplasma species). The high accuracies for Mollicutes may be explained by the presence of many members of this class in the set of organisms studied. The genes for Escherichia coli strain K12 can also be relatively well predicted, perhaps due to the fact that it is so well studied. Classification of E. coli genes performs particularly well in the second classification system, when information about annotated genes from other genomes is included. This may be due to the fact that the characteristics of genes of E. coli have been used to find those other genes in other organisms.
Limitations of using sequence conservation for gene prediction
Genes with little conservation among the studied genomes or which are only present in a few genomes cannot be detected using sequence conservation, leading to false negatives. Sequence similarity can be due to reasons other than selection due to protein-coding function. This can lead to false positive gene predictions. Sequences that do not encode protein may be similar because: i) they are conserved for other reasons (non-coding control elements, RNA-genes) or ii) because there is insufficient evolutionary distance between a pair of studied species. The latter can be the case for pseudogenes – a region in one genome may contain a gene while the corresponding region in another genome may not actually code for protein but may not have accumulated enough mutations to prevent detection of homology.
Examples of gene prediction inconsistencies in COGs of ORFs
The ORFs in Mycoplasma penetrans from 1316960 to 1317088 on the minus strand and in Mycoplasma mycoides from 830742 to 830915 on the minus strand do not contain annotated genes. They are however members of a COG at stringency 6 (COG id 199 in Table 2) in which the majority of ORFs (25 of 27) correspond to annotated genes. The annotated members of the COG encode the 50S ribosomal protein L36. The sequence identities from the M. penetrans ORF to the annotated genes in the COG range from 39.0% to 83.3%. For the M. mycoides ORF the range is from 35.2% to 75.7%. Neither of the ORFs have any interruption in coding potential compared to the annotated genes in the COG. It is likely that these ORFs contain genes that were missed in the initial annotations of their organisms. The peptides they encode are less than 40 amino acids long, and this may account for the fact that the genes had not been previously detected.
The Vibrio cholerae ORF from 637551 to 638246 in the plus strand of chromosome I is present in a COG of stringency 6 (COG id 1826 in Table 3) in which 11 ORFs correspond to annotated genes. The V. cholerae ORF is only about 60% as long as the annotated genes, and the potential peptide sequence it encodes aligns to the amino terminal region of the annotated genes. Examination of the genomic context of the members of this COG reveals a nearby V. cholerae ORF from positions 638126 to 638788 which could encode peptide which would align the to carboxy terminus of the annotated genes of this COG. This suggests the presence of a frameshift or sequencing error in the region encompassing the two V. cholerae ORFs.
The secG gene annotated in M. genitalium, M. penetrans, M. pneumoniae and U. parvum has homologs in M. gallisepticum, M. mobile, M. mycoides and M. pulmonis . The M. genitalium homolog was detected by curators of the Genome Reviews database (it is not identified in the EMBL genome file). This group of homologs is present in COG id 3175 in Table 2. The M. gallisepticum homolog is identified as a potential gene, but given the annotation "unique hypothetical". The M. mobile and M. mycoides homologs are not identified as genes. In M. pulmonis, the ORF homologous to secG (genomic coordinates 412829 to 413074 on the forward strand) is not identified as a gene, but overlaps the locus MYPU_3500 which is on the opposite strand. It is likely that this is an error in the existing annotation – that the secG homolog is a real gene and the MYPU_3500 locus, while being slightly longer, does not actually encode a peptide sequence in the cell.
Several groups have used computational techniques to find pseudogenes in prokaryotes. Intrinsic techniques are poorly suited for identifying pseudogenes. Pseudogenes may lack characteristics of protein-coding sequence in a particular organism because they have resulted from recent horizontal transfer  or because there is a lack of selective pressure to maintain the characteristics. As a result, extrinsic techniques are common when searching for pseudogenes. Liu et al.  sought pseudogenes using SwissProt entries as queries in FastX searches against genome sequences. They used a fixed similarity cutoff score of 0.01 to define significant matches. The lack of selective pressure and resulting genetic drift that makes it difficult to find pseudogenes by intrinsic methods can also make it difficult to find homology between a pseudogene and the gene from which it was derived in disparate genome sequences. Lerat and Ochman  considered sets of closely related organisms and used thresholds for inferred homology as strict as TBLASTN E-value < 10-15 and protein identity >79%. By using COGs we were able to detect much more distant homology. There were 232 COGs of stringency-6 that contained at least one best-hit with a BLASTP E-value > 1.0.
Potential extensions and modifications to the methods of this study
One obvious extension of this study is to apply the strategy described here to more genome sequences. The scalability of the methods described in this study are limited primarily by the initial BLASTP searches of translations of the ORFs. Because all pairwise comparisons are performed, this step scales as the square of the number of ORFs among the genome sequences studied, which can be approximated by the square of the number of genome sequences. As of June 2005, there are 211 bacterial and 21 archaeal genome sequences published , about 8.9 times the number of genome sequences analyzed here. The similarity search comparison step is time consuming but highly parallel. Because each similarity search is independent of the other searches, the process can be carried out among many CPUs. New genome sequences can be added to the study incrementally.
One shortcut that could mitigate scaling issues would be to form a standard set of COGs of the various stringencies among only the annotated genes. All of the ORFs in a test genome sequence could then be compared to the members of the standard COGs. The annotated genes of the genomes studied here are only 5.75% of the ORFs of 30 or more codons, greatly reducing the number of comparisons that would need to be performed. This shortcut has a few drawbacks: i) it will not be possible to detect genes which are not present in enough of the genomes in the standard set (depending on stringency) and ii) it will not be possible to find the 'minority-annotated' mixed COGs that can indicate over-prediction of genes.
The varying stringency COGs produced using the techniques of this study could also be used in other analyses. The multiple sequence alignments of COGs could be used to help define which of several potential translational start sites may be used in a given gene. Stringency-three COGs which contain members in many organisms have been used to indicate genes which may be essential on account of their pervasiveness. Higher stringency COGs show not only that corresponding elements are present in many organisms but also that between most or all of pairs of organisms the elements are best-hit pairs of each other.
Relationship to proteomic studies
Jaffe et al.  revisited the annotation of genes in Mycoplasma pneumoniae, incorporating evidence of peptides detected using multi-dimensional chromatography followed by analysis by mass-spectrometry. They detected evidence for 16 proteins which could not be associated with annotated genes. One of these (from 207448 to 207717 in the (+) strand) was present in COGs up to stringency 10. Two more (from 250021 to 250293 and from 415490 to 416032, both on the (+) strand) were present in COGs of stringency 3. The others were only present in COGs of stringency 2. The Jaffe et al. study was aided by the fact that M. pneumoniae has a simple lifestyle and grows in a relatively static natural environment. Although some change in M. pneumoniae gene expression is reported in response to heat shock , it is thought that most of its genes are expressed constitutively. In organisms with more complex niches or lifestyles that may involve growing in multiple hosts or environments, the shotgun proteomic approach will require exposing the organisms to multiple conditions in order to induce detectable expression of all proteins.
In this paper, we describe a method that can be used in combination with existing techniques for detecting protein-coding gene sequences in bacterial genomes. Our method is extrinsic an individual in that it incorporates pairwise sequence similarities among several genomes.
The methods we describe can also be used to screen existing gene predictions. The "mixed COGs", in which some open reading frames correspond to annotated genes and some do not, are attractive targets for further study. Such COGs exist even when requiring best-hit similarity pairings among many organisms, and we list the COGs of this type that exist at stringency six among 27 sequenced bacterial strains. The methods we describe can be used to generate hypotheses about the presence of specific genes that may have been missed in existing annotations. Such a hypothesis could be evaluated by a targeted search for the expected protein product based on predicted protein characteristics.
Sequence preparation, comparison and Best Hit determination
Sequences and existing annotations for genomes under study (Table 1) were obtained from the Genome Reviews database  version 25.0 . The annotations in the Genome Review database include those from the EMBL sequence files with corresponding accessions as well as some genes that were identified on the basis of BLAST similarity to sequences in UniProt. We located open reading frames (ORFs) using the criterion of at least 90 nucleotides (30 codons) between in-frame stop codons (for codon usage tables appropriate to each organism – the included Mollicute species do not use UGA as a stop codon).
Sequence libraries composed of all of the ORFs in the sequences of included genomes were searched using each ORF as a query. These all-against-all searches were performed using WU-BLAST (BLASTP 2.0 MP-WashU [06-Apr-2005] [macosx-10.2-g4-ILP32F64 2005-04-06T17:46:37], BLOSUM62 similarity matrix, filtered with SEG, Smith-Waterman alignment used in the scoring phase). The top ten hits from each query to a library of ORFs from another organism were recorded provided the BLAST e-score was less than or equal to 20. There were 84009520 BLAST hits which met these criteria.
For each ORF, compare it to all ORFs in another organism, recording the best hit (BeT), provided that hit meets the loose stringency cutoff mentioned above (e-score less than 20).
If an ORF a has ORF b as its best hit in another organism and b has a as its best hit when the reciprocal similarity search is performed then a and b are said to have a "bidirectional" or "congruent" best-hit relationship. A pair of vertices will have edges connecting them if the ORFs they represent have a bidirectional best-hit relationship. For purposes of this step, ORFs the two strains of E. coli were not compared to each other.
For a given stringency n, cliques (complete graphs – wherein all nodes are connected to all other nodes) of size n are found. Cliques are joined when they share a sub-clique of size n-1 maximally-joined sets of cliques form a COG.
Like the extended COGs used in the STRING database , the COGs in this study are 'non-supervised'; we have not performed any manual curation subsequent to COG production. A gene fusion may result in the merging of two disparate COGs. Although such an artifact affects the use of COGs in functional annotation, it is not problematic when using COGs for gene recognition so we made no effort to avoid these merging events. The BLAST searches resulted in 2649524 best-hit pairs involving 891039 ORFs. Only 38 best-hit pairs involved a similarity with E-value as high as 20, and none of these best-hit pairs held together a COG of stringency three or higher.
A summary view of the COGs shows the number of members of a COG and how many of those members share a stop codon with an annotated gene. Individual COGs can be examined in more detail by seeing a list of their members, a force-directed graph layout showing best-hit relationships among these members, an alignment of the potential peptide sequences of the ORFs as produced by CLUSTALW  or a graphical view of the regions of the genomes around the ORFs which are in a COG.
Availability and Requirements
Project name: SPROCKET
Project home page: http://groove.med.unc.edu/sprocket
Operating system(s): Platform independent
Programming language: C++ (COG construction), Java (COG viewing)
Other requirements: Java 1.4 or higher, PostgreSQL (only required if data stored locally)
License: GNU GPL (COG construction), BSD (COG viewing)
This work was supported by a Public Health Service research grant subcontracted to UNC-CH (Clyde A. Hutchison III) from the Berkeley Structural Genomics Center (GM62412, Sung-Hou Kim). We thank Morgan Giddings for computational resources and Jack Snoeyink and Craig Falls for algorithmic insights into calculating high-stringency COGs.
- Dandekar T, Huynen M, Regula JT, Ueberle B, Zimmermann CU, Andrade MA, Doerks T, Sanchez-Pulido L, Snel B, Suyama M, Yuan YP, Herrmann R, Bork P: Re-annotating the Mycoplasma pneumoniae genome sequence: adding value, function and reading frames. Nucleic Acids Res 2000, 28(17):3278–3288. 10.1093/nar/28.17.3278PubMed CentralView ArticlePubMedGoogle Scholar
- Iliopoulos I, Tsoka S, Andrade MA, Enright AJ, Carroll M, Poullet P, Promponas V, Liakopoulos T, Palaios G, Pasquier C, Hamodrakas S, Tamames J, Yagnik AT, Tramontano A, Devos D, Blaschke C, Valencia A, Brett D, Martin D, Leroy C, Rigoutsos I, Sander C, Ouzounis CA: Evaluation of annotation strategies using an entire genome sequence. Bioinformatics 2003, 19(6):717–726. 10.1093/bioinformatics/btg077View ArticlePubMedGoogle Scholar
- Jaffe JD, Berg HC, Church GM: Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 2004, 4(1):59–77. 10.1002/pmic.200300511View ArticlePubMedGoogle Scholar
- Kersey P, Bower L, Morris L, Horne A, Petryszak R, Kanz C, Kanapin A, Das U, Michoud K, Phan I, Gattiker A, Kulikova T, Faruque N, Duggan K, McLaren P, Reimholz B, Duret L, Penel S, Reuter I, Apweiler R: Integr8 and Genome Reviews: integrated views of complete genomes and proteomes. Nucleic Acids Res 2005, 33(Database issue):D297–302. 10.1093/nar/gki039PubMed CentralView ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2005, 33(Database issue):D501–4. 10.1093/nar/gki025PubMed CentralView ArticlePubMedGoogle Scholar
- Arthur JW, Wilkins MR: Using proteomics to mine genome sequences. J Proteome Res 2004, 3(3):393–402. 10.1021/pr034056eView ArticlePubMedGoogle Scholar
- Giddings MC, Shah AA, Gesteland R, Moore B: Genome-based peptide fingerprint scanning. Proc Natl Acad Sci U S A 2003, 100(1):20–25. 10.1073/pnas.0136893100PubMed CentralView ArticlePubMedGoogle Scholar
- Jaffe JD, Stange-Thomann N, Smith C, DeCaprio D, Fisher S, Butler J, Calvo S, Elkins T, FitzGerald MG, Hafez N, Kodira CD, Major J, Wang S, Wilkinson J, Nicol R, Nusbaum C, Birren B, Berg HC, Church GM: The complete genome and proteome of Mycoplasma mobile. Genome Res 2004, 14(8):1447–1461. 10.1101/gr.2674004PubMed CentralView ArticlePubMedGoogle Scholar
- Wu J, Kasif S, DeLisi C: Identification of functional links between genes using phylogenetic profiles. Bioinformatics 2003, 19(12):1524–1530. 10.1093/bioinformatics/btg187View ArticlePubMedGoogle Scholar
- Kim SH, Shin DH, Choi IG, Schulze-Gahmen U, Chen S, Kim R: Structure-based functional inference in structural genomics. J Struct Funct Genomics 2003, 4(2–3):129–135. 10.1023/A:1026200610644View ArticlePubMedGoogle Scholar
- Hutchison CA, Peterson SN, Gill SR, Cline RT, White O, Fraser CM, Smith HO, Venter JC: Global transposon mutagenesis and a minimal Mycoplasma genome. Science 1999, 286(5447):2165–2169. 10.1126/science.286.5447.2165View ArticlePubMedGoogle Scholar
- Mushegian AR, Koonin EV: A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc Natl Acad Sci U S A 1996, 93(19):10268–10273. 10.1073/pnas.93.19.10268PubMed CentralView ArticlePubMedGoogle Scholar
- Fickett JW: Finding genes by computer: the state of the art. Trends Genet 1996, 12(8):316–320. 10.1016/0168-9525(96)10038-XView ArticlePubMedGoogle Scholar
- Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, McKenney K, Sutton G, Fitzhugh W, Fields C, Gocayne JD, Scott J, Shirley R, Liu LI, Glodek A, Kelley JM, Weidman JF, Phillips CA, Spriggs T, Hedblom E, Cotton MD, Utterback TR, Hanna MC, Nguyn DT, Saudek DM, Brandon RC, Fine LD, Frichtman JL, Fuhrmann JL, Geoghagen NSM, Gnehm CL, McDonald LA, Small KV, Fraser CM, Smith HO, Venter JC: Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 1995, 269(5223):496–512.View ArticlePubMedGoogle Scholar
- Karlin S, Mrazek J, Campbell AM: Codon usages in different gene classes of the Escherichia coli genome. Mol Microbiol 1998, 29(6):1341–1355. 10.1046/j.1365-2958.1998.01008.xView ArticlePubMedGoogle Scholar
- Harrison PM, Carriero N, Liu Y, Gerstein M: A "polyORFomic" analysis of prokaryote genomes using disabled-homology filtering reveals conserved but undiscovered short ORFs. J Mol Biol 2003, 333(5):885–892. 10.1016/j.jmb.2003.09.016View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Pachter L, Alexandersson M, Cawley S: Applications of generalized pair hidden Markov models to alignment and gene finding problems. J Comput Biol 2002, 9(2):389–399. 10.1089/10665270252935520View ArticlePubMedGoogle Scholar
- Brent MR, Guigo R: Recent advances in gene structure prediction. Curr Opin Struct Biol 2004, 14(3):264–272. 10.1016/j.sbi.2004.05.007View ArticlePubMedGoogle Scholar
- Ochman H: Distinguishing the ORFs from the ELFs: short bacterial genes and the annotation of genomes. Trends Genet 2002, 18(7):335–337. 10.1016/S0168-9525(02)02668-9View ArticlePubMedGoogle Scholar
- Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278(5338):631–637. 10.1126/science.278.5338.631View ArticlePubMedGoogle Scholar
- Thomas JW, Touchman JW, Blakesley RW, Bouffard GG, Beckstrom-Sternberg SM, Margulies EH, Blanchette M, Siepel AC, Thomas PJ, McDowell JC, Maskeri B, Hansen NF, Schwartz MS, Weber RJ, Kent WJ, Karolchik D, Bruen TC, Bevan R, Cutler DJ, Schwartz S, Elnitski L, Idol JR, Prasad AB, Lee-Lin SQ, Maduro VV, Summers TJ, Portnoy ME, Dietrich NL, Akhter N, Ayele K, Benjamin B, Cariaga K, Brinkley CP, Brooks SY, Granite S, Guan X, Gupta J, Haghighi P, Ho SL, Huang MC, Karlins E, Laric PL, Legaspi R, Lim MJ, Maduro QL, Masiello CA, Mastrian SD, McCloskey JC, Pearson R, Stantripop S, Tiongson EE, Tran JT, Tsurgeon C, Vogt JL, Walker MA, Wetherby KD, Wiggins LS, Young AC, Zhang LH, Osoegawa K, Zhu B, Zhao B, Shu CL, De Jong PJ, Lawrence CE, Smit AF, Chakravarti A, Haussler D, Green P, Miller W, Green ED: Comparative analyses of multi-species sequences from targeted genomic regions. Nature 2003, 424(6950):788–793. 10.1038/nature01858View ArticlePubMedGoogle Scholar
- Montague MG, Hutchison CA: Gene content phylogeny of herpesviruses. Proc Natl Acad Sci U S A 2000, 97(10):5334–5339. 10.1073/pnas.97.10.5334PubMed CentralView ArticlePubMedGoogle Scholar
- BSGC: Berkeley Structural Genomics Center.[http://www.strgen.org]
- GenomeReviews: Genome Reviews.[http://www.ebi.ac.uk/GenomeReviews]
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, McDade KE, McKenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM: Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005, 437(7057):376–380.PubMed CentralPubMedGoogle Scholar
- Besemer J, Lomsadze A, Borodovsky M: GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 2001, 29(12):2607–2618. 10.1093/nar/29.12.2607PubMed CentralView ArticlePubMedGoogle Scholar
- Siew N, Fischer D: Analysis of singleton ORFans in fully sequenced microbial genomes. Proteins 2003, 53(2):241–251. 10.1002/prot.10423View ArticlePubMedGoogle Scholar
- Liu Y, Harrison PM, Kunin V, Gerstein M: Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes. Genome Biol 2004, 5(9):R64. 10.1186/gb-2004-5-9-r64PubMed CentralView ArticlePubMedGoogle Scholar
- Lerat E, Ochman H: Recognizing the pseudogenes in bacterial genomes. Nucleic Acids Res 2005, 33(10):3125–3132. 10.1093/nar/gki631PubMed CentralView ArticlePubMedGoogle Scholar
- GOLD: GOLD: Genomes OnLine Database.[http://www.genomesonline.org]
- Weiner J, Zimmerman CU, Gohlmann HW, Herrmann R: Transcription profiles of the bacterium Mycoplasma pneumoniae grown at different temperatures. Nucleic Acids Res 2003, 31(21):6306–6320. 10.1093/nar/gkg841PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV: The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res 2001, 29(1):22–28. 10.1093/nar/29.1.22PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4(1):41. 10.1186/1471-2105-4-41PubMed CentralView ArticlePubMedGoogle Scholar
- von Mering C, Huynen M, Jaeggi D, Schmidt S, Bork P, Snel B: STRING: a database of predicted functional associations between proteins. Nucleic Acids Res 2003, 31(1):258–261. 10.1093/nar/gkg034PubMed CentralView ArticlePubMedGoogle Scholar
- OBDA: OBDA Main Page.[http://obda.open-bio.org]
- Biojava: BioJava.[http://biojava.org]
- Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 2003, 31(13):3497–3500. 10.1093/nar/gkg500PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.