Quantitative measures for the management and comparison of annotated genomes
© Eilbeck et al; licensee BioMed Central Ltd. 2009
Received: 20 October 2008
Accepted: 23 February 2009
Published: 23 February 2009
The ever-increasing number of sequenced and annotated genomes has made management of their annotations a significant undertaking, especially for large eukaryotic genomes containing many thousands of genes. Typically, changes in gene and transcript numbers are used to summarize changes from release to release, but these measures say nothing about changes to individual annotations, nor do they provide any means to identify annotations in need of manual review.
In response, we have developed a suite of quantitative measures to better characterize changes to a genome's annotations between releases, and to prioritize problematic annotations for manual review. We have applied these measures to the annotations of five eukaryotic genomes over multiple releases – H. sapiens, M. musculus, D. melanogaster, A. gambiae, and C. elegans.
Our results provide the first detailed, historical overview of how these genomes' annotations have changed over the years, and demonstrate the usefulness of these measures for genome annotation management.
The number of sequenced and annotated genomes is rapidly increasing. There are currently 925 published genomes and 3185 genome sequencing projects underway . Of those underway, over 900 are eukaryotic, genomes whose large size and intron-containing genes complicate annotation. Even assuming as few as 10,000 genes/genome, these new eukaryotic genomes alone will add more than nine million annotations to GenBank. Tools to manage and analyze these gene annotations are badly needed. Consider too that next-generation sequencing technologies will soon make it possible for individual labs to sequence and annotate genomes, thus the number of gene annotations could well exceed one billion in a few years time.
Gene annotations are not static entities, and how to best mange them is a complex and challenging problem. Gene annotations must be tracked from release to release, and problematic annotations identified, reviewed and modified. By nature this is a comparative process. Standardization of formats and database schemas has helped matters greatly. The Sequence Ontology  and GMOD projects , for example, provide tools and standards that promote database interoperability. This in turn has made possible common formats for data exchange such as CHADO XML  and gff3 . The result has been an ever-proliferating number of groups annotating and redistributing their own annotations, independent of the annotation pipelines used by GenBank. Examples include not only model organism databases such as C. elegans, and D. melanogaster but also emerging model organisms such as the planarian S. mediterranea . The growing numbers of annotation providers – and users – is creating a pressing need for tools and techniques for gene annotation management and analysis.
Today, most annotation management and comparison at the whole-genome scale is restricted to analyses of basic traits – for example differences between releases are usually evaluated in terms of gene and transcript numbers . Though indisputably useful, these simple statistics only tell part of the story. Comparisons of different genomes' annotations also suffer from a paucity of measures, with most studies restricted to analyses of protein alignments [8–10]. Here too, new measures of comparison are needed, measures that move beyond the amino acid sequences and take into account other aspects of the annotations such as similarities in intron-exon structures and patterns of alternative splicing.
Some previous work has been done in this area. The Sequence Ontology project , for example, has created a categorization system for alternative splicing that can identify problematic annotations for later manual review. The DEBD  and ASTRA  projects have also proposed genome-wide categorizations of alternative splicing using graph-based approaches. In principle these classification systems could be used for whole-genome annotation management, but to our knowledge they have not yet been applied for this purpose. Furthermore, useful as qualitative classification systems are, quantitative metrics are also needed – measures akin to the sensitivity, specificity and accuracy metrics used by the gene-prediction community to evaluate gene-finder performance . These measures have seen wide use [14–16]. However, they also have recognized shortcomings. Indeed, the recent eGASP contest concluded with a call for new performance measures for alternative splicing and UTR prediction . Moreover, these measures are designed for evaluating gene-prediction algorithms. The problems faced in annotation management are similar in spirit, but distinct enough to require different measures and software. In response to these issues, we have formulated a set of metrics for annotation comparison.
We introduce two new measures to evaluate changes to annotations across releases: Annotation Turnover, and Annotation Edit Distance. Annotation Turnover tracks the addition and deletion of gene annotations from release to release. We show that tracking annotations in this manner supplements traditional gene and transcript counts, allowing the detection of 'resurrection events' – cases where an annotation is created in one release, later deleted, and then after a lapse of one or more releases a new annotation is created at the old genomic location, with no reference to the previous annotation.
We use a second, complementary, measure, called Annotation Edit Distance (AED) to quantify the changes to individual annotations from release to release. AED is similar to performance measures employed by the gene-prediction community, but takes into account aspects of annotations not well addressed by conventional sensitivity/specificity measures  such as alternative splicing. AED complements Annotation Turnover and gene and transcript numbers in that it measures structural changes to an annotation. Two releases can differ dramatically from one another, with every annotation's intron-exon structure having been revised, yet still have identical gene and transcript numbers and no Annotation Turnover; AED provides a means to distinguish between a new release with no changes, and one wherein the intron-exon coordinates alone have been altered. Moreover, it provides a means to quantify the extent of these changes.
We also introduce a new measure for quantifying the complexity of alternative splicing, which we call Splice Complexity. Those in the field of gene annotation often speak of one gene as having a more complex pattern of alternative spicing than another. For example, a gene with 20 transcripts, each with different combinations of exons, is said to be more complex than a gene producing two transcripts that differ from one another by only a few nucleotides at their 5' ends. Splice Complexity provides a means to quantify transcriptional complexity; moreover, because it is independent of sequence homology, Splice Complexity can be used to compare any alternatively spliced gene to any other. This makes possible novel, global comparisons of alternate splicing across genomes. We have used Splice Complexity in conjunction with a classification scheme for alternatively spliced genes developed by the Sequence Ontology project  in order to obtain a global perspective on alternative splicing in different genomes. These novel analyses suggest that the complexity and mode of alternative splicing varies considerably amongst the different genomes in our collection.
In total we have analyzed over 500,000 annotations in this study. To our knowledge this is the largest meta-analysis of gene annotations ever undertaken. Our results reveal both global differences among the annotations of different genomes and unexpected similarities – demonstrating the utility of these new measures for whole-genome annotation management and for comparative genomics studies.
Our analyses fall into two classes – intra-genome comparisons of annotations that track and summarize genome-wide changes in annotations from release to release, and inter-genome comparisons that compare and contrast the annotations of different genomes to one another. We chose five annotated genomes for these analyses: Homo sapiens, Mus musculus, Drosophila melanogaster, Anopheles gambiae, and Caenorhabditis elegans. For D. melanogaster and C. elegans we used gff3  releases from FlyBase  and WormBase  respectively. For H. sapiens, Mus. musculus and A. gambiae we used GenBank releases . We also took practical issues into account when choosing which releases to analyze, such as completeness, and usability. The early gff3 releases of FlyBase and WormBase, for example, were alpha releases designed to troubleshoot the release process; in some cases this precluded effective analyses of some aspects of their contents. In total we analyzed six human GenBank releases (33–36.2), five M. musculus GenBank releases (30–36.1), four D. melanogaster FlyBase releases (3.2.2-5.1), and five C. elegans WormBase releases (WS100-WS176). We also included an A. gambiae release (08/2007) from GenBank in some of our analyses. See Additional File 1 for details of the dataset.
Annotation Edit Distance
Percentage of Genes in the current release with a history of modification
AED > 0
Percentage of genes in the latest release that have been modified n times in their past.
Average AED per revised transcript.
H. sapiens and M. musculus annotations are also undergoing considerable revision from release to release. 55% of current human annotations (release 36.2) have been modified at least once since 2003, with an average AED/revised transcript of 0.086. Substantial numbers of mouse annotations have also undergone revision. 29% of annotations in the release 36.1 (current at time of writing) have been modified at least once since their creation. Finally as Figure 2 makes clear, mouse release 36.1 is somewhat atypical in that no transcript coordinates were altered, though the CDS coordinates of 51 transcripts were changed. In addition, release 36.1 saw the deletion of 487 genes and 501 transcripts (Additional File 1).
These results show how AED naturally supplements gene and transcript numbers. Consideration of gene and transcript numbers alone, for example, would lead one to believe that the C. elegans and D. melanogaster annotations are both relatively static, when in fact the C. elegans annotations are evolving rapidly compared to those of D. melanogaster. Considering AED in conjunction with gene and transcript counts also makes it clear that the dynamics of the two invertebrate annotation sets differ markedly from the vertebrate ones, which are characterized by large fluctuations in both gene and transcript numbers – and AED.
The H. sapiens and M. musculus genomes have undergone higher rates of annotation turnover than either of the two invertebrate genomes (Figure 3). Less than 60% of annotations present in human release 33 and mouse release 30 were still in existence by the next release, e.g. human release 34.1 and mouse release 32.1. Most of the turnover was due to annotation deletion: between April and October 2003, human gene counts fell by 28% (Additional File 1), and between February and October 2003 mouse gene numbers fell 20%. Since this early clean up, mouse and human gene numbers have risen by 10%. Interestingly, about 1 in 3 (30%) of the new mouse genes are resurrections of release 30 annotations deleted from release 32; this is the underlying cause of the upward trend in blue line since release 32.1 in Figure 3.
We also measured Annotation turnover for the human and mouse Refseq NM and NR annotations . These are shown as dotted lines in the human and mouse panels in Figure 3. Turnover rates for these curated annotations are much lower. For both genomes, over 95% of Refseq NM and NRs present in the first releases in our collection (2003) were still present in last (2006). Likewise, more than 90% of 2006 human and mouse Refseq NM and NRs were present as long ago as 2003. Thus, the NMs and NRs paint a very different picture of annotation turnover, one that closely resembles the C. elegans and D. melanogaster turnover data (Figure 3), making it clear that most of the turnover in human and mouse genomes has been due to addition and deletion of automatically generated annotations for which there is little experimental support.
Alternatively spliced genes pose special challenges for annotation efforts. Because they are not predicted by most gene finders, and predicted with poor accuracy by those that do , alternatively-spliced transcripts are generally the product of manual annotation efforts. As such, they provide an important indication of the extent and completeness of active curation efforts. 15% of human genes (release 36.2), 7% of mouse (release 36.1), 24% of D. melanogaster, 9% of mosquito, and 19% of C. elegans genes have more than one annotated transcript (Additional File 1).
There has been a strong trend towards ever increasing numbers of alternatively spliced annotations from release to release for every genome in our collection (Additional File 1). Although this trend illustrates the growing focus on the annotation of alternatively spliced genes, it says nothing about how the contents of alternatively spliced annotations have evolved from release to release and how they differ between genomes. We have undertaken two analyses to address these points. First, we classified alternatively spliced annotations using a scheme developed by the Sequence Ontology. We also used a measure we term Splice Complexity (see Methods) to quantify the complexity of each alternatively spliced annotation.
Previous work on the D. melanogaster genome  has shown that the vast majority of its alternatively spliced genes belong to the 0:0:N class. We find that this trend also holds true for every genome in our collection (Figure 4B). However, the M. musculus and A. gambiae genomes are enriched for problematic annotations: 33% of A. gambiae and 15% of M. musculus alternatively spliced genes, for example, consist of transcripts lacking any exon borders in common (c.f. 0:N:0 class in Figure 4A Error! Reference source not found.). The high percentage of such genes in the Mus and Anopheles genomes indicates that these annotations are in need of review, as many of them may be mis-annotated. Conventional release statistics such as gene and transcript numbers or percentages of alt-spliced genes can never reveal trends such as these. Thus, these results highlight the usefulness of the SO classification scheme for annotation management.
To further characterize alternatively spliced genes, we developed a new measure that we term Splice Complexity. Splice Complexity provides a means to quantify (rather than classify) the complexity of a genome's alternatively spliced annotations; it thus naturally complements existing classification systems such as the Sequence Ontology's  and graph-based splicing schemes [11, 12]. The Methods Section describes in detail how Splice Complexity is calculated.
Interestingly, the H. sapiens, D. melanogaster and C. elegans alternatively spliced annotations all have very similar distributions of Splice Complexity, whereas the M. musculus and A. gambiae genomes are biased towards higher frequencies of splice- complex annotations (Figure 5, upper panel). The SO based classifications shown in the lower panel of Figure 5 suggest an explanation for these differences. Relative to the other three genomes, M. musculus and A. gambiae annotations tend to have higher Splice Complexities because they contain more annotations that belong to problematic SO classes. Moreover, the enrichment of these problematic classes grows steadily more pronounced as their Splice Complexity increases (Figure 5, lower panel). These results once again illustrate the utility of our measures for annotation management and meta-analysis and how they complement the SO schema – providing a global overview of an entire genome's alt-spliced genes and allowing the direct comparisons between genomes to reveal an excess of problematic – likely incorrect – annotations in mouse and mosquito genomes that should be subjected to manual review.
Most complex alternatively spliced annotations.
unc-43 – UNCoordinated family member
Dscam – Down syndrome cell adhesion molecule
LOC628147 similar to zinc finger protein 709
CREM – cAMP responsive element modulator
Conservation of Alternative Splicing
Orthologous gene annotations with greatest difference in splice complexity.
We have used a variety of new approaches to investigate the annotations of five large eukaryotic genomes, four of them across multiple releases. Our meta-analyses provide novel, global perspectives on the contents of more than 500,000 annotations and their evolution over a period of several years. These analyses have brought to light previously unknown differences and unexpected similarities between their annotations, and allowed us to tease apart differences due in annotation practice from underlying biology. We have also shown how analyses combining Splice Complexity and the Sequence Ontology's classification system can be used to identify and prioritize likely mis-annotated genes for manual review.
Our analyses of Annotation Turnover show that the H. sapiens and M. musculus annotations are characterized by very high rates of turnover. The major cause of turnover in both genomes appears to be due to incremental changes in the NCBI's annotation protocols, especially as regards pseudogene identification . Since 2003, far fewer annotations have been deleted from either vertebrate genome; and gene addition has been the dominant trend, some of these being resurrected from the earlier releases. This is especially true for mouse, wherein gene numbers rose by 17% between releases 34 and 35. Once again the cause appears to be changing annotation methodologies. Between these two releases the NCBI's gene prediction program, Gnomon, was altered to use a new repeat masking program and to incorporate protein alignments to the genome. This resulted in an increase in gene models in Build 35 compared to Build 34. . For both vertebrate genomes, turnover of Refseq  NM and NR annotations has been much lower (Figure 3); these form a stable core amid a continuous flux of more ephemeral annotations.
The high turnover rates characteristic of the human and mouse genomes stand in stark contrast to the more static D. melanogaster and C. elegans genomes. Almost 99% of D. melanogaster annotations present in the omnibus 3.2 release , are still present in some form today. C. elegans gene numbers are also quite stable, with rates of gene addition and deletion almost balanced – 90% of annotations present in 2003 were still present in 2007 (WS176) and vice versa. The stability of gene numbers in both organisms is certainly not due to neglect. Genome-wide searches for new protein coding genes followed by PCR-verification have been undertaken in both animals [28, 29].
We used Annotation Edit Distance (Figure 2) to measure active curation independently of annotation turnover. Whereas the D. melanogaster annotations are undergoing little revision, the C. elegans, H. sapiens and Mus musculus annotations have undergone significant revision with each release. 58% of C elegans annotations and 55% of human annotations for example, have been altered since 2003; by comparison only 6% of D. melanogaster annotations have been altered during this time. These results show how Annotation Edit Distance can be used to assess the intensity of annotation curation efforts among different databases.
Our analyses of alternatively spliced genes indicate that these are incompletely annotated in every genome in our collection. Despite the fact that alternate splicing is a trait frequently shared among orthologous genes [25, 30–32], this trend is poorly captured by the current crop of annotations. For example, estimates based on EST data suggest that around 50% of D. melanogaster and A. gambiae alternative exons are conserved . At time of writing, however, only 6.4% of melanogaster-gambiae orthologous genes are alternatively spliced in both genomes. Likewise, only 2.6% of orthologous human-mouse annotations are alternatively spliced in both genomes, considerably less than the published estimate of 40% based upon EST analyses [30, 31]. We did, however, detect a weak but statistically significant tendency for human-mouse and melanogaster-gambiae orthologs to both be alternatively spliced when either member is. There is also a statistically significant correlation in their Splice Complexities. These facts suggest that the current crop of annotations have begun to capture the conserved aspects of alternative splicing, but that much progress remains possible. Certainly, a rigorous review of alternative splicing patterns among orthologous genes could do much to improve the annotation of all four genomes.
Our analyses using the Sequence Ontology classification system revealed genome specific differences in the frequencies of different modes of alternative splicing. M. musculus and A. gambiae, for example, are highly enriched for genes whose transcripts share no exon borders in common. Our Splice Complexity based analyses complement these findings: Unexpectedly, the human, C. elegans and D. melanogaster distributions are all very similar to one another despite the vast evolutionary distances separating these genomes (Figure 5, top panel). This may indicate that common selective forces govern the transcriptional complexity of alternative spliced genes. Inconsistent with this hypothesis, however, the M. musculus and A. gambiae Splice Complexity distributions are skewed towards higher values due to enrichment for genes with unusual modes of alternative splicing. We believe that annotation quality, rather than biology is the likely cause of the skew towards higher Splice Complexities in these two genomes. If this explanation is correct, the mouse and mosquito distributions should converge upon those of the other three genomes as their annotations mature. But whatever the cause – mis-annotation or fundamental differences in biology – their in-silico identification is the first step toward review and experimental investigation of these unusual annotations.
Although the information encoded in genomic DNA provides a foundation for modern medicine, genome sequences in themselves are not very useful. Their value is dependant upon identifying and annotating the genes they contain. Incomplete and incorrect annotations poison every experiment that employs them. In light of these considerations, accurate and complete genome annotation seems a laudable and achievable goal, especially for model organisms. Because the datasets are so large and complex, in silico methods for annotation management must necessarily play a major role in this process. In response, we have formulated three new measures for annotation management – Annotation Turnover, Annotation Edit Distance and Splice Complexity – and used them to investigate the annotations of five genomes. Our results show how these measures can be used to better monitor changes to a genome's annotations from release to release; to compare the magnitude of curation efforts among different genome databases; and to identify and prioritize problematic annotations for manual review.
Tracking annotations from release to release
Reciprocal best hits are commonly used to identify orthologous genes, even over large evolutionary distances [10, 33]. This approach is also effective for tracking annotations from assembly to assembly, as intra-genome differences are meager in comparison to cross-genome differences. In order to determine the accuracy of this procedure, we used two complementary approaches. First, we searched the first and last release from each genome against themselves. We found that on average, 98.7% of genes were their own reciprocal best hits; this percentage demonstrates that paralogs, repeats and low complexity sequence have little impact on the accuracy of the reciprocal best hits procedure. We used a second procedure to assess the impact of greater release distance on accuracy. To do so, we identified reciprocal best hits between each release and its closest two temporal neighbors, and used these data to populate a graph of reciprocal best hits from release to release for each genome. We then compared the correspondence between reciprocal best hits obtained by traversing this graph from start to end to those obtained from searching the most current release against the earliest release. The trace-based approach recovered a subset (91%) of the reciprocal best hits obtained by the first approach. For D. melanogaster and C. elegans the percentage was 100% and 98%, respectively. For H. sapiens and M. musculus it was 85% and 79%. Resurrection of previously deleted genes lowered percentages in the two vertebrate genomes. For these reasons we conclude that simply blasting releases against one another is the preferred means of tracking annotations across releases. All searches were preformed with the following WU-BLAST command: blastn <db> <query> -filter = seg -cpus = 1 -W = 30 -N = -10 -mformat = 2 -B = 1 E = 1e-6 -gspmax = 5 T = 1000 -wink = 30.
Changes to the underlying assembly complicate analyses of annotation change. We therefore sought to segregate changes to annotations resulting solely from curation, from those resulting from changes to the underlying assembly. To do so, we first identified versions of the same annotation in sequential pairs of releases using a reciprocal best hits approach. We then compared the underlying genomic sequences (including a flanking region of 500 bp) for each gene version-pair. If there was any change to the underlying genomic sequence, these annotations were flagged as altered due to assembly change. We found that the impact of assembly changes on existing annotations varied widely from genome to genome and from release to release (Additional File 3). The D. melanogaster and C. elegans assemblies were the most static; on average only 0.42% of D. melanogaster and 0.30% of C. elegans genes experienced changes to their underlying DNA sequences from release to release. The H. sapiens and M. musculus assemblies were more labile. On average 3% of H. sapiens and 18% of M. musculus annotations underwent assembly induced coordinate changes from release to release. For both vertebrate genomes the vast majority of these occurred between early releases. In M. musculus, for example, the underlying genomic sequences of 30% of release 30 (02/2003) annotations had been altered by 32.1 (10/2003), whereas the percentage fell to 15% between releases 34.1 (05/2005) and 35.1 (09/2005) and to only 0.05% between releases 35.1 (09/2005) and 36.1 (05/2006). H. sapiens followed a similar trend (averaging around 3% release), with the exception of release 35.1, which had a higher percentage (5%).
Calculating Annotation Edit Distance
Sensitivity, Specificity, and Accuracy  are commonly used to measure gene-finder performance relative to some standard, usually a reference annotation that is well supported by experimental evidence. Sensitivity (SN) is the fraction of the reference feature predicted, whereas Specificity (SP) is the fraction of the prediction overlapping the reference feature. Both measures can be calculated for any feature class, e.g. transcripts, exons or introns; and the calculations can be preformed at the nucleotide level, or, if greater stringency is desired, the fraction of the features predicted exactly . SN and SP are often combined into a single measure called Accuracy (AC). Several formulations of accuracy are in use (see ). Some of these take true negatives into account; others do not. In practice, it can be difficult to determine the scope of true negatives for genome annotations, as these can be considered as limited to some flanking region around the gene in question, the entire intergenic region or even the rest of the genome. Including true negatives in the accuracy calculation also complicates inter-genome comparisons. For example, gene-prediction accuracy will tend to be higher for those genomes with large introns and intergenic regions. For these reasons we have used a simple average, (SN +SP)/2, to measure accuracy.
From release to release annotations are added, deleted, split and merged. Because gene numbers only tally the ratio of additions to deletions they give little insight into the process of annotation turnover. An obvious approach to investigating annotation turnover would be to follow gene IDs from release to release, but in practice this proved problematic for some of the earlier releases. Instead, we used a reciprocal best-hits approach to investigate the process of annotation turnover, as this provides a general method not dependent upon ID history data, which is not always available. For each genome's collection of releases we searched the transcripts from the most recent release in our collection against the earlier releases and vice versa. If one or more of a gene's alternative transcripts had a reciprocal best hit to a transcript in an earlier release, that gene was considered present in that release, but only so long as all its transcripts' reciprocal best hits were to the same target gene and vice versa. We used exactly the same procedure to track annotations in the other direction as well, i.e. from the first release for each genome forward through subsequent releases.
H. sapiens: GenBank releases 33 (04/2003), 34.1 (10/2003), 34.2 (01/2004), 34.3 (03/2004), 35.1 (08/2004), 36.1 (03/2006), 36.2 (09/2006). M. musculus: GenBank releases 30 (02/2003), 32.1 (10/2003), 33.1 (09/2004), 34.1 (05/2005), 35.1 (09/2005), 36.1 (05/2006). D. melanogaster: FlyBase releases 3.1 (10/2004), 4.2 (09/2005), 4.3 (03/2006), 5.1 (12/2006). C. elegans: WormBase releases WS100 (05/2003), WS130 (09/2004), WS150 (11/2005), WS160 (07/2006), WS 176 (06/2007). A. gambiae GenBank release (downloaded 08/2007). The H sapiens, M. musculus and A. gambiae releases were downloaded from ftp://ftp.ncbi.nih.gov/genomes. The D. melanogaster releases were downloaded from http://www.flybase.org, and the C. elegans releases from http://www.wormbase.org.
GenBank releases were converted to Chaos-XML prior to processing using cx_genbank2chaos.pl http://fruitfly.org/data/chaos-xml/. Older gff3 releases from WormBase were brought forward to current gff3 specifications http://www.sequenceontology.org/gff3.shtml using the scripts ws100_forward, ws130_forward, and ws150_foward. Bulk gff3 files for D. melanogaster and C. elegans chromosomes were split into individual annotations along with their accompanying nucleotide sequence using the cgl-gff3 library http://www.yandell-lab.org. Annotation Edit Distances and Splice Complexities were calculated at the nucleotide level using the scripts splice_distance_nucleo and splice_complexity_nucelo respectively. All code is available at http://www.yandell-lab.org/publications/supp_data/anno_measures.html. After download the bundle should be uncompressed. A README details requirements and the installation procedure.
This work was supported in part by NIH/NHGRI R01HG004341 to KE and NIH/NHGRI R01HG004694 to MY. The authors would also like to thank M. Ashburner, I. Korf, M. Metzstein, G. Miklos, and M. Reese for many helpful comments on an earlier version of the manuscript.
- Liolios K, Tavernarakis N, Hugenholtz P, Kyrpides NC: The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic acids research 2006, (34 Database):D332–334. 10.1093/nar/gkj145
- Eilbeck K, Lewis SE, Mungall CJ, Yandell M, Stein L, Durbin R, Ashburner M: The Sequence Ontology: a tool for the unification of genome annotations. Genome biology 2005, 6(5):R44. 10.1186/gb-2005-6-5-r44PubMed CentralView ArticlePubMedGoogle Scholar
- Generic Model Organism Database[http://www.gmod.org]
- Mungall CJ, Emmert DB: A Chado case study: an ontology-based modular schema for representing genome-associated biological information. Bioinformatics (Oxford, England) 2007, 23(13):i337–346. 10.1093/bioinformatics/btm189View ArticleGoogle Scholar
- Generic Feature Format 3[http://www.sequenceontology.org/gff3.shtml]
- Robb SM, Ross E, Alvarado AS: SmedGD: the Schmidtea mediterranea genome database. Nucleic Acids Res (36 Database):D599–606.
- Misra S, Crosby MA, Mungall CJ, Matthews BB, Campbell KS, Hradecky P, Huang Y, Kaminker JS, Millburn GH, Prochnik SE, et al.: Annotation of the Drosophila melanogaster euchromatic genome: a systematic review. Genome biology 2002, 3(12):RESEARCH0083. 10.1186/gb-2002-3-12-research0083PubMed CentralView ArticlePubMedGoogle Scholar
- Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR, Hariharan IK, Fortini ME, Li PW, Apweiler R, Fleischmann W, et al.: Comparative genomics of the eukaryotes. Science 2000, 287(5461):2204–2215. 10.1126/science.287.5461.2204PubMed CentralView ArticlePubMedGoogle Scholar
- Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al.: The sequence of the human genome. Science 2001, 291(5507):1304–1351. 10.1126/science.1058040View ArticlePubMedGoogle Scholar
- Yandell M, Mungall CJ, Smith C, Prochnik S, Kaminker J, Hartzell G, Lewis S, Rubin GM: Large-scale trends in the evolution of gene structures within 11 animal genomes. PLoS Comput Biol 2006, 2(3):e15. 10.1371/journal.pcbi.0020015PubMed CentralView ArticlePubMedGoogle Scholar
- Lee C, Atanelov L, Modrek B, Xing Y: ASAP: the Alternative Splicing Annotation Project. Nucleic acids research 2003, 31(1):101–105. 10.1093/nar/gkg029PubMed CentralView ArticlePubMedGoogle Scholar
- Nagasaki H, Arita M, Nishizawa T, Suwa M, Gotoh O: Automated classification of alternative splicing and transcriptional initiation and construction of visual database of classified patterns. Bioinformatics (Oxford, England) 2006, 22(10):1211–1216. 10.1093/bioinformatics/btl067View ArticleGoogle Scholar
- Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics 1996, 34(3):353–367. 10.1006/geno.1996.0298View ArticlePubMedGoogle Scholar
- Reese MG, Hartzell G, Harris NL, Ohler U, Abril JF, Lewis SE: Genome annotation assessment in Drosophila melanogaster. Genome research 2000, 10(4):483–501. 10.1101/gr.10.4.483PubMed CentralView ArticlePubMedGoogle Scholar
- Guigo R, Reese MG: EGASP: collaboration through competition to find human genes. Nature methods 2005, 2(8):575–577. 10.1038/nmeth0805-575View ArticlePubMedGoogle Scholar
- Reese MG, Guigo R: EGASP: Introduction. Genome biology 2006, 7 Suppl 1: S1.1-S1.3. 10.1186/gb-2006-7-s1-s1Google Scholar
- Crosby MA, Goodman JL, Strelets VB, Zhang P, Gelbart WM: FlyBase: genomes by the dozen. Nucleic acids research 2007, (35 Database):D486–491. 10.1093/nar/gkl827
- Bieri T, Blasiar D, Ozersky P, Antoshechkin I, Bastiani C, Canaran P, Chan J, Chen N, Chen WJ, Davis P, et al.: WormBase: new content and better access. Nucleic acids research 2007, (35 Database):D506–510. 10.1093/nar/gkl818
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank. Nucleic acids research 2007, (35 Database):D21–25. 10.1093/nar/gkl986
- Celniker SE, Rubin GM: The Drosophila melanogaster genome. Annual review of genomics and human genetics 2003, 4: 89–117. 10.1146/annurev.genom.4.070802.110323View ArticlePubMedGoogle Scholar
- Celniker SE, Wheeler DA, Kronmiller B, Carlson JW, Halpern A, Patel S, Adams M, Champe M, Dugan SP, Frise E, et al.: Finishing a whole-genome shotgun: release 3 of the Drosophila melanogaster euchromatic genome sequence. Genome biology 2002, 3(12):RESEARCH0079. 10.1186/gb-2002-3-12-research0079PubMed CentralView ArticlePubMedGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research 2005, (33 Database):D501–504.
- Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, et al.: EGASP: the human ENCODE Genome Annotation Assessment Project. Genome biology 2006, 7(Suppl 1):S2. 1–31 1–31 10.1186/gb-2006-7-s1-s2PubMed CentralView ArticlePubMedGoogle Scholar
- Schmucker D, Clemens JC, Shu H, Worby CA, Xiao J, Muda M, Dixon JE, Zipursky SL: Drosophila Dscam is an axon guidance receptor exhibiting extraordinary molecular diversity. Cell 2000, 101(6):671–684. 10.1016/S0092-8674(00)80878-8View ArticlePubMedGoogle Scholar
- Malko DB, Makeev VJ, Mironov AA, Gelfand MS: Evolution of exon-intron structure and alternative splicing in fruit flies and malarial mosquito genomes. Genome research 2006, 16(4):505–509. 10.1101/gr.4236606PubMed CentralView ArticlePubMedGoogle Scholar
- Spearman's rank correlation coefficient[http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient]
- National Center for Biotechnology Information[http://www.ncbi.nlm.nih.gov/genome/guide/human/release_notes.html]
- Yandell M, Bailey AM, Misra S, Shu S, Wiel C, Evans-Holm M, Celniker SE, Rubin GM: A computational and experimental approach to validating annotations and gene predictions in the Drosophila melanogaster genome. Proceedings of the National Academy of Sciences of the United States of America 2005, 102(5):1566–1571. 10.1073/pnas.0409421102PubMed CentralView ArticlePubMedGoogle Scholar
- Wei C, Lamesch P, Arumugam M, Rosenberg J, Hu P, Vidal M, Brent MR: Closing in on the C. elegans ORFeome by cloning TWINSCAN predictions. Genome Res 2005, 15(4):577–582. 10.1101/gr.3329005PubMed CentralView ArticlePubMedGoogle Scholar
- Thanaraj TA, Clark F, Muilu J: Conservation of human alternative splice events in mouse. Nucleic Acids Res 2003, 31(10):2544–2552. 10.1093/nar/gkg355PubMed CentralView ArticlePubMedGoogle Scholar
- Nurtdinov RN, Artamonova II, Mironov AA, Gelfand MS: Low conservation of alternative splicing patterns in the human and mouse genomes. Hum Mol Genet 2003, 12(11):1313–1320. 10.1093/hmg/ddg137View ArticlePubMedGoogle Scholar
- Modrek B, Lee C: A genomic view of alternative splicing. Nature genetics 2002, 30(1):13–19. 10.1038/ng0102-13View ArticlePubMedGoogle Scholar
- Remm M, Storm CE, Sonnhammer EL: Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 2001, 314(5):1041–1052. 10.1006/jmbi.2000.5197View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.