Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment
© Gotoh et al.; licensee BioMed Central Ltd. 2014
Received: 19 March 2014
Accepted: 9 June 2014
Published: 14 June 2014
Accurate computational identification of eukaryotic gene organization is a long-standing problem. Despite the fundamental importance of precise annotation of genes encoded in newly sequenced genomes, the accuracy of predicted gene structures has not been critically evaluated, mostly due to the scarcity of proper assessment methods.
We present a gene-structure-aware multiple sequence alignment method for gene prediction using amino acid sequences translated from homologous genes from many genomes. The approach provides rich information concerning the reliability of each predicted gene structure. We have also devised an iterative method that attempts to improve the structures of suspiciously predicted genes based on a spliced alignment algorithm using consensus sequences or reliable homologs as templates. Application of our methods to cytochrome P450 and ribosomal proteins from 47 plant genomes indicated that 50 ~ 60 % of the annotated gene structures are likely to contain some defects. Whereas more than half of the defect-containing genes may be intrinsically broken, i.e. they are pseudogenes or gene fragments, located in unfinished sequencing areas, or corresponding to non-productive isoforms, the defects found in a majority of the remaining gene candidates can be remedied by our iterative refinement method.
Refinement of eukaryotic gene structures mediated by gene-structure-aware multiple protein sequence alignment is a useful strategy to dramatically improve the overall prediction quality of a set of homologous genes. Our method will be applicable to various families of protein-coding genes if their domain structures are evolutionarily stable. It is also feasible to apply our method to gene families from all kingdoms of life, not just plants.
KeywordsGenome annotation Gene prediction Gene structure Multiple sequence alignment Spliced alignment Cytochrome P450 Ribosomal proteins
Recent progress in DNA sequencing technologies is making it possible to determine thousands of eukaryotic genomic sequences [1, 2]. On an even larger scale, the China National Genebank has launched the Three Million Genomes (3 M) Project to sequence one million human genomes, one million plant and animal genomes and one million micro-ecosystem genomes . To gain full use of the genomic information, the next step after sequencing is to annotate genes encoded in each genome. Several genome annotation pipelines have been established and actively used in a number of genome projects [4–6]. One of the most important products of genome annotation is a set of amino acid sequences translated from predicted protein-coding genes. Obviously, the quality of amino acid sequences, which in turn depends on the accuracy of gene prediction, profoundly affects the reliability of downstream analyses, such as functional implications, 3D-structure prediction of proteins, and evolutionary inference of the genes and species. The recently emerging method of co-evolutionary 3D structure prediction requires even larger numbers of accurate protein sequences [7–9] than the traditional template-based homology modelling methods .
Currently, three major categories of methods are used in computational prediction of protein-coding genes [11, 12]. The ab inito methods rely on only the statistical features of the “query” genomic sequence to be analysed; the de novo or comparative genomic methods use the information of sequence conservation and variation between the query genome and one or more related “reference” genome(s); and the transcript-dependent methods use known transcript sequences, such as full-length cDNAs, ESTs, and proteins of cognate or related genomes as templates. The accuracy and coverage of gene prediction can be improved by combining several lines of information, and present annotation pipelines usually adopt several distinct categories of methods depending on the available information. Although human intervention can further improve the quality of annotation [13, 14], manual annotation by experts is impractical to apply to all the genomes whose number is growing rapidly. Thus, the vast majority of genome annotations rely on high-throughput automated methods. However, assessment of the quality of the products of such procedures has not been well explored, mostly due to the scarcity of proper assessment methods.
In this work, we present gene-structure-aware multiple protein sequence alignment (GSA-MPSA) as a powerful tool to evaluate and refine a set of homologous (orthologous and paralogous) gene structures. A GSA-MPSA is constructed from translated amino-acid sequences supplemented with the location and phase of introns along the coding sequence (CDS) of the parental (predicted) genes. Our approach is based on the facts that (i) long insertions/deletions are rare among closely related homologous protein sequences and (ii) positions of introns in a set of homologous genes are generally well conserved . Thus, we can expect high reliability of gene predictions if we observe few gaps and a concordant distribution of intron positions in the GSA-MPSA. Conversely, existence of long gaps or a discordant distribution of intron positions is a strong indicator of some defects including annotation errors.
Although many MPSA methods have been developed so far ( for the latest reviews), none is designed to incorporate the information about exon-intron organization of the parental genes. Hence, we extended our iterative MPSA method Prrn [17, 18] so that conserved intron positions should gain an additional bonus in the objective function to be optimized. The resultant MPSA labelled with intron positions is parsed to see the distribution of indels and also the distribution of the concordant/discordant intron sites with the help of an outlier statistical analysis. We have also extended our spliced alignment program Aln  so that it can use a generalized profile as the template. Generalized profiles have been used in internal routines of Prrn  to detect and evaluate gaps exactly and efficiently, where “generalized” means that various lengths of internal gaps, as well as ordinary residues, are treated in the form of a profile . We have implemented a tool named “Refgs.pl” (refine gene structures) that tries to successively improve individual gene structures of the members constituting the GSA-MPSA by running Prrn and Aln in an iterative manner. To test the performance of Refgs.pl, we applied our method to the cytochrome P450 and ribosomal protein genes annotated in 47 plant genome datasets, most of which were obtained from Phytozome Ver. 9.0 . These gene (super-) families were chosen as representatives of two distinct categories of genes, a divergent multi-gene superfamily that codes for enzymes and a set of unique gene families that code for individual subunits of a large protein-RNA complex, respectively. The results indicate that Refgs.pl is highly effective at detecting various kinds of defects including annotation errors, a majority of which can be corrected by the iterative refinement procedure. External tests with EST mapping and manual inspection indicated that nearly 99% of exon-intron boundaries are correctly assigned after a series of refinements by Refgs.pl, and the overall quality of gene prediction is comparable with that obtained by specialists in the field.
Performance of spliced alignment with a profile template
Assessment and refinement of gene structures by Refgs.pl
EST-based and manual assessments
Most genome annotation pipelines utilize the results of EST mapping on the relevant genome as a component of evidence-based prediction [31, 32]. Hence, the false rates assessed through our mapping results are expected to be small. In fact, the sensitivity (SN) and specificity (SP) at the intron level (both ends should be correct) exceed 96% and 94%, respectively (Additional file 2: Table S1A, B), when all retrieved sequences before filtering were examined. The two steps of filtering improved both SN and SP by ~1% (Table S1C, D). Although marginal, CR-M1-PR refinement with MaxCluster = 25 or 50 further improved SN and SP (Table S1E, F) despite the fact that no information about EST mapping is incorporated in the refinement process.
For human and several model organisms [38–40], the catalogue of their protein-coding genes is nearly complete. The situation is completely different for the vast majority of organisms whose genomic sequences are recently revealed. Although high-throughput gene annotation pipelines are widely used, the accuracy of such products has not been well studied. To our knowledge, MisPred [41, 42] is the sole practical tool to search for potentially erroneous annotations. To evaluate the reliability of a prediction, MisPred examines macroscopic features of the predicted amino acid sequences. Accordingly, MisPred is not necessarily effective to discover individual defects, especially subtle defects. Moreover, MisPred affords no way to correct annotation errors it finds. We have shown here that a GSA-MPSA constructed from close homologues provides rich information about the reliability of each predicted gene structure. Thanks to the recent progress in many genome projects undertaken in parallel, collection of appropriate homologs is now much easier than a few years ago. We have also shown that GSA-MPSA can facilitate not only evaluation but also refinement of the predicted gene structures. The improvement in the quality of prediction is demonstrated from various viewpoints as shown in Figures 3 and 4. However, a more direct impression would be obtained by looking into the alignments themselves. Additional file 1: Figure S1 presents several examples of GSA-MPSAs of the same genes before and after refinement. The smaller numbers of indels and better aligned intron positions after refinement compared with those in the original GSA-MPSAs will give an intuitive support for the efficacy of our refinement procedure.
In essence, our refinement strategy relies on the consensus or the “decision by majority” rule. In general, consensus-based approaches fail when the initial predictions are scarce, poor, or highly heterogeneous. In fact, of the 484 clusters of P450 and 1217 clusters of ribosomal proteins with MaxCluster = 50, the CR-M1-PR refinement worsened the nWSP scores for 20 and 43 clusters (211 and 368 sequences involved), whereas the nWSP scores remained unchanged for 38 and 119 clusters (164 and 580 sequences) and improved for the rest of the 426 and 1055 clusters (8083 and 13470 sequences), respectively. In addition, 126 P450 and 459 ribosomal protein sequences didn’t belong to any cluster due to the lower limit in the minimal size of clusters and had no chance to be evaluated or refined. Roughly speaking, therefore, our refinement procedure is effective or neutral for about 93 ~ 96% but inadequate for about 4 ~ 7% of sequences that passed the two filtering criteria (Methods). (Note that the fraction of inadequate cases for ribosomal proteins is inflated as the initial filtering imposed no limitation in length.) One potential way to cope with these difficult cases would be to improve the profile method so that more distant homologs can be used as the templates, or pre-annotated sequences from human curated datasets could be used. In addition, the 1000 plant transcriptome project now in progress  will provide a broad sampling of plant gene diversity including distant homologs currently lacking as templates.
Numbers of sequences at various stages of data processing
Last stage R-type
Last stage Q-type
As our method does not directly refer to the available cDNA/EST information, the assessment based on the EST mapping results can be regarded as an external test. We consider that the sensitivity and specificity of 97 ~ 98% at the intron level and 98 ~ 99% at the junction level are quite high. One reason for this high accuracy might be because the original annotations had already incorporated EST mapping information and some annotations had been manually curated. It is an interesting next theme to examine how our approach performs without relying on existing annotations.
In the present study, we analysed ca. 104 P450 and ribosomal protein sequences. Although the Phytozome dataset are generally well organized, the formats of gene annotation of various resources are not completely uniform, rendering the preparation of the initial sets of amino acid sequences supplemented with parental gene structure information rather laborious. Once the initial setting was completed, however, the entire process including clustering, assessment, and refinement finished within an hour on our computer system (CentOS 6.2, Xeon E5-2687 W, 3.1GHz, 8 × 2 cores, 64GB memory, 2 TB hard disk) when MaxCluster is set to 50 or smaller (Figures 3 and 4). It was feasible because the post-clustering processes are executable in parallel, where we used ten cores/threads in the present experiments. Thus, it would not be much additional burden to analyse an order of magnitude larger number of sequences in a personal computational environment, and this may be scaled further with an institutional computer system.
We have shown that GSA-MPSA-mediated refinement of eukaryotic gene structures is a useful automated strategy to dramatically improve the overall prediction quality of a set of homologous genes. Our method will be applicable to various families of protein-coding genes if their domain structures are evolutionarily stable. It is also feasible to apply our method to gene families in any kingdoms of life. The C++ source codes of Aln, Prrn, and Spaln, the engines of our strategy, are available from our web site . The Perl scripts of Refgs.pl that organizes the refinement process together with source codes/scripts of the associated programs are also made available from the same site as Prrn/Aln.
Data retrieval and preparation of initial GSA-MPSA
Methods for GSA-MPSA and spliced alignment
The original Prrn algorithm  attempts to maximize the weighted sum-of-pairs score by doubly-nested iterative refinement methods using the exact group-to-sequence or group-to-group pairwise alignment algorithm (Algorithm D in ) at each iterative step. To adapt Prrn to the present task, we have made several simplifications and extensions. First, a bonus is added to the objective function when the intron positions match with each other along the conceptual CDSs of an aligned pair of members. A match of intron positions means that not only the aligned columns but also the phases must be identical. Second, a double affine gap penalty can be used by option, while the forward algorithm is simplified compared with that of Prime , an extended version of Prrn that can handle long gaps more adequately than Prrn. Third, instead of the exact algorithm, an approximate algorithm (Algorithm C in ) is used at each step of pairwise alignment. With the Algorithm C, opening of gaps is evaluated exactly but the time-consuming optimization step is replaced by a more economical greedy method. Fourth, recalculation of pair weights was suppressed, i.e. the outer loop is run only once. Finally, Prrn now supports an “update” option with which the sequences with the same identifiers as newly given ones are first removed from the older MSA, new members are added one by one to the existing MSA, pair weights are calculated, and then iterative refinement is performed as usual. This option is particularly useful to keep consistency between the GSA-MPSA and the member sequences revised by new prediction of gene structures.
To align a genomic sequence and an MSA/GSA-MPSA, as well as a single amino acid sequence, we extended our spliced alignment program Aln  so that it can handle a generalized profile. Aln also adopts Algorithm C as the dynamic programming-based alignment engine when MSA/GSA-MPSA is assigned as the template. Matching intron positions are given a bonus as described above and earlier . For most of the genomes we examined, species-specific parameter sets were available by the method described earlier . If that was not the case, the parameter set of the evolutionarily closest species was used. Aln shares many subroutines with Spaln in common, and is essentially the same as Spaln with the –Q0 option when a single protein sequence is given as the template. However, unlike Spaln, Aln does not currently support the anchoring-based faster calculation mode.
Assessment of predicted gene structures with GSA-MPSA
To find unusually long insertions or deletions in the given GSA-MPSA, we first locate conserved blocks within the GSA-MPSA by a heuristic score-based method  (Figure 8B). The lengths (numbers of residues) of the individual members within each variable region flanked by or outside such conserved blocks are subjected to recursive outlier analyses with Dixon’s method . The significance level α is adjustable but α = 0.1 is used throughout this study. Long deletions at either end are specifically marked to indicate that the target genomic region should be extended more than usual in the next prediction cycle. Within each variable region, average sequence divergence between one member and the rest of the members is calculated in turn, and the set of divergence values are also subjected to the outlier analyses. This test is incorporated primarily to detect compensating frame shifts and also to detect alternatively spliced exons. At each time of outlier detection in either of the tests, a “defect point” of one is added to the corresponding member. The defect point is also incremented by one if the host gene has at least one “lonesome” or “discordant” intron, where a lonesome intron is an intron that finds no mate at the same position in other members (Figure 8B). The predicted genes were further examined for frame shift(s), premature termination codon(s), or ambiguous residue(s). Every such abnormality contributes to the defect point total by two. If the total defect point is zero, we assign the label “reliable” or “R”, if greater than zero and less than a given threshold, MaxDefect, we assign “questionable” or “Q”, and otherwise we assign “Pseudogene” or “P” to that member. We used MaxDefect = 2 in this study unless specifically remarked. In addition, we calculated several quantities, including the total number of outliers, total number of lonesome introns, variance in sequence lengths within the variable regions, and normalized sum-of-pairs (nSP) or normalized weighted sum-of-pairs (nWSP) score of the alignments. The normalization is taken with respect to both the number of sequence pairs in an MPSA and the length (number of columns) of the MPSA.
In addition to the GSA-MPSA-based assessment mentioned above, we also tried two other approaches. One is an automated method based on available EST/cDNA sequences; the EST/cDNA sequences were mapped on the cognate genomic sequence by Spaln  with the –LS option for local similarity search, and exon/intron boundaries thus inferred were compared with those of predicted genes. The other approach is manual inspection of various data, e.g. intactness of important linear motifs conserved among nearly all P450 protein sequences [57, 58]. Because of the intrinsically subjective nature and the high human cost, we experimentally applied this approach to only P450 genes in two representative genomes (peach and maize).
Iterative refinement of gene structures by Refgs.pl
Refgs.pl is a Perl script that achieves a cycle of assessments by Prrn and alignments by Aln until predicted gene structures no longer change, no “Q” gene remains, or up to a pre-specified number of times. We designed three modes which differ from one another by the choice of the templates. The “minus one” (M1) mode withdraws one sequence from the existing GSA-MPSA, and the structure of the corresponding gene is re-examined using the profile constructed from all the rest of the members. The GSA-MPSA is updated with the amino acid sequence translated from the revised gene, and the process is repeated in turn for every member. The “closest reliable” (CR) and the “reliable profile” (PR) modes update only the genes that are assigned “Q”; the former selects the template that is most similar to the gene in question from the “R” category, whereas the latter uses the profile derived from all the sequences in the “R” category as the template. We also examined all possible combinations of the three modes up to three series so that no consecutive modes should be identical, i.e. M1, CR, PR, M1-CR, …, PR-CR, M1-CR-M1, …, PR-CR-PR.
Expressed sequence tag
Gene-structure-aware multiple protein sequence alignment
Normalized sum-of-pairs score
Normalized weighted sum-of-pairs score.
We thank Dr. H. Iwata (Kyushu University) for a help in preparation of species-specific parameters for Aln and Spaln. This work was supported by the National Science Foundation [IOS-1242275 to D.R.N. and subaward (No. 9000009989) to O.G.], and JSPS KAKENHI B [Grant Number 22310124 to O.G.].
- Haussler D, O’Brien SJ, Ryder OA: Genome 10 K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J Hered. 2009, 100: 659-674.View ArticleGoogle Scholar
- Grigoriev IV, Nikitin R, Haridas S, Kuo A, Ohm R, Otillar R, Riley R, Salamov A, Zhao X, Korzeniewski F, Smirnova T, Nordberg H, Dubchak I, Shabalov I: MycoCosm portal: gearing up for 1000 fungal genomes. Nucleic Acids Res. 2014, 42: D699-D704. 10.1093/nar/gkt1183.View ArticlePubMed CentralPubMedGoogle Scholar
- 3-Million Genomes Project. http://www.nationalgenebank.org/en/research.html,
- Curwen V, Eyras E, Andrews TD, Clarke L, Mongin E, Searle SM, Clamp M: The Ensembl automatic gene annotation system. Genome Res. 2004, 14: 942-950. 10.1101/gr.1858004.View ArticlePubMed CentralPubMedGoogle Scholar
- Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, White O, Buell CR, Wortman JR: Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome biology. 2008, 9: R7-10.1186/gb-2008-9-1-r7.View ArticlePubMed CentralPubMedGoogle Scholar
- Madupu R, Brinkac LM, Harrow J, Wilming LG, Bohme U, Lamesch P, Hannick LI: Meeting report: a workshop on Best Practices in Genome Annotation. Database. 2010, 2010: baq001-View ArticlePubMed CentralPubMedGoogle Scholar
- Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, Zecchina R, Onuchic JN, Hwa T, Weigt M: Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci U S A. 2011, 108: E1293-E1301. 10.1073/pnas.1111471108.View ArticlePubMed CentralPubMedGoogle Scholar
- Jones DT, Buchan DW, Cozzetto D, Pontil M: PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics. 2012, 28: 184-190. 10.1093/bioinformatics/btr638.View ArticlePubMedGoogle Scholar
- Kamisetty H, Ovchinnikov S, Baker D: Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proc Natl Acad Sci U S A. 2013, 110: 15674-15679. 10.1073/pnas.1314045110.View ArticlePubMed CentralPubMedGoogle Scholar
- Sali A, Blundell TL: Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993, 234: 779-815. 10.1006/jmbi.1993.1626.View ArticlePubMedGoogle Scholar
- Harrow J, Nagy A, Reymond A, Alioto T, Patthy L, Antonarakis SE, Guigo R: Identifying protein-coding genes in genomic sequences. Genome Biol. 2009, 10: 201-10.1186/gb-2009-10-1-201.View ArticlePubMed CentralPubMedGoogle Scholar
- Yandell M, Ence D: A beginner's guide to eukaryotic genome annotation. Nat Rev Genet. 2012, 13: 329-342. 10.1038/nrg3174.View ArticlePubMedGoogle Scholar
- Ashurst JL, Chen CK, Gilbert JG, Jekosch K, Keenan S, Meidl P, Searle SM, Stalker J, Storey R, Trevanion S, Wilming L, Hubbard T: The Vertebrate Genome Annotation (Vega) database. Nucleic Acids Res. 2005, 33: D459-D465.View ArticlePubMed CentralPubMedGoogle Scholar
- Estill JC, Bennetzen JL: The DAWGPAWS pipeline for the annotation of genes and transposable elements in plant genomes. Plant methods. 2009, 5: 8-10.1186/1746-4811-5-8.View ArticlePubMed CentralPubMedGoogle Scholar
- Rogozin IB, Carmel L, Csuros M, Koonin EV: Origin and evolution of spliceosomal introns. Biology direct. 2012, 7: 11-10.1186/1745-6150-7-11.View ArticlePubMed CentralPubMedGoogle Scholar
- Russell DJ: Methods in Molecular Biology, Volume 1079. Multiple sequence alignment methods. 2013, New York: Humana Press – SpringerGoogle Scholar
- Gotoh O: Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol. 1996, 264: 823-838. 10.1006/jmbi.1996.0679.View ArticlePubMedGoogle Scholar
- Gotoh O: Multiple sequence alignment: algorithms and applications. Adv Biophys. 1999, 36: 159-206.View ArticlePubMedGoogle Scholar
- Gotoh O: Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps. Bioinformatics. 2000, 16: 190-202. 10.1093/bioinformatics/16.3.190.View ArticlePubMedGoogle Scholar
- Gotoh O: Further improvement in methods of group-to-group sequence alignment with generalized profile operations. Comput Appl Biosci. 1994, 10: 379-387.PubMedGoogle Scholar
- Gribskov M, McLachlan AD, Eisenberg D: Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A. 1987, 84: 4355-4358. 10.1073/pnas.84.13.4355.View ArticlePubMed CentralPubMedGoogle Scholar
- Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N, Rokhsar DS: Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 2012, 40: D1178-D1186. 10.1093/nar/gkr944.View ArticlePubMed CentralPubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.View ArticlePubMed CentralPubMedGoogle Scholar
- Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999, 292: 195-202. 10.1006/jmbi.1999.3091.View ArticlePubMedGoogle Scholar
- Soding J, Remmert M: Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr Opin Struct Biol. 2011, 21: 404-411. 10.1016/j.sbi.2011.03.005.View ArticlePubMedGoogle Scholar
- Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005, 33: 511-518. 10.1093/nar/gki198.View ArticlePubMed CentralPubMedGoogle Scholar
- Iwata H, Gotoh O: Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res. 2012, 40: e161-10.1093/nar/gks708.View ArticlePubMed CentralPubMedGoogle Scholar
- Birney E, Clamp M, Durbin R: GeneWise and Genomewise. Genome Res. 2004, 14: 988-995. 10.1101/gr.1865504.View ArticlePubMed CentralPubMedGoogle Scholar
- Meyer IM, Durbin R: Gene structure conservation aids similarity based gene prediction. Nucleic Acids Res. 2004, 32: 776-783. 10.1093/nar/gkh211.View ArticlePubMed CentralPubMedGoogle Scholar
- Gotoh O: Direct mapping and alignment of protein sequences onto genomic sequence. Bioinformatics. 2008, 24: 2438-2444. 10.1093/bioinformatics/btn460.View ArticlePubMedGoogle Scholar
- Stanke M, Tzvetkova A, Morgenstern B: AUGUSTUS at EGASP: using EST, protein and genomic alignments for improved gene prediction in the human genome. Genome Biol. 2006, 7 (1): S11-11–18View ArticlePubMed CentralPubMedGoogle Scholar
- Allen JE, Salzberg SL: JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics. 2005, 21: 3596-3603. 10.1093/bioinformatics/bti609.View ArticlePubMedGoogle Scholar
- Nagasaki H, Arita M, Nishizawa T, Suwa M, Gotoh O: Species-specific variation of alternative splicing and transcriptional initiation in six eukaryotes. Gene. 2005, 364: 53-62.View ArticlePubMedGoogle Scholar
- Wang BB, Brendel V: Genomewide comparative analysis of alternative splicing in plants. Proc Natl Acad Sci U S A. 2006, 103: 7175-7180. 10.1073/pnas.0602039103.View ArticlePubMed CentralPubMedGoogle Scholar
- Verde I, Abbott AG, Scalabrin S, Jung S, Shu S, Marroni F, Zhebentyayeva T, Dettori MT, Grimwood J, Cattonaro F, Zuccolo A, Rossini L, Jenkins J, Vendramin E, Meisel LA, Decroocq V, Sosinski B, Prochnik S, Mitros T, Policriti A, Cipriani G, Dondini L, Ficklin S, Goodstein DM, Xuan P, Del Fabbro C, Aramini V, Copetti D, Gonzalez S, Horner DS, et al: The high-quality draft genome of peach (Prunus persica) identifies unique patterns of genetic diversity, domestication and genome evolution. Nat Genet. 2013, 45: 487-494. 10.1038/ng.2586.View ArticlePubMedGoogle Scholar
- Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, Liang C, Zhang J, Fulton L, Graves TA, Minx P, Reily AD, Courtney L, Kruchowski SS, Tomlinson C, Strong C, Delehaunty K, Fronick C, Courtney B, Rock SM, Belter E, Du F, Kim K, Abbott RM, Cotton M, Levy A, Marchetto P, Ochoa K, Jackson SM, Gillam B, et al: The B73 maize genome: complexity, diversity, and dynamics. Science. 2009, 326: 1112-1115. 10.1126/science.1178534.View ArticlePubMedGoogle Scholar
- Liang C, Mao L, Ware D, Stein L: Evidence-based gene predictions in plant genomes. Genome Res. 2009, 19: 1912-1923. 10.1101/gr.088997.108.View ArticlePubMed CentralPubMedGoogle Scholar
- Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, Christie KR, Fisk DG, Issel-Tarver L, Schroeder M, Sherlock G, Sethuraman A, Weng S, Botstein D, Cherry JM: Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res. 2002, 30: 69-72. 10.1093/nar/30.1.69.View ArticlePubMed CentralPubMedGoogle Scholar
- Yook K, Harris TW, Bieri T, Cabunoc A, Chan J, Chen WJ, Davis P, de la Cruz N, Duong A, Fang R, Ganesan U, Grove C, Howe K, Kadam S, Kishore R, Lee R, Li Y, Muller HM, Nakamura C, Nash B, Ozersky P, Paulini M, Raciti D, Rangarajan A, Schindelman G, Shi X, Schwarz EM, Ann Tuli M, Van Auken K, Wang D, et al: WormBase 2012: more genomes, more data, new website. Nucleic Acids Res. 2012, 40: D735-D741. 10.1093/nar/gkr954.View ArticlePubMed CentralPubMedGoogle Scholar
- Bernstein BE, Birney E, Dunham I, Green ED, Gunter C, Snyder M: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012, 489: 57-74. 10.1038/nature11247.View ArticleGoogle Scholar
- Nagy A, Hegyi H, Farkas K, Tordai H, Kozma E, Banyai L, Patthy L: Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics. 2008, 9: 353-10.1186/1471-2105-9-353.View ArticlePubMed CentralPubMedGoogle Scholar
- Nagy A, Patthy L: MisPred: a resource for identification of erroneous protein sequences in public databases. Database. 2013, 2013: bat053-View ArticlePubMed CentralPubMedGoogle Scholar
- The 1KP Project. http://onekp.com/project.html,
- Alignment Program. http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user/,
- UniGene. http://www.ncbi.nlm.nih.gov/unigene,
- GenBank. http://www.ncbi.nlm.nih.gov/genbank,
- Dong Q, Schlueter SD, Brendel V: PlantGDB, plant genome database and analysis tools. Nucleic Acids Res. 2004, 32: D354-D359. 10.1093/nar/gkh046.View ArticlePubMed CentralPubMedGoogle Scholar
- Gotoh O: A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence. Nucleic Acids Res. 2008, 36: 2630-2638. 10.1093/nar/gkn105.View ArticlePubMed CentralPubMedGoogle Scholar
- Jones DT, Taylor WR, Thornton JM: The rapid generation of mutation data matrices from protein sequences. Comp Appl Biosci. 1992, 8: 275-282.PubMedGoogle Scholar
- Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. Atlas of protein sequence and structure, Volume 3. Edited by: Dayhoff MO. 1978, Silver Spring, ML: National Biomedical Research Foundation, 345-352. 5Google Scholar
- Nelson DR, Kamataki T, Waxman DJ, Guengerich FP, Estabrook RW, Feyereisen R, Gonzalez FJ, Coon MJ, Gunsalus IC, Gotoh O, Okuda K, Nebert DW: The P450 superfamily: update on new sequences, gene mapping, accession numbers, early trivial names of enzymes, and nomenclature. DNA Cell Biol. 1993, 12: 1-51. 10.1089/dna.1993.12.1.View ArticlePubMedGoogle Scholar
- Gotoh O: Optimal alignment between groups of sequences and its application to multiple sequence alignment. Comp Appl Biosci. 1993, 9: 361-370.PubMedGoogle Scholar
- Yamada S, Gotoh O, Yamana H: Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost. BMC Bioinformatics. 2006, 7: 524-10.1186/1471-2105-7-524.View ArticlePubMed CentralPubMedGoogle Scholar
- Iwata H, Gotoh O: Comparative analysis of information contents relevant to recognition of introns in many species. BMC Genomics. 2011, 12: 45-10.1186/1471-2164-12-45.View ArticlePubMed CentralPubMedGoogle Scholar
- Yamada S, Gotoh O, Yamana H: Improvement in Speed and Accuracy of Multiple Sequence Alignment Program PRIME. Inform Media Tech. 2009, 4: 317-327.Google Scholar
- Verma SP, Quiroz-Ruiz A: Critical values for six Dixon tests for outliers in normal samples up to sizes 100, and applications in science and engineering. Revista Mexicana de Ciencias Geológicas. 2006, 23: 133-161.Google Scholar
- Nelson DR, Strobel HW: Secondary structure prediction of 52 membrane-bound cytochromes P450 shows a strong structural similarity to P450cam. Biochemistry. 1989, 28: 656-660. 10.1021/bi00428a036.View ArticlePubMedGoogle Scholar
- Gotoh O: Substrate recognition sites in cytochrome P450 family 2 (CYP2) proteins inferred from comparative analyses of amino acid and coding nucleotide sequences. J Biol Chem. 1992, 267: 83-90.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.