A method of precise mRNA/DNA homology-based gene structure prediction
© Churbanov et al; licensee BioMed Central Ltd. 2005
Received: 08 March 2005
Accepted: 21 October 2005
Published: 21 October 2005
Accurate and automatic gene finding and structural prediction is a common problem in bioinformatics, and applications need to be capable of handling non-canonical splice sites, micro-exons and partial gene structure predictions that span across several genomic clones.
We present a mRNA/DNA homology based gene structure prediction tool, GIGOgene. We use a new affine gap penalty splice-enhanced global alignment algorithm running in linear memory for a high quality annotation of splice sites. Our tool includes a novel algorithm to assemble partial gene structure predictions using interval graphs. GIGOgene exhibited a sensitivity of 99.08% and a specificity of 99.98% on the Genie learning set, and demonstrated a higher quality of gene structural prediction when compared to Sim4, est2genome, Spidey, Galahad and BLAT, including when genes contained micro-exons and non-canonical splice sites. GIGOgene showed an acceptable loss of prediction quality when confronted with a noisy Genie learning set simulating ESTs.
GIGOgene shows a higher quality of gene structure prediction for mRNA/DNA spliced alignment when compared to other available tools.
A vast amount of genomic data, including most of the human genome , is now available in publicly accessible databases, and the deposition of additional data continues at a rapid pace. Genomic data requires meticulous interpretation and annotation for meaningful information to be extracted. Genes, the most important functional blocks in the human genome, require exact structural annotation for future biological experiments such as reverse genetics and microarray experiments.
Most of the human genes have piecewise structure with a number of exons separated by introns. Introns are excised from original gene transcripts (pre-mRNA) to form mature mRNA. By aligning mRNA with originating genomic clones, we can reconstruct gene structure.
Several fast and efficient tools, such as BLAST  and BLASTX , were introduced in the early 90's to search databases for homologous blocks, an essential component of all gene structural prediction algorithms. An original method of gene structure prediction based on a set of protein-DNA blocks , implemented in GeneBuilder, was followed by Procrustes implementation . Later, there were numerous implementations exploiting the idea of homology-based gene structure prediction, including GeneSeqer with SplicePredictor , AAT , EbEST , ESTMAP , TAP , Sim4 , Spidey , GrailEXP Galahad , BLAT  and est2genome . Other genome annotation software is described in [16, 17].
Homology-based methods of gene structure prediction, referred to as spliced alignment, are often classified according to the homology type they employ (DNA/DNA, DNA/mRNA, DNA/Protein, etc.) ; frequently, programs employ more than one homology type. The purpose of a spliced alignment algorithm is to explore all possible assemblies of potential exons (blocks) to find a chain of exons which best fits an mRNA target sequence.
In this paper we discuss GIGOgene, a gene structure prediction tool. GIGOgene, like existing spliced alignment software [11, 16], can deal with repeating domains, paralogs and pseudogenes. In addition, GIGOgene is capable of combining structural prediction of a gene from partial gene models that span across several genomic clones. The key to GIGOgene higher precision, in the case of mRNA/DNA spliced alignment, is in the use of new splice-enhanced affine gap penalty global alignment for noise-tolerant recovery of exon-intron boundaries, including non-canonical splice sites, with simultaneous prediction of short exons. GIGOgene uses a filtering step to remove suboptimal blocks for better prediction quality.
Before we proceed with formal description of methods, we need to define a High-scoring Segment Pair (HSP), otherwise known as a block. In the context of this paper, an HSP is a statistically significant alignment between segments (subsequences) in DNA and mRNA obtained from a BLASTN result. Parameters characterizing HSPs include location in the mRNA query and in the DNA target sequence, and different quality values such as expectation value (E), percent identity, and score.
Below, we provide a brief description of the steps in our gene structural prediction process. Some of these steps are self-explanatory, while others are considered in detail in the following subsections:
Step 1 Align curated mRNA sequence(s) with DNA target sequence database using BLASTN .
Step 2 Parse the BLASTN output and select genomic clones that score above 200 bits with an expectation value of no more than 1e 1. Through experimentation we have determined that these values are optimal for recovery of most of the essential HSP sets needed for further analysis. These values can be easily adjusted.
Step 3 By pairwise comparative analysis of an HSP set for each selected genomic clone, exclude HSPs with mRNA segments totally within other larger mRNA HSP segments. The longest HSP is assumed to contain the true exonic boundaries; shorter subHSPs usually result from paralogous and pseudogenic matches.
Step 4 Disambiguate the HSP sequences for all the selected clones, as discussed [see Subsection Algorithm for an unambiguous HSP sequences allocation]. The result of this step is a set of unambiguous HSP sequences.
Step 5 Build an interval graph of overlapping unambiguous HSP sequences. The interval graph captures intersection relations of nodes (unambiguous HSP sequences) as we put edges between nodes when nodes belong to different genomic clones, while their mRNA composite segments intersect. Edges between HSP sequences from the same genomic clone are not allowed.
Step 7 Compact the interval graph, as discussed [see Subsection Joining unambiguous HSP sequences]. This results in the biggest composite genomic clone containing the maximum number of possible exons.
Step 8 Use splice-enhanced affine gap penalty global alignment to identify possible intron/exon boundaries in the composite genomic clone, as discussed [see Subsection Splice-enhanced affine gap penalty global alignment].
Step 9 Extract intron and exon segments from the composite genomic clone and print a report.
Algorithm for an unambiguous HSP sequences allocation
Transcripts are always linear. Thus, we require the set of HSPs to be sequential (we refer to this as the sequential rule).
Splicing of pre-mRNA does not introduce any alternations in the order of exons.
Alternative splicing does not affect the order of exons in a gene.
The similarity of homologous fragments of a gene gradually decreases due to sporadic mutations. As a result, HSPs from the real gene usually have higher scores than HSPs from corresponding pseudogene(s) or paralog(s), as schematically shown in Figure 2. We thus reject unambiguous HSP sequences with average percent identities below a certain threshold; a threshold of 97% produced good results in our experiments with mRNAs. Threshold value could be easily adjusted, if needed, to find gene structure with distant homologs, such as Expressed Sequence Tags (ESTs).
Pre-mRNA splicing results in mature mRNA, with exons arranged side by side. In the case of an HSP sequence containing potential exons, we require the entire mRNA transcript to be covered with segments continuously, without breaks.
In our experiments, we noticed cases in which HSPs from the real gene match have a smaller identity compared to HSPs originating from paralogs and pseudogenes. Thus, the disambiguation procedure finds the unambiguous HSP sequence covering the longest mRNA segment with the minimum number of HSPs. For the HSP sequences of equal length with the same number of HSPs, we compare the maximum total weight where the weight of an HSP is a tradeoff between its identity and size:
weight = size·m100-x (1)
Here x is the BLASTN-assigned percent identity for an HSP, size is the HSP length, and m < 1 is the decay rate to ensure substantial weight loss for identity lower than the threshold value. The value m = 0.85 produced good results in our experiments. Weight function (1) characterizes the importance of any given HSP in a global solution.
For this algorithm we must allocate a score matrix F of dimensionality 2 × (# of RNA segments) × (# of DNA segments) and a matrix C of the same size to record the intermediate HSP sequences in the dynamic programming procedure. For the convenience of indexing we introduce aliases M ← F0 and Iy ← F1. We ignore matrix Ix as being unnecessary. The boolean function CONNECTED(i, j) determines the overlap between segments i and j in the transcript.
Joining unambiguous HSP sequences
Although the produced graphs may have different degrees of density, in our experiments they were not sparse enough to use Johnson's modification , which runs in O(V2 log V + VE).
To connect the nodes, we solve the following maximization problem:
Figure 7 shows the dynamic programming procedure, after we initialize matrix D. Upon completion of the procedure, we extract the maximum element from matrix D n and recover the solution. The COMBINATORIALCONNECT function used to find the best combination of HSP sequences is shown in Figure 8.
Splice-enhanced affine gap penalty global alignment
In order to identify precise intron/exon boundaries in a genomic clone, a modified Needleman-Wunch global alignment algorithm with affine gap penalty is used to create a spliced alignment between segments of query and target sequences.
The basic Needleman-Wunch algorithm provides a scattered (i.e. frequently interrupted) mRNA/DNA alignment pattern, with no clear indication of exon/intron boundaries. With affine gap penalties, we penalize the score each time we break an alignment ; this provides an alignment clustered within putative exons, but usually without precise indications of exon/intron boundaries. The addition of sensor information (GT/AG, AT/AC or similar rules ) results in precise gene structural prediction.
In our algorithm, we introduce the match state M (Figure 9), which uses the BLASTN scoring matrix. The state Iy corresponds to a gap penalty in the mRNA transcript, while the other states correspond to forming gaps in the genomic clone and have nucleotide-specific score deductions. Gap-opening matrices dA and dG express score preferences to open a gap with either nucleotide A or G, respectively; d is a generic gap-opening penalty; and e is a generic gap-extending penalty. Typically, the cost of extending a gap is set to be five to ten times lower than the cost for opening a gap. Gap-extension penalty matrices eA, eC, eG and eT express scoring preference to extend gaps with nucleotides A, C, G and T, respectively.
In order to save running time, we use anchors – short nucleotide segments from mRNA and DNA expected to contain exon/intron boundary fragments with donor/acceptor signals. A normal anchor does not have mismatches in state M. If a mismatch is encountered, it may mean a short exon is present. If necessary, the anchor can be expanded and the spliced alignment rerun with two full exons and intron between to identify possible short exons or address sequencing errors, as discussed [see Subsection Advantages of using splice-enhanced affine gap penalty global alignment in gene structure prediction].
According to our model, introducing or extending a gap is straightforward using the GT/AG rule. The penalty becomes severe if we try inserting a gap without the rule; we would rather use higher-extension-penalty state Ix for short gaps frequently resulting from sequencing errors. The AT/AC rule works in much the same way, except with a higher score penalty.
We implement the affine gap penalty spliced alignment algorithm in a linear memory of size S(m + n) and running time O(n × m), where n is the size of a DNA fragment and m is the size of an mRNA fragment.
Run the spliced alignment ALIGN(0...n, 0...m) to find indexes of u and v (the split points for a recursive call). Here u is the vertical median index, and v is the horizontal index of a point where the optimal traceback intersects the median.
Restore the matrix context for the recursive calls and prior-state information for proper backtracking.
Make the recursive calls for nucleotide segments ALIGN(0...u, 0...v) and ALIGN(u...n, v...m), etc.
If either of the nucleotide segments' lengths in a recursive call is less than or equal to 1, call the ordinary spliced alignment for these pieces to get the alignment states.
More detailed explanation of the spliced alignment algorithm we use is in .
Advantages of using splice-enhanced affine gap penalty global alignment in gene structure prediction
There are several advantages of using splice-enhanced affine gap penalty global alignment, discussed [see Subsection Splice-enhanced affine gap penalty global alignment], for gene structure prediction:
ability to recover canonical and non-canonical splice sites;
noise-tolerant prediction of splice sites;
ability to recover short exons;
ability to handle low complexity regions in genomic DNA, if sorted out by dust filtering.
A splice site usually happens on the boundaries of HSPs, but in most cases mRNA segments of neighboring HSPs overlap with no clear indication of a splice site. Recovery of a splice site could be formulated as a combinatorial problem of finding optimal exon/intron boundaries in HSPs' overlap vicinity. A dynamic programming approach, such as the splice-enhanced affine gap penalty global alignment we use, allows us to consider all possible rearrangements around HSPs' overlap to pick optimal splice sites in polynomial time.
neighboring HSPs overlap too much, so that we can't reliably identify a splice site with small anchors;
there is a sequencing error adjacent to a splice site;
short exons are present.
All of these cases require additional application of splice-enhanced affine gap penalty global alignment with anchors expanded to include the entire HSP segments for maximum error tolerance, as shown in Figure 11.
Experiments with Genie learning set
GIGOgene was tested, along with Spidey, est2genome, Sim4, Galahad and BLAT on 462 mRNA transcripts of the human Genie multi-exon annotated learning set http://www.fruitfly.org/sequence/human-datasets.html/. We used transcripts corresponding to mRNA or CDS features in the Genie learning set annotation.
Sensitivity (ESn) and specificity (ESp) were calculated according to the formulas
Comparative exon-level sensitivity and specificity for different programs on human Genie learning set
This test is designed as evidence of general prediction quality of different gene structure annotation tools. GIGOgene has the highest sensitivity and specificity in this case, which highlights advantages of the approach we use.
Experiments with micro-exon detection
Micro-exon gene set comparative level sensitivity and specificity for different programs
This study shows that the GIGOgene program has the highest structural prediction sensitivity and specificity in this case. BLAT recovered 96.69% true exonic boundaries in the micro-exonic set, while other programs had fraction of true splice sites recovered no more than 90%, i.e. they most likely miss micro-exon(s) from their prediction.
Experiments with non-canonical splice sites
Non-canonical gene set comparative level sensitivity and specificity
In this study est2genome made a mistake in annotating virtually every non-canonical splice site while reinforcing canonical splice rule. Although BLAT was very sensitive in this experiment, it makes mistakes occasionally.
Simulated EST experiment
Noisy Genie experiment
With simulated EST study our program performed worse than Sim4 and est2genome, about as well as Galahad, and substantially better than BLAT and Spidey, the programs that were specifically designed for mRNA/DNA spliced alignment. The reason for substantial quality loss with GIGOgene is in splice site annotation strategy. If we get a number of nucleotide inserts between exon boundaries in mRNA, they can be easily interpreted as micro-exon(s) with non-canonical splice sites, rather than reinforcing the GT-AG rule in a genomic clone as Sim4 and EST2genome do. That is why these two applications have rather poor performance in micro-exonic testing [see Subsection Experiments with micro-exon detection], where they sacrifice micro-exons to reinforce canonical splice rule.
Comparative time in seconds required by Pentium IV computer to annotate a set of genes containing micro-exons. BLASTN running time is included in GIGOgene timing.
Run time comparison on the set of micro-exons indicates that our program runs faster than est2genome but slower than other tools we have looked at. By using splice-enhanced affine gap penalty global alignment we traded execution time for quality, compare to simpler heuristics used to predict splice sites in other tools.
Chromosome 22 experiment
For this experiment we chose human chromosome 22 whole draft sequence NC_000022.8 from NCBI Genbank. A total of 506 transcripts were mapped to the chromosome by parsing human RefSeq flatfiles, but only 430 transcripts have corresponding genes annotated in NCBI Genbank.
Chromosome 22 prediction quality for 430 mapped transcripts with structural annotation
We report exon-level comparative performance of BLAT and GIGOgene in Table 6.
Results of BLAT and GIGOgene comparison on Chromosome 22 whole draft sequence annotation agree well with the previously observed tendency: with GIGOgene, gene structural prediction takes longer, compared to BLAT, and has higher prediction quality.
Using a homology-based approach, we have designed a program for eukaryotic gene structural annotation. In case of mRNA/DNA spliced alignment we have been able to improve on exon-level sensitivity and specificity by addressing several possibilities of error. Program domain is limited to mRNA/DNA spliced alignment with a reasonable fraction of sequencing errors. Experiments on running time position our tool as a relatively slow utility for annotating specific cases of gene structural prediction.
Several published spliced alignment algorithms were mentioned [see Section Background]. Our splice-enhanced affine gap penalty global alignment in some ways similar to the spliced alignment of protein/DNA blocks described in the Procrustes paper . The key differences in our implementation is that it works in linear memory and is effective in annotation of both canonical and non-canonical splice sites. Compared to protein-DNA alignment, it has finer granularity, which translates to smaller possibility for incorrect structural prediction, especially for micro-exons. We can also annotate both CDS and UTR regions, while protein-DNA homology programs, such as Procrustes  and Genomescan , are limited to CDS region only.
The stand-alone program version, web implementation interface, test results and manual for GIGOgene are available at http://bioinformatics.ist.unomaha.edu/~achurban/.
Availability and requirements
Project name: Good In Good Out gene structural prediction tool (GIGOgene)
Project home page: http://bioinformatics.ist.unomaha.edu/~achurban/.
Operating system: Platform independent
Programming language: Java
Other requirements: Java 1.4.1 or higher
License: GNU Lesser General Public Licence
We would like to thank members of the Bioinformatics Group at the University of Nebraska at Omaha who provided useful feedback on our progress and program. This work was supported by the NIH grant number P20 RR16469 from the INBRE program of National Center for Research Resource.
- IHGSC: Initial sequencing and analysis of the human genome. Nature 2001, 409: 860–921.View ArticleGoogle Scholar
- Altschul WG, Miller W, Myers E, Lipman D: Basic Local Alignment Search Tool. Journal of Molecular Biology 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Gish W, States DJ: Identification of protein coding regions by database similarity search. Nature Genetics 1993, 3: 266–272.View ArticlePubMedGoogle Scholar
- Rogozin IB, Milanesi L, Kolchanov NA: Gene structure prediction using information on homologous protein sequence. Comput Appl Biosci 1996, 12(3):161–170.PubMedGoogle Scholar
- Gelfand MS, Mironov AA, Pevzner PA: Gene recognition via spliced sequence alignment. Proc Natl Acad Sci USA 1996, 93: 9061–9066.PubMed CentralView ArticlePubMedGoogle Scholar
- Usuka J, Zhu W, Brendel V: Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 2000, 16: 203–211.View ArticlePubMedGoogle Scholar
- Huang X, Adams MD, Zhou H, Kerlavage AR: A tool for analyzing and annotating genomic sequences. Genomics 1997, 46: 37–45.View ArticlePubMedGoogle Scholar
- Jiang J, Jacob HJ: EbEST: an automated tool using expressed sequence tags to delineate gene structure. Genome Res 1998, 8(3):268–275.PubMed CentralView ArticlePubMedGoogle Scholar
- Milanesi L, Rogozin IB: ESTMAP: a system for expressed sequence tags mapping on genomic sequences. IEEE Trans Nanobioscience 2003, 2(2):75–78.View ArticlePubMedGoogle Scholar
- Kan Z, Rouchka EC, Gish WR, States DJ: Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res 2001, 11(5):889–900.PubMed CentralView ArticlePubMedGoogle Scholar
- Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Research 1998, 8: 967–974.PubMed CentralPubMedGoogle Scholar
- Wheelan SJ, Church DM, Ostell JM: Spidey: A tool for mRNA-to-Genomic Alignment. Genome Research 2001, 11: 1952–1957.PubMed CentralPubMedGoogle Scholar
- Xu Y, Uberbacher EC: Automated Gene Identification in Large-Scale Genomic Sequences. Journal of Computational Biology 1997, 4(3):325–338.View ArticlePubMedGoogle Scholar
- Kent WJ: BLAT-the BLAST-like alignment tool. Genome Research 2002, 12(4):656–664.PubMed CentralView ArticlePubMedGoogle Scholar
- Mott R: EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. CABIOS 1997, 13(7):477–478.PubMedGoogle Scholar
- Mathé C, Sagot MF, Schiex T, Rouzé P: Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids research 2002, 30: 4103–4117.PubMed CentralView ArticlePubMedGoogle Scholar
- Miller W: Comparison of genomic DNA sequences: solved and unsolved problems. Bioinformatics 2001, 17(5):391–397.View ArticlePubMedGoogle Scholar
- Tchourbanov A, Quest D, Ali H, Pauley M, Norgren R: A New Approach for Gene Annotation Using Unambiguous Sequence Joining. In Proceedings of the Computational Systems Bioinformatics (CSB'03). IEEE Computer society; 2003:353–362.Google Scholar
- Cormen TH, Leiserson CE, Rivest RL, Stein C: Introduction to Algorithms. 2nd edition. MIT Press; 2001.Google Scholar
- Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Cambridge University Press; 1998.View ArticleGoogle Scholar
- Burge CB, Padgett RA, Sharp PA: Evolutionary fates and origins of U12-type introns. Molecular Cell 1998, 2: 773–785.View ArticlePubMedGoogle Scholar
- Hirschberg DS: A linear-space algorithm for computing maximal common subsequences. Communications of the ACM 1975, 18: 341–343.View ArticleGoogle Scholar
- Myers EW, Miller W: Optimal alignments in linear space. Computer Applications in the Bio-sciences 1988, 4: 11–17.Google Scholar
- Volfovsky N, Haas BJ, Salzberg SL: Computational Discovery of Internal Micro-Exons. Genome Research 2003, 13: 1216–1221.PubMed CentralView ArticlePubMedGoogle Scholar
- Burset M, Seledtsov IA, Solovyev VV: Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Research 2000, 28(21):4364–4375.PubMed CentralView ArticlePubMedGoogle Scholar
- Yeh RF, Lim LP, Burge CB: Computational inference of homologous gene structures in the human genome. Genome Research 2001, 11: 803–816.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.