Localizing triplet periodicity in DNA and cDNA sequences
© Wang and Stein. 2010
Received: 3 June 2010
Accepted: 8 November 2010
Published: 8 November 2010
Skip to main content
© Wang and Stein. 2010
Received: 3 June 2010
Accepted: 8 November 2010
Published: 8 November 2010
The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism C. elegans.
Using both simulated TP signals and the real C. elegans sequence F56F11 as an example, we demonstrate that, (1) Modified Wavelet Transform (MWT) can better define the boundary of TP region than the conventional Short Time Fourier Transform (STFT); (2) The scale parameter (a) of MWT determines the precision of TP boundary localization: bigger values of a give sharper TP boundaries but result in a lower signal to noise ratio; (3) RNA splicing sites have weaker TP signals than coding region; (4) TP signals in coding region can be destroyed or recovered by frame-shift mutations; (5) 6 bp periodicities in introns and intergenic region can generate false positive signals and it can be removed with 6 bp MWT.
MWT can provide more precise TP boundaries than STFT and the boundaries can be further refined by bigger scale MWT. Subtraction of 6 bp periodicity signals reduces the number of false positives. Experimentally-introduced frame-shift mutations help recover TP signal that have been lost by possible ancient frame-shifts. More importantly, TP signal has the potential to be used to detect the splice junctions in fully spliced mRNA sequence.
Most protein-coding genes in eukaryotic cells are composed of alternating introns and exons. With the exception of the extreme 5' and 3' ends of the first and last exons, which contain un-translated regions (UTRs), exons encode protein sequence. These are so-called "coding designated segments" or CDS. Combination of any three nucleotide bases is a codon that encodes one amino acid residue (except three stop codons). CDS regions are known to exhibit a periodic organization of three bases, a triplet periodicity (TP) , which is generally considered a good preliminary indicator of protein-coding exon locations. TP based approaches for prediction of CDS [2–6] is model-independent since it does not need to be trained. On the other hand, model-based methods, like Markov chains , tend to be more precise by training a supervised classifier based on database of previously known organisms' genomic information, though it is usually not trivial to construct negative samples for training purposes. However, when sequenced organisms have coding segments that are not represented in the currently available database, TP based methods may complement model-based systems.
TP signals can be detected by applying windowed Fourier Transform, also called Short Time Fourier Transform (STFT), along DNA sequences . However, the ability of the STFT to identify the precise boundaries of the TP signal is limited by its requirement of an arbitrarily chosen window size over which the spectrum is calculated. It has been shown that the choice of different window lengths directly affects the prediction accuracy . The window size problem of STFT is the result of the resolution tradeoff between the time and frequency domains: FT is applied to the data from time zero to end with high resolution in the frequency domain but none in time domain; STFT (or windowed FT) captures the time domain resolution but loses some resolution in the frequency domain. A natural improvement on STFT is the wavelet transform (WT) that allows one to balance resolution at any time and frequency . However direct application of WT is limited by the fact that the coding regions with TP will present the same frequency (1/3) under different scales . For this reason, the ability of WT to automatically capture different frequencies is not appropriate here. Here we used a Modified Wavelet Transform (MWT) algorithm to show: (1) how TP boundary can be defined with greater accuracy; (2) how the TP profile can be used to infer the splice junctions in mature mRNA sequences; (3) how ancient frame-shift mutations may explain the loss of TP signals in coding regions; and (4) how the 6 bp periodicity in some intronic and intragenic regions can be identified and corrected in order to reduce false positive identifications of coding regions.
For a biological test case, we used the sequenced contig F56F11 of C. elegans, a 43 kbp sequence containing six protein-coding genes and two non-coding genes. The 8000 bp-long gene F56F11.4 which starts at contig position 7021 (GenBank access number AF0099922, positions 7021-15020) has been used extensively as a test case for TP detection techniques [5, 6, 10]. This gene contains five CDS of length greater than 100 bp located at positions 928-1039, 2528-2857, 4114-4377, 5465-5644, and 7255-7605 relative to the start of the gene (a short alternatively-spliced sixth exon at the extreme 5' end contains a coding region that is too short to be analyzed by these techniques).
Where m is a non-negative integer and different p values simulates period of 3, 6 or 9 patterns from base position 300 to 600.
We choose a window size N which is a multiple of p, and slide it across the DNA sequence. The PSD[N/p] value given by STFT is recorded for each base located at the center of the sliding window. The TP property of a DNA sequence implies that PSD[N/3] (when p = 3) values peak in coding regions. Thus the plot of PSD[N/3] against base position will distinguish CDS from non-coding regions. Later we will refer PSD[N/3] simply as PSD for each base position.
Such modification ensures that the frequency part ( ) is independent of scale parameter a since t = (x - k)/a. For practical reason, the length (or vanishing moments) of the wavelet analysis function (N) is taken as 1200 unless specified: a longer length will give higher precision but takes longer to compute. Now if ω0 is taken as N/b, it can precisely capture b base periodicity. The normalization factor in equation (4) is critical to ensure a constant norm in the space L 2(R) of square integrable functions, which in turn ensures equal areas under the MWT for two equal-amplitude Fourier components in the power spectrum. Without the normalization factor, MWT is similar to the modified Gabor Transform . PSD(a, b) can be calculated by equation (3) after obtaining U by equation (4) to capture b base periodicity under scale a. b is chosen as 3 for capturing TP signal and a is 5 unless specified. Later we will also refer PSD(a, b) simply as PSD for each base position.
From Figure 1(a), it can be seen that MWT with larger scale parameters gives a sharper boundary but with a lower peak amplitude, making it less tolerant to noise. For real DNA sequences, larger scale parameters (equivalent to a smaller window size) will result in more noise in introns and intergenetic regions since smoothing is performed over smaller region. A hybrid approach can be taken by using a smaller scale parameter to locate candidate coding regions and then refining the boundaries of these regions by re-running the MWT with a larger value of scale. Combining this technique with the well known GT-AG rule, which governs the great majority of Type I eukaryotic splice junctions , and open reading frame identification, might enable precise identification of the CDS boundaries.
More interestingly, Figure 3 shows that, under the larger value of scale (a = 5), individual exons can be distinguished from each other even when they are merged together as they would appear in fully-spliced mRNA. The exceptions are the second and last exons, each of which contains two distinct PSD peaks (Figure 2 and Figure 3).
The most straightforward explanation for this result is that the diminution of TP signal around splice junctions is due to the different TP patterns between adjacent exons. The presence of splicing cis-regulatory elements in the exon, such as exonic splice enhancers , which locally distort the pattern of evolutionary constraint, might contribute to the dimunution. However, the valleys are still observed after removing up to 90 bp on each side of the splice junction (Additional file 1, Figure S1), suggesting that these elements, if present, are not the dominating factors.
A frame-shift mutation is a genetic mutation caused by insertion or deletion of 3n + 1 or 3n + 2 nucleotides from the coding region of a DNA sequence. Such mutations change the downstream codons which in turn changes the final protein product [14, 15]. Such mutations will break the TP pattern within the exon by altering the phase of the codon bias around the mutation point. Here we examined whether the TP signal can be used to detect frame-shift mutations and vice versa.
Interestingly, nucleotide alignments between the C. elegans genome and related nematode species  detects indels around these regions (Additional file 1, Figure S4 and Additional file 1, Figure S5), providing independent support for ancient frameshift mutations at these sites.
It has been pointed out that p bp periodicity peak will be observed in the power spectrum even if the sequence only has 2p bp periodicity when Fourier Transform is used . This is an inherent character of FT and MWT since, for example, if there is a peak at k = N/6, there will be a peak at k = N/3 using equation (2) or (4). This drawback needs to be corrected when either FT or MWT is applied to the identification of DNA triplet periodicity.
Several hypotheses have been advanced to explain the origin of the TP in coding sequence [10, 18, 19]. The simplest explanation is codon bias, which increases the probability that nucleotide triplets will appear in the same phase, but the matter is far from settled. Further it remains unclear why some CDS do not have apparent TP signal, and why some non-CDS segments have strong TP signals. Our analysis provides useful insights into TP properties at both the algorithmic and biological perspecives. It shows that MWT can better precisely delimit coding boundaries than STFT, and that a simple procedure of introducing artificial frame-shift mutations into protein-coding candidate regions can recover signal that was lost by presumptive ancient frame-shift mutations. While, this step might end up identifying regions that have lost coding capability during evolution, the identification of ancient coding sequences may still be of interest for comparative genomics.
In and of itself, the TP property is inadequate for gene prediction. However, it may be a useful adjunct to other techniques, particularly in the identification of protein-coding genes that are unusual in one way or another. For example, a recent study  showed that highly tissue-specific protein coding exons can be discovered via massive RNA-sequencing of 69 lymphoblastoid cell lines derived from unrelated Nigerian individuals that were not predicted by conventional model-based gene prediction algorithms. One can speculate that some of these exons remained undiscovered because they depart from the expectations of the model. Protein coding segment prediction methods based on the model-independent TP property would provide a valuable complement to the conventional gene prediction algorithms.
Using simulated and real life sequencing data, we demonstrated that the MWT scale parameter can be reduced to give more robust prediction of broad TP regions or increased to provide higher resolution of the TP boundary. One way to achieve both will be running MWT with a smaller scale to identify TP regions followed by re-running the algorithm using larger values of the scale. Though it is not clear whether the edges of TP are completely consistent with the edges of protein coding regions (exons), TP edges could be further refined by known knowledge, such as GT-AG rule if done carefully.
The most interesting result stemming from this analysis is the diminishing TP signal around the splice junction in mature mRNA sequences. The valley remains after deleting up to 90 bp of sequence immediately up and downstream of the junction. This implies that an extended region surrounding the splice site is under a different set of evolutionary constraints which reduces the TP signal. An alternative hypothesis is that areas of weak TP signal are favoured sites for intron birth. It is interesting to note that Additional file 1, Figure S4 shows that the last exon of gene F56F11.4 contains two different TP patterns even after making artificial frame-shift mutations. Perhaps this region once had an intron and lost it; alternatively this region might have an increased probability of acquiring an intron over the course of future evolution. In support of the first hypothesis, we note that there is a 657 bp deletion at the site of the TP valley in C. elegans relative to P. pacificus (Additional file 1, Figure S4), suggesting the presence of an intron in the common ancestor of these two nematodes.
MWT is a promising method for capturing triplet periodicity in DNA sequence. Artificial 'frame-shift' mutations and correction for the effect of the 6 bp periodicity signal could further improve the prediction. We also hypothesize that TP property of exons might carry evolution evidence about frame-shift mutations and the separation of exons by introns.
We would like to thank Dr. Chunlao Tang, Sharon Wei and Andrew Olson for helpful discussions with LW. This research was supported in part by WormBase grant P41 HG02223 from National Institutes of Health and iPlant grant #EF-0735191 from National Science Foundation Plant Cyberinfrastructure program to LDS.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.