Identification of exonic regions in DNA sequences using cross-correlation and noise suppression by discrete wavelet transform
- Omid Abbasi^{1},
- Ali Rostami^{1, 2} and
- Ghader Karimian^{3}Email author
https://doi.org/10.1186/1471-2105-12-430
© Abbasi et al; licensee BioMed Central Ltd. 2011
Received: 14 June 2011
Accepted: 3 November 2011
Published: 3 November 2011
Abstract
Background
The identification of protein coding regions (exons) in DNA sequences using signal processing techniques is an important component of bioinformatics and biological signal processing. In this paper, a new method is presented for the identification of exonic regions in DNA sequences. This method is based on the cross-correlation technique that can identify periodic regions in DNA sequences.
Results
The method reduces the dependency of window length on identification accuracy. The proposed algorithm is applied to different eukaryotic datasets and the output results are compared with those of other established methods. The proposed method increased the accuracy of exon detection by 4% to 41% relative to the most common digital signal processing methods for exon prediction.
Conclusions
We demonstrated that periodic signals can be estimated using cross-correlation. In addition, discrete wavelet transform (DWT) can minimise noise while maintaining the signal. The proposed algorithm, which combines cross-correlation and DWT, significantly increases the accuracy of exonic region identification.
Keywords
Background
Several reasons for the existence of period-3 property have been presented in [2, 3] and [4]. Some codons participate more in protein synthesis than others, giving rise to repetitions of a specific type of codon in the genome [4]. For example, the existence of a large number of GCA codons in the exonic regions gives greater repetition of G, C and A nucleotides in the first, second and third codon position, respectively. In other words, the G, C and A nucleotides exhibit period-3 property in the exonic regions.
Gene finding methods based on genetic characteristics, such as promoter, CpG Island, start and stop codon etc., tend to be of insufficient accuracy [5]. The characterization of coding and noncoding regions based on nucleotide statistics inside codons is described by Bernaola et al., who employed a 12-symbol alphabet to identify the borders between coding and noncoding regions [6]. Later, Nicorici and Astola segmented the DNA sequence into coding and noncoding regions using recursive entropic segmentation and stop-codon statistics [7].
The use of signal processing techniques to identify exonic regions based on the period-3 property offers new opportunities for gene finding. Tiawari used Fourier transform spectrum to achieve this goal [8]. In Tiawari's method, the discrete Fourier transform (DFT) energy at a central frequency is calculated for a fixed length window, and the window is slid across the numerical sequence. Vaidyanathan [9] identified protein coding regions using an anti-notch filter which magnified regions with period-3 property. Datta and Asif [10] presented a new algorithm using DFT theory with a Bartlett window. In another signal processing method, Akhtar [11] applied time domain algorithms, average magnitude difference function and time domain periodogram algorithms to identify period-3 property. Some gene finding methods based on digital signal processing (DSP) techniques have been developed but the accuracy of these methods is low and requires improvement.
In this paper, a new algorithm based on cross-correlation theory is presented. We show that the algorithm enhances the accuracy of the identification while reducing noise. The noisy waveform is cross-correlated with a periodic impulse train to provide the estimated signal. Discrete wavelet transform is applied to remove extra frequencies.
The remainder of the paper is organized as follows: in the Methods section, the application of the cross-correlation to obtain the periodic signal plus noise is described, together with the period-3 behaviour detection using cross-correlation theory. The final part of this section details the use of wavelet transform to remove noise. The datasets used are introduced in the Dataset section. Thereafter, evaluation measures are introduced for the measurement and comparison of various methods. Finally, in the Results and Discussion section, the results of the proposed algorithm are compared with those of the most common digital signal processing algorithms for exon prediction, in both time and frequency domains.
Methods
Cross-correlation
The discrete nature of DNA and the existence of period-3 behaviour in the exonic regions render it suitable for analysis by signal processing algorithms. We present an algorithm for the identification of the period-3 component based on cross-correlation techniques. The theory of cross-correlation theory is briefly explained below.
To estimate a periodic waveform that is contaminated with noise, this waveform is cross-correlated with an adjustable template waveform; the template waveform is adjusted until the cross-correlation is maximized. The resulting template is an estimate of the signal term of the periodic waveform.
In our approach, a noisy waveform is cross-correlated with a periodic impulse train of period equal to that of the signal.
As N_{ δ } → ∞, $\frac{1}{{N}_{\delta}}{\sum}_{k=0}^{{N}_{\delta}}q\left[k{N}_{p}\right]\to 0$, and therefore r_{ sδ } → s(0).
from which the periodic signal without noise can be extracted [12].
Identification of exonic regions
- 1.
DNA sequences are converted into numerical sequences.
- 2.
FIR filter is applied to the numerical sequences representing DNA sequences.
- 3.
Cross-correlation is applied to the filtered numerical sequences.
- 4.
The noise effect is removed using discrete wavelet transform.
1. Numerical conversion of the DNA sequences
To apply DSP techniques to the DNA sequence to find nucleotide regions exhibiting period-3 behaviour, the DNA sequence is first mapped onto the numerical sequence. The simplest conversion method maps four numerical sequences I_{ A } [n], I_{ T } [n], I_{ C } [n] and I_{ G } [n] from DNA sequences in binary format. In this mapping, the presence or absence of the respective nucleotides at the n th position is represented by '1' and '0', respectively. For example, given a section of DNA sequence ATCCGATATTC, the binary sequence of the nucleotide A, denoted I_{ A } [n], is [10000101000]. The binary sequences for the other three nucleotides T, C and G are found similarly [13].
2. Applying FIR filter to the numerical sequences
After mapping the DNA sequence onto its binary numerical sequence, the binary sequence is passed through a Hamming window based FIR filter of order 8 with central frequency set to 2π/3, to emphasize period-3 property in the exonic regions. Lack of distortions in FIR filters is one reason for their preferred use over IIR filters in medical applications [12].
3. Applying cross-correlation theory to the numerical sequences
Most previous methods have used a window of fixed length to find the regions in DNA sequences exhibiting period-3 property. In such methods, the window length directly affects the accuracy of the identification. Typically, an appropriate window length is considered to lie within the range 240-351 (window lengths are multiples of three to reflect the codon structure). Short length windows increase noise, while long length windows tend to miss short exonic regions.
In this energy spectrum, a peak corresponds to the presence of a period-3 component on that region, implying that the region is exonic.
4. Decreasing the noise using discrete wavelet transform
Decreasing noise increases the accuracy of exonic region identification. As seen from equation (6), a small window size, required for the detection of small exons, will not diminish noise sufficiently. Hence we apply discrete wavelet transform (DWT) to decrease the noise in the output spectrum.
DWT has been used for de-noising in various signal processing applications. In protein coding region detection, Haar wavelet has previously been employed for noise suppression [14]. Our proposed algorithm uses Dmey wavelet to remove noise and thereby increase the accuracy of the exonic region identification.
Datasets
Standard datasets are used to compare the efficacy of different algorithms at identifying exonic regions. Exon and intron positions in these databases are available and when DSP methods detect the position of exons, these positions are compared with real positions. The proposed algorithm is first applied to chromosome III of Caenorhabditis elegans [NCBI Reference Sequence: NC_003281.8], containing a total of 13783681 nucleotides with 8172 coding regions, and the results are compared with those of other popular methods. The results of the proposed algorithm for the sequence F56F11.4 of C. elegans (comprising 8,000 nucleotides) are separately presented. This sequence has five exonic regions at positions 928-1039, 2528-2857, 4114-4377, 5465-5644, and 7255-7605. Also analysed in this paper are the BG570 [16] and HMR190 [17] datasets. BG570 is a genomic test dataset of 570 single gene vertebrate sequences prepared by Burset and Guigo [16]. HMR195 comprises 195 single-gene human, mouse, and rat sequences selected in 2001 by Rogic et al. [17] to test and evaluate the performance of gene structure prediction algorithms.
Evaluation Measures
In applying DSP techniques to gene searching, other parameters have been described. A most popular evaluation measure is the Receiver Operating Characteristic (ROC) curve. By selecting different threshold levels, different values of TP for a given FP are calculated at each threshold and the ROC curve is constructed from the various TPs and their corresponding FPs. The area under the ROC curve (AUC) is used as an evaluation measure; the greater the AUC, the higher the accuracy of the gene finding algorithm [18]. Another means by which to compare identification accuracy between methods is the calculation of specificity for different sensibilities. Since the majority of genomes comprise intronic and intergenic regions, the calculation of FP can provide a useful comparison measure [19].
Threshold Selection Method
To discriminate between coding and noncoding regions, a threshold is imposed on the output power spectrum. The selection of a proper threshold can optimise the accuracy of the identification; however, the calculation of an optimum threshold value itself raises problems [20]. Therefore, in this paper, the sensitivity, specificity and approximate correlation measures are defined by changing the threshold level, to accurately compare different methods. In this section, we discuss implementation of the threshold selection.
where meanP_{ 3e } and sdP_{ 3e } represent respectively the mean and standard deviation of the period-3 values obtained from the exon sequences of a training set, and meanP_{ 3i } and sdP_{ 3i } represent respectively the mean and standard deviation of the period-3 values obtained from the intron sequences of the same training set.
Results and Discussion
Comparison of different methods using the sequence F56F11.4.
Method | S _{ n } | S _{ p } | AC |
---|---|---|---|
DFT | 0.80 | 0.17 | 0.08 |
AN filter | 0.80 | 0.23 | 0.25 |
Asif | 0.80 | 0.18 | 0.12 |
AMDF | 0.80 | 0.20 | 0.19 |
TDP | 0.80 | 0.49 | 0.55 |
Cross-correlation (Proposed) | 0.80 | 0.82 | 0.78 |
Evaluation of different methods using chromosome III of C. elegans
S _{ n } | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
%20 | %40 | %60 | ||||||||
Methods | AUC | FP | Sp | AC | FP | Sp | AC | FP | Sp | AC |
AN filter | 0.6471 | 157 | 71 | 0.17 | 372 | 66.3 | 0.21 | 727 | 60 | 0.20 |
TDP | 0.6115 | 196 | 70 | 0.15 | 436 | 65 | 0.18 | 796 | 59 | 0.19 |
Cross-correlation (proposed) | 0.6891 | 134 | 76.5 | 0.20 | 302 | 70.9 | 0.25 | 610 | 61 | 0.26 |
Evaluation of different methods using HMR195 and BG570 genomic datasets.
BG570 | HMR195 | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
S _{ n } | S _{ n } | |||||||||||||
%10 | %30 | %50 | %10 | %30 | %50 | |||||||||
Methods | AUC | FP | S _{ p } | FP | S _{ p } | FP | S _{ p } | AUC | FP | S _{ p } | FP | S _{ p } | FP | S _{ p } |
DFT | 0.6540 | 279 | 45.8 | 767 | 43.3 | 1412 | 34.3 | 0.6782 | 438 | 51.5 | 1184 | 45 | 2064 | 41.7 |
AN filter | 0.6765 | 121 | 55 | 499 | 49.7 | 1103 | 36.7 | 0.7615 | 151 | 64.4 | 526 | 57.4 | 1217 | 51.1 |
Asif | 0.5748 | 140 | 34.2 | 330 | 31.7 | 554 | 29.1 | 0.6261 | 214 | 47.1 | 473 | 44.6 | 787 | 39.9 |
AMDF | 0.6600 | 340 | 40.8 | 770 | 39.4 | 1309 | 35.3 | 0.6980 | 410 | 47.9 | 1010 | 46.8 | 1821 | 43.3 |
TDP | 0.7560 | 160 | 62 | 408 | 56 | 805 | 49.4 | 0.7850 | 262 | 64.8 | 627 | 60.4 | 1128 | 56 |
Cross-correlation (proposed) | 0.8143 | 81 | 75.5 | 244 | 69 | 547 | 61 | 0.8250 | 124 | 71 | 382 | 67 | 841 | 59 |
Approximate correlation measures for HMR195 and BG570 genomic datasets
BG570 | HMR195 | |||||
---|---|---|---|---|---|---|
method | S _{ n } | S _{ p } | AC | S _{ n } | S _{ p } | AC |
DFT | 0.80 | 0.28 | 0.18 | 0.80 | 0.31 | 0.18 |
AN filter | 0.80 | 0.26 | 0.17 | 0.80 | 0.39 | 0.32 |
Asif | 0.80 | 0.25 | 010 | 0.80 | 0.30 | 0.15 |
AMDF | 0.80 | 0.29 | 0.20 | 0.80 | 0.37 | 0.27 |
TDP | 0.80 | 0.37 | 0.31 | 0.80 | 0.44 | 0.38 |
Cross-correlation (Proposed) | 0.80 | 0.43 | 0.40 | 0.80 | 0.47 | 0.45 |
Conclusions
This paper presents a new algorithm based on cross-correlation theory, designed to increase the accuracy of exonic region identification. The FIR filter makes it easier to identify the exonic regions. The main advantage of the proposed method is its reduced dependency on the window length as a result of the decreasing noise effect. The ability to detect small exonic regions is another advantage of this algorithm. The final step of the algorithm utilizes the discrete wavelet transform to reduce noise. Compared with established time and frequency domain methods, the proposed algorithm yields improvements ranging from 4% to 41% in terms of the area under the ROC curve for the HMR195 and BG570 datasets. Our proposed method also minimises the number of nucleotides incorrectly predicted as exonic. This decrease in the number of false positives is responsible for the increase in specificity; for example, at a sensitivity of 30%, our proposed algorithm yielded 15% to 85% improvement in specificity over other tested methods. As can be seen from Tables 3 and 4, our algorithm confers significant improvement on the accuracy of exonic region identification.
Declarations
Acknowledgements
The first author, Mr. Omid Abbasi would like to thank Mr. Omid Omrani, an expert in genetic laboratory techniques in the Department of Natural Sciences, University of Tabriz, Tabriz, Iran.
Authors’ Affiliations
References
- Fickett JW: Recognition of protein coding regions in DNA sequences. Nucl Acids Res 1982, 10: 5303–5318. 10.1093/nar/10.17.5303PubMed CentralView ArticlePubMedGoogle Scholar
- Trifonov E: Elucidating sequence codes: three codes for evolution. Ann NY Acad Sci 1999, 870: 330–338. 10.1111/j.1749-6632.1999.tb08894.xView ArticlePubMedGoogle Scholar
- Eskesen ST, Eskesen FN, Kinghom B, Ruvinsky A: Periodicity of DNA in exons. BMC Molecular Biology 2004.Google Scholar
- Chang CQ, Fung PCW, Hung YS: Improved Gene Prediction by Resampling-based Spectral Analysis of DNA Sequence. In Proceedings of the 5th International Conference on Information Technology and Application in Biomedicine: 30–31 May 2008. Shenzhen, China; 2008.Google Scholar
- Salamov AA, Solovyev VV: Ab initio gene finding in Drosophila genomic DNA. Genome Res 2000, 10: 516–522. 10.1101/gr.10.4.516PubMed CentralView ArticlePubMedGoogle Scholar
- Bernaola-Galvan P, Grosse I, Carpena P, Oliver JL, Roman-Roldan R, Stanley HE: Finding borders between coding and noncoding DNA regions by an entropic segmentation method. Phys Rev Letters 2000, 85(6):1342–1345. 10.1103/PhysRevLett.85.1342View ArticleGoogle Scholar
- Nicorici D, Astola J: Segmentation of DNA into coding and noncoding regions based on recursive entropic segmentation and stop-codon statistics. Journal of Applied Signal Processing, Special issue in Genomic Signal Processing 2004, 1(1):81–91.View ArticleGoogle Scholar
- Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R: Prediction of probable genes by Fourier analysis of genomic sequences. Comput Appl Biosci 1997, 13: 263–270.PubMedGoogle Scholar
- Vaidyanathan PP, Yoon BJ: The role of signal-processing concepts in genomics andproteomics. Journal of the Franklin Institute 2004, 341: 111–135. 10.1016/j.jfranklin.2003.12.001View ArticleGoogle Scholar
- Datta S, Asif A: A Fast DFT-Based Gene Prediction Algorithm for Identification of Protein Coding Regions. Proceedings of the 30th International Conference on Acoustics, Speech, and Signal Processing 2005.Google Scholar
- Akhtar M, Epps J, Ambikairajah E: Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction. IEEE journal of selected topics in signal processing 2008, 2: 310–321.View ArticleGoogle Scholar
- Ifeachor E, Jervis B: Digital Signal Processing: A Practical Approach. Prentice Hall Press; 2002.Google Scholar
- Voss RF: Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Physical Review Letters 1992, 68: 3805–3808. 10.1103/PhysRevLett.68.3805View ArticlePubMedGoogle Scholar
- George TP, Thomas T: Discrete wavelet transform de-noising in eukaryotic gene splicing. BMC Bioinformatics 2010, 11: S50. 10.1186/1471-2105-11-S1-S50PubMed CentralView ArticlePubMedGoogle Scholar
- Weeks M: Digital Signal Processing Using MATLAB and Wavelets. Infinity Science Press LLC; 2007.Google Scholar
- Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics 1996, 34: 353–367. 10.1006/geno.1996.0298View ArticlePubMedGoogle Scholar
- Rogic S, Mackworth AK, Ouellette BF: Evaluation of gene-findin programs on mammalian sequences. Genome 2001, 11: 817–832. 10.1101/gr.147901View ArticleGoogle Scholar
- Akhtar M, Ambikairajah E, Epps J: Detection of period-3 behavior in genomic sequences using singular value decomposition. Proceedings of the International Conference on Emerging Technologies: 17–18 September 2005; Islamabad, Pakistan 2005.Google Scholar
- Burge C: Modeling dependencies in pre-mRNA splicing signals in Computational Methods in Molecular Biology. In Elsevier Sciences Edited by: Salzberg SL, Searls DB, Kasif S. 1998, 129–164.Google Scholar
- Akhtar M, Ambikairajah E, Epps J: GMM-based classification of genomic sequences. IEEE 15th International Conference on Digital Signal Processing (Cardiff, UK) 2007.Google Scholar
- Kwan JYY, Kwan BYM, Kwan HK: Spectral analysis of numerical exon and intron sequences. Proceedings of IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2010) Workshops, Hong Kong; 2010.View ArticleGoogle Scholar
- Agrawal A, Mittal A, Jain R, Takkar R: An adaptive fuzzy thresholding algorithm for exon prediction. In IEEE International Conference on Electro/Information Technology, 2008. EIT; 2008.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.