Volume 11 Supplement 1
Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010)
Discrete wavelet transform de-noising in eukaryotic gene splicing
- Tina P George^{1}Email author and
- Tessamma Thomas^{2}
https://doi.org/10.1186/1471-2105-11-S1-S50
© George and Thomas; licensee BioMed Central Ltd. 2010
Published: 18 January 2010
Abstract
Background
This paper compares the most common digital signal processing methods of exon prediction in eukaryotes, and also proposes a technique for noise suppression in exon prediction. The specimen used here which has relevance in medical research, has been taken from the public genomic database - GenBank.
Methods
Here exon prediction has been done using the digital signal processing methods viz. binary method, EIIP (electron-ion interaction psuedopotential) method and filter methods. Under filter method two filter designs, and two approaches using these two designs have been tried. The discrete wavelet transform has been used for de-noising of the exon plots.
Results
Results of exon prediction based on the methods mentioned above, which give values closest to the ones found in the NCBI database are given here. The exon plot de-noised using discrete wavelet transform is also given.
Conclusion
Alterations to the proven methods as done by the authors, improves performance of exon prediction algorithms. Also it has been proven that the discrete wavelet transform is an effective tool for de-noising which can be used with exon prediction algorithms.
Background
Methods
Many digital signal processing methods have been tried for genomic data analysis with proven results [1, 4–6, 9, 10], are but a few examples of such published work.
Exon prediction using the DFT - The binary method
gives better results than the one given in equation 5. The period-3 property of a DNA sequence implies that the DFT coefficients corresponding to k = N/3 is large. Thus if we take N to be a multiple of 3 and plot S[k] then we should see a peak at the sample value k = N/3. Instead of evaluating the DFT of a full-length sequence, DFTs of several of its subsequences, (STFT) was computed for better time domain resolution by sliding the window by one entry in the sequence.
Exon prediction using the DFT - The EIIP method
When Se[k] is plotted against k, it reveals a peak at N/3 for a coding region and no such peak is observable for a noncoding region. Rectangular windows were used in this work, for evaluating the STFT by breaking up the long sequence into subsequences.
Exon prediction using digital filters
A plot of this function is a preliminary indicator of coding regions. Filter 2 designed by the authors [6] gives better result than this anti-notch filter-Filter 1, called so in this paper.
Filter 1
Filter 2
Reduced computation technique in filter method
DWT to improve gene splicing techniques
The above methods of gene splicing, though give results, better reduction in noise and accuracy of prediction is desired. The statistically optimal null filter to improve prediction of exons has been suggested by Kakumani et. al. [10]. Here we've tried to improve the accuracy of a gene splicing algorithm using the Discrete Wavelet Transform (DWT). In DWT [12], the signal is passed through a series of high and low pass filters to analyze the respective frequencies followed by a scaling. The scale is changed by upsampling and downsampling (subsampling) operations. Subsampling reduces the sampling rate, or removes some of the samples of the signal. Upsampling increases the sampling rate of a signal by adds new samples. Filtering involved is explained as follows. If a signal has a maximum of 1000 Hz component, then half band low-pass filtering removes all the frequencies above 500 Hz. However it is to be recalled that with discrete signals, frequency ω is expressed in terms of radians. Accordingly, the sampling frequency of the signal is equal to 2F_{ m }, Hz, in the analog domain and 2π radians in terms of discrete radial frequency. Therefore, the highest frequency component in a discrete signal will be π radians. Hz is not appropriate for discrete signals, but used for clarity of the idea.
The bandwidth of the signal at every level is marked on the figure as "f". The DWT of the original signal is obtained by concatenating all coefficients starting from the last level of decomposition (remaining two samples, in this case) and will have the same number of coefficients as the original signal. The difference of this from the Fourier transform is that the time localization of these frequencies will not be lost, a key advantage. Good time resolution is obtained at high frequencies, and good frequency resolution at low frequencies. All algorithms mentioned in this work were implemented using MATLAB.
Results
Binary method and EIIP method
Results obtained are the exon plots shown in Figures 6 and 7 respectively. Of the gene splicing algorithms mentioned here, the ones which make use of the DFT are the Binary method and the EIIP method. C elegans gives best result for a window length of 240. The boundary of exons is more well defined with this window. A window size of 351 though reduces inter-exon noise, the exon boundaries tend to shift, its not shown here.
Filter method
Tabulation of results. Exon locations of [GenBank:AF099922] as given in the NCBI database, and those obtained using the various DSP methods discussed here.
Exon locations obtained for C elegans | ||||||
---|---|---|---|---|---|---|
Binary method | EIIP method | Filter 1 | Filter 2 | Reduced computation with Filter 1 | Reduced computation with Filter 2 | NCBI ranges |
7921-8021(100) | 7821-8021(200) | 7921-8021(100) | 7821-8021(200) | 7821-8021(200) | 7841-8021(180) | 7947-8059(112) |
9521-9821(300) | 9521-9821(300) | 9521-9821(300) | 9521-9851(330) | 9521-9871(350) | 9521-9851(330) | 9548-9879(331) |
11021-11221(200) | 11021-11221(200) | 11021-11221(200) | 10921-11221(300) | 11021-11271(250) | 10921-11221(300) | 11134-11397(263) |
12321-12521(200) | 12421-12621(200) | 12421-12621(200) | 12321-12541(220) | 12321-12521(200) | 12321-12541(220) | 12485-12664(179) |
14281-14621(340) | 14221-14621(400) | 14221-14621(400) | 14221-14621(400) | 14221-14621(400) | 14221-14621(400) | 14275-14625(350) |
DWT to improve gene splicing techniques
Figure 12 mentioned in the results section shows the detail coefficients and Figure 13 shows the approximation coefficients of Haar decomposition respectively. The final exon plot obtained after DWT treatment are given in Figure 14. Notice that the in exon plot 6, Figure 11 power levels corresponding to the first exon which had half power values almost equal to the noise levels (exon plot 6) has been accentuated such that there is no mistaking between exon region and intron region. As the signals desired corresponding to exon peaks are in the lower region of the spectrum spanning the 0 - π/2 range, against a discrete frequency interval of - π to π, a single level decomposition and reconstruction was sufficient here. As already mentioned, the region of the genomic sequence of C elegans has 8000 nucleotides from 7021 to 15021. The exon plots in figures 6 to figure 14 show 8000 nucleotide locations with the exons depicted as spectral peaks. The exon boundaries obtained after de-noising with DWT are the same as those obtained with the reduced computation technique using Filter2, as the the exon plot obtained with the method was used for subsequent wavelet decomposition and re-construction. Hence the exon boundaries are not tabulated separately for the de-noised result.
Discussion
DFT is a conventional frequency analysis tool. Instead of evaluating the DFT of a full-length sequence, the DFTs of several of its subsequences, ie. the STFT was computed for better time domain resolution by sliding the window by one entry in the sequence. It is a known fact that using the STFT increases resolution in time domain. For the first two methods, most of the literature asserts 351 to be the window size, especially for C elegans. But the authors have found that the window size varies with the method adopted and the DNA sequence analyzed. With the DFT used for frequency analysis, the window found to yield better result was 240. The better result obtained with the single peaking IIR filter over the one described in [5] can be attributed to the higher attenuation seen in the stop band of the filter. The use of such a filter has given lesser noise without using the subsequent filter bank mentioned in [5]. DWT is a far more popular and potential signal processing tool today. However it has been used only for noise suppression here. Review of literature did not reveal a formal, randomized comparison of each of engineering methods mentioned here with other non-engineering approaches, hence such a comparison is not presented.
Conclusion
In this paper the authors have shown that appropriate alterations to the classical methods of exon prediction yields better results. For AF099922 C elegans, the window size for the binary and EIIP methods has been found to be 240, whereas for the digital filter method it is 450, as against 351 mentioned in most of the literature. The window size thus should be selected depending on the method of analysis and also on the sequence analyzed. The filter1 as it is called in this paper is the common filter found in literature [3, 5]; filter 2 has been designed by the authors. It's clear that this design is much better performance-wise as evident from the results. We have proposed the DWT to de-noise exon prediction, and it has been proved that it is the right tool for de-noising to be used with exon prediction algorithms.
Declarations
Acknowledgements
I would like to thank the authorities of the Department of Electronics, Cochin University of Science and Technology, Kerala, India, for permitting me to carry out this work under the guidance of Dr. Tessamma Thomas, the second author.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S1.
Authors’ Affiliations
References
- Vaidyanathan PP: Genomics and Proteomics: A signal processor's tour. IEEE Circuits ad Systems Magazine, Fourth quarter 2004, 6–29. 10.1109/MCAS.2004.1371584Google Scholar
- Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R: Prediction of probable genes by Fourier analysis of genomic sequences. CABIOS 1997, 13(3):263–270.PubMedGoogle Scholar
- Trifonov EN, Sussman JL: The pitch of chromatin DNA is reflected in its nucleotide sequence. Proceedings of the Nat Acad Sci, USA 1980, 77: 3816–3820. 10.1073/pnas.77.7.3816View ArticleGoogle Scholar
- Anastassiou D: Frequency-domain analysis of bio-molecular sequences. Bioinformatics 2000, 16(12):1073–1081. 10.1093/bioinformatics/16.12.1073View ArticlePubMedGoogle Scholar
- Vaidyanathan PP, Yoon BJ: Gene and exon prediction using all-pass based filters. ieeexplore.orgGoogle Scholar
- Fox TW, Carreira A: A Digital Signal Processing Method for Gene Prediction with Improved Noise Suppression. EURASIP Journal on Applied Signal Processing 2004, 1: 108–114.View ArticleGoogle Scholar
- George TP, Thomas T: Improvements in Gene Splicing and Gene Comparison for Anomaly Detection. In Report of The M Tech Semester IV Project work 2009 April, done at Department of Electronics. Cochin University of Science And Technology, Kerala, India;Google Scholar
- George TP, Thomas T: Exon Prediction Methods in Eukaryotes. In Report of The M Tech Semester III Project work 2008 December, done at Department of Electronics. Cochin University of Science And Technology, Kerala, India;Google Scholar
- Nair AS, Sreenadhan S: A coding measure scheme employing electron-ion interaction pseudopotential (EIIP). Bioinformation, Open access hypothesis 2006, 197–202.Google Scholar
- Kakumani R, Devabhaktuni V, Ahmad MO: Prediction of protein coding regions in DNA using a model based approach. IEEE Signal Processing Magazine 2009.Google Scholar
- Proakis JG, Manolakis D: Digital Signal Processing. Fourth edition. Prentice - Hall of India, Pvt. Ltd; 2007:454–461.Google Scholar
- Soman KP, Ramachandran KI: Insights into Wavelets - From Theory to Practice. Second edition. Prentice - Hall of India, Pvt. Ltd; 2004.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.