Identification of exonic regions in DNA sequences using cross-correlation and noise suppression by discrete wavelet transform

Background The identification of protein coding regions (exons) in DNA sequences using signal processing techniques is an important component of bioinformatics and biological signal processing. In this paper, a new method is presented for the identification of exonic regions in DNA sequences. This method is based on the cross-correlation technique that can identify periodic regions in DNA sequences. Results The method reduces the dependency of window length on identification accuracy. The proposed algorithm is applied to different eukaryotic datasets and the output results are compared with those of other established methods. The proposed method increased the accuracy of exon detection by 4% to 41% relative to the most common digital signal processing methods for exon prediction. Conclusions We demonstrated that periodic signals can be estimated using cross-correlation. In addition, discrete wavelet transform (DWT) can minimise noise while maintaining the signal. The proposed algorithm, which combines cross-correlation and DWT, significantly increases the accuracy of exonic region identification.


Background
When the DNA sequence of a new eukaryotic organism is synthesized, the exonic (protein coding) regions must be distinguished from the introns (see Figure 1 for a schematic of genome arrangement). The protein coding regions of DNA have been observed to exhibit a period-3 property due to the non-uniform codon usage in the translation of codons into amino acids [1]. The aim of this paper is to use this property to identify exonic regions.
Several reasons for the existence of period-3 property have been presented in [2,3] and [4]. Some codons participate more in protein synthesis than others, giving rise to repetitions of a specific type of codon in the genome [4]. For example, the existence of a large number of GCA codons in the exonic regions gives greater repetition of G, C and A nucleotides in the first, second and third codon position, respectively. In other words, the G, C and A nucleotides exhibit period-3 property in the exonic regions.
Gene finding methods based on genetic characteristics, such as promoter, CpG Island, start and stop codon etc., tend to be of insufficient accuracy [5]. The characterization of coding and noncoding regions based on nucleotide statistics inside codons is described by Bernaola et al., who employed a 12-symbol alphabet to identify the borders between coding and noncoding regions [6]. Later, Nicorici and Astola segmented the DNA sequence into coding and noncoding regions using recursive entropic segmentation and stop-codon statistics [7].
The use of signal processing techniques to identify exonic regions based on the period-3 property offers new opportunities for gene finding. Tiawari used Fourier transform spectrum to achieve this goal [8]. In Tiawari's method, the discrete Fourier transform (DFT) energy at a central frequency is calculated for a fixed length window, and the window is slid across the numerical sequence. Vaidyanathan [9] identified protein coding regions using an anti-notch filter which magnified regions with period-3 property. Datta and Asif [10] presented a new algorithm using DFT theory with a Bartlett * Correspondence: Karimian@tabrizu.ac.ir 3 Faculty of Electrical and Computer Engineering, University of Tabriz, Tabriz 5166614761, Iran Full list of author information is available at the end of the article window. In another signal processing method, Akhtar [11] applied time domain algorithms, average magnitude difference function and time domain periodogram algorithms to identify period-3 property. Some gene finding methods based on digital signal processing (DSP) techniques have been developed but the accuracy of these methods is low and requires improvement.
In this paper, a new algorithm based on cross-correlation theory is presented. We show that the algorithm enhances the accuracy of the identification while reducing noise. The noisy waveform is cross-correlated with a periodic impulse train to provide the estimated signal. Discrete wavelet transform is applied to remove extra frequencies.
The remainder of the paper is organized as follows: in the Methods section, the application of the cross-correlation to obtain the periodic signal plus noise is described, together with the period-3 behaviour detection using cross-correlation theory. The final part of this section details the use of wavelet transform to remove noise. The datasets used are introduced in the Dataset section. Thereafter, evaluation measures are introduced for the measurement and comparison of various methods. Finally, in the Results and Discussion section, the results of the proposed algorithm are compared with those of the most common digital signal processing algorithms for exon prediction, in both time and frequency domains.

Cross-correlation
The discrete nature of DNA and the existence of period-3 behaviour in the exonic regions render it suitable for analysis by signal processing algorithms. We present an algorithm for the identification of the period-3 component based on cross-correlation techniques. The theory of cross-correlation theory is briefly explained below.
Correlation between two waveforms, x 1 [n] and x 2 [n], each of length N, is defined as [12]: To estimate a periodic waveform that is contaminated with noise, this waveform is cross-correlated with an adjustable template waveform; the template waveform is adjusted until the cross-correlation is maximized. The resulting template is an estimate of the signal term of the periodic waveform.
In our approach, a noisy waveform is cross-correlated with a periodic impulse train of period equal to that of the signal.
Let the signal of period N p points (N p <N) be s[n] and the noise be q[n]; therefore, the noisy waveform is S[n] = s[n]+q[n]. Periodic impulse train used for the crosscorrelation is denoted δ[n-kN p ], k = 0, 1, 2,..., N δ , where N δ is the number of impulses. Then Where j represents the lag, defined as the number of sampling points by which δ is shifted to the left. For j = 0, and remembering that δ[n-kN p ] = 0 for all n ≠ kN p Since the signal is periodic, s[n+kN p ] = s[n], and equation (3) becomes or As N δ ∞, 1 N δ N δ k=0 q[kN p ] → 0, and therefore r sδ s(0). Now, the periodic impulse train is shifted on the signal by an amount depending on j. Thus equation (5) can be written for all j's: From equation (6), it can be concluded that from which the periodic signal without noise can be extracted [12].

Identification of exonic regions
In this section, a new algorithm using the cross-correlation is proposed for the identification of exonic regions. The algorithm proceeds via the following steps: 1. DNA sequences are converted into numerical sequences.
2. FIR filter is applied to the numerical sequences representing DNA sequences.
3. Cross-correlation is applied to the filtered numerical sequences.
4. The noise effect is removed using discrete wavelet transform. Figure 2 represents these steps as a block diagram. Each step is explained in detail below:

Numerical conversion of the DNA sequences
To apply DSP techniques to the DNA sequence to find nucleotide regions exhibiting period-3 behaviour, the DNA sequence is first mapped onto the numerical sequence. The simplest conversion method maps four numerical sequences I A [n], I T [n], I C [n] and I G [n] from DNA sequences in binary format. In this mapping, the presence or absence of the respective nucleotides at the nth position is represented by '1' and '0', respectively. For example, given a section of DNA sequence ATCC-GATATTC, the binary sequence of the nucleotide A, denoted I A [n], is [10000101000]. The binary sequences for the other three nucleotides T, C and G are found similarly [13].

Applying FIR filter to the numerical sequences
After mapping the DNA sequence onto its binary numerical sequence, the binary sequence is passed through a Hamming window based FIR filter of order 8 with central frequency set to 2π/3, to emphasize period-3 property in the exonic regions. Lack of distortions in FIR filters is one reason for their preferred use over IIR filters in medical applications [12].

Applying cross-correlation theory to the numerical sequences
Most previous methods have used a window of fixed length to find the regions in DNA sequences exhibiting period-3 property. In such methods, the window length directly affects the accuracy of the identification. Typically, an appropriate window length is considered to lie within the range 240-351 (window lengths are multiples of three to reflect the codon structure). Short length windows increase noise, while long length windows tend to miss short exonic regions.
In our proposed method, the cross-correlation between the numerical DNA sequence and an impulse train of periodicity 3 (N p = 3) is calculated to identify regions in the DNA sequence with period-3 behaviour. The length of the impulse train is set at 270. The impulse train signal length plays the same role as the window length in previous approaches. Following the cross-correlation calculation, the impulse train slides across the numerical sequence by an amount j.
Different energy levels of the period-3 components exist in binary sequences M A , M T , M C and M G . Thus, the output energy spectrum is the combination of the four separate outputs In this energy spectrum, a peak corresponds to the presence of a period-3 component on that region, implying that the region is exonic.

Decreasing the noise using discrete wavelet transform
Decreasing noise increases the accuracy of exonic region identification. As seen from equation (6), a small window size, required for the detection of small exons, will not diminish noise sufficiently. Hence we apply discrete wavelet transform (DWT) to decrease the noise in the output spectrum.
DWT has been used for de-noising in various signal processing applications. In protein coding region detection, Haar wavelet has previously been employed for noise suppression [14]. Our proposed algorithm uses Dmey wavelet to remove noise and thereby increase the accuracy of the exonic region identification.
To this end, by down-sampling the output of low pass and high pass filters, samples are divided into two signals; high frequency samples (detail signals) and low frequency samples (approximation signals), each embracing half the number of samples as the original signal. Figure 3 shows this procedure operating over three levels.
The signal x[n] is passed first through the high pass filter, h[n], then through the low pass filter, g[n] [14,15].
Approximation and detail signals for the output power spectrum of the sequence F56F11.4 (GenBank access number AF099922) at positions 7021-15020 are shown in Figures 4b and 4c. By removing the detail signal and considering only the approximation signal, the extra frequencies are removed and the output power spectrum is smoothed. Therefore, the noise effect is decreased, while the accuracy of the identification is enhanced.  [16] and HMR190 [17] datasets. BG570 is a genomic test dataset of 570 single gene vertebrate sequences prepared by Burset and Guigo [16]. HMR195 comprises 195 single-gene human, mouse, and rat sequences  [17] to test and evaluate the performance of gene structure prediction algorithms.

Evaluation Measures
To accurately compare different methods, the evaluation is performed at the nucleotide level. In the identification of exonic regions using DSP techniques, some parameters are defined by changing the threshold level in the output spectrum. Those parameters which make the comparison possible are defined in this section. In the identification step, the number of nucleotides correctly predicted as exons is denoted true positive (represented by TP), while the number of nucleotides correctly predicted as introns is denoted true negative (represented by TN). Similarly, the number of intron nucleotides predicted as exon nucleotide is the false positive (FP) value, while the number of exon nucleotides predicted as intron nucleotides is the false negative (FN) value. From these four defined quantities, the sensitivity and specificity parameters are determined as follows [16]: The sensitivity S n is the proportion of exon nucleotides that have been correctly predicted as exons, and the specificity S p is the proportion of predicted exon nucleotides that actually exist in the exonic regions. These parameters alone are not suitable for evaluation because at high sensitivity, the specificity is low and vice versa. Therefore, another measure known as the approximate correlation (AC) has been defined. This parameter combines sensitivity and specificity as shown [16].
In applying DSP techniques to gene searching, other parameters have been described. A most popular evaluation measure is the Receiver Operating Characteristic (ROC) curve. By selecting different threshold levels, different values of TP for a given FP are calculated at each threshold and the ROC curve is constructed from the various TPs and their corresponding FPs. The area under the ROC curve (AUC) is used as an evaluation measure; the greater the AUC, the higher the accuracy of the gene finding algorithm [18]. Another means by which to compare identification accuracy between methods is the calculation of specificity for different sensibilities. Since the majority of genomes comprise intronic and intergenic regions, the calculation of FP can provide a useful comparison measure [19].

Threshold Selection Method
To discriminate between coding and noncoding regions, a threshold is imposed on the output power spectrum. The selection of a proper threshold can optimise the accuracy of the identification; however, the calculation of an optimum threshold value itself raises problems [20]. Therefore, in this paper, the sensitivity, specificity and approximate correlation measures are defined by changing the threshold level, to accurately compare different methods. In this section, we discuss implementation of the threshold selection.
To select an appropriate threshold, the method of Kwan et al. [21] is used. The mean and standard deviation of the period-3 values determined from a training set of exon and intron sequences are used to calculate the threshold level T, defined as: where meanP 3e and sdP 3e represent respectively the mean and standard deviation of the period-3 values obtained from the exon sequences of a training set, and meanP 3i and sdP 3i represent respectively the mean and standard deviation of the period-3 values obtained from the intron sequences of the same training set.
The 1000 multi exon genes from chromosome III of C. elegans provide data for training. The calculated threshold level is 61. This threshold level was applied to the F56F11.4 gene in chromosome III of C. elegans as shown in Figure 5.a. Clearly, at this threshold, all five regions are correctly identified as coding regions. However, there also exist small non-coding regions around position 2000 which are misidentified as coding regions. Since the characteristics of the DNA sequence can change significantly at different positions, even within the same dataset, a static threshold may yield incorrect identifications at some positions. Therefore, an adaptive threshold selection algorithm such as that described in [22] is required for exon prediction. In Tables (1), (2), (3), and (4) our proposed algorithm is compared with other algorithms over a range of thresholds.

Results and Discussion
In this section, the results of the proposed algorithm are compared with those of established methods, namely, Average Magnitude Difference Function (AMDF), Time Domain Periodogram (TDP) [11], Anti-Notch filter (AN filter) [9], Fourier Transform Spectrum (DFT) [8] and Asif [10]. As mentioned in the previous section, to evaluate and compare the results, measures such as the area under the ROC curve, the specificity and the number of false negatives in a particular sensitivity are computed. The approximate correlation measure for different threshold levels is also calculated. Our proposed algorithm is first applied to the gene sequence F56F11.4. In Figure 5a, the output power spectrum computed by equation (12) is displayed. The output power spectrum of the AN filter, TDP and DFT methods is shown in Figures  5.b, c and 5d respectively. Exonic regions that should be identified in this figure are marked as shaded regions. It should be noted that Figure 5.a is the output of equation (12) after de-noising with DWT, as shown in Figure 6.b. The strongest feature of our proposed algorithm is the noise reduction. Not only is the noise reduced by increasing the window length, but the small length exonic regions can be identified. Unlike the established methods, the accuracy of identification does not decrease by changing the window length up to a specific value. Figure 7 shows the effect of changing window length on area under the ROC curve for the F56F11.4 sequence. According to this curve, the identification accuracy of our algorithm is fixed for window lengths ranging from 150 to 510 bp, whereas that of the other tested methods depends on window length. The window length varies according to gene length. The decreasing noise effect and magnification of the period-3 component under FIR filtering causes the peaks to coincide with exon positions and enables detection of small exons, such as the first exon in F56F11.4 (shown in Figure 5.a).
In Table 1, the approximate correlation and specificity for specified sensitivities are presented for our proposed method and for the other tested methods (using gene sequence F56F11.4). We observe that our algorithm yields the highest value of both parameters.
The proposed algorithm is then applied to chromosome III of Caenorhabditis elegans [NCBI Reference Sequence: NC_003281.8], comprising 13783681 nucleotides with 8172 coding regions, and the results are again compared with the outputs of other popular methods. Different evaluation measures for the proposed algorithm, AN filter and TDP methods are shown in Table  2. Clearly, the proposed method outperforms AN filter and TDP methods. It achieves a larger area under the  ROC curve, fewer false positives and higher specificities and approximate correlation compared with AN filtering and TDP. By way of illustration, at a sensitivity of 20% the false positive output of our algorithm is 134 bp compared with 157 bp and 196 bp for AN filtering and TDP, respectively. In addition, our proposed method exhibits relative improvements of 3% and 5% respectively over AN filter and TDP methods in the approximate correlation measure.
The proposed algorithm was finally applied to the HMR195 and the BG570 datasets. The output results are shown in Table 3. With regard to the HMR195 dataset, our algorithm outputs the least number of nucleotides incorrectly identified as exons. At a sensitivity of 30%, the number of false positives in the cross-correlation method is improved by a factor of 1.24 relative to the next-best performing method, Asif. Our proposed algorithm shows relative improvements of 21%, 8.3%, 24%, 18% and 4% over the DFT, AN filter, Asif, AMDF and TDP methods respectively, in terms of the area under the ROC curve. Similar superiority of our proposed algorithm is apparent for the BG570 dataset. Table 4 shows the AC measure of our proposed method in addition to the other tested methods. At a sensitivity of 80%, the AC measure for the proposed method is 40% in the BG570 database, while that of TDP (yielding the highest AC of the established methods) is 31%. Finally, from Figures 8 and 9, illustrating the ROC's of the proposed and other methods, it is obvious that the proposed method's area under curve in both datasets is the highest of all the tested methods. This implies that our proposed algorithm is superior to the other methods at identifying exonic gene regions.

Conclusions
This paper presents a new algorithm based on crosscorrelation theory, designed to increase the accuracy of exonic region identification. The FIR filter makes it easier to identify the exonic regions. The main advantage of the proposed method is its reduced dependency on the window length as a result of the decreasing noise effect. The ability to detect small   Table 2 shows the area under the ROC curve (AUC), the number of false positives (FP), the approximate correlation (AC) and specificity (S p ) at the same sensitivity (S n ) for our proposed method and for other methods using the chromosome III of C. elegans.
exonic regions is another advantage of this algorithm. The final step of the algorithm utilizes the discrete wavelet transform to reduce noise. Compared with established time and frequency domain methods, the proposed algorithm yields improvements ranging from 4% to 41% in terms of the area under the ROC curve for the HMR195 and BG570 datasets. Our proposed method also minimises the number of nucleotides incorrectly predicted as exonic. This decrease in the number of false positives is responsible for the increase in specificity; for example, at a sensitivity of 30%, our proposed algorithm yielded 15% to 85% improvement in specificity over other tested methods.
As can be seen from Tables 3 and 4, our algorithm confers significant improvement on the accuracy of exonic region identification.    Table 4 shows the approximate correlation (AC) measure for our proposed method and for other methods for a given sensitivity (S n ) and specificity (S p ).