Many digital signal processing methods have been tried for genomic data analysis with proven results [1, 4–6, 9, 10], are but a few examples of such published work.
Exon prediction using the DFT - The binary method
This method [4] uses the binary indicator sequences obtained as described above, the DSP tool used being the DFT. As per the classical definition [11], DFT of a sequence x[n], of length N, is itself another sequence X[k], of the same length N.
The sequence X[k] provides a measure of the frequency content at "frequency" k, which corresponds to an underlying period of N/k samples. Using the above definition the U
A
[k] , U
T
[k] , U
C
[k], and U
G
[k] are the DFTs of the binary indicator sequences u
A
[n], u
T
[n], u
C
[n], and u
G
[n], respectively and then it follows that:
As already mentioned here, a = t = c = g = 1.
has been used as a measure of the total power spectral content of the DNA character string, at "frequency" k. But we've found [8] that
gives better results than the one given in equation 5. The period-3 property of a DNA sequence implies that the DFT coefficients corresponding to k = N/3 is large. Thus if we take N to be a multiple of 3 and plot S[k] then we should see a peak at the sample value k = N/3. Instead of evaluating the DFT of a full-length sequence, DFTs of several of its subsequences, (STFT) was computed for better time domain resolution by sliding the window by one entry in the sequence.
Exon prediction using the DFT - The EIIP method
In this method, described in [11], letters of the DNA sequence A, T, C, G are replaced with the electron ion interaction pseudo-potentials(EIIP) of nucleotides. If we substitute the EIIP values in x[n], we get a numerical sequence the 'EIIP indicator sequence', xe[n] which represents the distribution of the free electrons' energies along the DNA sequence, for A, T, C, G the values are 0.126, 0.1335, 0.134, 0.0806 respectively. Next, DFT is evaluated and the corresponding value of the power spectrum is
When Se[k] is plotted against k, it reveals a peak at N/3 for a coding region and no such peak is observable for a noncoding region. Rectangular windows were used in this work, for evaluating the STFT by breaking up the long sequence into subsequences.
Exon prediction using digital filters
Digital filtering methods used for identification of exons make use of the period-3 behaviour [5] coding regions. The output of an antinotch filter, with a sharp gain at the frequency 2π/3 provides this information as a function of base location. This filter [5] has an impulse response w(n) given by,
Let H(z) be a narrow band bandpass or anti-notch digital filter with a sharp passband cantered at ω0 = 2π/3. With the indicator sequence x
G
(n) taken as input, let y
G
(n) denote its output. In the coding regions, the sequence x
G
(n) is expected to have a period-3 component, which means that it has large energy in the filter passband. So the output y
G
(n) should be comparatively large in the coding regions as demonstrated in Figure 2. With similar notation for the other bases, define
A plot of this function is a preliminary indicator of coding regions. Filter 2 designed by the authors [6] gives better result than this anti-notch filter-Filter 1, called so in this paper.
Filter 1
The design [5] starts by considering second order all-pass filter, A(z).
Now consider a filter bank with two filters G(z) and H(z) defined as,
Then G(z) has the form
G(z) is a notch filter with a zero at the frequency w0.
H(z) is the required anti-notch filter with magnitude and phase responses as in Figure 3.
Filter 2
Filter 2, in this paper is an IIR single peaking filter with the peak frequency at 2π/3. This was designed using the built-in utility of MATLAB [6]. Its magnitude and phase response is shown in Figure 4.
Reduced computation technique in filter method
The number of digital filter operations can be reduced from four to one [6] by creating a new signal that encapsulates the entire DNA sequence uA+C+T+G(n) = auA(n) + cuC(n) + tuT(n) + guG(n) where a, c, t, and g are real-valued parameters.. A long DNA sequence can be approximated using a two-symbol representation, where one symbol is either A or T and the other symbol is either C or G as they are complimentary to each other. Also capitalizes on the strong periodicity exhibited by the G sequence. In this case, the signal becomes
DWT to improve gene splicing techniques
The above methods of gene splicing, though give results, better reduction in noise and accuracy of prediction is desired. The statistically optimal null filter to improve prediction of exons has been suggested by Kakumani et. al. [10]. Here we've tried to improve the accuracy of a gene splicing algorithm using the Discrete Wavelet Transform (DWT). In DWT [12], the signal is passed through a series of high and low pass filters to analyze the respective frequencies followed by a scaling. The scale is changed by upsampling and downsampling (subsampling) operations. Subsampling reduces the sampling rate, or removes some of the samples of the signal. Upsampling increases the sampling rate of a signal by adds new samples. Filtering involved is explained as follows. If a signal has a maximum of 1000 Hz component, then half band low-pass filtering removes all the frequencies above 500 Hz. However it is to be recalled that with discrete signals, frequency ω is expressed in terms of radians. Accordingly, the sampling frequency of the signal is equal to 2F
m
, Hz, in the analog domain and 2π radians in terms of discrete radial frequency. Therefore, the highest frequency component in a discrete signal will be π radians. Hz is not appropriate for discrete signals, but used for clarity of the idea.
Decomposition of the signal into different frequency bands is obtained by successive high pass and low pass filtering of the time domain signal. The original signal x[n] is first passed through a halfband, highpass filter g[n] and a lowpass filter h[n]. After the filtering, half of the samples can be eliminated ie. subsampled by 2, by discarding every other sample. This constitutes one level of decomposition, mathematically expressed as:
h[n] and g[n] are the sample sequences or impulse responses and yhigh[k] and ylow[k] are the outputs of the highpass and lowpass filters, respectively, after subsampling by 2. This decomposition halves the time resolution since only half the number of samples now characterizes the entire signal. However, this operation doubles the frequency resolution, since the frequency band of the signal now spans only half the previous frequency band. The above procedure, known as subband coding, can be repeated for further decomposition. At every level, the filtering and subsampling will result in half the number of samples (and hence half the time resolution) and half the frequency band spanned (and hence double the frequency resolution). Figure 5 illustrates this procedure.
The bandwidth of the signal at every level is marked on the figure as "f". The DWT of the original signal is obtained by concatenating all coefficients starting from the last level of decomposition (remaining two samples, in this case) and will have the same number of coefficients as the original signal. The difference of this from the Fourier transform is that the time localization of these frequencies will not be lost, a key advantage. Good time resolution is obtained at high frequencies, and good frequency resolution at low frequencies. All algorithms mentioned in this work were implemented using MATLAB.