- Methodology article
- Open Access

# Nonlinear preprocessing method for detecting peaks from gas chromatograms

- Byonghyo Shim
^{1}, - Hyeyoung Min
^{2}and - Sungroh Yoon
^{3}Email author

**10**:378

https://doi.org/10.1186/1471-2105-10-378

© Shim et al; licensee BioMed Central Ltd. 2009

**Received:**17 October 2008**Accepted:**18 November 2009**Published:**18 November 2009

## Abstract

### Background

The problem of locating valid peaks from data corrupted by noise frequently arises while analyzing experimental data. In various biological and chemical data analysis tasks, peak detection thus constitutes a critical preprocessing step that greatly affects downstream analysis and eventual quality of experiments. Many existing techniques require the users to adjust parameters by trial and error, which is error-prone, time-consuming and often leads to incorrect analysis results. Worse, conventional approaches tend to report an excessive number of false alarms by finding fictitious peaks generated by mere noise.

### Results

We have designed a novel peak detection method that can significantly reduce parameter sensitivity, yet providing excellent peak detection performance and negligible false alarm rates from gas chromatographic data. The key feature of our new algorithm is the successive use of peak enhancement algorithms that are deliberately designed for a gradual improvement of peak detection quality. We tested our approach with real gas chromatograms as well as intentionally contaminated spectra that contain Gaussian or speckle-type noise.

### Conclusion

Our results demonstrate that the proposed method can achieve near perfect peak detection performance while maintaining very small false alarm probabilities in case of gas chromatograms. Given the fact that biological signals appear in the form of peaks in various experimental data and that the propose method can easily be extended to such data, our approach will be a useful and robust tool that can help researchers highlight valid signals in their noisy measurements.

## Keywords

- False Alarm
- False Alarm Rate
- Wavelet Coefficient
- Peak Detection
- Continuous Wavelet Transform

## Background

In order to filter out false peaks and selectively detect valid ones, several preprocessing techniques such as thresholding or lowpass filtering (LPF) have been employed [3]. However, in many situations, these approaches either filter out valid peak information or fail to reject false peaks. Additionally, choosing appropriate parameters (e.g., cutoff frequency, rejection threshold) is mostly based on trial-and-error, so it is not uncommon to manually adjust these parameters. Indeed, since the peak detection is a blind problem and no prior knowledge on the information is given, tuning parameters has been a nontrivial task causing high false alarm and misdetection rates.

In this paper, we propose a novel peak detection method that is much less sensitive to parameter choices than conventional techniques, yet produces a very robust and accurate detection performance on noisy experimental data. The key feature in our method is a nonlinear preprocessing to suppress the noise and to strengthen the peak signal. Two major ingredients achieving this objective are geometric mean filtering (GMF) and wavelet domain denoising. A function of multiplying the observation and the reference signal for achieving correlation gain has been widely used in communication and signal processing [4]. Since no desired reference is available for the peak detection problem, the shifted version of experimental signal itself can serve as an approximate reference to obtain a coherent amplification of the peak signal.

Furthermore, since the output of GMF suppresses randomly fluctuating noise better than the conventional arithmetic-mean based filtering, the proposed GMF effectively differentiates the peak information and the noise. Once the GMF is finished, further cleaning of noise i.e., a denoising operation, is performed in the wavelet domain. Due to the increased frequency resolution of wavelet transform in low frequencies compared to the Fourier transform [5], peak information frequencies can easily be localized by their energy, and thus the noise and unwanted signal can nicely be separated from the peak. It is worth pointing out that there is a close relationship between the proposed approach and the denoising studies in image processing [6, 7] in the sense that both filter out noise in the wavelet domain. While the image denoising needs to consider the low-energy wavelet coefficients for preserving the image shape, such is unnecessary for the peak detection problem since the unique goal is the identification of peak points. In fact, since the peak information is rarely located in wavelet coefficients with small energy, cleaning of those coefficients will rather help, in particular, to suppress sharp and narrow-shaped false peaks so-called *speckles*. When the GMF and wavelet domain denoising are finished, we clearly observe the suppression of the noise magnitude. Due to this clear distinction between the signal and noise, employing a nonlinear operation (amplification followed by slicing) can remove substantial amount of noise and hence facilitate the peak collection operation.

We test the proposed method with gas chromatography data and show that the proposed approach exhibits excellent peak detection performance with small false alarm probabilities. Further, we demonstrate the robustness of the proposed method using the extended scenario in which artificial noise, viz. speckles and Gaussian noise, is added into the data.

## Proposed method

### Peak signal model

where *g* [*n*] and *v* [*n*] are the peak and noise signal, respectively. In order to design a systematic peak detection method, we should rely upon minimum guidelines on the characteristics of peak signal to detect.

For this purpose, we employ the following assumptions on the peak signal *g* [*n*] and peak point *n*_{
p
}.

**A.1)** *g* [*n*] is gradually changing in a local interval *I*_{
p
}= {*n*_{
p
}- *δ*, ⋯, *n*_{
p
}+ *δ*} around peak point *n*_{
p
}. That is, |*g*[*n*_{1}] - *g*[*n*_{0}]| <*ϵ*_{0} for adjacent values of *n*_{1}, *n*_{0} ∈ *I*_{
p
}, where *ϵ*_{0} is a pre-defined small constant.

**A.2)** The magnitude of peak signal is highest in *I*_{
p
}. In other words, *n*_{
p
}, a valid peak point, should satisfy
where
.

**A.3)** *g*[*n*] is monotonically increasing in the local interval [*n*_{
p
}- *δ* *n*_{
p
}) and monotonically decreasing in (*n*_{
p
}*n*_{
p
}+ *δ*].

Notice that, since we cannot discriminate the signal *g* [*n*] and noise *v* [*n*], the assumptions we described in **A.1)** and **A.3)** are rather conceptual. Further, we have no clue on *ϵ*_{0} and *δ*, which are observation dependent parameters. In spite of this, these assumptions play an important role in our preprocessing since they provide useful guidance on the algorithm design.

### Geometric mean filtering (GMF)

*k*+ 1)-tap GMF [

*n*] for the input sequence

*x*[

*n*] is defined as

For filtering out the noise from the data, the correlation between *x* [*n*] and shifted version *x* [*n* - *i*] is exploited. As an example, consider the 3-tap (*k* = 1) GMF filter. For notational convenience, we denote the value of *g* [*n*] at *n* = *n*_{0} by *g*_{0} and *v* [*n*] at *n* = *n*_{0} - 1, *n*_{0}, and *n*_{0} + 1 by *v*_{-1}, *v*_{0}, and *v*_{1}, respectively.

*g*[

*n*] is gradually changing by

**A.1)**, the observations at

*n*

_{0},

*n*

_{0}- 1, and

*n*

_{0}+ 1 are

*ϵ*

_{ i }<

*ϵ*

_{0}for

*i*= 1, 2. The GMF of

*r*[

*n*

_{0}- 1],

*r*[

*n*

_{0}], and

*r*[

*n*

_{0}+ 1] is

*v*[

*n*], the coefficient associated with , which is essentially the sum of zero mean random variables, gets close to zero. Denoting this term by

*δ*, (6) can be written as

*δ*decreases and ( ) comes close to unity, and thus [

*n*] well approximates

*g*

_{0}. Figure 3 illustrates the 5-tap GMF filter output for gradually changing function .

*g*[

*n*] ~

*g*

_{0}is dominant over

*v*[

*n*] in

*I*

_{ p }by the assumptions

**A.1)**and

**A.2)**, then

*y*[

*n*] ~

*g*

_{0}so that the arithmetic mean is similar to the geometric mean by the arithmetic-geometric mean inequality. However, for the samples in ,

*g*[

*n*] is not dominant any more, and the geometric mean value becomes noticeably smaller than the arithmetic mean value in . Figure 4 illustrates this behavior for a randomly generated sequence. Owing to the function of lowpass filtering as well as the suppression of randomly fluctuating noise, the GMF output becomes more amenable to the subsequent denoising operation.

### Wavelet domain denoising

For further suppression of GMF filtered output, the wavelet transform is employed. The wavelet transform lends itself to separation between the signal and noise thanks to the inherent use of multi-resolution techniques by which different frequencies are analyzed with different resolutions [5]. In fact, it is far more efficient for extracting peak signals than the Fourier transform that only provides a constant resolution.

**A.1)**, the filtered noise signal [

*n*] is now limited globally, i.e., | [

*n*]| <

*ϵ*

_{0}. Thus, the noise power is clearly insignificant compared to the signal power and the signal and noise spectrum become more distinct in the transformed domain in their magnitude. In order to eliminate [

*n*], we take the wavelet transform of [

*n*] and then do the thresholding of the spectrum in the wavelet domain. Due to the enhanced frequency resolution in low frequencies, wavelet coefficients of the peak signal, mostly located in low frequencies, are well localized, and the wavelet coefficient of [

*n*] becomes

*k*),

*G*(

*k*), and (

*k*) are the wavelet coefficients of [

*n*],

*g*[

*n*], and [

*n*], respectively. By applying the GMF function, the peak signal energy is compacted into a few wavelet coefficients, and the noise contributes mainly to the rest of insignificant coefficients. Hence, to denoise , we use the simplified assumption that low-energy coefficients are mostly due to the noise, whereas high-energy coefficients are mainly from the peak signal. A proper denoising strategy in this model is 1) to remove the low-energy wavelet coefficients substantially, and 2) to retain or modify slightly the high-energy coefficients. Although these assumptions might not strictly be true, they are sufficient for our purpose since the effect of small loss in the peak signal energy is minimal. For denoising from , we use the soft-threshold estimator [9] given by

*I*

_{ α }is the set of index

*α*satisfying , and |

*A*| is the cardinality of set

*A*. As a trivial case, if

*α*= 100, then

*T*

_{ h }returns to the mean absolute. The reason why we use the percentile mean is to control the threshold so that no valid peak signal is being erased. If some peak signal magnitude is very high, then the mean absolute value will also be large so that valid peaks with relatively small magnitude might be erased. We observe from the empirical test that 90 ~95 percentile generates satisfying results. Due to the removal of low-energy wavelet coefficients (| (

*k*)| ≤

*T*

_{ h }), the time-domain noise signal is suppressed substantially, as shown in Figure 5, and the signal after the inverse wavelet transform is safely modeled as

where .

### Nonlinear signal amplification

As a final step of the preprocessing stage, nonlinear signal amplification of
[*n*] is applied. Nonlinear signal amplification refers to the magnification of peak and the annihilation of noise signals followed by slicing. This operation is useful since it provides the last cleanup of the residual noise before the final peak collection step.

Since the peak collection is done via the zero-crossing operation after the derivative, of note is that the differential operator might amplify abrupt noise fluctuations, even though the magnitude of noise is very small. In fact, it is one of the primary reasons for high false alarm rates.

*x*) =

*x*for

*x*≥ 0 and 0 otherwise. Clearly, will amplify [

*n*] greater than

*E*[ ] and suppress [

*n*] smaller than

*E*[ ]. Because for the pure noise (when

*g*[

*n*] = 0) and also , by the proper choice of

*c*

_{0}, the argument inside the pos function becomes negative, ending up being zero after this pos function. For the choice of

*c*

_{0}, it would be ideal to use . However, since for most cases, 0.1 ≤

*c*

_{0}≤ 1 would be enough as a simple choice. In Figure 6, we plot a result of Eg. (13) for simulated data with

*c*

_{0}= 1. As clearly shown in the figure, the noise with insignificant magnitude, which is mostly noise in real application, is removed thereby suppressing the false alarm rate significantly.

### Peak collection

In the absence of noise, the peak detection problem is equivalent to the problem of finding local maxima, and hence the points *t*_{0} satisfying
become the solutions for continuous-time signals *f*(*t*) [8]. In the case of discrete sequence *f* [*n*], the difference Δ_{f [n]}= *f* [*n*] - *f* [*n* - 1] is being used instead of the derivative. Also, since no point satisfying Δ_{f [n]}= 0 might exist, zero-crossing detection is indispensable. That is, if Δ_{f [n]}> 0 and Δ_{f [n+1]}< 0, *n* or *n* + 1 is chosen as the peak point depending on their magnitude.

## Results and Discussion

### Test setup

*r*[

*n*] as and that of preprocessed data [

*n*] as , the detection probability (

*P*

_{ D }) and the modified false alarm probability (

*P*

_{ FA }) are defined as

#### Real GC data

*μ*m film thickness; Hewlett Packard, Palo Alto, USA) and flame-ionization detector (FID). Oven temperature program starting with a 15°C/min ramp from 150 to 180°C, followed by a ramp to 240°C at 5°C/min was employed. Helium was used as carrier gas. The flow rate was maintained at 3 ml/min for separations by using a mass flow controller, and the head pressure was set to 42.9 psi. The inlet temperature of the GC was 280°C. The injection volume was 0.2

*μ*l. The temperature of the FID was set to 280°C. System control and data evaluation were done using Autochro-3000 Data System. The minimum detection level (MDL) of the FID employed is less than 3.2 carbon pg/sec (dodecane) and the sensitivity is 19 mCoulomb/sec. It also has a linear dynamic range of 10

^{7}. The chromatograms used were measured at 0.00085 min intervals and each sample was analyzed for 14 min.

#### Simulated GC data

For checking the performance in noisier conditions, we additionally test the cases in which Gaussian and speckle-type random noise signals are intentionally added into the GC data used in [1]. As mentioned in the original article, samples were analyzed on a Varian 3400 GC (Varian Instrument, Palo Alto, CA) equipped with a 100 m × 0.32 mm SP2380 (Supelco Inc., Bellefonte, PA) capillary column and flame-ionization detector (FID). Helium was used as carrier gas.

#### Availability

The source code of the proposed method and the data used for validation are available at http://dna.korea.ac.kr/pub/gcpeak/.

### Experimental results

*x*-axis in the curves indicates retention time (unit in minutes) and

*y*-axis represents intensity (unit in millivolts). In the plot, 11 true peaks in the pre-determined positions are marked by black circles and the peak positions detected by each technique by red diamonds. Although the LPF-based processing shows good peak detection performance, we observe that this method detects too many invalid peaks resulting in high false peak probabilities. Since the thresholding in the wavelet domain without the proper distinction of peak signal and noise removes too much energy (including the energy of

*g*[

*n*]), the thresholding-based approaches tend to miss many valid peaks. Hence, even though the false alarm probability is moderate, this method is not desirable due to its poor detection performance. Interestingly, the proposed method is the best among the tested methods and displays the perfect performance (

*P*

_{ D }= 1 and

*P*

_{ FA }= 0).

*P*

_{ D }and those in horizontal axis represent 10 × P

_{ FA }. Clearly, it would be the best if the data is located in the upper left corner (

*P*

_{ D }= 1 and

*P*

_{ FA }= 0). Notice that, since the number of data samples is much larger than that of valid peaks,

*P*

_{ FA }is very small even though the absolute number is considerable. To take this point into account, we display the scaled false alarm probability (10 ×

*P*

_{ FA }) in the horizontal axis of Figures 8, 9, 10, 11, 12, 13 and 14.

As shown in Figure 8, CWT and thresholding techniques show slightly better false alarm probability than the proposed method but their detection performance is unsatisfactory. Whereas, the proposed method shows excellent *P*_{
D
}while maintaining small *P*_{
FA
}.

Next, we consider a scenario where speckle noise is added into GC data. In fact, this scenario models the instrumental noise from experimental devices or contamination due to impurities. To this end, we added speckle noise to the GC data from [1], which was described earlier in the 'Simulated GC data' section. In our experiments, we test three cases where the number of speckles are |
|, 2|
|, and 4|
|, respectively. In addition, for some selected speckle points, we add the random noise chosen from the distribution *N*(*E* [*r*]; Var(*r*)). For each test case, we generate 10 random files. As shown in Figures 9, 10 and 11, we observe the general tendency that *P*_{
D
}is getting worse as the noise level increases. The LPF method has a clear benefit in *P*_{
D
}over the thresholding method. However, as the number of noise and the filter bandwidth increase, the performance degradation of the LPF technique becomes conspicuous. Whereas, the proposed method and CWT are insensitive to the speckle noise and provide excellent results. In particular, the proposed method is almost insensitive to the noise (*P*_{
D
}= 1 for all 30 cases) and shows the best performance for all the tested cases.

Finally, we test a scenario where the GC data is corrupted by Gaussian noise. This scenario models the thermal noise of instrumental devices or temperature variations. We added Gaussian noise into the GC data from [1] described in 'Simulated GC data' above. In our experiments, we test three cases (0.1%, 0.2% and 0.5% of *E* [|*y*|^{2}] as noise power) and generate 10 random files for each case. Due to the corruption of the whole data, as shown in Figures 12, 13 and 14, the performance degradation is in general much severer than that in the speckle type scenario. In particular, we observe degradation in *P*_{
D
}of CWT as well as *P*_{
FA
}of the LPF method. In contrast, the proposed method maintains the near perfect performance at 0.1% and 0.2% noise additions. Even at 0.5% noise addition, the proposed method outperforms all tested methods, providing perfect detection performance and small false alarm rates.

## Discussion

In this work, we addressed the problem of detecting peaks from noisy experimental data in a robust manner. The key ingredients of our approach to achieving this goal are geometric mean filtering (GMF), wavelet domain denoising, and nonlinear signal amplification. The GMF technique conducts the first round suppression of noise, and the wavelet domain denoising then performs the filtering of low-magnitude and high frequency noise. In the nonlinear signal amplification stage, noise clean-up is achieved by zeroing out the residual noise. From our experimental studies on the GC data, we observed that the proposed method shows near perfect peak detection and false alarm performance and that it is the best among the compared methods. Although the setup in this study is primarily for the GC data (including the extended tests with Gaussian and speckle-type contaminated GC data), we could observe that the proposed method can be extended to other types of experimental data as well. However, for the detection of speckle-type peaks appearing in, for instance, mass spectrometry experiments, the assumptions of the proposed method **(A.1)** and **A.3)**) need to be modified. In fact, regarding this extension, there are interesting directions worth pursuing. Our GMF relies only on data with integer delays. Hence, the result might not be desirable when the peak duration is very short, as in the case of speckles. In this case, it might be better to use non-integer delays by applying non-integer interpolated GMF. In addition, when the contamination level is severe, it would be a reasonable choice to use the cascade of the proposed method and supervised learning by which additional reliability might be inserted into the peak detection.

## Conclusion

We have devised a computational method for detecting signals appearing in the form of peaks from noisy experimental observations. Compared with previous techniques, the proposed method is unique in the sense that (1) it requires much less efforts to tune algorithm parameters and (2) its false detection rate is significantly lower, yet maintaining near perfect peak detection performance. We tested the proposed technique extensively with actual data obtained from gas chromatography experiments. In addition, in order to demonstrate the robustness of our approach, we deliberately incorporated two types of noise (speckle and Gaussian) into the original data and tested our technique with the data. In all the experimental studies we conducted, the proposed technique outperformed the alternatives we tried in terms of true and false positive rates and sensitivity to parameters. Given the fact that researchers are very much interested in isolating meaningful signals accurately from their experimental data in an automated and robust manner, we believe that the proposed method can lead to a significant contribution to the field.

## Declarations

### Acknowledgements

This work was supported in part by the second BK21 project and in part by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 2009-0060369 and No. 2009-0079888). We would like to thank Prof. Jungbae Kim, Seung-hyun Jun, Seunghak Yu, Jay S. Lee and Guangtao Ge for their help.

## Authors’ Affiliations

## References

- Adlof R, Duval S, Emken E: Biosynthesis of Conjugated Linoleic Acid in Humans.
*Lipids*2000, 35(2):131–135. 10.1007/BF02664761View ArticlePubMedGoogle Scholar - Das R, Kudaravalli M, Jonikas M, Laederach A, Fong R, Schwans JP, Baker D, Piccirilli JA, Altman RB, Herschlag D: Structural inference of native and partially folded RNA by high-throughput contact mapping.
*PNAS*2008, 105(11):4144–4149. 10.1073/pnas.0709032105PubMed CentralView ArticlePubMedGoogle Scholar - Antoniou A:
*Digital Filters - Analysis, Design, and Applications*. 2nd edition. New York: McGraw Hill; 1993.Google Scholar - Scharf LL:
*Statistical Signal Processing*. Addison Wesley; 1991.Google Scholar - Daubechies I: The wavelet transform, time-frequency localization and signal analysis.
*IEEE Trans Inform Theory*1990, 36: 961–1005. 10.1109/18.57199View ArticleGoogle Scholar - Donoho DL, Johnstone IM: Ideal spatial adaptation via wavelet shrinkage.
*Biometrika*1994, 81: 425–455. 10.1093/biomet/81.3.425View ArticleGoogle Scholar - S G, Chang BY, Vetterli M: Spatially adaptive wavelet thresholding with context modeling for image denoising.
*IEEE Trans on Image Processing*2000, 9: 1522–1531. 10.1109/83.862630View ArticleGoogle Scholar - Bertsekas DP:
*Nonlinear Programming*. 2nd edition. Massachusetts: Athena Scientific; 1999.Google Scholar - Donoho DL, Johnstone IM: Threshold selection for wavelet shrinkage of noisy data.
*Annual Conf. of the IEEE Engineering in Medicine and Biological Society*1994, 24a-25a.View ArticleGoogle Scholar - Du P, Kibbe WA, Lin SM: Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching.
*Bioinformatics*2006, 22(17):2059–2065. 10.1093/bioinformatics/btl355View ArticlePubMedGoogle Scholar - A Cohen ID, Feauveau JC: Biorthogonal bases of compactly supported wavelets.
*Communications on Pure and Applied Math*1992, 45: 485–560. 10.1002/cpa.3160450502View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.