Identification of biomarkers from mass spectrometry data using a "common" peak approach
 Tadayoshi Fushiki^{1}Email author,
 Hironori Fujisawa^{1} and
 Shinto Eguchi^{1}
DOI: 10.1186/147121057358
© Fushiki et al; licensee BioMed Central Ltd. 2006
Received: 12 March 2006
Accepted: 26 July 2006
Published: 26 July 2006
Abstract
Background
Proteomic data obtained from mass spectrometry have attracted great interest for the detection of earlystage cancer. However, as mass spectrometry data are highdimensional, identification of biomarkers is a key problem.
Results
This paper proposes the use of "common" peaks in data as biomarkers. Analysis is conducted as follows: data preprocessing, identification of biomarkers, and application of AdaBoost to construct a classification function. Informative "common" peaks are selected by AdaBoost. AsymBoost is also examined to balance false negatives and false positives. The effectiveness of the approach is demonstrated using an ovarian cancer dataset.
Conclusion
Continuous covariates and discrete covariates can be used in the present approach. The difference between the result for the continuous covariates and that for the discrete covariates was investigated in detail. In the example considered here, both covariates provide a good prediction, but it seems that they provide different kinds of information. We can obtain more information on the structure of the data by integrating both results.
Background
Mass spectrometry is being used to generate protein profiles from human serum, and proteomic data obtained from mass spectrometry have attracted great interest for the detection of earlystage cancer (for example, [1–3]). Recent advancements in proteomics come from the development of protein mass spectrometry. Matrixassisted laser desorption and ionization (MALDI) and surface enhanced laser desorption/ionization (SELDI) mass spectrometry provide highresolution measurements. Mass spectrometry data are ideally continuous data. Some method is required to deal with highdimensional but small samplesize data, similar to microarray data. An effective methodology for identifying biomarkers in highdimensional data is thus an important problem.
Some methodologies have been proposed for the identification of biomarkers from ideally continuous mass spectrometry data. One approach is to use peaks in the data to identify biomarkers. Yasui et al.[4] and Tibshirani et al.[5] adopted this approach. Another approach is based on binning of data. Yu et al.[6] and Geurts et al.[7] analyzed mass spectrometry data after binning the data. The methodology presented in this paper adopts the former approach, since peaks in mass spectrometry data are considered to represent biological information. Our idea is that "common" peaks within the sample might contain useful information. This means that a peak seen for only one subject may be noise, whereas a peak exhibited by many subjects might be useful. In this study, an ovarian cancer dataset was analyzed as follows: (1) preprocessing, (2) peak detection, (3) identification of biomarkers, (4) classification. AdaBoost was used to construct a classification function. AsymBoost was also examined for balancing the false negatives and the false positives. The effectiveness of the approach is evaluated using validation data.
The proposed approach is closely related to that of Yasui et al.[4]. In [4], biomarkers were specified through classification by AdaBoost. The present approach differs in that "common" peaks are extracted before classification to specify biomarkers. By specifying biomarkers before classification, the dimension of covariates becomes smaller in classification. We think that it is better if the number of covariates is small in classification. Whereas Yasui et al.[4] used discrete covariates, both continuous covariates and discrete covariates can be used in the present approach, according to the situations. Furthermore, we recommend that the results obtained using discrete covariates should be compared with those obtained using continuous covariates. In the example considered here, more information on the structure of the data can be obtained through the use of both covariates, i.e., different kinds of informative features can be obtained by the use of both covariates.
Results and discussion
Dataset
The range of m/zvalue in the dataset is approximately [0, 20000]. However, the frequency of the peaks is too high in the interval [0,1500] and in some cases it is difficult to derive information from peaks in the interval [0,1500]. Therefore, the dataset is analyzed only in the interval [1500, 20000], as in [4].
As is often the case in statistical learning [8], the dataset is divided into two sets; a training dataset that consists of 73 controls and 130 ovarian cancer patients and a test dataset that consists of 18 controls and 32 ovarian cancer patients. The number of the training and test datasets are denoted by N and n, respectively (N = 203 and n = 50). The method proposed in this paper is trained using the training dataset, and the performance of the trained scheme is checked using the test dataset.
Preprocessing
Proteomic data obtained from mass spectrometry are often inaccurate in some senses. For example, the mass/charge axis shift is a big problem in many cases [9, 10]. Therefore preprocessing of the data is very important. Preprocessing methods have recently been proposed by Wong et al.[9] and Jeffries [10].
In this paper, preprocessing of the dataset is performed using SpecAlign [9] as follows: (i) subtract baseline, (ii) generate spectrum average, (iii) spectra alignment (peak matching method). It should be noted that it is difficult to align spectra perfectly even if some alignment algorithm is used. In the section of Identification of biomarkers, this problem is reconsidered.
Peak detection
The peak detection rule of Yasui et al.[4] is adopted here. An m/z point is regarded as a peak if it takes the maximum value in the knearest neighborhood. If k is small, a point is easily recognized as a peak. An appropriate k can be selected by examining some k s as done in Yasui et al.[4]. We empirically set k = 10. In this study, a slightly small k is used, since only the "common" peak is considered as a biomarker.
Identification of biomarkers
Suppose that some individuals have a peak at a certain m/zvalue, m*. It is then expected that there exists a protein related with the ion corresponding to m*. Therefore, m* may be a biomarker that can be used to judge whether an individual is affected or not. But a peak exhibited by only one subject may just be noise. The peaks "commonly" exhibited by many subjects are thus candidates of biomarkers. However, there remains the problem that the m/zvalues are not perfectly aligned in general, so that the above idea cannot be applied directly. By overcoming this problem of imperfect alignment, the method for identifying such "common peaks" is derived in the following.

In general, peaks cannot be aligned perfectly even if some alignment algorithm is applied, as stated in the section of Preprocessing. In the "average of peaks," even if peaks are not aligned perfectly, they can be added because they have "width" σ (p_{i,j}) (Figs. 2 (d) and 2 (e)).

Another possible approach is to use the average of intensities (Fig. 2 (b)). However, we think that the "average of peaks" is more effective in mass spectrometry data (Fig. 2 (e)). "Common" peaks with small intensities can be found easily in Fig. 3 (b), whereas it is difficult to find such "small common" peaks in Fig. 3 (a). Furthermore, the difference between controls and ovarian cancer patients can be seen more clearly in the "average of peaks" than in the average of intensities (Fig. 4).
There are many ways to reduce the number of biomarkers determined by the above procedure. One way is to select biomarkers that are effective in classification. In this study, however, we simply used the biomarkers obtained by the above procedure.
The covariates are extracted from the data as discrete variables and/or continuous variables. Let m_{1}, m_{2}, ⋯ be the m/zvalues of the biomarkers. The discrete covariate is obtained by searching for a peak within a window of the biomarker, i.e.,
The continuous covariate is the maximum value of the intensity within the window.
x_{ j }= the maximum value of the intensity within [(1  ρ)m_{ j }, (1 + ρ)m_{ j }].
In this study, ρ = 0.002.
AdaBoost
The present objective is to find the important features of a peak pattern associated with a disease on the basis of peak identification on proteomic data. We introduce AdaBoost for the extraction of informative patterns in the feature space based on examples consisting of N pairs of the feature vector and the label. In this context, the feature vector is obtained from peak intensities over the detected m/zvalues for a subject, and the label expresses the disease status of the subject. For pattern classification, one of two cases are employed, that is, in which the feature vector is composed of discrete or continuous values, as discussed in the preceding section.
Ensemble learning has been studied in machine learning. AdaBoost [11] is one of the most efficient learning methods in ensemble learning. As explained below, AdaBoost provides a classification function by a linear combination of weak learners. The AdaBoost algorithm can be regarded as a sequential minimization algorithm for the exponential loss function.
Let x be a feature vector and y a binary label with values +1 and 1. A classification function f(x) is then used to predict the label y. If f(x) is positive (or negative), then the label is predicted as +1 (or 1). Suppose that a class of classification functions $\mathcal{F}$ = {f} is provided. In AdaBoost, a classification function f in $\mathcal{F}$ is called a weak learner. A new classification function F is constructed by taking a linear combination of classification functions in $\mathcal{F}$, i.e.,
where β = (β_{1}, ⋯, β_{ T }) is a weight vector and f_{ t }∈ $\mathcal{F}$ for t = 1, ⋯, T. The sign of F(x; β) provides a label prediction of y. This is a rule of majority vote by T classification functions f_{1}(x), ⋯, f_{ T }(x) with weights β_{1}, ⋯, β_{ T }. Consider a problem in which weights β_{1}, ⋯, β_{ T }and classification functions f_{1}(x), ⋯,f_{ T }(x) are optimally combined based on N given examples of (x_{1}, y_{1}), ⋯, (x_{ N }, y_{ N }). AdaBoost aims to solve the problem by minimizing the exponential loss defined by
AdaBoost does not jointly provide the optimal solution, but offers a sophisticated learning algorithm with sequential structure involving two stages of optimization in which the best weak learner f_{ t }(x) is selected in the first stage and the best scalar weight β_{ t }is determined in the second stage at the tstep.
In this study, decision stumps were used as weak learners. A decision stump is a naive classification function in which, for a subject with a feature vector x of peak intensities, the label is predicted by observing whether a certain peak intensity is larger than a predetermined value or not. Accordingly, the set of weak learners is relatively large, but all of the weak learners are literally weak, since they respond only to a peak pattern. The set of the decision stumps is denoted by
{d_{ j }(x) = sign(x_{ j } b)  j = 1, ⋯, J, ∞ <b < ∞ }.
AdaBoost efficiently integrates the set of weak learners by sequential minimization of the exponential loss. As a result, the learning process of AdaBoost can be traced, and the final classification function can be reexpressed as the sum of the peak pattern functions F_{ j }(x)'s, where
in which the sum of coefficients β_{ t }is referred to as the score S_{ j }[12]. In this way, the score S_{ j }expresses the degree of importance for the jth peak in terms of contribution to integrating a final classification function in the process of learning algorithm.
Test errors, false negatives and false positives.
test error  false negative  false positive  

discrete  0.06  0.11  0.03 
continuous  0.06  0.11  0.03 
In Table 1, the test error with discrete covariates and that with continuous covariates are the same.
Ten highest values of S_{ j }in discrete case.
peak(m/z)  S _{ j } 

3675  1.42 
13651  1.06 
7247  1.02 
4907  1.00 
7539  0.82 
8569  0.76 
4792  0.65 
16668  0.62 
1849  0.61 
4728  0.54 
Ten highest values of S_{ j }in continuous case.
peak(m/z)  S _{ j } 

17095  3.98 
14798  2.46 
7247  1.97 
2095  1.68 
4029  1.62 
5271  1.56 
8039  1.38 
4118  1.19 
4773  1.07 
15044  1.05 
Test error when h_{th} varies in discrete case.
h _{th}  test error  number of variables 

0.0  0.04  232 
0.1  0.06  146 
0.2  0.08  128 
0.4  0.16  102 
0.8  0.1  69 
Test error when h_{th} varies in continuous case.
h _{th}  test error  number of variables 

0.0  0.06  232 
0.1  0.06  146 
0.2  0.06  128 
0.4  0.08  102 
0.8  0.12  69 
In Figs. 5 and 6, the false negatives are much larger than the false positives, but this is not a desirable result. In order to suppress false negatives, AsymBoost [13] may be useful. In AdaBoost, the loss function is given by (2), but in AsymBoost, the loss function is given by
where each initial weight w_{ i }is set as follows:
Conclusion
We proposed a methodology for identifying biomarkers from highdimensional mass spectrometry data. "Common" peaks in the data are regarded as biomarkers. The number of biomarkers can be changed by varying the value of h_{th}, which is a threshold value that controls how "common" peaks can be regarded as biomarkers. By identifying biomarkers, the number of covariates is reduced, so that classification is facilitated. We can select discrete or continuous covariates depending on the situation.
The effectiveness of our approach was demonstrated through application to an ovarian cancer dataset. It was shown that a prediction function with high performance can be obtained by a simple application of AdaBoost.
A simple method was used to analyze data in this study. In general, however, a more sophisticated method may be required to extract a covariate at a biomarker. For example, when discrete covariates are extracted, we can use peaks obtained by a stricter rule or we can extract variables effective for classification [5]. When continuous variables are extracted, we can use the intensity at the m/zvalue nearest to a biomarker, or the average of intensities within a window including a biomarker.
In this paper, the difference between the result for the continuous covariates and that for the discrete covariates was investigated in detail. In the example, the result obtained using continuous covariates appeared to be based on a difference in the intensity under the condition that the peak exists, but the result obtained using discrete covariates was likely to be based on a rougher difference whether the peak exists at the m/zvalue. In general, whether discrete covariates are better or continuous covariates are better depends on data. If the value of the intensity in the data is reliable, it may be better to use continuous covariates. If not, it may be better to use discrete covariates. We consider that both cases of covariates should be examined and the results compared and inspected in detail for practical almost studies. We conclude that we can obtain more information on the structure of the data by integrating both results.
Declarations
Acknowledgements
We would like to thank Masaaki Matsuura for his helpful comments on this study. This work was supported by Transdisciplinary Research Integration Center, Research Organization of Information and Systems.
Authors’ Affiliations
References
 Petricoin EF III, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA: Use of proteomic patterns in serum to identify ovarian cancer. Lancet 2002, 359: 572–577. 10.1016/S01406736(02)077462View ArticlePubMedGoogle Scholar
 Hanash S: Disease proteomics. Nature 2003, 422: 226–232. 10.1038/nature01514View ArticlePubMedGoogle Scholar
 Wu B, Abbott T, Fishman D, McMurray W, Mor G, Stone K, Ward D, Williams K, Zhao H: Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 2003, 19: 1636–1643. 10.1093/bioinformatics/btg210View ArticlePubMedGoogle Scholar
 Yasui Y, Pepe M, Thompson ML, Adam BL, Wright GL Jr, Qu Y, Potter JD, Winget M, Thornquist M, Feng Z: A dataanalytic strategy for protein biomarker discovery: profiling of highdimensional proteomic data for cancer detection. Biostatistics 2003, 4: 449–463. 10.1093/biostatistics/4.3.449View ArticlePubMedGoogle Scholar
 Tibshirani R, Hastie T, Narasimhan B, Soltys S, Shi G, Koong A, Le QT: Sample classification from protein mass spectrometry, by 'peak probability contrasts'. Bioinformatics 2004, 20: 3034–3044. 10.1093/bioinformatics/bth357View ArticlePubMedGoogle Scholar
 Yu JS, Ongarello S, Fiedler R, Chen XW, Toffolo G, Cobelli C, Trajanoski Z: Ovarian cancer identification based on dimensionality reduction for highthroughput mass spectrometry data. Bioinformatics 2005, 21: 2200–2209. 10.1093/bioinformatics/bti370View ArticlePubMedGoogle Scholar
 Geurts P, Fillet M, de Seny D, Meuwis MA, Malaise M, Merville MP, Wehenkel L: Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics 2005, 21: 3138–3145. 10.1093/bioinformatics/bti494View ArticlePubMedGoogle Scholar
 Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. New York: SpringerVerlag; 2001.View ArticleGoogle Scholar
 Wong JWH, Cagney G, Cartwright HM: SpecAlign – processing and alignment of mass spectra datasets. Bioinformatics 2005, 21: 2088–2090. 10.1093/bioinformatics/bti300View ArticlePubMedGoogle Scholar
 Jeffries N: Algorithms for alignment of mass spectrometry proteomic data. Bioinformatics 2005, 21: 3066–3073. 10.1093/bioinformatics/bti482View ArticlePubMedGoogle Scholar
 Freund Y, Schapire R: A decisiontheoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences 1997, 55: 119–139. 10.1006/jcss.1997.1504View ArticleGoogle Scholar
 Takenouchi T, Ushijima M, Eguchi S: GroupAdaBoost for selecting important genes. IEEE 5th Symposium on Bioinformatics and Bioengineering 2005, 218–226.View ArticleGoogle Scholar
 Viola P, Jones M: Fast and robust classification using asymmetric adaboost and a detector cascade. Neural Information Processing Systems 14:
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.