Skip to main content

Ovarian cancer classification based on dimensionality reduction for SELDI-TOF data



Recent advances in proteomics technologies such as SELDI-TOF mass spectrometry has shown promise in the detection of early stage cancers. However, dimensionality reduction and classification are considerable challenges in statistical machine learning. We therefore propose a novel approach for dimensionality reduction and tested it using published high-resolution SELDI-TOF data for ovarian cancer.


We propose a method based on statistical moments to reduce feature dimensions. After refining and t-testing, SELDI-TOF data are divided into several intervals. Four statistical moments (mean, variance, skewness and kurtosis) are calculated for each interval and are used as representative variables. The high dimensionality of the data can thus be rapidly reduced. To improve efficiency and classification performance, the data are further used in kernel PLS models. The method achieved average sensitivity of 0.9950, specificity of 0.9916, accuracy of 0.9935 and a correlation coefficient of 0.9869 for 100 five-fold cross validations. Furthermore, only one control was misclassified in leave-one-out cross validation.


The proposed method is suitable for analyzing high-throughput proteomics data.


Recent advances in proteomics technologies have enabled large-scale analysis of complex protein expression patterns, protein-protein interactions and posttranslational modifications. One of these technologies, surface-enhanced laser desorption/ionization time-of-flight (SELDI-TOF) mass spectrometry has shown great potential in early disease diagnosis and prevention [1, 2]. The majority of studies on SELDI-TOF analysis have focused on the classification problem. The potential efficacy of SELDI-TOF serum protein profiling for cancer classification has been demonstrated by the identification of biomarkers for ovarian cancer [36], breast cancer[7], prostate cancer[8], liver cancer[9], lung cancer[10].

Data sets are obtained for biological samples collected from different patients classified in different classes (e.g. disease versus control, disease A versus disease B, successful versus unsuccessful treatment) for mass spectrometry data after sample fractionation under different physical conditions (so-called chip surfaces). The mass spectrometer typically provides signal intensities in a range of m/z ratios up to 20,000 Da. This leads to a vector of 5000-20,000 numerical values for each mass spectrum. In practice, for a given patient these data can be obtained for a sample pre-processed on several different chip surfaces for several replicates, thus potentially leading to over 100,000 numerical variables per patient. Although the number of variables can be very high in these applications, the number of patients and samples is typically rather small. This leads to rather untypical pattern problems in which the number of input variables is several orders of magnitude higher than the number of samples.

Although significant progress has been made in proteomics technologies, the development of tools for the analysis and interpretation of the large amounts of data produced remains a challenge. Some methods have been reported for class discrimination based on SELDI mass spectra [11, 12]. Vlahou et al. [13] tested the classification and regression tree (CART) method for discrimination of ovarian cancer from benign diseases and healthy controls and achieved a cross-validation accuracy of 81.5%. Another approach was via tree-based methods [14]. Li et al. [15] applied the GA/KNN method to SELDI proteomics data analysis and achieved 95% accuracy (range 90-100%) for unaffected specimens and 98% (range 90-100%) for cancer specimens. Purohit et al. [16] combined information from multiple multivariate models to accurately classify type-diabetes and control subjects, with 88.9% specificity and 90.0% sensitivity. Zhang et al. [17] developed a recursive support vector machine (R-SVM) algorithm to select important genes and biomarkers for the classification of noisy data and used it for mass spectrometry and microarray data.

SELDI-TOF mass spectrometry data also consist of tens of thousands of m/z ratios per specimen and an intensity level for each m/z ratio. Thus, dimensionality reduction is a critical stage before discrimination using such data. Traditionally, there are two types of methods used to reduce dimensionality. One is variable selection and the other is variable transformation such as linear or nonlinear combination of variables. Variable selection techniques do not alter the original representation of the variables, but merely select a subset of variables derived from the large set of profiles. A classifier is then built using the reduced input set. In the context of classification, feature selection techniques can be organized into three categories, depending on how they combine the feature selection search with the construction of the classification model: filter methods, wrapper methods and embedded methods [18]. Variable transformation methods are used to construct a small set of aggregate variables such that every variable combines many inputs in the profile, including methods based on projection or compression such as principal component analysis and kernel transformation. Regardless of the type of method adopted, reduction of high dimensionality to an appropriate low dimension can substantially influence the results. Yu et al. [19] developed a four-step strategy for data preprocessing and achieved average sensitivity of 97.38% and specificity of 93.30%. Hauskrecht et al. [20] used univariate feature selection strategies with a heuristic based on multivariate decorrelation filtering to improve the classification performance for a pancreatic cancer data set. Bhanot et al. [21] proposed a novel method for phenotype identification involving stringent noise analysis and a filtering procedure combined with the results of several machine learning tools to produce a robust predictor. They were able to distinguish cancer from non-cancer cases with sensitivity of 90.31% and specificity of 98.81%. From a clinical viewpoint, such positive results demonstrate the potential of this new bioassay technology.

The majority of the above-mentioned studies on analysis of proteomics data have focused on one feature reduction approach. To improve the classification performance for ovarian cancer, we use a different method based on statistical moments to transform data. The aim of our method is not only to preserve the data properties, but also to reduce the variables. Four statistical moments (mean, variance, skewness and kurtosis) are used to describe the data to reduce the feature dimensions and extract the characteristics of the data at the same time. A kernel algorithm could be useful in mining high-dimensional SELDI-TOF data. We demonstrate the applicability of the kernel PLS (KPLS) to SELDI-TOF data analysis and show that the kernel is capable of reliable discrimination between cancer and healthy samples. Dimensionality reduction and classification can be carried out simultaneously using KPLS. Good results were obtained and only one control was misclassified in the leave-one-out cross validation.


Data preprocessing

In general, MS data are characterized by small numbers of very high-dimensional samples. There are potentially tens of thousands of intact and cleaved proteins in the human serum proteome, so finding a single elusive disease-related protein is very tedious, requiring the laborious separation and identification of every biomarker.

For the raw high-resolution SELDI-TOF ovarian data set provided by the National Cancer Institute (available at, the data can be written as S = {(x i , y i ) x i Rm, y i = ± 1, i = 1, 2, ..., n}, where x i is an intensity vector according to a sorted sequence of m/z ratios and y i is the class label of x i (−1 for non-cancer, +1 for cancer). When the feature space has high dimensionality, feature selection and transformation become crucial as the first step towards pattern recognition. For the raw high-resolution SELDI-TOF ovarian data set for 95 control samples and 121 cancer samples, the dimensionality of the original feature space is >370,000.

In the present study, the data were preprocessed using three steps, (1) refining (2) t-testing and (3) statistical transformation. All the procedures are independent of the particular classifier that will be used later.


The data are aligned according to the sorted union of m/z values into an intensity frame with missing data. The data rows represent m/z values and columns represent samples or observations. Any m/z values that do not have certain desired properties are removed from the data set, thus excluding m/z values that have missing values in one of the columns. Then the data are refined and the missing data are ignored. The dimensionality is thus reduced to 39,905. This is defined as set A.


For each m/z ratio, we compare the distribution of the data using a two-sided t-test (the null hypothesis H0 is that cancer and non-cancer have the same distribution) at a significance level of 1%. The data obtained are defined as set B. The dimensionality of the feature space is reduced from 39,905 to 24,545.

Figure 1 depicts a mass spectrum for serum from a cancer patient. Although the spectrum of set B resembles that of set A, more than 15000 m/z values that are not significantly different between cancer and non-cancer have been filtered.

Figure 1

Mass spectra for set A (top) and set B (bottom). A mass spectrum for serum from a cancer patient, with m/z values on the horizontal axis and intensity values indicating the relative ion abundance on the vertical axis.

Data transformation based on statistical moments

The mass data were first divided into several intervals. Waveform segments are better than individual m/z values in representing the characteristics of MS data. Thus, we selected variables that could represent the characteristics of each waveform segment. After several experiments, statistical moments were selected for further analysis.

The mean, variance, skewness and kurtosis were calculated for each interval. The mean is the sum of the observations divided by the number of observations, which describes the central location of the data. The variance is the expected value of the square of the deviation from its own mean. It can be expressed as the average of the square of the distance of each data point from the mean. The variance describes the spread of the data. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. The skewness is a measure of symmetry or, more precisely, a lack of symmetry. Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. Data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak.

For univariate data Y 1 , Y 2 , ..., Y N , the following formulas hold, where is the mean, S is the standard deviation, and N is the number of data points:

The calculated statistical moments, which describe the characteristics of the spectrum from a statistical point of view, were then used as new variables for classification. Figure 2 depicts the statistical moment transformation for set A.

Figure 2

Statistical moment transformation for set A. After division of the data into several intervals, the statistical moments were calculated for one of the intervals (for example m/z 7600-8000) and used as new variables to represent the characteristics of the spectrum.

It is important to note that the results are influenced by the window width. The interval width was optimized based on modeling results. Figure 3 shows the results for different window widths. The best results were obtained for a width of 40 for set A and 50 for set B. Therefore, after transformation based on statistical moments, sets A and B were transformed to sets C and D, reducing the dimensionality to 3992 and 1964, respectively.

Figure 3

Results for different window widths. Ac/CC results for set A (top) and set B (bottom) for different window widths.

Classifications by KPLS

SVM was originally proposed and developed by Vapnik [22] and has been applied in various areas. KPLS, like SVM, is a learning method based on a kernel function. Rosipal [23] introduced KPLS as a nonlinear extension to the linear PLS method [24] and KPLS is a powerful machine learning tool for classification and regression. KPLS comprises construction of a kernel matrix and PLS. The difference between KPLS and traditional PLS is that XXT is replaced by a kernel matrix. Therefore, the relationships among variables are replaced by those among samples.

The fundamental principle of a kernel is the feature space F (φ: x i RN → Φ (x i ) F) in which the original data items are embedded [23]. In this space linear relations are explored among images of the data items. However, these feature images φ(x i ) are not necessary and are also impossible to obtain. Thus, pairwise inner products are computed from the original data items using kernel function to reveal the relations implicitly.

Given a set of vectors X = (x1, ... n), where n denotes the number of samples and κ denotes kernel function; K is defined as the kernel matrix whose entries are generated by such pattern:

The training kernel matrix is symmetric and square, whereas the testing kernel matrix is usually not because the vectors involved in computing the inner product are separate from the training set and testing set.

To centralize the mapped data in a feature space F we can simply apply the following procedures [25]:

where I is an n-dimensional identity matrix and 1 n , 1 nt represent the vectors whose elements are 1 of length n and n t , respectively. K t is the (n t × n) "test" matrix whose elements are K ij = K(x i , x j ), where and are the testing and training points, respectively.

Before applying KPLS, identification of a proper kernel function and optimal kernel parameters is crucial. Commonly used kernel functions include linear, polynomial, radial basis and sigmoid functions. Different kernel functions are selected for different problems. Currently, there is no systematic methodology for the selection of kernel functions. In the present study, experiments revealed that a polynomial kernel function was better than other functions:

The parameter p denotes the degree of the polynomial and was defined as p = 3 in experiments for the present study. The parameter r denotes the relative weighting of the different degree mononomials. Here r is fixed as 1 for simplicity under the assumption that all subsets corresponding to each mononomial make an equal contribution to the kernel.

Results and discussions

Accuracy (Ac), sensitivity (Sn) and specificity (Sp) are often used to evaluate prediction systems. The correlation coefficient (CC), which ranges from -1 to +1, can also be used. The closer the CC value is to +1, the better is the prediction system. Sn, Sp, Ac and CC are expressed in terms of true positive (TP), false negative (FN), true negative (TN) and false positive (FP) rates:

Results for combinations of the four statistical moments are listed in Table 1. For each five-fold validation, the random experiment was repeated 100 times independently. The mean plays a very important role in classification. Poor results were obtained using skewness and kurtosis separately and the best results were obtained when the four statistical moments were all used for calculation.

Table 1 Five-fold cross validation of statistical transformation

Tables 2, 3, 4 and 5 list the validation results for data sets A-D, respectively. The classification performance was assessed using five-fold cross-validation, five-fold proportional validation (randomly selecting 80% controls and 80% cancers as the training set and testing the classifier on the remaining samples) and leave-one-out cross-validation. For each five-fold validation, the random experiment was repeated 100 times independently. Comparison of the standard deviation for five-fold cross and proportional validations reveals that KPLS is stable for classification. KPLS avoids the SVM overfitting problem reported by Yu et al. [19]. The SVM results indicate that overfitting seems to be more serious for the control samples, whereas KPLS leads to good results for both control and cancer samples. Besides decreasing the computational complexity, the procedure cleans the original data and explores their category traits for classification. All four sets yield good classification results, with the best performance observed for set D in all validation tests. In leave-one-out cross-validation, only one control was misclassified.

Table 2 Validation for set A
Table 3 Validation for set B
Table 4 Validation for set C
Table 5 Validation for set D

In order to evaluate the applicability of the proposed method to other data sets, the high resolution pancreatic cancer premalignant data downloaded from National Cancer Institute were classified by the method. The classification accuracy using seven different classifiers and three kinds of feature selection methods ranged from 0.5611 to 0.7500 [26]. The 10-fold cross-validation results in Table 6 (repeated 100 times independently) showed that the proposed preprocessing method performed well. The data using statistical moments preprocessing have better accuracies than those without preprocessing and those in reference, thus suggest the utility of the method in proteomics data analysis more generally.

Table 6 Results of pancreatic cancer premalignant data

The number of proteomics data samples is much less than the number of variables. Therefore, there are some difficulties in data analysis: useless variables interfere with the results and useful information is hidden. After data preprocessing to reduce the dimensionality, KPLS is used as a variable nonlinear recombination method in which the relationships among variables are replaced by those among samples. Thus, both samples and variables are considered in the classification and good results are obtained.

Statistic moments are applicable to many different domains, for example, aspects of image processing, ranging from invariant pattern recognition and image encoding to pose estimation. When applied to images, statistic moments describe the image content (or distribution) with respect to its axes. They are designed to capture both global and detailed geometric information about the image. Recently, a novel method, named statistical moment-based method, was proposed for structural damage detection, which was believed to be sensitive to local structural damage but insensitive to measurement noise [27]. Four statistical moments transformation yield novel variables and reduce dimension of feature space by reconstruct the data. In this paper, the result showed that the statistic moments can not only reduce the dimensionality of the feature space but also reflect the differences between the cancer and the healthy samples. Using the technology of reconstruction data, such as statistic moments, a novel insight will be created into large biological data and good performance will be achieved in other fields of biological statistics.


Serum proteomic profiling is a new approach to cancer diagnosis. It confronts a challenging environment, as it combines measurement technologies that are new in the clinical setting with novel approaches to processing and interpreting high dimensional data. Statistical moments' transformation for feature reduction has been playing active roles in analyzing such data. By efficient preprocessing of high-resolution ovarian MS data, KPLS achieve a satisfied performance of classifying cancer and the healthy. On the one hand, data preprocessing reduces the dimensionality of feature space; on the other hand KPLS extrudes the most significant category traits for the coming classification. The proposed method is general enough that it can be adapted to other proteomics data such as MALDI-TOF data and even NMR metabolomics profiling.


  1. 1.

    Diamandis EP: Proteomic patterns in serum and identification of ovarian cancer. Lancet 2002, 360(9327):170. 10.1016/S0140-6736(02)09390-X

    CAS  Article  PubMed  Google Scholar 

  2. 2.

    Wulfkuhle JD, Liotta LA, Petricoin EF: Proteomic applications for the early detection of cancer. Nat Rev Cancer 2003, 3(4):267–275. 10.1038/nrc1043

    CAS  Article  PubMed  Google Scholar 

  3. 3.

    Petricoin E, Ardekani A, Hitt B, Levine P, Fusaro V, Steinberg S, Mills G, Somone C, Fishman D, Kohn E, et al.: Use of proteomic patterns in serum to identify ovarian cancer. Lancet 2002, 359(9306):572–577. 10.1016/S0140-6736(02)07746-2

    CAS  Article  PubMed  Google Scholar 

  4. 4.

    Sorace JM, Zhan M: A data review and re-assessment of ovarian cancer serum proteomic profiling. BMC bioinformatics 2003, 4(1):24. 10.1186/1471-2105-4-24

    Article  PubMed  PubMed Central  Google Scholar 

  5. 5.

    Zhang H, Kong B, Qu X, Jia L, Deng B, Yang Q: Biomarker discovery for ovarian cancer using SELDI-TOF-MS. Gynecol Oncol 2006, 102(1):61–66. 10.1016/j.ygyno.2005.11.029

    CAS  Article  PubMed  Google Scholar 

  6. 6.

    Kong F, Nicole White C, Xiao X, Feng Y, Xu C, He D, Zhang Z, Yu Y: Using proteomic approaches to identify new biomarkers for detection and monitoring of ovarian cancer. Gynecol Oncol 2006, 100(2):247–253. 10.1016/j.ygyno.2005.08.051

    CAS  Article  PubMed  Google Scholar 

  7. 7.

    Carter D, Douglass JF, Cornellison CD, Retter MW, Johnson JC, Bennington AA, Fleming TP, Reed SG, Houghton RL, Diamond DL, et al.: Purification and characterization of the mammaglobin/lipophilin B complex, a promising diagnostic marker for breast cancer. Biochemistry 2002, 41(21):6714–6722. 10.1021/bi0159884

    CAS  Article  PubMed  Google Scholar 

  8. 8.

    Petricoin ER, Ornstein DK, Paweletz CP, Ardekani A, Hackett PS, Hitt BA, Velassco A, Trucco C, Wiegand L, Wood K, et al.: Serum proteomic patterns for detection of prostate cancer. J Natl Cancer Inst 2002, 94(20):1576–1578.

    CAS  Article  PubMed  Google Scholar 

  9. 9.

    Poon TCW, Yip T, Chan ATC, Yip C, Yip V, Mok TSK, Lee CCY, Leung TWT, Ho SKW, Johnson PJ: Comprehensive proteomic profiling identifies serum proteomic signatures for detection of hepatocellular carcinoma and its subtypes. Clin Chem 2003, 49(5):752–760. 10.1373/49.5.752

    CAS  Article  PubMed  Google Scholar 

  10. 10.

    Zhukov TA, Johanson RA, Cantor AB, Clark RA, Tockman MS: Discovery of distinct protein profiles specific for lung tumors and pre-malignant lung lesions by SELDI mass spectrometry. Lung Cancer 2003, 40(3):267–279.

    Article  PubMed  Google Scholar 

  11. 11.

    Ball G, Mian S, Holding F, Allibone RO, Lowe J, Ali S, Li G, McCardle S, Ellis IO, Creaser C: An integrated approach utilizing artificial neural networks and SELDI mass spectrometry for the classification of human tumours and rapid identification of potential biomarkers. Bioinformatics 2002, 18(3):395. 10.1093/bioinformatics/18.3.395

    CAS  Article  PubMed  Google Scholar 

  12. 12.

    Adam BL, Qu Y, Davis JW, Ward MD, Clements MA, Cazares LH, Semmes OJ, Schellhammer PF, Yasui Y, Feng Z: Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res 2002, 62(13):3609.

    CAS  PubMed  Google Scholar 

  13. 13.

    Vlahou A, Schorge JO, Gregory BW, Coleman RL: Diagnosis of ovarian cancer using decision tree classification of mass spectral data. Journal of Biomedicine and Biotechnology 2003, 2003: 308–314. 10.1155/S1110724303210032

    Article  PubMed  PubMed Central  Google Scholar 

  14. 14.

    Qu Y, Adam BL, Yasui Y, Ward MD, Cazares LH, Schellhammer PF, Feng Z, Semmes OJ, Wright GL Jr: Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. Clin Chem 2002, 48(10):1835–1843.

    CAS  PubMed  Google Scholar 

  15. 15.

    Li L, Umbach DM, Terry P, Taylor JA: Application of the GA/KNN method to SELDI proteomics data. Bioinformatics 2004, 20(10):1638–1640. 10.1093/bioinformatics/bth098

    CAS  Article  PubMed  Google Scholar 

  16. 16.

    Purohit S, Podolsky R, Schatz D, Muir A, Hopkins D, Huang Y, She J: Assessing the utility of SELDI-TOF and model averaging for serum proteomic biomarker discovery. Proteomics 2006, 6(24):6405–6415. 10.1002/pmic.200600420

    CAS  Article  PubMed  Google Scholar 

  17. 17.

    Zhang X, Lu X, Shi Q, Xu XQ, Leung HC, Harris LN, Iglehart JD, Miron A, Liu JS, Wong WH: Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics 2006, 7: 197. 10.1186/1471-2105-7-197

    Article  PubMed  PubMed Central  Google Scholar 

  18. 18.

    Saeys Y, Inza I, Larranaga P: A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23(19):2507–2517. 10.1093/bioinformatics/btm344

    CAS  Article  PubMed  Google Scholar 

  19. 19.

    Yu JS, Ongarello S, Fiedler R, Chen XW, Toffolo G, Cobelli C, Trajanoski Z: Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data. Bioinformatics 2005, 21(10):2200. 10.1093/bioinformatics/bti370

    CAS  Article  PubMed  Google Scholar 

  20. 20.

    Hauskrecht M, Pelikan R, Malehorn DE, Bigbee WL, Lotze MT, Zeh HJ, Whitcomb DC, Lyons-Weiler J: Feature selection for classification of SELDI-TOF-MS proteomic profiles. Applied Bioinformatics 2005, 4(4):227–246. 10.2165/00822942-200504040-00003

    CAS  Article  PubMed  Google Scholar 

  21. 21.

    Bhanot G, Alexe G, Venkataraghavan B, Levine AJ: A robust meta-classification strategy for cancer detection from MS data. Proteomics 2006, 6(2):592–604. 10.1002/pmic.200500192

    CAS  Article  PubMed  Google Scholar 

  22. 22.

    Vapnik V: The nature of statistical learning theory. New York 1995.

    Google Scholar 

  23. 23.

    Rosipal R, Trejo LJ: Kernel partial least squares regression in Reproducing Kernel Hilbert Space. J Mach Learn Res 2002, 2(2):97–123. 10.1162/15324430260185556

    Google Scholar 

  24. 24.

    Lindgren F, Geladi P, Wold S: The Kernel Algorithm For PLS. J Chemometr 1993, 7(1):45–59. 10.1002/cem.1180070104

    CAS  Article  Google Scholar 

  25. 25.

    Scholkopf B, Smola A, Muller K: Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 1998, 10(5):1299–1319. 10.1162/089976698300017467

    Article  Google Scholar 

  26. 26.

    Ge G, Wong GW: Classification of premalignant pancreatic cancer mass-spectrometry data using decision tree ensembles. BMC Bioinformatics 2008, 9: 275. 10.1186/1471-2105-9-275

    Article  PubMed  PubMed Central  Google Scholar 

  27. 27.

    Xu YL, Zhang J, Li JC, Xia Y: Experimental Investigation on Statistical Moment-based Structural Damage Detection Method. Structural Health Monitoring 2009, 8(6):555–571. 10.1177/1475921709341011

    Article  Google Scholar 

Download references


The authors would like to thank the reviewers for their valuable comments. This project was supported by National Natural Science Foundation of China (20675057, 20705024 and 30900832).

Author information



Corresponding author

Correspondence to Tong-Hua Li.

Additional information

Authors' contributions

KLT performed the data analysis and wrote the manuscript, WWX evaluated the results. KC participated in the study design. THL conceived and supervised the study. All authors have read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Tang, K., Li, T., Xiong, W. et al. Ovarian cancer classification based on dimensionality reduction for SELDI-TOF data. BMC Bioinformatics 11, 109 (2010).

Download citation


  • Feature Selection
  • Kernel Function
  • Statistical Moment
  • Kernel Matrix
  • Proteomics Data Analysis