A data review and re-assessment of ovarian cancer serum proteomic profiling
© Sorace and Zhan 2003
Received: 28 March 2003
Accepted: 9 June 2003
Published: 9 June 2003
Skip to main content
© Sorace and Zhan 2003
Received: 28 March 2003
Accepted: 9 June 2003
Published: 9 June 2003
The early detection of ovarian cancer has the potential to dramatically reduce mortality. Recently, the use of mass spectrometry to develop profiles of patient serum proteins, combined with advanced data mining algorithms has been reported as a promising method to achieve this goal. In this report, we analyze the Ovarian Dataset 8-7-02 downloaded from the Clinical Proteomics Program Databank website, using nonparametric statistics and stepwise discriminant analysis to develop rules to diagnose patients, as well as to understand general patterns in the data that may guide future research.
The mass spectrometry serum profiles derived from cancer and controls exhibited numerous statistical differences. For example, use of the Wilcoxon test in comparing the intensity at each of the 15,154 mass to charge (M/Z) values between the cancer and controls, resulted in the detection of 3,591 M/Z values whose intensities differed by a p-value of 10-6 or less. The region containing the M/Z values of greatest statistical difference between cancer and controls occurred at M/Z values less than 500. For example the M/Z values of 2.7921478 and 245.53704 could be used to significantly separate the cancer from control groups. Three other sets of M/Z values were developed using a training set that could distinguish between cancer and control subjects in a test set with 100% sensitivity and specificity.
The ability to discriminate between cancer and control subjects based on the M/Z values of 2.7921478 and 245.53704 reveals the existence of a significant non-biologic experimental bias between these two groups. This bias may invalidate attempts to use this dataset to find patterns of reproducible diagnostic value. To minimize false discovery, results using mass spectrometry and data mining algorithms should be carefully reviewed and benchmarked with routine statistical methods.
The early diagnosis of ovarian cancer has the potential to dramatically reduce the mortality associated with this disease. Recently, the use of surface-enhanced laser desorption/ionization (SELDI) time-of-flight mass spectrometry profiling of patient serum proteins, combined with advanced data mining algorithms, to detect protein patterns associated with malignancy, has been reported as a promising field of research to achieve the goal of early cancer detection [1–5]. Several reports have detailed the ability of this proteomic method to diagnose the difference between ovarian cancer [6–8], prostate cancer [9–13], and bladder cancer [13, 14]. Much of the effort in these analyses has focused on the use of a variety of data mining tools such as the evaluation of prostate cancer using peaks in the mass to charge (M/Z) region between 2 K and 40 K combined with boosted decision tree analysis  to try to detect patterns that allow the diagnosis of cancer versus non-cancer. The use of similar technology to evaluate bladder cancer has also been reported [13, 14]. Thus, this field represents an active area of current research. For example, a recent report by the Clinical Proteomics Program Databank has demonstrated that the use of genetic algorithms coupled with clustering analysis has resulted in rule sets that can predict ovarian cancers (including samples from patients with stage 1 disease) with 100% sensitivity and 96% specificity . These results have been extended by the same group to include a larger series of ovarian cancer patients as well as prostate cancer patients [7, 9]. The Clinical Proteomics Program Databank has provided three sets of ovarian cancer data to the scientific community without restriction. These data sets include Lancet Ovarian Data 2-16-02 used in the study noted above . This study consisted of a total of 100 control, 100 cancer, and 16 benign disease samples run on a Ciphergen H4 ProteinChip array (since discontinued). The samples were manually processed. The data was posted after baseline subtraction. The second data set, Ovarian Dataset 4-3-02 consist of the same samples as the first but the samples were run on a Ciphergen WCX2 ProteinChip array. The samples were manually prepared and the data was posted with baseline subtraction. A model diagnostic rule based on this dataset is published on the website, but no data is given regarding the rules sensitivity or specificity. In this report, we analyze the third Ovarian Dataset 8-7-02 and corresponding sample information downloaded from the Clinical Proteomics Program Databank website . This set of data consists of serum profiles of 162 subjects with ovarian cancer and 91 non-cancer control subjects. The cancer group may be further divided into 28 stage 1 patients, 20 stage 2 patients, 99 stage 3 patients, 12 stage 4 patients, and 3 no stage specified patients. For each subject a set of data consisting of intensities at 15,154 distinct M/Z values ranging from 0.0000786 to 19995.513 was available for analysis. This dataset was constructed using the Ciphergen WCX2 ProteinChip array. All the steps of preparing the chips for sample analysis were preformed robotically, and the raw data without baseline subtraction was posted for download. A model rule claiming 100% sensitivity and specificity is also given. Additional details of experimental data collection may be found at the Clinical Proteomics Data Bank . In addition to the various methods of preparing and running the samples on the mass spectrometer, the optimal steps in processing the raw data from the mass spectrometer for further analysis have not been standardized and remain a fertile area for investigation . We choose the deliberately simple strategy of using Wilcoxon test on the raw data to better understand the underlying properties of the data set. We consider this simple approach a "benchmark method" to which other methods can be compared. Further, we use Wilcoxon test and stepwise discriminant analysis on a training subset consisting of 80 cancer patients and 45 controls, randomly chosen from the original data set, to develop rules to classify a test set consisting of the remaining cancer and control subjects. Disease classifiers of great sensitivity and specificity could be readily constructed by visual inspection and manual binning of M/Z values based on the p-values of the Wilcoxon test combined with classical stepwise discriminant analysis. The ability of these rules to classify disease and normal samples were comparable to the model rule published on this dataset at the Clinical Proteomics Program Databank website which was developed using a proprietary genetic algorithm. Further, in examining all M/Z values, the M/Z values that discriminated best between ovarian cancer and control were all found to be less than 500, an area of the spectrum often discarded as noise . These findings are useful for several reasons. First, the statistical methods used in this study are readily available, widely understood, and can be cheaply implemented. Secondly, a vast amount of mathematical research and practical experience underlies their interpretation. Finally, they can be used to discover unexpected patterns present in the data set. These patterns may be missed by machine learning methods that are narrowly focused on diagnostic classification, and do not present the researcher with a broad overview of the data. As a result of these traditional studies, a better understanding of the weaknesses and possible strengths of serum proteomic profiling becomes apparent.
Development of Diagnostic Rule 1.
Wilcoxon p-value Training Set
Entire Data Set
Development of Diagnostic Rule 2.
Classification Rule 1 based on the intensities at the 7 M/Z values:
a 1 = (1303,5.66302,5.48787.19.60743,-8.88828,-30.47983,-0.34510)',
a 2 = (1413,6.44028,6.36701,20.84677,-11.04580,-32.04436,0.03553)',
c 1 = -2984,
c 2 = -3521.
Let X be the vector that represents the intensities from a subject at the 7 M/Z values:
2.7921478, 245.53704, 261.8864, 418.1136, 435.0751, 464.3617, 4003.645. Classify X into the cancer group if
(a 1 - a 2)' X + (c 1 - c 2) ≥ 0;
Otherwise classify X into the control group.
Classification Rule 2 based on the intensities at the 13 M/Z values:
a 1 = (24.58884,-15.09887,0.95772,-3.16411,-10.93854,31.7966,8.11259,6.59602,-53.15727, -7.49888,149.18784,399.67258,112.83481)',
a 2 = (26.26343,-19.06632,1.07975,-1.89482,-9.83188,29.47779,7.69470,8.54597,-49.99409, -7.32192,142.08982,389.53254,116.55493)',
c 1 = -1386,
c 2 = -1312.
Let X be the vector that represents the intensities from a subject at the following 13 M/Z values: 2665.397, 3969.469, 3991.844, 4003.645, 4027.3, 4056.967, 4744.889, 6801.495, 7756.437, 8349.266, 14796.14, 15955.47, 17034.05.
Classify X into cancer if
(a 1 - a 2)' X + (c 1 -c 2) ≥ 0.
Otherwise classify X into control.
Classification rule #3 based on 7 M/Z values:
a 1 = (6.04377,3.42186,-1.99804,0.23374,2.46593,-1.87559,17.37384)',
a 2 = (7.35920,1.86527,-1.32486,0.92386, 1.18336,4.76619,9.95349)',
c 1 = -218.19592,
c 2 = -254.33039.
Let X be the column vector that represents the intensities from a subject at the following 7 M/Z values:
418.1136, 435.0751, 464.3617,4003.645, 4906.962, 6599.823, 6801.495.
Classify X into cancer if
(a 1 - a 2)' X + (c 1 - c 2) ≥ 0.
Otherwise classify X into control.
Rule 1 with M/Z values of 2.8234234, 222.41828, 410.13727, 417.73207, 435.07512, 4027.2999, and 8035.0581, achieved 100% sensitivity and specificity on both the test and training sets.
Rule 2 with M/Z values of 3676.3951, 3937.7816, 4003.6449, 4440.095, 5269.0367, 10511.699, 14182.82, and 17019.433. This rule achieved 100% sensitivity and specificity on the training set. However sensitivity and specificity fell on the test set to 96.25% and 91.11% respectively.
Rule 3 with M/Z values of 417.73207, 435.07512, 2666.361, 2674.0769, 3937.7816, 3991.8435, 4821.0481, 4839.2088, 5269.0367, 7627.1183, 14182.82, and 17019.433. This rule achieved 100% sensitivity and specificity on the training set. On the test set it achieved a sensitivity of 100% and a specificity of 97.8%. We have used a strategy identical to that used in Rule 1 to further analyze this data. First, a randomly ordered list of cancer spectra and a randomly ordered list of control spectra were prepared. Next, we assigned the first 20% of each list to a test set and the remaining 80% to a training set. The process was repeated five times assigning the next consecutive 20% of each list for the test set on each occasion. The results were very similar to those above with all five rules achieving 100% sensitivity and specificity. This data is posted as additional data file Supplement1.xls.
Clinical Proteomics Program Databank Example Ovarian Rule.
Consecutive M/Z Bin
Normalization and Permutation Analysis of Low M/Z Values
M/Z = 2760.6685, p = 0.24
M/Z = 19643.409, p = 0.52
M/Z = 6631.7043, p = 9.0 × 10-4
M/Z = 14051.976, p = 1.8 × 10-8
M/Z = 3497.5508, p = 1.4 × 10-6
M/Z = 464.3617 with a p-value less than 6.8 × 10-35, that correlates with a shoulder in a secondary peak at about 463, that is decreased in cancer patients (average intensity 17.5 for cancer versus 23.6 for controls).
M/Z = 435.0751 with a p-value of less than 3.9 × 10-37, that corresponds to a peak with increased intensity in cancer (average intensity 33 for cancer versus 25.5 for controls).
M/Z = 417.73207 with a p-value less than 6.2 × 10-35, that corresponds to a peak that is decreased in cancer (average intensity 39.5 for cancer versus 47.4 for controls)
The identity of the molecules responsible for these differences cannot be determined from this data. However, it is possible that in some cases they may relate to the LPA family of molecules, or to alterations in proteins that bind LPA family members.
Other explanations for the presence of statistically significant bands of low M/Z include degradation products of higher molecular weight macromolecules or a matrix effect. For example, if a set of proteins exist that are expressed at different levels between cancer and control subjects but have a common domain, then a common product ion of lower M/Z may be generated that would represent a summation of all the changes in expression of the group of proteins, and might thus have greater statistical significance than the changes associated with any single high M/Z value. Similarly, a set of low M/Z molecules (e.g., energy-absorbing molecule or matrix) that interacts differently in a protein environment that differs markedly between cancer and control could hypothetically generate a similar phenomenon. However, it is difficult to apply any of the above explanations to the very low M/Z values such as 2.7921478 and 245.53704, although in the last case an extremely small organic molecule is possible.
Alternatively, there maybe some unexpected experimental bias or systematic error that accounts for low M/Z discrimination. This could occur at any experimental step, and might include medication or lifestyle change that occurs in patients who learn they have a cancer diagnosis, variation in sample collection, processing and preservation, as well as bias introduced at the time of analysis. In the case of LPA, increased plasma levels may be associated with platelet activation. Another group trying to repeat the observations of increased levels of LPA associated with ovarian cancer concluded that there was no diagnostic value in the assay, and attributed the discrepant findings as possibly related to different sample centrifugation protocols used by the two groups to remove platelets from the samples prior to analysis . However, LPA continues to be actively evaluated for its clinical utility .
Serum proteomic profiling is a new approach to cancer diagnosis. However it confronts a challenging environment, as it combines measurement technologies that are new in the clinical setting with novel approaches to processing and interpreting high dimensional data. Further, controlling large clinical studies can be challenging even in more established settings. Nevertheless, it represents an advance in the ability to diagnose and understand illness. The results presented in this study are useful for several reasons. First, in regard to disease classification, advanced data mining techniques should be benchmarked against traditional methods when possible. Further identical training sets should be defined for such a comparison as results may very depending on the samples chosen for inclusion in the training set. The development of disease classifiers using routine analysis proved to be straightforward, and resulted in excellent performance in both the test and training sets (e.g. 100% sensitivity and specificity for Rules 1 and 3 in the first training set). In particular these preliminary data suggest that these two rules may be specific enough to scale to larger population trials without generating an unacceptably high false positive rate. This study also confirms that a classifier could be developed with M/Z values greater than 2000. This indicates that information regarding the difference between cancer and control is present throughout the entire M/Z region studied, a result entirely consistent with the observed Wilcoxon test p-values. Secondly, routine analysis allows investigators to rapidly review the data for their general trends, and correlate the findings with other information. The findings of significant discrimination between cancer and control groups at low M/Z values indicates that attention should be focused in this region. In particular, if experimental bias and noise effects can be excluded, this region may prove to offer the optimum for ovarian cancer diagnostic test development. On the other hand, if bias cannot be excluded, the possibility must be entertained that higher M/Z values may also have been similarly affected. In order to address these issues, consideration may be given to using mass spectrometry methods with increased sensitivity in the low M/Z region. The experimental conditions used to physically bind the serum samples to the chip prior to analysis may also prove critical, and should be consistent with those used in collecting the current data set. Also, the possibility that the changes in the low M/Z region may represent an additive effect caused by differing protein environments between cancer and normal may be approached by intentionally spiking samples with panels of known proteins, and determining if there is an effect on the spectra in the low M/Z region. The use of internal standards to normalize this type of experimental system in general may also be considered. As with all clinical test development, confirmation of results in independent laboratories running blinded samples will remain the gold standard in ruling out the possible effects of bias, unless the sample set itself contains the bias. Particular attention should be paid to pre-analytic causes of bias that may influence the serum proteome. In particular the coagulation and complement systems should be considered as potential sources of noise in this context, as both are activated during serum sample collection and generate low molecular weight products. These products are undesirable for two reasons. First, if a putative tumor biomarker (e.g. LPA) is a member of a pathway altered during serum sample collection, changes between plasma levels of cancer and control subjects may be obscured. Secondly, the generation of activation products may simply complicate the spectrum. Also, sample collection practices should be rigorously defined, and include submitting matched control and cancer samples from all centers participating in the study. Matching for age and menopausal status should be considered. For example, in the data set used in this study, the mean age of the control group was 47 years and the cancer group 60 years. It is noteworthy that the average age of menopause is approximately 51 years . This may introduce a bias in the results reported in this study as well as all others derived from this dataset. Finally, the steps associated with sample collection, processing, and binding to the chip may represent a particularly fertile area for research. Any combination of such steps may significantly alter the molecular subset of the sample that can be successfully analyzed.
However, the ability to discriminate between cancer and control based on the M/Z values of 2.79 and 245.5 reveals the presence of a significant experimental bias not related to disease pathology, that likely involves machine noise and matrix effects. This is particularly true of the M/Z value at 2.79 which represents a bias of the mass spectrometer instrument itself. If this is the case the higher M/Z regions may also be affected. These findings indicate that any rule derived from this data set, including the ones presented in this paper, may be detecting differences in experimental bias and not disease pathology. Investigators in this field may minimize their chances of false discovery by careful experimental design and by using routine statistical methods to both overview the data (in an intentional search for bias) as well as a benchmark for comparison with other data mining algorithms.
A training set was formed by randomly sampling 45 spectra out of the 91 controls and 80 spectra out of the 162 cancer cases (see Figure 1). Those spectra that were in the original data set but not in the training set were considered in a 'test' set. Two-sided Wilcoxon test was used to compare the intensity between the controls and cancers in the training set at each of the 15,154 M/Z values. We then selected a subset of the M/Z values with the lowest Wilcoxon test p-values (see the Results section for details). We sorted on consecutive M/Z values to get bins. A separation of at least one M/Z value was required to start the next bin. The lowest p-value in each bin was selected and the corresponding M/Z value was used in stepwise discriminant analysis to determine the subset of M/Z values that best discriminated cancer from control in the training set. The criteria were applied to the test data set, and sensitivity and specificity were computed. All the analyses were performed in SAS Version 8.2 (A statistical package from SAS Institute Inc., Cary, NC, USA) on a personal computer. Wilcoxon test was performed using NPAR1WAY procedure in SAS, stepwise discriminant analysis was performed using STEPDISC procedure in SAS, and discriminant analysis was performed using DISCRIM procedure in SAS .
To normalize the data, the procedure outlined by the Clinical Proteomics Program Databank was used . The cancer and control values for each M/Z were given respective labels, and the data were then pooled and normalized using the formula NV = (V-Min)/(Max - Min). In this expression, Min is the minimum intensity of the pooled samples, Max represents the maximum intensity found in the pooled samples, and NV represents the normalized value. Using this procedure, the data intensities will all fall between 0 and 1. The data points were sorted into cancer and controls, and the p-values were calculated.
We thank Christopher Gocke MD, Jules J. Berman PhD MD, G. William Moore MD PhD, and Robert Rohwer PhD for their critical reading of the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.