On the analysis of glycomics mass spectrometry data via the regularized area under the ROC curve
 Jingjing Ye^{1},
 Hao Liu^{2}Email author,
 Crystal Kirmiz^{3},
 Carlito B Lebrilla^{3} and
 David M Rocke^{4}
DOI: 10.1186/147121058477
© Ye et al; licensee BioMed Central Ltd. 2007
Received: 18 May 2007
Accepted: 12 December 2007
Published: 12 December 2007
Abstract
Background
Novel molecular and statistical methods are in rising demand for disease diagnosis and prognosis with the help of recent advanced biotechnology. Highresolution mass spectrometry (MS) is one of those biotechnologies that are highly promising to improve health outcome. Previous literatures have identified some proteomics biomarkers that can distinguish healthy patients from cancer patients using MS data. In this paper, an MS study is demonstrated which uses glycomics to identify ovarian cancer. Glycomics is the study of glycans and glycoproteins. The glycans on the proteins may deviate between a cancer cell and a normal cell and may be visible in the blood. Highresolution MS has been applied to measure relative abundances of potential glycan biomarkers in human serum. Multiple potential glycan biomarkers are measured in MS spectra. With the objection of maximizing the empirical area under the ROC curve (AUC), an analysis method was considered which combines potential glycan biomarkers for the diagnosis of cancer.
Results
Maximizing the empirical AUC of glycomics MS data is a largedimensional optimization problem. The technical difficulty is that the empirical AUC function is not continuous. Instead, it is in fact an empirical 0–1 loss function with a large number of linear predictors. An approach was investigated that regularizes the area under the ROC curve while replacing the 0–1 loss function with a smooth surrogate function. The constrained threshold gradient descent regularization algorithm was applied, where the regularization parameters were chosen by the crossvalidation method, and the confidence intervals of the regression parameters were estimated by the bootstrap method. The method is called TGDRAUC algorithm. The properties of the approach were studied through a numerical simulation study, which incorporates the positive values of mass spectrometry data with the correlations between measurements within person. The simulation proved asymptotic properties that estimated AUC approaches the true AUC. Finally, mass spectrometry data of serum glycan for ovarian cancer diagnosis was analyzed. The optimal combination based on TGDRAUC algorithm yields plausible result and the detected biomarkers are confirmed based on biological evidence.
Conclusion
The TGDRAUC algorithm relaxes the normality and independence assumptions from previous literatures. In addition to its flexibility and easy interpretability, the algorithm yields good performance in combining potential biomarkers and is computationally feasible. Thus, the approach of TGDRAUC is a plausible algorithm to classify disease status on the basis of multiple biomarkers.
Background
With rapidly developing biotechnology, the use of highthroughput clinical laboratory data to detect disease conditions and predict patients' outcomes is becoming a reality for medical practice. These technologies include microarray, mass spectrometry applied to proteomics, and new imaging modalities, which have been engaged in research on detecting clinical disease, predicting patients' responses to different treatments and evaluating the prognosis of patients with disease [1].
Among those new biotechnologies, mass spectrometry (MS) is used increasingly for protein profiling in cancer research. The basic goal is to predict cancer on the basis of peptide/protein abundance from the MS data. Recent literatures on cancer classification using MS have identified some potential protein biomarkers in serum to distinguish cancer from normal samples (Baggerly et al.[2], Wagner et al.[3], Adam et al.[4]). However, sensitivity and reproducibility remains as a major concern in making the protein technology reliable [5].
As an alternative, glycomics is proposed as a new trend for biomarker detection at the end of 20th century. Glycomics is the study of glycans (oligosaccharides), and glycoproteins. Apweiler et al.[6] estimated that at least 50% of human proteins are glycosylated. Since glycans play crucial roles in cell communication and signalling events [7], they may be implicated in cancer. Compared to potential protein or peptide biomarkers, oligosaccharides are highly sensitive to biochemical environment and are more easily identified and quantified [8]. Therefore, in a study conducted in this lab, clinical glycomics is used to identify potential biomarkers for the early detection of ovarian cancer.
Ovarian cancer is one of the most deadly types of cancer among women [8]. Many investigators believe that early detection of ovarian cancer would improve the patients' survival. CA 125 is the only FDA approved biomarker for the early detection of ovarian cancer. However, it has unreasonably low sensitivity and specificity. For instance, only 50% women with Stage I ovarian cancer will have an elevated CA 125 and many benign conditions can cause elevated levels [8].
A number of recent studies have implemented machine learning algorithms to classify highdimensional MS cancer data. Artificial neural networks (ANNs) were utilized by Ball et al.[10], Lancashire et al.[11] and Mian et al.[12], to discriminate different tumor states. Fushiki et al. [13] explored the efficient learning algorithm AdaBoost to extract potential biomarkers for classifying MS cancer from control samples. Decision tree based ensemble methods were proposed by Geurts et al.[14] to identify biomarkers for inflammatory diseases. Other algorithms, such as support vector machine (SVM) used by Xiong et al.[15], random forest (RF) applied by Wu et al.[16] and linear discriminant analysis (LDA) employed by Miketova et al.[17] and Lilien et al.[18], were also studied. Comparisons among algorithms in a case study of ovarian cancer classification were evaluated by Datta and DePadilla [19] and Wu et al.[16]. All of the above studies aim to discover the potential MS biomarkers that can distinguish one group from another. However, for prognostic and diagnostic purposes, how to combine those MS biomarkers and whether or not the combination is optimal are not addressed in those studies. Thus, an objective of this research is to consider a statistical method that combines the highdimensional MS measurements into a single score to classify cancer status jointly with suitable preprocessing of the data.
There are several studies on combining biomarkers. Su and Liu [20] studied the case where markers follow a multivariate normal distribution. They gave a closed form of optimal solution to the linear parameters. Normality is not suitable for mass spectrometry data because measurements of relative abundance are always positive. Pepe and Thompson [21] considered linearly combining two biomarkers by optimizing the area under the ROC curve. The method was developed only for lowdimensional situation. And it is not trivial to generalize the approach to highdimensional case. Ma and Huang [22] applied Pepe and Thompson's idea of optimizing AUC to microarray experiment. They used multivariate normal distribution in the simulation study and assumed independence between biomarkers, which is not true for mass spectrometry data. In addition, they implemented the threshold gradient algorithm, first proposed by Friedman and Popescu [23], without correctly recognizing the regularization parameter. In this work, the question on how to combine the highdimensional mass spectrometry predictors into a single score for the purpose of classification is addressed. The performance of a classifier by maximizing the area under the ROC curve for linearly combining the biomarkers is evaluated. The technical difficulty of this optimization problem is that the empirical AUC function is not differentiable. The objective function is in fact an empirical 0–1 loss function with a large number of linear predictors, and it is well known that such optimization problem is illimposed. An approach that regularizes the area under the ROC curve while replacing the 0–1 loss function with a sigmoid function was investigated. A constrained threshold gradient descent regularization algorithm, which is first introduced by Friedman and Popescu [23], to stabilize the estimates is applied. In Friedman and Popescu, they demonstrated their algorithm in a quadratic objective function. In this study, their objective function is replaced with the area under the ROC. A simulation is also conducted on mass spectrometry data under the casecontrol design that will generate joint distribution of diseased samples and normal samples to evaluate this algorithm.
The article is organized as the following. Simulation study is described in the Testing Section which describes how effective the proposed TGDRAUC approach is. The Implementation Section is for real mass spectrometry ovarian cancer data analysis after a description of our preprocessing method. TGDRAUC method is applied to lowdimensional and highdimensional ovarian cancer data. In the Conclusion and Discussion Section, it is concluded that the TGDRAUC algorithm is appropriate in the analysis of mass spectrometry glycomic data. A detailed description for TGDRAUC algorithm is in Method Section. The definition and properties of the ROC curve are reviewed. The area under the ROC curve as the objective function for maximizing the performance of the classifier is proposed. Furthermore, several sigmoid functions that replace the 0–1 loss function are introduced and a simple comparison among the sigmoid functions is shown. Threshold gradient direct regularization algorithm is explained after selection of sigmoid function as well as the detailed algorithm for parameter estimations.
Results
Testing
Testing of the TGDRAUC algorithm was demonstrated through a simulation study. Since the mass spectrometry measures the relative abundance of molecules, the measurement is always positive. Hence, a positive distribution is a reasonable choice for data simulation. In contrast to Ma and Huang [22], who simulated data under normal distribution and assumed independence among biomarkers, exponential distribution was chosen to generate the simulation data. Since a better classifier is desired, the data used for simulation were chosen so that the true AUC equals to 0.95. The simulation is generated as the following:

Denote X as normal patient, and m is the number of normal patients. Denote Y as the disease patient, and n is the number of diseased patients. For simplicity, we choose m = n. The dimension of the biomarkers is denoted as p.

Simulate X_{ i }as an exponential distribution with parameter λ = 1, i = 1, ..., n.

Generate a Bernoulli trial B(1, 0.95) for n times.

The data for the diseased patients are generated as Y_{ i }= X_{ i }+ 1 when Bernoulli trial is 1; Y_{ i }= max{0, X_{ i } 1} when Bernoulli trail is 0.
The number of the replication was chosen to be 500. The data is generated as joint distribution of X and Y. The true probability is P(X <Y) = 0.95, no matter what the linear combination β is. The goal of the simulation study is to show that the maximizer of empirical AUC by TGDR regularization is in fact our targeted maximization problem (4) (see Additional file 1, 2 for examples of simulated data).
Simulation Study Result. Summary statistics of the simulation results of TGDRAUC algorithm.
p  n  Bias  Mean of Empirical AUC  Median of AUC  Standard Error 

10  5  0.0088  0.9588  1  0.0857 
10  10  0.01674  0.96674  1  0.0535 
10  25  0.0087  0.9587  0.96  0.0375 
10  50  0.0009  0.9509  0.9576  0.0300 
25  5  0.0166  0.9666  1  0.0779 
25  10  0.0226  0.9726  1  0.047 
25  25  0.016  0.966  0.968  0.0338 
25  50  0.007  0.957  0.96  0.0273 
50  5  0.0172  0.9672  1  0.0795 
50  10  0.028  0.978  1  0.0473 
50  25  0.0218  0.9718  0.9808  0.0311 
50  50  0.0086  0.9586  0.96  0.0268 
The simulation conducted is for a relatively small sample size and a small number of biomarkers. From Table 1, the regularization of TGDR tends to overestimate the AUC when p is less than n. As sample size increases, the estimated AUC by regularization approximates the true AUC. Furthermore, as sample size increases, the standard error stabilizes to be around 0.03.
Table 1 gives a guideline on how to use the TGDRAUC algorithm for different situations. When a larger sample size n than biomarker p is observed, the algorithm is trustworthy, in the sense that the estimated AUC approximated the true AUC and the best ratio for p/n is around 1/5.0. However, when the number of biomarkers is much larger than the sample size, it remains unclear whether the TGDRAUC algorithm tends to overestimate the AUC.
Implementation
Two real MS data sets are analyzed, one lowdimensional and another highdimensional real ovarian cancer data. The lowdimensional data is preprocessed, and highdimensional data is the mass spectrometry raw data. So highdimensional data will be preprocessed first before it is carried on any further analysis. The highdimensional data is a subset of the lowdimensional data because of missing raw data files. The TGDRAUC algorithm is applied to both data sets for cancer diagnosis. All of the programming were done in C and R (see Additional file 2, 3, 4 for algorithm codes).
Lowdimensional Ovarian Cancer Data Analysis
Ovarian Cancer Data (with the bootstrap standard error in parenthesis). Implementation results of TGDRAUC algorithm to MS ovarian cancer data.
Estimators  Lowdimensional  Highdimensional 

Empirical AUC  0.953656(0.00126)  0.994987(0.0002527) 
τ  1  0.1 
λ  0.081559  0.000436 
Highdimensional Ovarian Cancer Data Analysis
In this section, the analysis starts from the raw ovarian cancer data. The problem with the raw mass spectrometry data is that the data is highdimensional, so extracting useful information is crucial. For this highdimensional data set, there were 19 normal patients and 21 cancer patients. Each patient had three measurements, called 10%, 20% and 40% fractions, corresponding to different sample extraction methods carried out prior to the mass spectrometry experiments. For each spectrum, the raw data contained 500,000 data points. The mass spectrometer's manufacturer(Varian FTMS Systems, Lake Forest, CA) provided their software for peakselection of individual spectra from the 500,000 data points. The problem with their peakidentification is that it is based on an individual spectrum, meaning that for any two spectra, peaks are selected at different masstocharge values. Therefore, the peaks are not consistent between samples and not trustworthy for cancer diagnosis. Before any data analysis is performed, the preprocessing of the data is critical.
Preprocessing HighDimensional Data
First, the peaks selected by the instrument's software for the spectra were used. The selectedpeaks were grouped into a matrix, with each column corresponding to one spectrum and each row corresponding to one distinct masstocharge (m/z) value. If the spectrum did not have the peak at the m/z value, a zero was replaced to indicate missing value in the matrix. However, this resulted in many zeros in the matrix. The corresponding raw data was chosen to be substituted into the zero intensity. In this way, much more similar information could be included as the raw data.
However, the scale of raw data was not the same as the corresponding selectedpeaks data file. The ratio factor between the raw and its corresponding selectedpeaks file needed to be estimated. The estimation of the ratio factor was done as the following: find the nearest point in the raw data to its corresponding selectedpeaks data, defined as the absolute distance between the corresponding m/z values. Two cases may happen: if the nearest m/z value in the raw data is unique, the ratio is calculated by the raw data intensity to the selectedpeaks data intensity at the nearest value; if the nearest value is not unique, the ratio is calculated by the averaged intensities of raw data at those nearest values to the selectedpeaks data intensity. For each file, the ratio was then averaged to give the unique factor estimation.
Using the estimated ratio for each file, the intensities from raw data could then be calculated to fill in the data matrix. There, again, may be two scenarios: if the closest m/z in raw data is unique to the selectedpeaks data, then substitute in the corresponding raw data intensity with the adjustment by its ratio factor; if the closest m/z in raw data is not unique, then substitute in the maximum intensities of those closest m/z raw data to the corresponding column of the data matrix, with adjusting by its ratio factor. This was chosen as the maximum intensity because we wanted to include the strongest signal to be substituted in the data matrix.
After imputing the data as above, the data matrix was formed with each spectrum as one column of the matrix and intensities at all same m/z values. Before any statistical analysis could be completed, each column of the data matrix was further normalized by dividing total ion current of the corresponding raw data intensities to make sure the comparison of the spectrum would be made on the same level. An arbitrary factor 100000 times the intensities in the data matrix was to amplify the normalized intensities to a reasonable magnitude. Because the data variation was dependent on the mean, log transform was carried out on the data matrix. An arbitrary 0.00000001 was added in the intensities to ensure valid log transformation. The 10%, 20% and 40% fractions were combined by adding intensities up at each m/z value to group the data into one patient as one column in the data matrix.
After appropriate preprocessing of normalization and log transformation, the intensities for MS data are assumed to approximately meet the ttest requirements. We then performed a ttest to each m/z value on factor whether the patient had cancer or not. The pvalues of the tests were recorded. The false discovery rate (FDR) by Benjamini and Hochberg [25] was applied to the ttest pvalues to adjust to multitest problem. Only adjusted pvalues less than 0.05 were selected out to be potential biomarkers. As a result, 1228 biomarkers were selected for the highdimensional data case.
Highdimensional Data analysis result
A 3fold crossvalidation of the TGDRAUC algorithm was applied to the data. The result is listed as the second column in Table 2. The estimated empirical AUC was almost perfect, close to 1. The bootstrap method was applied to estimate the empirical AUC confident interval for 1228 biomarkers. 500 bootstrap data sets were generated. The bagged empirical AUC for the given optimal pair of λ and τ was calculated from the bootstrap sample. The estimated standard error was 0.0002527. The confident interval for bootstrap was (0.9945, 0.9955). The highdimensional data had a small sample size compared to a large number of biomarkers, which suggests that the estimated empirical AUC may be overestimated by the simulation.
Comparisons of selected biomarkers between Hyun Joo An, et al. [8] and the TGDRAUC algorithm. The biomarkers in Hyun Joo An, et al. [8] were selected based on their biochemical properties.
Observed Mass  Oligosaccharide Composition  Estimated linear coefficient 

347.10  2Hex  0.138 
388.14  1HexNAc:1Hex  0.462 
509.17  3Hex  0.628 
550.21  1HexNAc:2Hex  0.077 
712.28  3Hex:1HexNAc  0.6299 
772.31  2Hex:1HexNAc:1Hex  0.084 
874.36  4Hex:1HexNAc  0.069 
915.38  3Hex:2HexNAc  0.2724 
975.43  2Hex:2HexNAc:1Hex  0.031 
1077.47  4Hex:2HexNAc  0.204 
1137.51  3Hex:2HexNAc:1Hex  0.0888 
1280.62  4Hex:3HexNAc  0.0424 
1442.72  5Hex:3HexNAc  0.0635 
1502.74  4Hex:3HexNAc:1Hex  0.0449 
Conclusion and Discussion
The key contribution of this work is that the optimal rules, for purpose of classifying disease status on the basis of multiple biomarkers, are based on the maximization of the area under the ROC subjected to constrained threshold gradient direct algorithm. The approach presented here relaxes the normality assumption and the approach of Pepe and Thompson [21] is generalized. The analysis is applied to the highdimensional mass spectrometry glycomics data. In contrast to Ma and Huang [22], the simulation data is generated based on the joint distribution of disease samples and control samples with nonnormality assumption, which is more appropriate for mass spectrometry data. This simulation is able to assess the difference between the maximization of the empirical AUC and the target AUC. The simulation proves the asymptotic properties that estimated TGDRAUC approaches the true AUC when the sample size is increasing compared to the dimensionality of p biomarkers. When applied to the real ovarian cancer data, the algorithm also provides the buildin dimension reduction technique. For the highdimensional ovarian cancer data, we can detect the 63 most important biomarkers among the total of 1228 biomarkers with simultaneously estimating the linear combination coefficients. The selected 63 biomarkers match very well with the 14 peaks preselected based on biological evidence. The algorithm is a nonparametric approach, very flexible and easy to interpret. The algorithm is aimed to have optimal classification. The resulting AUC of the linear combination is plausible and should be optimal among all other possible combinations. The computation of TGDRAUC is computational feasible of highdimensional data. This highdimensional analysis evaluates more than 1000 biomarkers in the algorithm and essentially could consider more.
Although the result of simulation cannot guarantee the estimated AUC comes close enough to the true AUC in large p biomarker number for small sample size situation, the algorithm still provides enough information in recognizing potential biomarkers. Since the performance of small dimension of biomarkers in the large sample size scenario gives excellent overall result, iteratively combining the biomarkers in highdimensional and reducing the dimension of biomarker to some reasonable size is considered. With the reduced dimension, the biomarkers are combined again using the TGDRAUC algorithm. The algorithm would continue until the ratio of number of biomarker and the sample size are in the comfortable zone suggested by our simulation. The TGDRAUC algorithm proves to be a promising algorithm and might be recommended in combining mass spectrometry biomarker analysis for cancer diagnosis.
Methods
ROC Curve
A casecontrol study is considered where the main outcome is binary denoted as D, where D = 1 as the case and D = 0 as the control. Denote the relative abundance of the p glycomics R_{p × 1}= (R_{1}, ..., R_{ p })^{ T }. We consider the linear combination score of the form
L_{ β }(R) = β'R = β_{1}R_{1} + β_{2}R_{2} + ... + β_{ p }R_{ p }
where ${\beta}_{1\times p}^{T}=({\beta}_{1},\mathrm{...},{\beta}_{p})$ is an unknown pvector parameter and R serves as the classification predictors. The classification rule is constructed by β'R. To be more specific, we classify D = 1 if β'R ≥ c and D = 0 otherwise, for a cutoff value c. By varying the cutoff value c, we obtain the Receiver Operating Characteristic(ROC) curve.
ROC is a graphical plot of the sensitivity and 1specificity, also known as true positive rate (TPR) and false positive rate (FPR), respectively. The TPR and FPR are defined by
TPR(c) = Pr(β'R ≥ cD = 1),
FPR(c) = Pr(β'R ≥ cD = 0)
for any cutoff value c. By varying the discrimination value of c, the TPR and FPR are plotted to generate the ROC curve, which is a twodimensional plot of FPR(c) vs TPR(c) with ∞ ≤ c ≤ +∞. There is a balance between TPR and FPR. A completely random predictor would give a straight line at an angle of 45 degrees from the horizontal, from bottom left to top right, because as the threshold is raised, there would be equal numbers of true and false positives. ROC above the nodiscrimination line would be preferred with better classification as the line closer to the upper leftcorner point (0,1).
The empirical AUC is the same as the form of MannWhitney test statistics. The optimal estimator $\widehat{\beta}$ is then defined as the maximizer of AUC(β).
The Sigmoid Function
where Y, X and index i, j are defined in (5).
Since the exponential part in sigmoid function may result in unbounded situation when larger r is selected or the data itself may be large, the exponential part was chosen to be controlled by normalizing the data in the following way: denote Z_{ ji }= Y_{ j } X_{ i }, i = 1, ..., m, j = 1, ..., n. Z is a pairwise difference matrix between is the disease and normal patients. We then normalize Z as Z/Z_{2}, where ${\Vert Z\Vert}_{2}^{2}={\displaystyle {\sum}_{i=1}^{p}{Z}_{i}^{2}}$ and p is the dimension of Z.
Threshold Gradient Direct Regularization (TGDR)
where P(β) is the penalties. There are several candidate penalty terms, ${P}_{1}(\beta )={\displaystyle {\sum}_{i=1}^{p}\left{\beta}_{i}\right}$, ${P}_{2}(\beta )={\displaystyle {\sum}_{i=1}^{p}{\beta}_{i}^{2}}$ and P_{∞}(β) = max_{i = 1, ..., p}β_{ i }, where p is the number of parameters. We choose the quadratic penalty term ${P}_{2}(\beta )={\displaystyle {\sum}_{i=1}^{p}{\beta}_{i}^{2}}$ because it is the most common one and use this in our following analysis.
Let ν denotes the starting value on the path and Δν as the increment on the path. To implement the algorithm, we select Δν = 0.01. For any given r of the sigmoid function, a threshold τ is varying between 0 and 1. τ was chosen in the algorithm to be a vector {0,0.1,0.2,...,0.9,1}. The TGDR algorithm performs the following iteration steps:

Initialize β_{ i }= 0.01 and ν = 0, for i = 1, ..., p.

Compute the negative gradient g_{ i }(ν) = ∂G(β; λ)/∂β_{ i }evaluatedat β_{ i }(ν). Denote g_{ i }(ν) as the ith component of g(ν). If max_{ i }{g_{ i }(ν)} = 0, stop the iteration.

Compute vector of f_{ i }(ν) as the ith component of f(ν); f_{ i }(ν) = I{g_{ i }(ν) ≥ τ·max_{ l }g_{ l }(ν), l = 1, ..., p}.

Update β_{ i }(ν + Δν) = β_{ i }(ν) + Δν × g_{ i }(ν) × f_{ i }(ν). Replace by ν + Δν.

Repeat step 2–4 until β converges, which means ${\sum}_{i=1}^{p}{({\beta}_{i}^{(k+1)}{\beta}_{i}^{(k)})}^{2}}\le \epsilon $. ε is a preselect small number and k is the number of iteration steps. We choose ε = 1 × 10^{8}.
The tuning parameter or the threshold τ controls the distribution of estimator $\widehat{\beta}$. When τ = 0, the estimator β is updated on every gradient and therefore the converged estimator is close to ridge regression (RR); while τ = 1, the estimated β is only updated on the maximum gradients, and the result is roughly corresponding to LASSO (Least Absolute Shrinkage and Selection Operator, [28]). The τ values in between 0 and 1 create the estimators more diverse than RR, but less than LASSO.
Double CrossValidation
where ${{\widehat{\beta}}^{\prime}}_{{k}_{1}(j){k}_{2}(i)}$ is calculated by (7) when the k_{1}(j) part of Y and k_{2}(i) part of X data are removed. The optimal λ* is found by minimizing (8).
The function in (8) reduces the pdimensional optimization problem to be onedimensional of minimizing λ. The golden section search method [29] was implemented, which does not require the calculation of the derivative, as the optimization method to search for the minimal value between 0 to a large number.
For any given τ in {0,0.1,0.2,...,0.9,1}, we run a double kfold crossvalidation. Denote τ_{ l }, l = 1, ..., 11 as the lth component in the vector {0,0.1,0.2,...,0.9,1}. After selection of regularized ${\widehat{\lambda}}_{l}$ estimator for given τ_{ l }, l = 1, ..., 11, the optimal $\widehat{\tau}$ is chosen to be min{CV(${\widehat{\lambda}}_{l}$), l = 1, ..., 11}, where CV(λ) is evaluated by substituting the regularized ${\widehat{\lambda}}_{l}$ in (8). One may also adapt Stone's twostage crossvalidation [30], but it will be computationally intensive.
Positive constraints
Because of the positive nature of the mass spectrometry data, a positive constraint on the parameter β is reasonable. The objective function (7) is minimized subjected to the constraints β_{ k }≥ 0 for k = 1, ..., p. Hence, the estimation $\widehat{\beta}$ will result in some exact zero coefficients when the optimization hits the positive constraint boundary, which means that the biomarker has no contribution to maximize the AUC.
Declarations
Acknowledgements
The statistical part of work is funded by NIH grant: NIH NHGRI R01HG003352, NIH NIEHS P42ES04699 and NIH NCI P30CA093373. The biological part of work is funded by NIH grant: NIH RO1 GM49077. In addition, the authors thank two anonymous referees for their insightful comments.
Authors’ Affiliations
References
 Pepe M, Cai T, Longton GM: Combining Predictors for Classification using the Area Under the Receiver Operating Characteristic Curve. Biometrics. 2006, 62: 221229. 10.1111/j.15410420.2005.00420.x.View ArticlePubMed
 Baggerly A, Morris J, Wang J, Gold D, Xiao L, Coombes K: A comprehensive approach to the analysis of matrixassited laser desorption/ionizationtime of flight proteomics spectra from serum samples. Proteomics. 2003, 3: 16671672. 10.1002/pmic.200300522.View ArticlePubMed
 Wagner M, Naik D, Pothen A: Protocols for disease classification from mass spectrometry data. Proteomics. 2003, 3: 16921698. 10.1002/pmic.200300519.View ArticlePubMed
 Adam B, Qu Y, Davis JW, Ward MD, Clements MA, Cazares LH, Semmes OJ, Schellhammer PF, Yasui Y, Feng Z, Wright GL: Serum Protein Fingerprinting Coupled with a Patternmatching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men. Cancer Research. 2002, 62: 36093614.PubMed
 Baggerly KA, Morris JS, Coombes KR: Reproducibility of SELDITOF protein patterns in serum: comparing datasets from different experiments. Bioinformatics. 2004, 20 (5): 777785. 10.1093/bioinformatics/btg484.View ArticlePubMed
 Apweiler R, Hermjakob H, Sharon N: On the Frequency of Protein Glycosylation, as deduced from analysis of the SWISSPROT database. Biochimica et Biophysica Acta. 1999, 1473 (1): 48.View ArticlePubMed
 Varki A: Biological roles of oligosaccharides: all of the theories are correct. Glycobiology. 1993, 3 (2): 97130. 10.1093/glycob/3.2.97.View ArticlePubMed
 An H, Miyamoto S, Lancaster K, Kirmiz C, Li B, Lam K, Leiserowitz G, Lebrilla C: Profiling of Glycans in Serum for the Discovery of Potential Biomarkers for Ovarian Cancer. Journal of Proteome Research. 2006, 5: 16261635. 10.1021/pr060010k.View ArticlePubMed
 Pepe M, Etzioni R, Feng Z, Potter J, Thompson M, Thornquist M, Winget M, Yasui Y: Phases of biomarker development for early detection of cancer. Journal of the National Cancer Institute. 2001, 93 (14): 10541061. 10.1093/jnci/93.14.1054.View ArticlePubMed
 Ball G, Mian S, Holding F, Allibone RO, Lowe J, Ali S, Li G, McCardle S, Ellis IO, Creaser C, Rees RC: An integrated approach utilizing artificial neural networks and SELDI mass spectrometry for the classification of human tumours and rapid identification of potential biomarkers. Bioinformatics. 2002, 18 (3): 395404. 10.1093/bioinformatics/18.3.395.View ArticlePubMed
 Lancashire LJ, Mian S, Rees RC, Ball GR: Preliminary artificial neural network analysis of SELDI mass spectrometry data for the classification of melanoma tissue. 17th European Simulation Multiconference, Nottingham. 2003, Society for Modeling and Simulation International, SCS European Publishing House, Erlanger, Germany, 131135.
 Mian S, Ball G, Hornbuckle J, Holding F, Carmichael J, Ellis I, Ali S, Li G, McArdle S, Creaser C, Rees R: A prototype methodology combining surfaceenhanced laser desorption/ionization protein chip technology and artificial neural network algorithms to predict the chemoresponsiveness of breast cancer cell linear exposed to Paclitaxel and Doxorubicin under in vitro condition. Proteomics. 2003, 3: 17251737. 10.1002/pmic.200300526.View ArticlePubMed
 Fushiki T, Fujisawa H, Eguchi S: Identification of biomarkers from mass spectrometry data using a "common" peak approach. BMC Bioinformatics. 2006, 7 (358):
 Geurts P, Fillet M, Seny D, Meuwis M, Malaise M, Merville M, Wehenkel L: Proteomics mass spectra classification using decision tree based ensemble methods. Bioinformatics. 2005, 21 (15): 31383145. 10.1093/bioinformatics/bti494.View ArticlePubMed
 Xiong X, Fang X, Zhao J: Biomarker identification by feature wrappers. Genome Research. 2001, 11: 18781887.PubMed CentralPubMed
 Wu B, Abbott T, Fishman D, McMurray W, Mor G, Stone K, Ward D, Williams K, Zhao H: Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics. 2003, 19 (13): 16361643. 10.1093/bioinformatics/btg210.View ArticlePubMed
 Miketova P, AbbasHawka C, Hadfiled T: Microorganism gramtype differentiation of whole cells based on pyrolysis highresolution mass spectrometry data. Journal of Analytical and Applied Pyrolysis. 2003, 67: 109122. 10.1016/S01652370(02)000190.View Article
 Lilien RH, Farid H, Donald BR: Probabilistic disease classification of expressiondependent proteomic data from mass spectrometry of human serum. Journal of Computational Biology. 2003, 10: 925946. 10.1089/106652703322756159.View ArticlePubMed
 Datta S, DePadilla LM: Feature selection and machine learning with mass spectrometry data for distinguishing cancer and noncancer samples. Statistical Methodology. 2006, 3: 7992. 10.1016/j.stamet.2005.09.006.View Article
 Su J, Liu J: Linear Combinations of Multiple Diagnostic Markers. Journal of the American Statistical Association. 1993, 88 (424): 13501355. 10.2307/2291276.View Article
 Pepe M, Thompson M: Combining diagnostic test results to increase accuracy. Biostatistics. 2000, 1 (2): 123140. 10.1093/biostatistics/1.2.123.View ArticlePubMed
 Ma S, Huang J: Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics. 2005, 21: 43564362. 10.1093/bioinformatics/bti724.View ArticlePubMed
 Friedman J, Popescu B: Gradient Directed Regularization for linear regression and classification. Technical report. 2004, Department of Statistics, Stanford University, CA, [http://wwwstat.stanford.edu/jhf/ftp/path.pdf]
 Dudoit S, Fridlyand J, Speed T: Comparison of discrimination methods for tumor classification based on microarray data. Journal of the American Statistical Association. 2002, 97: 7787. 10.1198/016214502753479248.View Article
 Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of Royal Statistical Society B. 1995, 57 (1): 289300.
 McIntosh M, Pepe M: Combining Several Screening Tests: Optimality of the Risk Score. Biometrics. 2002, 58: 657664. 10.1111/j.0006341X.2002.00657.x.View ArticlePubMed
 Bamber D: The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology. 1975, 12: 387415. 10.1016/00222496(75)900012.View Article
 Tibshirani R: Regression shrinkage and selection via lasso. Journal of the Royal Statistical Society B. 1996, 58: 267288.
 Press W, Teukolsky S, Vetterling W, Flannery B: Golden Section Search in One Dimension. Numerical Recipies in C: the Art of Scientific Computing. 1992, Combridge University Press, 2
 Stone M: Crossvalidatory choice and assessment of statistical predictions. Journal of Royal Statistical Society. 1974, 36: 111147.
 Gui J, Li H: Threshold gradient descent method for censored data regression with applications in pharmacogenomics. Pac Symp Biocomput. 2005, : 272283. full_text.
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.