Research article  Open  Published:
MultiTGDR, a multiclass regularization method, identifies the metabolic profiles of hepatocellular carcinoma and cirrhosis infected with hepatitis B or hepatitis C virus
BMC Bioinformaticsvolume 15, Article number: 97 (2014)
Abstract
Background
Over the last decade, metabolomics has evolved into a mainstream enterprise utilized by many laboratories globally. Like other “omics” data, metabolomics data has the characteristics of a smaller sample size compared to the number of features evaluated. Thus the selection of an optimal subset of features with a supervised classifier is imperative. We extended an existing feature selection algorithm, threshold gradient descent regularization (TGDR), to handle multiclass classification of “omics” data, and proposed two such extensions referred to as multiTGDR. Both multiTGDR frameworks were used to analyze a metabolomics dataset that compares the metabolic profiles of hepatocellular carcinoma (HCC) infected with hepatitis B (HBV) or C virus (HCV) with that of cirrhosis induced by HBV/HCV infection; the goal was to improve earlystage diagnosis of HCC.
Results
We applied two multiTGDR frameworks to the HCC metabolomics data that determined TGDR thresholds either globally across classes, or locally for each class. MultiTGDR global model selected 45 metabolites with a 0% misclassification rate (the error rate on the training data) and had a 3.82% 5fold crossvalidation (CV5) predictive error rate. MultiTGDR local selected 48 metabolites with a 0% misclassification rate and a 5.34% CV5 error rate.
Conclusions
One important advantage of multiTGDR local is that it allows inference for determining which feature is related specifically to the class/classes. Thus, we recommend multiTGDR local be used because it has similar predictive performance and requires the same computing time as multiTGDR global, but may provide classspecific inference.
Background
Feature selection algorithms, which select a subset of the most relevant features for the underlying data mining tasks, are commonly used in combination with classifier construction to analyze “omics” data or data with highdimensional input variables. The benefits of feature selection include minimizing model overfitting, improved predictive performance, and computational efficiency. It may also provide insights on potential targets that relate to the fundamental differences among different classes or subtypes of a biological process [1]. Threshold Gradient Descent Regularization (TGDR) [2], one such algorithms, has been explored and implemented by us [3–5] extensively because it possesses several key advantages, as described in our previous paper [4].
Currently, multiclass classification, where an observation needs to be categorized into more than two classes, has attracted increasing attention in the statistics and bioinformatics literatures [6–10]. Its popularity may be attributed to the fact that multiclass classification is commonly encountered in realworld applications. For example, multiple classes can represent different tumor types or different responses to a therapy. According to Li et al. [6], multiclass classification can be roughly divided into two types. One type includes classification algorithms that can be directly extended to handle multiclass cases, and the other type includes algorithms that arise from the decomposition of multiclass problems into a series of binary ones.
While a series of binary TGDRs can be easily constructed to accomplish multiclass classification, it is more desirable to extend TGDR directly to the multiclass cases since this approach results in a substantial decrease of the number of classifiers being trained. The major technical difficulty associated with such extension of TGDR involves defining an overall threshold for a feature across different classes, which is not addressed in the original TGDR framework [2, 11]. In our previous work [4], we introduced one approach, referred to as multiTGDR, for determining the threshold function. We applied the proposed multiTGDR framework to two realworld data conducted on the Affymetrix HGU133 Plus 2 platform and demonstrated that multiTGDR was superior in terms of predictive accuracy and parsimony compared to its binary counterparts (i.e., oneversusanother schema). In this paper, we propose a more general method to determine the threshold function, which allows the threshold function to be classspecific.
Metabolomics is the “…systematic study of the unique chemical fingerprints that specific cellular processes leave behind” [12]. Over the last decade, metabolomics has evolved into a mainstream scientific approach practiced by many laboratories globally. The information conveyed in metabolomics data can provide insight for a variety of applications such as biomarker identification, clinical toxicology, and drug discovery and development [13]. Like other “omics” data, metabolomics data typically has the characteristics of a smaller sample size compared to the number of features (usually hundreds of metabolites after peak alignment). Therefore, it is crucial to implement feature selection. However, metabolomics data analysis is less standardized compared to other “omics” data analysis (e.g., microarray and NextGeneration Sequencing [NGS]) due to its complexity. Consequently, many of the existing feature selection algorithms have not been explored and implemented in metabolomics data analyses. Only a few algorithms have been proposed to specifically analyze mass spectrometry (MS) data [14]. Reviews on feature selection algorithms that may be used in metabolomics data analyses have been reported [1, 15].
Partial Least SquareDiscrimination Analysis (PLSDA) is a very popular multivariate analysis tool, which is commonly used in metabolomics data analyses to identify informative metabolites for many distinct purposes, such as the diagnosis or prognosis of a disease [16–19]. Notably, the success of the standalone software SIMCAP (http://www.umetrics.com) boosts the prevalence applications of PLSDA in metabolomics data analyses. As a supervised method, PLSDA rotates the Principal Component Analysis (PCA) components by using the class membership information to achieve a better separation between the classes of samples. Similar to PCA, the results from PLSDA are based on some linear combinations of all metabolites or at least the linear combinations of the selected metabolites by naively leaving out the metabolites with small variable influence on the projection (VIP, which is a weighted sum of PLS loadings). This approach not only lacks readily biological interpretation, but also does not provide valid statistics that can be used to evaluate its predictive performance. To obtain such statistics, an extra classifier is desirable in PLSDA. For example, the study by Student and Fujarewicz [10] obtained the accuracy of PLSDA by implementing an additional support vector machine (SVM) classifier. Furthermore, absence of predictive rules in PLSDA makes the results of PLSDA less practical. This is because in clinical practice, physicians would prefer to a score (e.g., posterior probabilities) to quantify a patient’s status. Therefore, an explicit predictive rule is essential for metabolomics to become a diagnostic tool.
In this paper, we investigate the use of two multiTGDR approaches to analyze mass spectrometry metabolomics data. Hepatocellular carcinoma (HCC) is the most common type of liver cancer. Most cases of HCC are secondary to either a viral hepatitis (hepatitis B or C) or cirrhosis [20]. Currently, the gold standards for diagnosis (e.g., ultrasonography and alphafetoprotein [AFP]) have been reported to lack satisfactory sensitivity and specificity for identifying HCC at early stages [21, 22]. Since metabolomics can monitor the changes in small molecular comprehensively and provide insight on metabolic deregulations systematically [23, 24], researchers are employing metabolomics techniques to elucidate the difference between HCC and cirrhosis [19]. The identification of metabolic profiles for HCC/cirrhosis infected with HBV or HVC may help discriminate between HCC/cirrhosis/normal classes and achieve accurate diagnosis of HCC at early stages. Moreover, the analyses presented in this paper also provide motivation for developing feature selection algorithms specifically for metabolomics data, and for the applications of existing algorithms to metabolomics data.
Methods
The experimental data
The study included 30 patients with cirrhotic liver disease (22 infected with HBV and 8 with HCV, respectively), 70 patient with HCC (39 with HBV infection and 31 with HCV infection), and 31 healthy volunteers recruited in the metabolic profiling study. All of them provided the written informed consent, and the ethics committee of the Jilin University approved upon this study. Detailed descriptions on the study design, experimental procedures, and LCMS metabolomics data collections were reported in [19].
Preprocessing procedures
Raw data were imported into Databridge (Waters, U.K.) for data format transformation. The resulting NeTCDF files were imported into XCMS software for the peak extraction and alignment. Then the peaks including 384 metabolites (indexed by the combination of m/z and retention time, and their corresponding peak intensities) and 131 samples were exported to an Excel file. The peak intensity values were log transformed so that the distribution of the transformed intensity values for each metabolite was approximately normal. Zeros (corresponding to no peaks) in peak intensity, were replaced by a nominal value (i.e., 0.01) before log transformation, to avoid the creation of missing values. Several other values for replacing zero values (i.e., 0.001, 0.005, 0.02, 0.05) were examined to evaluate if different nominal values would affect the results, and no difference was found. Finally, peak intensity values were further centralized and normalized to have a mean of 0 and a variance of 1. The resulting matrix was used in the two multiTGDR frameworks for the classification analysis.
Compounds identification was achieved by comparing the accurate mass of compounds from the Human Metabolome Database: HMDB version 3 (http://www.hmdb.ca).
Methods
Here, we omit the description of binary TGDR. Interested readers may refer to the original papers [2, 11] for the detailed descriptions on binary TGDR. We present two multiclass TGDR frameworks with emphasis on the specific modifications made on the overall threshold functions to handle the multiclass problem.
Extension to multiclass classification
In the multiclass cases, we have a set of C1 binary variables Y_{ci} , which are the indicators for class c on subject i (i = 1,…,n, here n is the total number of subjects) i.e., Y_{ci} is equal 1 if the i^{th} subject belongs to class c and zero otherwise. C is the number of classes (C ≥ 3) and X_{1},…,X_{n} represents the feature values of one specific subject. Notably, X_{i} is a vector of length G and thus X is an n×G matrix with X_{ij} for the corresponding intensity values of feature j (j = 1,…,G) on subject i. The loglikelihood function is defined as,
β_{c0}s (c = 1,…,C1) are unknown intercepts which are not subject to regularization. β_{c} = (β_{c1},…, β_{cG}) are the corresponding coefficients for the expression values of metabolites under consideration. In an ‘omics’ experiment, most of those betas are assumed to be zeros, implying the corresponding features are noninformative in explaining the difference across different classes. In the multiclass cases, the threshold functions of every feature (i.e., metabolites in our application) in TGDR need to be redefined across classes. In previous work [4], we proposed an extension of TGDR as described below.
Method 1
Denote Δv as the small positive increment (e.g., 0.01) in ordinary gradient descent search and v_{ k } = k×Δv as the index for the point along the parameter path after k steps. Let β(v_{ k }) denote the parameter estimate of β corresponding to v_{ k }. For a fixed threshold 0 ≤τ≤ 1, our proposed TGDR algorithm for multiclass cases is given as follows:

1.
Initialize β(0)=0 and v _{ 0 }=0.

2.
With current estimate β, compute the negative gradient matrix g(v) =  ∂R(β)/∂β with its (c,j) ^{th} component as g _{ cj } (v).

3.

a)
Let f _{ c }(v) represent the threshold vector of size G for class c (c=1,..,C1), with its j ^{th} component calculated as
$${\mathit{f}}_{\mathit{cj}}\left(\mathit{v}\right)=\mathit{I}\left(\left{\mathit{g}}_{\mathit{cj}}\left(\mathit{v}\right)\right\ge \mathit{\tau}\times \underset{\mathit{l}\in {\mathit{\beta}}_{\mathit{c}}}{\mathit{\text{max}}\left(\left{\mathit{g}}_{\mathit{cl}}\left(\mathit{v}\right)\right\right)}\right)\forall \mathit{j}\in {\mathit{\beta}}_{\mathit{c}}$$ 
b)
Then, the j ^{th}feature specific threshold function was defined as
$${\mathit{f}}_{\mathit{j}}\left(\mathit{v}\right)=\underset{\mathit{c}}{\mathit{\text{max}}}\left({\mathit{f}}_{\mathit{cj}}\right)$$

a)

4.
Update β(v+Δv) = β(v)  Δv×g(v)×f(v) and update v by v+Δv , where the (c, j) ^{th} component of the product of f and g is g _{ cj }(v) × f _{ j }(v).

5.
Steps 24 are iterated K times. The number of iteration K is determined by cross validation.
As in binary TGDR, all metabolites are assumed to be noninformative at the initial stage. Parameters τ and k are the tuning parameters, and thus jointly determine the property of the estimated coefficients, including the selection of features and their corresponding magnitudes. τ can be regarded as a threshold because it determines how βs would be updated in each iteration. Two extreme cases include: if τ=0, all coefficients are nonzero for all values of k; and if τ=1, the multiTGDR increases in the direction of one (if the gradient of the intercept term has the largest absolute value) or two covariates in each iteration. Consequently, the nonzero coefficients are few at the early iterations. With increasing k, increasing number of βs would deviate from zeros until all of them would eventually enter the model. Both τ and K are determined by using crossvalidations [25].
In this framework, when one feature is selected in one comparison, it will appear in the rest comparisons even though it may not be informative in those comparisons. This is analogous to the multivariate regression model setting, where the same set of covariates is used for each response even though some of them may not be statistically significant. Alternatively, we may choose to force small estimated coefficients into zeros in the last step. Then, the set of selected features for each comparison becomes different. This framework is referred to as multiTGDR global herein.
On the other hand, one may argue why not set f_{j} as the minimum of f_{cj}s instead of their maximum. So if, there is no update until one feature has large enough gradients for all C1 comparison. Therefore, only features which are informative in all comparisons will be chosen, resulting in an optimal feature set that is used to classify all classes simultaneously. This is in conflict with the hypothesis that a good feature set consists of those highly correlated with a class but uncorrelated with other classes, which had been confirmed by Hall [26]. Moreover, the performance of such determination has been proved to be generally less favorable than that of oneversusanother or oneversusthe rest binary ensembles [10].
Method 2
Instead of having an overall threshold function for j^{th} feature, a c^{th}class specific threshold function for the feature is used to select features. This modification corresponds to the step 3a in the multiTGDR global framework. Thus, a feature is not necessarily selected in other comparisons when it is updated in one comparison. As a result, different sets of selected features are associated with different classes. This framework is herein referred to as multiTGDR local. Figure 1 shows the flowcharts of multiTGDR global and multiTGDR local, and pinpoints the difference between two frameworks.
In the above two frameworks, we treat τ as a uniform tuning parameter across classes, which can certainly be relaxed so that τ may take different values for each class, allowing different degrees of regularization for different comparisons. However, for the “omics” data where the number of features is much larger than the number of samples, in our experience τ=1 tends to give the most reasonable results. Firstly, it has the harshest threshold, resulting in the smallest set of selected features. Secondly, the predictive performance may be improved by discarding those noninformative or redundant features.
Bagging and brier score
Bagging [27] procedure was used to discard the possible noise from a single run of multiTGDR, so that a better model parsimony can be warranted. The benefits of bagging include but are not limited to: avoidance of overfitting, improvement on prediction, and manageable experimental verification. In many applications, e.g., [10], bootstrap resampling/bagging is mainly used to evaluate the stability of a classifier.
Besides the traditionally used confusion matrix and misclassification rate, the generalized brier score (GBS) [7] was also calculated to evaluate the predictive performance of two multiTGDR frameworks by absorbing the extra information provided by the estimated posterior probabilities. Additional details on trimming performed on both bagging and brier score for multiclass classifications were discussed in a previous study [4].
Statistical language and packages
The statistical analysis was carried out in the R language version 2.15 (http://www.rproject.org), R codes for multiTGDR are available upon request.
Results and discussion
Synthesized data
In order to study the empirical performance of both multiTGDR frameworks, we used the real values for metabolites of the HCC/cirrhosis data (384 metabolites and 131 samples) but assigned the class membership according to predetermined logit functions f. Specifically, the logit functions for class 2 and 3, having class 1 as reference, were given by following relationship for two synthesized datasets,
Simulation 1
where the logits for class 2 and 3 depend only on features X_{1} ~ X_{5}, but differ in the direction and magnitudes of the association.
Simulation 2
where the logits for class 2 and 3 are two function with different parameters in the second simulation.
By this means, the true relevant features (i.e., X_{1}X_{2} X_{3} X_{4} X_{5}) are known and performance comparison can be made between multiTGDRs and PLSDA. Here, PLSDA analysis was carried out in the software of SIMCAP + version 12.0 (http://www.umetrics.com). A feature was eliminated unless it had VIP values larger than 1 in either of the first two PLS components. The results were given in Table 1.
In summary, the true relevant features were successfully identified by all methods. The predictive performance of both multiTGDR frameworks was superior to that of PLSDA. Even after bagging, the final models for both multiTGDRs include substantially more features than the true ones, which might indicate more improvement in the multiTGDR frameworks and other relevant algorithms may exist.
Real data
A metabolomics study was conducted with the objectives of identifying potentially important differential metabolites related to HCC pathogenesis and early diagnosis, and thus providing an explicit predictive rule that can aid a physician’s diagnosis on HCC and cirrhosis. There were 131 subjects (70 with HCC, 30 with cirrhosis, and 31 normal controls, respectively) and 384 metabolites in this study. Additional details on this motivating study have been presented in [19]. Figure 2 outlines the schema of this study.
Performance of multiTGDR
In Figure 3, crossvalidation scores showed minimal change, especially after k > 500 for both frameworks. So the final iteration number K in both MultiTGDR global and MultiTGDR local was chosen as 500. Table 2 presents the results of the two multiTGDR approaches. MultiTGDR global selected 45 metabolites with a 0% misclassification rate and a 3.82% 5fold crossvalidation predictive error rate. With the cutoff of bagging frequency fixed at 40%, 30 metabolites were retained in the final model (Model 1_w), which had a 0% misclassification rate and a slight improvement on GBS. On the other hand, MultiTGDR local selected 48 metabolites with a 0% misclassification rate and a 5.34% CV5 error rate. After applying Bagging, 29 were identified in the final model (Model 2_w) with a slight decrement in GBS. Interestingly, the metabolites in model 1_w and model 2_w are almost the same (25 overlapped). Model 1_w identified 5 extra metabolites and model 2 _w identified 4 such metabolites. Table 3 presented those overlaps and those extra metabolites identified by specific multiTGDR framework.
Comparison with PLSDA analysis
The data had also been analyzed using PLSDA [19]. There, the potential markers were chosen based on the loading plot of PLSDA, then evaluated by VIP of the first two components in PLSDA and further confirmed by ttests. We compared the selected metabolites by the original analysis with the resulting metabolites from MultiTGDR frameworks (using the whole data on which the original PLSDA was conducted), there are only 4 overlaps between multiTGDR global and PLSDA, and 5 overlaps between multiTGDR local and PLSDA, respectively (indicated as * in Table 3).
In order to compare results obtained from PLSDA and those from MultiTGDR, we used the 42 metabolites selected by PLSDA (as shown on Table 2 in [19]) and considered a naïve Bayes model as a classifier to calculate the posterior probabilities in PLSDA. The performance of PLSDA was shown in Table 2. In summary, the metabolites selected by MultiTGDR have a better predictive performance than those by PLSDA.
Evaluation on the effect of preprocessing filtering
Moderated ttests using limma [28] were conducted to identify the differential metabolites between HCC/cirrhosis versus normal to examine the effects of prefiltering. The cutoff for the false discovery rate (FDR) was chosen as 0.05. There were 94 downregulated and 104 upregulated metabolites in the comparison of cirrhosis to normal, and 63 downregulated and 186 upregulated metabolites for HCC to normal. In total, 302 unique differentially expressed metabolites were identified by those ttests. Only 4 metabolites were missing from the final classifier models (i.e., model 1_w and 2_w). We then reran both multiTGDRs with those 302 differentially expressed metabolites. The corresponding results were shown in Table 2B and Figure 4. From them, we can see the performance of both multiTGDR on the filtered data decreased but was not substantial. To conclude, prefiltering may save considerable computational time with marginal impact on predictive performance.
Conclusions
Metabolites selected by multiTGDR may provide biological insight in HCC/cirrhosis. According to the description of those selected metabolites given by the HMDB, some interesting observations were gained. First, furoic acid is a metabolite produced by furfural via oxidation. Furfural is a confirmed animal carcinogen with unknown relevance to humans, and it has been suggested as a substance that produces hepatic cirrhosis [29, 30]. Here, both multiTGDR versions selected furoic acid, while the coefficients of both comparisons (i.e., HCC versus normal, cirrhosis versus normal) are in opposite directions. A significant decrease of isoxanthopterin has been identified in cancer patients [31], however, the multiTGDR results show an increase instead. It is well known that careful control of the participants’ intake before a metabolomics experiment is difficult. With that in mind, many of the HCC subjects may have received therapeutic treatments that might increase the level of isoxanthopterin, with residual levels present despite strict diet and intake control during the metabolomics study. In addition, overdosage of interventions for cancer patients, especially in a developing country like China is possible. Thus the accumulation of isoxanthopterin in HCC patients is possible as a result of longterm overdosage of relevant drugs. Meanwhile, we don’t exclude the possibility that HCC has its own unique characteristics in terms of isoxanthopterin and is consequently different from other cancer types. Further investigation on the biological explanation of those selected metabolites is definitely warranted. Here, our focus is to present the multiTGDR frameworks and to demonstrate their applications in metabolomics.
With the aids of a feature selection algorithm like multiTGDR (an algorithm can provide an explicit predictive rule and compute the posterior probabilities of the class membership), it is possible to design a diagnostic kit to examine the selected metabolites in a clinic setting with higher sensitivity and specificity. This kit would allow discrimination between HCC developed from HCV/HBV infections apart from cirrhosis with HCV/HBV infections, which is highly desirable and of scientific importance. One limitation of our application is that since the proportion of diseased persons in an observational study may not reflect disease prevalence in the population, care must be taken in both model construction and evaluation. To ensure a multiTGDR model can correctly classify persons in the general population, one approach is to obtain weights based on the ratio between the proportion of diseased persons in the population and that in the study. A comprehensive investigation of these issues is the focus of our future research.
Two extensions of TGDR are proposed here for multiclass classification problems. By training only one classifier, we specifically address suboptimality associated with dividing multiclass classification into individual binary pairs. The performance of multiTGDR global has been shown to be excellent by us previously [4] using simulated data and two microarray data sets. Compared to multiTGDR global, multiTGDR local has an almost identical predictive performance in the HCC metabolomics data (in both the simulated data and the real data). Additionally, we conducted extra simulations to verify the validity of multiTGDR local and compared its performance with multiTGDR global. The results (included in the Additional file 1: Supplementary materials) show that both multiTGDR algorithms can identify the true relevant features and discard the irrelevant features. Identical predictive performances are also observed even in cases where some of features are highly correlated to the relevant features. Intuitively, we hypothesized that multiTGDR global should perform better in cases where the classes share more similarity. This entails that the same set of features is shared across different classes, but the magnitudes of the association differ. This may correspond to different stages of a disease. In contrast, multiTGDR local may be optimal in cases where no similarity of the classes is present. This entails that complete different sets are selected across different classes, which may represent different diseases. Interestingly, the results from the simulations don’t support this hypothesis. Finally, we also examined whether multiTGDR local is associated with less computation time since it does not need to compute the overall threshold function f_{ j }(v). However, with our current experience on the simulations and realworld applications, we found the computational effort of these two approaches to be comparable. This may be due to the fact that compared to the computation of many gradients at each iteration, the computation of maximum on f_{ cj }(v) is negligible. One obvious advantage of multiTGDR local is that it may provide us with information on which feature is related to which class/classes.
To conclude, we recommend the use of the multiTGDR frameworks for multiclass classifications on “omics” data because they have excellent predictive capacity. The researchers may choose to run both or either of the multiTGDR frameworks based on their research hypotheses and data type.
Abbreviations
 TGDR:

Threshold gradient descent regularization
 multiTGDR global and local:

Threshold gradient descent regularization for multiple classes (global: version 1 and local: version 2)
 HCC:

Hepatocellular carcinoma
 CVX:

X fold cross validation
 BF:

Bagging frequency
 HMDB:

Human Metabolome Database
 GBS:

Generalized brier score
 PCA:

Principal Component Analysis
 PLSDA:

Partial Least SquareDiscrimination Analysis.
References
 1.
Saeys Y, Inza I, Larrañaga P: A review of feature selection techniques in bioinformatics. Bioinformatics. 2007, 23: 25072517. 10.1093/bioinformatics/btm344.
 2.
Friedman JH: Gradient Directed Regularization for Linear Regression and Classification. 2004, Techinical report
 3.
Tian S, Krueger JG, Li K, Jabbari A, Brodmerkel C, Lowes MA, SuárezFariñas M: Metaanalysis derived (MAD) transcriptome of psoriasis defines the “core” pathogenesis of disease. PLoS One. 2012, 7: e4427410.1371/journal.pone.0044274.
 4.
Tian S, SuárezFariñas M: MultiTGDR: a regularization method for multiclass classification in microarray experiments. PLoS One. 2013, 8: e7830210.1371/journal.pone.0078302.
 5.
Tian S, Suárezfariñas M: HierarchicalTGDR: combining biological hierarchy with a regularization method for multiclass classification of lung cancer samples via highthroughput geneexpression data. Syst Biomed. 2013, 1: 93102.
 6.
Li T, Zhang C, Ogihara M: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004, 20: 24292437. 10.1093/bioinformatics/bth267.
 7.
Yeung KY, Bumgarner RE, Raftery AE: Bayesian model averaging: development of an improved multiclass, gene selection and classification tool for microarray data. Bioinformatics. 2005, 21: 23942402. 10.1093/bioinformatics/bti319.
 8.
Zhang ML, Zhou ZH: MLKNN: a lazy learning approach to multilabel learning. Pattern Recognit. 2007, 40: 20382048. 10.1016/j.patcog.2006.12.019.
 9.
Vens C, Struyf J, Schietgat L, Džeroski S, Blockeel H: Decision trees for hierarchical multilabel classification. Mach Learn. 2008, 73: 185214. 10.1007/s1099400850773.
 10.
Student S, Fujarewicz K: Stable feature selection and classification algorithms for multiclass microarray data. Biol Direct. 2012, 7: 3310.1186/17456150733.
 11.
Ma S, Huang J: Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics. 2005, 21: 43564362. 10.1093/bioinformatics/bti724.
 12.
Daviss B: Growing pains for metabolomics. Science. 2005, 19: 2528.
 13.
Wishart DS, Knox C, Guo AC, Eisner R, Young N, Gautam B, Hau DD, Psychogios N, Dong E, Bouatra S, Mandal R, Sinelnikov I, Xia J, Jia L, Cruz JA, Lim E, Sobsey CA, Shrivastava S, Huang P, Liu P, Fang L, Peng J, Fradette R, Cheng D, Tzur D, Clements M, Lewis A, De Souza A, Zuniga A, Dawe M, et al: HMDB: a knowledgebase for the human metabolome. Nucleic Acids Res. 2009, 37: D603D610. 10.1093/nar/gkn810.
 14.
Noble WS, MacCoss MJ: Computational and statistical analysis of protein mass spectrometry data. PLoS Comput Biol. 2012, 8: e100229610.1371/journal.pcbi.1002296.
 15.
Baumgartner C, Osl M, Netzer M, Baumgartner D: Bioinformaticdriven search for metabolic biomarkers in disease. J Clin Bioinform. 2011, 1: 210.1186/2043911312.
 16.
Ramadan Z, Jacobs D, Grigorov M, Kochhar S: Metabolic profiling using principal component analysis, discriminant partial least squares, and genetic algorithms. Talanta. 2006, 68: 16831691. 10.1016/j.talanta.2005.08.042.
 17.
Chen M, Ni Y, Duan H, Qiu Y, Guo C, Jiao Y, Shi H, Su M, Jia W: Mass spectrometrybased metabolic profiling of rat urine associated with general toxicity induced by the multiglycoside of Tripterygium wilfordii Hook. f. Chem Res Toxicol. 2008, 21: 288294. 10.1021/tx7002905.
 18.
Chen J, Zhang X, Cao R, Lu X, Zhao S, Fekete A, Huang Q, SchmittKopplin P, Wang Y, Xu Z, Wan X, Wu X, Zhao N, Xu C, Xu G: Serum 27nor5βcholestane3,7,12,24,25 pentol glucuronide discovered by metabolomics as potential diagnostic biomarker for epithelium ovarian cancer. J Proteome Res. 2011, 10: 26252632. 10.1021/pr200173q.
 19.
Zhou L, Ding L, Yin P, Lu X, Wang X, Niu J, Gao P, Xu G: Serum metabolic profiling study of hepatocellular carcinoma infected with hepatitis B or hepatitis C virus by using liquid chromatographymass spectrometry. J Proteome Res. 2012, 11: 54335442. 10.1021/pr300683a.
 20.
Kumar V, Fausto N, Abbas A: Robbins & Cotran Pathologic Basis of Disease. 2005, Philadelphia: Elsevier Saunders, 7
 21.
Chen L, Ho DWY, Lee NPY, Sun S, Lam B, Wong KF, Yi X, Lau GK, Ng EWY, Poon TCW, Lai PBS, Cai Z, Peng J, Leng X, Poon RTP, Luk JM: Enhanced detection of early hepatocellular carcinoma by serum SELDITOF proteomic signature combined with alphafetoprotein marker. Ann Surg Oncol. 2010, 17: 25182525. 10.1245/s1043401010388.
 22.
Colli A, Casazza G, Massironi S, Colucci A, Conte D, Duca P: Accuracy of ultrasonography, spiral CT, magnetic resonance, and alphafetoprotein in diagnosing hepatocellular carcinoma: a systematic review. Am J Gastroenterol. 2006, 101: 513523. 10.1111/j.15720241.2006.00467.x.
 23.
Nicholson JK, Lindon JC, Holmes E: “Metabonomics”: understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data. Xenobiotica. 1999, 29: 11811189. 10.1080/004982599238047.
 24.
Van der Greef J, Stroobant P, van der Heijden R: The role of analytical sciences in medical systems biology. Curr Opin Chem Biol. 2004, 8: 559565. 10.1016/j.cbpa.2004.08.013.
 25.
Geisser S: Predictive Inference: An Introduction. 1993, New York: Chapman & Hall
 26.
Hall MA: Correlationbased Feature Selection for Machine Learning. 1999, Waikato University, Computer Science Department
 27.
Breiman L: Bagging predictors. Mach Learn. 1996, 24: 123140.
 28.
Smyth G: Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, Irizarry R, Huber W. 2005, New York: Springer, 397420.
 29.
Tan ZB, Tonks CE, O’Donnell GE, Geyer R: An improved HPLC analysis of the metabolite furoic acid in the urine of workers occupationally exposed to furfural. J Anal Toxicol. 2003, 27: 4346. 10.1093/jat/27.1.43.
 30.
Shimizu A, Kanisawa M: Experimental studies on hepatic cirrhosis and hepatocarcinogenesis. I. Production of hepatic cirrhosis by furfural administration. Acta Pathol Jpn. 1986, 36: 10271038.
 31.
Lord JL, de Peyster A, Quintana PJE, Metzger RP: Cytotoxicity of xanthopterin and isoxanthopterin in MCF7 cells. Cancer Lett. 2005, 222: 119124. 10.1016/j.canlet.2004.09.009.
Acknowledgements
The study was supported by Natural Science Foundation of China (No 81172727 and 81202377). ST was also partially supported by a seed fund from the Jilin University (No 450060491885). We are grateful to two reviewers for their helpful comments and to Catherine Anthony for scientific editing. Especially, we thank Drs. Margaret MacDonald and Ype De Jong of the Rockefeller University for helpful discussion.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
Conceived and designed the study: ST JQN. Analyzed the data: ST CW. Interpreted data analysis and results: HHC ST CW XMW JJ. Contributed materials/analysis tools: XMW JQN JJ. Wrote the paper: ST HHC JJ JQN CW. All authors reviewed and approved the final manuscript.
Electronic supplementary material
Authors’ original submitted files for images
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Threshold gradient descent regularization (TGDR)
 Multiclass classification
 Metabolic profile
 Hepatocellular carcinoma (HCC)
 Feature selection
 Metabolomics
 Omics data