The experimental data
The study included 30 patients with cirrhotic liver disease (22 infected with HBV and 8 with HCV, respectively), 70 patient with HCC (39 with HBV infection and 31 with HCV infection), and 31 healthy volunteers recruited in the metabolic profiling study. All of them provided the written informed consent, and the ethics committee of the Jilin University approved upon this study. Detailed descriptions on the study design, experimental procedures, and LC-MS metabolomics data collections were reported in [19].
Pre-processing procedures
Raw data were imported into Databridge (Waters, U.K.) for data format transformation. The resulting NeTCDF files were imported into XCMS software for the peak extraction and alignment. Then the peaks including 384 metabolites (indexed by the combination of m/z and retention time, and their corresponding peak intensities) and 131 samples were exported to an Excel file. The peak intensity values were log transformed so that the distribution of the transformed intensity values for each metabolite was approximately normal. Zeros (corresponding to no peaks) in peak intensity, were replaced by a nominal value (i.e., 0.01) before log transformation, to avoid the creation of missing values. Several other values for replacing zero values (i.e., 0.001, 0.005, 0.02, 0.05) were examined to evaluate if different nominal values would affect the results, and no difference was found. Finally, peak intensity values were further centralized and normalized to have a mean of 0 and a variance of 1. The resulting matrix was used in the two multi-TGDR frameworks for the classification analysis.
Compounds identification was achieved by comparing the accurate mass of compounds from the Human Metabolome Database: HMDB version 3 (http://www.hmdb.ca).
Methods
Here, we omit the description of binary TGDR. Interested readers may refer to the original papers [2, 11] for the detailed descriptions on binary TGDR. We present two multi-class TGDR frameworks with emphasis on the specific modifications made on the overall threshold functions to handle the multi-class problem.
Extension to multi-class classification
In the multi-class cases, we have a set of C-1 binary variables Yci , which are the indicators for class c on subject i (i = 1,…,n, here n is the total number of subjects) i.e., Yci is equal 1 if the ith subject belongs to class c and zero otherwise. C is the number of classes (C ≥ 3) and X1,…,Xn represents the feature values of one specific subject. Notably, Xi is a vector of length G and thus X is an n×G matrix with Xij for the corresponding intensity values of feature j (j = 1,…,G) on subject i. The log-likelihood function is defined as,
(1)
βc0s (c = 1,…,C-1) are unknown intercepts which are not subject to regularization. βc = (βc1,…, βcG) are the corresponding coefficients for the expression values of metabolites under consideration. In an ‘omics’ experiment, most of those betas are assumed to be zeros, implying the corresponding features are non-informative in explaining the difference across different classes. In the multi-class cases, the threshold functions of every feature (i.e., metabolites in our application) in TGDR need to be redefined across classes. In previous work [4], we proposed an extension of TGDR as described below.
Method 1
Denote Δv as the small positive increment (e.g., 0.01) in ordinary gradient descent search and v
k
= k×Δv as the index for the point along the parameter path after k steps. Let β(v
k
) denote the parameter estimate of β corresponding to v
k
. For a fixed threshold 0 ≤τ≤ 1, our proposed TGDR algorithm for multi-class cases is given as follows:
-
1.
Initialize β(0)=0 and v
0
=0.
-
2.
With current estimate β, compute the negative gradient matrix g(v) = - ∂R(β)/∂β with its (c,j) th component as g
cj
(v).
-
3.
-
a)
Let f
c
(v) represent the threshold vector of size G for class c (c=1,..,C-1), with its j th component calculated as
-
b)
Then, the j th-feature specific threshold function was defined as
-
4.
Update β(v+Δv) = β(v) - Δv×g(v)×f(v) and update v by v+Δv , where the (c, j) th component of the product of f and g is g
cj
(v) × f
j
(v).
-
5.
Steps 2-4 are iterated K times. The number of iteration K is determined by cross validation.
As in binary TGDR, all metabolites are assumed to be non-informative at the initial stage. Parameters τ and k are the tuning parameters, and thus jointly determine the property of the estimated coefficients, including the selection of features and their corresponding magnitudes. τ can be regarded as a threshold because it determines how βs would be updated in each iteration. Two extreme cases include: if τ=0, all coefficients are nonzero for all values of k; and if τ=1, the multi-TGDR increases in the direction of one (if the gradient of the intercept term has the largest absolute value) or two covariates in each iteration. Consequently, the non-zero coefficients are few at the early iterations. With increasing k, increasing number of βs would deviate from zeros until all of them would eventually enter the model. Both τ and K are determined by using cross-validations [25].
In this framework, when one feature is selected in one comparison, it will appear in the rest comparisons even though it may not be informative in those comparisons. This is analogous to the multivariate regression model setting, where the same set of covariates is used for each response even though some of them may not be statistically significant. Alternatively, we may choose to force small estimated coefficients into zeros in the last step. Then, the set of selected features for each comparison becomes different. This framework is referred to as multi-TGDR global herein.
On the other hand, one may argue why not set fj as the minimum of fcjs instead of their maximum. So if, there is no update until one feature has large enough gradients for all C-1 comparison. Therefore, only features which are informative in all comparisons will be chosen, resulting in an optimal feature set that is used to classify all classes simultaneously. This is in conflict with the hypothesis that a good feature set consists of those highly correlated with a class but uncorrelated with other classes, which had been confirmed by Hall [26]. Moreover, the performance of such determination has been proved to be generally less favorable than that of one-versus-another or one-versus-the rest binary ensembles [10].
Method 2
Instead of having an overall threshold function for jth feature, a cth-class specific threshold function for the feature is used to select features. This modification corresponds to the step 3a in the multi-TGDR global framework. Thus, a feature is not necessarily selected in other comparisons when it is updated in one comparison. As a result, different sets of selected features are associated with different classes. This framework is herein referred to as multi-TGDR local. Figure 1 shows the flowcharts of multi-TGDR global and multi-TGDR local, and pinpoints the difference between two frameworks.
In the above two frameworks, we treat τ as a uniform tuning parameter across classes, which can certainly be relaxed so that τ may take different values for each class, allowing different degrees of regularization for different comparisons. However, for the “omics” data where the number of features is much larger than the number of samples, in our experience τ=1 tends to give the most reasonable results. Firstly, it has the harshest threshold, resulting in the smallest set of selected features. Secondly, the predictive performance may be improved by discarding those non-informative or redundant features.
Bagging and brier score
Bagging [27] procedure was used to discard the possible noise from a single run of multi-TGDR, so that a better model parsimony can be warranted. The benefits of bagging include but are not limited to: avoidance of over-fitting, improvement on prediction, and manageable experimental verification. In many applications, e.g., [10], bootstrap resampling/bagging is mainly used to evaluate the stability of a classifier.
Besides the traditionally used confusion matrix and misclassification rate, the generalized brier score (GBS) [7] was also calculated to evaluate the predictive performance of two multi-TGDR frameworks by absorbing the extra information provided by the estimated posterior probabilities. Additional details on trimming performed on both bagging and brier score for multi-class classifications were discussed in a previous study [4].
Statistical language and packages
The statistical analysis was carried out in the R language version 2.15 (http://www.r-project.org), R codes for multi-TGDR are available upon request.