Greedy feature selection for glycan chromatography data with the generalized Dirichlet distribution
 Marie C Galligan^{1, 2}Email author,
 Radka Saldova^{2},
 Matthew P Campbell^{3},
 Pauline M Rudd^{2} and
 Thomas B Murphy^{1}
DOI: 10.1186/1471210514155
© Galligan et al.; licensee BioMed Central Ltd. 2013
Received: 5 March 2012
Accepted: 20 March 2013
Published: 7 May 2013
Abstract
Background
Glycoproteins are involved in a diverse range of biochemical and biological processes. Changes in protein glycosylation are believed to occur in many diseases, particularly during cancer initiation and progression. The identification of biomarkers for human disease states is becoming increasingly important, as early detection is key to improving survival and recovery rates. To this end, the serum glycome has been proposed as a potential source of biomarkers for different types of cancers.
Highthroughput hydrophilic interaction liquid chromatography (HILIC) technology for glycan analysis allows for the detailed quantification of the glycan content in human serum. However, the experimental data from this analysis is compositional by nature. Compositional data are subject to a constantsum constraint, which restricts the sample space to a simplex. Statistical analysis of glycan chromatography datasets should account for their unusual mathematical properties.
As the volume of glycan HILIC data being produced increases, there is a considerable need for a framework to support appropriate statistical analysis. Proposed here is a methodology for feature selection in compositional data. The principal objective is to provide a template for the analysis of glycan chromatography data that may be used to identify potential glycan biomarkers.
Results
A greedy search algorithm, based on the generalized Dirichlet distribution, is carried out over the feature space to search for the set of “grouping variables” that best discriminate between known group structures in the data, modelling the compositional variables using beta distributions. The algorithm is applied to two glycan chromatography datasets. Statistical classification methods are used to test the ability of the selected features to differentiate between known groups in the data. Two wellknown methods are used for comparison: correlationbased feature selection (CFS) and recursive partitioning (rpart). CFS is a feature selection method, while recursive partitioning is a learning tree algorithm that has been used for feature selection in the past.
Conclusions
The proposed feature selection method performs well for both glycan chromatography datasets. It is computationally slower, but results in a lower misclassification rate and a higher sensitivity rate than both correlationbased feature selection and the classification tree method.
Keywords
Compositional data Beta distribution Generalized Dirichlet distribution Variable selection Feature selection Correlationbased feature selection Recursive partitioning Glycobiology Glycan HILIC Chromatography dataBackground
In the statistical literature, a composition is a vector of nonnegative elements that are constrained to sum to a constant. Compositional data are composed of such vectors. They represent parts of a whole and are typically expressed as proportions or percentages. The variables in a composition are often referred to as components. Compositional data arise naturally in many disciplines, such as in plant ecology [1], archaeometry [2], and geology [3]. Notwithstanding this fact, is not uncommon for statistical analysis to be carried out without regard to the compositional nature of the data. The constantsum constraint on the data restricts the sample space to a simplex and also induces spurious correlation between components [4], with the result that traditional statistical methods such as multivariate analysis of variance (MANOVA), pairwise correlations, and discriminant analysis are not directly suitable for these data.
Aitchison [3] provides great insight into the special considerations required in compositional data analysis, advocating the use of a logratio approach. This has met with much success, in the statistical and geological communitites in particular. Others have since built on his work, making available a collection of methods that are easily accessible for compositional data analysis.
We propose a feature selection method for compositional data. Notably little research appears to have been conducted into feature selection for compositions to date. This methodology was developed with a specific application in mind; feature selection for hydrophilic interaction liquid chromatography (HILIC) data from glycan analysis.
Glycans are complex sugar chains that are present in all cells. They can exist either in free form or are covalently bound to other macromolecules, such as proteins or lipids [5]. The diversity and complexity of these structures means that they have a broad range of functions, playing a structural role as well as being involved in most physiological processes [5]. Glycosylation is important in the growth and development of a cell, tumour growth and metastasis, immune recognition and response, anticoagulation, communication between cells, and microbial pathogenesis [6]. Glycans are generally attached to proteins through a nitrogen atom (Nglycans) or an oxygen atom (Oglycans).
Glycobiology has great potential for biomarker discovery, as it has been relatively unexploited in comparison with genomics and proteomics [7]. Alterations in the glycosylation profiles of proteins have been observed during the pathogenesis of many different diseases; including cancer, congenital disorders of glycosylation and inflammatory conditions such as rheumatoid arthritis and schizophrenia [8].
A feature selection methodology for these data would provide a useful tool for biomarker research. One reason for this is that it could reduce the time and cost associated with further analysis. To identify the exact glycan structures corresponding to each chromatographic peak, further experimental analysis is required. To reduce the expense incurred from the addition of costly enzymes to the sample and the time required for detailed quantitative analysis, it would be extremely beneficial to be able to select a smaller subset of seemingly informative peaks for further refinement.
The level of refinement of the profile (the number of chromatographic peaks) is dependent on experimental conditions. The datasets demonstrated in this paper are from profiles consisting of 17 (lung cancer data) and 24 (prostate cancer data) chromatographic peaks. The dimensionality of glycan chromatography datasets is expected to increase in the future, as more advanced techniques have already become available [10]. For the purposes of biomarker discovery, it will become more important to have a methodology available for selecting subsets of chromatographic peaks that differ between control/disease groups.
Galligan et al [11] compared three suitable models for the classification of glycan chromatography data and found that modelling the data using the logratio approach of Aitchison [3] gave satisfactory results. A disadvantage is that fitting this model to compositional data requires transformation of the data, making interpretation of the model difficult in terms of the raw data (the glycan peaks). Proposed here is a feature selection methodology based on Connor and Mosimann’s generalized Dirichlet distribution [12] and its marginal, the beta distribution. This is an extension of the Dirichlet distribution that has almost double the number of parameters and allows for more flexible modelling of compositional data. The Dirichlet class is a natural parametric family for modelling data in a simplex sample space and therefore, no data transformation is required. There has been much interest in this class of models, and several other extensions of the ordinary Dirichlet distribution have been explored, such as the hyper Dirichlet [13] and the nested Dirichlet [14] distributions. The Dirichlet class has also been used for fitting regression models [15, 16] and time series models to compositional data [17]. In addition, Wang et al. [18] proposed a dimension reduction technique for compositional data, using properties of the Dirichlet distribution. They project compositional data onto a lower dimensional simplex space, finding an “optimal” projection, defined as that which maximizes the estimated Dirichlet precision on the reduced data. A major advantage of modelling compositional data using the Dirichlet distribution is that transformation of the data is not required, hence, the results are directly interpretable in terms of the original variables. This is a desirable property for feature selection, as features can be directly selected from the model.
Raftery and Dean [19] propose a methodology for variable selection integrated with modelbased clustering. They use the headlong search strategy proposed by Badsberg [20] to search over the feature space. They add and remove features during the modelbuilding process using a comparison of the Bayesian Information Criterion (BIC) [21] for proposed models. Murphy, Dean, and Raftery [22] extend this variable selection methodology for use with supervised learning problems, specifically with modelbased discriminant analysis. These approaches formulate the problem of variable selection as a modelselection problem, whereby the features and the appropriate model are selected simultaneously.
We propose a generalized Dirichlet feature selection (GDFS) method, that is an adaptation of the above methods that is suitable for use with compositional data in a supervised learning problem. This method could also be easily adapted for use with unsupervised classification methods, such as modelbased clustering. A greedy search algorithm traverses the feature space, selecting a “grouping model” at each step from a set of target generalized Dirichlet models, using the BIC for model selection and including a backwards step in the algorithm to avoid getting trapped at local maxima. At each iteration, a set of chromatographic peaks are selected as the current optimal set of “grouping variables”. Convergence is declared when no further proposed changes in the current set of selected features are accepted. The selected features are those peaks that appear to contain information about the group structure in the data, and further experimental analysis could be carried out to identify the glycan structures corresponding to these chromatographic peaks.
This method is applied to two glycan chromatography datasets; from the lung cancer study conducted by Arnold et al. [23] and from the prostate cancer study of Saldova et al. [24].
The GDFS method is compared with two wellknown feature selection techniques: correlationbased feature selection (CFS) developed by Hall [26] and a recursive partitioning method (rpart) for the construction of classification trees, developed by Breiman et al. [27]. Neither method makes implicit assumptions about the distribution of the data, so both are suitable for use with compositions. Recursive partitioning builds a classification tree using a selected subset of features. It is a nonparametric method that has been used in the past for feature selection in compositional data [2, 28]. Correlationbased feature selection, widely used in the machine learning community and elsewhere [29, 30], is applied to a discretized form of the data. It involves a best first search over the feature space to select the set of features with the highest “merit”, a heuristic used to measure the predictive ability of a feature subset.
Methods
Described in detail here is the proposed statistical methodology for feature selection in compositional data. This includes an introduction to the Dirichlet, beta, and generalized Dirichlet distributions, algorithmic details of the GDFS method for feature selection, a brief discussion of the two feature selection methods used for comparison and a description of the statistical classification methods employed for model validation.
Relevant information is also provided on the two glycan chromatography datasets used to test the proposed statistical methodology, along with analytical details on the glycan analysis used to collect these datasets.
Statistical methods
The generalized Dirichlet distribution
Connor and Mosimann [12] propose the generalized Dirichlet distribution as a more flexible extension of the ordinary Dirichlet distribution for modelling compositional data with a unitsum constraint. This section introduces the Dirichlet distribution, followed by a description of how the Dirichlet is extended to obtain the Generalized Dirichlet model.
is the Gamma function.
Further details on parameter estimation are given in the next subsection.
are mutually independent. The generalized Dirichlet distribution results from making the additional assumption that the marginal distributions of the elements of $\stackrel{~}{\mathbf{Y}}$ are beta distributions. Note that the last component of $\stackrel{~}{\mathbf{Y}}$ is degenerate since it is equal to one.
where B(α_{ j },β_{ j })=Γ(α_{ j })Γ(β_{ j })/Γ(α_{ j }+β_{ j }) is the beta function, s_{i,j−1} is the sum of the first j−1 compositional components for observation i, and $\prod _{j=1}^{p1}1/(1{s}_{i,j1})$ is the Jacobian term resulting from the change of variable. For a full derivation of this probability density function, please refer to Appendix B. Derivation of the generalized Dirichlet probability density function. In the special case where β_{j−1}=α_{ j }+β_{ j } for j=1,2,…,p−1 and writing α_{ p }=β_{p−1}, this model simplifies to the ordinary Dirichlet distribution given by Equation 1.
where θ=(α_{1},β_{1},…,α_{p−1},β_{p−1}) is the generalized Dirichlet parameter vector.
Note that the ordering of generalized Dirichlet components is important. A particular ordering of compositional variables may be completely neutral, while another ordering of the same variables may not be [12]. Therefore, if a compositional vector (Y_{1},Y_{2},Y_{3}) follows a generalized Dirichlet distribution, a permutation of its components such as (Y_{2},Y_{1},Y_{3}) may not.
The generalized Dirichlet model is more intuitive when viewed as a tree structure. This is well explained by Null [14], who relates the generalized and nested Dirichlet distributions. Representing a generalized Dirichlet random vector as a tree structure, the compositional components are assigned to be leaves in the tree and a set of p−2 interior nodes are introduced. Each “nest” in the tree comprises of a leaf node (or original compositional component) and an interior node (or “nesting variable”) whose value is the sum of the leaf nodes nested below (or equivalently, one minus the sum of leaf nodes not nested beneath). The first component of the generalized Dirichlet vector is at the top of the tree structure, and successive components are nested underneath. For example, the third component is nested under the second. The nest at the bottom level of the tree consists of two leaf nodes only. The variables in each nest are beta distributed, conditional on the value of the parent (interior) node above.
and the Jacobian term 1/(Y_{2}+Y_{3}) for the change of variable.
and the Jacobian term 1/(Y_{1}+Y_{2}).
Maximum likelihood estimation for the generalized Dirichlet distribution
The maximum likelihood estimates for a generalized Dirichlet distribution with p components are obtained via the estimation of parameters for the p−1 independent beta distributions from which the probability density function is comprised. As mentioned in the previous section, parameter estimates for the beta distribution can be obtained in the same manner as those for a Dirichlet distribution, since the beta distribution is a special case of the Dirichlet distribution.
Since maximum likelihood estimates for a Dirichlet distribution cannot be obtained in closed form, the fixedpoint iteration method proposed by Minka [31] is used here to numerically approximate the beta MLEs.
A tolerance level of 0.003 was used here, as it appeared sufficient for convergence upon examination of log likelihood sequences.
The parameter vector for a generalized Dirichlet distribution for a composition with p components is written (α_{1},β_{1},α_{2},β_{2},…,α_{p−1},β_{p−1}), where each pair (α_{ j },β_{ j }) are the parameters for the beta distribution of the jth nest in the tree structure, with j=1 corresponding to the top level and j=p−1 corresponding to the bottom level.
Feature selection using the generalized Dirichlet distribution
This section describes the GDFS method for compositional data. Let Y=(Y_{1},Y_{2},…,Y_{ p }) denote a unitsum compositional random vector. Let Z be a random variable indicating the group to which an observation Y=y_{ i } belongs, so that z_{ i }=g if y_{ i } belongs to group g. Components in Y that contain group information will therefore be dependent on Z.
 1.
Y ^{(c)}: variables that contain group information (that are currently in the grouping model)
 2.
Y ^{(o)}: variables that do not contain group information (that are currently omitted from the grouping model)
 3.
Y ^{(p)}: the proposal variable (the variable proposed for addition or removal from the current grouping set, Y ^{(c)})
When proposing to add (remove) Y^{(p)} to (from) the grouping model, two models can be considered; the first is a model where part (i i) of Equation (20) depends on Z, indicating that Y^{(p)} contains group information. The second is a model where the density function in part (i i) does not depend on Z, indicating that the proposal variable, Y^{(p)}, does not contain group information and should be excluded from the grouping set.
which is independent of Y^{(c)} since (Y^{(c)},Y^{(p)},Y^{(o)}) is completely neutral. Therefore, the density of the proposal variable Y^{(p)} is the product of a beta distribution and the Jacobian term, 1/(1−S^{(c)}).
Interestingly enough, the notion of partitioning variables into independent subspaces of components has previously been considered in independent subspace analysis (ISA), which has been applied to feature extraction problems in the past [37].
Proposal to add a component to Y^{(c)}
At every iteration of the greedy search algorithm, it is proposed to add a component to the grouping set, considering each of the currently omitted components. The decision of whether a proposed component Y^{(p)} contains group information is made by comparing a grouping and nongrouping model for Y^{(p)}. In the grouping model, Y^{(p)} is dependent on Z, and in the nongrouping model it is not.
The parameters for the grouping model are group dependent and must be estimated separately for each group.
If the grouping model for the proposal variable provides a better fit than the nongrouping model, then the proposal variable should considered for addition to the grouping set, Y^{(c)}. Note that if it is added to Y^{(c)}, it should be added to the end of Y^{(c)} rather than the beginning, to indicate that it is nested underneath variables that were added before it. This is necessary for the model to be a generalized Dirichlet distribution, considering the specified grouping and nongrouping model structure.
Proposal to remove a component from Y^{(c)}
A proposal to remove a variable from the grouping model is also included at each iteration. This could potentially reduce the possibility of getting stuck at a local maxima. The decision of whether to remove a proposed component Y^{(p)} from the grouping set Y^{(c)} is made by comparing a grouping and nongrouping model for Y^{(p)}. In the grouping model, Y^{(p)} depends on the group information vector Z, and in the nongrouping model it does not.
For the remove step, the “grouping model” considered is the generalized Dirichlet model fitted to the current set of grouping variables, denoted here as Y^{(c,p)}. This notation is used here to indicate that Y^{(p)} is included amongst the grouping set, in the ordering specified by the currently fitted generalized Dirichlet model for the grouping set. It differentiates from the notation (Y^{(c)},Y^{(p)}), which indicates that component Y^{(p)} is definitely at the end of the vector (i.e. in the bottom nest of the generalized Dirichlet tree structure). If Y^{(c,p)} follows a generalized Dirichlet distribution, this does not imply that (Y^{(c)},Y^{(p)}) is also generalized Dirichlet distributed. When proposing to remove a component from the grouping set, the generalized Dirichlet model fitted to Y^{(c)} must also be considered in the comparison of grouping and nongrouping models. This is because removing the proposal variable from the generalized tree structure could result in a different generalized Dirichlet tree structure for the set of grouping variables.
and the probability density functions for each are calculated from Equation 26. For the set of grouping variables, the parameter vector is indexed by g to indicate that they are estimated separately for each group g. At every remove step, each of the components currently in the grouping model are considered as remove proposals. If the grouping model provides a better fit than the nongrouping model, this can be considered as evidence for retaining the proposal variable in the grouping set. If the converse is true, there is evidence for removing the proposal variable from the grouping set.
Selected feature model
When the partition {Y^{(c)},Y^{(o)}} is found that is considered to be optimal, the “grouping model” is the generalized Dirichlet model currently fitted to Y^{(c)}. Note that this is equivalent to fitting a generalized Dirichlet model to (Y^{(c)},1−S^{(c)}), since the component 1−S^{(c)} is degenerate (it is equal to the sum of the omitted variables).
The parameters for this grouping model should be estimated separately for each group, since these are the components that are considered to be dependent on Z.
Bayesian Information Criterion (BIC)
where $\ell (\widehat{\alpha},\widehat{\beta};\mathbf{y})$ is the log likelihood function given in Equation 5, evaluated at the maximum likelihood estimates of the parameters, $\alpha =\widehat{\alpha}$ and $\beta =\widehat{\beta}$. The BIC prevents model overfitting by using a penalty for model complexity, number of parameters× logn. In comparing two models, that with the larger BIC is preferable.
A positive value for BIC_{diff} provides evidence in favour of grouping model, M_{ G R }, over the nongrouping model, M_{ N G R }. The larger the difference in BIC, the more statistical evidence there is in favour of including Y^{(p)} in the set of grouping variables.
where $\ell (\widehat{\mathit{\theta}};\mathbf{y})$ is the log likelihood function of the generalized Dirichlet distribution given by Equation (9), evaluated at $\mathit{\theta}=\widehat{\mathit{\theta}}$, the maximum likelihood estimate for the parameter vector, 2(p−1) is the number of estimated parameters, and n is the number of samples. This BIC can be used to compare generalized Dirichlet distributions fitted to different orderings/permutations of the grouping variable set Y^{(c)}, as outlined in the following section.
Algorithm outline
 1.
INITIALIZATION: Initially assign all variables to the nongrouping set, and then add a single variable to the grouping set. The decision of which variable to add is made via a comparison of BIC differences for grouping and nongrouping models for each variable. The variable with the maximum BIC difference is added to the grouping model. If all of the BIC differences are negative, the variable with the least negative BIC difference is added. Add a second variable to the grouping model in a similar manner. If this second add move is not made, the algorithm will terminate after the first iteration if the BIC difference was negative for the first variable added (as the variable will be removed and no further add moves will be made).
 2.
ADD STEP: Propose to add a variable to the grouping model. The decision of whether to add a variable to the grouping model is made via a BIC comparison for grouping and nongrouping models for each variable in Y ^{(o)}, the nongrouping set. If any of the BIC differences for these models is positive, add the variable with the largest positive BIC difference to the grouping set. A positive BIC difference provides evidence that a variable contributes group information to the model. If all BIC differences are negative, reject the proposal to add a variable to the grouping model. If the proposal to add a variable is accepted, this variable is added to the end of Y ^{(c)}. This means that it will be located in the bottom nest of the generalized Dirichlet tree structure fitted to the grouping variables.
 3.
REMOVE STEP: Propose to remove a variable from the grouping model. The decision of whether to remove a variable is made via a BIC comparison for grouping and nongrouping models for each variable currently included in the grouping set, Y ^{(c)}. A negative BIC difference provides evidence that a variable does not contribute group information to the model. If the BIC difference is negative for any of these variables, remove the variable with the largest negative BIC difference from the grouping set, and add it to the nongrouping set Y ^{(o)}. If all BIC differences are positive, reject the proposal to remove a variable from the grouping model.
 4.
PERMUTE STEP: If there are two or more variables in the grouping model, propose to permute order of the components in Y ^{(c)}. Permuting the order of Y ^{(c)} will change the generalized Dirichlet tree structure and will result in a different generalized Dirichlet model for the set of grouping variables. Set MAXPERM to be the maximum number of permutations to be considered at any iteration. Used here was a maximum of 60 permutations. Setting a maximum is necessary for computational efficiency, because if there are m variables in the grouping set, the number of possible permutations is m! and increases quickly as more variables are added to the grouping set. The number of permutations, NPERM, considered at a particular iteration is defined as the minimum of m! and MAXPERM. Calculate the BIC for the currently fitted generalized Dirichlet model, and then fit generalized Dirichlet models to NPERM randomly generated permutations of the grouping variables, Y ^{(c)}. Let the permutation with the largest BIC be the proposal model. If the proposal model has a larger BIC than the current generalized Dirichlet model, let the proposal model be the current model for the grouping variables. As an example, if (Y _{2},Y _{4}) is the current grouping set, evaluate the BIC of the current generalized Dirichlet model, fitted to (Y _{2},Y _{4},1−Y _{2}−Y _{4}). Consider the permutation (Y _{4},Y _{2}). Evaluate the BIC of a generalized Dirichlet distribution fitted to (Y _{4},Y _{2},1−Y _{2}−Y _{4}). If this is larger than the current BIC, then let the current grouping set be (Y _{4},Y _{2}).
 5.
TERMINATION: Iterate over steps 2 to 4 until an add, remove, and permute proposal are rejected in succession. The selected components Y ^{(c)} at this point, and their selected ordering, are the optimal feature set to be returned from the algorithm.
Competing feature selection methods
Two alternative methods for feature selection were applied to the glycan chromatography data for comparison with the proposed GDFS method. The first is the correlationbased feature selection (CFS) algorithm developed by Hall [26], while the second is a classification tree method developed by Breiman et al. [27]. These methods do not make implicit assumptions about the distribution of the data, and so they are both suitable for compositional data analysis. A brief outline of each method is provided here.
is the conditional entropy of Y given X. Cover and Thomas [39] provide a comprehensive review of information theory, including a detailed discussion on entropy. Here, p(yx) is the proportion of observations observed at level y of Y, within level x of X.
where $\overline{{r}_{\mathrm{ff}}}$ is the average symmetrical uncertainty between all pairs of features in S, $\overline{{r}_{\mathrm{cf}}}$ is the average symmetrical uncertainty between the features and the class variable, and k is the number of features in S. The idea behind using the merit as a heuristic for feature selection, is that a “good” set of features will be highly correlated with the class variable, but not highly correlated with each other.
The bestfirst search algorithm starts at an empty node (corresponding to an empty feature set). Predecessors of the current node, S, are the set of nodes generated by adding each of the currently omitted features to the feature set at the current node. The algorithm terminates when five consecutive fully expanded nodes have provided no improvement in the merit score, or when all nodes have been visited (typically only happens where the feature space is of low dimensionality). In the case where the merit score does not improve from zero, an empty feature set should be returned. Pseudocode for the correlation based feature selection algorithm is provided in Appendix C. Construction of classification trees using recursive partitioning. Further technical details are given by Hall [26].
To obtain measures of classification performance for the chosen feature set, Dirichlet distributions are fitted to each group, using the selected feature set. Posterior probabilities are calculated using a maximum a posteriori (MAP) classification rule. Further details are provided in the section below.
rpart: Baxter and Jackson [2], Ranganathan and Borges [1]], and Vermeesch [[28] use classification tree methods for compositional data analysis. The method used here is the same as that applied by Baxter and Jackson [2], a recursive partitioning algorithm developed by Breiman et al. [27]. Model fitting is carried out by the rpart package in R [40]. A brief summary of the methodology employed by rpart is included in Appendix B. Derivation of the generalized Dirichlet probability density function. More technical details of the recursive partitioning algorithm and the software implementation may be found in the technical report by Therneau and Atkinson [41].
Classification and selection bias
Each of the feature selection methods described above choose a set of “grouping” features and return a set of posterior group probabilities for the observations in Y. Statistical classification is used to measure how well a selected feature set separates the set of known groups, to determine whether the feature selection algorithm has chosen a “good” feature set. Maximum a posteriori (MAP) classifications, calculated from the selected feature set, are used assign observations to groups.
 1.
Observation j (test data) is omitted from the data, and feature selection is carried out on the remaining observations (training data).
 2.
For each group of observations in the training data, a generalized Dirichlet distribution is fitted to the selected feature set, Y ^{(c)}.
 3.Using the fitted generalized Dirichlet models fitted to Y ^{(c)} for each group defined by Z, posterior group probabilities are calculated for observation i using Bayes rule:$\begin{array}{ll}\phantom{\rule{23.0pt}{0ex}}P\left(g\left\right.{\mathbf{y}}_{i}^{\left(c\right)}\right)\phantom{\rule{0.3em}{0ex}}& =\phantom{\rule{0.3em}{0ex}}\frac{{\tau}_{g}\phantom{\rule{2.77626pt}{0ex}}f({\mathbf{y}}_{i}^{\left(c\right)}\left\right.{z}_{i}=g)}{\sum _{j=1}^{G}{\tau}_{j}\phantom{\rule{2.77626pt}{0ex}}f({\mathbf{y}}_{i}^{\left(c\right)}\left\right.{z}_{i}=j)}\text{for}g=1,2,\dots ,G\phantom{\rule{2em}{0ex}}\end{array}$(35)
where G is the number of groups in the data.
 4.Observation i is classified to some group g using the MAP classification rule, so observation y _{ i } is assigned to group g if$\begin{array}{l}g={\text{argmax}}_{j}P\left(j\left\right.{\mathbf{y}}_{i}^{\left(c\right)}\right)\text{for}\phantom{\rule{0.3em}{0ex}}j=1,2,\dots ,\mathrm{G.}\end{array}$(36)
Because the glycan chromatography datasets are from observational studies, the number of patients in each group is not representative of the general population, and so we assume here that the prior probability of group membership is equal across all groups. Selection bias is avoided, since observation j is not used in the selection of the feature set that it is classified under.
The above procedure is repeated for each observation i. The crossvalidated error rate is then calculated as the proportion of observations that were incorrectly classified by the above method.
For correlationbased feature selection, classifications are obtained in the same manner, except that the assumption is made that the selected grouping features are distributed according to a Dirichlet distribution, rather than a generalized Dirichlet distribution. Then in step 2, a Dirichlet distribution is fitted to the grouping features with observation i omitted, and at steps 3, the probability density functions in Equation 35 correspond to the Dirichlet rather than generalized Dirichlet distributions fitted to each group.
For the recursive partitioning (rpart) method the same steps for the classification of observations are used, with the exception of steps 2 and 3. Posterior probabilities for a classification tree are defined within the tree construction process. Each leaf in the tree has an associated set of posterior probabilities for each group, corresponding to the proportions of observations in the training data that belonged to each group, that were classified to that leaf node. Posterior group probabilities are obtained for a new observation by dropping it down the tree until it reaches a leaf node. The posterior group probabilities for that observation are the class proportions assigned to that leaf during the building of the tree. These probabilities are used in place of those obtained from steps 2 and 3 in the above algorithm.
Measures of classification performance
Classification results for each feature selection method are reported via a crosstabulation of the true and predicted group memberships. Also included are the following measures of classification performance:
Crossvalidation error
the proportion of observations incorrectly classified, calculated by the proportion of observations on the offdiagonal of the confusion matrix.
kappa
Cohen’s kappa statistic [43] is another measure of class agreement, recording the proportion of observations correctly classified, corrected for classification by chance. It is calculated as κ=(O−E_{ c h a n c e })/(1−E_{ c h a n c e }), where O is the observed or actual proportion of observations correctly classified and E_{ c h a n c e } is the expected proportion of observations that would be classified correctly by chance. If all observations are correctly classified, then κ=1. If the classification performance is no better than what one could expect by chance, κ≤0.
Sensitivity
the proportion of true positives. In assessing the diagnostic accuracy of a test, the sensitivity is measured by the proportion of disease cases correctly diagnosed by the test. For life threatening diseases, a test with high sensitivity is vitally important.
ROC curves
ROC curves allow for the visualization of the true positive rate (sensitivity) against the false positive rate (1  specificity) of a classifier, where the probability threshold for classification is varied over the interval [0,1]. ROC curves and their corresponding AUC (area under the ROC curve) values are commonly reported in the biological sciences, so these are also included as performance measures for each of the feature selection methods.
AUC:
area under a ROC curve. Values range between 0 and 1, with larger values indicating better classification performance. Fawcett [44] gives a useful interpretation of the AUC, as being equivalent to the probability that a randomly chosen disease case will be ranked higher than a randomly chosen control, by the classifier.
Software
All statistical analyses were carried out using R version 2.13 [40]. ROC curves were constructed and AUC values estimated using the ROCR package in R [45], while classification trees were fitted using the rpart package.
Nglycan analysis
Nglycan analysis was carried out on two datasets, from a lung cancer study and a prostate cancer study. Samples were obtained with ethical consent from their respective sources. The glycan analysis was carried out using HILIC. Details on experimental conditions are provided below.
Lung cancer serum samples
Serum samples from preoperative patients diagnosed with lung cancer and cancerfree healthy volunteers were obtained from Fox Chase, Cancer Center, Philadelphia, USA under IRB approved protocols. They were from both males and females. Patient sera (20 from each stage  I, II, IIIA, IIIB, IV) were examined alongside 84 agematched control sera from donors who did not have cancer.
Nglycan analysis was carried out by HILIC flouresence using a 60minute method. The glycan HILIC profiles produced were integrated over a set of 17 glycan peaks, resulting in a 17 part compositional vector for each observation. An example of one such glycan HILIC profile is shown in Figure 1. Further details on the analysis may be found in Arnold et al. [23].
Prostate cancer serum samples
Samples were collected with consent from prostate cancer patients before undergoing radical prostatectomy and from men with benign prostate hyperplasia (BPH) following a standard operating procedure, which is part of the Prostate Cancer Research Consortium BioResource. Ethical consent was granted from respective Hospital ethics committee of the consortium. Blood samples (10 mL) were collected into anticoagulantfree tubes. Samples were coded and transported on ice to the laboratory. The tubes were centrifuged at 2500 rpm at 20°C for 10 min. within a 30 min. time frame. Serum from each patient sample was then collected, aliquoted, and stored at 80°C until time of analysis. Each serum sample underwent no more than three freeze/thaw cycles prior to analysis. Nglycan analysis was carried out by HILIC fluorescence using a 60 minute method. The glycan HILIC profiles produced were integrated over a set of 24 glycan peaks, resulting in a 24 part compositional vector for each observation in our data. An example of one such glycan HILIC profile is shown in Figure 2. Further details on the data collection and analysis may be found in Saldova et al. [24].
Nglycan analysis for patients with lung and prostate cancer
Nglycans were released from serum using the highthroughput method described by Royle et al. [9]. Briefly, serum samples were reduced and alkylated in 96well plates, and then they were immobilized in SDSgel blocks and were washed. The Nlinked glycans were released using peptide Nglycanase F (1000 U/mL; EC 3.5.1.52) as described previously [46, 47]. Glycans were fluorescently labeled with 2aminobenzamide (2 AB) by reductive amination [46] (LudgerTag 2AB labeling kit LudgerLtd., Abingdon, UK). HILIC was performed using a TSKGel Amide80 column (Anachem, Luton, Bedfordshire, UK) on a 2695 Alliance separation module (Waters, Milford, MA) equipped with a Waters temperature control module and a Waters 2475 (lung cancer data) or 474 (prostate cancer data) fluorescence detector. Solvent A was 50 mM formic acid which was adjusted to pH 4.4 with ammonia solution. Solvent B was acetonitrile. The column temperature was set to 30°C. Gradient conditions were as follows: 60 min. method  a linear gradient of 35 to 47% solvent A over 48 min. at a flow rate of 0.8 mL/min, followed by 1 min. at 47 to 100% A and 4 min. at 100% A, returning to 35% A over 1 min., and then finishing with 35% A for 6 min. [9]. Samples were injected in 80% (lung cancer data) or 65% (prostate cancer data) acetonitrile. Fluorescence was measured at 420 nm with excitation at 330 nm. Royle et al. [9] described the Nglycosylation in human serum in detail and showed that there are 117 Nglycans present.
This method enables the analysis of glycan isoforms based on sequence and linkage (for example, core α16 fucosylation can be distinguished from α13 linked outer arm fucosylation). Glycan size and linkage result in a specific elution position that can be converted to glucose units (GUs) using a dextran hydrolysate standard ladder [9].
Glycan HILIC peaks were integrated and relative peak areas calculated using the Waters Empower 3 chromatography data software. Thus, each serum sample generates a data observation, consisting of the set of relative proportional peaks areas from a glycan HILIC profile.
Results and discussion
The proposed GDFS method was applied to the lung and prostate cancer datasets. The results are compared with those of two wellestablished feature selection methods; correlation based feature selection and recursive partitioning (rpart).
For the lung cancer dataset, the group structure is redefined as control versus cancer. All three feature selection methods perform reasonably well. The GDFS method gives the best performance with a classification rate of approximately 75% and an AUCvalue of 0.83.
Two different group structures are considered for the prostate cancer dataset. Feature selection was carried out on the data, with cases grouped as control or prostate cancer, to determine whether any features could have diagnostic value for prostate cancer. However, none of the feature selection methods were successful at classification. CFS chooses no features most of the time, while the other two methods produce feature selection with very poor classification performance.
The other research question of interest for the prostate cancer dataset was whether glycosylation could be used as a marker of disease progression. Thus, feature selection was also applied to the prostate cancer samples, classified as into Gleason 5 and Gleason 7 cases. A Gleason score of 7 indicates a more advanced cancer of the prostate.
The results from these analyses are shown here. Following the discussion of these feature selection results is a note on the computational complexity of the GDFS method used and its behaviour in moving to higher dimensions.
Lung cancer data
Jemal et al. [48] reported that lung cancer is the most common cancer globally, responsible for 1.4 million deaths each year. It has a very poor 5year survival rate of 816%, that is mainly attributable to the disease only presenting symptoms when it reaches an advanced stage [49]. Early stage detection of lung cancer could greatly improve the outlook of patients. Ghosal et al. [49] highlight that, in an attempt to reduce the mortality rates of this disease, much research has been carried out in the area of lung cancer screening and biomarker discovery. Serum biomarkers would provide a noninvasive method for cancer diagnosis. However, although a number of potential biomarkers have been identified, none todate seem to have adequate sensitivity, specificity or reproducibility to be used in clinical diagnostics.
Arnold et al. [23] conducted a study to investigate alterations or irregularities that occur in the serum Nglycome of lung cancer patients. The main objective was to identify a set of glycan structures that have biomarker potential.
Feature selection was carried out on the glycan chromatography dataset from this study using the proposed GDFS method, as well as two competing methods, CFS and rpart. The results are compared here.
Feature selection for lung cancer data
Feature selection for lung cancer data
GDFS  CFS  rpart  Predominant glycans(GDFS method)  

Peak 1  ✗  ✗  ✗  
Peak 2  ✓  ✗  ✗  A2 
Peak 3  ✓  ✗  ✗  FA2 
Peak 4  ✓  ✗  ✗  FA2B, A2[3]G1, A2[6]G1, M5 
Peak 5  ✓  ✓  ✗  FA2[3]G1, FA2[6]G1, FA2[3]BG1, FA2[6]BG1 
Peak 6  ✓  ✓  ✗  A2G2, A2BG2, A2[3]G1S1, A2[6]G1S1 
Peak 7  ✓  ✓  ✗  FA2G2, FA2BG2,FA2[3]G1S1, FA2[6]G1S1 
Peak 8  ✓  ✗  ✗  A2G2S1, A2BG2S1 
Peak 9  ✓  ✓  ✗  A3G3S2, A3BG3S2, A2F1G2S2 
Peak 10  ✗  ✗  ✗  
Peak 11  ✗  ✗  ✗  
Peak 12  ✓  ✓  ✓  A3G3S2, A3BG3S2, A2F1G2S2 
Peak 13  ✗  ✗  ✗  
Peak 14  ✓  ✓  ✗  A3F1G3S3 
Peak 15  ✗  ✗  ✗  
Peak 16  ✗  ✗  ✗  
Peak 17  ✓  ✓  ✗  A4G4LacS4, A4F2G3S4 
Lung cancer data classifications
GDFS  CFS  rpart  

Control  Cancer  Control  Cancer  Control  Cancer  
True groups  Control  69  15  66  18  74  10 
Cancer  32  68  40  60  39  61 
Lung cancer data classification performance
Crossvalidation error  Kappa  Sensitivity  AUC  

GDFS  0.255  0.493  0.680  0.830 
CFS  0.315  0.378  0.600  0.757 
rpart  0.266  0.478  0.610  0.562 
Prostate cancer data
Jemal et al. [48] observed that, globally, prostate cancer is the second most frequently diagnosed cancer in males and is the sixth most common cause of cancer death in males, based on figures from 2008. Prostate cancer is one of the most commonly diagnosed cancers in men. Prostate specific antigen (PSA) is a glycoprotein that is currently used as a clinical biomarker for this disease, but this glycoprotein is lacking in sensitivity and specificity. In fact, the U.S. Preventative Services Task Force (USPSTF) have recently issued a draft recommendation against PSA screening [52], after concluding that PSAbased screening for prostate cancer results in little or no reduction in the prostate cancerspecific mortaility. They also suggest that the screening may do more harm than good, due to the harms associated with evaluations or treatment carried out subsequent to screening. Several other potential biomarkers for this disease have been identified, but none that appear to be sensitive or specific enough for clinical use. Thus, there is an urgent need for further developments in this area.
Saldova et al. [24] conducted a study to investigate whether patterns of glycosylation are useful in differentiating between cases of prostate cancer and benign prostate hyperplasia (BPH). BPH is an enlargement of the prostate gland and is very common in men, especially as they age. BPH can present similar symptoms to prostate cancer and is also associated with elevated PSA levels. It would be extremely useful to identify a biomarker that can distinguish between these conditions. The study by Saldova et al. [24] was carried out using 34 prostate cancer cases (consisting of 17 cases with Gleason score 5 and 17 cases with Gleason score 7) and 13 men with BPH. The Gleason score is a currently used measure of disease severity. It ranges from 2 to 10, with a higher score indicating a more advanced stage of disease.
Variable selection for prostate cancer data  cancer vs. BPH
Variable selection for prostate cancer data (prostate cancer vs. BPH)
GDFS  CFS  rpart  Predominant glycans (GDFS method)  

Peak 1  ✗  ✗  ✗  
Peak 2  ✗  ✗  ✗  
Peak 3  ✗  ✗  ✗  
Peak 4  ✗  ✗  ✗  
Peak 5  ✗  ✗  ✗  
Peak 6*  ✗  ✗  ✗  FA2[3]G1, FA2[6]BG1 
Peak 7  ✗  ✗  ✗  
Peak 8  ✗  ✗  ✗  
Peak 9  ✗  ✗  ✗  
Peak 10  ✓  ✗  ✗  FA2G2, FA2[6]G1S1, FA2[6]BG1S1 
Peak 11  ✗  ✗  ✗  
Peak 12  ✗  ✗  ✗  
Peak 13**  ✗  ✗  ✗  A2BG2S1 
Peaks 14  24  ✗  ✗  ✗ 
Prostate cancer data classifications (prostate cancer vs. BPH)
GDFS  CFS  rpart  

BPH  Cancer  BPH  Cancer  BPH  Cancer  
True groups  BPH  1  12      7  6 
Cancer  13  21      20  14 
Prostate cancer data classification performance (prostate cancer vs. BPH)
Crossvalidation error  Kappa  Sensitivity  AUC  

GDFS  0.532  0.298  0.618  0.274 
rpart  0.553  0.037  0.412  0.371 
Variable selection for prostate cancer data  disease progression
In addition to the separation of BPH from prostate cancer samples, it is desirable to see whether the serum Nglycan profile changes as prostate cancer progresses. Gleason scores are assigned to prostate cancer cases based on the microscopic appearance of the cancerous tissue. They range from 2 to 10, with grade 10 having the worst prognosis. Feature selection was carried out on the prostate cancer samples from the study by Saldova et al. [24], to investigate whether there differences in the chromatograms of the 17 Gleason 5 and 17 Gleason 7 cases. Three feature selection methods are compared; the proposed GDFS method, correlationbased feature selection (CFS), and a recursive partitioning (rpart).
Variable selection for prostate cancer data (Gleason 5 vs. Gleason 7)
GDFS  CFS  rpart  Predominant glycans (GDFS method)  

Peaks 114  ✗  ✗  ✗  
Peak 15  ✓  ✓  ✗  FA2BG2S1, A3G3 
Peak 16  ✗  ✗  ✗  
Peak 17  ✗  ✗  ✗  
Peak 18  ✗  ✗  ✗  
Peak 19  ✓  ✓  ✗  A3G3S2 
Peak 20  ✗  ✗  ✗  
Peak 21  ✓  ✓  ✗  A3G3S3 
Peak 22  ✗  ✗  ✗  
Peak 23  ✓  ✗  ✗  A4G4S4 
Peak 24*  ✓  ✓  ✗  A4F1G4S4 
Prostate cancer data classifications (Gleason 5 vs. Gleason 7)
GDFS  CFS  rpart  

Gleason 5  Gleason 7  Gleason 5  Gleason 7  Gleason 5  Gleason 7  
True groups  Gleason 5  11  6  11  6  11  6 
Gleason 7  4  13  9  8  8  9 
Prostate cancer data classification performance (Gleason 5 vs. Gleason 7)
Crossvalidation error  Kappa  Sensitivity  AUC  

GDFS  0.294  0.412  0.765  0.785 
CFS  0.441  0.118  0.471  0.585 
rpart  0.412  0.176  0.529  0.512 
Search strategy and computational complexity
For a dataset of dimension p, the cardinality of the feature space increases exponentially with p. An exhaustive search over this space would involve an evaluation of all possible solutions and for this problem has complexity 2^{ p }. That is, for p variables, there are 2^{ p } possible solutions to the feature selection problem. An exhaustive search would certainly be possible for a relatively small number of variables, but the computational complexity increases quickly. A dataset with 24 variables has 16,777,216 possible solutions in the feature selection problem!
Glycan chromatography data being produced is of a relatively low dimensionality at present. It has been found that there are 117 glycans in human serum [9], and therefore, it can be expected that the number of variables in the glycan chromatography data will increase as technology becomes more advanced. For example, Bones et al. [10] recently showed that ultra performance liquid chromatography (UPLC) allows for the quantification of the glycan pool by a chromatogram consisting of 53 glycan peaks, under certain experimental conditions.
Computational efficiency of the GDFS method compared with CFS and rpart
p  GDFS  CFS  rpart  

10  3.29  (0)  2.065  (0)  0.02  (2) 
20  52.246  (0)  11.527  (2)  0.023  (6) 
30  113.702  (0)  23.395  (5)  0.029  (9) 
40  249.751  (2)  30.866  (8)  0.038  (12) 
50  498.445  (1)  83.885  (10)  0.043  (16) 
60  609.841  (0)  415.525  (4)  0.05  (19) 
70  962.695  (2)  828.434  (3)  0.083  (22) 
80  1902.347  (0)  696.083  (10)  0.068  (26) 
90  1516.234  (1)  1286.167  (9)  0.078  (28) 
100  2059.3  (1)  812.16  (17)  0.096  (31) 
Murphy, Dean, and Raftery [22] used the headlong search strategy proposed by Badsberg [20] in their variable selection. They add (or remove) the first variable whose BIC difference is greater than (or less than) a prespecified value. This removes the necessity to search through all variables at each iteration, reducing computational time dramatically over an ordinary greedy search strategy. However, they state that the variables selected using this method may change depending on the initial ordering of variables in the dataset. They preferred this approach, as they had over 1000 variables to consider in their application. Since glycan chromatography datasets are relatively lowdimensional, we avoid this issue by using considering all possibilities of variables to add or remove at each iteration.
Conclusions
Biomarker discovery is of the utmost importance for disease discovery and treatment. The field of glycobiology shows great potential in this area and is continually improving technologies to advance research into the identification and validation of glycan biomarkers. Glycan analysis is commonly carried out using highthroughput chromatography techniques, that give rise to compositional data. The compositional nature of the data is commonly ignored in statistical analysis, mainly due to lack of awareness of the special considerations that are required for the analysis of such data.
There is a substantial need in the field of glycobiology for a statistical toolbox of suitable methods for dealing with the compositional glycan chromatography data. This article hopes to contribute a novel method for feature selection that could be used for identifying sets of potential biomarkers. The method carries out a greedy search over the space of all possible sets of features, seeking the set of features that best discriminates between a set of defined groups in the data. The generalized Dirichlet distribution and its marginal, the beta distribution, are used to model compositional components (variables), since they suitable for proportional data. The BIC is used for model selection.
This methodology was tested on two glycan chromatography datasets, from the lung cancer study by Arnold; et al. [23] and the prostate cancer study by Saldova et al. [24]. Two other wellestablished methods were applied to these datasets for comparison  correlation based feature selection (CFS) and a recursive partitioning method for classification tree construction (rpart package in R [40]). For the lung cancer dataset, a set of 11 peaks are consistently identified by the GDFS method as differing between the lung cancer and clinical control cases (Table 1). Peaks 12, 14, and 17, included in this selected feature set, contain the sialyl Lewis x (SLe ^{ X }) epitope, which is known to be increased in cancer and important for cancer progression [25]. For the prostate cancer dataset, peaks 10 and 13 are consistently identified by the GDFS method as potential glycan biomarkers for differentiating between BPH and prostate cancer. peak 10 contains corefucosylated biantennary glycans, and peaks 10 and 13 contain bisected biantennary glycans. Our findings are consistent with previous results showing that corefucosylation is altered in cancer and bisects are decreased in cancer [53]. Regarding separation of different disease stages, five Nglycan peaks were selected by the GDFS method (peaks 15, 19, 21, 23, and 24) as differing between Gleason 5 and Gleason 7 cases. This indicates a decrease in triantennary trigalactosylated glycans and in tetraantennary tetrasialylated outer arm fucosylated glycans and an increase in tetraantennary tetrasialylated glycans in Gleason 7 compared with Gleason 5 prostate cancer patients [24].
In general, the proposed GDFS method outperformed both CFS and rpart on classification performance, although it is somewhat slower computationally. Importantly, the sensitivity of the classifiers was largest for the GDFS method in all cases, meaning that more of the actual lung cancer cases were detected. From our results, we conclude that the proposed GDFS method provides a useful tool for feature selection in compositional glycan chromatography data.
This method has been developed specifically with glycan chromatography data in mind and accounts for the special constraints on a compositional dataset, since the data are modelled in a simplex sample space. It has been used for feature selection in the context of supervised learning, where the data have a known group structure, but may easily be extended for use with unsupervised learning methods, such as modelbased clustering, as in Raftery and Dean [19].
Appendix
A. change of variable rule
where h^{ ′ }(y) is the derivative of $\stackrel{~}{y}=h\left(y\right)$ with respect to y.
B. Derivation of the generalized Dirichlet probability density function
are mutually independent, where S_{0}=1 and ${S}_{j}=\sum _{m=1}^{j}{S}_{m}$ for m=1,2,…,p. Note that the last component of $\stackrel{~}{\mathbf{Y}}$ is degenerate, since it is equal to one. Since Y is a generalized Dirichlet random vector, the marginal distributions of the elements of $\stackrel{~}{\mathbf{Y}}$ are beta distributions, so that ${\stackrel{~}{Y}}_{j}\sim \text{beta}({\alpha}_{j},{\beta}_{j})$ for j=1,2,…,p−1.
C. Construction of classification trees using recursive partitioning
Briefly, the classification trees fitted here are constructed by splitting observations into subsets that give the best separation between the set of known groups in the data. Subsets of observations are represented by nodes in the classification tree. Each node is labelled by the predominant class of observations at that node, and the misclassification error for any node is then the proportion of observations at that node that don’t belong to the predominant class.
All observations are included in the root node. A binary split of the observations at a given node is made by selecting the feature (variable), and the split threshold for that feature that give the best separation of the classes. Here, the feature set and cut threshold are selected to minimize the Gini index, a measure of “impurity”, or the average misclassification error for the child nodes resulting from a binary split. Then the observations at this node are split into two child nodes according to whether their observed values of the selected feature lie above or below the split threshold. The process is then repeated for the resulting child nodes.
where α is a penalty term that controls the size of the tree. The final tree is then chosen as the “pruned” version of the full tree, that minimizes the cost complexity, R_{ α }. The value of α is estimated using the “1SE” rule proposed by Brieman et al. [27]. If α∈[0,∞], then this interval may be partitioned into a set of subintervals (I_{1},I_{2},…,I_{ k }), such that any value of α in the interval I_{ j } will give rise to the same subtree obtained from pruning the expanded tree by minimizing R_{ α }. The rpart function provides the crossvalidated risk (average R_{ α } value from tenfold crossvalidation) along with its standard error, evaluated at the range of complexity parameters α equal to geometric means of the maximum and minimum values for each interval (I_{1},I_{2},…,I_{ k }). Any cost complexity score within one standard error of the minimum is then marked as being equivalent to the minimum. The optimum value of the complexity parameter α is the one the gives is the simplest of set of models at the “minimum” cost complexity (or in other words, the largest value of α, since α is a penalty for complexity).
D. Pseudocode for correlationbased feature selection
Abbreviations
 HILIC:

Hydrophilic interaction liquid chromatography
 CFS:

Correlationbased feature selection
 rpart:

recursive partitioning method for the construction of classification trees
 GDFS:

Generalized Dirichlet feature selection
 BIC:

Bayesian information criterion
Declarations
Acknowledgements
This work was funded by the Irish Research Council for Science, Engineering and Technology (IRCSET). Lung cancer samples were collected at Fox Chase Cancer Centre. Glycan analysis of the lung cancer samples was supported by the Oxford Glycobiology Institute endowment fund, EUROCarbDB, RIDS contract number: 011952. Analysis of the prostate cancer data was supported by a grant from the Irish Cancer Society, via the Prostate Cancer Research Consortium. R.S. acknowledges funding from the European Union Seventh Framework Programme (FP7/20072013) under grant agreement n° 260600 (“GlycoHIT”).
Authors’ Affiliations
References
 Ranganathan Y, Borges RM: To transform or not to transform: That is the dilemma in the statistical analysis of plant volatiles. Plant Signal Behav. 2011, 6: 113116. 10.4161/psb.6.1.14191.PubMed CentralView ArticlePubMed
 Baxter MJ, Jackson CM: Variable selection in artefact compositional studies. Archaeometry. 2001, 43 (2): 253268. 10.1111/14754754.00017.View Article
 Aitchison J: The Statistical Analysis of Compositional Data. 1986, Caldwell: The Blackburn PressView Article
 Pearson K: On a form of spurious correlation which may arise when indices are used in the measurement of organs. Proc R Soc London. 1897, 60: 489498.View Article
 Taylor ME, Drickamer K: Introduction to Glycobiology. 2003, USA: Oxford University Press
 Raman R, Raguram S, Venkataraman G, Paulson JC, Sasisekharan R: Glycomics: an integrated systems approach to structurefunction relationships of glycans. Nature Methods. 2005, 2 (11): 817824. 10.1038/nmeth807.View ArticlePubMed
 Packer NH, von der Lieth CW, AokiKinoshita KF, Lebrilla CB, Paulson JC, Raman R, Rudd PM, Sasisekharan R, Taniguchi N, York WS: Frontiers in glycomics: bioinformatics and biomarkers in disease. An NIH White paper prepared from discussions by the focus groups at a workshop on the NIH campus, Bethesda MD (September 1113, 2006).eptember 1113, 2006). Proteomics. 2008, 8: 820. 10.1002/pmic.200700917.View ArticlePubMed
 Struwe WB, Cosgrave EFJ, Byrne JC, Saldova R, Rudd PM: Glycoproteomics in health and disease. Functional and Structural Proteomics of Glycoproteins. Edited by: Owens RJ. 2011, Netherlands: Springer, 138.
 Royle L, Campbell MP, Radcliffe CM, White DM, Harvey DJ, Abrahams JL, Kim YG, Henry GW, Shadick NA, Weinblatt ME, Lee DM, Rudd PM, Dwek RA: HPLCbased analysis of serum NGlycans on a 96Well plate platform with dedicated database software. Anal Biochem. 2008, 376: 112. 10.1016/j.ab.2007.12.012.View ArticlePubMed
 Bones J, Mittermayr S, O’Donoghue N, Guttman A, Rudd PM: Ultra performance liquid chromatographic profiling of serum NGlycans for fast and efficient identification of cancer associated alterations in glycosylation. Anal Chem. 2010, 82 (24): 1020810215. 10.1021/ac102860w.View ArticlePubMed
 Galligan M, Campbell MP, Saldova R, Rudd PM, Murphy TB: Application of compositional models for glycan HILIC data. Proceedings of the 4th International Workshop on Compositional Data Analysis. Edited by: Ortega MI, TolosanaDelgado R, Egozcue JJ, TolosanaDelgado R , Ortega MI . 2011, [http://congress.cimne.com/codawork11/Admin/Files/FilePaper/p51.pdf],
 Connor RJ, Mosimann JE: Concepts of independence for proportions with a generalization of the Dirichlet distribution. J Am Stat Assoc. 1969, 64 (325): 194206. 10.1080/01621459.1969.10500963.View Article
 Dennis SY: On the hyperDirichlet type 1 and hyperLiouville distributions. Commun Stat Theory Methods. 1991, 20 (12): 40694081. 10.1080/03610929108830757.View Article
 Null B: The nested Dirichlet distribution: properties and applications. 2008, [Working paper. Department of Management Science and Engineering, Stanford University]
 Hijazi RH, Jernigan RW: Modelling compositional data using Dirichlet regression models. J Appl Probability Stat. 2009, 4: 7791.
 Gueorguievaa R, Rosenheckb R, Zelterman D: Dirichlet component regression and its applications to psychiatric data. Comput Stat Data Anal. 2008, 52 (12): 53445355. 10.1016/j.csda.2008.05.030.View Article
 Grunwald GK, Raftery AE, Guttorp P: Time series of continuous proportions. J R Stat Soc Ser B. 1993, 55: 103116.
 Wang HY, Yang Q, Qin H, Zha H: Dirichlet component analysis: feature extraction for compositional data. The 25th International Conference on Machine Learning (ICML). 2008, Helsinki:
 Raftery AE, Dean N: Variable selection for modelbased clustering. J Am Stat Assoc. 2006, 101 (473): 168178. 10.1198/016214506000000113.View Article
 Badsberg JH: Model search in contingency table by CoCo. Dodge, Y. and Whittaker, J. Edited by: Neuchatel, COMPSTAT 1992, Computational Statistics, Computational Statistics , COMPSTAT 1992 , Neuchatel . Physica Verlag: Heidelberg, Vol. 1, 251256.
 Schwarz G: Estimating the dimension of a model. Ann Stat. 1978, 6 (2): 461464. 10.1214/aos/1176344136.View Article
 Murphy TB, Dean N, Raftery AE: Variable selection and updating in modelbased discriminant analysis for highdimensional data with food authenticity applications. Ann Appl Stat. 2010, 4: 396421.PubMed CentralView ArticlePubMed
 Arnold JN, Saldova R, Galligan MC, Murphy TB, MimuraKimura Y, Telford JE, Godwin AK, Rudd PM: Novel glycan biomarkers for the detection of lung cancer. J Proteome Res. 2011, 10 (4): 17551764. 10.1021/pr101034t.View ArticlePubMed
 Saldova R, Fan Y, Fitzpatrick JM, Watson RWG, Rudd PM: Core fucosylation and α23 sialylation in serum Nglycome is significantly increased in prostate cancer comparing to benign prostate hyperplasia. Glycobiology. 2011, 21 (2): 195205. 10.1093/glycob/cwq147.View ArticlePubMed
 Arnold JN, Saldova R, Hamid UMA, Rudd PM: Evaluation of the serum Nlinked glycome for the diagnosis of cancer and chronic inflammation. Proteomics. 2008, 8 (16): 32843293. 10.1002/pmic.200800163.View ArticlePubMed
 Hall MA, Smith LA: Feature subset selection: A Correlation Based Filter Approach. International Conference on Neural Information Processing and Intelligent Information Systems. 1997, Berlin: Springer, 855858.
 Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. 1984, Monterey: Wadsworth & Brooks/Cole Advanced Books & Software
 Vermeesch P: Tectonic discrimination of basalts with classification trees. Geochimica et Cosmochimica Acta. 2006, 70 (7): 18391848. 10.1016/j.gca.2005.12.016.View Article
 Pirooznia M, Yang JY, Yang MQ, Youping D: A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics. 2008, 9 (Suppl 1): S1310.1186/147121649S1S13.PubMed CentralView ArticlePubMed
 Peek AS: Improving model predictions for RNA interference activities that use support vector machine regression by combining and filtering features. BMC Bioinformatics. 2007, 8: 18210.1186/147121058182.PubMed CentralView ArticlePubMed
 Minka TP: Estimating a Dirichlet distribution. Tech Rep M.I.T. 2000, http://research.microsoft.com/enus/um/people/minka/papers/dirichlet/minkadirichlet.pdf,
 Ronning G: Maximum likelihood estimation of Dirichlet distributions. J Stat Comput Simul. 1989, 32 (4): 215221. 10.1080/00949658908811178.View Article
 Lindstrom MJ, Bates DM: NewtonRaphson and EM algorithms for linear mixedeffects models for repeated measures data. J Am Stat Assoc. 1988, 83: 10141022.
 Böhning D, Dietz E, Schaub R, Schlattmann P, Lindsay B: The distribution of the likelihood ratio for mixtures of densitites from the oneparameter exponential family. Ann Inst Stat Math. 1994, 46 (2): 373388. 10.1007/BF01720593.View Article
 Lindsay BG: Mixture models: theory, geometry and applications. NSFCBMS Regional Conference Series in Probability and Statistics, Volume 5. 1995, Hayward: Institute of Mathematical Statistics
 McNicholas PD, Murphy TB, McDaid AF, Frost D: Serial and parallel implementations of modelbased clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal. 2010, 54 (3): 711723. 10.1016/j.csda.2009.02.011.View Article
 Ekenel HK, Sankur B: Feature selection in the independent component subspace for face recognition. Pattern Recognit Lett. 2004, 25 (12): 13771388. 10.1016/j.patrec.2004.05.013.View Article
 Fayyad UM, Irani KB: Multiinterval discretization of continuousvalued attributes for classification learning. “Proceedings of the 13th International Joint Conference on Artificial Intelligence”, Volume 2. 1993, Chambery: Morgan Kaufmann, 10221027.
 Cover TM, Thomas JA: Elements of Information Theory. 1991, Hoboken, New Jersey and published simultaneously in Canada: John Wiley & Sons Inc.View Article
 R Development Core Team:: R: A Language and Environment for Statistical Computing. 2011, Vienna: R Foundation for Statistical Computing, [http://www.Rproject.org] [ISBN 3900051070],
 Therneau TM, Atkinson B: rpart: Recursive Partitioning. 2012, http://mayoresearch.mayo.edu/mayo/research/biostat/splusfunctions.cfm [R package version 3.152. R port by Brian Ripley]
 Ambroise C, McLachlan GJ: Selection bias in gene extraction on the basis of microarray geneexpression data. Proc Natl Acad Sci USA. 2002, 99 (10): 65626566. 10.1073/pnas.102102699.PubMed CentralView ArticlePubMed
 Cohen J: A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960, 20: 3746. 10.1177/001316446002000104.View Article
 Fawcett T: An introduction to ROC analysis. Pattern Recognit Lett. 2006, 27 (8): 861874. 10.1016/j.patrec.2005.10.010.View Article
 Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics. 2005, 21 (20): 39403941. 10.1093/bioinformatics/bti623.View ArticlePubMed
 Bigge JC, Patel TP, Bruce JA, Goulding PN, Charles SM, Parekh RB: Nonselective and efficient fluorescent labeling of glycans using 2amino benzamide and anthranilic acid. Anal Biochem. 1995, 230 (2): 229238. 10.1006/abio.1995.1468.View ArticlePubMed
 Kuster B, Wheeler SF, Hunter AP, Dwek RA, Harvey DJ: Sequencing of Nlinked oligosaccharides directly from protein gels: ingel deglycosylation followed by matrixassisted laser desorption/ionization mass spectrometry and normalphase highperformance liquid chromatography. Analy Biochem. 1997, 250: 82101. 10.1006/abio.1997.2199.View Article
 Jemal A, Bray F, Center MM, Ferlay J: Global cancer statistics. CA: A Cancer J Clin. 2011, 61 (2): 6990. 10.3322/caac.20107.
 Ghosal R, Kloer P, Lewis KE: A review of novel biological tools used in screening for the early detection of lung cancer. Postgrad Med J. 2009, 85 (1005): 358363. 10.1136/pgmj.2008.076307.View ArticlePubMed
 Harvey DJ, Merry AH, Royle L, Campbell MP, Dwek RA, Rudd PM: Proposal for a standard system for drawing structural diagrams of N and Olinked carbohydrates and related compounds. Proteomics. 2009, 9 (15): 37963801. 10.1002/pmic.200900096.View ArticlePubMed
 Campbell MP, Royle L, Radcliffe CM, Dwek RA, Rudd PM: GlycoBase and autoGU: tools for HPLCbased glycan analysis. Bioinformatics. 2008, 24 (9): 12141216. 10.1093/bioinformatics/btn090.View ArticlePubMed
 Chou R, Croswell JM, Dana T, Bougatsos C, Blazina I, Fu R, Gleitsmann K, Koenig HC, Lam C, Maltz A, Rugge JB, Lin K: Screening for prostate cancer: a review of the evidence for the U.S. preventive services task force. Ann Intern Med. 2011, 155: 762771. 10.7326/000348191551120111206000375.View ArticlePubMed
 Marino K, Saldova R, Adamczyk B, Rudd PM: Changes in serum Nglycosylation profiles: functional significance and potential for diagnostics. Carbohydr Chem: Chem Biol Approaches. 2011, in press
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.