Volume 8 Supplement 5

## Articles selected from posters presented at the Tenth Annual International Conference on Research in Computational Biology

# On consensus biomarker selection

- Janusz Dutkowski
^{1}Email author and - Anna Gambin
^{1}Email author

**8(Suppl 5)**:S5

**DOI: **10.1186/1471-2105-8-S5-S5

© Dutkowski and Gambin; licensee BioMed Central Ltd. 2007

**Published: **24 May 2007

## Abstract

### Background

Recent development of mass spectrometry technology enabled the analysis of complex peptide mixtures. A lot of effort is currently devoted to the identification of biomarkers in human body fluids like serum or plasma, based on which new diagnostic tests for different diseases could be constructed. Various biomarker selection procedures have been exploited in recent studies. It has been noted that they often lead to different biomarker lists and as a consequence, the patient classification may also vary.

### Results

Here we propose a new approach to the biomarker selection problem: to apply several competing feature ranking procedures and compute a consensus list of features based on their outcomes. We validate our methods on two proteomic datasets for the diagnosis of ovarian and prostate cancer.

### Conclusion

The proposed methodology can improve the classification results and at the same time provide a unified biomarker list for further biological examinations and interpretation.

## Background

There is great hope among clinical proteomics researchers that mass spectrometry (MS) will soon become a powerful diagnostic tool. Extensive research has been conducted on statistical learning methods for disease prediction from MS data (see e.g. [1–9]). The data considered in this context are mainly spectra of complex peptide mixtures, such as plasma or serum samples. These spectra contain signals of thousands of peptides. The dimension of the input space is therefore very high, while the number of available samples is relatively small (a few hundred at most). Furthermore, only a small fraction of the peptides may potentially be significant in determining the health state of the patient. Most classification algorithms do not perform well in this setting, and for the ones that do (e.g. random forests) the results become difficult to interpret. Here we consider two paradigms for reducing the dimension of the data and identifying informative variables: feature selection and feature extraction. We review commonly used feature selection and extraction procedures and propose a new approach based on aggregating the preferences of several competing methods.

We present two novel solutions to the biomarker selection problem, both of which aim at unifying the preferences of a number of feature scoring functions. The first approach is based on computing a consensus ranking, given the individual rankings from several scoring functions. This is often referred to as rank aggregation. This problem was formulated in the context of the Web search engines in [13] and several heuristics have been proposed therein for this task. In our application the consensus is found as the stationary distribution of an appropriately defined Markov chain over the set of features.

A different way of reducing the dimension of the data is by feature extraction. Instead of selecting a subset of the original features, the aim is to construct a group of new features that optimize a given objective. Often the method of choice is principal component analysis (PCA), which retains as much of the original variance of the data as possible, with the condition that the output variables are uncorrelated. Typically PCA is applied to all of the variables to extract a small number of new variables, which convey the greatest amount of the variance. We argue that in the case of MS data (and many other similar applications) it is unrealistic to assume that the directions of the greatest variance of the data capture the differences between the observed health conditions. Instead many other factors, independent of the given patient classification, like age, diet, or sample processing may contribute to the variance of the data. For best results and interpretability we propose to apply PCA only to a selected group of features. This selection is based on the outcome of a number of scoring functions.

## Results and discussion

### Markov chain rank aggregation

To evaluate the performance of the Markov chain rank aggregation algorithm we first scored each feature using the four scoring functions, thus obtaining four rankings. Next, we selected top 100 features from each list and aggregated the resulting partial rankings using the Markov chain method. Note that this process was repeated for each training block of the data in the cross-validation scheme. Best results were obtained using the MC_{4} transition matrix for the prostate cancer dataset and MC_{1} transition matrix in the case of the ovarian cancer dataset (the respective transition matrices are defined in Section Methods). Figures 2 and 3 compare the classifier accuracy (total number of correct class predictions divided by the total number of predictions) obtained using different feature selection methods and their aggregation for the two datasets. It can be observed that consensus ranking found by the MC rank aggregation performed well compared to the separate feature rankings. It did not achieve the highest overall score, but was clearly better than most of the input rankings it was computed from. In case of the prostate cancer dataset (see Fig. 2) the classification accuracy for features selected with MC_{4} closely followed the results for the best input method, while all other input methods performed much worse. In case of the ovarian cancer data (see Fig. 3), MC_{1} also outperformed three of four input methods. Best features selected with the t-statistic yielded the highest prediction accuracy on this dataset. However, the accuracy of most classifiers significantly decreased when more features from this ranking were used. The MC_{1} ranking displayed a more stable behavior increasing the prediction accuracy as more features were included.

### Consensus feature extraction *via* PCA

To evaluate the performance of the proposed consensus feature extraction method, for each cross-validation split we took a union of 100 best features from each ranking and applied PCA to the data reduced to only those features. A specified number of the outcome consensus features (sorted by the decreasing eigenvalues) was chosen each time to construct the classifier. Due to the limited number of the input variables, PCA outputs only a small number of significant features. We decided to take only those which constitute for at least 0.1% of the variance. The performance of standard PCA and the proposed "Consensus" version for the two datasets are presented in Figures 2 and 3. Features obtained from Consensus PCA for the prostate cancer dataset (see Fig. 2) were far more informative than the ones extracted using regular PCA (judging by classifier performance). The accuracy achieved using a small number of the consensus features also compared favorably with the results for a large set of features selected by the best input ranking function. In the case of the ovarian cancer data using Consensus PCA significantly improved the performance of the decision tree classifier. Remaining classifiers yielded better predictions using the standard PCA version. We also note that in the case of this dataset using standard principal components, which convey the variance of all features, overall resulted in more accurate predictions than using features selected by ranking methods (most evident with the SVM classifier). Based on these observations we suspect that in the case of this dataset lower ranked features provide additional information useful for class prediction. This would explain the lower accuracy of the classifiers constructed using the consensus features, which only convey information about the top 100 features from each method.

## Conclusion

We have proposed and tested two biomarker selection methods: one based on rank aggregation and second applying PCA to the informative variables selected by different scoring procedures. Both methods raise the possibility of identifying predictors supported by several competing feature ranking procedures. Although we focused on the analysis of MS data, our methods can just as well be applied to extract consensus predictors from other large-scale experiments (e.g. gene expression microarray data).

In order to confirm the relevance of predictors found by any computational means one must study their underlying biological function. We are currently involved in two proteomic projects carried out by the Laboratory of Mass Spectrometry of the Polish Academy of Science and the Warsaw Oncology Center, which aim at identifying prognostic biomarkers for cystic-fibrosis and diagnostic biomarkers for colorectal cancer. These studies will provide us a chance to validate our procedures with respect to their ability to identify more biologically meaningful predictors and less false positives.

## Methods

We will denote our dataset with a *p* by *n* matrix, where *p* is the number of intensity measurements and *n* is the number of samples. In a typical MS dataset the intensities correspond to *p* mass to charge (m/z) ratios, which determine the set of observed variables. Sometimes additional experimental techniques such as liquid chromatography (LC) are applied to distinguish between substances with the same mass, thus the observed variables may be further refined by these measurements. The rows of the data matrix contain the values of the observed variables (also called features) for each of the *n* samples. Given a set of training samples with known classification, our goal is to build a classifier which will be able to predict the class of a new sample. In the following discussion we will describe the subsequent steps leading to the construction of a classifier which uses consensus features (potentially biomarkers) computed by one of our methods. These steps include feature selection using various ranking functions, rank aggregation or consensus feature extraction and supervised learning.

### Biomarker selection by ranking functions

We begin by examining four feature selection methods. The first one considered here is the t-statistic (TT) [10], which measures the normalized differences of the means of a given feature in two groups (e.g. cancer and healthy group). The peak probability contrast method (PPC) for classification of MS samples was introduced in [7]. Here we exploit only the feature ranking part of the algorithm. It proceeds by determining for each feature the split point, which maximally discriminates between the two given groups of samples. The best split point is chosen among the quantiles of feature values (i.e. peak heights at a given position), (see [7] for details). As another way of determining feature importance, we consider the use of the mutual information measure (MI) [11]. Each peptide signal is treated as a discrete source which conveys information about the decision variable (class of the sample). Features with the greatest mutual information with the decision variable are considered the most important. The forth ranking statistic considered here is the feature importance measure computed by the random forest (RF) classification algorithm [12]. Note that we also use random forests as a classifier (see Section Classification).

*k*of the variables (biomarkers) for further processing. One serious drawback of this approach is that different selection procedures result in different biomarker rankings (see Figures 5 and 6) and yield different classification rules. Each single selection method may provide valuable information about the features. Therefore it is reasonable to combine the information from several methods. In the following we describe two methods for aggregating the preferences of multiple scoring functions.

### Markov chain rank aggregation

The first approach proposed here is to compute an aggregated ranking from the outcomes of several feature selection procedures. The mathematical problem of rank aggregation was originally formulated in the context of Web search engines [13]. The idea is to start with several *partial rankings*, and produce one *consensus list* being the aggregation of them. Considered rankings are partial in two senses: several features, consecutive in the ranking list, can have the same label (i.e. they are equally ranked), and often we are interested only in the top *k* items from each list.

Rank aggregation problem has different formalizations based on various optimality criteria. Probably one of the most natural criteria – optimization of the average Kendall distance (the bubble sort distance between two lists) has been proven to be NP-hard in [13]. To cope with the complexity of the problem an efficient heuristic approach based on calculating the stationary distribution of an appropriately defined Markov chain has been proposed. The states of the chain correspond to the features ranked by various scoring functions and the transition probabilities depend on the position of the features in the given partial rankings. The aggregated consensus ranking is obtained as the list of states sorted by their stationary probabilities.

In this study we consider two Markov chains from [13] (MC_{1} and MC_{4}). The transition matrices for the two Markov chains are defined as follows:

MC_{1} – If the current state is feature *P*, the next state is chosen uniformly from the multiset of all features that were ranked higher (or equal to) *P* by some feature selection method that selected *P*. The main idea is that in each step we move from the current feature to a better feature, allowing about $\frac{1}{k}$ probability of staying in the same state, where *k* is roughly the average rank of the current feature.

MC_{4} – If the current state is *P*, then the next state is chosen as follows: first pick a state *Q* uniformly from the union of all ranked features. If *Q* is ranked better than *P* by the *majority* of the methods, that selected both *P* and *Q*, then go to *Q*, else stay in *P*.

In the case of chain MC_{1}, we observed that its specific structure complies well with the framework developed in [17], where the Markov chain transition matrix *L* is assumed to have a form

*L* = *L*_{0} + *ε*·*L*_{1} + *ε*^{2}·*L*_{2} + ... + *ε*^{
k
}·*L*_{
k
}, (1)

*ε*∈ (0, 1). This structure reflects the different strengths of interactions between the vertices of the Markov chain underlying graph (see Figure 7), and gives rise to an efficient algorithm for approximating the stationary distribution. The approximation algorithm is based on the idea of grouping together closely related states. Different interaction strengths (transition probabilities of the order of

*ε*

^{ n },

*n*= 1 ...

*k*from Eq. (1)) result in several grouping phases. During the first grouping phase all the top ranked features (the ones that were never ranked lower than any other feature outside this group and have transition probabilities between themselves above a certain threshold) are lumped together. Second phase continues with the smaller state space and determines the next set of top ranked features, which are substituted by a single clump in the third phase, etc. States which are joined in the last phase have the lowest stationary probability. For each clump of grouped states the stationary distribution is calculated separately using the Grassmann-Taksar-Heyman (GTH) algorithm [18] and appropriately updated during the subsequent phases of the algorithm.

To ensure that all states will be grouped at some stage, we set the transition probability from the top to the bottom ranked features from each ranking list to a small value *δ* > 0. Recall that in the transition matrix of MC_{1} the probabilities from lower ranked features to higher ranked are also greater than 0. This ensures that the entire underlying graph is strongly connected, which is a prerequisite of the approximation algorithm. The probability value *δ* is chosen appropriately to guarantee that the corresponding interactions are taken into account during the last grouping phase, hence it has negligible influence on the final result.

The structure of the graph underlying the Markov chain MC_{1} is presented in Figure 7. Depicted is the nested family of sets grouped together during the consecutive phases of the algorithm.

Readers interested in the details of the approximation algorithm are referred to [19], (an extended version of it is available at [20]).

_{1}. The structure of MC

_{4}does not allow for efficient aggregation, but because of the limited number of features selected from each input ranking, it is still tractable. Table 1 summarizes the efficiency of Markov chain aggregation method. For MC

_{1}we have 7 grouping stages; the sizes of resulting groups vary from 4 to 87. For MC

_{4}all 242 states are grouped together in the first phase and the stationary distribution vector is calculated exactly using the GTH algorithm [18]. The significant speed-up obtained with the approximation algorithm is crucial when aggregation is applied to complete rankings of features from massive datasets.

Markov chain grouping comparison.

MC | # stages | groups sizes | time (in sec) |
---|---|---|---|

MC | 7 | 4, 15, 46, 51, 87, 14, 25 | 1.8 |

MC | 1 | 242 | 21.7 |

### Consensus feature extraction

The second method proposed here is based on principal component analysis (PCA) [21]. PCA is a projection method, which seeks linear combinations of the original variables with maximal variance. The *i*-th projection vector (orthogonal to all previous projection vectors) is given by

where **Σ** is the covariance matrix. The solution to (2) is given by the eigenvector of **Σ**, which corresponds to the *i*-th largest eigenvalue. PCA is particularly useful for reduction and interpretation of biological data from high-throughput technologies like mass spectrometry and micro-arrays. It is an unsupervised method, so it does not take into account the classification of the samples. As a characteristic of high-throughput biological experiments, large part of the variance of the data is possibly unrelated to class assignments. Various biological and processing factors can contribute to the overall diversity of the samples. Hence, for classification purposes, the aim should not be to preserve the overall variance, but rather to preserve the variance between classes. In order to boost the discrimination power of the extracted features, we apply PCA only to the group of the most discriminative variables. This also allows for easier interpretability of the resulting features (i.e. determining the contributions of the original variables), as the linear combinations are taken over a much smaller set of vectors. We take the union of the sets of top *k* features from each of the considered feature selection methods to generate the set of discriminative features and apply PCA to the reduced data matrix to extract a set of uncorrelated *consensus features*.

### Classification

We use four well-known supervised learning algorithms in our experiments, namely: linear discriminant analysis (LDA), support vector machines (SVM), random forests (RF), and decision trees (DT). All these methods have been compared in [8]. We provide only a brief description of SVM and RF below. For details and description of LDA and DT see e.g. [22, 23].

#### Support Vector Machines

Support Vector Machines were introduced by Vapnik [24]. The method seeks the optimal hyperplane separating two classes. In the case of linearly separable data, the optimal hyperplane can be found by solving a linear optimization problem. Often the training points are not perfectly linearly separable. SVM deals with this problem by mapping the data points into higher-dimensional space. For details refer to [25].

#### Random forest

It has been found that aggregating classifiers built from perturbed versions of the training set could substantially improve prediction accuracy. Random forest ([12]) is a very effective classifier which exploits this idea, by constructing an ensemble of classification trees and basing the decision on the majority vote. Each tree is build on a bootstrap sample of the training data and random feature selection is applied at each node of the tree.

## Abbreviations

- •
**BPH**: -
Benign Prostatic Hyperplasia

- •
**DT**: -
Decision Trees

- •
**GTH**: -
Grassmann-Taksar-Heyman Algorithm

- •
**LC**: -
Liquid Chromatography

- •
**LDA**: -
Linear Discriminant Analysis

- •
**MALDI-TOF**: -
Matrix-Assisted Laser Desorption/Ionization Time-of-Flight Spectrometry

- •
**MC**: -
Markov Chain

- •
**MI**: -
Mutual Information

- •
**MS**: -
Mass Spectrometry

- •
**PCA**: -
Principal Component Analysis

- •
**PPC**: -
Peak Probability Contrasts

- •
**RF**: -
Random Forest

- •
**SELDI-TOF**: -
Surface-Enhanced Laser Desorption/Ionization Time-of-Flight Spectrometry

- •
**SVM**: -
Support Vector Machines

- •
**TT**: -
T-Test

## Declarations

### Acknowledgements

We would like to acknowledge the helpful discussions with many colleagues from the Laboratory of Mass Spectrometry, as well as professor Jerzy Ostrowski from the Warsaw Oncology Center. The research described in this paper was partially supported by Polish Ministry of Education and Science grants KBN-3 T11F 021 28 and PBZ-KBN-088/P04/2003.

This article has been published as part of *BMC Bioinformatics* Volume 8, Supplement 5, 2007: Articles selected from posters presented at the Tenth Annual International Conference on Research in Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S5.

## Authors’ Affiliations

## References

- Adam BL, Qu Y, Davis JW, Ward MD, Clements MA, Cazares LH, Semmes OJ, Schellhammer PF, Yasui Y, Feng Z, Wright GLJ:
**Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men.***Cancer Research*2002,**62:**3609–3614.PubMedGoogle Scholar - Geurts P, Fillet M, de Seny D, Meuwis MA, Malaise M, Merville MP, Wehenkel L:
**Proteomic mass spectra classification using decision tree based ensemble methods.***Bioinformatics*2005,**21:**3138–3145. 10.1093/bioinformatics/bti494View ArticlePubMedGoogle Scholar - Jacobs IJ, Menon U:
**Progress and challenges in screening for early detection of ovarian cancer.***Mol Cell Proteomics*2004,**3:**355–366. 10.1074/mcp.R400006-MCP200View ArticlePubMedGoogle Scholar - Lilien RH, Farid H, Donald BR:
**Probabilistic disease classification of expression-dependent proteomic data from mass spectrometry of human serum.***Journal of Computational Biology*2003,**10:**925–946. 10.1089/106652703322756159View ArticlePubMedGoogle Scholar - Li J, Zhang Z, Rosenzweig J, Wang YY, Chan DW:
**Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer.***Clinical Chemistry*2002,**48:**1296–1304.PubMedGoogle Scholar - Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA:
**Use of proteomic patterns in serum to identify ovarian cancer.***Lancet*2002,**359:**572–577. 10.1016/S0140-6736(02)07746-2View ArticlePubMedGoogle Scholar - Tibshirani R, Hastie T, Narasimhan B, Soltys S, Shi G, Koong A, Le QT:
**Sample classification from protein mass spectrometry, by 'peak probability contrasts'.***Bioinformatics*2004,**20:**3034–3044. 10.1093/bioinformatics/bth357View ArticlePubMedGoogle Scholar - Wu B, Abbott T, Fishman D, McMurray W, Mor G, Stone K, Ward D, Williams K, Zhao H:
**Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data.***Bioinformatics*2003,**19:**1636–1643. 10.1093/bioinformatics/btg210View ArticlePubMedGoogle Scholar - Yu JS, Ongarello S, Fiedler R, Chen XW, Toffolo G, Cobelli C, Trajanoski Z:
**Ovarian cancer identification based on dimensionality reduction for high-throughput mass spectrometry data.***Bioinformatics*2005,**21:**2200–2209. 10.1093/bioinformatics/bti370View ArticlePubMedGoogle Scholar - Speed T:
*Statistical Analysis of Gene Expression Microarray Data*. Chapman & Hall/CRC; 2003.View ArticleGoogle Scholar - Jones GA, Jones JM:
*Information and Coding Theory*. Springer Verlag; 2000.View ArticleGoogle Scholar - Breiman L:
**Random Forests.***Machine Learning*2001,**45:**123–140. 10.1023/A:1010950718922View ArticleGoogle Scholar - Dwork C, Kumar R, Naor M, Sivakumar D:
**Rank aggregation methods for the Web.**In*Proceedings of the 10th international conference on World Wide Web Hong Kong*. Elsevier Science; 2001:613–622.Google Scholar - Gambin A, Dutkowski J, Karczmarski J, Kluge B, Kowalczyk K, Ostrowski J, Poznański J, Tiuryn J, Bakun M, Dadlez M:
**Automated reduction and interpretation of multidimensional mass spectra for analysis of complex peptide mixtures.***International Journal of Mass Spectrometry*2007,**260:**20–30. 10.1016/j.ijms.2006.06.011View ArticleGoogle Scholar - Smith CA, Want EJ, O'Maille G, Abagyan R, Siuzdak G:
**XCMS: Processing mass spectrometry data for metabolite profiling using nonlinear peak alignment, matching and identification.***Analytical Chemistry*2006,**78:**779–787. 10.1021/ac051437yView ArticlePubMedGoogle Scholar - Luksza M, Kluge B, Ostrowski J, Karczmarski J, Gambin A:
**Effcient Model-Based Clustering for LC-MS Data.**In*Proceedings of the 6th Workshop on Algorithms in Bioinformatics 2006, LNBI*. Springer-Verlag; 2006:613–622.Google Scholar - Pokarowski P:
**Directed forests with applications to algorithms related to Markov chains.***Applicationes Mathematicae*1999,**26:**395–414.Google Scholar - Grassmann WK, Taksar MI, Heyman DP:
**Regenerative analysis and steady state distributions for Markov chains.***Operations Research*1985,**33:**1107–1116.View ArticleGoogle Scholar - Gambin A, Pokarowski P:
**A combinatorial aggregation algorithm for stationary distribution of a large Markov chain.**In*Proceedings of the 13th International Symposium on Fundamentals of Computation Theory 2001, Riga, Latvia, LNCS 2138*. Edited by: Freivalds R. Springer; 2001:384–388.Google Scholar - Gambin A, Pokarowski P:
**Aggregation Algorithms for Markov Chains with Large State Space.**[http://bioputer.mimuw.edu.pl/papers/aggr.pdf] - Jolliffe IT:
*Principal component analysis*. Springer Verlag; 1986.View ArticleGoogle Scholar - Hastie T, Tibshirani R, Friedman JH:
*The Elements of Statistical Learning*. Springer Verlag; 2001.View ArticleGoogle Scholar - Venables WN, Ripley BD:
*Modern Applied Statistics with S*. Forth edition. Springer Verlag; 2002.View ArticleGoogle Scholar - Vapnik V:
*Statistical Learning Theory*. New York: Wiley-Interscience; 1998.Google Scholar - Scholkopf B, Smola AJ:
*Learning with Kernels*. Cambridge, MA: MIT Press; 2002.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.