Prediction using step-wise L1, L2 regularization and feature selection for small data sets with large number of features
© Demir-Kavuk et al; licensee BioMed Central Ltd. 2011
Received: 6 July 2011
Accepted: 25 October 2011
Published: 25 October 2011
Machine learning methods are nowadays used for many biological prediction problems involving drugs, ligands or polypeptide segments of a protein. In order to build a prediction model a so called training data set of molecules with measured target properties is needed. For many such problems the size of the training data set is limited as measurements have to be performed in a wet lab. Furthermore, the considered problems are often complex, such that it is not clear which molecular descriptors (features) may be suitable to establish a strong correlation with the target property. In many applications all available descriptors are used. This can lead to difficult machine learning problems, when thousands of descriptors are considered and only few (e.g. below hundred) molecules are available for training.
The CoEPrA contest provides four data sets, which are typical for biological regression problems (few molecules in the training data set and thousands of descriptors). We applied the same two-step training procedure for all four regression tasks. In the first stage, we used optimized L1 regularization to select the most relevant features. Thus, the initial set of more than 6,000 features was reduced to about 50. In the second stage, we used only the selected features from the preceding stage applying a milder L2 regularization, which generally yielded further improvement of prediction performance. Our linear model employed a soft loss function which minimizes the influence of outliers.
The proposed two-step method showed good results on all four CoEPrA regression tasks. Thus, it may be useful for many other biological prediction problems where for training only a small number of molecules are available, which are described by thousands of descriptors.
Nowadays, empirical methods of machine learning are widely used in life sciences and related sciences such as chemistry, biochemistry, pharmacy, and medicinal diagnostics. They can be used to predict the value of a target property in focus such as the competence of a molecular compound to fulfill a specific function. For this purpose, regression methods correlate specific molecular properties (features or descriptors) of molecular compounds with the desired target property. In the simplest case, the regression uses an objective function whose number of parameters is as large as the number of considered features. Usually the objective function contains also a so-called regularization term. It penalizes model details of unnecessary complexity, focuses on the most relevant features, and thus avoids over-fitting of the data used for training (parameter optimization) . The most commonly used regularization methods are L1 regularization, also known as Lasso  and L2 regularization also known as ridge regression . As penalty term, the L1 regularization adds the sum of the absolute values of the model parameters to the objective function whereas the L2 regularization adds the sum of the squares of them. Due to its inherent linear dependence on the model parameters, regularization with L1 disables irrelevant features leading to sparse sets of features. Thus, L1 regularization combines efficient feature selection and model generation into one single optimization step.
In recent years, considerable advancements were made in high throughput techniques to generate for a large number of relevant molecular compounds the target values in focus. Nevertheless, for many problems, where molecular target values need to be predicted by empirical machine learning methods, the amount of data is often scarce. Hence, in a typical prediction scenario the number of compounds with known target values can be very small (say 100 or even much below), while the number of potentially relevant features (and thus also the number of model parameters), which need to be employed initially is often large (1000 or even more). Under such circumstances, overtraining would be unavoidable unless specific precautions are applied to control and reduce the number of features and thus also the number of parameters. Becoming aware of these problems, the biological research community has started to pay increasingly attention to feature selection techniques. There are various feature selection strategies, which all have their pros and contras. Recently, feature selection using Lasso (L1) regularization has gained research interests due to its simplicity and several modifications of the original L1 regularization have been developed [4, 5].
In this study we apply in a two-step regularization procedure where first L1 and than L2 regularization is applied, using L1 regularization for feature selection only. With the remaining selected features, the final model achieves higher accuracy, if it is build with L2 regularization only. In spite of its simplicity, L1 regularization requires special solvers, since the derivatives of the L1 regularization term are not defined at vanishing parameter values. A number of highly optimized solvers are available that can minimize the objective function in presence of L1 regularization [6–10]. However, the provided implementations are limited to a certain programming language and loss functions complicating re-implementation and modification of the original algorithm. In this work, the simple Rprop algorithm  is used to approximate the optimal L1 regularized solution leading to good results. We apply the two-step method mentioned above to the prediction tasks of CoEPrA (Co mparative E valuation of Pr ediction A lgorithms) modeling competition of 2006 . These data sets are characterized by few data points (~80) with a large number of features (~5000). The data sets of CoEPrA 2006 contain octo- and nona-peptides relevant to MHC class I binding which play an important role in the immune response of mammals. The purpose of this competition was to facilitate testing and comparison of various classification and regression algorithms for biological active molecules using blind prediction. The participant groups in CoEPrA applied various methods to the data sets at each task, and the organizers evaluated all collected predictions and then announce the rank of the participants on web site. All data sets can be obtained free of charge from the CoEPrA web site .
The training sets have been used to build prediction models using the proposed two-step learning method. The models have then been used to predict the provided test sets. All experiments were performed on a single core of an Intel Xeon X5670 CPU running at 2.93Ghz. Typically the whole optimization of a prediction model for one of the four CoEPrA regularization tasks required only 10 minutes of CPU time. Thereby the most CPU time demanding step is the optimization of the regularization parameters λ1,2, involving the average over five different ten-fold cross validations for the eight and thirteen considered candidate values of λ1 and λ2, respectively. The CPU time increases linearly with the number of features and the number of molecules in the training data set. Furthermore, the number of cross validation rounds may be decreased for large data sets. Hence, the proposed method is also applicable for much larger data sets. Finally, using the optimized prediction model requires only milliseconds of CPU time per molecule to be predicted. Hence, it can be used as a high through-put method.
The winner of CoEPrA regression task II and IV, and the second best for task III is the group of Curt Breneman. Their method combined the provided physico-chemical features with 147 RECON (TAE/RECON electron-density derived) descriptors. Feature selection has been performed using principal component analysis. Their final models contained 34, 148 and 148 features in task II, III and IV respectively. Wit Jakuczun, who ranked first, second and third for task III, I and IV, respectively, used a Random Forest approach for feature selection and model building . For task III, his prediction model used 115 relevant features. Almost all methods used by the CoEPrA participants applied an independent feature selection step before the final model building. It is worth mentioning that none of the participants of the CoEPrA contest was able to rank best for all four regression tasks. Our prediction method provides good scores using consistently the same method for all four CoEPrA regression tasks. It serves a double purpose namely a dramatic reduction of the number of used features by feature selection and a subsequent parameter optimization in a two-step learning procedure.
Feature selection via L1 regularization
Dependence of number of features and prediction performance
In this study we applied a two-step approach using first feature selection and subsequent model building. In the first stage L1 regularization is used to filter out redundant and irrelevant features. The remaining features are used in a second stage of model building in conjunction with L2 regularization. For both steps (p = 1, 2) an appropriate regularization strength, governed by λp, is crucial. If the regularization strength is too small, unimportant features may get a strong influence on the resulting prediction model. If it is too large, also relevant features will be removed resulting in poorer prediction performance. There are various ways to optimize the λp values. In this work the λp optimization has been done in an additional step using k times n-fold cross validation.
With the L1 regularization, we were able to select about one hundred or less features of the initial feature set. However, the set of selected features varied strongly with the regularization strength used to build the prediction model. We explained this behavior by the fact that these regression tasks employ feature sets with high redundancies as we used many physico-chemical features to describe each amino acid. Hence, several of these properties may be used equivalently to discriminate between amino acids. However, not only the selected number of features but also the prediction performance of the first stage depends strongly on the parameter value used for L1 regularization.
Applying L2 regularization for the selected features, we were able to achieve an even higher prediction performance for three of the four CoEPrA regression tasks. Hence, the feature selection with L1 regularization should always be followed by an L2 regularized model-building step.
Instead of a highly optimized solver to overcome the singular behavior occurring with L1 regularization, we made use of the simple Rprop algorithm . Although the Rrpop method is not able to set the feature weights exactly to zero the optimal solution is well approximated. Setting a threshold near zero and removing all features with an absolute weight below this threshold seems to be a simple alternative and may also be useful for other machine learning approaches.
The choice of loss function, eq. (3), can have a large influence on prediction results. Squared error loss functions, most often used, try to recall each data point of the training set as accurate as possible. Hence, errors in the training set may have a large influence on the predictor. In this work a loss function is used which increases very smoothly with increasing prediction errors weakening the influence of potential outliers.
The data sets of the CoEPrA contest are particularly valuable, since they offer the possibility to compare the own approach with a larger number of alternative approaches from different groups on equal footing. In one case (task I) our approach surpasses the best result obtained in the CoEPrA concourse, while in two cases (task II and IV) our results are at second rank. For task III the feature selection with L1 regularization was seemingly too rigorous. Hence, the results in the second stage were of lower quality than in the first stage where the second rank was obtained. However, we like to point out the following. Although, we did not make explicit use of knowledge on the prediction set, we knew them ahead. This can have a subtle influence on the details of the procedures we were selecting in the present study and may thus have provided a hidden advantage.
Previous studies have used reduced sets of descriptors to encode amino acid sequences [16–18]. These small descriptor sets were generated using multivariate statistical methods to reduce the dimension of the feature space by P rincipal C omponent A nalysis (PCA) . In PCA the main assumption is that the variance of a feature represents its information content. Hence, PCA tries to project the original features into a lower dimensional space such that the variance of data in the lower dimensional space is maximized. The advantage of these generated small feature sets is that they can be used instantly to build models using few descriptors. The proposed L1/L2 method on the other hand has to be performed for each specific task (here MHC binding affinity). This results in highly optimized feature sets, which we expect to perform better than feature sets obtained with a generally valid reduction scheme. It is also possible to use PCA for a specific task similar to conventional feature selection. In this case PCA is performed on the training data of the considered problem. We used PCA analysis in this manner in previous unpublished studies. PCA was able to reduce the number of features drastically. Still, using conventional feature selection resulted in better predictive power in most cases. This may be explained by the fact that PCA combines the original descriptors such that the variance in the reduced space is maximized. However, features with large variance carry not necessarily the most useful ones for a specific learning task. Furthermore, it has to be kept in mind that PCA features in the lower dimensional space are linear combinations of the original descriptors. Hence, interpretation of selected descriptors may be much less intuitive than by the use of a classical feature selection method.
The limitations of wet labs to generate larger data sets for a particular problem of interest may be due to various reasons. Mostly measurements may be expensive or complicated and time consuming to perform such as in vivo experiments. Another reason may be that the regarded problem is a quite new one. In all these cases in silico methods are highly requested as these are able to easily predict the desired target property of new compounds. Without precautions many machine learning methods fail in such situations. Hence, specialized methods are desired. Our proposed method achieved good prediction results for the four CoEPrA regression tasks. Furthermore, the number of molecular descriptors has been reduced drastically for the final prediction models. The CoEPrA data sets are representative for many biological classification and regression problems where small data sets of less than hundred are described by thousands of descriptors. Hence, we expect the proposed method to be applicable for many other machine learning tasks having same conditions.
Overview of used data sets.
In order to build a predictor the given oligo-peptides have to be transformed into a multidimensional computer readable representation. For this purpose a fixed number of features (molecular descriptors) collected in a vector are generated for each oligo-peptide of the training and test data sets. There are several possibilities to extract feature vectors from peptide sequences. A simple method for encoding protein sequences is to assume that all 20 native amino acids have the same degree of pair-wise similarity. This so-called sparse encoding represents each amino acid of the oligo-peptide by sub-vectors of 24 components. The first 20 components are used to identify one of the 20 native amino acids. The four additional components (*,B,X,Z) are normally set to zero. However, if chemical or crystallographic analyses of the peptide or protein yield no conclusive information on the identity of the residue, one of the four additional components is set to unity. The components *,B,X,Z refer to a gap, uncertainty in asparagine/aspartate, unspecified/unknown amino acid, and uncertainty in glutamine/glutamate, respectively. For the CoEPrA data sets, this results in 216 and 192 features for nona- and octo-peptides, respectively. Since in the CoEPrA data sets all amino acids are specified, the last four components of the corresponding sub-vectors are always zero.
Ignoring chemical similarities between amino acids is a quite abrasive way to describe sequences, as it is well known that some amino acid pairs share more similarity than others do. Hence, to account for chemical similarities between amino acids BLOSUM matrices (BLOcks of Amino Acid SUbstitution Matrix)  have also been used to generate feature vectors. BLOSUM matrices are commonly used to score the similarity of aligned sequences. They are generated from multiple sequence alignments of protein sequences comparing the number of divergent sequences. The off-diagonal elements refer to specific amino acid pairs. As more positive a BLOSUM matrix element is, as more likely can the corresponding amino acid pair interchange in a point mutation. There are several versions of BLOSUM matrices, i.e. BLOSUM40, BLOSUM62, and BLOSUM90. The digits XX after BLOSUMXX stand for the maximum percentage sequence identity of the sequences used to generate the matrix. Hence, matrices with a large XX value are suitable to compare closely related sequences whereas matrices with a small XX value are suitable to compare distantly related sequences. The BLOSUM62 matrix yielded good performances in previous research  and has been used in many sequence alignment applications . Therefore, in this study the BLOSUM62 matrix has been used, where every amino acid is coded using the corresponding row of the BLOSUM matrix. Analog to sparse encoding the resulting number of features for each nona- and octo-peptide using BLOSUM matrices is 216 and 192, respectively.
BLOSUM encoding describes the similarity between amino acids by a single value. A more detailed description can be achieved using physico-chemical descriptors for each amino acid type. The CoEPrA organizers provided a set of physico-chemical features for all four data sets. These features describe each amino acid by 643 physico-chemical properties taken from literature . This results in a total number of 5787 or 5144 features for nona- or octo-peptides, respectively.
The different encoding techniques have their own advantages and disadvantages. Hence, a combination of the presented three encoding types (physico-chemical, sparse encoding and BLOSUM encoding) has been used in this work which resulted in 5787+216+216 = 6219 and 5144+192+192 = 5528 features for nona- or octo-peptides, respectively. Every feature encoding technique represents a different property and therefore the features are in different units. Thus, it is important to normalize the feature vectors before training a classifier. For that purpose all features with a standard deviation of zero are removed as these features do not contain any information. The remaining features are shifted and scaled such that each feature possesses a mean of zero and a standard deviation of one. The features of the test set are scaled according to the parameters derived from the training set. An overview of all used data sets together with the number of generated features is shown in Table 2.
Linear scoring function
This function grows very slowly with deviations of the scoring function f from the target value m and is therefore less sensitive to outliers than a quadratic loss function that is commonly used. Once a hyperplane is determined, binding affinities (pIC50 values) of oligo-peptides to a MHC receptor can be calculated using the scoring function, eq. (2).
Regularization and feature selection
Without further knowledge, it is difficult to decide which features describe the real world objects best for a particular prediction problem. Hence, generally all features, which can be computed for the specific task, are used resulting in high dimensional feature vectors. On the other hand, the available amount of data points is quite limited especially for some biological problems. The "curse of dimensionality" states that the volume of the feature space increases rapidly as more features are added . However, in many classification and regression tasks only a comparatively small number of data are available. Hence, in most cases the high dimensional feature space will be almost empty. This makes it difficult to build a model that is able to generalize for unknown data, while it may work well for the training data. Using only relevant descriptors furthermore accelerates the whole prediction process and reduces the storage space needed. Models that only depend on few features are also much easier to interpret than models with hundreds or thousands of parameters and may allow gaining more insights to the underlying chemical processes allowing chemist to modify a compound amplifying favorable properties and weakening unfavorable ones. Hence, the goal is to reduce the dimension of the feature vector space by filtering irrelevant and redundant information.
The regularization terms used in the present study are with p = 1 and p = 2 called L1 (Lasso) and L2 (Ridge regression) regularization, respectively. Because of the quadratic form of the L2 regularization term none of the model weights will be set exactly to zero during learning and hence no explicit feature selection is done. On the other hand, Lasso regularization leads to sparse models due to its linear form where the weights of many features are set to zero rigorously, resulting in efficient feature selection.
In this work, these two regularization techniques have been used in a two-step procedure. At stage 1, the parameters are optimized using an objective function with L1 regularization. At this stage, irrelevant and redundant features are filtered out. The remaining features are used in the second stage to build a model using only L2 regularization. It is obvious that for this procedure the regularization parameters λ p will have a large influence on the results and are therefore chosen carefully. In this work, both parameter values are determined automatically by evaluating the prediction performance observing the error in k times n-fold cross validation. For that purpose, the training set is randomly divided into n parts. One part is retained as a validation set, while the other n-1 parts are used to train the prediction model. This process is repeated n times such that each part is used exactly once as validation set. The whole procedure is repeated k times and the average over these k times n-fold validation errors is the estimated performance. The λ p value that reveals the smallest validation error is then used to train the prediction model for the whole training set. In this work we set k = 5, n = 10 and the candidate sets λ1 ∈ [0.001, 0.005, 0.01, 0.05, 0.08, 0.1, 0.2, 0.3] and λ2 ∈ [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.93].
When using a least square loss function with L2 regularization the minimum of the objective function can be obtained with exact numeric solving a corresponding set of linear equations  with the Cholesky decomposition algorithm . For all other loss functions, the parameters minimizing the objective function can only be determined approximately using for instance a gradient descent procedure.
When using the L1 regularization the objective function possesses a discontinuous derivative whenever a parameter reaches zero. Various highly optimized algorithms are capable of minimizing an objective function involving the L1 regularization term. However, the provided implementations are limited to certain programming languages and loss function types complicating re-implementation and modification of the original algorithm. In this work the simple Resilient propagation (Rprop) algorithm  is used to approximate the optimal L1 regularized solution. The Rprop algorithm is a quite simple though effective minimization procedure, which can be implemented in just a few lines of code. Using Rprop in conjunction with a linear regularization term does not lead to weights exactly set to zero. Nevertheless, many weights are set to nearly zero values approximating the exact solution quite well. Hence, these features can be filtered out using a low threshold. In our work, all features with an absolute weight below 10-8 after the first training stage using L1 regularization have been removed before proceeding with stage 2.
where and are the measured value provided by CoEPrA and the predicted value for peptide i, respectively, and < pIC50exp > is the mean of all measured values. CoEPrA task IV has been evaluated using the Spearman Rank Correlation Coefficient (SRCC) . However, in order to simplify comparison of prediction performances with other tasks q2 has been computed as well for task IV.
The authors like to thank Dr. Ovidiu Ivanciuc for designing the CoEPrA contest in 2006. This work was supported by the International Research Training Group (IRTG) on "Genomics and Systems Biology of Molecular Networks" (GRK1360, German Research Foundation (DFG)).
- Demir-Kavuk O, Riedesel H, Knapp EW: Exploring classification strategies with the CoEPrA 2006 contest. Bioinformatics 26(5):603–9.Google Scholar
- Tibshirani R: Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society Series B-Methodological 1996, 58(1):267–288.Google Scholar
- Hoerl AE, Kennard RW: Ridge Regression - Biased Estimation For Nonorthogonal Problems. Technometrics 1970, 12(1):55. 10.2307/1267351View ArticleGoogle Scholar
- Zou H, Hastie T: Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society B 2005, 67: 301–320. 10.1111/j.1467-9868.2005.00503.xView ArticleGoogle Scholar
- Xu ZongBen, Z H, Wang Yao, Chang XiangYu, Yong Liang: L1/2 regularizer. SCIENCE CHINA 2010, 53(6):1159–1169.Google Scholar
- Andrew G, Gao J: Scalable training of L1-regularized log-linear models. ICML '™07 2007.Google Scholar
- Lee S, Lee H, Abbeel P, Ng A: Efficient L1 Regularized Logistic Regression. Proceedings of the Twenty-First National Conference on Artificial Intelligence (AAAI-06) 2006.Google Scholar
- Goodman J: Exponential Priors for Maximum Entropy Models . Proceedings of HLTNAACL 2004 2003.Google Scholar
- Roth V: The generalized LASSO. IEEE Trans Neural Netw 2004, 15(1):16–28. 10.1109/TNN.2003.809398View ArticlePubMedGoogle Scholar
- Perkins S, Theiler J: Online feature selection using grafting. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003. AAAI Press; 21–24.Google Scholar
- Riedmiller M, Braun H: A direct adaptive method for faster backpropagation learning: The Rprop algorithm. Proceedings of the IEEE International Conference on Neural Networks 1993, 586–591.View ArticleGoogle Scholar
- Breiman L: Random Forests. Machine Learning 2001, 45(1):5–32. 10.1023/A:1010933404324View ArticleGoogle Scholar
- Zou H: The adaptive lasso and its oracle properties. In Journal of the American Statistical Association. ASA; 2006:1418–1429.Google Scholar
- Fan JaLR: Variable selection via nonconcave penalized likelihood and its oracle properties. In Journal of the American Statistical Association. ASA; 2001:1348–1360.Google Scholar
- Atchley WR, Zhao J, Fernandes AD, Druke T: Solving the protein sequence metric problem. Proc Natl Acad Sci USA 2005, 102(18):6395–400. 10.1073/pnas.0408677102PubMed CentralView ArticlePubMedGoogle Scholar
- Georgiev AG: Interpretable numerical descriptors of amino acid space. J Comput Biol 2009, 16(5):703–23. 10.1089/cmb.2008.0173View ArticlePubMedGoogle Scholar
- Venkatarajan MS, Braun W: New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physicalâ€"chemical properties. Journal of Molecular Modeling 2001, 7(12):445–453. 10.1007/s00894-001-0058-5View ArticleGoogle Scholar
- Pearson K: On lines and planes of closest fit to systems of points in space. Philosophical Magazine 1901, 2(6):559–572.View ArticleGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89(22):10915–9. 10.1073/pnas.89.22.10915PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Kawashima S, Ogata H, Kanehisa M: AAindex: Amino Acid Index Database. Nucleic Acids Res 1999, 27(1):368–369. 10.1093/nar/27.1.368PubMed CentralView ArticlePubMedGoogle Scholar
- Bellman R: Adaptive control processes - A guided tour. In Adaptive control processes - A guided tour. Princeton University Press; 1961:255.Google Scholar
- Shlens J: A Tutorial on Principal Component Analysis. 2005.Google Scholar
- Hansen L, et al.: Controlling feature selection in random forests of decision trees using a genetic algorithm: classification of class I MHC peptides. Comb Chem High Throughput Screen 2009, 12(5):514–9. 10.2174/138620709788488984View ArticlePubMedGoogle Scholar
- Patil D, et al.: Feature selection and classification employing hybrid ant colony optimization/random forest methodology. Comb Chem High Throughput Screen 2009, 12(5):507–13. 10.2174/138620709788488993View ArticlePubMedGoogle Scholar
- Riedesel H, Kolbeck B, Schmetzer O, Knapp EW: Peptide binding at class I major histocompatibility complex scored with linear functions and support vector machines. Genome Inform 2004, 15(1):198–212.PubMedGoogle Scholar
- Bau D III, Trefethen LN: Numerical linear algebra. Philadelphia: Society for Industrial and Applied Mathematics; 1997.Google Scholar
- Spearman C: The proof and measurement of association between two things. By C. Spearman, 1904. Am J Psychol 1987, 100(3–4):441–71.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.