Improving protein fold recognition by random forest
© Jo and Cheng; licensee BioMed Central Ltd. 2014
Published: 21 October 2014
Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. In our work, we developed RF-Fold that uses random forest - one of the most powerful and scalable machine learning classification methods - to recognize protein folds.
RF-Fold consists of hundreds of decision trees that can be trained efficiently on very large datasets to make accurate predictions on a highly imbalanced dataset. We evaluated RF-Fold on the standard Lindahl's benchmark dataset comprised of 976 × 975 target-template protein pairs through cross-validation. Compared with 17 different fold recognition methods, the performance of RF-Fold is generally comparable to the best performance in fold recognition of different difficulty ranging from the easiest family level, the medium-hard superfamily level, and to the hardest fold level. Based on the top-one template protein ranked by RF-Fold, the correct recognition rate is 84.5%, 63.4%, and 40.8% at family, superfamily, and fold levels, respectively. Based on the top-five template protein folds ranked by RF-Fold, the correct recognition rate increases to 91.5%, 79.3% and 58.3% at family, superfamily, and fold levels.
The good performance achieved by the RF-Fold demonstrates the random forest's effectiveness for protein fold recognition.
Proteins are the fundamental functional units in living systems. Protein tertiary (three-dimensional) structures at the molecular level are necessary to understand the functions of proteins. However, due to the significant cost of experimentally determining the tertiary structures of proteins, the number of known 3D protein structures is about 200 times smaller than the number of known protein sequences [1, 2]. Therefore, it is important to develop computational methods to predict protein structures from protein sequences . Recognizing a known structure that is similar to the unknown structure (i.e. fold recognition) is an important step of the template-based protein structure modeling approach that uses the known structure as a template to construct a structural model for the target protein [4, 5].
Since the number of unique protein structures appears to be limited (e.g., several thousand) according to the structural analysis on all the tertiary protein structures in the Protein Data Bank (PDB) , it is possible to identify one correct template structure (fold) for a large portion of target proteins. This is particularly the case if a target protein has a significant sequence identity with one of template proteins with a known tertiary structure. Fold recognition becomes very challenging when the sequence identity of the target protein and template proteins is low, i.e., in the twilight zone. Numerous research endeavors have been devoted to developing sensitive methods to improve fold recognition in the twilight zone. Machine learning methods have been used to tackle the problem effectively by casting the fold recognition as a binary classification problem to decide whether or not a target protein shares the same structural fold with a template protein in a protein structure library [6–8].
Given a number of features describing the pairwise similarity between two proteins (e.g., a target protein and a template protein), the objective of the classification is to predict if the two proteins share a similar tertiary structure (fold). The problem can often be divided into three difficulty levels that range from the easiest family level (i.e. two protein belonging to the same family), to the superfamily level, and to the hardest fold level. This roughly corresponds to the decrease in sequence identity between two proteins. Proteins sharing similar structures have a relatively high sequence similarity if they are in the same family, moderate or little sequence similarity if in the same superfamily, and almost no sequence similarity if in the same fold.
Random forest is one of the most powerful machine learning methods known for its good interpretability and its efficiency in handling very large training datasets . Random forest grows a large number of decision trees based on a subset of randomly selected features and a fraction of randomly selected training data points. All the trained trees are applied to a new data point to make prediction. The majority vote of the ensemble of trained decision trees is used as the final prediction for the data point. The average decision based on a large number of decision trees makes random forest robust against noisy data, irrelevant features, and unbalanced class distribution. Random forest has delivered an excellent performance in broad classification tasks that compares favorably with other ensemble classifiers such as Adaboost , and its performance is generally comparable to other state-of-the-art classifiers such as Support Vector Machine (SVM) as well . Random forest has been used extensively in a wide variety of domains [12–14] including protein fold classification [15–17], which is related to, but different than protein fold recognition. The fold recognition problem addressed in this paper is to recognize proteins that have similar tertiary structures to target proteins, while the protein classification [15, 16], and  is to classify a single protein sequence into a number of structural folds. On contrary, we applied random forest to classify if a pair of proteins (one target protein and one template protein) shares the same structure. The classification scores are then used to rank template proteins based on their structural relevance (i.e. the classification score) with a target protein. Many methods have been developed to improve the accuracy of recognizing structurally similar folds when there is little sequence similarity between a target and a template protein, such as PSI-BLAST , HMMER , SAM-T98 , SSHMM ,THREADER , FUGUE , SPARKS , SP3 , HHpred , FOLDpro , SP4 , SP5 , RAPTOR , SPARKS-X , and BoostThreader .
In this work, we applied the random forest method (i.e. RF-Fold) to address the fold recognition problem and evaluated its performance on the standard Lindahl's dataset , on which many previously established methods had been benchmarked. In comparison with 17 existing methods, RF-Fold's performance was comparable to that of the state-of-the art methods, demonstrating the effectiveness of the random forest method in protein fold recognition.
Random forest method for protein fold recognition
The decision tree method for classification had been widely used in many domains due to its simplicity and good interpretability after Leo Breiman et al. introduced it in 1984 . However, the accuracy of a single decision tree is often lower than more advanced classification methods such as support vector machines or neural networks, which limits its application in accuracy-critical domains. The more recent development of the decision tree methodology found that using an ensemble of decision trees constructed from randomly selected features and training data not only often yielded significantly higher accuracy than a single decision [34, 35], but also often surpassed the accuracy of other most advanced machine learning methods. This new approach is called random forest. Random forest is a meta-learning algorithm for classification, which consists of a bag of separately trained decision trees. Therefore, it inherits the advantages of decision tree methods such as easy training, fast prediction, and good interpretability. Because random forest selects a random subset of input features to construct each decision tree, the average prediction of a sufficient number of decision trees is robust against the existence of irrelevant features, which partially contributes to its good accuracy. Furthermore, the random selection of a subset of training data to train each tree also leads to an ensemble of decision trees that are resistant to noise and disproportional class distribution in the training data.
Data set and features
We trained and tested RF-Fold on the FOLDpro dataset . The FOLDpro dataset used the proteins in Lindahl's benchmark dataset  derived from the SCOP  database (version 1.39). The Lindahl's dataset includes 976 proteins, among which 555 proteins have at least one positive match with other proteins at the family level, 434 proteins at the super family level, and 321 proteins at the fold level. The pairwise sequence identity of any pair in the dataset is <= 40%. In the FOLDpro dataset, 84 features were extracted for each of all 976 × 975 distinct protein pairs in order to classify if a pair of proteins (one target / query protein and one template protein) share the same structure at the family, superfamily, or fold level. The features were extracted using existing, general-purpose alignment tools as well as protein structure prediction programs in five categories, including sequence/family information, sequence-sequence alignment, sequence-profile alignment, profile-profile alignment, and structural information. For the features of sequence/family information, the compositions of a single amino acid (monomer) and an ordered pair of amino acid (dimer) were computed and transformed into similarity scores using the cosine, correlation, and Gaussian kernel functions. For sequence-sequence alignment features, PALIGN  and CLUSTALW  were used to extract pairwise features associated with sequence alignment scores of a pair of proteins. For sequence-profile alignment features, PSI-BLAST, HMMER-hhmsearch  and IMPALA  were used to extract profile-sequence alignment features between the target profile and the template sequence. For profile-profile alignment features, five profile-profile alignment tools CLUSTALW, COACH of LOBSTER , COMPASS , HHSearch  and PRC (Profile Compiled, http://supfam.org/PRC) were used to align target and template profiles to obtain profile-profile alignment scores. For structural features, based on the global profile-profile alignments obtained with LOBSTER, structural features of query proteins predicted using the SCRATCH suite [45–49] were compared with that of template proteins to obtain structural compatibility scores.
The small portion of pairs belonging to the same protein family, superfamily, or fold was labelled as positive examples because they shared the same structural folds. The vast majority of protein pairs that did not have structural similarities were labelled as negative examples.
Training and benchmarking
We divided all protein pairs into 10 equal-size subsets for 10-fold cross validation purposes. We put all the target-template pairs associated with the same target protein into the same subset. Nine subsets were used for training and the remaining subset was used for validation. We removed all the pairs in the training dataset that used targets in the test dataset as templates. This procedure was repeated 10 times and the sensitivity and specificity of fold recognition were computed across the 10 trials. We also compared RF-Fold with 17 other methods by fold recognition rates for top-one ranked templates and for top-five ranked templates as in [5, 32]. Using the same evaluation procedure as in [5, 29–32], we calculated the sensitivity by taking as predictions the top-one or the top-five template proteins ranked for each target protein by classification scores. Here the sensitivity was defined by the percentage of target proteins (with at least one possible hit) having at least one correct template ranked 1st, or within the top 5 [5, 32].
Comparison of random forest with a single decision tree
Error rate of the random forest and a single decision tree on fold recognition dataset.
Error rate (%)
Single decision tree
Effects of data imbalance on random forest
It is difficult to train a classifier on a highly imbalanced dataset in which one or more classes are extremely under-represented. The significant drawback of using training data with the imbalanced distribution of classes has been reported in . The FOLDpro dataset is a very imbalanced dataset, which has 7,438 positive examples versus 944,162 negative examples. The ratio between the majority class and the minority class is 128:1. Training on such a dataset is difficult for most machine learning methods in general.
The number of correctly predicted template folds by random forest at the family level, superfamily level, and fold level under various ratios of negatives and positive training examples.
Effect of the number of features
Comparing RF-Fold with existing fold recognition methods
The sensitivity of 18 methods on the Lindahl's dataset.
RF-Fold performed better than most of methods in Table 3 and comparably to RAPTOR, SPARKS-X, and BoostThreader. Compared with RAPTOR, in most situations, RF-Fold shows some improvement of accuracy, while it performed worse than Raptor at top-1 family level and top-5 fold level. Compared with SPARKS-X, RF-Fold was less accurate at the fold level, but more accurate at the other two levels. Compared with BoostThreader, RF-Fold was less accurate in top-one at three levels, but more accurate in top-five at all three levels.
Availability of RF-Fold software and source code
In order to facilitate the reuse and implementation of RF-Fold method, the online web service for fold recognition, the source code of the programs of random forest learning and classification, the scripts of generating pairwise features for a pair of proteins, the scripts of evaluating the fold recognition results, and the training and test datasets are released at http://calla.rnet.missouri.edu/rf-fold/. The readme .txt file describes how to train and test the random forest method for fold recognition (RF_learn and RF_classify programs), how to evaluate the performance on the benchmark data set (Calculate-lindahl-Top1-Top5.sh), the datasets used to do cross-validation, and the scripts used to generate 84 pairwise features for a pair of proteins (32 Perl scripts in scripts_feature_generation sub-directory). Based on the document and programs, any user can create his/her own training and test datasets and train / test his/her own random forest classifier for protein fold recognition from scratch. The software, source code and data are released under the GNU General Public License. Anyone can freely reuse the software and source code for any purpose (e.g., protein fold recognition, homology detection, and protein tertiary structure prediction). Any technical problems may be addressed to the email box of the corresponding authors. Based on users' feedback, additional documents, utility programs, test examples, and data will be added in order to facilitate the development of random forest methods for protein fold recognition.
In this study, we developed a random forest method (RF-Fold) to recognize protein folds. The method was systematically validated by varying the input features and the class distribution of training datasets on a standard fold recognition dataset. The random forest consisting of 500 decision trees yielded a low error rate than a single decision tree on a highly imbalanced dataset. The random forest also delivered a good, steady performance regardless of the different ratios of negative and positive examples. Compared with 17 other different fold recognition methods, the performance of the RF-Fold is generally comparable to the best performance. The results achieved by the RF-Fold demonstrated the effectiveness of using the random forest algorithm in protein fold recognition. In the future, we plan to further evaluate the performance of RF-Fold on a standard protein homology detection dataset , independent CASP datasets , and to build a protein tertiary structure prediction web server based on RF-Fold for the community to use. Furthermore, the sensitivity of RF-Fold for the hardest fold recognition problem at the fold level is still relatively low (e.g. 40.8% for top-one predictions and 58.3% for top-five predictions), which is one of the major bottlenecks of template-based protein structure modeling. We will incorporate more informative features into RF-Fold to address this problem in the future.
The work was partially supported by an NIH grant (R01GM093123) to JC.
The publication charges for this article were funded by NIH grant (R01GM093123) to JC. Any opinions, findings, and conclusions expressed in this article are those of the authors and do not necessarily reflect the views of the National Institutes of Health.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 11, 2014: Proceedings of the 11th Annual MCBIOS Conference. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S11.
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.PubMed CentralView ArticlePubMedGoogle Scholar
- Bairoch A, Bougueleret L, Altairac S, Amendolia V, Auchincloss A, Puy GA, Axelsen K, Baratin D, Blatter M, Boeckmann B: The universal protein resource (UniProt). Nucleic Acids Res. 2008, 36: D190-D195. 10.1093/nar/gkn141.View ArticleGoogle Scholar
- Cheng J: A Multi-Template Combination Algorithm for Protein Comparative Modeling. BMC Structural Biology. 2008, 8: 18-10.1186/1472-6807-8-18.PubMed CentralView ArticlePubMedGoogle Scholar
- Jones DT, Taylort WR, Thornton JM: A new approach to protein fold recognition. Nature. 1992, 358: 86-89. 10.1038/358086a0.View ArticlePubMedGoogle Scholar
- Cheng J, Baldi P: A Machine Learning Information Retrieval Approach to Protein Fold Recognition. Bioinformatics. 2006, 22: 1456-1463. 10.1093/bioinformatics/btl102.View ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995, 247: 536-540.PubMedGoogle Scholar
- Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH-a hierarchic classification of protein domain structures. Structure. 1997, 5: 1093-1108. 10.1016/S0969-2126(97)00260-8.View ArticlePubMedGoogle Scholar
- Cheng J, Tegge AN, Baldi P: Machine learning methods for protein structure prediction. IEEE Rev Biomed Eng. 2008, 41-49.Google Scholar
- Breiman L: Random Forests. Machine Learning. 2001, 45: 5-32. 10.1023/A:1010933404324.View ArticleGoogle Scholar
- Freund Y, Schapier RE: A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence. 1999, 14: 771-780.Google Scholar
- Livingston F: Implementation of Breiman's random forest machine learning algorithm. Machine Learning Journal Paper. 2005, ECE591Q-Google Scholar
- Lariviere B, Van den Poel D: Predicting Customer Retention and Profitability by Using Random Forests and Regression Forests Techniques. Journal of Expert Systems with Applications. 2005, 29 (2): 472-482. 10.1016/j.eswa.2005.04.043.View ArticleGoogle Scholar
- Xu P, Jelinek F: Random Forests and the Data Sparseness Problem in Language Modeling. Journal of Computer Speech and Language. 2007, 21 (l): 105-152.View ArticleGoogle Scholar
- Peters J, De Baets B, Verhoest NEC, Samson R, Degroeve S, De Becker P, Huybrechts W: Random Forests as a Tool for Ecohydrological Distribution Modelling. Journal of Ecological Modelling. 2007, 207 (2-4): 304-318. 10.1016/j.ecolmodel.2007.05.011.View ArticleGoogle Scholar
- Dehzangi A, Phon-amnuaisuk S, Dehzani O: Using Random Forest for Protein Fold Prediction Problem. An Empirical Study Journal of Information Science and Engineering. 2010, 26: 1941-1956.Google Scholar
- Chen K, Kurgan L: PFRES: protein fold classification by using evolutionary information and predicted secondary structure. Bioinformatics. 2007, 23 (21): 2843-2850. 10.1093/bioinformatics/btm475.View ArticlePubMedGoogle Scholar
- Jaina P, Garibaldib JM, Hirst JD: Supervised machine learning algorithms for protein structure classification. Computational Biology and Chemistry. 2009, 33 (3): 216-223. 10.1016/j.compbiolchem.2009.04.004.View ArticleGoogle Scholar
- Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25: 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy S: Profile hidden Markov models. Bioinformatics. 1998, 14: 755-763. 10.1093/bioinformatics/14.9.755.View ArticlePubMedGoogle Scholar
- Karplus K, Barrett C, Hughey R: Hidden Markov models for detecting remote protein homologies. Bioinformatics. 1998, 14: 846-846. 10.1093/bioinformatics/14.10.846.View ArticlePubMedGoogle Scholar
- Hargbo J, Elofsson A: A study of hidden markov models that use predicted secondary structures for fold recognition. Proteins. 1999, 36: 68-87. 10.1002/(SICI)1097-0134(19990701)36:1<68::AID-PROT6>3.0.CO;2-1.View ArticlePubMedGoogle Scholar
- Jones D, Taylor W, Thornton J: A new approach to protein fold recognition. Nature. 1992, 358: 86-98. 10.1038/358086a0.View ArticlePubMedGoogle Scholar
- Shi J, Blundell T, Mizuguchi K: FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J Molecular Biology. 2001, 310: 243-257. 10.1006/jmbi.2001.4762.View ArticleGoogle Scholar
- Zhou H, Zhou Y: Single-body residue-level knowledge-based energy score combined with sequence-profile and secondary structure information for fold recognition. Proteins. 2004, 55: 1005-1013. 10.1002/prot.20007.View ArticlePubMedGoogle Scholar
- Zhou H, Zhou Y: Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments. Proteins. 2005, 58: 321-328.PubMed CentralView ArticlePubMedGoogle Scholar
- Johannes S: Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005, 21 (7): 951-960. 10.1093/bioinformatics/bti125.View ArticleGoogle Scholar
- Liu S, Zhang C, Liang S, Zhou Y: Fold recognition by concurrent use of solvent accessibility and residue depth. Proteins. 2007, 68 (3): 636-645. 10.1002/prot.21459.View ArticlePubMedGoogle Scholar
- Zhang W, Liu S, Zhou Y: SP5: improving protein fold recognition by using torsion angle profiles and profile-based gap penalty model. PLoS One. 2008, 3 (6): e2325-10.1371/journal.pone.0002325.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu J, Li M, Kim D, Xu Y: RAPTOR: optimal protein threading by linear programming. Journal of Bioinformatics and Computational Biology. 2003, 1 (1): 95-117. 10.1142/S0219720003000186.View ArticlePubMedGoogle Scholar
- Yang Y, Faraggi E, Zhao H, Zhou Y: Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of the query and corresponding native properties of templates. Bioinformatics. 2011, 27 (15): 2076-2082. 10.1093/bioinformatics/btr350.PubMed CentralView ArticlePubMedGoogle Scholar
- Peng J, Xu J: Boosting Protein Threading Accuracy. Res Comput Mol Biol. 2009, 5541: 31-45. 10.1007/978-3-642-02008-7_3.PubMed CentralView ArticlePubMedGoogle Scholar
- Lindahl E, Elofsson A: Identification of related proteins on family, superfamily and fold level. J Mol Biol. 2000, 295: 613-625. 10.1006/jmbi.1999.3377.View ArticlePubMedGoogle Scholar
- Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. 1984, New York: Chapman and HallGoogle Scholar
- Schapire RE: The strength of weak learnability. Machine Learning. 1990, 5 (2): 197-227.Google Scholar
- Kam HT: Random decision forest, Proceedings of the 3rd Int'l Conf on Document Analysis and Recognition: 14-18 August 1995. Montreal. 1995, 278-282.Google Scholar
- Chawla NV, Japkowicz N, Kotcz A: Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter. 2004, 6 (1): 1-6. 10.1145/1007730.1007733.View ArticleGoogle Scholar
- Liaw A, Wiener M: Classification and Regression by randomForest. R News. 2002, 2: 18-22.Google Scholar
- Ohlson T, Wallner B, Elofsson A: Profile-profile methods provide improved fold-recognition. a study of different profile-profile alignment methods. Proteins. 2004, 57: 188-197. 10.1002/prot.20184.View ArticlePubMedGoogle Scholar
- Thompson J, Higgins D, Gibson T: CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680. 10.1093/nar/22.22.4673.PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy S: Profile hidden Markov models. Bioinformatics. 1998, 14: 755-763. 10.1093/bioinformatics/14.9.755.View ArticlePubMedGoogle Scholar
- Schaffer A, Wolf Y, Ponting C, Koonin E, Aravind L, Altschul S: IMPALA: matching a protein sequence against a collection of PSI-BLASTconstructed position-specific score matrices. Bioinformatics. 1999, 15: 1000-1011. 10.1093/bioinformatics/15.12.1000.View ArticlePubMedGoogle Scholar
- Edgar R, Sjolander K: COACH: profile-profile alignment of protein families using hidden markov models. Bioinformatics. 2004, 20: 1309-1318. 10.1093/bioinformatics/bth091.View ArticlePubMedGoogle Scholar
- Sadreyev R, Grishin N: COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol. 2003, 326: 317-336. 10.1016/S0022-2836(02)01371-2.View ArticlePubMedGoogle Scholar
- Soding J: Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005, 21: 951-960. 10.1093/bioinformatics/bti125.View ArticlePubMedGoogle Scholar
- Pollastri G, Baldi P, Fariselli P, Casadio R: Prediction of coordination number and relative solvent accessibility in proteins. Proteins. 2001, 47 (2): 142-153.View ArticleGoogle Scholar
- Pollastri G, Przybylski D, Rost B, Baldi P: Improving the prediction of protein secondary strucure in three and eight classes using recurrent neural networks and profiles. Proteins. 2001, 47 (2): 228-235.View ArticleGoogle Scholar
- Pollastri G, Baldi P: Predition of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners. Bioinformatics. 2002, 18 (Suppl 3): S62-S70.View ArticlePubMedGoogle Scholar
- Cheng J, Randall A, Sweredoski M, Baldi P: SCRA TCH: a protein structure and structural feature prediction server. Nucleic Acids Res. 2005, 33: w72-76. 10.1093/nar/gki396.PubMed CentralView ArticlePubMedGoogle Scholar
- Cheng J, Baldi P: Three-stage prediction of protein beta-sheets by neural networks, alignments, and graph algorithms. Bioinformatics. 2005, 21 (Suppl 1): i75-i84. 10.1093/bioinformatics/bti1004.View ArticlePubMedGoogle Scholar
- Liao L, Noble WS: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of Computational Biology. 2003, 10 (6): 857-868. 10.1089/106652703322756113.View ArticlePubMedGoogle Scholar
- Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A: Critical assessment of methods of protein structure prediction - Round VII. Proteins. 2007, 69 (S8): 3-9. 10.1002/prot.21767.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.