Parameter advising for multiple sequence alignment
© DeBlasio and Kececioglu; licensee BioMed Central Ltd. 2015
Published: 28 January 2015
While the multiple sequence alignment output by an aligner strongly depends on the parameter values used for its alignment scoring function (i.e. choice of gap penalties and substitution scores), most users rely on the single default parameter setting. A different parameter setting, however, might yield a much higher-quality alignment for a specific set of input sequences. The problem of picking a good choice of parameter values for a given set of input sequences is called parameter advising. A parameter advisor has two ingredients: (i) a set of parameter choices to select from, and (ii) an estimator that estimates the accuracy of a computed alignment; the parameter advisor then picks the parameter choice from the set whose resulting alignment has highest estimated accuracy.
Our estimator Facet (Feature-based Accuracy Estimator) is a linear combination of real-valued feature functions of an alignment. We assume the feature functions are given as well as the universe of parameter choices from which the advisor's set is drawn. For this scenario we define the problem of learning an optimal advisor by finding the best possible parameter set for a collection of training data of reference alignments. Learning optimal advisor sets is NP-complete . For the advisor sets problem, we develop a greedy -approximation algorithm that finds near optimal sets of size at most k given an optimal solution of size ℓ<k. For the advisor estimator problem, we have an efficient method for finding the coefficients for the estimator that performs well in practice [2, 3].
Our tool Facet (Feature-based Accuracy Estimator)  is an easy-to-use, open-source utility for estimating the accuracy of a protein multiple sequence alignment. Facet evaluates the estimated accuracy of a computed alignment as a linear combination of real-valued feature functions. We considered 12 features of which we found an optimal subset of 5 that provide the best performance for alignment advising. Many of the most useful features utilize information about protein secondary structure. We find coefficients by fitting the difference in estimator values to the difference in true accuracy for pairs of examples where the correct alignment is known. This "difference fitting" approach is computationally efficient and yields an estimator that works well for advising.
The Facet website provides parameter sets that can be used with the Opal aligner (namely substitution matrices and affine gap penalties), as well as scripts for structure prediction.
While the new problem of learning optimal parameter sets for an advisor is NP-complete, in practice our greedy approximation algorithm efficiently learns parameter sets that are remarkably close to optimal. Moreover, these parameter sets significantly boost the accuracy of an aligner compared to a single default parameter choice, when advising using the best accuracy estimators from the literature.
- DeBlasio DF, Kececioglu JD: Learning Parameter Sets for Alignment Advising. Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB). 2014, 10.1145/2649387.2649448.Google Scholar
- DeBlasio DF, Wheeler TJ, Kececioglu JD: Estimating the accuracy of multiple alignments and its use in parameter advising. Proceedings of the 16th Conference on Research in Computational Molecular Biology (RECOMB). 2012, 45-59. 10.1007/9 78-3-642-29627- 7_5.View ArticleGoogle Scholar
- Kececioglu JD, DeBlasio DF: Accuracy Estimation and Parameter Advising for Protein Multiple Sequence Alignment. Journal of Computational Biology. 2013, 20 (4): 259-279. 10.1089/cmb.2013.0007.PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler TJ, Kececioglu JD: Multiple alignment by aligning alignments. Bioinformatics. 2007, 23 (13): 559-68. 10.1093/bioinformatics/btm226.View ArticleGoogle Scholar
- Wheeler TJ, Kececioglu JD: Opal: multiple sequence alignment software, Version 2.1.0. 2012, [http://opal.cs.arizona.edu]Google Scholar
- Chang JM, Tommaso PD, Notredame C: TCS: A new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Molecular Biology and Evolution. 2014, 10.1093/molbev/msu117.Google Scholar
- Lassmann T, Sonnhammer ELL: Automatic assessment of alignment quality. Nucleic Acids Research. 2005, 33 (22): 7120-7128. 10.1093/nar/gki1020.PubMed CentralView ArticlePubMedGoogle Scholar
- Ahola V, Aittokallio T, Vihinen M, Uusipaikka E: Model-based prediction of sequence alignment quality. Bioinformatics. 2008, 24 (19): 2165-2171. 10.1093/bioinformatics/btn414.View ArticlePubMedGoogle Scholar
- DeBlasio DF, Kececioglu JD: Facet: software for accuracy estimation of protein multiple sequence alignments, Version 1.1. 2014, [http://facet.cs.arizona.edu]Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.