Volume 16 Supplement 2

## Highlights from the Tenth International Society for Computational Biology (ISCB) Student Council Symposium 2014

# Parameter advising for multiple sequence alignment

- Dan DeBlasio
^{1}Email author and - John Kececioglu
^{1}

**16(Suppl 2)**:A3

https://doi.org/10.1186/1471-2105-16-S2-A3

© DeBlasio and Kececioglu; licensee BioMed Central Ltd. 2015

**Published: **28 January 2015

## Background

While the multiple sequence alignment output by an aligner strongly depends on the parameter values used for its alignment scoring function (i.e. choice of gap penalties and substitution scores), most users rely on the single default parameter setting. A different parameter setting, however, might yield a much higher-quality alignment for a specific set of input sequences. The problem of picking a good choice of parameter values for a given set of input sequences is called parameter advising. A *parameter advisor* has two ingredients: (i) a *set* of parameter choices to select from, and (ii) an *estimator* that estimates the accuracy of a computed alignment; the parameter advisor then picks the parameter choice from the set whose resulting alignment has highest estimated accuracy.

Our estimator Facet (**F**eature-based **Ac**curacy **E**s**t**imator) is a linear combination of real-valued feature functions of an alignment. We assume the feature functions are given as well as the universe of parameter choices from which the advisor's set is drawn. For this scenario we define the problem of learning an optimal advisor by finding the best possible parameter set for a collection of training data of reference alignments. Learning optimal advisor sets is NP-complete [1]. For the advisor sets problem, we develop a greedy $\frac{\ell}{k}$-approximation algorithm that finds near optimal sets of size at most *k* given an optimal solution of size ℓ<k. For the advisor estimator problem, we have an efficient method for finding the coefficients for the estimator that performs well in practice [2, 3].

## Results

### Parameter advising

*k =*10, where the benchmarks are assigned to bins based on their accuracy using a default parameter choice; the figure also shows the accuracies when using a single default parameter choice, and an oracle. The number of benchmarks per bin is indicated above the columns. An

*oracle*is an advisor that knows the true accuracy of an alignment; its accuracy is shown by the dotted line, which gives the performance of a perfect advisor. Notice that in many cases the performance of the estimator is close to the oracle. This is most clear on the bin which has lowest average accuracy, where advising increases the average accuracy by almost 20% compared to using a single default parameter.

*k*of the greedy advisor set. Greedy advisor set found by the approximation algorithm are augmented from the exact set of cardinality ℓ = 1 (namely, the best single parameter choice). Notice that Facet (the topmost curve in the plot) continues to increase in advising accuracy up to cardinality

*k*= 6. Notice also that while all of the advisors reach a plateau, for Facet this occurs at a greater cardinality and accuracy than for other estimators.

### Accuracy estimation

Our tool Facet (**F**eature-based **Ac**curacy **E**s**t**imator) [9] is an easy-to-use, open-source utility for estimating the accuracy of a protein multiple sequence alignment. Facet evaluates the estimated accuracy of a computed alignment as a linear combination of real-valued feature functions. We considered 12 features of which we found an optimal subset of 5 that provide the best performance for alignment advising. Many of the most useful features utilize information about protein secondary structure. We find coefficients by fitting the difference in estimator values to the difference in true accuracy for pairs of examples where the correct alignment is known. This "difference fitting" approach is computationally efficient and yields an estimator that works well for advising.

The Facet website provides parameter sets that can be used with the Opal aligner (namely substitution matrices and affine gap penalties), as well as scripts for structure prediction.

## Conclusion

While the new problem of learning optimal parameter sets for an advisor is NP-complete, in practice our greedy approximation algorithm efficiently learns parameter sets that are remarkably close to optimal. Moreover, these parameter sets significantly boost the accuracy of an aligner compared to a single default parameter choice, when advising using the best accuracy estimators from the literature.

## Authors’ Affiliations

## References

- DeBlasio DF, Kececioglu JD: Learning Parameter Sets for Alignment Advising. Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB). 2014, 10.1145/2649387.2649448.Google Scholar
- DeBlasio DF, Wheeler TJ, Kececioglu JD: Estimating the accuracy of multiple alignments and its use in parameter advising. Proceedings of the 16th Conference on Research in Computational Molecular Biology (RECOMB). 2012, 45-59. 10.1007/9 78-3-642-29627- 7_5.View ArticleGoogle Scholar
- Kececioglu JD, DeBlasio DF: Accuracy Estimation and Parameter Advising for Protein Multiple Sequence Alignment. Journal of Computational Biology. 2013, 20 (4): 259-279. 10.1089/cmb.2013.0007.PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler TJ, Kececioglu JD: Multiple alignment by aligning alignments. Bioinformatics. 2007, 23 (13): 559-68. 10.1093/bioinformatics/btm226.View ArticleGoogle Scholar
- Wheeler TJ, Kececioglu JD: Opal: multiple sequence alignment software, Version 2.1.0. 2012, [http://opal.cs.arizona.edu]Google Scholar
- Chang JM, Tommaso PD, Notredame C: TCS: A new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Molecular Biology and Evolution. 2014, 10.1093/molbev/msu117.Google Scholar
- Lassmann T, Sonnhammer ELL: Automatic assessment of alignment quality. Nucleic Acids Research. 2005, 33 (22): 7120-7128. 10.1093/nar/gki1020.PubMed CentralView ArticlePubMedGoogle Scholar
- Ahola V, Aittokallio T, Vihinen M, Uusipaikka E: Model-based prediction of sequence alignment quality. Bioinformatics. 2008, 24 (19): 2165-2171. 10.1093/bioinformatics/btn414.View ArticlePubMedGoogle Scholar
- DeBlasio DF, Kececioglu JD: Facet: software for accuracy estimation of protein multiple sequence alignments, Version 1.1. 2014, [http://facet.cs.arizona.edu]Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.