Skylign: a tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models
© Wheeler et al.; licensee BioMed Central Ltd. 2014
Received: 23 September 2013
Accepted: 7 January 2014
Published: 13 January 2014
Logos are commonly used in molecular biology to provide a compact graphical representation of the conservation pattern of a set of sequences. They render the information contained in sequence alignments or profile hidden Markov models by drawing a stack of letters for each position, where the height of the stack corresponds to the conservation at that position, and the height of each letter within a stack depends on the frequency of that letter at that position.
We present a new tool and web server, called Skylign, which provides a unified framework for creating logos for both sequence alignments and profile hidden Markov models. In addition to static image files, Skylign creates a novel interactive logo plot for inclusion in web pages. These interactive logos enable scrolling, zooming, and inspection of underlying values. Skylign can avoid sampling bias in sequence alignments by down-weighting redundant sequences and by combining observed counts with informed priors. It also simplifies the representation of gap parameters, and can optionally scale letter heights based on alternate calculations of the conservation of a position.
Skylign is available as a website, a scriptable web service with a RESTful interface, and as a software package for download. Skylign’s interactive logos are easily incorporated into a web page with just a few lines of HTML markup. Skylign may be found at http://skylign.org.
Alignments and profile hidden Markov models
Alignments of multiple biological sequences play an important role in a wide range of bioinformatics applications, and are used to represent sequence families that range in size from DNA binding site motifs to full length proteins, ribosomal RNAs, and autonomous transposable elements. In an alignment, sequences are organized such that each column contains amino acids (or nucleotides) related by descent or shared functional constraint. The distributions of letters will typically vary from column to column. These patterns can reveal important characteristics of the sequence family, for example highlighting sites vital to conformation or ligand binding.
A sequence alignment can be used to produce a profile hidden Markov model (profile HMM). Profile HMMs provide a formal probabilistic framework for sequence comparison [1–3], leveraging the information contained in a sequence alignment to improve detection of distantly related sequences [4, 5]. They are, for example, used in the annotation of both protein domains [6–9] and genomic sequence derived from ancient transposable element expansions .
Consider a family of related sequences, and an alignment of a subset of those sequences. For each column, we can think of the observed letters as having been sampled from the distribution, of letters at that position among all members of the sequence family. One approach to estimating for a column is to compute a maximum likelihood estimate directly from observed counts at that column. An alternative is to try to improve the estimate using sequence weighting (relative  and absolute ) and mixture Dirichlet priors [2, 13–15]. The later approach is used in computing position-specific letter distributions for profile HMMs [16, 17].
In an alignment, a subset of the columns will be consensus columns, in which most sequences are represented by a letter, rather than a gap character. In a typical profile HMM, a model position is created for each consensus column, and non-consensus columns are treated as insertions relative to model positions. As with letters, the per-position gap distributions may be estimated from observed or weighted counts, or combined with a Dirichlet prior.
A logo provides a compact graphical representation of an alignment, representing each column with a stack of letters. The total height of each stack corresponds to a measure of the invariance of the column – typically, it is the information content of that position. The height of each letter within a stack depends on the frequency of that letter at that position. Logos were originally devised to represent the extent of letter conservation in each column of an alignment [18, 19], and were later generalized to show letter and gap probabilities of a profile HMM .
Consider an alphabet A consisting of L letters, a1 through a L (L is 4 for DNA, and 20 for amino acids). For a given column in an alignment, we capture the estimated column distribution as a length-L vector , such that p i is the probability of observing letter a i at that column. We define the length-L vector to be the background distribution over letters in A, such that q i is the background probability of observing letter a i , typically based on letter frequency in a large set of representative sequences.
Relationship between DNA letter distribution and information content
We present a software tool and associated web service, called Skylign, which offers several advantages over existing logo tools. It can generate both a static image file and a new interactive web plot that supports scrolling, zooming, and inspection of values underlying each letter stack. Skylign also produces a simplified representation of per-position gap probabilities, and optionally reduces visual clutter by including only overrepresented letters in a stack. Skylign’s interactive logos are robust and fast for alignments with length in the thousands, such as those representing many transposable element families.
An important implementation detail is that Skylign produces logos for both profile HMMs and multiple sequence alignments in a unified framework. Profile logos are plotted using the per-position distributions of the profile HMM. For alignment logos, the column distributions can be estimated either from observed counts, weighted counts, or based on posterior probabilities after combining with a Dirichlet mixture prior. Estimation based on weights and priors is performed by explicitly producing a profile HMM using the hmmbuild tool within HMMER3.1 .
In the following sections, we describe implementation details, compare alternative visualization approaches, and illustrate the utility of these logos. Skylign can be accessed as a web service at http://skylign.org, and the Skylign software package may be downloaded for independent installation.
Results and discussion
Several logo web servers have been released since the introduction of logos [20, 22–24], each with their own enhancements to logo presentation. In the course of developing websites for sequence homology search and annotation, we identified a need for interactive web-enabled logos that could efficiently render very long logos, and offer alternate letter height options, improved visualization of per-position gap parameters, and the ability to inspect underlying values. We developed Skylign to meet these needs.
Web-enabled interactive logos
Skylign may be used in a variety of ways to create an image or interactive logo. The simplest option is to use the website submission form. Skylign also offers a web service via a RESTful interface , enabling scripted logo creation. Finally, the Skylign package may be downloaded for local installation. Instructions for all of these options are available at http://skylign.org.
Position-specific gap parameters
Occupancy: the probability of observing a letter at position k. If we call this value, occ(k) the probability of observing a gap character (part of a deletion relative to the model) is 1 − occ(k).
Insert probability: the probability of observing one or more letters inserted between the letter corresponding to position k and the letter corresponding to position (k + 1).
Insert length: the expected length of an insertion following position k, if one is observed. For mathematical convenience, profile HMMs model insertions as having a geometric length distribution with position-specific parameter ϵ and mean length 1/(1 − ϵ).
The later two are only relevant for profile logos, since Skylign creates a logo position for each non-empty column in the alignment when producing an alignment logo.
Unified framework for profile logos and alignment logos
Logo height options
Skylign also offers an option to produce a different sort of logo in which the height of each letter is its score, s i . Only positive-scoring letters are included in the stack, as demonstrated in Figure 5C. We find this logo useful, for example, when inspecting per-position scores of an alignment of a sequence to a profile HMM. It is important to emphasize that the height of a score stack does not have any inherent meaning – it is simply the sum of all letter heights. In the interactive web logo, clicking a column reveals a list of scores for all letters of the alphabet, including those with negative scores.
Logos have long been used to visually represent the position-specific patterns of conservation in sequence alignments and profile HMMs. We developed Skylign with the aim of enabling interactive manipulation and inspection of logos, while offering a variety of logo variants for alignments and profiles. The result is a logo tool that supports scrolling, zooming, inspection of underlying values, and mapping between logo positions and alignment columns. Skylign simplifies the representation of gap parameters, offers alternate calculations to determine letter heights, and can overcome sampling bias by down-weighting redundant sequences and by combining observed counts with informed priors.
Skylign’s interactive logos are easily incorporated into a web page, and we have already used them in our HMMER and Dfam webservers, presenting logos for both protein and DNA profile HMMs [10, 27]. We anticipate that Skylign will be used to create logos, either in advance or on the fly, for other sites that present data related to multiple sequence alignments or profile HMMs.
Availability and requirements
Skylign can be accessed as a web server and web service, and may be downloaded for local use at http://skylign.org.
Institutional support was provided by Howard Hughes Medical Institute Janelia Farm Research Campus. We thank the reviewers for their helpful comments, and Tom Jones and Sean Eddy for their insightful feedback during development of the software and manuscript.
- Krogh A, Brown M, Mian IS, Sjolander K, Haussler D: Hidden Markov models in computational biology. J Mol Biol. 1993, 235: 1501-1531.View ArticleGoogle Scholar
- Karplus K, Barrett C, Hughey R: Hidden Markov models for detecting remote protein homologies. Bioinformatics. 1998, 14: 846-856. 10.1093/bioinformatics/14.10.846.View ArticlePubMedGoogle Scholar
- Eddy SR: A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol. 2008, 4: e1000069-10.1371/journal.pcbi.1000069.View ArticlePubMed CentralPubMedGoogle Scholar
- Eddy SR: Accelerated profile HMM searches. PLoS Comput Biol. 2011, 7: e1002195-10.1371/journal.pcbi.1002195.View ArticlePubMed CentralPubMedGoogle Scholar
- Wheeler TJ, Eddy SR: nhmmer: DNA homology search with profile HMMs. Bioinformatics. 2013, 29: 2487-2489. 10.1093/bioinformatics/btt403.View ArticlePubMed CentralPubMedGoogle Scholar
- de Lima Morais DA, Fang H, Rackham OJL, Wilson D, Pethica R, Chothia C, Gough J: SUPERFAMILY 1.75 including a domain-centric gene ontology method. Nucleic Acids Res. 2011, 39: D427-D434. 10.1093/nar/gkq1130.View ArticlePubMed CentralPubMedGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer ELL, Eddy SR, Bateman A, Finn RD: The Pfam protein families database. Nucleic Acids Res. 2011, 40: D290-D301.View ArticlePubMed CentralPubMedGoogle Scholar
- Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, Bateman A, Bernard T, Binns D, Bork P, Burge S, de Castro E, Coggill P, Corbett M, Das U, Daugherty L, Duquenne L, Finn RD, Fraser M, Gough J, Haft D, Hulo N, Kahn D, Kelly E, Letunic I, Lonsdale D, Lopez R, Madera M, Maslen J, McAnulla C, McDowall J, et al: InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res. 2012, 40: D306-D312. 10.1093/nar/gkr948.View ArticlePubMed CentralPubMedGoogle Scholar
- Lees J, Yeats C, Perkins J, Sillitoe I, Rentzsch R, Dessailly BH, Orengo C: Gene3D: a domain-based resource for comparative genomics, functional annotation and protein network analysis. Nucleic Acids Res. 2012, 40: D465-D471. 10.1093/nar/gkr1181.View ArticlePubMed CentralPubMedGoogle Scholar
- Wheeler TJ, Clements J, Eddy SR, Hubley R, Jones TA, Jurka J, Smit AFA, Finn RD: Dfam: a database of repetitive DNA based on profile hidden Markov models. Nucleic Acids Res. 2013, 41: D70-D82. 10.1093/nar/gks1265.View ArticlePubMed CentralPubMedGoogle Scholar
- Henikoff S, Henikoff JG: Position-based sequence weights. J Mol Biol. 1994, 243: 574-578. 10.1016/0022-2836(94)90032-9.View ArticlePubMedGoogle Scholar
- Johnson S: Remote Protein Homology Detection Using Hidden Markov Models. PhD thesis. 2006, St. Louis: Washington UniversityGoogle Scholar
- Brown M, Hughey R, Krogh A, Mian IS, Sjolander K, Haussler D: Using Dirichlet mixture priors to derive hidden Markov models for protein families. Proceedings of the First International Conference on Intelligent Systems for Molecular Biology. 1993, 1: 47-55.Google Scholar
- Sjolander K, Karplus K, Brown M, Hughey R, Krogh A, Mian IS, Haussler D: Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Bioinformatics. 1996, 12: 327-345. 10.1093/bioinformatics/12.4.327.View ArticleGoogle Scholar
- MacKay DJC: Information theory, inference, and learning algorithms. 2003, Cambridge: Cambridge Univ PressGoogle Scholar
- Hughey R, Krogh A: SAM: Sequence alignment and modeling software system. 1995, Santa Cruz: University of CaliforniaGoogle Scholar
- Eddy SR, Wheeler TJ: HMMER User's Guide, version 3.1. 2013, http://hmmer.janelia.org/,Google Scholar
- Schneider TD, Stormo GD, Gold L, Ehrenfeucht A: Information content of binding sites on nucleotide sequences. J Mol Biol. 1986, 188: 415-431. 10.1016/0022-2836(86)90165-8.View ArticlePubMedGoogle Scholar
- Schneider TD, Stephens RM: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990, 18: 6097-6100. 10.1093/nar/18.20.6097.View ArticlePubMed CentralPubMedGoogle Scholar
- Schuster-Böckler B, Schultz J, Rahmann S: HMM logos for visualization of protein families. BMC Bioinformatics. 2004, 5: 7-10.1186/1471-2105-5-7.View ArticlePubMed CentralPubMedGoogle Scholar
- Kullback S, Leibler RA: On information and sufficiency. Ann Math Stat. 1951, 22: 79-86. 10.1214/aoms/1177729694.View ArticleGoogle Scholar
- Gorodkin J, Heyer LJ, Brunak S, Stormo GD: Displaying the information contents of structural RNA alignments: the structure logos. Comput Applic Biosci. 1997, 13: 583-586.Google Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14: 1188-1190. 10.1101/gr.849004.View ArticlePubMed CentralPubMedGoogle Scholar
- Thomsen MCF, Nielsen M: Seq2Logo: a method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion. Nucleic Acids Res. 2012, 40: W281-W287. 10.1093/nar/gks469.View ArticlePubMed CentralPubMedGoogle Scholar
- Fielding RT, Taylor RN: Principled design of the modern Web architecture. ACM Transactions on Internet Technology. 2002, 2: 115-150. 10.1145/514183.514185.View ArticleGoogle Scholar
- Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A: Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res. 2010, 39: D141-D145.View ArticlePubMed CentralPubMedGoogle Scholar
- Finn RD, Clements J, Eddy SR: HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011, 39: W29-W37. 10.1093/nar/gkr367.View ArticlePubMed CentralPubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.