Software | Open | Published:
CircularLogo: A lightweight web application to visualize intra-motif dependencies
BMC Bioinformaticsvolume 18, Article number: 269 (2017)
The sequence logo has been widely used to represent DNA or RNA motifs for more than three decades. Despite its intelligibility and intuitiveness, the traditional sequence logo is unable to display the intra-motif dependencies and therefore is insufficient to fully characterize nucleotide motifs. Many methods have been developed to quantify the intra-motif dependencies, but fewer tools are available for visualization.
CircularLogo is an innovative web application that is specifically designed to visualize and interactively explore intra-motif dependencies.
Many DNA and RNA binding proteins recognize their binding sites through specific nucleotide patterns called motifs. Motif sites bound by the same protein do not necessarily have same sequence but typically share consensus sequence patterns. Several methods have been developed to statistically model the position-specific consensus and diversity of nucleotide motifs using the position weight matrix (PWM) or position-specific scoring matrix (PSSM) [1, 2]. These mathematical representations are usually visualized using sequence logos, which depict the consensus and diversity of each motif residue as a stack of nucleotide symbols. The height of each symbol within the stack indicates its relative frequency, and the total height of symbols is scaled to the information content of that position [3, 4].
Traditional PWM and PSSM assume statistical independence between nucleotides of a motif. However, such assumption is not completely justified, and accumulated evidence indicates the existence of intra-motif dependencies [5,6,7,8]. For example, an analysis of wild-type and mutant Zif268 (EGR-1) zinc fingers, using microarray binding experiments, suggested that the nucleotides within transcription factor binding site (TFBS) should not be treated independently . In addition, the intra-dependences within a motif were also revealed by a comprehensive experiment to examine the binding specificities of 104 distinct DNA binding proteins in mouse . Intra-motif dependencies when into consideration could substantially improve the accuracy of de novo motif discovery . Therefore, many statistical methods have been developed to characterize the intra-motif dependencies, which include the generalized weight matrix model , sparse local inhomogeneous mixture model (Slim) , transcription factor flexible model based on hidden Markov models (TFFMs) , the binding energy model (BEM) , and the inhomogeneous parsimonious Markov model (PMM) . However, the most commonly used visualization tools such as WebLogo  and Seq2Logo  are incapable of displaying these intra-motif dependencies.
Only a handful of tools like CorreLogo, enoLOGOS, and ELRM are capable of visualizing positional dependencies [16,17,18]. CorreLogo depicts mutual information from DNA or RNA alignment using three-dimensional sequence logos generated via VRML and JVX. However, CorreLogo’s three-dimensional graphs are difficult to interpret because of the excessively complex and distorted perspective associated with the third dimension. ELRM generates static graphs to visualize intra-motif dependences. ELRM splits up “base features” and “association features” and fails to comprehensively integrate nucleotide diversities and dependencies. In addition, ELRM is limited to measuring dependence with its own built-in method. Similar to ELRM, enoLOGOS represents the dependency between different positions using a matrix plot underneath the nucleotide logo. While pLogo allows user to visualize correlations to a particular nucleotide position, it fails to provide overall view of intra-motif dependencies . Finally, all of these tools lack the functionality for users to explore and interpret the data in an interactive fashion.
JSON-Graph specifications of nucleotide motif representation
We used the JSON-Graph format to describe nucleotide motif in order to make it intelligible and malleable. The schema of JSON-Graph format is illustrated as below:
The contents within two curly braces describe a DNA or RNA motif. Specifically, the “id” keyword specifies the name of the motif. The “background” keyword designates nucleotides frequencies (in the order of A, T, C and G) of the relevant genomic background. For example, when studying motifs in human genome, these percentages are computed from the human reference genome as background distribution. By default, they are set to 0.25 representing equal frequencies. The “pseudocounts” keyword represents the extra nucleotides added to each position of the motif to avoid zero-division error in small data set; these are set to 0.25 for each nucleotide by default. The “nodes” section describes various properties of motif residues using the following keywords: a) the “index” keyword specifies the sequential order (in anticlockwise) of nucleotide stacks b) the “label” keyword denotes the identity of each nucleotide stack c) the “bit” keyword refers to the information content calculated for each nucleotide stack d) the “base” keyword indicates the four nucleotides sorted incrementally by their corresponding frequencies as designated by the “freq” keyword. The “links” section describes the pairwise dependencies between nucleotide stacks using the following keywords: a) the “source” and “target” keywords denoting the start and the end positions of nucleotide stacks b) the “value” keyword indicates the width of the link that is proportional to the strength of dependence between the two linked positions.
CircularLogo web server
CircularLogo web application uses NGINX (https://www.nginx.com/) web server with uWSGI (https://pypi.python.org/pypi/uWSGI) gateway interface to handle multiple concurrent client requests. The application is hosted on Amazon Elastic Compute Cloud (Amazon EC2).
Measure intra-motif dependencies using χ2 statistic
We implemented two metrics to calculate the dependence between a pair of nucleotide positions: mutual information and the χ2 statistic. The χ2 statistic is widely used to test the independence of two categorical variables and corresponding Q score is a natural measure of dependency between two events that quantifies the co-incidence as follows. Let us assume that a DNA motif is l nucleotides long and is built from N sequences. For given two positions i and j within the motif (1 ≤ i ≤ l, 1 ≤ j ≤ l, i ≠ j), the observed di-nucleotide frequency is denoted as O ij , which can be obtained by counting di-nucleotide combinations from the input N sequences. The expected di-nucleotide frequency is represented as E ij . The χ2 statistic score is then calculated as:
Here, m is the total number of di-nucleotides (42 = 16).
Measure intra-motif dependencies using mutual information
The second built-in approach to measure dependence is the mutual information. This metric quantifies the mutual dependence between two discrete random variables X (X = [A, C, G, T]) and Y (Y = [A, C, G, T]) and it is defined as:
Here, x (x ∈ [A, C, G, T]) and y (y ∈ [A, C, G, T]) represent nucleotides at two nucleotide stacks X and Y, respectively. p (x) and p (y) denote the nucleotide frequencies of x and y. p (x, y) defines the frequencies of dinucleotides (xy) from X and Y. The significance of dependency between two positions was evaluated using Chebyshev’s inequality. For example, if the observed mutual information is K × stdev times larger than that expected from random background model. P < = 1/K2.
HNF6 motif analysis
HNF6 ChIP-exo data was obtained from Array Express (accession number E-MTAB-2060; http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-2060/), processed with MACE , and HNF6 binding sites were extracted. The 5549 65-nucleotide (upstream 20 nucleotides + 25 nucleotides HNF6 binding site + downstream 20 nucleotides) sequences were published to https://sourceforge.net/projects/circularlogo/files/test/. All sequences were aligned by the HNF6 motif, which start from postion-29 to position-36.
tRNA sequence analysis
A total of 1114 tRNA sequences were downloaded from RFAM database  in the form of RFAM ‘seed’ alignment format (accession # RF00005; https://correlogo.ncifcrf.gov/ccrnp/trnafull.html). After excluding sequences with gaps in the alignment, 291 sequences were used as the final dataset to generate circular logo of tRNA (https://sourceforge.net/projects/circularlogo/files/test/). Mutual information was used as the metric to measure intra-motif dependencies. The lower 33% links were filtered out.
Synthesized DNA fragments of splice sites and branch-points for analysis
We used the synthesized DNA fragments by concatenating the 5′ donor site (16 bp), branch-point (21 bp) and the 3′ acceptor site (16 bp) to represent the splicing motif. Briefly, a total of 59,359 predefined, high-confidence human branch-points were downloaded from the supplementary data of the study . We excluded introns with multiple branch-points, small introns (<1 kb) and introns with small gap (≤25 bp) between the branch-point and the acceptor site. For each of the remained introns, we first extracted upstream 6 bp and downstream 10 bp of 5′ donor site. Then we extracted a 21 bp DNA sequence encompassing branch-point by extending 10 bp to both upstream and downstream of the branch-point; thirdly, we extracted upstream 10 bp and downstream 6 bp of 3′ acceptor site. At last, we concatenated these three DNA sequences in the order of “5′ donor site–branch-point–3′ acceptor site” to form a 53 bp DNA fragment. We used a final set of 10,316 DNA fragments to generate circular logo (https://sourceforge.net/projects/circularlogo/files/test/).
Circular nucleotide motif
Unlike the traditional sequence logos that display motif residues on a two-dimensional Cartesian coordinate system (with the x-axis denoting the position of residue stacks and the y-axis denoting the information contents), CircularLogo visualizes motifs using a polar coordinate system that facilitates the display of pairwise intra-motif dependencies with linked ribbons (Fig. 1). Since traditional PWM or PSSM representations do not preserve intra-motif dependency information, we use the JSON-Graph as the main input format to CircularLogo. When the input file is in JSON-Graph format that has pre-calculated nucleotide frequencies and dependencies, the CircularLogo simply transforms this file into a pictorial representation. In addition, CircularLogo also accepts the FASTA format motif representation as input. In this scenario, CircularLogo transforms the FASTA information into a JSON-Graph format by calculating the intra-motif dependency using the built-in χ2 statistic or mutual information metric, and determine the height of each nucleotide stack in the same way as webLogo . In brief, CircularLogo generates a sector for each motif position and draws nucleotide stack within that sector based on the information content and relative frequencies of nucleotides. All sectors are properly arranged into a circular layout. The width of linked arcs indicates the strength of intra-dependency between each pair of nucleotide positions.
CircularLogo allows users to interactively adjust a variety of parameters and explore intra-motif dependencies and fine-tune the appearance of the final output. For example, any nucleotide in the genome has a certain level of dependencies with its immediate neighbors. Such dependencies are considered as the background noise since they are not likely to be biologically meaningful. CircularLogo automatically filters out weak links according to user-specified p-value, and also provides a slider bar to let user to do interactive filtering.
Nucleotide dependencies within HNF6 motif
HNF6 (also known as ONECUT1) is a transcription factor that regulates expression of genes involved in a variety of cellular processes. The exact protein-DNA binding boundaries of HNF6 in mouse genome were previously defined by our group . A total of 5549 binding sites, each of 25 nucleotides long, were used to explore the intra-motif dependencies. Each binding site was also extended 20 nucleotides up- and downstream in order to estimate the background dependency level. Pair-wise dependencies between all 65 positions were displayed in Fig. 1a. As we expected, dependencies between positions within the HNF6 binding site (i.e. nucleotides within 29th and 36th position) were much higher than those of flanking regions (Fig. 1b). Figure 1c indicated background links relating to node 5 (i.e. the 5th position of input DNA sequence). Figure 1d indicated dependencies related to node 33 within the HNF6 binding site after spurious links were removed.
Nucleotide dependencies within tRNAs
The transfer RNA (tRNA) is involved in translating message RNA (mRNA) into the amino acid sequence. It’s typical cloverleaf secondary structure is composed of D-loop, anticodon loop, variable loop and TΨC loop, as well as four base-paired stems between these loops (Fig. 2a). The nucleotides within stems are less conserved than those of loops, but base pairings within stems are required for structural stability. Thus we expect higher positional dependencies between nucleotides within stems than those within loops. We used CircularLogo, with mutual information as a measurement of dependence, to generate tRNA circular motif. After filtering out weak links (lower 33%), we observed four apparent clusters of connected links corresponding to the four stems (Fig. 2b). Comparing to motif logo generated from enoLOGOS (http://www.benoslab.pitt.edu/cgi-bin/enologos/enologos.cgi) using the same dataset, CircularLogo provided more intuitive view of intra-dependencies within the four stems (Fig. 2c). Figure 2b also shows that nucleotides with three loops (D-loop, Anticodon loop, and TΨC loop) exhibited much higher sequence conservation than that of nucleotides located in stems, suggesting that the loops are main functional domains of tRNA. For example, D-loop is the recognition site of aminoacyl-tRNA synthetase, an enzyme involved in amino-acylation of the tRNA molecule [22, 23], and TΨC loop is the recognition site of the ribosome.
Nucleotide dependencies between splicing sites and branch site in eukaryotic introns
Splicing is a critical step during pre-mRNA processing, where introns are removed and exons are joined together by the spliceosome complex. The eukaryotic genes contain three splicing motifs that are essential for successful intron excision: an almost invariant 5′-splice site (donor site), 3′-splice site (acceptor site) and the branch site that is about 20–50 bp upstream of acceptor site. Generally, two successive biochemical reactions are involved in the spliceosomal splicing: First, a specific branch-point nucleotide within the intron, defined during spliceosome assembly, performs a nucleophilic attack on the 5′-splice donor site to form a lariat intermediate. Second, the released 5′-exon attacks 3′-splice acceptor site to excise lariat structure and join the adjacent exons . Recently, Mercer et al. identified 59,359 high-confidence human branch-points using high-throughput sequencing technique . These reliable sites provide us a great opportunity to investigate how those elements interact with each other. We extracted the motif DNA sequences (see Implementation section) and explored their nucleotide dependencies using CircularLogo with χ2 statistic approach (Fig. 3). After filtering those weak links, we found strong dependencies among the three sites (donor site, branch-point and acceptor site). In addition, CircularLogo further revealed the interactions between the polypyrimidine tract and the two splice sites (donor site and acceptor site).
New statistical models and experimental approaches are being developed for measuring intra-motif dependency. CircularLogo uses a plain text, JSON-Graph formatted, file to describe DNA/RNA motifs, which enables users to generate a customized JSON-Graph file containing positional dependencies that are pre-calculated by their choice methods.
When the raw sequences were given to CircularLogo, it provides two approaches (χ2 statistic and mutual information) for measuring the positional dependency. Both of these methods, although commonly used, are biased and unable to quantify dependencies between highly conserved nucleotide stacks (e.g. invariable sites) [6, 25]. This problem could be address by users providing as many sequences as possible in order to capture the low-frequent variants at those highly conserved sites. This is feasible due to genome-wide, high-throughput, screening technologies. For example, researchers usually identify tens of thousands of potential TFBSs using ChIP-seq or other similar technologies. After retrieving the potential TFBSs from ChIP-seq data, a researcher can align them using the predicted DNA motif and give the final alignment file as input for CircularLogo. We recommend that a FASTA input file should contain at least 25 sequences.
It is worth noting that the χ2 statistic and mutual information are two different measures of dependence, each suited for use under different conditions. Essentially, the χ2 statistic measures the co-occurrence of nucleotides of two different positions. Hence, χ2 method is suited for measuring dependency between two conserved (i.e. less variable) positions but it has limited power to measure dependency between two highly variable positions wherein the dinucleotide frequencies are close to background (i.e. 1/16) and the χ2 statistic approaches 0. In contrast, mutual information measures the reduction in uncertainty about nucleotide frequencies in one position, given some knowledge of nucleotide frequencies at another position. For a pair of highly conserved positions that are dominated by particular nucleotides, the information content of each position and the mutual information between them approaches to 0 bit. Hence, mutual information is suited for measuring dependency between two highly variable positions.
Visualization is key for efficient data exploration and effective communication in scientific research. CircularLogo is an innovative tool offering the panorama of DNA or RNA motifs taking into consideration the intra-site dependencies. We demonstrated the utility and practicality of this tool using examples wherein CircularLogo was able to depict complex dependencies within motifs and reveal biomolecular structure (such as stem structures in tRNA) in an effective manner.
the Binding energy model
Java script object notation
Java view geometry file
Model-based analysis of ChIP-Exo
Multiple Em for motif elicitation
the Inhomogeneous parsimonious Markov model
Position-specific scoring matrix
Position weight matrix
Transcription factor binding sites
Transcription factor flexible model
Virtual reality modeling language
Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000;16:16–23.
Boeva V. Analysis of Genomic Sequence Motifs for Deciphering Transcription Factor Binding and Transcriptional Regulation in Eukaryotic Cells. Front Genet. 2016;7:24.
Crooks GE, Hon G, Chandonia J-M, Brenner SE. WebLogo: a sequence logo generator. Genome Res. 2004;14:1188–90.
O’Shea JP, Chou MF, Quader SA, Ryan JK, Church GM, Schwartz D. pLogo: a probabilistic approach to visualizing sequence motifs. Nat Methods. 2013;10:1211-1212.
Bulyk ML, Johnson PLF, Church GM. Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. Nucleic Acids Res. 2002;30:1255–61.
Eggeling R, Gohr A, Keilwagen J, Mohr M, Posch S, Smith AD, et al. On the value of intra-motif dependencies of human insulator protein CTCF. PLoS ONE. 2014;9, e85629.
Man TK, Stormo GD. Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. Nucleic Acids Res. 2001;29:2471–8.
Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, et al. Diversity and complexity in DNA recognition by transcription factors. Science. 2009;324:1720–3.
Grau J, Posch S, Grosse I, Keilwagen J. A general approach for discriminative de novo motif discovery from high-throughput data. Nucleic Acids Res. 2013;41, e197.
Zhou Q, Liu JS. Modeling within-motif dependence for transcription factor binding site predictions. Bioinformatics. 2004;20:909–16.
Keilwagen J, Grau J. Varying levels of complexity in transcription factor binding motifs. Nucleic Acids Res. 2015;43, e119.
Mathelier A, Wasserman WW. The Next Generation of Transcription Factor Binding Site Prediction. PLoS Comput Biol Public Library of Science. 2013;9:e1003214.
Zhao Y, Ruan S, Pandey M, Stormo GD. Improved models for transcription factor binding site identification using nonindependent interactions. Genetics. 2012;191:781–90.
Eggeling R, Roos T, Myllymäki P, Grosse I. Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data. BMC bioinformatics. 2015;16:375.
Thomsen MCF, Nielsen M. Seq2Logo: a method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion. Nucleic Acids Res. 2012;40:W281–7.
Bindewald E, Schneider TD, Shapiro BA. CorreLogo: an online server for 3D sequence logos of RNA and DNA alignments. Nucleic Acids Res. 2006;34:W405–11.
Yang C, Chang C-H. Exploring comprehensive within-motif dependence of transcription factor binding in Escherichia coli. Sci Rep. 2015;5:17021.
Workman CT, Yin Y, Corcoran DL, Ideker T, Stormo GD, Benos PV. enoLOGOS: a versatile web tool for energy normalized sequence logos. Nucleic Acids Res. 2005;33:W389–92.
Wang L, Chen J, Wang C, Uusküla-Reimand L, Chen K, Medina-Rivera A, et al. MACE: model based analysis of ChIP-exo. Nucleic Acids Res. 2014;42:e156.
Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR. Rfam: an RNA family database. Nucleic Acids Res. 2003;31:439–41.
Mercer TR, Clark MB, Andersen SB, Brunck ME, Haerty W, Crawford J, Taft RJ, Nielsen LK, Dinger ME, Mattick JS. Genome-wide discovery of human splicing branchpoints. Genome Res. 2015;25:290–303.
Smith D, Yarus M. Transfer RNA structure and coding specificity. I. Evidence that a D-arm mutation reduces tRNA dissociation from the ribosome. J Mol Biol. 1989;206:489–501.
Hardt WD, Schlegl J, Erdmann VA, Hartmann RK. Role of the D arm and the anticodon arm in tRNA recognition by eubacterial and eukaryotic RNase P enzymes. Biochemistry. 1993;32:13046–53.
Lee Y, Rio DC. Mechanisms and regulation of alternative pre-mRNA splicing. Annu Rev Biochem. 2015;84:291–323.
Paninski L. Estimation of entropy and mutual information. Neural Comput. 2003;15:1191-253.
This works is partly supported by the Mayo Clinic Center for Individualized Medicine. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Availability and requirements
CircularLogo (http://circularlogo.sourceforge.net/) is implemented in Python and Django and is released under the GNU General Public License (GPLv2). CircularLogo web server (http://bioinformaticstools.mayo.edu/circularlogo/index.html) is hosted on Amazon Elastic Compute Cloud and uses NGINX web server with uWSGI gateway interface to handle multiple concurrent client requests. Local installation of CircularLogo on Linux, Mac OS X and Windows systems requires these modules: python2.7.10 (https://www.python.org/downloads/release/python-2710/), Django (https://www.djangoproject.com/), biopython (https://github.com/biopython/biopython.github.io/), numpy (http://www.numpy.org/) and scipy (https://www.scipy.org/). The source codes and datasets analyzed during the current study are available at: https://sourceforge.net/projects/circularlogo/files/. CircularLogo web server can be accessed from http://bioinformaticstools.mayo.edu/circularlogo/index.html.
LW and JPK conceived the study. ZY and TM implemented CircularLogo software and performed the analysis. MK built CircularLogo web server. LW, ZY, SD and JPK wrote the manuscript. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.