DiffLogo: a comparative visualization of sequence motifs

Background For three decades, sequence logos are the de facto standard for the visualization of sequence motifs in biology and bioinformatics. Reasons for this success story are their simplicity and clarity. The number of inferred and published motifs grows with the number of data sets and motif extraction algorithms. Hence, it becomes more and more important to perceive differences between motifs. However, motif differences are hard to detect from individual sequence logos in case of multiple motifs for one transcription factor, highly similar binding motifs of different transcription factors, or multiple motifs for one protein domain. Results Here, we present DiffLogo, a freely available, extensible, and user-friendly R package for visualizing motif differences. DiffLogo is capable of showing differences between DNA motifs as well as protein motifs in a pair-wise manner resulting in publication-ready figures. In case of more than two motifs, DiffLogo is capable of visualizing pair-wise differences in a tabular form. Here, the motifs are ordered by similarity, and the difference logos are colored for clarity. We demonstrate the benefit of DiffLogo on CTCF motifs from different human cell lines, on E-box motifs of three basic helix-loop-helix transcription factors as examples for comparison of DNA motifs, and on F-box domains from three different families as example for comparison of protein motifs. Conclusions DiffLogo provides an intuitive visualization of motif differences. It enables the illustration and investigation of differences between highly similar motifs such as binding patterns of transcription factors for different cell types, treatments, and algorithmic approaches. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0767-x) contains supplementary material, which is available to authorized users.

: Comparison of nine CTCF motifs from the cell lines HepG2, MCF7, HUVEC, ProgFib, NHEK, K562, HeLa-S3, H1-hESC, GM12878. We plot all pair-wise difference logos and represent the distance between each pair of motifs by the background color from green (similar) to red (dissimilar). We plot the sequence logos of each motif as well as the leaf-ordered cluster tree above.  Figure S2: Comparison of F-box domain motifs using DiffLogo. Comparison of the F-box domain from the kingdoms bacteria, metazoa, fungi and viridiplantae. We plot all pair-wise difference logos and represent the distance between each pair of motifs by the background color from green (similar) to red (dissimilar). We plot the sequence logos of each motif as well as the leaf-ordered cluster tree above. The motifs of metazoa and fungi are highly similar. Other pair-wise comparisons show substantial differences.

CTCF with and without Clustering
The impact of clustering with optimal leaf ordering on the resulting grid of pair-wise comparisons.
(a) Clustering disabled (b) Clustering enabled, optimal leaf ordering disabled (c) Clustering and optimal leaf ordering enabled Figure S3: Influence of clustering on appearance of difference logo grid: a) with clustering disabled, b) with clustering enabled and optimal leaf ordering disabled, and c) with clustering and optimal leaf ordering enabled. Figure S3 shows the importance of clustering especially when comparing more than four motifs. Without clustering it is hard to recognize groups of similar or dissimilar motifs ( Figure S3a). When clustering is enabled, but optimal leaf ordering is disabled, larger groups of similar motifs can be detected. Details are still hard to perceive ( Figure S3b). When clustering and optimal leaf ordering are enabled, it is easy to see which two motifs are the most dissimilar ones and it is easy to recognize groups of motifs. Even within these groups it is easy to determine the two motifs that differ the most.

Alternative combinations of stack heights and symbol weights
We consider two motifs represented by two PWMs p and q. The height of symbol a in the symbol stack at position of the difference logo is denoted H ,a and given by where H represents the height of the symbol stack at position and the weight r ,a represents the proportion of symbol a ∈ A in the symbol stack at position , where A is the alphabet. We calculate H ,a for different measures H and r ,a to emphasize different facets of distribution differences. We propose various alternatives to calculate the measures H and r ,a as follows (illustrated in supplementary Table S1). In the following sections, the information content of a PWM p at position is denoted H p and given by where p ,a is the probability of symbol a at position in PWM p. H q is defined analogously.

Jensen-Shannon divergence
The Jensen-Shannon divergence is a measure for the difference of two probability distributions based on information theory. The Jensen-Shannon divergence at position is denoted by H (i) and given by where m ,a = 1 2 (p ,a +q ,a ). H (i) is symmetric and limited to [0, 1]. This measure especially emphasizes large distribution differences.

Change of information content (stack)
The change of information content (stack) is a measure for the absolute change of information content between two probability distributions. The change of information content (stack) at position is denoted by H (ii) and given by H (ii) is symmetric and limited to [0, 2 * log 2 (|A|)]. This measure especially emphasizes large changes of information content.

Relative change of information content
The relative change of information content is a measure for the absolute change of information content relative to the average information content of the two probability distributions. The relative change of information content at position is denoted by H (iii) and given by is symmetric and limited to [0, 2 * log 2 (|A|)]. This measure especially emphasizes large changes of information content relative to the information content of the given distributions.

Change of probabilities (stack)
The change of probabilities (stack) is a measure for the absolute change of probabilities between two probability distributions. The change of probabilities (stack) at position is denoted by H (iv) and given by The change of probability (symbol) is a measure for the change of symbol-specific probability relative to the sum of absolute symbol-specific probability differences of the given probability distributions. The change of probability (symbol) of symbol a at position is denoted by r (i) ,a and given by ,a is antisymmetric and limited to [− 1 2 , 1 2 ]. This measure especially emphasizes a large change of symbol-probability. For each position of the difference logo, the height of the symbol stack with negative measures r ,a , because each gain of symbol-probability implies a loss of probability for the remaining symbols and vice versa.

Change of information content (symbol)
The change of information content (symbol) is a measure for the symbol-specific change of information content relative to the sum of absolute symbol-specific differences of information content of the given probability distributions. The change of information content (symbol) of symbol a at position is denoted by r (ii) ,a and given by ,a is antisymmetric and limited to [−1, 1]. This measure especially emphasizes a large change of symbol-specific information content.

Change of probability (symbol)
Change of information content (symbol)  Table S1: Comparison of different stack heights and symbol weights using two pairs of CTCF motifs. We compare the four measures 'Jensen-Shannon divergence' (row 1), 'Change of information content (stack)' (row 2), 'Relative change of information content' (row 3), and 'Change of probabilities (stack)' (row 4) for the stack heights and the two measures 'Change of probability (symbol)' (column 1) and 'Change of information content (symbol)' (column 2) for the symbol weights. Depending on the measures used, we emphasize different facets of distribution differences and consequently, the difference logos change dramatically.

Change of probability (symbol)
Change of information content ( Figure S4: STAMP. Stack of sequence logos of the CTCF motifs for GM12878, K562, H1-hESC, and HUVEC generated by STAMP.