JSON-Graph specifications of nucleotide motif representation
We used the JSON-Graph format to describe nucleotide motif in order to make it intelligible and malleable. The schema of JSON-Graph format is illustrated as below:
The contents within two curly braces describe a DNA or RNA motif. Specifically, the “id” keyword specifies the name of the motif. The “background” keyword designates nucleotides frequencies (in the order of A, T, C and G) of the relevant genomic background. For example, when studying motifs in human genome, these percentages are computed from the human reference genome as background distribution. By default, they are set to 0.25 representing equal frequencies. The “pseudocounts” keyword represents the extra nucleotides added to each position of the motif to avoid zero-division error in small data set; these are set to 0.25 for each nucleotide by default. The “nodes” section describes various properties of motif residues using the following keywords: a) the “index” keyword specifies the sequential order (in anticlockwise) of nucleotide stacks b) the “label” keyword denotes the identity of each nucleotide stack c) the “bit” keyword refers to the information content calculated for each nucleotide stack d) the “base” keyword indicates the four nucleotides sorted incrementally by their corresponding frequencies as designated by the “freq” keyword. The “links” section describes the pairwise dependencies between nucleotide stacks using the following keywords: a) the “source” and “target” keywords denoting the start and the end positions of nucleotide stacks b) the “value” keyword indicates the width of the link that is proportional to the strength of dependence between the two linked positions.
CircularLogo web server
CircularLogo web application uses NGINX (https://www.nginx.com/) web server with uWSGI (https://pypi.python.org/pypi/uWSGI) gateway interface to handle multiple concurrent client requests. The application is hosted on Amazon Elastic Compute Cloud (Amazon EC2).
Measure intra-motif dependencies using χ2 statistic
We implemented two metrics to calculate the dependence between a pair of nucleotide positions: mutual information and the χ2 statistic. The χ2 statistic is widely used to test the independence of two categorical variables and corresponding Q score is a natural measure of dependency between two events that quantifies the co-incidence as follows. Let us assume that a DNA motif is l nucleotides long and is built from N sequences. For given two positions i and j within the motif (1 ≤ i ≤ l, 1 ≤ j ≤ l, i ≠ j), the observed di-nucleotide frequency is denoted as O
ij
, which can be obtained by counting di-nucleotide combinations from the input N sequences. The expected di-nucleotide frequency is represented as E
ij
. The χ2 statistic score is then calculated as:
$$ Q={\displaystyle \sum_{k=1}^m\frac{{\left({O}_{ij}^k-{E}_{ij}^k\right)}^2}{E_{ij}^k}, Q\sim {x}^2\left( m-1\right), m=16,{O}_{ij}\in \left[ AA, AT, AC, AG,\dots \right]} $$
Here, m is the total number of di-nucleotides (42 = 16).
Measure intra-motif dependencies using mutual information
The second built-in approach to measure dependence is the mutual information. This metric quantifies the mutual dependence between two discrete random variables X (X = [A, C, G, T]) and Y (Y = [A, C, G, T]) and it is defined as:
$$ I\left( X; Y\right)={\displaystyle \sum_{y\in Y}{\displaystyle \sum_{x\in X} p\left( x, y\right) log}}\left(\frac{p\left( x, y\right)}{p(x) p(y)}\right) $$
Here, x (x ∈ [A, C, G, T]) and y (y ∈ [A, C, G, T]) represent nucleotides at two nucleotide stacks X and Y, respectively. p (x) and p (y) denote the nucleotide frequencies of x and y. p (x, y) defines the frequencies of dinucleotides (xy) from X and Y. The significance of dependency between two positions was evaluated using Chebyshev’s inequality. For example, if the observed mutual information is K × stdev times larger than that expected from random background model. P < = 1/K2.
HNF6 motif analysis
HNF6 ChIP-exo data was obtained from Array Express (accession number E-MTAB-2060; http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-2060/), processed with MACE [19], and HNF6 binding sites were extracted. The 5549 65-nucleotide (upstream 20 nucleotides + 25 nucleotides HNF6 binding site + downstream 20 nucleotides) sequences were published to https://sourceforge.net/projects/circularlogo/files/test/. All sequences were aligned by the HNF6 motif, which start from postion-29 to position-36.
tRNA sequence analysis
A total of 1114 tRNA sequences were downloaded from RFAM database [20] in the form of RFAM ‘seed’ alignment format (accession # RF00005; https://correlogo.ncifcrf.gov/ccrnp/trnafull.html). After excluding sequences with gaps in the alignment, 291 sequences were used as the final dataset to generate circular logo of tRNA (https://sourceforge.net/projects/circularlogo/files/test/). Mutual information was used as the metric to measure intra-motif dependencies. The lower 33% links were filtered out.
Synthesized DNA fragments of splice sites and branch-points for analysis
We used the synthesized DNA fragments by concatenating the 5′ donor site (16 bp), branch-point (21 bp) and the 3′ acceptor site (16 bp) to represent the splicing motif. Briefly, a total of 59,359 predefined, high-confidence human branch-points were downloaded from the supplementary data of the study [21]. We excluded introns with multiple branch-points, small introns (<1 kb) and introns with small gap (≤25 bp) between the branch-point and the acceptor site. For each of the remained introns, we first extracted upstream 6 bp and downstream 10 bp of 5′ donor site. Then we extracted a 21 bp DNA sequence encompassing branch-point by extending 10 bp to both upstream and downstream of the branch-point; thirdly, we extracted upstream 10 bp and downstream 6 bp of 3′ acceptor site. At last, we concatenated these three DNA sequences in the order of “5′ donor site–branch-point–3′ acceptor site” to form a 53 bp DNA fragment. We used a final set of 10,316 DNA fragments to generate circular logo (https://sourceforge.net/projects/circularlogo/files/test/).