The exceptional genomic word symmetry along DNA sequences
 Vera Afreixo^{1, 3, 4}Email author,
 João M. O. S. Rodrigues^{2, 4},
 Carlos A. C. Bastos^{2, 4} and
 Raquel M. Silva^{3, 4}
https://doi.org/10.1186/s1285901609050
© Afreixo et al. 2016
Received: 21 September 2015
Accepted: 19 January 2016
Published: 3 February 2016
Abstract
Background
The second Chargaff’s parity rule and its extensions are recognized as universal phenomena in DNA sequences. However, parity of the frequencies of reverse complementary oligonucleotides could be a mere consequence of the single nucleotide parity rule, if nucleotide independence is assumed. Exceptional symmetry (symmetry beyond that expected under an independent nucleotide assumption) was proposed previously as a meaningful measure of the extension of the second parity rule to oligonucleotides. The global exceptional symmetry was detected in long and short genomes.
Results
To explore the exceptional genomic word symmetry along the genome sequences, we propose a sliding window method to extract the values of exceptional symmetry (for all words or by word groups). We compare the exceptional symmetry effect size distribution in all human chromosomes against control scenarios (positive and negative controls), testing the differences and performing a residual analysis. We explore local exceptional symmetry in equivalent composition word groups, and find that the behaviour of the local exceptional symmetry depends on the word group.
Conclusions
We conclude that the exceptional symmetry is a local phenomenon in genome sequences, with distinct characteristics along the sequence of each chromosome. The local exceptional symmetry along the genomic sequences shows outlying segments, and those segments have high biological annotation density.
Keywords
Background
Chargaff’s first parity rule states that, in any sequence of doublestranded DNA molecules, the total number of complementary nucleotides is exactly equal [1]. Chargaff’s second parity rule states that those quantities are almost equal in a single strand of DNA [2–4], and this phenomenon holds in almost all living organisms.
The extension to the second parity rule is also known as single strand symmetry phenomenon. The single strand symmetry states that, in each DNA strand, the proportion of an oligonucleotide should be similar to that of its reverse complement [5–8]. There is no knowledge about why the parity is needed and there is no consensual explanation for the occurrence of the single strand phenomenon. There are some attempts to explain the phenomenon related with the species evolution process, for example: stemloops hypothesis [9]; duplication followed by inversion hypothesis [10]; inversions and inverted transposition hypothesis [11]; no strand bias [12]; original trait of the primordial genome [8].
Powdel and others [13] studied the symmetry phenomenon in nonoverlapping regions of DNA of specific size. They analysed the frequency distributions of the local abundance of oligonucleotides along a single strand of DNA, and found that the frequency distributions of reverse complementary oligonucleotides tend to be statistically similar. Afreixo et al. [14] introduced a new symmetry measure, which emphasizes that the frequency of an oligonucleotide is more similar to the frequency of its reverse complement than to the frequencies of other equivalent composition oligonucleotides. They also identified several word groups with a strong exceptional symmetry. Here, we have applied this measure to find genomic regions with very strong exceptional symmetry effect and to characterize their nonuniform behaviour. We observed exceptional symmetry throughout the human genome. Moreover, some regions showed outlying exceptional symmetry, and those are enriched in proteincoding annotated genes.
Methods
Materials
We analysed the whole human genome, reference assembly build 37.3, available from the website of the National Center for Biotechnology Information. In our data processing, the chromosomes were processed as separate sequences, words were counted with overlap. We also produced and used random control experiments. Those experiments tried to mimic some features of each human chromosome and contained the same number of base pairs of the corresponding chromosome (see ‘Control experiments’ subsection).
We obtained the coding sequences (cds file) for all the transcripts of the human genome (release 75) from Ensembl (http://www.ensembl.org/), to use in coding vs noncoding region classification.
Exceptional genomic word symmetry
In a previous work, we proposed the concept of exceptional genomic word symmetry in equivalent composition groups (ECG), and globally [14]. Exceptional symmetry is a refinement of Chargaff’s second parity rule that highlights the words whose frequencies of occurrence are similar to those of their reversed complements, but are dissimilar to the frequencies of occurrence of other words with equivalent composition. Words of equal length are defined to have equivalent composition if they contain the same number of nucleotides A or T.
Some words are equal to their reverse complement. We denote these as self symmetric words (SSW). We also define a symmetric word pair as the set composed by one word w and the corresponding reverse complement word w ^{′}, with (w ^{′})^{′}=w.

N _{ m } the number of elements in G _{ m }.

\(N^{\mathit {SSW}}_{m}\) the number of elements in G _{ m } which are self symmetric words.

\({N^{0}_{m}}\) the number of symmetric word pairs in G _{ m }, excluding the SSWs, such that both words in the pair are absent from the nucleotide sequence under study.

n _{ w } the frequency of occurrence of word w in a nucleotide sequence.

N _{ m } the frequency of occurrence of words from group G _{ m } in a nucleotide sequence.
According to the exceptional symmetry concept, if V R(G _{ m })≈1, there is no exceptional symmetry, but if V R(G _{ m })≫1, there is exceptional symmetry.
where \({X^{2}_{i}}=\sum _{m \in \{0,\ldots,k\}}{{X^{2}_{i}}(G_{m})}\), \(df_{i}=1+\sum _{m \in \{0,\ldots,k\}}{\left (df_{i}(G_{m})+1\right)}\) and i∈{s,u}.
The exceptional genomic word symmetry values were determined in all nonoverlapping subchromosomal regions (windows) of several specific sizes (1000 bp, 2000 bp, 5000 bp and corresponding multiples of 10, up to the size of the chromosomes). The starting window size (1000 bp) was established taking into account the maximum word size under study (k = 10) and the expected number of words in each ECG assuming uniform word distribution: as expected value we fixed at least one word in the smallest ECGs, G _{0} and G _{ k }. However, note that for large k, the shorter windows (1000 bp and 2000 bp) may not include enough ocurrences in the smallest ECGs to provide a good estimate of V R(G _{ m }).
Control experiments

random (rnd): assuming independence and using the human chromosome nucleotide composition as input. There are small differences between the frequencies of occurrence of complementary nucleotides. Moreover, in this scenario the expected probabilities of the reverse complements are not equal but there are words in an ECG (e.g. ATT, TAT, TTA) with equal expected probabilities.
Input: nucleotide probabilities (π _{ A }, π _{ C }, π _{ G }, π _{ T }, where π _{ w } denotes the probability of w).

random symmetric (sym): assuming independence and using the same composition for complementary nucleotides as input. In this scenario the expected probabilities of ECG words are the same.
Input: nucleotide probabilities (π _{ A }, π _{ C }, π _{ G }, π _{ T }, subject to \({\pi _{w}}={\pi _{w^{\prime }}}\phantom {\dot {i}\!}\) with w∈{A,C,G,T}).

random with firstorder dependence (mrnd): assuming first order Markov structure using the human chromosome nucleotide and dinucleotide composition as inputs.
Input: matrix of nucleotide transition probabilities (\(\mathbf {P}= \,[\pi _{K_{1}K_{2}\phantom {\dot {i}\!}}/\pi _{K_{1}}]\phantom {\dot {i}\!}\) with K _{1}, K _{2}∈{A,C,G,T}) and initial probabilities (π _{ A }, π _{ C }, π _{ G }, π _{ T }).

random exceptional symmetric with firstorder dependence (msym): assuming first order Markov structure using the human chromosome nucleotide and dinucleotide composition and using the same composition for inverted complement dinucleotides as inputs.
Input: matrix of nucleotide transition probabilities (\(\mathbf {P}=\,[\pi _{K_{1}K_{2}}/\pi _{K_{1}}]\) with K _{1}, K _{2}∈{A,C,G,T}, subject to \({\pi _{w}}={\pi _{w^{\prime }}}\phantom {\dot {i}\!}\) with w∈{A A,A C,…,T T}) and initial probabilities (π _{ A }, π _{ C }, π _{ G }, π _{ T }, subject to \({\pi _{w}}={\pi _{w^{\prime }}\phantom {\dot {i}\!}}\) with w∈{A,C,G,T}).
Coding region classification
We extracted the start and end positions of all known coding sequences from the Ensembl cds file whose gene biotype was “protein coding”. For genes with multiple transcripts, the gene start position was considered as the minimum start position of all the transcripts of that gene, and the end position as the maximum end position of the same transcripts. For each chromosome, for a given word length k and window size, windows that intercept a gene were labeled as coding neighbourhood windows, windows that do not intercept any gene were labeled as noncoding windows.
Isochores region classification
We used the IsoSegmenter program [15] with the default parameters, to classify the human genome in isochore families: L1, L2, H1, H2, H3. For each chromosome, for a given word length k and window size, windows fully included in an isochore were labeled with the corresponding isochore family. Windows spanning more than one isochore were discarded.
DNA segmentation procedure
In order to evaluate the association between the local exceptional symmetry values and their biological relevance we propose a threshold based method to perform DNA segmentation into high and low exceptional symmetry regions.
To perform the DNA segmentation on the exceptional symmetry profile (the sequence of exceptional symmetry values, also referred to as V R sequences), we need to choose an adequate window size and word length. The window size and the word length which show the widest diversity of local behaviours along the sequence have the potential to perform a good sequence segmentation. So, to explore the variability of local behaviours we evaluate the strict stationarity using the Kolmogorov Smirnov (KS) statistic.

the V R chromosome sequence (the sequence of V R values in each chromosome) is divided in successive nonoverlapping subsequences (V R subsequences) with a fixed length (50, 100, 200);

for each word length and for each window size, we compute the KS statistic between the V R distribution of each subsequence and the V R distribution of its complete chromosome sequence;

to characterise the lack of stationarity in each exceptional symmetry experiment (defined by the window size and the word length) we compute the average of all KS statistics obtained from V R subsequences.

the window size and word length of the exceptional symmetry experiment with the highest average of KS values are chosen.

we determined the quartiles of the V R chromosome sequence;

we calculated the outlier threshold as the third quartile (Q _{3}) plus 1.5 times the interquartile range (I Q R): Q _{3}+1.5∗I Q R;

we identified the windows with V R≥Q _{3}+1.5∗I Q R as the regions with very high local exceptional symmetry (outlying regions) and the other regions with V R<Q _{3}+1.5∗I Q R as the regions without very high local exceptional symmetry (nonoutlying regions).
Functional annotation enrichments
where \({n^{A}_{S}}\) denotes the number of annotations in the subset S. We used the chisquare test to evaluate if the annotations are equally distributed in the two subsets. To better evaluate the diferences between both subsets, we used the adjusted residual analysis. Under the homogeneity hypothesis the adjusted residuals have a standard normal distribution [16].
Results and discussion
In this study, we analysed local exceptional word symmetry in the complete human genome. In particular, we analysed words of lengths up to 10 in all human chromosomes. We performed a sliding window analysis in terms of exceptional symmetry (V R). We obtained, when possible, results for the following window sizes: 10^{ l }, 2×10^{ l }, 5×10^{ l } base pairs, with l∈{3,4,5,6,7,8}.
We performed our analysis using five ACGT sequence types: real human chromosomes, and corresponding simulated sequences generated according to four distinct random scenarios. For each fixed window size and word length we determined the exceptional symmetry (V R) and symmetry \(\left ({X^{2}_{s}}\right)\) values. Each of these experiments is characterized by median and median absolute deviation values.
To evaluate the effect of chromosome type, window size and word length on the local exceptional symmetry behaviour, we considered the window V R median values of each ACGT sequence (chromosomes or corresponding random chromosomes).
Window size effect
The random sequences msym and mrand were generated with forced exceptional symmetry under stationary behaviour, and an increasing tendency was observed on their local exceptional symmetry values as a function of the window size. For the random sequences without exceptional symmetry (sym and rnd) the local V R values are nearly constant (see Fig. 4).
All human chromosomes exhibit increased exceptional symmetry with increasing window lengths. In general, the behaviour is similar to the random sequences with first order Markov structure. However, we can observe higher values in the first order Markov sequences than in the human sequences.
Local exceptional symmetry stationarity
Local exceptional ECG symmetry
As G _{0} and G _{ k } are the sets with fewer elements, higher variability in V R results is expected, and this was confirmed in all sequences under study (results not shown). We verified that almost all human ECGs have higher V R values than the random scenarios.
Segmentation
Outlying segments description by human chromosome for k=7 and 20,000 window size
Chr  1  2  3  4  5  6  7  8  9  10  11  12 

Outlying segments %  4.8  6.1  6.4  7.6  7.7  6.5  9.1  6.8  4.7  5.3  6.0  6.5 
A D R  3.5  3.9  4.1  4.3  3.8  3.6  2.9  3.5  4.0  3.3  3.3  6.9 
χ ^{2} test (pvalue)  **  **  **  **  **  **  **  **  **  **  **  ** 
ϕ∗100 %  9  9  10  14  13  14  12  14  13  12  13  12 
Chr  13  14  15  16  17  18  19  20  21  22  X  Y 
Outlying segments %  5.6  4.5  4.4  3.5  3.0  6.7  1.5  5.5  5.0  2.6  8.6  7.0 
A D R  5.5  4.3  3.8  3.1  2.2  4.5  1.2  2.5  5.2  2.5  3.8  1.5 
χ ^{2} test (pvalue)  **  **  **  **  **  **  **  0.003  **  1  **  0.038 
ϕ∗100  16  15  11  7  6  14  8  9  15  8  14  19 
Annotation results
To characterize the chromosome features associated with the outlying segments of local exceptional symmetry, we have performed annotation enrichment analyses. Table 1 presents the annotation density ratio (A D R, Eq. 3) of the outlying segments vs nonoutlying segments by chromosome, as defined by the DNA segmentation procedure described in the ‘Methods’ section. We observe that in the human genome the A D R values are higher than 1 for all chromosomes, which means that the density of annotation in outlying segments is higher than in nonoutlying segments (average value equal to 3.6 and standard deviation equal to 1.2). Table 1 also shows the pvalues of the chisquare test for homogeneity of annotation types between outlying and nonoutlying segments. The pvalues were adjusted using the Holm–Bonferroni method [17]. Almost all chromosomes display significant diferences in annotation between segment types (outlying vs nonoutlying). In chromosome 22, however, the difference was not considered significant, perhaps due to the low percentage of outlying segments and chromosome size. Still, the dissimilarity effect between outlying and nonoutlying annotations is present in all chromosomes (phi measure (ϕ) range between 0.06 to 0.19).
Conclusion
The local exceptional symmetry profile provides a numerical signature along genomic sequences. The proposed procedure to analyse local exceptional symmetry in the human genome can be applied to any genomic sequence as a segmentation procedure and also as a genomic signature. The results obtained in this work suggest that for the human genome there is an optimal word length and window size to explore the local exceptional symmetry (7 and 20,000 bp, respectively).
The local exceptional symmetry in the human genome is very dissimilar from random scenarios (both with independent symbols or first order Markov structure) showing, as expected, a nonstationary behaviour. Globally, the human genome exhibits high local exceptional symmetry values, which for some word lengths are lower than the values for positive control experiments, but higher than the values for negative control experiments.
The global statistical pattern (location and dispersion values), which is obtained from the exceptional symmetry profiles, is present in all chromosomes of the human genome. The local profile is chromosome specific and the regions with very high exceptional symmetry values are strongly associated with the presence of protein coding genes, although noncoding regions also present exceptional symmetry. Additionally, the local exceptional symmetry values are positively correlated with the GC content as defined by isochore families.
Declarations
Acknowledgements
This work was supported by Portuguese funds through the iBiMED  Institute of Biomedicine, IEETA  Institute of Electronics and Telematics Engineering of Aveiro and the Portuguese Foundation for Science and Technology (“FCT–Fundação para a Ciência e a Tecnologia”), within projects: COMPETE/FEDER UID/BIM/04501/2013 and PEstOE/EEI/UI0127/2014.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Chargaff E. Chemical specificity of nucleic acids and mechanism of their enzymatic degradation. Experientia. 1950; 6(6):201–9.View ArticlePubMedGoogle Scholar
 Rudner R, Karkas JD, Chargaff E. Proc Nat Acad Sci USA. 1968; 60(2):630–5.Google Scholar
 Karkas JD, Rudner R, Chargaff E. Separation of B, subtilis DNA into complementary strands. II. template functions and composition as determined by transcription with RNA polymerase. Proc Nat Acad Sci USA. 1968; 60(3):915–20.PubMed CentralView ArticlePubMedGoogle Scholar
 Rudner R, Karkas JD, Chargaff E. Proc Nat Acad Sci USA. 1968; 60(3):921–2.Google Scholar
 Forsdyke DR. Evolutionary Bioinformatics. New York: Springer; 2011.View ArticleGoogle Scholar
 Qi D, Cuticchia AJ. Compositional symmetries in complete genomes. Bioinformatics. 2001; 17(6):557–9.View ArticlePubMedGoogle Scholar
 Kong SG, Fan WL, Chen HD, Hsu ZT, Zhou N, Zheng B, et al. Inverse symmetry in complete genomes and wholegenome inverse duplication. PLoS ONE. 2009; 4(11):7553.View ArticleGoogle Scholar
 Zhang SH, Huang YZ. Limited contribution of stemloop potential to symmetry of singlestranded genomic DNA. Bioinformatics. 2010; 26(4):478–85.View ArticlePubMedGoogle Scholar
 Forsdyke DR, Bell SJ. Purine loading, stemloops and Chargaff’s second parity rule: a discussion of the application of elementary principles to early chemical observations. Appl Bioinformatics. 2004; 3(1):3–8.View ArticlePubMedGoogle Scholar
 Baisnée PF, Hampson S, Baldi P. Why are complementary DNA strands symmetric?Bioinformatics. 2002; 18(8):1021–33.View ArticlePubMedGoogle Scholar
 AlbrechtBuehler G. Inversions and inverted transpositions as the basis for an almost universal “format” of genome sequences. Genomics. 2007; 90:297–305.View ArticlePubMedGoogle Scholar
 Lobry JR, Lobry C. Evolution of dna base composition under nostrandbias conditions when the substitution rates are not constant. Mol Biol Evol. 1999; 16:719–23.View ArticlePubMedGoogle Scholar
 Powdel B, Satapathy S, Kumar A, Jha P, Buragohain A, Borah M, et al. A study in entire chromosomes of violations of the intrastrand parity of complementary nucleotides (chargaff’s second parity rule). DNA Res. 2009; 16:325–43.PubMed CentralView ArticlePubMedGoogle Scholar
 Afreixo V, Rodrigues JAMOS, Bastos CAC. Analysis of singlestrand exceptional word symmetry in the human genome: new measures. Biostatistics. 2015; 16(2):209–21.View ArticlePubMedGoogle Scholar
 Cozzi P, Milanesi L, Bernardi G. Segmenting the human genome into isochores. Evol Bioinformatics. 2015; 11:253–61.Google Scholar
 Agresti A. Categorical Data Analysis. New York: Wiley; 2002.View ArticleGoogle Scholar
 Holm S. A simple sequentially rejective multiple test procedure. Scand J Stat. 1979; 6(2):65–70.Google Scholar