Statistical modelling of CG interdistance across multiple organisms

Background Statistical approaches to genetic sequences have revealed helpful to gain deeper insight into biological and structural functionalities, using ideas coming from information theory and stochastic modelling of symbolic sequences. In particular, previous analyses on CG dinucleotide position along the genome allowed to highlight its epigenetic role in DNA methylation, showing a different distribution tail as compared to other dinucleotides. In this paper we extend the analysis to the whole CG distance distribution over a selected set of higher-order organisms. Then we apply the best fitting probability density function to a large range of organisms (>4400) of different complexity (from bacteria to mammals) and we characterize some emerging global features. Results We find that the Gamma distribution is optimal for the selected subset as compared to a group of several distributions, chosen for their physical meaning or because recently used in literature for similar studies. The parameters of this distribution, when applied to our larger set of organisms, allows to highlight some biologically relavant features for the considered organism classes, that can be useful also for classification purposes. Conclusions The quantification of statistical properties of CG dinucleotide positioning along the genome is confirmed as a useful tool to characterize broad classes of organisms, spanning the whole range of biological complexity. Electronic supplementary material The online version of this article (10.1186/s12859-018-2303-2) contains supplementary material, which is available to authorized users.


Background
Recent studies revealed that dinucleotide interdistances can be a powerful tool for detecting DNA properties [1,2], such as the identification of CpG islands [3] and the characterization of epigenomic regulation through methylation [4,5]. In a previous paper [4], we higlighted a peculiar feature of mammals CG dinucleotides: the tail of CG interdistance distributions showed an exponential decay, at difference with non CG's which had a heavier tail more similar to a power law. This might be due to the specific role that CGs play inside mammals genomes, since they are the preferential sites of methylation, a fundamental epigenetic mechanism involved in gene regulation [6][7][8][9][10] and structural conformation of chromatine *Correspondence: gastone.castellani@unibo.it 1 Department of Physics and Astronomy, University of Bologna, Bologna, Italy Full list of author information is available at the end of the article [11,12]. In light of these preliminary observations, we believe that a characterization of the complete CG distribution would provide a better comprehension of their role inside genomes of all organisms, with the idea that similar functionalities should share similar statistical properties. Moreover, the identified distribution can be the basis for hypothesizing specific physical models to describe the observed DNA sequence characteristics.
We previously noticed that the distinction between CG and non-CG interdistance distributions is less sharp in non-mammal organisms, by considering a set of 21 genomes, belonging to 10 mammal and 11 non-mammal organisms [4]. We have now extended the study to CG interdistance distributions from 4425 genomes, belonging to a wide range of organism categories (bacteria, protozoa, plants, fungi, invertebrates, mammal and non-mammal vertebrates) in order to better understand the heterogenous scenario found among non-mammals and to obtain a global picture associated to this particular feature.

Data
The organism DNA sequences were downloaded from GenBank NCBI database [13]. We defined a subset of organisms, namely the DNA sequences of 9 mammal model organisms: Bos taurus, Canis familiaris, Equus caballus, Homo sapiens, Macaca mulatta, Mus musculus, Ornithorhynchus anatinus, Pan troglodytes and Rattus norvegicus, to test the goodness of fit of the chosen probability density functions, since in a previous work [4] they showed very homogeneous characteristics in terms of CG distribution.
An extended analysis was then performed on a dataset composed of 4425 genomes (see Additional file 1 for a detailed list on organisms and measured parameters), selected among 7 of the 11 categories represented on the NCBI database: bacteria, fungi, invertebrates, plants, protozoa, mammal vertebrates and non-mammal vertebrates (see Table 1). In order to ensure minimal quality criteria on the reconstructed genome sequences, we chose to study only fasta files at chromosome and scaffold levels, discarding those for which only contigs were available.

Computation of CG interdistance distributions
The first step of our analysis consisted in the estimation of CG interdistance relative frequency distributionsp(τ ) in the selected organism set. We pre-processed the data by extracting the longest sequence from each genome, except sex chromosomes [4], and removing the unknown bases, identified with the "N" symbol in the fasta files. This operation did not affect the computation ofp(τ ), because the ratio of N inside the sequences was in general low (see Table 2 and Additional file 1) and they were mainly located contiguously at the centromere and telomere regions, thus producing only a very small number of  large distances (that could eventually be easily removed from the analysis). Subsequently we found the positions x j of each CG dinucleotide inside the sequence, and we calculated the distance between two consecutive CG as τ j = x j+1 −x j ; finally, for each distance value τ , we counted its abundance along the sequence and estimated its relative frequencyp(τ ), as described in Eq. 1. In this way we obtained a relative frequency distribution that we called CG interdistance distribution.

Choice of best distribution
In order to find a complete characterization of mammal CG distribution, we firstly representedp(τ ) for the 9 mammal model organisms in semilogarithmic scale.
In this way, we immediately recognized an exponentially decaying trend in the tails (not shown, see Supplementary Materials in [4]), which led us to consider the following functions: exponential and double exponential distributions, which can be associated to physical processes respectively governed by a single and a double characteristic scale (that would correspond to characteristic CG distances along the genome); stretched exponential and gamma distributions, which are related to physical processes involving both a characteristic scale and a power-law trend [14][15][16][17][18][19][20][21][22][23]. We also took into account the q-exponential distribution, as suggested by a recent work [5] that studied CG interdistance distributions on a small interval of about 0 − 300 dinucleotide distance values for human genome. In our study we consider the whole distance distribution up to about 2000 nucleotides for the same organism, and of the same order of magnitude for the other higher-order organisms of the considered subset. The proposed distributions were fitted to the data by using a non-linear least square method (fit function, Mathworks Matlab software).
We noticed that the extreme region of the right tail of our CG distributions adversely affected fit results, due to poor sampling (see Additional file 1 for details), therefore we decided to exclude from the fit procedure all distances beyond the 90th percentile (leaving an interval of distances from 0 up to about 1000 − 2000 bases in all 9 higher-order organisms). The goodness of fit was initially estimated by r 2 parameter (Eq. 7), defined as: where SSR represents the sum of squares of the regression and SST the sum of squares about the mean, also called total sum of squares. Due to the large number of distances fitted for these organisms, any correction for sample size to the goodness of fit estimation was not relevant. A comparison of r 2 values allowed to discard some distributions with a clear low fitting performance. In order to find the best fitting distribution among the remaining, we considered additionally the mean value of residual distribution (reported in Table 3), that allowed a further discrimination, also supported by visual inspection (see Additional file 1).

Multiple genome analysis
Once obtained the best fitting probability density function for the mammal organism set, we applied it to all organisms chosen for our analysis. The fit parameters associated to the best distribution, together with the goodness-of-fit parameters, were used to describe the analyzed organisms, individually or grouped by category, allowing to obtain a global picture from a point of view of organism complexity. We expected that genomes with similar CG interdistance distributions would show similar fit parameter values, reflecting similarities in the functional roles of CG dinucleotides in these organisms. Even if for some organism categories the chosen distribution is not optimal as for the initial subset, we hypothesize that organisms with similar distributions (even if not corresponding to the chosen one) should present similar parameters anyway, allowing a global classification with a unified approach. Anyway, to filter out possible fit errors due to bad genome sequence reconstruction, we only considered for our analyses the organisms which goodness-of-fit exceeded a value r 2 = 0.9. With this filter we discarded on average about 15% of our genomes (from 2% in bacteria to 25% in non-mammal vertebrates), homogeneously distributed along the considered categories, resulting in 3857 genomes left for our analysis.

Results
Goodness-of-fit parameters showed that gamma distribution (Eq. 6) is the function that best describes CG interdistance distribution for the 9 mammal subset (see Fig. 1 for the case of human genome). In particular, if we look at r 2 values in Table 4, we can see that the worst fit results are given by q-exponential distribution, since the corresponding r 2 values are the lowest ones, followed by single exponential distribution. The choice of best fit distribution among the remaining was more difficult, because r 2 values were very similar or even identical. Therefore, we also considered the mean values of residual distribution, that provided a clear distinction among the considered distributions (see Table 3), with values around 10 −11 for gamma fit, 10 −8 for stretched exponential fit, 10 −7 for double exponential fit, 10 −8 for exponential fit and 10 −1 for q-exponential fit. These values confirmed that q-exponential was the worst fitting distribution, and showed that gamma is the best fit function for mammal CG interdistance distributions (see Table 5 for fit results). Looking at Fig. 2, we notice that b is the parameter that mainly discriminates between the organism categories while the value a of the power term in gamma distribution is equally spread across all organisms of all categories (see also Fig. 3). Furthermore, b values seem to increase with the "biological complexity" of the considered categories, being minimum for bacteria and protozoa, and maximum for vertebrates (higher in mammals than in non-mammals) and with an intermediate value for invertebrates. Vertebrate categories have a median value of b in the range 200 − 300, while it is an order of magnitude lower for bacteria (about 30). We remark that this value is very close to the typical length of DNA enveloped around a histone (146 bp envelope around histone octamer plus a linker region summing up to about 200-220 bp), thus there might be a relation between DNA enveloping around histones and our observation in term of CG distances, even if we cannot provide an explanation for this.
Since we are considering a large class of organisms, with DNA sequence size differing by several orders of magnitude (from 10 8 for mammals to 10 4 − 10 5 for bacteria and protozoa), we checked if b parameter could be associated with the length of the analyzed genomic sequence. This does not seem the case, since the Pearson's correlation coefficient r between the logarithm of b and the logarithm of the length of the analyzed genome sequences is very close to zero: r = −0.12.  In light of these observations, we also tested whether the gamma scale parameter (i.e., b) could depend on CG density inside the sequence (number of CG dinucleotides with respect to sequence length), representing b as a function of %CG in double logarithmic scale (see Fig. 4). In a simple null model, the average distance between dinucleotides should decrease proportionally to the inverse of dinucleotide density inside the sequence, thus with a slope equal to −1 in double logarithmic plot. Therefore, we fitted the b vs %CG double logarithmic plot to a straight line using linear least square method, obtaining the results shown in Table 6. We observe that the relation between b and %CG is in general very close to the fitted lines for each organism category, with average value of Pearson's coefficient r = −0.65 (minimum correlation r MIN = −0.54 for invertebrates, maximum correlation r MAX = −0.75 for protozoa). From this analysis we can identify two groups of organisms, according to values of the coefficient m, corresponding to the slope of the line in log-log plot and thus to the exponent of the polynomial relation b ∝ %CG m : bacteria, plants, fungi, protozoa and invertebrates have an exponent approximately equal to −1, while mammal vertebrates and non-mammal vertebrates have a smaller exponent in absolute value closer to 0.5, significantly different from the others in terms of 95% confidence interval. Some organism categories thus seem to verify the null model hypothesis, while for vertebrates the significant deviation from the null model suggests a different mechanism for CG dinucleotide placement along the genome rather than a "maximum entropy" process.

Discussion
A possible biological interpretation of this grouping could be a different role of CG methylation in these two classes of organisms. CG methylation is known to be an important mechanism in higher-order organisms (like vertebrates, that in our analysis show a slope significantly smaller than −1), with an active role on gene transcription regulation [24]. For most of the biological categories that showed an exponent close to −1 it is not clear how (or even if ) the CG methylation mechanism is used [25][26][27], since in some cases different nucleotide sequences are involved in methyl group binding (like the GATC motif in E. Coli, or other motifs in plants [28]) and in general is not used for gene regulation, if not only during embryonic development [29]. We speculate that a characterization of CG distribution parameters for a specific organism could be an index to hypothesize a role of CG methylation at a single organism level, even if we did not go further in the analysis in this direction. In order to extend the range of applications, we think that the method developed in this work can be applied to further repeated genomic sequences (e.g. transcription-factor-binding-site motifs mapped in ENCODE project [30] and repeated sequences associated to transposable elements [31]) in order to gain a deeper insight into DNA properties of

Conclusions
We considered several probability density functions to fit the CG interdistance distribution of a selected set of mammal organisms, and we observed that it is best described by a Gamma distribution. Applying this function on a wide set of organisms, taken from different taxonomic categories, we noticed that the scale parameter b of the Gamma distribution could be associated to the biological complexity of the organism category, increasing from bacteria to vertebrates. Moreover, we tested for possible factors affecting this parameter, like genome sequence length and CG density. While the first was not related to our observations, the second revealed stronger correlations; in particular, for a group of organisms, comprising those of minor biological complexity (bacteria, protozoa, fungi, invertebrates and plants), the relation between b and CG density could be explained by a minimal null model, while for higher order organisms (vertebrates) this null model did not explain the observations. We argue that this difference could be related to the different role that CG methylation plays in these classes of organisms.