Dataset of archaeal and bacterial genomes
Microbial genome assemblies
The study was conducted on a dataset of genome assemblies from 13,883 bacterial and 295 archaeal strains, for a total of 14,178 complete genomes selected (for the list see https://doi.org/10.6084/m9.figshare.15035619.v1), from the NCBI Genbank Genomes database (https://ftp.ncbi.nlm.nih.gov/genomes/genbank/), among the complete genome assemblies available in June 2019 that had also passed RefSeq quality checks. Assembly metadata, such as taxonomical identifiers, sources and assembly statuses, were extracted from the summary file available at
https://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt. The complete nucleotide sequences of each assembly (available in fasta format in .fna files) were tested for putative gene coding sequences with the microbial genome annotation tool Prokka v1.12 [39] with default software settings. The only exception was assembly GCA_002761215.1, corresponding to Candidatus Gracilibacteria bacterium HOT-871, which follows the alternative version of the genetic code known as translation table 25, where the stop codon UGA encodes for the amino acid glycin. In this case, Prokka was run with the option --gcode 25.
Annotation with Pfam domain coding regions
All putative gene sequences identified by Prokka were further examined for the presence of protein domain coding regions using hmmscan v3.1b2 [40], with setting --eval 0.001. Hmmscan was run against the target database Pfam release 32.0 (https://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/Pfam-A.hmm.gz, [28]), a reference collection of Hidden Markov Model profiles for 17,929 protein domain families. The threshold to define whether a gene contained a significant sequence match for a putative Pfam domain was placed at single domain e-value < 0.001. A further filtering step was implemented to deal with cases where multiple domains from the same Pfam domain clan matched within the same gene sequence. These clans are groups of domain families with shared ancestry that have undergone independent evolution, but still retain high sequence similarity. This means that, in presence of a true match for a member of a clan, other members of the same clan can also align with the same region, albeit with lower scores. Whenever this happened, only the match with the lowest e-value was accepted. Finally, as hmmscan is often not able to determine the exact boundaries of single Pfam domain coding regions within gene sequences, the chromosomal positions of all predicted domains were approximated to the ones of their corresponding gene coding sequences.
Identification of known functionally interacting domains
The gene ontology (GO) database [41] is an online resource that provides a series of relationships (annotations) between genes or protein domains and some standard labels (Gene Ontology terms) describing their known molecular or cellular functions. All predicted Pfam protein domains from the 14,178 genome assemblies were annotated with the corresponding Gene Ontology term(s), when available, using the pfam2GO mapping table version 2019/06/01 14:44:40 downloaded from http://geneontology.org/external2go/pfam2go. From these annotations, we selected only those corresponding to Gene Ontology Biological Process (GOBP) terms, using the GO terms description file available at http://purl.obolibrary.org/obo/go.obo. Any two domains found in the same assembly that shared at least one identical GOBP annotation were labelled as functionally interacting.
Phylogenetic analysis
A phylogenetic tree describing the evolutionary relationships between the 14,178 examined microbial strains was built from the sequence of 107 protein coding genes (see Additional file 1: Table S1 for a list of housekeeping protein coding genes from archaea and bacteria) previously identified as present in most known bacteria and archaea, such as genes encoding for ribosomal proteins or sub-units of the RNA polymerase [42]. This analysis was carried out using the software wrapper bcgTree v1.0.8 [43]. In detail, the protein sequences of the predicted homologs of these genes in all genome assemblies were identified with hmmscan, concatenated together and aligned with Clustal Omega v1.2.1 [44]. The alignment was refined by selecting only the best aligned blocks using Gblocks v.0.91b [45], and processed to build a Maximum-Likelihood tree with the software FastTreeMP v2.1.8 [46]. The resulting tree was rooted using the archaea subclade as an out-group. The currently accepted taxonomic classification (species, genus, phylum, etc) for each assembly in the dataset was determined by mapping each taxonomic identifier to the current taxonomy of all living beings available on the NCBI database (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz).
Modelling of clustering domains in individual organisms
An overall visual summary of the complete analysis pipeline is reported in Fig. 9.
Chromosomal distances of domain pairs
The probabilistic models of chromosomal distances of clustering domains were obtained by analysing each of the 14,178 genome assemblies independently and only on their main chromosome, which was defined as the longest DNA molecule found in each assembly. The choice to restrict to the main chromosome was made to reduce noise caused by the presence of minor DNA elements such as plasmids, where most domains map very close to each other simply simply because of the small size of the DNA molecule itself. In every chromosome i, we measured the chromosomal distance (expressed in number of base pairs) between the two members of all possible pairings (domain a, domain b) obtainable from the Pfam domains located on the chromosome. This distance (defined in Eq. 1) includes the size of both involved domain coding sequences and of any DNA fragment located between the two.
In genome assemblies from bacteria and archaea, gene (and domain) coordinates are conventionally assigned beginning at the origin of replication (located immediately upstream to the predicted homolog of dnaA gene) and increasing along the leading strand of the DNA molecule, travelled clockwise, until the origin of replication is reached again. Due to the circular shape of most of these chromosomes, however, domains that appear to be very far according to the values of their coordinates can be very close if the chromosome is examined in an anti-clockwise manner. Accordingly, for each pair (domain a, domain b) found on chromosome i, the corresponding chromosomal distance was defined as the size, expressed in number of base pairs, of the smallest DNA fragment delimited by and including the two domains measured by travelling the chromosome either clockwise or anti-clockwise (as shown in Fig. 10):
$$\begin{aligned} \text{dist}_{(a,b)}^{(i)} = \min (\text{end}_{b}^{(i)} - \text{start}_{a}^{(i)} + 1, L^{(i)} - \text{start}_{b}^{(i)} + \text{end}_{a}^{(i)} + 1), \end{aligned}$$
(1)
where \(\text{start}_{a}^{(i)}\), \(\text{end}_{a}^{(i)}\) and \(\text{start}_{b}^{(i)}\), \(\text{end}_{b}^{(i)}\) are the start and end coordinates of domain a and domain b on chromosome i, respectively, and \(L^{(i)}\) is the total length of chromosome i, expressed in number of base pairs. These distances are symmetrical, in that \(\text{dist}_{(a,b)}^{(i)}\) (clockwise) = \(\text{dist}_{(b,a)}^{(i)}\) (counterclockwise); therefore, the order in which the two domains are located on the chromosome is not relevant. Note that, when two domains were located within the same gene, the value of \(\text{dist}_{(a,b)}^{(i)}\) corresponded to the size of the gene itself. To avoid over-representation of the same distance due to the presence of multi-domain protein coding genes, all distances corresponding to multiple pairs of domains mapping within the same pair of genes were only counted once.
Clustering model based on chromosomal positions
After examining the distributions of chromosomal distances between functionally interacting pairs, we hypothesised these distributions to be the result of the existence of two distinct sub-populations of pairs that can be modelled using two distinct probability distributions. The chromosomal distances of the first sub-population, representing functionally interacting domains grouping together on the chromosome (clustering pairs), were modelled using an exponential distribution with rate parameter \(\lambda ^{(i)}\):
$$\begin{aligned} \text{Exp}\Big (\text{dist}_{(a,b)}^{(i)} ; \lambda ^{(i)}\Big ) = \lambda ^{(i)} \cdot \exp {\displaystyle \left( -\lambda ^{(i)} \cdot \text{dist}_{(a,b)}^{(i)}\right) } \; \qquad \text{for~dist}_{(a,b)}^{(i)} \ge 0. \end{aligned}$$
Thus clustering pairs are modelled as being likely to be close together on the chromosome with the probability of chromosomal distance between them exponentially decaying with increasing distance. The value of the rate parameter of this distribution was initially considered as characteristic of each chromosome i and unknown.
The second sub-population, representing domains that are involved in the same cellular process but whose coding regions are scattered on the chromosome (non-clustering pairs), was modelled with a uniform distribution between zero and half of the total length of chromosome i:
$$\begin{aligned} \text{U}\bigg (\text{dist}_{(a,b)}^{(i)} ; 0, \frac{L^{(i)}}{2}\bigg ) = \frac{2}{L^{(i)}}\; \qquad \text{for} 0 \le x \le \frac{{L^{(i)}}}{{2}}. \end{aligned}$$
i.e. all chromosomal distances between non-clustering pairs are modelled as being equally probable.
The likelihood of the observed chromosomal distances between the two members of functionally interacting pairsFootnote 2 was then defined using a mixture model consisting of the weighted sum of the these two distributions:
$$\begin{aligned} p(\text{dist}_{(a,b)}^{(i)} \vert a, b \; \text{functionally interacting})& = \phi ^{(i)} \cdot \text{Exp}\Big (\text{dist}_{(a,b)}^{(i)} ; \lambda ^{(i)}\Big ) + \\&(1 - \phi ^{(i)}) \cdot \text{U}\Big (\text{dist}_{(a,b)}^{(i)} ; 0, L^{(i)}/2\Big ), \end{aligned}$$
where the weights \(\phi ^{(i)}\) and \((1 - \phi ^{(i)})\) represent the proportion of clustering and non-clustering pairs, respectively, out of the total number of functionally interacting pairs found on chromosome i. The allocation of the observed functionally interacting pairs to the clustering or non-clustering population - and consequently the value of \(\phi ^{(i)}\)—was unknown at this stage.
Parameter estimation for the clustering model
The values of the two parameters characterising the above mixture model, namely the rate parameter of the exponential distribution \(\lambda ^{(i)}\) and the proportion of clustering pairs over the total functionally interacting pairs \(\phi ^{(i)}\), were initially regarded as unknown and characteristic of each chromosome i. As these values are not obtainable analytically, their maximum-likelihood estimates were computed with the Expectation-Maximisation (EM) algorithm [47]. This optimisation algorithm is a heuristic procedure commonly employed in statistics to obtain parameter estimates for models with latent properties, that is models where the data display characteristics that can not be directly measured but can be inferred from other variables. In this case, the hidden feature of the data was whether the two members of each functionally interacting pair were part of the same cluster. This property was included in the model by assigning to each pair a binary latent variable \(z_{(a,b)}^{(i)}\), that indicated whether the pair was likely to be part of cluster (\(z_{(a,b)}^{(i)} = 1\)) or not (\(z_{(a,b)}^{(i)} = 0\)). The values of these indicator variables were initially unknown.
The EM algorithm takes its name by the fact that it computes maximum likelihood estimates by iterating through two steps (Expectation and Maximisation), usually followed by a third step that checks for convergence. In the implementation used in this study, the algorithm started with two initial guesses for the two parameters (called \({\hat{\lambda }}^{(i, 0)}\) and \({\hat{\phi }}^{(i, 0)}\)) and updated them by repeating the following three steps for \(t =\) 1, 2, \(\dots\), up to a maximum of 1,000 iterations:
(1) Expectation step Given the most recent estimates for the two parameters of the model, called \({\hat{\lambda }}^{(i, t-1)}\) and \({\hat{\phi }}^{(i, t-1)}\), update the estimates for the latent variables of each pair (domain a, domain b) on chromosome i according to the following equation:
$$\begin{aligned} \hat{z}^{(i, t)}_{(a,b)} = \frac{{\hat{\phi }}^{(i, t-1)} \cdot \text{Exp}\Big (\text{dist}_{(a,b)}^{(i)} ; {\hat{\lambda }}^{(i, t-1)}\Big )}{{\hat{\phi }}^{(i, t-1)} \cdot \text{Exp}\Big (\text{dist}_{(a,b)}^{(i)} ; {\hat{\lambda }}^{(i, t-1)}\Big ) + (1 - {\hat{\phi }}^{(i, t-1)}) \cdot \text{U}\Big (\text{dist}_{(a,b)}^{(i)} ; 0, L^{(i)}/2\Big )}, \end{aligned}$$
where \(\text{dist}_{(a,b)}^{(i)}\) is the distance of the two members of the pair on chromosome i, obtained according to Eq. 1, and \(L^{(i)}\) is the size (in base pairs) of the same chromosome. This is the expected value of the indicator variable of whether the pair are clustering or non-clustering, given the current estimates of the model parameters.
(1) Maximisation step Compute new values \({\hat{\lambda }}^{(i, t)}, {\hat{\phi }}^{(i, t)}\) that maximise the expected complete data log-likelihood given the current estimates of the latent variables, that can be written as:
$$\begin{aligned} \begin{aligned} Q\big (\lambda ^{(i)}, \phi ^{(i)} \mid {\hat{\lambda }}^{(i, t-1)}, {\hat{\phi }}^{(i, t-1)}\big )&= \sum _{a \ne b}{\Big (\hat{z}^{(i, t)}_{(a,b)} \cdot \log \Big (\text{Exp}\Big (\text{dist}_{(a,b)}^{(i)} ; \lambda ^{(i)}\Big )\Big )\Big )} \\&\qquad + \sum _{a \ne b} {\Big (\Big (1 - \hat{z}^{(i, t)}_{(a,b)}\Big ) \cdot \log \Big (\text{U}\Big (\text{dist}_{(a,b)}^{(i)} ; 0, {L^{(i)}}/{2}\Big )\Big )\Big )}. \end{aligned} \end{aligned}$$
In detail, the value of \({\hat{\lambda }}^{(i, t)}\) that maximises the above complete data log-likelihood function is obtained by taking the partial derivative of the above function with respect to \(\lambda ^{(i)}\) and setting it equal to zero, yielding:
$$\begin{aligned} {\hat{\lambda }}^{(i, t)} = \frac{\sum _{a \ne b}{\hat{z}^{(i, t)}_{(a,b)}}}{\sum _{a \ne b}{\Big (\hat{z}^{(i, t)}_{(a,b)} \cdot \text{dist}_{(a,b)}^{(i)}}\Big )}. \end{aligned}$$
The update estimate for \({\hat{\phi }}^{(t)}\) is then obtained by maximising the complete data log-likelihood with respect to \(\phi ^{(i)}\), which results in the computation of the mean of the new estimates of all the latent variables obtained in the estimation step:
$$\begin{aligned} {\hat{\phi }}^{(i, t)} = \frac{\sum _{a \ne b}{\hat{z}^{(i, t)}_{(a,b)}}}{\# \hat{z}^{(i, t)}_{(a,b)}}, \end{aligned}$$
where \(\# \hat{z}^{(i, t)}_{(a,b)}\) is the number latent indicator variables (i.e. the number of domain pairs in species i) These are the maximum likelihood estimates of the model parameters, given the current expected value of the indicator variable on whether each pair is clustering or non-clustering.
(3) Convergence check Compute the value of the marginal log-likelihood of the model at iteration t (\(\text{MLL}^{(i, t)}\)) given the updated parameters estimates \({\hat{\phi }}^{(i, t)}\) and \({\hat{\lambda }}^{(i, t)}\):
$$\begin{aligned} \text{MLL}^{(i, t)} = \sum _{a \ne b}{\log \Big ({\hat{\phi }}^{(i, t)} \cdot \text{Exp}\Big (\text{dist}_{(a,b)}^{(i)} ; {\hat{\lambda }}^{(i, t)}\Big )} + (1 - {\hat{\phi }}^{(i, t)}) \cdot \text{U}\Big (\text{dist}_{(a,b)}^{(i)} ; 0, {L^{(i)}}/{2}\Big )\Big ) \end{aligned}$$
If the difference between the two consecutive values of \(\text{MLL}^{(i,t)}\) is lower than a certain \(\epsilon\) (here set to \(10^{-16}\)) i.e. \(\text{MLL}^{(i,t)} - \text{MLL}^{(i,t-1)} < \epsilon\) stop iterating and take \({\hat{\phi }}^{(i, t)}, {\hat{\lambda }}^{(i, t)}\) as the final maximum likelihood estimates for \(\phi ^{(i)}, \lambda ^{(i)}\). Otherwise, move to the next iteration.
Sampling of initial values
The EM algorithm is guaranteed to increase the complete data log-likelihood of the model, and consequently its marginal log likelihood, with every iteration. The complete data log-likelihood function, however, has multiple local maxima and the algorithm will converge to only one of them, which strictly depends on the value of the initial parameter guesses and does not necessarily correspond to the global solution to the maximisation problem. As the algorithm is completely deterministic, multiple runs with the same set of starting values will always converge to the same solution. A way to address this issue is by re-running the algorithm with different initial parameter guesses and taking the estimates that give the highest final marginal log-likelihood. Accordingly, the EM algorithm was repeated with 100 different combinations of initial guesses for \(\lambda _i\) and \(\phi _i\), obtained by combining ten samples for each parameter.
In the case of \({\hat{\lambda }}^{(i, 0)}\), an initial estimate was obtained with the method of moments assuming that all functionally interacting pairs were exponentially distributed, that corresponds to computing the reciprocal of the mean pairwise distance on each chromosome:
$$\begin{aligned} {\hat{\lambda }}^{(i, 0)} = \frac{\#\text{dist}_{(a,b)}^{(i)}}{\sum _{a \ne b}{\text{dist}_{(a,b)}^{(i)}}}, \end{aligned}$$
where \(\#\text{dist}_{(a,b)}^{(i)}\) denotes the total number of pairwise distances in species i. Further guesses were then obtained by sampling ten logarithmically spaced values that ranged from three orders of magnitude below to three orders of magnitude above the value obtained with the method of moments. The initial guesses for the fraction of functionally interacting domains that were also clustering, \({\hat{\phi }}^{(i)}_0\), was obtained by sampling ten logarithmically spaced values between 0.001 and 0.1. This interval was chosen based on the a priori assumption, derived from visual analysis of the data, that the proportion of clustered pairs was going to be somewhere within this interval.
Implementation and data visualisation
All above described analyses were implemented in a collection of scripts written in the programming language Python v3.6. The computational time required for the analysis of this extensive dataset was optimised by allocating the processing of individual assemblies to independent CPU cores working in parallel. Moreover, most numerical calculations were carried out using the library NumPy v.1.15.2 that allows for efficient vectorised processing of data. The import and handling of Genbank files and phylogenetic trees took advantage of the libraries Biopython v.1.72 [48] and ete3 v3.1.1 [49], respectively. Summary plots were generated with the libraries Matplotlib v.3.0.0 and seaborn 0.9.0. The scripts are available at https://github.com/selenocysteine/conserved-chromosomal-clusters.