The detection of gene homology performed by PanDelos is divided into 5 main steps that combine a candidate selection based on k-dictionaries, with a refinement procedure, developed by means of network analysis. Firstly, an optimal value of word length k is chosen according to properties of the input collection of genomes. Consequently, genes are compared and candidate homologous pairs are selected. The selection is firstly applied by setting a minimum amount of intersection between the k-dictionaries of two genes. Then, the generalized Jaccard similarity is used to measure the similarity between genes in order to extract bidirectional best hits. The extraction produces a homology network from which, at the end of a refinement procedure, gene families are retrieved. Figure 1 gives an overview of the overall schema.
In what follows, we first describe the details of PanDelos together with the engineering and extension of existing data structures that allows PanDelos to reach high performance and efficiency.
Basic notation
A gene is represented as a string s over the amino acid alphabet Γ, s=a1a2…ah, with ai∈Γ for 1≤i≤h. The k-mers of s are the substrings of s having length k. A sequence s, having length |s|, contains |s|−k+1 occurrences of k-mers. A k-mer w may occurs several times within s. The number of times that w occurs in s is called the multiplicity of w in s and it is denoted by cs(w). The k-dictionary Dk(s) of s is given by the set of all distinct k-mers occurring in s:
$$D_{k}(s) = { s[i..i+k] : 1 \leq i \leq |s| - k}, $$
where s[i..i+k] is the substring of s starting at position i and ending after k positions.
Given a population of n individual genomes, we denote by \(\mathbb {G}^{i} = \{s_{1}, s_{2},\dots,s_{m} \}\) the set of genes of the i-th individual. The genetic length of \(\mathbb {G}^{i}\) is given by the sum of the lengths of the genes in \(\mathbb {G}^{i}\) and it is denoted by \(\langle \mathbb {G}^{i} \rangle \). On the contrary, when whole DNA sequences are taken into account the genomic length of the i-th individual, |Gi|, is given by the total length of the DNA sequence. In what follows, we use the term genome to indicate both a DNA sequence G and the corresponding set of genes \(\mathbb {G}\). The context will suggest the intended appropriated meaning.
Choosing an optimal word length for gene dictionary construction
A dictionary-based measure is highly sensitive to the length k of the words that compose the dictionary. In analyzing whole genome sequences, a crucial resolution is given by k=log4|G|, where 4 is the cardinality of the nucleotide alphabet [24, 25]. This value was proven to reveal structural laws that emerge from the maximum entropic difference between real genomes with random ones of the same length. In our case, genes are represented by the amino acid sequences of the proteins they encode, thus the alphabet to be considered is Γ rather than the 4-symbols nucleotide alphabet. Thus, we take into account the set of genetic sequences belonging to all the n input genomes by setting the value of the optimal word length k as:
$$k = {log}_{|\Gamma|} \sum\limits_{i=1}^{n} \langle \mathbb{G}^{i} \rangle. $$
Selection of candidate gene pairs
Unfortunately, no theory exists to define a non-empirical threshold regarding the application of the Jaccard similarity in the context of gene comparison. Thus, a preliminary step filters pairs of gene candidate to be homologous. The intersection coverage of the dictionaries of two genes is used as a criterion of relational relevance between sequences. The criterion requires that the k-mers of Dk(s∩t)=Dk(s)∩Dk(t) have to occur in s and t with a minimal percentage.
PanDelos creates a set CH of candidate homologous genes by computing, for each pair of genes s and t, \(s \in \mathbb {G}^{i}\) and \(t \in \mathbb {G}^{j}\), the percentage of k-mer occurrences of s that belong to \(\widehat {D_{k}}(s, t)\). It is given by
$$p_{k}(s \rightarrow t) = \frac{{\sum\nolimits}_{w \in D_{k}(s, \cap t)} c_{s}(w)}{|s| - k + 1}. $$
PanDelos considers as homologous two genes s,t such that both pk(s→t) and pk(t→s) must overcome the minimum amount of 2/k.
The threshold 2/k is not empirically defined, but motivated by an argument that we will briefly outline. If we consider that from a sequence s we can extract at most |s|/k distinct non-overlapping k-mers, then we realize that, when the average multiplicity of k-mers in s is close to 1, this fraction is close to 1/k of the number of all k-mer occurrences of s. However, the lack of overlap denies any possibility of reconstructing of s from such a k-dictionary, because in this case there is no indication on how the different k-mers must be arranged to form s. Therefore, we assume that a minimum amount of overlap between consecutive k-mers extracted from s is given by doubling the above fraction 1/k. This argument suggests us to fix as 2/k the threshold of pk(s→t) and pk(t→s). In conclusion, s and t are considered homologous candidate genes if: pk(s→t)≥2/k and pk(t→s)≥2/k.
Dictionary based gene sequence similarity detection
For each pair of genomes, \(\mathbb {G}^{i}\) and \(\mathbb {G}^{j}\), and for each candidate pair of genes (s,t), such that \(s \in \mathbb {G}^{i}\) and \(t \in \mathbb {G}^{j}\), PanDelos computes their sequences similarity by applying a generalized Jaccard similarity among the k-dictionaries. Note that, in the search for paralogous genes, i is equal to j.
Given two sequences, s and t, and Dk(s∪t)=Dk(s)∪Dk(t) the union of their k-dictionaries, PanDelos uses the following generalized Jaccard similarity Jk(s,t) on k-mer multiplicities:
$$J_{k}(s,t) = \frac{ {\sum\nolimits}_{w \in D_{k}(s \cup t)} min(c_{s}(w), c_{t}(w)) }{{\sum\nolimits}_{w \in D_{k}(s \cup t)} max(c_{s}(w), c_{t}(w))}. $$
It takes values in the interval [0,1]. It is independent of the lengths of the compared sequences and thus it is suitable for comparing sets of sequences having a wide range of lengths.
Extraction of gene pairs by bidirectional similarity
In order to obtain the set CHO of orthologous candidate genes, PanDelos computes bidirectional best hits (BBHs) on genes in CH.
Given a gene \(s \in \mathbb {G}^{i}\), the set of best hits of s towards a genome \(\mathbb {G}^{j}\) is given by:
$$BH\left(s, \mathbb{G}^{j}\right) = \left\{ t \in \mathbb{G}^{j} : J_{k}(s,t) = \underset{v \in \mathbb{G}^{j}}{\text{max} } J_{k}(s,v) \right\}. $$
The set of bidirectional best hits of s towards \(\mathbb {G}^{j}\) is given by:
$$BBH\left(s, \mathbb{G}^{j}\right) \,=\, \left\{ t \in \mathbb{G}^{j} : t \in \!BH\left(s,\mathbb{G}^{j}\right) \text{and}\ s \in BH\left(t, \mathbb{G}^{i}\right) \right\} $$
Only genes involved in at least one BBH are kept in CHO.
The BBH strategy is commonly used in pan-genomic analyses, however, it may capture sequences having low similarity. This behavior especially arises with singleton sequences. Two unrelated singletons of the two genomes may form a BBH simply because no orthologs exist and they are reciprocally the best match. PanDelos avoids these cases by performing the BBH strategy only on genes in CH, i.e. on candidate genes ‘similar enough’.
In order to obtain the set CHP of paralogous candidate genes, at the end of every 1-vs-1 genome comparison, the minimum score of BBH orthologous candidate genes is used to infer new paralogous. Recalling that PanDelos has compared each genome to it self, the intra-genome BBH with a score equal to or greater than the minimum inter-genomes BBH score (orthology score) are accounted as paralogous. This rule states that the score accounting for orthologous sequences can be used as threshold to define two genes as paralogous because their similarity is strong at least as the minimum trusted similarity between orthologs.
Gene family detection by network coherence refinement
PanDelos constructs an undirected weighted network from homology information, where each vertex is labelled with a pair \((s, \mathbb {G}_{i})\) formed by a candidate gene and the genome to which it belongs, and where an edge connects two vertices if they are in CHO or CHP. The edge weights are the scores computed applying the generalized Jaccard similarity on candidate genes.
The network may be formed by several connected components which are the starting homologous candidate gene families. A connected component is defined inconsistent if it contains two genes belonging to the same genome that are not accounted as paralogs, namely which are not connected by an edge. The inconsistency is resolved by recursively splitting the component into subgroups until a set of consistent subgroups is reached. PanDelos uses the Girvan-Newman algorithm for community detection which calculates the betweenness centrality along the components and progressively removes the edge with the highest centrality [26]. PanDelos normalizes the edge weights by means of the maximum weight present in each connected components.
The resulting pan-genomic structure is given by the final set of consistent connected components plus singleton genes, i.e. the singleton vertices in the network. Components containing genes of all genomes represent the core of the pan-genome. The other components contain the dispensable genes.
Data structures engineering for fast similarity computation
Limitations of enhanced suffix arrays for pan-genome computations
Given a string s, a suffix array (SA) [27] reports the lexicographically ordered suffixes of s equipped with their start position in s. Substring search by means of SA can be sped up by performing binary searches. An enhanced suffix array (ESA) [28] is a combination of SA with the LCP (Longest Common Prefix) array giving the length of the longest common prefix of a suffix with that one lexicographically preceding it. An ESA allows for efficient recovery of the k-mers multiplicities [29]. The values of the LCP array define contiguous regions of the ESA array, called LCP-intervals, which identify all the occurrences of k-mers. Additionally, an array of length N reports for each suffix the distance from its start to the first forward occurrence of a N symbol [30]. The N is used to represent positions in s that must be discarded in dictionary operation. The ESA structure performs k-mer enumeration in linear time by just doubling the memory requirement of simple SA. Since each k-mer must be checked for N inclusion, this verification increases the time complexity by a factor of k. However, with the additional N array, the complexity remains linear. Figure 2a gives an example of ESA+N structure that has been built for the string WLLPPP, and illustrates LCP intervals of 1-mers and 2-mers of the string.
Methodologies for efficient similarity calculation in sets of strings have been developed by means of suffix trees [31]. This type of data structure inspired the develop of suffix arrays as a memory efficient implementation that does not increase time requirements. However, enhanced suffix arrays are useful for single reference string analysis, but result less suitable to be applied to sets of strings.
Given two sequences, s and t, the comparison of their k-dictionaries can be performed by listing the k-mers of s and searching for them in t. Taking into account an ESA+N structure, the k-mers listing is performed in |s| time, and the search of each k-mer in t takes \(\mathcal {O}(k \cdot log(|t|))\) time. Since at most |s| distinct k-mers are in s, the overall time is \(\mathcal {O}(|s| \cdot k \cdot log(|t|))\).
The search described above takes into account only Dk(s), but t may contain k-mers not listed in the dictionary of s, thus the time requirement is doubled because Dk(t) must also be scanned. In 1-vs-1 genome comparison, the process must be repeated for every pair of genes belonging to the two genomes, resulting in a highly expensive procedure. In the next section, we address a way to efficiently improve the procedure.
PanDelos data structure engineering
We extend the ESA+N structure [30] in order to speed up the comparison of k-dictionaries when multiple sequences are taken into account. The goal is to compute the generalized Jaccard similarities between a set of sequences simultaneously.
The generalized Jaccard similarity between s and t can be expressed as:
$$J_{k}(s,t) = \frac{a}{b + c}, $$
where a is the sum of the minimum multiplicities of k-mers shared by the two sequences, b is the sum of the maximum multiplicities of the shared k-mers, and c is the sum of the multiplicities of k-mers appearing only in one of the two sequences.
Given a and b for every pair of sequences, then c is obtained as
$$c = (|s| + |t| - 2k + 2) - (a + b), $$
where (|s|+|t|−2k+2) is the sum of multiplicity of all k-mers in s and t. Therefore it can be rewritten as:
$$J_{k}(s,t) = \frac{a}{|s| + |t| - 2k + 2 - a}. $$
Given two genomes, \(\mathbb {G}^{1}=\{s_{1}, \dots, s_{n}\}\) and \(\mathbb {G}^{2}=\{t_{1}, \dots, t_{m}\}\), genes are concatenate in a single global sequence s1·N·⋯·N·sn·N·t1·N·⋯·N·tm. An ESA+N structure is built, and the concatenation by N symbols ensures that extracted k-mers do not cross between gene sequences. The data structure is extended with a further array, called SID. Given an LCP-interval, that represents a specific k-mer, the content of the corresponding interval in the SID array reports the identifiers of the sequences in which the k-mer is present. Moreover, the number of time a sequence identifier is repeated within the interval corresponds to the multiplicity of the k-mer within the specific sequence.
For each pair of sequences involved in the interval, we compute the sum of minima (the a term in the generalized Jaccard formula) by computing the partial of such sums and storing them in a matrix M
$$M[i,j] = \sum\limits_{w \in D_{k}\left(s_{i} \cup t_{j}\right)} min\left(c_{s_{i}}(w), c_{t_{j}}(w)\right), $$
for 1≤i≤n and 1≤j≤m.
Once the matrix is filled, the similarities are finally computed by the formula:
$$J\left(s_{i},t_{j}\right) = \frac{ M[i,j] }{ \left|s_{i}\right| + \left|t_{j}\right| -2k +2 - M[i,j]}. $$
In this way, we avoid comparisons between sequences that do no share any k-mers, and eliminate the logarithmic factor of searching k-mers into multiple suffix arrays, one for each gene sequence.
Similarly, two additional matrices are stored in order to calculate candidate homologous genes: P1[i,j] reporting the percentage of multiplicities of k-mers in sishared with tj, and P2[i,j] storing the vice versa:
$$P_{1}[i,j] = \sum\limits_{w \in D_{k}(s_{i} \cap t_{j}) }\frac{c_{s_{i}}(w)}{|s|-k+1} $$
$$P_{2}[i,j] = \sum\limits_{w \in D_{k}(s_{i} \cap t_{j})} \frac{c_{t_{j}}(w)}{|t|-k+1}. $$
Figure 2 reports an example for four sequences that are concatenated into a single one. Then 2-mers, and their associated sequences identifiers, are retrieved from the data structure. In the example, two genomes are compared. The first genome contains the sequences s1 and s2, and the second genome contains the sequences t1 and t2. The word length 2 is chosen as best k value, thus sequences are compared by means of the multiplicities of the 2-mers they contain. Ideally, the matrix (c) has to be computed in order to calculate the Jaccard similarity between the sequences. For higher values of k, the storage and update of such matrix may require high computational efforts, thus its rows are computed on the fly by identifying k-mer intervals along the indexing structure and by iterating them. A linear iteration over the structure lists the 2-mers, together with the number of times they appear within each original sequence. During the iteration the three matrices M, P1 and P2, in Fig. 2d, e and f, are updated. After that every 2-mer have been iterated, Jaccard similarity are computed by means of the M matrix, while coverage percentage are computed by means of the P1 and P2 matrices.