eCAMBer requires as its input a set of genome sequences and annotations for multiple bacterial genomes. It should be noted, however, that eCAMBer supports automatic download of bacterial annotations from the PATRIC [2] database and, as an option, it allows the use of Prodigal to generate the input annotations. It works in two phases. In the first phase it uses BLAST+ [30] to transfer each gene annotation among multiple strains. Based on the results of this procedure, homologous multigene clusters are identified. In the second phase eCAMBer applies subsequently the procedures for refinement, TIS voting and clean up. Figure 1 presents a schematic view of these subsequent procedures of eCAMBer.
The main improvements in eCAMBer as compared to CAMBer [11] are:
-
Significant speed up of the closure procedure for unifying genome annotations among bacterial strains;
-
Modified refinement procedure for splitting homologous gene families into orthologous gene clusters;
-
New TIS voting procedure for selecting the most reliable TIS;
-
New clean up procedure for removal of gene clusters that are likely to be gene annotation errors propagated during the closure procedure.
Here, we describe the details of the above listed procedures. The default values for parameters introduced below were chosen arbitrarily. However, based on our experiments, the program is robust for other choices of the parameters from a reasonable spectrum. eCAMBer allows users to specify values of all the parameters.
The closure procedure
The closure procedure is the first step of eCAMBer. The input consists of genome sequences and genome annotations for a set of closely related bacterial strains. In this procedure gene annotations are iteratively transferred among the set of considered strains, until no new ORFs (open reading frames) are identified.More precisely, a gene annotation is transferred to a new location if its BLAST hit extended to the nearest in-frame stop codon is acceptable. Analogous to CAMBer, a BLAST hit extension to the nearest stop codon is acceptableif it satisfies the following conditions:
-
The hit has one of the appropriate start codons: ATG, GTG, TTG, or the same start codon as in the query sequence;
-
The hit has its beginning aligned with the beginning of the query sequence;
-
The BLAST e-value score is below a given threshold e
t
(in the default setting e
t
=10−10);
-
The ratio of the length of the extended hit to the query length is less than 1+p
t
and greater than 1−p
t
, where p
t
is a given threshold (in the default setting p
t
=0.2);
-
The percentage of identity of the hit (calculated as the number of identities divided by the query sequence length, times 100) is above a length-dependent threshold given by the adaptation of the HSSP curve introduced in our previous work [11], defined by the parameter n
t
(in the default setting n
t
=60.5).
In this procedure eCAMBer, unlike CAMBer, takes advantage of working with closely related genomes. In contrast to the old approach, in each iteration, instead of using each ORF sequence as a query, it first identifies groups of ORFs with exactly identical sequences. This approach avoids use of the same ORF sequence multiple times as a BLAST query.
The pseudocode for the closure procedure implemented in eCAMber is given in Algorithm ??, which we now describe in more details. First, we start with the set of input annotations , for each strain s in the set of considered strains S. Each ORF annotation (or simply ORF) is defined by a tuple (start, end, strand, contig, strain). Then, in ith iteration we compute the set of BLAST queries Qi as the set of distinct ORF sequences among all strains, which have not been used as BLAST queries yet. Next, we calculate in parallel, for each strain, BLAST results for all sequence queries in Qi. For each strain s∈S, all acceptable BLAST hit extensions are added to the strain annotations, defining . Next, the set of newly identified sequences across all genomes Hi is computed, which is then used to update the set of BLAST queries for the next iteration Qi+1←Hi∖Di, where Di denotes the set of all distinct sequences before the ith annotation. The procedure stops when no new ORF sequences are identified, hence Qi=∅. For each strain s∈S, we denote by the set of annotations produced by the closure procedure above. We further denote by Ac the set of all ORFs produced by the closure procedure.
Here, we also recall the notion of a multigene, introduced in our previous work [11], to account for the situation when multiple ORFs share the same stop codon in the annotations produced during the closure procedure. These ORFs are called multigene elements and represent putative gene translation units. Each multigene is represented by a tuple (e n d,s t r a n d,c o n t i g,s t r a i n,e l t s), where elts is the set of ORFs constituting the multigene. Also, for each strain s∈S, we denote by the set of multigenes resulting from the closure procedure.
Figure 2 presents a schematic view of the implementation of the closure procedure in eCAMBer.
The careful reader may also notice two important differences between the closure procedure in CAMBer and eCAMBer. In particular, eCAMBer uses unique ORF sequences, rather than ORF annotations, as queries against all strain genomes and, thus, does not repeat a BLAST query when the same ORF sequence corresponds to multiple ORF annotations. In contrast, firstly, CAMBer uses all ORF sequences as queries and, thus, may repeat a query BLAST several times. Secondly, CAMBer BLASTs a query against all strains’ genomes except the strain from which the query is taken. The second difference may potentially lead to different outcomes generated by these two approaches.
Since BLAST computations are the most time-consuming operation in each iteration of the closure procedure, we express the time complexity of one iteration of the closure procedure by the number of performed BLAST computations. Let k=|S| denote the number of considered strains and let be the maximal number of gene annotations per strain, in iteration i. Let, d=|Di| denote the number of distinct gene sequences among all gene annotations in all considered strains. Then, the time complexity of one iteration of the closure procedure implemented in eCAMBer can be expressed as O(d·k), whereas it is O(n·k2) for CAMBer. Here, it should be noted that, potentially, if every annotated ORF sequence in S is different, then . However, as our case study experiments show, d is usually much smaller than n·k (see Figure 3).
Importantly, the number of I/O operations per iteration is also significantly decreased, from O(n·k2) in CAMBer to O(k) in eCAMBer.
Consolidation graphs
Having the closure procedure computed we represent its results in the form of graph structures, called consolidation graphs.
First, we introduce the conceptual representation, called the ORF consolidation graph. In this graph G
O
=(V
O
,E
O
), each node o∈V
O
represents an ORF annotation in , for some s∈S. There is an undirected edge {o1,o2}∈E
O
between a pair of ORFs, if there is an acceptable BLAST hit from the sequence of o1 to o2 or from the sequence of o2 to o1. We additionally assume, that there are no self-edges, i. e. o1≠o2.
Second, we recall the definition of the multigene consolidation graph, introduced in our previous work [11]. In this graph G
M
=(V
M
,E
M
) each node m∈V
M
represents a multigene in , for some s∈S. There is an undirected edge {m1,m2}∈E
M
between a pair of multigenes, if there is a pair of ORFs o1∈e l t s(m1) and o2∈e l t s(m2), such that there is an edge between them in the ORF consolidation graph (i.e., such that {o1,o2}∈E
O
).
Finally, we introduce the sequence consolidation graph, which is the structure used in the implementation of eCAMBer, as it is a compact representation of the information stored in the ORF consolidation graph and the multigene consolidation graph. In this graph G
S
=(V
S
,E
S
,E
B
), nodes represent distinct ORF sequences. There are two types of edges, E
B
called BLAST-hit edges, and E
S
called shared-end edges. There is an undirected shared-end edge {x,y}∈E
S
between a pair of sequence nodes if there is a multigene having two elements with these sequences. There is an undirected BLAST-hit edge {x,y}∈E
B
between a pair of sequence nodes if there is an acceptable BLAST hit from x to an ORF with sequence y, or if there is an acceptable BLAST from y to an ORF with sequence x.
Figure 4 illustrates the correspondence between the ORF consolidation graph, sequence consolidation graph and the multigene consolidation graph.
Homologous gene clusters
The second step of eCAMBer is to determine homologous gene families as connected components of the multigene consolidation graph G
M
. There is a natural one-to-one correspondence between the connected components of the multigene consolidation graph and the connected components of the sequence consolidation graph (the latter connected components are obtained by taking the union of E
S
and E
B
). So, in eCAMBer, we do this using connected components of the sequence consolidation graph G
S
, because it tends to be smaller for closely related genomes. The obtained set of homologous gene families is represented as a set of disjoint multigene clusters, denoted by C
M
.
Refinement procedure
The third step of eCAMBer is the refinement procedure. The goal of the refinement procedure is splitting the homologous gene families, represented by multigene clusters, to obtain anchors. We call a multigene cluster an anchor, if it includes at most one multigene for every strain. Analogously, we call a multigene cluster non-anchor, if there is a strain which includes at least two multigenes in the cluster. Multigenes in the same anchor are potentially orthologous to each other, whereas a non-anchor contains at least two multigenes that are non-orthologous. Following CAMBer, we use genomic context information to decompose non-anchors into smaller multigene clusters that can emerge as anchors, as described below.
The input for the refinement procedure consists of the set of multigene clusters C
M
, the sequence consolidation graph G
M
, and the multigene annotations , for each strain s∈S. We start with classifying the set C
M
of multigene clusters into two disjoint sets of anchors and non-anchors, denoted C
A
and C
N
, respectively. We also sort all multigenes within strain contigs by positions of their stop codons. We reconstruct the subgraph of the multigene consolidation graph, called the refinement graph. In this graph G
R
=(V
R
,E
R
), nodes V
R
are constituted by the subset of multigenes, which belong to non-anchor clusters. There is an edge {m1,m2}∈E
R
, between a pair of multigenes coming from different strains, if there is an edge {m1,m2}∈E
M
, and the two multigenes belong to the same multigene cluster. By we denote the subset of edges between multigenes from a pair of strains s1 and s2. We omit details of the reconstruction of the refinement graph for brevity.
Then, for each unordered pair of strains {s1,s2} we perform the following procedure in parallel. First, for each multigene m we identify a pair of its nearest neighbours which belong to anchors with a multigene element present on the opposite strain. Such left and right neighbours of m are denoted as and , respectively. Then, for each edge we check whether it is supported in the sense that it satisfies one of the following conditions: (i) it connects multigenes belonging to a cluster, such that m1 and m2 are its only elements in strains s1 and s2; (ii) the corresponding pairs and belong to the same anchor; (iii) the corresponding pairs and belong to the same anchor. If any of the four neighbours does not exist we substitute it with a dummy node, which virtually belongs to any anchor.
Finally, we obtain the refined graph by removal of unsupported edges from G
R
. Then, the set of connected components C
R
of defines the set of multigene clusters after the split. Finally, we update the set of multigene clusters as .
The careful reader may also notice the differences between the refinement procedures implemented in CAMBer and eCAMBer. First, the refinement procedure in CAMBer performs in iterations until no multigene clusters can be split. In eCAMBer the refinement procedure consists of only one iteration. However, since the input and output for the procedure are of the same type, it can be used multiple times, until no new clusters are split. Second, the condition for an edge to be supported in eCAMBer is more relaxed than that in CAMBer. Both approaches, for a pair of multigenes on different strains, identify pairs of their nearest left and right neighbour multigenes (belonging to anchor clusters with elements on both strains). However, CAMBer checks the actual presence of edges between the neighbours, whereas eCAMBer only checks if the identified neighbours match the same pair of clusters. This approach allows eCAMBer to avoid a costly reconstruction of the whole multigene consolidation graph.
TIS voting procedure
The fourth step of eCAMBer is the TIS voting procedure. The goal of the TIS voting procedure is to select the most reliable TIS for each multigene. To do this we implement an approach based on the concept of majority voting. This strategy has also been used to improve genome annotation accuracy in several recent studies [24, 31].
In this procedure, for each multigene m in each multigene cluster , we try to find a TIS (originally annotated or transferred) that belongs to a connected component of the ORF consolidation graph, where the connected component satisfies the following two conditions: (i) it has TISs (originally annotated or transferred) present in at least 80% of the multigenes in c; and (ii) it has TISs originally annotated in at least 50% of the multigenes in c, or it has TISs originally annotated in at least twice the number of multigenes in c than all other connected components in c. If such a TIS is found, it is selected as the TIS for m. If such a TIS is not found, but m has an originally annotated TIS, then the originally annotated TIS is selected as the TIS for m. If both of these two cases cannot be applied, the TIS corresponding to the longest ORF in the multigene m is selected. After the TIS voting procedure, every multigene has exactly one TIS selected. Thus, we obtain unambiguous TIS annotation for every gene.
Note that the connected components of the sequence consolidation graph—after shared-end edges have been removed— are in a natural one-to-one correspondence with the connected components in the ORF consolidation graph. So in eCAMBer, we implement the TIS voting procedure using the sequence consolidation graph, as it tends to be smaller for closely related genomes.
Clean up procedure
The last step of eCAMBer is the clean up procedure, which is designed to filter out multigene clusters which are likely due to gene annotation errors propagated during the closure procedure.
The input for this procedure consists of the set of multigene clusters and multigene annotations , for each strain s∈S. For each multigene cluster we compute the following features: (i) l, the median multigene length in c; (ii) p, the ratio of the number of strains with at least one element from c to the total number of strains; (iii) r, the ratio of the number of strains with at least one originally annotated multigene to the total number of strains with at least one element from c; (iv) v, the ratio of the number of multigenes in the cluster that are overlapped by a longer multigene to the total number of multigenes in the cluster.
Then, we update the set of multigene clusters , by removing of multigene clusters for which ( or ) and (l<150 or v>0.5).
Other features
In order to make eCAMBer more user friendly we have added a functionality for downloading genome sequences and genome annotations from the PATRIC database, for the set of selected strains within a species. The downloaded data is automatically formatted as input for eCAMBer. Additionally, eCAMBer integrates Prodigal to generate input gene annotations.
Furthermore, eCAMBer generates output compatible with CAMberVis [27], a tool for simultaneous visualization of multiple genome annotations of bacterial strains. CAMBerVis also handles visualization of genome annotation inconsistencies.