eCAMBer requires as its input a set of genome sequences and annotations for multiple bacterial genomes. It should be noted, however, that eCAMBer supports automatic download of bacterial annotations from the PATRIC [2] database and, as an option, it allows the use of Prodigal to generate the input annotations. It works in two phases. In the first phase it uses BLAST+ [30] to transfer each gene annotation among multiple strains. Based on the results of this procedure, homologous multigene clusters are identified. In the second phase eCAMBer applies subsequently the procedures for refinement, TIS voting and clean up. Figure 1 presents a schematic view of these subsequent procedures of eCAMBer.
The main improvements in eCAMBer as compared to CAMBer [11] are:

Significant speed up of the closure procedure for unifying genome annotations among bacterial strains;

Modified refinement procedure for splitting homologous gene families into orthologous gene clusters;

New TIS voting procedure for selecting the most reliable TIS;

New clean up procedure for removal of gene clusters that are likely to be gene annotation errors propagated during the closure procedure.
Here, we describe the details of the above listed procedures. The default values for parameters introduced below were chosen arbitrarily. However, based on our experiments, the program is robust for other choices of the parameters from a reasonable spectrum. eCAMBer allows users to specify values of all the parameters.
The closure procedure
The closure procedure is the first step of eCAMBer. The input consists of genome sequences and genome annotations for a set of closely related bacterial strains. In this procedure gene annotations are iteratively transferred among the set of considered strains, until no new ORFs (open reading frames) are identified.More precisely, a gene annotation is transferred to a new location if its BLAST hit extended to the nearest inframe stop codon is acceptable. Analogous to CAMBer, a BLAST hit extension to the nearest stop codon is acceptableif it satisfies the following conditions:

The hit has one of the appropriate start codons: ATG, GTG, TTG, or the same start codon as in the query sequence;

The hit has its beginning aligned with the beginning of the query sequence;

The BLAST evalue score is below a given threshold e_{
t
} (in the default setting e_{
t
}=10^{−10});

The ratio of the length of the extended hit to the query length is less than 1+p_{
t
} and greater than 1−p_{
t
}, where p_{
t
} is a given threshold (in the default setting p_{
t
}=0.2);

The percentage of identity of the hit (calculated as the number of identities divided by the query sequence length, times 100) is above a lengthdependent threshold given by the adaptation of the HSSP curve introduced in our previous work [11], defined by the parameter n_{
t
} (in the default setting n_{
t
}=60.5).
In this procedure eCAMBer, unlike CAMBer, takes advantage of working with closely related genomes. In contrast to the old approach, in each iteration, instead of using each ORF sequence as a query, it first identifies groups of ORFs with exactly identical sequences. This approach avoids use of the same ORF sequence multiple times as a BLAST query.
The pseudocode for the closure procedure implemented in eCAMber is given in Algorithm ??, which we now describe in more details. First, we start with the set of input annotations {A}_{s}^{0}, for each strain s in the set of considered strains S. Each ORF annotation (or simply ORF) is defined by a tuple (start, end, strand, contig, strain). Then, in ith iteration we compute the set of BLAST queries Q^{i} as the set of distinct ORF sequences among all strains, which have not been used as BLAST queries yet. Next, we calculate in parallel, for each strain, BLAST results for all sequence queries in Q^{i}. For each strain s∈S, all acceptable BLAST hit extensions {H}_{s}^{i} are added to the strain annotations, defining {A}_{s}^{i+1}\leftarrow {A}_{s}^{i}\cup {H}_{s}^{i}. Next, the set of newly identified sequences across all genomes H^{i} is computed, which is then used to update the set of BLAST queries for the next iteration Q^{i+1}←H^{i}∖D^{i}, where D^{i} denotes the set of all distinct sequences before the ith annotation. The procedure stops when no new ORF sequences are identified, hence Q^{i}=∅. For each strain s∈S, we denote by {A}_{s}^{c} the set of annotations produced by the closure procedure above. We further denote by A^{c} the set of all ORFs produced by the closure procedure.
Here, we also recall the notion of a multigene, introduced in our previous work [11], to account for the situation when multiple ORFs share the same stop codon in the annotations produced during the closure procedure. These ORFs are called multigene elements and represent putative gene translation units. Each multigene is represented by a tuple (e n d,s t r a n d,c o n t i g,s t r a i n,e l t s), where elts is the set of ORFs constituting the multigene. Also, for each strain s∈S, we denote by {M}_{s}^{c} the set of multigenes resulting from the closure procedure.
Figure 2 presents a schematic view of the implementation of the closure procedure in eCAMBer.
The careful reader may also notice two important differences between the closure procedure in CAMBer and eCAMBer. In particular, eCAMBer uses unique ORF sequences, rather than ORF annotations, as queries against all strain genomes and, thus, does not repeat a BLAST query when the same ORF sequence corresponds to multiple ORF annotations. In contrast, firstly, CAMBer uses all ORF sequences as queries and, thus, may repeat a query BLAST several times. Secondly, CAMBer BLASTs a query against all strains’ genomes except the strain from which the query is taken. The second difference may potentially lead to different outcomes generated by these two approaches.
Since BLAST computations are the most timeconsuming operation in each iteration of the closure procedure, we express the time complexity of one iteration of the closure procedure by the number of performed BLAST computations. Let k=S denote the number of considered strains and let n={\mathit{\text{max}}}_{s\in S}\left{A}_{s}^{i}\right be the maximal number of gene annotations per strain, in iteration i. Let, d=D^{i} denote the number of distinct gene sequences among all gene annotations in all considered strains. Then, the time complexity of one iteration of the closure procedure implemented in eCAMBer can be expressed as O(d·k), whereas it is O(n·k^{2}) for CAMBer. Here, it should be noted that, potentially, if every annotated ORF sequence in S is different, then \left{D}^{i}\right=\sum _{s\in S}\left{A}_{s}^{i}\right=O(n\xb7k). However, as our case study experiments show, d is usually much smaller than n·k (see Figure 3).
Importantly, the number of I/O operations per iteration is also significantly decreased, from O(n·k^{2}) in CAMBer to O(k) in eCAMBer.
Consolidation graphs
Having the closure procedure computed we represent its results in the form of graph structures, called consolidation graphs.
First, we introduce the conceptual representation, called the ORF consolidation graph. In this graph G_{
O
}=(V_{
O
},E_{
O
}), each node o∈V_{
O
} represents an ORF annotation in {A}_{s}^{c}, for some s∈S. There is an undirected edge {o_{1},o_{2}}∈E_{
O
} between a pair of ORFs, if there is an acceptable BLAST hit from the sequence of o_{1} to o_{2} or from the sequence of o_{2} to o_{1}. We additionally assume, that there are no selfedges, i. e. o_{1}≠o_{2}.
Second, we recall the definition of the multigene consolidation graph, introduced in our previous work [11]. In this graph G_{
M
}=(V_{
M
},E_{
M
}) each node m∈V_{
M
} represents a multigene in {M}_{s}^{c}, for some s∈S. There is an undirected edge {m_{1},m_{2}}∈E_{
M
} between a pair of multigenes, if there is a pair of ORFs o_{1}∈e l t s(m_{1}) and o_{2}∈e l t s(m_{2}), such that there is an edge between them in the ORF consolidation graph (i.e., such that {o_{1},o_{2}}∈E_{
O
}).
Finally, we introduce the sequence consolidation graph, which is the structure used in the implementation of eCAMBer, as it is a compact representation of the information stored in the ORF consolidation graph and the multigene consolidation graph. In this graph G_{
S
}=(V_{
S
},E_{
S
},E_{
B
}), nodes represent distinct ORF sequences. There are two types of edges, E_{
B
} called BLASThit edges, and E_{
S
} called sharedend edges. There is an undirected sharedend edge {x,y}∈E_{
S
} between a pair of sequence nodes if there is a multigene having two elements with these sequences. There is an undirected BLASThit edge {x,y}∈E_{
B
} between a pair of sequence nodes if there is an acceptable BLAST hit from x to an ORF with sequence y, or if there is an acceptable BLAST from y to an ORF with sequence x.
Figure 4 illustrates the correspondence between the ORF consolidation graph, sequence consolidation graph and the multigene consolidation graph.
Homologous gene clusters
The second step of eCAMBer is to determine homologous gene families as connected components of the multigene consolidation graph G_{
M
}. There is a natural onetoone correspondence between the connected components of the multigene consolidation graph and the connected components of the sequence consolidation graph (the latter connected components are obtained by taking the union of E_{
S
} and E_{
B
}). So, in eCAMBer, we do this using connected components of the sequence consolidation graph G_{
S
}, because it tends to be smaller for closely related genomes. The obtained set of homologous gene families is represented as a set of disjoint multigene clusters, denoted by C_{
M
}.
Refinement procedure
The third step of eCAMBer is the refinement procedure. The goal of the refinement procedure is splitting the homologous gene families, represented by multigene clusters, to obtain anchors. We call a multigene cluster an anchor, if it includes at most one multigene for every strain. Analogously, we call a multigene cluster nonanchor, if there is a strain which includes at least two multigenes in the cluster. Multigenes in the same anchor are potentially orthologous to each other, whereas a nonanchor contains at least two multigenes that are nonorthologous. Following CAMBer, we use genomic context information to decompose nonanchors into smaller multigene clusters that can emerge as anchors, as described below.
The input for the refinement procedure consists of the set of multigene clusters C_{
M
}, the sequence consolidation graph G_{
M
}, and the multigene annotations {M}_{s}^{c}, for each strain s∈S. We start with classifying the set C_{
M
} of multigene clusters into two disjoint sets of anchors and nonanchors, denoted C_{
A
} and C_{
N
}, respectively. We also sort all multigenes within strain contigs by positions of their stop codons. We reconstruct the subgraph of the multigene consolidation graph, called the refinement graph. In this graph G_{
R
}=(V_{
R
},E_{
R
}), nodes V_{
R
} are constituted by the subset of multigenes, which belong to nonanchor clusters. There is an edge {m_{1},m_{2}}∈E_{
R
}, between a pair of multigenes coming from different strains, if there is an edge {m_{1},m_{2}}∈E_{
M
}, and the two multigenes belong to the same multigene cluster. By {E}_{R}^{\{{s}_{1},{s}_{2}\}} we denote the subset of edges between multigenes from a pair of strains s_{1} and s_{2}. We omit details of the reconstruction of the refinement graph for brevity.
Then, for each unordered pair of strains {s_{1},s_{2}} we perform the following procedure in parallel. First, for each multigene m we identify a pair of its nearest neighbours which belong to anchors with a multigene element present on the opposite strain. Such left and right neighbours of m are denoted as {l}_{m}^{\{{s}_{1},{s}_{2}\}} and {r}_{m}^{\{{s}_{1},{s}_{2}\}}, respectively. Then, for each edge \{{m}_{1},{m}_{2}\}\in {E}_{R}^{\{{s}_{1},{s}_{2}\}} we check whether it is supported in the sense that it satisfies one of the following conditions: (i) it connects multigenes belonging to a cluster, such that m_{1} and m_{2} are its only elements in strains s_{1} and s_{2}; (ii) the corresponding pairs ({l}_{{m}_{1}}^{\{{s}_{1},{s}_{2}\}},{l}_{{m}_{2}}^{\{{s}_{1},{s}_{2}\}}) and ({r}_{{m}_{1}}^{\{{s}_{1},{s}_{2}\}},{r}_{{m}_{2}}^{\{{s}_{1},{s}_{2}\}}) belong to the same anchor; (iii) the corresponding pairs ({l}_{{m}_{1}}^{\{{s}_{1},{s}_{2}\}},{r}_{{m}_{2}}^{\{{s}_{1},{s}_{2}\}}) and ({r}_{{m}_{1}}^{\{{s}_{1},{s}_{2}\}},{l}_{{m}_{2}}^{\{{s}_{1},{s}_{2}\}}) belong to the same anchor. If any of the four neighbours does not exist we substitute it with a dummy node, which virtually belongs to any anchor.
Finally, we obtain the refined graph{G}_{R}^{\ast} by removal of unsupported edges from G_{
R
}. Then, the set of connected components C_{
R
} of {G}_{R}^{\ast} defines the set of multigene clusters after the split. Finally, we update the set of multigene clusters as {C}_{M}^{\ast}\leftarrow ({C}_{M}\setminus {C}_{N})\cup {C}_{R}.
The careful reader may also notice the differences between the refinement procedures implemented in CAMBer and eCAMBer. First, the refinement procedure in CAMBer performs in iterations until no multigene clusters can be split. In eCAMBer the refinement procedure consists of only one iteration. However, since the input and output for the procedure are of the same type, it can be used multiple times, until no new clusters are split. Second, the condition for an edge to be supported in eCAMBer is more relaxed than that in CAMBer. Both approaches, for a pair of multigenes on different strains, identify pairs of their nearest left and right neighbour multigenes (belonging to anchor clusters with elements on both strains). However, CAMBer checks the actual presence of edges between the neighbours, whereas eCAMBer only checks if the identified neighbours match the same pair of clusters. This approach allows eCAMBer to avoid a costly reconstruction of the whole multigene consolidation graph.
TIS voting procedure
The fourth step of eCAMBer is the TIS voting procedure. The goal of the TIS voting procedure is to select the most reliable TIS for each multigene. To do this we implement an approach based on the concept of majority voting. This strategy has also been used to improve genome annotation accuracy in several recent studies [24, 31].
In this procedure, for each multigene m in each multigene cluster c\in {C}_{M}^{\ast}, we try to find a TIS (originally annotated or transferred) that belongs to a connected component of the ORF consolidation graph, where the connected component satisfies the following two conditions: (i) it has TISs (originally annotated or transferred) present in at least 80% of the multigenes in c; and (ii) it has TISs originally annotated in at least 50% of the multigenes in c, or it has TISs originally annotated in at least twice the number of multigenes in c than all other connected components in c. If such a TIS is found, it is selected as the TIS for m. If such a TIS is not found, but m has an originally annotated TIS, then the originally annotated TIS is selected as the TIS for m. If both of these two cases cannot be applied, the TIS corresponding to the longest ORF in the multigene m is selected. After the TIS voting procedure, every multigene has exactly one TIS selected. Thus, we obtain unambiguous TIS annotation for every gene.
Note that the connected components of the sequence consolidation graph—after sharedend edges have been removed— are in a natural onetoone correspondence with the connected components in the ORF consolidation graph. So in eCAMBer, we implement the TIS voting procedure using the sequence consolidation graph, as it tends to be smaller for closely related genomes.
Clean up procedure
The last step of eCAMBer is the clean up procedure, which is designed to filter out multigene clusters which are likely due to gene annotation errors propagated during the closure procedure.
The input for this procedure consists of the set of multigene clusters {C}_{M}^{\ast} and multigene annotations {M}_{s}^{c}, for each strain s∈S. For each multigene cluster c\in {C}_{M}^{\ast} we compute the following features: (i) l, the median multigene length in c; (ii) p, the ratio of the number of strains with at least one element from c to the total number of strains; (iii) r, the ratio of the number of strains with at least one originally annotated multigene to the total number of strains with at least one element from c; (iv) v, the ratio of the number of multigenes in the cluster that are overlapped by a longer multigene to the total number of multigenes in the cluster.
Then, we update the set of multigene clusters {C}_{M}^{\ast}, by removing of multigene clusters for which (p<\frac{1}{3} or r<\frac{1}{3}) and (l<150 or v>0.5).
Other features
In order to make eCAMBer more user friendly we have added a functionality for downloading genome sequences and genome annotations from the PATRIC database, for the set of selected strains within a species. The downloaded data is automatically formatted as input for eCAMBer. Additionally, eCAMBer integrates Prodigal to generate input gene annotations.
Furthermore, eCAMBer generates output compatible with CAMberVis [27], a tool for simultaneous visualization of multiple genome annotations of bacterial strains. CAMBerVis also handles visualization of genome annotation inconsistencies.