DELEAT has been designed as a pipeline which takes a GenBank annotation file (.gb) as initial input and performs a series of analyses resulting in a list of candidate genome deletions, as well as some complementary information useful for the genome reduction process (Fig. 1). The different steps in the pipeline communicate through modified GenBank files which are compatible with genome visualisation tools such as Artemis [38], so that results from every step can be inspected by the user. DELEAT has been implemented for Linux systems, but can be run on any platform by means of a Docker image built from the provided Dockerfile. Software dependencies are detailed in the “Availability and requirements” section.
Step 1: predict-essentiality The first step calculates an essentiality score for each annotated gene in the analysed genome (Fig. 1). This score lies in the interval [0, 1] and represents the gene’s probability of belonging to the class “essential”, which is calculated from multiple gene features (see section “Gene essentiality classifier”). These features are:
Strand In prokaryotes, essential genes are preferentially located in the leading strand of the chromosome, which corresponds to the “+” strand between the ori and ter coordinates (origin and terminus of replication, respectively), and to the “−“ strand on the other half (ter → ori). Therefore, these coordinates are needed in order to determine this gene feature. If they are not provided by the user, ori and ter are calculated automatically by the cumulative GC skew method [39].
Phyletic retention (essential orthologs) The Geptop 2.0 algorithm [32] was replicated for the calculation of this feature. For each protein in the genome, a BLASTp search is run against all reference proteomes and orthologs are identified by the reciprocal best hit (RBH) method. Once ortholog pairs are identified, a partial essentiality score is defined for protein i as:
\(S_{i} = \sum\limits_{j = 1}^{N} {\frac{{M_{ij} }}{{D_{j} }}}\)
where j is the reference proteome index, N is the total reference proteome count, the “mapping score” Mij is 1 if an essential ortholog exists in this proteome and 0 otherwise, and Dj is the evolutionary distance between the problem and reference species. This distance is calculated by the composition vector method [40], based on K-mer frequencies in the proteome (here, K = 6). The composition vectors of all reference proteomes are pre-computed and provided with DELEAT in JSON format, in order to speed up the analysis. Once phyletic retention scores are obtained, they are scaled to the [0, 1] interval.
Codon usage, G + C content, length and hydrophobicity score. These gene features are calculated directly from the nucleotide sequences using the tool CodonW [41].
Scores for these 6 gene features are then pre-processed following the same steps as for the model training set, and a classifier is used to obtain essentiality scores for each gene annotated with a locus tag (see section “Gene essentiality classifier”). In the case of RNA genes and pseudogenes, they are assigned scores of 1 and 0, respectively, without the need for gene feature calculation. Essentiality scores are used for downstream analyses in the pipeline, and they are added to the genome annotation resulting in a modified-1 GenBank file (.gbm1), which can be edited to manually correct scores if necessary. Results from step 1 are also exported in CSV format.
Step 2: define-deletions The second step takes a modified-1 GenBank file as input, together with arguments L (minimum desired deletion length) and E (score threshold for considering a gene as essential). We provide general guidelines on how to adjust these parameters in the Discussion section. A list of deletions is proposed, where a deletion is defined as a region equal to or longer than L which does not contain any gene with an essentiality score equal to or higher than E. A default non-coding margin of 200 nucleotides around essential genes is taken into account to avoid disruption of cis-regulatory elements, although the size of this margin may also be modified at execution. Proposed deletions are added to the genome annotation, resulting in a modified-2 GenBank file (.gbm2). A CSV file is also created with information about each deletion: name, coordinates, fraction of genome coverage, and gene content (counts of pseudogenes, hypothetical proteins, and annotated genes).
Step 3: revise-deletions At this point, the user must manually curate the list of proposed deletions by accepting, rejecting, modifying or renaming each one, and/or adding new ones. Then, step 3 is run in order to update both the deletion data table and the annotation, which becomes a modified-3 GenBank file (.gbm3).
Step 4: summarise Once deletions are designed, the last steps in the pipeline are dedicated to providing complementary information about the genome reduction process. This fourth step builds a circular genome map where the original genome is compared to the fully reduced one, with the use of the tool pyCircos (github.com/ponnhide/pyCircos). It also generates a report (in plain text format) which summarises the deletion design process. This report includes the selected L and E parameters, the number of genes classified based on essentiality and gene type (RNA, hypothetical, annotated, pseudogene, total), and basic data about the proposed deletions—total number of deletions, total size (in kb and % of genome) and total number of genes to be removed. Finally, an optimal deletion order is proposed according to the rule of minimising replichore imbalance at each step. A size difference large enough between the two halves of the genome delimited between ori and ter has been found to have deleterious effects on replication [1].
Step 5: design-all-primers This final step is optional and serves the purpose of designing PCR primers to carry out deletions by the method of megapriming [42]. This consists in amplifying the regions flanking the desired deletion, which are then concatenated and cloned into a plasmid to be introduced into the target organism, where the cloned sequence is integrated by homologous recombination and the region in between is eliminated from the genome. Primer design for this purpose must follow a set of rules, i.e. the flanking regions should span approximately 800 bp to facilitate recombination, and the cloned product cannot contain neither any target of the restriction enzyme used for cloning nor sequences that can be found repeated in the genome. Furthermore, appropriate tail sequences must be added to the primers in order for the method to function correctly (i.e. an adequate restriction site for cloning and complementary sequences for megapriming). Given these restrictions, an algorithm was designed that automatically calculates optimal primer pairs based on the coordinates of a deletion (Additional file 1: Figure S1). It makes use of Primer3-py (libnano.github.io/primer3-py), a Python API for the primer design tool Primer3 [43]. The primer design process is detailed in an output log file which includes oligonucleotide sequences with added tails, as well as their coordinates in the genome. The flanking region sequences are also saved to a FASTA file. This step of the pipeline can be run either manually for specific deletions (design-primers) or automatically for all output deletions by step 3 (design-all-primers).
DELEAT usage documentation and code examples can be found in the GitHub repository (see “Availability and requirements” section).
Gene essentiality classifier: training and evaluation
A logistic regression model was trained in order to classify genes according to essentiality, based on the six gene features listed in Step 1: predict-essentiality. The Python package scikit-learn [44] (version 0.23.1) was used for model training and fine-tuning, prediction and evaluation. We followed a supervised machine learning approach, using DEG 15.2 datasets, where genes are labelled as essential or non-essential (binary classification), as reference data. The logistic regression algorithm was chosen because it belongs to the family of linear models, which are widely used due to their interpretability and efficiency with very large datasets [45]. As such, it can be trained with the large dataset used here without the need for special computational resources, unlike other models. In addition, this algorithm is particularly well suited because it bases classification decisions on a probability value, which for this application is interpreted, and used downstream, as an “essentiality score” that allows filtering with any arbitrary threshold.
Gene feature selection was done based both on available literature and model evaluation. Among the gene features obtainable from CodonW calculations, a minimal set that optimises model performance was selected. As a measure of codon usage, we chose the effective number of codons Nc [46] because it does not need a reference codon usage table. From the available CodonW features we also selected GC content, protein length, and the hydrophobicity indicator Gravy. In addition, phyletic retention (as defined by the Geptop 2 algorithm) and strand location were determined. Once this gene feature list was defined, the complete table of all genes vs. all features was computed for each of the 30 reference organisms. The concatenation of these 30 tables was considered as our reference dataset, which was split into 60% training set and 40% test set after random shuffling.
A model training pipeline was defined as follows. First, because the feature Nc sometimes cannot be calculated for short genes, missing data are imputed – with the default method, the average of observed values. Then, all data are scaled to the [0, 1] interval, given that the regularised logistic regression model requires all variables to be in the same range. Finally, the model is trained on the training dataset, using L2 regularisation to avoid overfitting, and rebalancing due to class imbalance (there are many more non-essential genes than essential). After designing this training protocol, a grid search was carried out to find the optimal value of C, the regularisation strength parameter.
Model evaluation was based on the area under the ROC curve (AUC) statistic, which is a well-suited metric for datasets with class imbalance and is a standard in the gene essentiality classification literature. Evaluation was carried out with three different approaches: calculating the AUC only for the test set, with fivefold cross-validation, and with leave-one-species-out cross-validation.