Skip to main content

Galba: genome annotation with miniprot and AUGUSTUS

Abstract

Background

The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes.

Results

Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments.

Conclusions

Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

Peer Review reports

Background

The Earth Biogenome Project (EBP) aims at sequencing and annotating all eukaryotic life on Earth within ten years [1]. It has brought about an explosion of genomic data: for instance, the Wellcome Sanger Institute alone currently aims at sequencing and assembling 60 genomes per day. This provides an unprecedented opportunity to study the diversity of life on Earth. Generating genome assemblies is now easier than ever thanks to cheaper sequencing, e.g. with Nanopore technology (for review of technology see [2]). However, while the number of available genomes continues to rapidly increase, the annotation of protein-coding genes remains a bottleneck in the analysis of these data [3]. This is, for instance, obvious from screening through Data Note Genome Announcements at Wellcome Open Research [4], or from counting genomes and their annotations at NCBI Genomes, where on April 3rd 2023, only 23% of 28,754 species are listed with the annotation of at least one annotated Coding Sequence (CDS) [5].

Genome annotation remains a bottleneck because it is currently not a straightforward approach. Large centers, such as Ensembl at EBI or the NCBI, are facing computational and human resources bottlenecks to apply their in-house annotation pipelines to all incoming genomes, while small and less experienced teams simply might not know where to start because not all annotation pipelines work equally well in all genomes.

BRAKER3 [6], a pipeline that combines the gene prediction tools GeneMark-ETP [7] and AUGUSTUS [8, 9] for fully automated structural genome annotation with short read transcriptome data (RNA-Seq) and a large database of proteins (such as an OrthoDB clade partition [10]) was recently demonstrated to have high accuracy for the particular input scenario of genome file, RNA-Seq short read data, and a protein database. However, despite the EBP encouraging the sequencing of transcriptomes alongside genomes [3], it can be difficult to obtain RNA-Seq data for some organisms for logistical or financial reasons, or an initial genome annotation can be desired before a transcriptome is sequenced. Also, some genes may not be expressed in tissues being sequenced and thus do not have RNA-Seq support. Conservation species often need to be annotated for gene-level genetic load estimation, frequently lacking RNA-Seq data. In invasomics, annotation of protein coding genes is of particular importance for exploratory gene drive studies, and generating probes for expression and localization studies. For both, high-quality rapid annotation is essential to move towards downstream analyses.

In the lack of transcriptome evidence, it is a common procedure to annotate novel genomes by leveraging spliced alignment information of proteins from related species to the target genome. Since the resulting alignments usually only cover a fraction of all existing genes in a genome and do not cover untranslated regions (UTRs), protein alignments are commonly combined with gene prediction tools that employ statistical models (e.g. AUGUSTUS, SNAP [11], and variants of GeneMark [12,13,14]) to identify the other fraction of genes as good as possible. MAKER [15,16,17] was an early pipeline that automated this for the gene prediction step (though it lacks automated training of gene predictors). FunAnnotate [18] was originally designed to train gene finders using RNA-Seq data but also provides a workaround for protein input on fungi. It has since also been applied to other eukaryotic genomes (a random example: [19]). In contrast to these algorithms, which usually use evidence from one or a low number of donor proteomes, BRAKER2 [20] is a pipeline that leverages a large database of proteins with GeneMark-EP [13] and AUGUSTUS to predict protein-coding genes. BRAKER2 fully automates the training of GeneMark-EP and AUGUSTUS in novel genomes. BRAKER2 was previously demonstrated to have higher accuracy than MAKER [20].

In order to allow for the alignment of a large number of protein sequences in a reasonable time, GeneMark-EP first runs self-training GeneMark-ES [12, 14] to generate genomic seeds. Subsequently, DIAMOND [21] quickly returns hits of proteins against those initial candidate protein-coding sequences found in the genome, and Spaln [22, 23] is applied to run accurate spliced-alignment of the best matching protein sequences against the genomic seeds. BRAKER2 executes one iteration of this process to expand the genomic seed space by AUGUSTUS predictions. This complex sub-pipeline is called ProtHint and was introduced to make the alignment of a large database of proteins against the genome for evidence generation computationally feasible on desktop machines. BRAKER2 generally achieves high accuracy in small and medium-sized genomes. In large genomes (e.g., the genome of a chicken or mouse), self-training GeneMark-ES performs poorly during seed generation, leading to lower prediction accuracy of BRAKER2.

With the appearance of miniprot [24], a very fast and accurate tool for spliced-aligning proteins to genome sequences, the question arose whether it is necessary to run a complicated pipeline such as ProtHint in order to generate evidence and training genes to annotate novel genomes with protein evidence with high accuracy. Moreover, miniprot has no problems processing average vertebrate-sized genomes and therefore promises to overcome the main shortcoming of BRAKER2 in terms of accuracy in large genomes.

With regard to the EBP, we expect the appearance of a large number of genomes for which suitable reference proteomes for running BRAKER2 will not be fully available. BRAKER2 requires a large protein database input; it usually fails to run with reference proteins of only one species because its components, ProtHint and GeneMark-EP, rely heavily on evidence derived from multiple alignments (requiring \(>= 4\) supporting alignments to classify a hint as high-confidence). This hinders BRAKER2’s ability to annotate genomes of poorly sequenced clades where only one reference relative is often available.

In order to address these open questions and challenges, we designed GALBA. GALBA is a fully automated pipeline that takes protein sequences of one or many species and a genome sequence as input, aligns the proteins to the genome with miniprot, trains AUGUSTUS, and then predicts genes with AUGUSTUS using the protein evidence. In this manuscript, we describe the GALBA pipeline and evaluate its accuracy in 14 genomes with existing reference annotation. Further, we present three use cases of de novo genome annotation in insects, vertebrates, and one land plant. We also evaluate the effect of merging GALBA and BRAKER2 gene sets with TSEBRA [25], the transcript selector for BRAKER.

Our pipeline is fully open source, containerized, and addresses the critical need for accurate gene annotation in large newly sequenced genomes. We believe that GALBA will greatly facilitate genome annotation for diverse organisms and is thus a valuable resource for the scientific community.

Results

We first briefly describe the GALBA pipeline and the effect of several features on gene prediction accuracy. Subsequently, we present accuracy results of the final software in 14 species. Further, we present three different use cases for GALBA.

GALBA pipeline

GALBA is a pipeline that connects three main components to predict protein coding genes: Firstly, we employ miniprot [24] to splice-align input protein sequences to the genome, and then use miniprothint [26] to score the resulting alignments and categorize the evidence into low- and high-confidence classes. We utilize the high-confidence alignment-derived genes with the highest alignment score per locus to train the gene prediction tool AUGUSTUS [8, 9]. Subsequently, we run AUGUSTUS with the Python package Pygustus to predict genes using the protein evidence in multithreading mode. After the first round of prediction, we select genes with 100% evidence support according to AUGUSTUS for a second round of training, while all other predicted genes are used to delineate flanking intergenic regions for the training of parameters for non-coding sequences. Then, we obtain the final set of predicted genes by AUGUSTUS (see Fig. 1). The idea of GALBA is that training AUGUSTUS on the basis of miniprot alignments will enable AUGUSTUS (with hints) to obtain a gene set that is more accurate and more complete than the miniprot alignments on their own. We show that GALBA works as expected in terms of accuracy with respect to reference annotations on the example of 14 species in Additional file 1: Table S10. This is also reflected by the drastically increasing complete BUSCOs when moving from training gene set to AUGUSTUS gene set within GALBA (see Additional file 1: Table S12).

Fig. 1
figure 1

The GALBA pipeline. Miniprot performs rapid spliced alignment of proteins against the genome. Subsequently, miniprothint (2) scores and classifies these alignments. Training genes for AUGUSTUS are generated from the best high quality miniprot alignment per locus (1). After training, AUGUSTUS predicts genes using the alignment evidence generated by miniprothint. AUGUSTUS parameters are refined by one iteration of training (3). The numbering of steps in the figure caption corresponds to the order in which steps were introduced into GALBA during development, see Additional file 1: Results section S4.1

GALBA was implemented in Perl, building on the existing codebase of BRAKER [27].

Effect of mutation rate from reference to target

GALBA is designed to be used with reference proteomes of (possibly several) closely related species. It is predictable that spliced protein to genome alignment with miniprot works better the lower the mutation rate from donor to target is. We provide results of GALBA runs with single-species reference protein inputs in D. melanogaster next to a phylogenetic tree that indicates mutation rates to provide users a reference for how similar a donor species should be to achieve good results with GALBA (see Fig. 2).

Fig. 2
figure 2

Gene prediction of GALBA provided with either a proteome of a single reference species (corresponding to phylogenetic tree from [57]), or executed with a combination of the species listed on the right. BRAKER2 can only be executed with a certain level of redundancy in the protein reference set, and results are therefore only provided for the combined protein input set

When executed using all annotated proteins of the target species itself, GALBA achieves a gene F1 of 79.5% (F1-scores are in this manuscript defined as \(\frac{2 \cdot \text {Sensitivity} \cdot \text {Specificity}}{ \text {Sensitivity} + \text {Specificity}}\)). When moving to D. ananassae, the accuracy drops by \(\sim\)7.5% points. Gene F1 does not drop below 63.6% when moving away to D. grimshawi, and even with Musca domestica input, GALBA maintains an accuracy of 57%. Interestingly, accuracy is restored to 71% when using a combined input of five protein donors. This last experiment can in fact also be performed with BRAKER2, which scores 3% points higher accuracy compared to GALBA.

Accuracy in genomes with reference annotation

We provide accuracy results measured in genomes of 14 species by comparison to existing annotations (see Figs. 3 and 4 for sensitivity and specificity on gene level, and Table 1 for F1-scores for gene, transcript, and exon levels). The annotations of the small model organisms Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster have undergone extensive curation [28], and thus we believe that benchmarking on these data sets gives a realistic estimate of the true accuracy of gene prediction pipelines. Annotations of the other species are much less reliable. Therefore, we report gene prediction sensitivity measured on two more reliable subsets created by selecting transcripts that (1) are complete and have all introns supported by RNA-Seq mapping (Additional file 1: Table S3); (2) have identical gene structures in two distinct reference annotations (Additional file 1: Table S4).

Fig. 3
figure 3

Sensitivity and Specificity on gene level in 7 genomes smaller than 500 Mb. We show accuracy of miniprot raw alignments, AUGUSTUS ab initio trained on filtered miniprot alignments, GALBA (AUGUSTUS with hints by miniprot), BRAKER2, GeneMark-EP, GeneMark-ES, and a combination of GALBA and TSEBRA (labelled as TSEBRA G+B)

Fig. 4
figure 4

Sensitivity and Specificity on gene level in 7 genomes larger than 500 Mb. We show accuracy of miniprot raw alignments, AUGUSTUS ab initio trained on filtered miniprot alignments, GALBA (AUGUSTUS with hints by miniprot), BRAKER2, GeneMark-EP, GeneMark-ES, and a combination of GALBA and TSEBRA (labelled as TSEBRA G+B)

Table 1 F1-scores of gene predictions for the genomes of 14 different species

We decided to show GALBA and BRAKER2 results with identical multi-species protein input side-by-side. Since users of BRAKER2 may be familiar with the Transcript Selector for BRAKER (TSEBRA) for combining several gene sets, we also provide TSEBRA results for which the GALBA and BRAKER2 outputs including their evidence were combined, enforcing the predictions by GALBA to avoid a drop of all transcripts without support by evidence. In large vertebrate genomes, GALBA shows a large improvement in accuracy compared to BRAKER2 (between 10 and 30% points in the gene F1-score). In small and medium-sized genomes, BRAKER2 is usually superior to GALBA. In A. thaliana, D. melanogaster, M. truncatula, P. tepidarorium, R. prolixus, and T. nigroviridis, BRAKER2 is \(\ge\)5% more accurate on the gene level than GALBA. GALBA shows particularly poor accuracy in C. elegans (17% points less than BRAKER2) and P. trichocarpa (7% points less than BRAKER2). In B. terrestris and S. lycopersicum, GALBA perfoms marginally better than BRAKER2.

This general impression also holds when looking at the subset of multi-exon genes that are supported by RNA-Seq from VARUS sampling (see Additional file 1: Table S3), and when inspecting Sensitivity in the subset of genes that are supported by more than one annotation provider (see Additional file 1: Table S4). In large vertebrate genomes, GALBA here achieves astonishing exon F1-scores of \(>90\%\), and gene F1-scores \(>70\)%, outperforming BRAKER2 by up to 42% points on the gene level.

Since BRAKER2 was originally designed to run with a large database of proteins instead of a hand-picked proteome of few closely related species, we show BRAKER2 results with OrthoDB v11 partitions for different taxonomic phyla (Arthropoda, Metazoa, Vertebrates, Viridiplantae), excluding proteins of the target species, and adding the hand-picked proteomes of close relatives by concatentation. This input does not change accuracy results much (see Additional file 1: Table S7). To the best of our knowledge, BRAKER2 is the most suitable pipeline for annotation scenarios where closer relatives have not been sequenced and annotated, yet. Therefore, we also provide BRAKER2 results with OrthoDB partitions, excluding proteins of species that are in the same taxomomic order as the target species.Footnote 1 In M. truncatula, P. tepidariorum, P. trichocarpa, and T. nigroviridis, BRAKER2 is even more accurate than GALBA using the remotely related protein set (see Additional file 1: Table S7).

It is an interesting question whether combining the GALBA and BRAKER2 gene sets (with the same protein input) with TSEBRA provides increased or restored accuracy. In general, TSEBRA tends to increase the ratio of mono-exonic to multi-exonic genes (see Table 2 and Additional file 1: Figure S5). In species where both GALBA and BRAKER2 shows initial comparable accuracy, TSEBRA application usually increases the accuracy by a few percentage points. However, if the GALBA gene prediction accuracy is particularly poor (e.g., in the case of C. elegans), then TSEBRA does not fully restore accuracy to the better gene finder (here BRAKER2). For large vertebrate genomes, the TSEBRA approach consistently yields very good results (despite increasing the amount of single-exon genes), although the effect varies between about 1% point on gene level in D. rerio and 13% points in M. musculus.

Table 2 Ratios of mono-exonic to multi-exonic genes per species

Since GALBA may also be executed with a single reference proteome, we provide results of such experiments, using the closest relative from our selection of protein donor species. Using a single protein donor instead of a set of several with GALBA usually leads to a decrease in accuracy (on average 4% points gene F1). This effect can be less strongly observed in species where GALBA performs comparably poorly (e.g., R. polixus or P. tepidariorum).

We also report results of FunAnnotate (see Additional file 1: Table S7) with the same protein and genome input as GALBA and BRAKER2, but these results are not directly comparable since this pipeline requires specification of a seed species for training AUGUSTUS, and of a BUSCO [30] lineage, and accuracy results may heavily depend on the selection of these (here used seed species and BUSCO lineages are listed in Additional file 1: Table S6). FunAnnotate was competetive with GALBA (and BRAKER2) only in the case of predicting genes in A. thaliana.

Use case examples

GALBA is widely applicable to eukaryotic genomes of different sizes and assembly quality. In the following, we present three use cases.

Insect genomes

We compare annotation results for four Hymenoptera species across three pipelines: GALBA, BRAKER2, and FunAnnotate. For this, we select three high-quality wasp genomes from [31], Vespula vulgaris, V. germanica, and V. pensylvanica, previously annotated using FunAnnotate with multiple rounds of annotation polishing, and one additional wasp generated with short-read assembly, [32] Polistes dominula (see Table 6). Input proteome to all three consisted of UniProt Swiss-Prot [33] release 2023_01, combined with published proteomes from RefSeq [34] release 104 of Apis mellifera HA v3.1 [35] and Polistes canadensis [36].

Compared to the other pipelines, GALBA consistently predicts the most genes. BUSCO scores are comparable with BRAKER2 and higher than FunAnnotate (see Table 3). GeneValidator [37], which scores individual proteins, serves as a larger metric for analyzing genome annotation results and scores individual protein predictions. GALBA predicts more higher-quality proteins, however the lower quartile for GALBA is always 0, while for BRAKER2 the average lower quartile is 39.3. Taken together, this shows GALBA predicts a larger number of both high-quality and low-quality proteins. Both pipelines outperform FunAnnotate in every metric. However, FunAnnotate was designed for use with RNA-Seq data (on fungi), so this is likely to be expected.

Table 3 Summary across four Hymenopteran insect genomes and de novo annotation pipelines

Vertebrate genomes

Three years ago, the Zoonomia consortium presented a large whole-genome alignment of various vertebrates [38]. Many of the genomes in this alignment have not been annotated for protein-coding genes until today. Most of the unannotated assemblies in the alignment were produced by short-read genome sequencing and are thus fragmented and incomplete, and for many species (reflected by a low N50, a very large number of scaffolds, and BUSCO completeness far below 100%), there is no transcriptome data available in the Sequencing Read Archive [39]. We de novo annotated all whale and dolphin assemblies from that alignment that lack RNA-Seq evidence (see Table 6). The selected reference protein sets are listed in Additional file 1: Table S1.

We were able to apply multi-threaded GALBA to these genomes without any problems. GALBA predicted between 53k and 78k genes in these assemblies. The ratio of mono- to multi-exonic genes suggests an overprediction of single-exon genes. It should be noted that AUGUSTUS is capable of predicting incomplete genes that span sequence borders, and that the high single-exon count is not caused by genome fragmentation alone. Removing all incomplete genes from the prediction does not substantially decrease the mono:mult ratio (data not shown). BUSCO-completeness of predicted genes is comparable to the BUSCO-completeness of the corresponding genomic assemblies (see Table 4 and Additional file 1: Figures S3 and S2). OMArk [40], a tool that provides an estimate on annotation quality for a much larger set of conserved genes than BUSCO, also indicates a high level of completeness in these genomes (see Additional file 1: Table S8). However, the number of unexpected duplicate HOGs is large for these annotations. The consistency report of OMArk shows that the predicted genes are to a large extent possibly incomplete/fragmented (which is here likely caused by the genome assembly quality).

Table 4 Summary of protein-coding gene structures predicted in the previously unannotated whale and dolphin genomes of Zoonomia [38], and in Coix aquatica

Plant genome

We chose the genome of the plant Coix aquatica [41] (see Table 6) to demonstrate the ability of GALBA to de novo annotate large chromosome-scaffolded genomes (see Table 6). This species is one of many that currently lack an annotation of protein-coding genes at NCBI Genomes (even though the publication [41] describes an annotation approach and statistics on predicted protein coding genes), and there is no RNA-Seq data of this species available at the Sequence Read Archive (even though [41] report having used RNA-Seq data for annotation). In practice, a Coix aquatica focused scientist would request the gene set from the authors of [41], but here, we took it as a de novo annotation example. Four reference proteomes used with GALBA are listed in Additional file 1: Table S1.

GALBA predicted 93k genes with a mono- to multi-exonic gene ratio of 1.07 in Coix aquatica. This is an overprediction compared to the number of 39,629 genes reported by [41]. However, the BUSCO sensitivity in the GALBA gene set is with \(\sim\)98% very high and comparable to BUSCO completeness of the assembly (see Additional file 1: Figure S4). OMArk also attests to a high degree of HOG completeness. Compared to the whale and dolphin gene predictions, the predictions in this plant genome show a much lower degree of fragmentation (see Additional file 1: Table S8). About half of the predicted proteins are placed as inconsistent, and most of these are identified by fragmented hits. GALBA here provided a quick and simple means to obtain a gene set.

Runtime

We report wallclock time passed when running GALBA on D. melanogaster using proteins of D. ananassae, D. pseudoobscura, D.willistoni, D. virilis, and D. grimshawi on an HPC node with Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz using 48 threads. A complete GALBA run took 3:24 h. A full BRAKER2 run on the same node took 3:03 h. The most time-consuming step of GALBA (and BRAKER2) is often the metaparameter optimization for AUGUSTUS. This step can optionally be disabled (--skipOptimize), leading to slightly lower prediction accuracy in most cases. Without this optimization step, a GALBA run with the same input data took 0:44 h.

As a second example, we report wallclock time of 8:52 h for de novo annotation of the Coix aquatica genome on an HPC node with Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz using 72 threads (including metaparameter optimization). On the same data set and architecture, BRAKER2 required 11:11 h.

Discussion

Obtained accuracy results of GALBA are far from perfect when compared to reference annotations. However, GALBA provides substantially higher accuracy than BRAKER2 in the genomes of large vertebrates because GeneMark-ES within BRAKER2 performs poorly in such genomes when generating seed regions for spliced-alignment of proteins to the genome. We estimate, that to date, \(\sim\)1k unannotated genomes without transcriptome data could benefit from structural annotation with GALBA (see Additional file 1: Methods S3.7).

In smaller genomes, BRAKER2 remains superior because with the GeneMark-ES seeding process, it is able to produce protein to genome alignments with a higher specificity than miniprot (compare Fig. 5 and Additional file 1: Table S11).

Fig. 5
figure 5

Network plot of gene F1 accuracy for (clockwise starting from the top, increasing genome sizes) insects, metazoa, plants, and vertebrates. We show accuracy of GALBA and its intermediate product miniprot, and of BRAKER2 and its intermediate GeneMark-ES and GeneMark-EP gene sets. Accuracy of the combiner TSEBRA combining the final gene sets of both GALBA and BRAKER2 is also shown as TSEBRA G+B

Further, we demonstrate that GALBA can process highly fragmented as well as large genomes in multi-threading mode, mainly attributed to the usage of Pygustus. We expect the Pygustus approach to be adopted in BRAKER to improve stability.

Implementing pipelines that leverage protein-to-genome alignment for training and running gene finders is not straightforward. In this work, we once more demonstrate that alignment scoring is crucial for achieving high gene prediction accuracy when protein evidence is used as the sole extrinsic evidence source.

While neither GALBA nor BRAKER2 can compete with pipelines that integrate RNA-Seq as an additional source of evidence, such as BRAKER3, GALBA is a valuable addition to closing the annotation gap for already deposited genomes and for future genomes generated within the EBP for which RNA-Seq data is not available.

Combining multiple gene sets commonly yields higher accuracy than using a single gene set of a single gene predictor. However, the authors caution users that combining gene sets from different sources may not always lead to improved accuracy, and users of genome annotation pipelines should proceed with caution. Recommended estimates for gene set quality are BUSCO Sensitivity, the number of predicted genes, and the mono-to-multi-exon gene ratio.

Both GALBA and BRAKER2 tend to heavily overpredict single-exon genes, most likely a result of incorrectly splitting genes. For plants, a desired mono- to multi-exonic gene ratio of 0.2 was recently postulated by [42]. This particular ratio certainly does not hold for non-plant species, and also the reference annotations of plants used in this manuscript often deviated from that recommendation. Nevertheless, GALBA, BRAKER2, and TSEBRA output may benefit from downstream mono-exonic gene filtering. The EBP would benefit from future developments to address the split gene problem in pipelines for fully automated annotation of protein-coding genes.

GeMoMa is a different approach towards an accurate mapping of annotated protein-coding genes from one species to the genome of another [43,44,45]. GeMoMa does not work with protein sequence input in FASTA format but requires a gff3 or gtf file with the annotation of a related species. It was previously shown that GeMoMa has higher base Sensitivity in the human genome using the zebrafish annotation as the donor, while miniprot has higher base Sensitivity in the fruit fly when using the mosquito annotation as input. It is to be expected that a pipeline such as GALBA will yield more accurate results using GeMoMa instead of miniprot if GeMoMa achieves higher accuracy with a given input scenario. We have previously demonstrated that combining GeMoMa with BRAKER [46] and TSEBRA can be beneficial for annotating plant and insect genomes [47,48,49]. Particularly for larger genomes, it is worth replacing BRAKER2 with GALBA in such workflows in the future.

Recently, Helixer demonstrated the potential of modern machine learning for genome annotation [50]. Accuracy is not competitive, yet, possibly because these methods do not currently allow for the integration of extrinsic evidence. However, we believe that once an improved and more accurate gene finder on the basis of modern machine learning technology has been trained, it will be of great advantage not only in terms of accuracy, but also in terms of reduced runtime compared to methods such as GALBA.

We intend to expand GALBA in the future. For example, we might incorporate Helixer for faster trimming of the flanking regions of training genes for AUGUSTUS. Also, there is room for improvement in the hints generation given that the protein donors for GALBA might not always be closely related (see Additional file 1: Table S2).

There is a substantial gap in data processing between producing a GALBA (or BRAKER2) output and submission of the annotation to e.g. NCBI Genomes. This gap is already addressed in FunAnnotate, and also to some extent in MOSGA, a web service that executes BRAKER [51]. We expect the definition of a new standard for third-party genome annotation tagging in the foreseeable future. We will then adapt GALBA to produce an annotation that matches this novel standard in order to facilitate genome annotation tagging.

Conclusions

GALBA is an easy-to-use pipeline for the annotation of protein coding genes. It has competitive accuracy, in particular, it is superior to the BRAKER2 pipeline in the annotation of large vertebrate genomes.

Methods

Sequences for accuracy estimation

For estimating prediction accuracy of gene prediction tools, genomes with an already existing annotation are required. Here, we resort to using the genomes and annotations of 14 species (see Table 5), collected from two previous publications. Data of Arabidopsis thaliana, Bombus terrestris, Caenorhabditis elegans, Drosophila melanogaster, Rhodnius prolixus, Parasteatoda tepidariorum, Populus trichocarpa, Medicago truncatula, Solanum lycopersicum, and Xenopus tropicalis prepared as described in [20],Footnote 2 annotation supporting RNA-Seq evidence described at [53]. In addition, we used the following genomes and annotations from [7]Footnote 3: Danio rerio, Gallus gallus, and Mus musculus. For each species, reliable transcripts were identified, either by definition if at least two annotation providers report a transcript identically, or if all introns of a transcript have support by a spliced alignment from RNA-Seq evidence sampled with VARUS [55]

Table 5 Summary of genomes and annotations used for accuracy evaluation

As protein input, we manually selected the reference protein sets listed in Additional file 1: Table S1 from NCBI Genomes. These include close relatives of the target species. In short, we used NCBI Taxonomy [56] to identify species that are closely related to the target species and that have a protein sequence set originating from nuclear genome annotation. In order to enable a direct comparison with BRAKER2 (which cannot be executed with a protein set from only one reference species), we ensured to pick a minimum of three protein sets for annotating each species.

Since GALBA is a pipeline that may also be executed with only one reference proteome, we also present accuracy with such single-species protein sets. In general, we selected the closest relative, with the exception of experiments in Drosophila melanogaster, where we excluded D. simulans and D. erecta from the combined protein set, and from selection as single species reference because they have less than 0.2 expected mutations per genomic site and are thus extremely similar to the target species (see Fig. 2).

Successful generation of high-quality protein to genome alignments depends on the phylogenetic distance between donor and target species. We demonstrate this by evaluating GALBA in single-reference-mode on D. melanogaster, using protein donor species arranged on a phylogenetic tree from [57].

Software

All software versions used to generate results in this manuscript are listed in Additional file 1: Table S5.

Miniprot extensions

Miniprot was modified to output detailed residue alignment in a compact custom format to facilitate alignment parsing for scoring with miniprothint. An example of this format is shown in Additional file 1: Figure S1. Further, a new option -I was introduced that automatically sets the maximal size of introns to \(3.6\cdot \sqrt{\text {genomeSize}}\). On the Drosophila-Anopheles benchmark dataset used in the miniprot paper [24], the new feature doubles the alignment speed and reduces the number of spurious introns by 16.3% at the cost of missing 0.5% of introns that are longer than the threshold.

Miniprothint

During early development of GALBA, it became clear that miniprot (like any spliced aligner) may produce spurious alignments if the reference proteins originate from distantly related species (compare Additional file 1: Table S2). Furthermore, conflicting alignments of homologous proteins from multiple donor species negatively impacted the quality of the AUGUSTUS training gene set. To solve these problems, we wrote an alignment scorer—here called miniprothint—that scores all predicted introns by computing the intron border alignment (IBA) and the intron mapping coverage (IMC) scores. Briefly, the IBA score characterizes the conservation of exons adjacent to the scored intron, with larger weights given to parts close to the donor and acceptor splice sites. The IMC score counts how many times a given intron was exactly mapped by spliced alignments of distinct target proteins. See [58], pages 20 and 21, for a precise definition of both scores.

Based on these scores, miniprothint discards the least reliable evidence and separates the remaining evidence into two classes: high- and low-confidence (see Additional file 1: Figure S6 for more details). High-confidence evidence is used to select training gene candidates for AUGUSTUS and is enforced during gene prediction with AUGUSTUS. Low-confidence evidence is supplied to AUGUSTUS in the form of prediction hints. In comparison to the scoring defined in [58], miniprothint adds penalties for in-frame stop codons and frameshifts (common in the alignments of remote homologs) and significantly improves the computational speed of alignment scoring. The speed improvements are, in part, achieved by taking advantage of miniprot’s compact alignment format (see Additional file 1: Figure S1).

Iterative training

When generating putative training genes for AUGUSTUS from any kind of extrinsic evidence, typically, only some of the actually existing gene structures will be identified in the genome. Otherwise, one would not need to train a gene finder to find the others. In the case of AUGUSTUS, training genes are excised from the genome with flanking and hopefully truly intergenic regions. There is a certain risk that a flanking region will, in fact, carry parts of neighboring genes. Using such “contaminated” intergenic regions can lead to sub-optimal training results. Therefore, we implemented the training of AUGUSTUS in GALBA as follows (e.g., suggested in [9]):

  1. 1

    etraining on the original training genes derived from evidence with possibly contaminated flanking regions

  2. 2

    prediction of genes with the evidence by AUGUSTUS after initial training

  3. 3

    selection of predicted genes with 100% evidence support, other genes are only eliminated from flanking regions

  4. 4

    etraining with training genes with filtered flanking regions that are free of predicted genes

  5. 5

    optimize_augustus.pl for metaparameter optimization

Multithreading AUGUSTUS

AUGUSTUS is not multithreaded and the gene prediction and metaparameter optimization steps can have a relatively long running time. To address this issue, the BRAKER pipelines split the genome into individual sequence files and execute AUGUSTUS using the Perl module ParallelForkManager. However, this approach can strain the file system when dealing with highly fragmented genomes, as a large number of files need to be generated.

To overcome this limitation, we developed Pygustus, a Python wrapper for AUGUSTUS that supports parallel execution. This allows for multithreading of AUGUSTUS prediction on genomes of any size and fragmentation level. Large chromosomes are split into overlapping chunks that are not too large for fast parallel execution. The overlaps are introduced to prevent the truncation of genes. Conversely, many short sequences are joined into temporary FASTA files of which there are not too many to strain the file system. Pygustus automatically and invisible to the user decides what sequences to split or join, and assemblies are allowed to have simultaneously very many (small) sequences and (few) very large sequences. The annotation is then done in parallel and the redundancies in annotations from overlapping runs are removed.

In GALBA, we use Pygustus to multithread AUGUSTUS predictions, thereby enabling efficient genome annotation without compromising the file system. This approach can be particularly useful for researchers dealing with large and complex genomes, where computational efficiency is critical.

Repeat masking

The genomes of 14 species used for accuracy assessment were previously masked for repeats in [13] and [7]. In short, species-specific repeat libraries were generated with RepeatModeler2 [59]. Subsequently, the genomes were masked with RepeatMasker [60] using those libraries. For vertebrate genomes, an additional step of masking with TandemRepeatsFinder [61] was performed.Footnote 4

The same approach was adopted for each whale and dolphin genome (including the TandemRepeatsFinder step). The additional TandemRepeatsFinder step was not applied to the insects and the plant in Table 6. For Polistes dominula, we used repeat masking as provided by NCBI Genomes. Genomes of Vespula species were masked with RepeatModeler and RepeatMasker as described in [31].

Table 6 Genomes de novo annotated with GALBA using reference protein sets listed in Additional file 1: Table S1 as use cases that demonstrate the applicability of GALBA

Accuracy evaluation

For selected genomes, we used the existing reference annotation to assess SensitivityFootnote 5 and SpecificityFootnote 6 of predictions by GALBA, BRAKER2, FunAnnotate, and TSEBRA on gene, transcript and exon level. For this purpose, we used the script compute_accuracies.sh that is a part of the BRAKER code. To summarize Sensitivity and Specificity, we computed the F1-score as

$$\begin{aligned} \frac{2\cdot \text {Sensivitity} \cdot \text {Specificty}}{\text {Sensitivity} + \text {Specificity}}. \end{aligned}$$

Prediction quality estimation

For estimating the quality of gene prediction in previously unannotated genomes, we provide BUSCO Sensitivity of both genomes and predicted proteomes [30], and OMArk results [40]. For BUSCO assessment of use case insect assembly and proteome completeness, we used hymenoptera_odb10. In dolphins and whales, we used the vertebrate_odb10 lineage. For Coix aquatica, we used the poales_odb10. Further, we report basic metrics such as the number of predicted genes, the number of transcripts, the recently suggested mono-exonic to multi-exonic gene ratio [42], and the maximum number of exons per gene across all predicted genes.

To provide a more fine-grained view on the insect annotation use case, we use GeneValidator [37], which scores the predicted proteins to a reference set by length, coverage, conserved regions, and identifies putative merges. Each predicted protein receives an individual score, with 90 being considered a good prediction, and a score of 0 indicating a very poor prediction, or a lack of BLAST hits to the reference proteome to estimate potential lengths and conserved regions. In this instance, we use our input proteome for the prediction tools (Swiss-Prot and RefSeq of A. mellifera and P. canadensis) consisting of 611,968 proteins.

Assembly statistics

We used seqstats and BUSCO to report basic assembly metrics (see Additional file 1: Methods).

Availability of data and materials

The datasets analysed during the current study have previously been made available. We are summarizing data sources in the GALBA-data repository at https://github.com/KatharinaHoff/GALBA-data. The corresponding author of this manuscript can be contacted if somebody wants to request the data from this study.

Notes

  1. For this, we used the orthodb-clades pipeline [29] to generate the protein sets.

  2. genomes, repeat masking and annotation processing documented at [52].

  3. Described at [54].

  4. See [53].

  5. \(\frac{\text {True Positives}}{\text {True Positives}+\text {False Negatives}}\).

  6. \(\frac{\text {True Positives}}{\text {True Positives}+\text {False Positives}}\).

References

  1. Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, Durbin R, Edwards SV, Forest F, Gilbert MTP, et al. Earth BioGenome project: sequencing life for the future of life. Proc Natl Acad Sci. 2018;115(17):4325–33.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Wang Y, Zhao Y, Bollas A, Wang Y, Au KF. Nanopore sequencing technology, bioinformatics and applications. Nat Biotechnol. 2021;39(11):1348–65.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Lawniczak MK, Durbin R, Flicek P, Lindblad-Toh K, Wei X, Archibald JM, Baker WJ, Belov K, Blaxter ML, Marques Bonet T, et al. Standards recommendations for the Earth BioGenome Project. Proc Natl Acad Sci. 2022;119(4):2115639118.

    Article  Google Scholar 

  4. Hope H, Willis S, Markie M, Elliott L. Wellcome Open Research. https://wellcomeopenresearch.org/browse/articles Accessed Accessed 10 April 2023. 2023.

  5. for Biotechnology Information NC. NCBI Genomes. https://www.ncbi.nlm.nih.gov/genome/browse#!/eukaryotes/ Accessed Accessed 10 April 2023. 2023.

  6. Gabriel L, Hoff KJ, Bruna T, Lomsadze A, Borodovsky M, Stanke M. The BRAKER3 genome annotation pipeline. Plant and Animal Genomes Conference. 2023;30.

  7. Bruna T, Lomsadze A, Borodovsky M. GeneMark-ETP: automatic gene finding in eukaryotic genomes in consistence with extrinsic data. bioRxiv. 2023. https://doi.org/10.1101/2023.01.13.524024.

  8. Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008;24(5):637–44.

    Article  CAS  PubMed  Google Scholar 

  9. Hoff KJ, Stanke M. Predicting genes in single genomes with AUGUSTUS. Curr Protoc Bioinform. 2019;65(1):57.

    Article  Google Scholar 

  10. Kuznetsov D, Tegenfeldt F, Manni M, Seppey M, Berkeley M, Kriventseva EV, Zdobnov EM. OrthoDB v11: annotation of orthologs in the widest sampling of organismal diversity. Nucleic Acids Res. 2023;51(D1):445–51.

    Article  Google Scholar 

  11. Korf I. Gene finding in novel genomes. BMC Bioinform. 2004;5(1):1–9.

    Article  Google Scholar 

  12. Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 2008;18(12):1979–90.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Bruna T, Lomsadze A, Borodovsky M. GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genom Bioinform. 2020;2(2):026.

    Google Scholar 

  14. Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 2005;33(20):6494–506.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Alvarado AS, Yandell M. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18(1):188–96.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinform. 2011;12(1):1–14.

    Article  Google Scholar 

  17. Campbell MS, Holt C, Moore B, Yandell M. Genome annotation and curation using MAKER and MAKER-P. Curr Protoc Bioinform. 2014;48(1):4–11.

    Article  Google Scholar 

  18. FunAnnotate. 2023. https://github.com/nextgenusfs/funannotate Accessed Accessed 10 April 2023.

  19. Ranz JM, González PM, Clifton BD, Nazario-Yepiz NO, Hernández-Cervantes PL, Palma-Martínez MJ, Valdivia DI, Jiménez-Kaufman A, Lu MM, Markow TA, et al. A de novo transcriptional atlas in danaus plexippus reveals variability in dosage compensation across tissues. Commun Biol. 2021;4(1):791.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Bruna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M. BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genom Bioinform. 2021;3(1):108.

    Article  Google Scholar 

  21. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12(1):59–60.

    Article  CAS  PubMed  Google Scholar 

  22. Gotoh O. Direct mapping and alignment of protein sequences onto genomic sequence. Bioinformatics. 2008;24(21):2438–44.

    Article  CAS  PubMed  Google Scholar 

  23. Iwata H, Gotoh O. Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res. 2012;40(20):161–161.

    Article  Google Scholar 

  24. Li H. Protein-to-genome alignment with miniprot. Bioinformatics. 2023;39(1):014.

    Article  CAS  Google Scholar 

  25. Gabriel L, Hoff KJ, Bruna T, Borodovsky M, Stanke M. TSEBRA: transcript selector for BRAKER. BMC Bioinform. 2021;22(1):1–12.

    Article  Google Scholar 

  26. Bruna T. miniprothint. https://github.com/tomasbruna/miniprothint.git Accessed Accessed 10 April 2023. 2023.

  27. Gabriel L, Bruna T, Hoff KJ, Lomsadze A, Borodovsky M, Stanke M. BRAKER. https://github.com/Gaius-Augustus/BRAKER. Accessed 10 April 2023. 2023.

  28. Yandell M, Ence D. A beginner’s guide to eukaryotic genome annotation. Nat Rev Genet. 2012;13(5):329–42.

    Article  CAS  PubMed  Google Scholar 

  29. Bruna T. OrthoDB-clades. https://github.com/tomasbruna/orthodb-clades Accessed Accessed 10 April 2023. 2023.

  30. Manni M, Berkeley MR, Seppey M, Zdobnov EM. BUSCO: assessing genomic data quality and beyond. Curr Protoc. 2021;1(12):323.

    Article  Google Scholar 

  31. Harrop TW, Guhlin J, McLaughlin GM, Permina E, Stockwell P, Gilligan J, Le Lec MF, Gruber MA, Quinn O, Lovegrove M, et al. High-quality assemblies for three invasive social wasps from the Vespula genus. G3: Genes Genom Genet. 2020;10(10):3479–88.

    Article  CAS  Google Scholar 

  32. Standage DS, Berens AJ, Glastad KM, Severin AJ, Brendel VP, Toth AL. Genome, transcriptome and methylome sequencing of a primitively eusocial wasp reveal a greatly reduced dna methylation system in a social insect. Mol Ecol. 2016;25(8):1769–84.

    Article  CAS  PubMed  Google Scholar 

  33. Uniprot. The universal protein knowledgebase in 2023. Nucleic Acids Res. 2023;51(D1):523–31.

  34. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, Smith-White B, Ako-Adjei D, et al. Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):733–45.

    Article  Google Scholar 

  35. Wallberg A, Bunikis I, Pettersson OV, Mosbech M-B, Childers AK, Evans JD, Mikheyev AS, Robertson HM, Robinson GE, Webster MT. A hybrid de novo genome assembly of the honeybee, apis mellifera, with chromosome-length scaffolds. BMC Genom. 2019;20:1–19.

    Article  Google Scholar 

  36. Patalano S, Vlasova A, Wyatt C, Ewels P, Camara F, Ferreira PG, Asher CL, Jurkowski TP, Segonds-Pichon A, Bachman M, et al. Molecular signatures of plastic phenotypes in two eusocial insect species with simple societies. Proc Natl Acad Sci. 2015;112(45):13970–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Drăgan M-A, Moghul I, Priyam A, Bustos C, Wurm Y. GeneValidator: identify problems with protein-coding gene predictions. Bioinformatics. 2016;32(10):1559–61.

    Article  PubMed  PubMed Central  Google Scholar 

  38. A comparative genomics multitool for scientific discovery and conservation. Nature. 2020;587(7833):240–245.

  39. Katz K, Shutov O, Lapoint R, Kimelman M, Brister JR, O’Sullivan C. The sequence read archive: a decade more of explosive growth. Nucleic Acids Res. 2022;50(D1):387–90.

    Article  Google Scholar 

  40. Nevers Y, Rossier V, Train C, Altenhoff AM, Dessimoz C, Glover N. Multifaceted quality assessment of gene repertoire annotation with OMArk. bioRxiv, 2022;2022–11.

  41. Guo C, Wang Y, Yang A, He J, Xiao C, Lv S, Han F, Yuan Y, Yuan Y, Dong X, et al. The coix genome provides insights into panicoideae evolution and papery hull domestication. Mol Plant. 2020;13(2):309–20.

    Article  CAS  PubMed  Google Scholar 

  42. Vuruputoor VS, Monyak D, Fetter KC, Webster C, Bhattarai A, Shrestha B, Zaman S, Bennett J, McEvoy SL, Caballero M, et al. Welcome to the big leaves: best practices for improving genome annotation in non-model plant genomes. bioRxiv. 2022. https://doi.org/10.1101/2022.10.03.510643

  43. Keilwagen J, Wenk M, Erickson JL, Schattat MH, Grau J, Hartung F. Using intron position conservation for homology-based gene prediction. Nucleic Acids Res. 2016;44(9):89–89.

    Article  Google Scholar 

  44. Keilwagen J, Hartung F, Paulini M, Twardziok SO, Grau J. Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. BMC Bioinform. 2018;19:1–12.

    Article  Google Scholar 

  45. Keilwagen J, Hartung F, Grau J. GeMoMa: homology-based gene prediction utilizing intron position conservation and RNA-seq data. Gene prediction: Methods Protoc. 2019;161–177.

  46. Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics. 2016;32(5):767–9.

    Article  CAS  PubMed  Google Scholar 

  47. Errbii M, Keilwagen J, Hoff KJ, Steffen R, Altmüller J, Oettler J, Schrader L. Transposable elements and introgression introduce genetic variation in the invasive ant Cardiocondyla obscurior. Mol Ecol. 2021;30(23):6211–28.

    Article  PubMed  Google Scholar 

  48. Wöhner TW, Emeriewen OF, Wittenberg AH, Schneiders H, Vrijenhoek I, Halász J, Hrotkó K, Hoff KJ, Gabriel L, Lempe J, et al. The draft chromosome-level genome assembly of tetraploid ground cherry (Prunus fruticosa Pall.) from long reads. Genomics. 2021;113(6):4173–83.

    Article  PubMed  Google Scholar 

  49. Woehner TW, Emeriewen OF, Wittenberg AH, Nijbroek K, Wang RP, Blom E-J, Keilwagen J, Berner T, Hoff KJ, Gabriel L, et al. The structure of the tetraploid sour cherry ’Schattenmorelle’(Prunus cerasus L.) genome reveals insights into its segmental allopolyploid nature. bioRxiv, 2023–03;2023.

  50. Stiehler F, Steinborn M, Scholz S, Dey D, Weber AP, Denton AK. Helixer: cross-species gene annotation of large eukaryotic genomes using deep learning. Bioinformatics. 2020;36(22–23):5291–8.

    CAS  PubMed Central  Google Scholar 

  51. Martin R, Hackl T, Hattab G, Fischer MG, Heider D. Mosga: modular open-source genome annotator. Bioinformatics. 2020;36(22–23):5514–5.

    CAS  Google Scholar 

  52. Bruna T. EukSpecies-BRAKER2. https://github.com/gatech-genemark/EukSpecies-BRAKER2. Accessed 10 April 2023. 2023.

  53. Bruna T. BRAKER2-exp. https://github.com/gatech-genemark/BRAKER2-exp. Accessed 10 April 2023. 2023.

  54. Bruna T. GeneMark-ETP-exp. https://github.com/gatech-genemark/GeneMark-ETP-exp. Accessed 10 April 2023. 2023.

  55. Stanke M, Bruhn W, Becker F, Hoff KJ. VARUS: sampling complementary RNA reads from the sequence read archive. BMC Bioinform. 2019;20:1–7.

    Article  Google Scholar 

  56. Schoch CL, Ciufo S, Domrachev M, Hotton CL, Kannan S, Khovanskaya R, Leipe D, Mcveigh R, O’Neill K, Robbertse B, et al. NCBI Taxonomy: a comprehensive update on curation, resources and tools. Database. 2020;2020:baaa062.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. König S, Romoth L, Stanke M. Comparative genome annotation. Comp Genom Methods Protoc 2018;189–212.

  58. Bruna T. Unsupervised algorithms for automated gene prediction in novel eukaryotic genomes. Ph.D thesis, Georgia Institute of Technology. 2022.

  59. Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci. 2020;117(17):9451–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Chen N. Using Repeat Masker to identify repetitive elements in genomic sequences. Curr Protoc Bioinform. 2004;5(1):4–10.

    Article  Google Scholar 

  61. Benson G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 1999;27(2):573–80.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank Stefan Kemnitz from the University of Greifswald Computing Center for support in designing the software container. We thank Felix Becker for help with publishing python packages to PyPI.

Funding

Open Access funding enabled and organized by Projekt DEAL. The position of L.G. is funded by the US National Institute of Health Grant GM128145 to M.S. The Ph.D. project of N.N. is partially funded by German Research Foundation Grant 277249973 to K.J.H. The positions of N.N. and M.E. are partially funded by Project Data Competency granted to K.J.H. and M.S. by the government of Mecklenburg-Vorpommern. H.L. is supported by US National Institute of Health Grant R01HG010040. D.H. was funded by German Research Foundation Grant 391397397 to S.H. and M.S. Funding bodies did not play any role in the design of the study or collection, analysis, or interpretation of data or in writing the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

T.B. developed miniprot boundary scorer and miniprothint; H.L. modified miniprot; N.N. evaluated intron accuracy on data sets that gave rise to the development of miniprothint; D.H. implemented Pygustus; M.E. ran FunAnnotate and participated in experimental design; J.G. contributed use case; S.H. and M.S. supervised Pygustus development; L.G. provided BRAKER2 ODB results; T.B., H.L., and K.J.H. conceptualized the pipeline; K.J.H. and T.B. implemented the pipeline; all authors wrote the manuscript.

Corresponding author

Correspondence to Katharina J. Hoff.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Supplementary Material.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brůna, T., Li, H., Guhlin, J. et al. Galba: genome annotation with miniprot and AUGUSTUS. BMC Bioinformatics 24, 327 (2023). https://doi.org/10.1186/s12859-023-05449-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-023-05449-z

Keywords