Skip to main content

FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences

Abstract

Background

Gene annotation in eukaryotes is a non-trivial task that requires meticulous analysis of accumulated transcript data. Challenges include transcriptionally active regions of the genome that contain overlapping genes, genes that produce numerous transcripts, transposable elements and numerous diverse sequence repeats. Currently available gene annotation software applications depend on pre-constructed full-length gene sequence assemblies which are not guaranteed to be error-free. The origins of these sequences are often uncertain, making it difficult to identify and rectify errors in them. This hinders the creation of an accurate and holistic representation of the transcriptomic landscape across multiple tissue types and experimental conditions. Therefore, to gauge the extent of diversity in gene structures, a comprehensive analysis of genome-wide expression data is imperative.

Results

We present FINDER, a fully automated computational tool that optimizes the entire process of annotating genes and transcript structures. Unlike current state-of-the-art pipelines, FINDER automates the RNA-Seq pre-processing step by working directly with raw sequence reads and optimizes gene prediction from BRAKER2 by supplementing these reads with associated proteins. The FINDER pipeline (1) reports transcripts and recognizes genes that are expressed under specific conditions, (2) generates all possible alternatively spliced transcripts from expressed RNA-Seq data, (3) analyzes read coverage patterns to modify existing transcript models and create new ones, and (4) scores genes as high- or low-confidence based on the available evidence across multiple datasets. We demonstrate the ability of FINDER to automatically annotate a diverse pool of genomes from eight species.

Conclusions

FINDER takes a completely automated approach to annotate genes directly from raw expression data. It is capable of processing eukaryotic genomes of all sizes and requires no manual supervision—ideal for bench researchers with limited experience in handling computational tools.

Background

Recent advances in sequencing technology enable the construction of chromosomal-level assemblies for even non-model organisms. As of December 2020, genomes of 16,108 eukaryotes, 295,784 prokaryotes, 41,936 viruses, 26,079 plasmids and 17,820 organelles are sequenced and available through GenBank [1], a considerable increase over the 1,500 sequences reported two decades ago (see Additional file 1: Fig. S1). Therefore, to annotate the ever-rising number of genome sequences, annotation software applications need to be fast, accurate, and designed to handle large amounts of expression data to facilitate discovery of novel genes across different conditions [2,3,4,5]. Extensive analysis of this available data is the key to achieving exhaustive gene discovery by analyzing samples from multiple tissues and conditions, obviating the need for additional sequencing.

Genome annotation is the process of identifying transcriptionally active regions of the genome and defining gene structures. Decoding the correct structures of genes is essential since several downstream applications rely on accurate annotations: detecting interactions between proteins [6,7,8,9,10,11,12,13,14], identifying post-translational modifications [15,16,17,18,19,20,21,22,23], mining effectors [24,25,26,27,28], and determining protein structure [29,30,31,32]. Although we have seen a significant improvement in genome sequencing technology, annotation methods continue to underperform [33, 34]. Obtaining accurate gene annotations is challenging, especially in recently sequenced non-model organisms. The presence of sequences exchanged through horizontal gene transfer in such genomes and the existence of fragmented assemblies make it difficult to predict gene structures [35]. Multiple groups working on the same species have different and oftentimes conflicting annotations that are difficult to merge into a common consensus.

The early 2000s saw initial genome annotation attempts with the introduction of PASA [36], which was developed to map full-length transcripts and Expressed Sequence Tags (ESTs) in order to annotate genomes. In parallel, FGENESH [37, 38], GeneGenerator [39], mGene [40] and GeneSeqer [41] were introduced which predicted gene structures directly from genome sequence. Tools such as MAKER [42,43,44,45] and PASA [36] closely depend on pre-assembled full-length transcripts to generate annotations. ESTs and/or de novo assembled transcriptomes have been often provided as inputs to these tools to generate annotations [46,47,48,49,50,51,52]. Transcripts constructed via de novo [53,54,55,56,57] or genome-guided [58,59,60,61,62,63] approaches are sensitive to the nature of the assembler and its parameter settings. Such assemblers report sequences that are highly similar to one another, making the process of sifting the correct assemblies from artefacts difficult. This issue is moderately mitigated by BRAKER2 [64, 65], which uses read splice information instead of full-length assemblies to predict gene structures and has been shown to perform better than de novo approaches [66]. BRAKER2 entails a round of unsupervised gene predictions using GeneMark-ET [67] generating ab-initio gene predictions followed by a second round of training by AUGUSTUS [68] using a subset of the gene models created by GeneMark-ET [64]. All variations of MAKER (MAKER, MAKER2 and MAKER-P) use a combination of AUGUSTUS [68] and SNAP [69] to generate gene predictions. Unlike BRAKER2 or PASA, users need to run MAKER for multiple rounds to improve annotation. With no standard technique to optimize the number of rounds, users often undertake a trial-and-error approach to decide what data is supplied to MAKER in each execution round. These unguided choices can create different annotations based on the same data sets. Thus, current approaches report either incomplete genes and/or derive annotations that are missing alternatively spliced transcripts. In addition to MAKER, BRAKER, PASA there is another gene annotator -GeMoMa [70] that use protein coding genes from a reference genome and transfers those to the target genome. Since it does not optimally use information from expression data, it has not been compared with the other gene annotators discussed in this manuscript.

To overcome the drawbacks described above, we developed FINDER, a new, automated annotation pipeline that downloads RNA-Seq data from NCBI SRA [71], conducts genome-guided assembly of short reads, predicts gene structure, and annotates genes. FINDER annotates both untranslated and coding regions of genes, categorizes transcripts based on the tissue/conditions where they are expressed, and outputs a complete set of alternatively spliced transcripts. FINDER analyzes the spatial expression profile of each transcript to redefine its boundaries and/or even create newer transcripts and employs an optimized strategy to locate transcripts housing micro-exons. Finally, gene models predicted by BRAKER2 are incorporated into the annotation along with assemblies generated by PsiCLASS [63]. We show that FINDER outperforms state-of-the-art annotation tools in constructing accurate gene structures, when executed with the same expression data.

Implementation

The detailed workflow of FINDER is outlined in Fig. 1. The pipeline accepts metadata via a comma-separated values (csv) file (see Additional file 2: Table S1). Users can verify the input data using the `verifyInputsToFINDER` utility (Please check Sect. 1.5.1 of Additional file 9). Both single-end and paired-end data are accepted. The pipeline automatically downloads RNA-Seq data from NCBI SRA or the samples can be accessed locally. Multiple rounds of alignment are conducted using STAR [72, 73] with short reads, thus ensuring the capture of tissue-specific splice junctions and ultimately generates the most comprehensive set of alternatively spliced transcripts. FINDER uses PsiCLASS [63] to generate transcripts both at the tissue level and consolidates them to produce a consensus annotation. It employs change-point detection (CPD) using coverage data to polish intron/exon boundaries if needed. Polished transcripts are then supplied to GeneMarkS-T [74] to predict protein coding regions. In addition to constructing genes from expression data, FINDER uses BRAKER2 [65] to predict genes de novo. Finally, gene models are assigned scores that reflect the confidence of prediction and evidence across different data sets. Throughout the pipeline run, intermediate temporary data is removed to optimize space usage. Proper logging of executions is implemented through ruffus [75].

Fig. 1
figure1

FINDER workflow. FINDER assembles short reads from RNA-Seq expression data, collected from multiple tissues and conditions, to generate full-length transcripts using PsiCLASS. Short read coverage profile is used to polish the structure of the transcripts to enhance the quality of annotation. GeneMarkS-T is used to predict coding regions of the transcripts. Gene models predicted by BRAKER2 and models obtained by mapping proteins are added to the gene models constructed from RNA-Seq data. Additionally, FINDER outputs the tissues where each transcript is expressed allowing users to work with tissue-specific transcripts. FINDER categorizes transcripts into two confidence levels depending on the available supporting evidence and depth of coverage. (Generated using Microsoft PowerPoint v16.47)

Read alignments to the genome

Reads from each sample are aligned to the genome using STAR [73]. FINDER accepts the location of the genomic STAR indices. If indices are not provided, then FINDER will generate them locally. FINDER implements multiple strategies to detect as many correct splice-junctions as possible. Several studies use a multi-step approach where splice junctions are detected in the first pass and then those junctions are used to guide the alignments in future passes [76, 77]. FINDER employs a similar strategy to align reads and obtain the most confident splice junctions in each tissue type and/or condition by conducting mapping in four passes (Please check Sect. 1.3 of Additional file 9 for more details).

Annotating transcripts with micro-exons

Certain genes in eukaryotes have micro-exons (i.e., exons with fewer than 50 nucleotides) [78,79,80,81] which impart important biological properties both in plants [82,83,84,85,86] and animals [87,88,89,90,91]. FINDER uses OLego [92] to map the reads which were reported unmapped by STAR, because OLego optimizes micro-exon sensitivity by checking intron signatures when no hits of seed sequences (~ 14 nt) are found. It is configured to align reads to exons of minimum length 2, with a minimum and maximum intron size of 20 and 10 K respectively.

Generating exon-exon transcript structure annotation with PsiCLASS

Alignments reported by STAR and OLego are combined and provided as input to PsiCLASS [63]. Unlike traditional assemblers, PsiCLASS accepts alignments from multiple samples at the same time. It generates annotations for each sample and one consolidated gene annotation for all the samples. FINDER runs PsiCLASS with the—bamGroup option enabled which instructs PsiCLASS to preserve tissue/condition specific features. It is a fast meta-assembler generating 350 samples of output in less than three hours while running on 30 cores and consumes less than 50 GB of memory.

Polishing gene structures to optimize gene discovery

Gene structure annotations reported by PsiCLASS were polished to generate the best assemblies. Annotations generated by assemblers often have three kinds of errors that impact accuracy: (1) presence of redundant transcripts that are proper subsets of other transcripts, (2) multiple transcripts on the same strand merged into one, and (3) transcripts with ill-defined exon boundaries. Most assemblers ignore such cases to boost the speed of operation. Developing solutions to deal with these kinds of errors increases the number of correct structural annotations thereby improving downstream analysis.

FINDER uses different algorithmic and statistical approaches to deal with the above cases. To eliminate redundant transcripts, exon–intron structure of all transcripts is compared with each other to retain only unique transcripts. Even though eukaryotes possess large genomes, certain genes/transcripts are closely packed and are overlapping (Fig. 2). Reads originating from one of those genes often map to nearby overlapping genes making the task of distinctly recognizing the transcripts very challenging.

Fig. 2
figure2

FINDER implements changepoint analysis of read coverages to modify existing gene models and/or generate new ones. Changepoint analysis is a statistical technique to assess alterations in trends over time. The same approach has been used to analyze read coverage patterns of a genome, where the data is distributed spatially. a Two Arabidopsis thaliana genes AT1G42960.1 and AT1G42970.1 are present within 50 base pairs of each other on the positive strand. Reads originating from the end exons of either genes bleed into each other resulting in PsiCLASS to merge the two gene models. Changepoint analysis recognizes the fall the read coverage and reports a position within the exon where the trough exists. This information is used to split up the gene models. b A similar issue exists with closely spaced genes residing on opposite strands. The end exons (highlighted with a red box) for a transcript extend up to the nearest intron of the adjacent transcript. Changepoint analysis is used to determine the actual end/start of transcript based on the read coverage. (Screenshot obtained from Integrative Genomics Viewer and figure generated using Microsoft PowerPoint v16.47)

FINDER is configured to use changepoint detection (CPD) analysis to detect the descent in read coverage at the junction of two overlapping transcripts. Statistical CPD is a procedure to detect changes in the probability distribution of a stochastic process. Typically, CPD is widely used to detect changes in time series [93,94,95,96,97], but can be extended to other applications as well [98, 99]. We have found that even though CPD was developed under the assumption of normality, it can also be used where normality is violated.

In the first step in FINDER’s CPD, short read alignments to the genome are converted into number of read counts per nucleotide using bedtools [100]. A custom python script is used to transfer the per nucleotide coverage data from the genome to the transcriptome reported by PsiCLASS. Each internal exon is considered as a potential site for the presence of changepoints if there exist premature stop codons in all the three frame translations. CPD only considers exons that have a high chance of housing a changepoint, thereby reducing duration of operation. The coverage pattern of each exon is probed to detect changepoints. The data has been modeled using an exponential distribution, and binary segmentation has been used to determines the changepoints in the exonic coverage using the ‘changepoints’ package [101]. Read coverage of exons mimics a time series where each nucleotide position of an exon can be assumed to be a single unit of time. Coverage patterns of exons, suspected to be merged, contain a characteristic depression in the signal to split the gene models (Fig. 2a). Overlapping transcripts on opposite strands sometimes share a common exon (Fig. 2b). This negatively impacts precision since the boundaries of the predicted transcript exceed the boundaries of the transcript in the reference annotation. FINDER trims the transcript boundaries, using the changepoints, to better model the RNA-Seq coverage (Fig. 2b). These strategies improve the annotation by increasing the transcript F1 scores (Table 1).

Table 1 Sensitivity, Specificity and F1 scores of transcripts generated by multiple gene annotation pipelines for three model organisms—Arabidopsis thaliana, Oryza sativa and Zea mays

De novo gene prediction from expression data and proteins from closely related species

Certain genes are expressed only under specific tissues and conditions [102]. However, constructing an exhaustive set of genes expressed across all possible tissues and conditions is a daunting task due to the mammoth volume of potential expression data. Hence, approaches that can predict structures of unknown genes using information obtained from known genes are needed. Within the FINDER framework, we used BRAKER2 [103] to predict the structure of protein coding genes. The pipeline is provided with alignment files generated by STAR and an optional, user-provided protein data file. If the previous execution fails, a second execution of BRAKER2 is launched without protein information. Genes predicted by BRAKER2 are compared to the genes obtained from expression data. To prevent too many false positives, predictions made by BRAKER2 are considered high confidence, only if those are supported by expression level or protein level evidence.

In addition to RNA-Seq data, FINDER also uses protein data (when provided), in two ways (1) to assess the veracity of the transcript models generated by BRAKER2, and (2) to align those proteins not recognized by BRAKER2 or PsiCLASS. Protein coding genes obtained from expression data and predicted by BRAKER2 are BLASTed [104] to the protein set provided by the user. Proteins not encountering any hits are aligned to the genome using exonerate [105] with a minimum threshold of 90% similarity. These alignments are augmented to the final set of gene predictions. Since these transcripts are obtained solely from proteins, they lack UTR sequences.

Prediction of coding regions

We leveraged GeneMarkS-T [74] to predict protein-coding regions of genes constructed from expression data. GTF files are first converted to FASTA files using the provided genome. Those FASTA files are supplied to GeneMarkS-T as inputs. GeneMarkS-T outputs coding sequence for the transcripts. CDS annotations are incorporated into the final GTF file by converting the transcriptomic coordinates to genomic coordinates.

Tissue/condition specific transcripts/gene models

Most eukaryotic genes have multiple isoforms which are derived from alternative transcripts. Expression of different transcripts can occur under different conditions in different tissues at different time points. FINDER compares assembled transcripts from each condition and prints out an association between each transcript and the provided tissue/condition (Additional file 9: Sect. 1.5).

Scoring gene models

FINDER groups genes into multiple categories based on supporting evidence. Genes that are expressed in RNA-Seq datasets, predicted by BRAKER2, and have protein evidence, are put into the high-confidence gene set. BRAKER2-predicted genes with no evidence of expression and/or proteins are treated as low confidence genes. FINDER expects a soft masked genome since it is a BRAKER2 requirement. Genes which are located in the repeat regions are marked as such and moved to the set of low-confidence genes.

Results and discussion

Choice of species for comparison

We tested the performance of FINDER primarily on three well-annotated plant organisms—Arabidopsis thaliana [106], Oryza sativa [107,108,109] and Zea mays [110, 111]. The genomes assemblies of these model organisms have been frequently updated and are almost complete with telomere-to-telomere sequences with fewer gaps and unknown nucleotides. In addition, their gene annotations have undergone regular improvement by mining the large number of RNA-Seq datasets available in the literature. Also, The Arabidopsis Information Resource (TAIR) provides a five-star rating system based on available evidence for each gene. Such a system offers a platform to test the quality of gene annotation software. For further evaluation, and to ensure that FINDER is able to annotate a wider range of genome types, we selected the following additional species to test: Caenorhabditis elegans [112], Drosophila melanogaster [113, 114], Homo sapiens [115, 116], Hordeum vulgare [117], and Triticum aestivum [117,118,119,120]). The genomes of these species range from small (C. elegans, D. melanogaster, A. thaliana), medium (O. sativa), to large (H. sapiens, Z. mays, H. vulgare, and T. aestivum). Finally, we evaluated FINDER on three different versions of Z. mays annotations—RefSeq [121], AGPv3 [111, 122] and AGPv4 [110, 123].

Metrics to assess quality of annotation

We used four metrics to compare the quality of annotations generated by each pipeline: (1) Annotation Edit Distance (AED) [42, 43, 124], (2) sensitivity, (3) specificity, and (4) F1 score. Although these metrics could be computed both at the nucleotide- and exon-level we chose to make comparisons at the transcript level since it encompasses bases, exons, and introns. An AED score of 0 indicates complete agreement of the predicted annotation with the reference, and a score of 1 denotes that the reference has not been identified in the annotation. A transcript is considered to be “recognized” only when all its intron definitions agree with at least one transcript from the predicted set. We used the Mikado “compare” utility to compare the predictions with the reference annotations [125]. A highly sensitive annotation is one that can correctly recognize more reference transcripts. A set of annotations has high specificity when it reports minimal incorrect transcripts. For an annotation to be of good quality, both sensitivity and specificity should be high. A balanced metric is the F1 score which is the harmonic mean of sensitivity and specificity. While AED provides a good numeric assessment of how well the ground truth evidence is represented in an annotation, when individually used, it fails to capture the extent to which false positives are reported. Hence, F1 score complements AED since it incorporates both specificity and sensitivity. For evaluation purposes, we assume that the annotations achieved through community efforts are the ground truth and contain no errors.

FINDER generates more accurate gene models than BRAKER2, MAKER2 and PASA

FINDER leverages expression data to construct transcript models and employs statistical changepoint detection to enhance their structures (see “Implementation” section). Both MAKER2 and PASA were run with transcript sequences reported by PsiCLASS.

To assess FINDER’s performance, we compared the AED scores of transcript models generated by FINDER with those generated by other commonly used annotation methods. As shown in Fig. 3a, d, g, the violin plots for FINDER are broader at the base, indicating a greater number of transcripts with lower AED scores as compared to BRAKER2, MAKER, and PASA. We compared the FINDER AED scores with the AED scores reported by other pipelines using Wilcoxon’s signed rank test (More details in Additional file 9: Sect. 2.5). For all organisms (Fig. 3, Additional file 1: Figs. S2–S5 and Additional file 3: Table S2), the AED scores reported by FINDER were significantly lower (p_value < 0.01) than that of any other pipeline. Figure 3c, f, i, shows a stacked bar plot to represent the fraction of transcripts in each category of AED values. In all the cases, a higher percentage of transcripts reported by FINDER have lower AED scores (Additional file 1: Figs. S2–S5). This indicates that FINDER is capable of constructing gene structures that better comply with the reference annotations.

Fig. 3
figure3

Comparison of performance of predicted annotations in three model species—ac A. thaliana, df O. sativa and gi Z. mays. Annotation Edit Distance (AED) is an assessment of how well predicted annotations agree with the evidence and was used as a quality control metric. A value of 0 denotes complete agreement of two annotations while a value of 1 denotes that the ‘gold standard’ reference annotation was not detected. Transcripts from ‘gold standard’ reference annotations that are not detected in any of the predicted annotations are removed from analysis. a, d, g Distribution of AED scores. Violin plots wider at the base indicate high density of annotations with lower AED. FINDER was able to create gene models having lowest AED resulting in a wide base. Gene models generated by FINDER were enhanced by adding predictions made by BRAKER and including protein evidence. Wilcoxon’s signed rank test was used to compare the AED scores between FINDER and other annotating pipelines. The “***” symbol implies that the AED scores of FINDER gene models were significantly lesser (p_value < 0.01) than the AED scores of the gene models reported by other pipelines. b, e, h Bar plot of F1 score of multiple approaches of annotation. Having a high nucleotide F1 (Base F1) or a high exon F1 score is not sufficient to conclude a good annotation. High value of transcript F1 score is indicative of good gene models with high sensitivity and high specificity. c, f, i Stacked bar plot showing percentage of transcripts in each of the four groups of AEDs. Higher number of transcripts to low AED denotes better annotation. In each of the three species, FINDER was able to generate a higher percentage of transcripts with low AED compared to other techniques of annotation. (Generated using ggplot2 v3.3.3)

High-quality exhaustive annotations predict the fewest false positives thereby boosting the transcript F1 score. The transcript F1 scores of the gene models that were reported by FINDER for A. thaliana, O. sativa and Z. mays were higher than the models generated by BRAKER2, MAKER, and PASA (Fig. 3b, e, h). This same trend is observed for other tested organisms where FINDER was successful in detecting nucleotides, exons, introns, transcripts and genes (Table 1, Additional file 1: Figs. S2–S5 and Additional file 3: Table S2). MAKER2 and BRAKER2 registered a high specificity for most of the organisms because fewer transcripts were reported than FINDER. MAKER2 and BRAKER2 also had lower F1 scores, indicating less sensitivity than FINDER. Additionally, we compared the CDS regions of genes reported by FINDER with those of BRAKER2. For most of the organisms, FINDER generated transcript models with a higher F1 score (Additional file 4: Table S3). These results show that the better performance of FINDER is ensured not only due to the presence of UTRs but also due to enhanced CDS structure of gene models.

Finally, including BRAKER2 predictions and protein sequences to FINDER enhanced the gene model predictions. About 15% of the gene models reported by BRAKER2, those having high sequence similarity with the provided protein sequences were included in the final annotations (Table 2). As shown in Table 1 and Additional file 5: Table S4, including evidence at the protein level led to the identification of more genes.

Table 2 Improvement in overall gene recognition by adding gene models predicted by BRAKER2 and aligning protein sequences

Unlike BRAKER2, FINDER does not assume a homogeneous nucleotide composition of the genome [103]. FINDER outperforms BRAKER2 while constructing gene models in complex organisms like H. sapiens, H. vulgare, and Z. mays since assemblers generating transcriptomes from alignments do not require a genome to possess homogeneous nucleotide composition.

FINDER in itself is restricted to annotate genes only in regions of the genome that are transcriptionally active. Recognizing that BRAKER2, being a gene predictor, can construct gene models in transcriptionally silent regions of the genome, FINDER is designed to incorporate the gene models predicted by BRAKER2 into the final annotations.

Distinct gene groups are accurately annotated with FINDER

Although eukaryotic genes differ from one another in terms of location, structure and the isoforms they encode, most annotation pipelines annotate and evaluate gene predictions with a global and uniform approach. The problem arises when these variances prompt each pipeline to perform differently on dissimilar groups of genes. To avoid this pitfall, we created groups of genes and transcripts based on various criteria (Table 3) and compared the performance of FINDER with BRAKER2, MAKER, and PASA for each of these sets.

Table 3 Classification of gene models into different groups based on their relative location to other genes, number of isoforms and other criteria

On the set of UTR-containing transcripts, FINDER reported the best transcript F1 scores (Fig. 4, Additional file 1: Figs. S6, S7). Unlike BRAKER2, FINDER uses GeneMark S/T to predict CDS from the transcript sequences assembled by PsiCLASS and can hence annotate UTR regions. For most of the organisms, BRAKER2 and MAKER2 gene models register a low transcript F1 score in this category of genes. Next, we tested the performance of the annotation pipelines on transcripts that are closely located in the genome. On this set of transcripts, FINDER reported the best F1 transcript score for A. thaliana, O. sativa, and Z. mays (Fig. 4), and comparable scores for D. melanogaster (Additional file 1: Fig. S6), H. vulgare (Additional file 1: Fig. S8), and C. elegans (Additional file 1: Fig. S7) with BRAKER2. Most eukaryotic genes have multiple isoforms which differ from one another by their exon–intron definition. Splice sites and coverage information provides clues to construct such alternatively spliced transcripts. We selected genes with more than one transcript to check how well each annotation pipeline was able to detect transcript isoforms. For this case, FINDER was able to generate the best transcript structures with the highest transcript F1 score among all the pipelines gene annotation software applications (Fig. 4 and Additional file 1: Figs. S6–S9). Surprisingly, BRAKER2 fared poorly in this category despite training with all the detected splice sites from RNA-Seq data. This demonstrates that FINDER is capable of leveraging both intron splice sites and read coverages to report best transcript structures. For H. sapiens, PASA was able to generate the best transcript structures across all categories of transcripts. Adding transcripts from BRAKER2 and protein evidence improved the transcript F1 score for all the organisms, signifying the importance of incorporating de novo gene models and protein evidence.

Fig. 4
figure4

FINDER versus other pipelines on different groups of genes in three model species—a A. thaliana, b O. sativa, c Z. mays. F1 score is the harmonic mean between sensitivity and specificity. Higher F1 score indicates better agreement with the reference transcript models. We created groups of transcripts that have similar characteristics as shown in the y-axis legend. A pool of transcripts was created containing multi-exonic transcript predictions, from each pipeline, that has a complete intron chain match with at least one reference annotation. Mono exonic transcripts were considered if at least 80% of the nucleotides overlap with one reference annotation. Transcript F1 scores, for each of the annotation pipelines, have been plotted as a bar graph. Even though all annotation pipelines are designed to serve the same purpose of annotating genomes, each pipeline adopts a different strategy. Each strategy has its own merits and demerits that lead to better annotation of a certain category of genes. This plot helps understand the performance of each annotation pipeline on different categories. The symbol “#” denotes the best annotator in each gene group. (Generated using ggplot2 v3.3.3)

BRAKER2 generated the best transcript annotation for the set of transcripts with a single exon (Fig. 4a, b and Additional file 1: Figs. S6–S9). Such transcripts, devoid of any introns, are difficult to construct from RNA-Seq alone. Also, the direction of the splice sites infers the direction of a transcript. Without any introns, such a single-exon transcript has to be probed for a CDS sequences' presence to infer directionality. BRAKER2 was configured to optimally predict only CDS regions of genes, hence, it performs well with the set of transcripts that have missing UTRs for organisms with small and moderate sized genomes (Fig. 4a, b and Additional file 1: Fig. S6–S9). The average number of transcripts per gene reported by BRAKER2 is lower than FINDER. While this boosts specificity, it compromises recall since BRAKER2 is not sensitive to detecting alternatively spliced transcripts. Hence, BRAKER2 accomplishes the best F1 score when tested on a set of single-transcript genes but performs poorly on a set of multi-transcript genes (Fig. 4a, b and Additional file 1: Figs. S6–S9).

Performance comparison on TAIR’s 5-star System

In order to assess the performance of the annotation pipelines on groups of genes constructed from varying levels of evidence, we used the TAIR10 5-star system. TAIR associates a quality score to each A. thaliana transcript based on the evidence used to construct the models, with five stars designating the best evidence and zero stars the least [126]. The three categories with limited evidence (< 3 stars) have fewer than 3,000 transcripts each. BRAKER2′s performance, on the genes in these three categories, was slightly better than the rest of the annotation pipelines (Fig. 5). The other two categories (five star and four star) have 9,067 and 18,374 transcripts respectively. In both of these categories, FINDER was able to detect more transcripts than any other annotation pipeline. 51.5% and 86.4% of genes in the 5-star and 4-star category respectively were multi-exonic. In both these categories, FINDER correctly constructed more gene models compared to any other annotation pipeline (Fig. 5). FINDER reported 80% of the gene models belonging to the 4-star category—18% more than BRAKER2 (Fig. 5). Hence, it is evident from this analysis that FINDER can reconstruct the structures of most of the genes that are well-supported by underlying evidence.

Fig. 5
figure5

Performance of annotation pipelines on gene groups of Arabidopsis thaliana generated by TAIR10. The Arabidopsis Information Resource (TAIR) group has created a quality ranking system to indicate the level of confidence in an annotated gene/transcript. The ranking system has five levels (denoted by stars). Higher number of stars denote the availability of more information to generate the gene structure. Here we display the percentage of transcripts in each category that was identified by a particular annotation pipeline. A high percentage of identified transcripts indicate higher sensitivity and hence a better annotation. The number below each legend in the x-axis denote the number of genes in that respective group. The “#” denotes the predictor which detected the maximum number of transcripts within each group. (Generated using ggplot2 v3.3.3)

Improving transcript annotations using changepoint analysis

The co-location of multiple overlapping genes on the genome strands makes it difficult to correctly annotate their structures (see “Polishing gene structures to optimize gene discovery” section). FINDER employs changepoint detection (CPD) [101] to split the merged transcripts reported by PsiCLASS (Fig. 2). To gauge the magnitude of improvement in transcript structures brought about by the application of CPD, we compared the accuracy of the predicted transcriptome before and after implementing CPD based on read coverage. As shown in Table 4 and Additional file 6: Table S5, implementing the CPD improved both specificity and sensitivity in organisms with small or medium-sized genomes. In A. thaliana, the transcript F1 scores increased from 40.78 to 45.95 (Table 4 and Additional file 6: Table S5) and in C. elegans it increased from 40 to 50. In large genomes, the improvement was not as significant, mainly because there are only a few genes that overlap with one another.

Table 4 Comparison of specificity, sensitivity and F1 scores of transcripts assemblies generated by Strawberry, Scallop, Stringtie, PsiCLASS and FINDER for three model organisms—A. thaliana, O. sativa and Z. mays

PsiCLASS meta-assembly works better than other approaches

We explored three popularly used software applications for merging transcriptome assemblies—StringTie-merge [77, 127,128,129,130,131,132,133], TACO [134,135,136,137,138,139] and Cuffmerge [140,141,142,143,144,145] to combine 116 A. thaliana assemblies constructed by StringTie [59], Scallop [61] and Strawberry [60] (Please check Sect. 3 of Additional file 9 for more details). The best assembly was reported by StringTie-merge and was hence used for all other organisms. We compared the accuracy of the consensus transcript models generated by StringTie-merge with the transcript models reported by PsiCLASS [63]. As depicted in Table 4 and Additional file 6: Table S5, PsiCLASS generated the best transcript models for all organisms registering the highest transcript F1 score improving upon the StringTie models by up to 15%. Hence, FINDER uses only PsiCLASS to generate assemblies from short-read data.

Impact of missing untranslated region on annotation of transcripts

Gene transcription is triggered by adherence of a transcription factor in the promoter region of a gene. Promoters are typically located within 1,000 bp upstream of a gene’s transcription start site (TSS) [146,147,148]. Determining the TSS from sequencing data is best facilitated by RAMPAGE [149, 150] or CAGE-Seq [151], but this data is usually unavailable due to constraints imposed by cost and time. Nevertheless, a good estimate can be obtained from RNA-Seq data by assuming the start coordinates of the assembled genes as the TSS. Thus, researchers often localize their investigation to a section 500–1000 bp upstream of the assumed TSS [152, 153]. Without 5′ UTR annotation it is impossible to deduce a good approximation of the TSS. This leads to conducting promoter mining in a completely incorrect genome location. To assess the quality of 5′ UTR annotation, we plotted the difference of TSS between the reference genes and the genes reported by BRAKER2 and FINDER using a violin plot (Fig. 6). Further, we applied Wilcoxon’s rank-sum test and found that the TSS distances reported by FINDER were significantly less than that of BRAKER2 for A. thaliana and Z. mays. Interestingly, for O. sativa, BRAKER2 generated better gene models for more transcripts. Over 25% of reference gene models in O. sativa have no UTRs annotated which is higher compared to 15% UTR-less gene models in A. thaliana and Z. mays. This result illustrates that more FINDER transcripts have a TSS closer to the evidence as compared to the TSS of the transcripts reported by BRAKER2. This is an expected result since BRAKER2 was configured to annotate only CDS regions of transcripts. Table 5 highlights the number of transcripts that have better agreement with the reference TSS for FINDER and BRAKER2.

Fig. 6
figure6

Comparison of distance between transcription start sites of gene models predicted by BRAKER2 and FINDER. Violin plots of the distribution of the distance between the actual transcription start site (TSS) and the predicted transcription start site. In a set of well annotation complete gene structures, a higher fraction of genes is expected to have low deviation from the actual TSS. We considered genes that were reported in either BRAKER or FINDER for this analysis. Wilcoxon’s rank sum test was used to compare the TSS distances between FINDER and BRAKER2. The “***” symbol implies that TSS distance for FINDER gene models was significantly less than BRAKER2 gene models. (Generated using ggplot2 v3.3.3)

Table 5 Use of RNA-Seq evidence to improve annotation of untranslated regions to aid in promoter mining and epigenetic studies

Enhancing ground truth annotations by extending untranslated regions

Official annotations of several model organisms, used as ground truth for this study, contain transcripts with missing UTR sequences. Even though UTRs do not code for proteins, they are relevant segments of a transcript involved in several important biological processes like mRNA translation [154,155,156], regulation of expression [157,158,159,160,161]] and a number of diseases [162,163,164,165,166]. In the A. thaliana TAIR10 annotations, there are 7,888 transcripts missing either UTR; 50% of these had a rating below 2 stars.

PacBio (Menlo Park, CA) offers long-read sequencing that contain both CDS and UTRs. Therefore, we used the PacBio annotations instead of the incomplete TAIR10 transcripts to assess FINDER’s performance on transcripts that were missing UTRs (Please refer to Sect. 2.6 in Additional file 9 for more details). Out of the 7,888 TAIR10 transcripts with missing UTRs, 113 transcripts were found both in the PacBio data and the 116 short-read RNA-Seq samples. We compared the FINDER annotations against these 113 transcripts. FINDER annotations were able to recall 91.55% of the nucleotides in 113 transcripts of TAIR10 and 97.86% of PacBio transcripts. The specificity of the FINDER annotations is markedly higher with PacBio transcripts (79.67%) compared to TAIR10 transcripts (72.14%). This demonstrates that FINDER enhances and improves upon the existing annotation.

The TRITEX H. vulgare annotation (Morex version r2) [117], released by the International Barley Sequencing Consortium (IBSC), is devoid of UTRs. We used FINDER to update and enrich the existing annotations by flanking the CDS region with UTRs on both sides. To verify the accuracy of the gene models reported by FINDER, we used PacBio full-length mRNA sequences derived from a time course of powdery mildew infected barley leaf tissue [167, 168]. A total of 7,352 gene models from IBSC, FINDER, and PacBio had a complete intron-chain match with each other. The gene structures for more than 93% (6,886 out of 7,352) of the FINDER models were improved when compared to PacBio full-length sequences (Additional file 7: Table S6). The highest F1 score achieved was 87.16. This shows that FINDER is capable of constructing accurate gene structures constituting both CDS and UTRs.

Evaluating performance with different annotations of Zea mays

Z. mays is an important model organism for crops and has been one of the most studied plants for genetics by researchers in several different fields [169,170,171,172]. Genes have been annotated in multiple ways using different kinds of data, resulting in substantial differences in gene structures [122]. Here we compare three alternative annotation sets of Z. mays—RefSeq, AGPv3, and AGPv4 and the performance of FINDER surpassed all three approaches. The transcript F1 score for FINDER gene models compared against the NCBI gene models were 43.48, whereas the F1 scores for AGPv3 and AGPv4 were 26.69 and 22.51 respectively. We observed the same trend for other annotation pipelines and reported a higher transcript F1 score for NCBI than the AGP annotations (Table 1 and Additional file 3: Table S2). Hence, FINDER generated high-quality gene structures with high transcript F1 scores for different Z. mays annotations.

Evaluating FINDER on different clades reported by Phylostratr

Genes in each organism can be categorized by their evolutionary history [173, 174]. We used Phylostratr [175] to classify genes into evolutionary strata. Here we present our results on the three model organisms—A. thaliana, O. sativa, and Z. mays. For all three, FINDER was able to accurately detect more genes in highly populated strata (Fig. 7). The performance of FINDER and PASA was comparable in strata with few genes. It was surprising to note that BRAKER2 was unable to identify highly conserved genes (those from the “cellular organisms” strata) since those would be easier to predict than organism specific genes. This demonstrates that FINDER is capable of effectively constructing genes from different evolutionary backgrounds.

Fig. 7
figure7

Assessment of annotation pipelines on genes from each phylostrata—Genes from three model species—a Arabidopsis thaliana, b Oryza sativa and c Zea mays, were allocated into evolutionary classes using Phylostratr. The number of genes correctly constructed by each pipeline was computed and plotted as a bar graph. Numbers below each stratum indicate the number of genes allocated to that strata. Strata having genes fewer than 500 are not shown in the graph. (Generated using ggplot2 v3.3.3)

FINDER constructs gene models for polyploid genomes

Being a general-purpose genome annotator, in addition to diploid organisms, FINDER can annotate the genomes of polyploid organisms. We generated gene structures of Triticum aestivum, a hexaploid with 120,744 annotated genes and 146,597 transcripts [117]. FINDER was able to detect 48,129 transcripts (39.9%). Out of the 130,582 transcripts predicted by FINDER, 48,104 (36.83%) matched perfectly with at least one reference annotation.

Conclusion

Identifying genes on chromosomes and deducing their structures from a plethora of evidence has been undertaken in multiple ways, with each method having advantages and disadvantages. Herein, we propose FINDER—an entirely automated, general-purpose pipeline to annotate genes in eukaryotic genomes. FINDER (1) implements an optimized mapping strategy that reduces the number of spurious mappings, (2) produces complete full-length transcripts comprising UTRs while identifying transcripts with micro-exons, (3) employs statistical CPD to modify gene boundaries and construct new genes, (4) reports more alternatively spliced transcripts as compared to other state-of-the-art annotation pipelines, and (5) assigns confidence classes to each transcript based on the evidence(s) that were used to construct those.

While FINDER’s performance has been superior to other gene annotation softwares, all the gene models reported by FINDER are predicted. Hence, a validation is necessary to ensure false positives are detected and removed. Also, future versions of FINDER will offer functionalities to leverage data from CAGE-Seq and Ribo-Seq to better annotate transcription start site and translation start sites respectively.

With a wide variety of available data for annotation, researchers often struggle to manage and optimize their usage. Several gene annotation software also offer users complicated configurations without providing substantial guidance. FINDER makes the job of gene annotation easy for bench scientists by automating the entire process from RNA-Seq data processing to gene prediction. Since FINDER does not assume the ploidy or the nucleotide composition of a genome, it could be applied to derive gene structures for a wide range of species, including non-model organisms. FINDER constructs gene models primarily from RNA-Seq data and is therefore capable of constructing tissue- and/or condition- specific isoforms which would have been impossible to obtain from ESTs only. FINDER supersedes the performance of existing software applications by utilizing read coverage information to fine-tune gene model boundaries. Instead of removing low-quality transcripts, FINDER flags them as low confidence—giving users the choice of using them as they seem fit. As a proof of concept, we provided evidence that using read coverage signal indeed enhances gene structures in a diverse set of organisms. Thus, we are confident that FINDER will pave the way for improved gene structure annotation in the future.

Availability and requirements

Project name: FINDER.

Project home page: https://github.com/sagnikbanerjee15/Finder.

Operating system(s): Linux, MacOS.

Programming language: Python, C, C++, Perl, Shell.

License: MIT.

Other software requirements: All software requirements are listed in https://github.com/sagnikbanerjee15/Finder/blob/master/environment.yml.

Any restrictions to use by non-academics: MIT licensing restrictions apply.

Availability of data and materials

FINDER can be accessed from https://github.com/sagnikbanerjee15/Finder. RNA-Seq samples used for annotation is included in Additional file 8: Table S7. Barley PacBio sequences have been deposited in NCBI (Project Id: GSE165730).

Abbreviations

ESTs:

Expressed sequence tags

NGS:

Next generation sequencing

NCBI:

National Center for Biotechnology Information

SRA:

Sequence read archive

UTR:

Untranslated regions

CSV:

Comma separated values

AED:

Annotation edit distance

CPD:

Changepoint detection

TSS:

Transcription start site

CDS:

Coding sequence

CPU:

Central processing unit

cDNA:

Complementary DNA

References

  1. 1.

    Genome List-Genome-NCBI. https://www.ncbi.nlm.nih.gov/genome/browse/#!/overview/. Accessed 12 Jan 2021.

  2. 2.

    Morganti S, Tarantino P, Ferraro E, D’Amico P, Viale G, Trapani D, et al. Complexity of genome sequencing and reporting: next generation sequencing (NGS) technologies and implementation of precision medicine in real life. Crit Rev Oncol Hematol. 2019;133:171–82.

    PubMed  Article  PubMed Central  Google Scholar 

  3. 3.

    Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155:27–38.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  4. 4.

    Phillips KA, Douglas MP. The global market for next-generation sequencing tests continues its torrid pace. J Precis Med. 2018;2018:4.

    Google Scholar 

  5. 5.

    Kulski JK. Next-generation sequencing—an overview of the history, tools, and “Omic” applications. Next Generation Sequencing–Advances, Applications and Challenges. 2016;3–60.

  6. 6.

    Banerjee S, Mitra B, Chatterjee A, Santra A, Chatterjee B. Identification of relevant physico chemical properties of amino acids with respect to protein glycosylation prediction. In: Computing and Communication (IEMCON), 2015 International Conference and Workshop on. IEEE; 2015. p. 1–7.

  7. 7.

    Banerjee S, Basu S, Nasipuri M. Big Data Analytics and Its Prospects in Computational Proteomics. In: Information systems design and intelligent applications. Springer; 2015. p. 591–8.

  8. 8.

    Banerjee S, Velásquez-Zapata V, Fuerst G, Elmore JM, Wise RP, Elmore M. NGPINT: a next-generation protein–protein interaction software. Brief Bioinform. 2020;2020:1–14. https://doi.org/10.1093/bib/bbaa351.

    Article  Google Scholar 

  9. 9.

    Rao VS, Srinivas K, Sujini GN, Kumar GN. Protein–protein interaction detection: methods and analysis. Int J Proteom. 2014;2014:147648.

    Google Scholar 

  10. 10.

    Patel S, Tripathi R, Kumari V, Varadwaj P. DeepInteract: deep neural network based protein–protein interaction prediction tool. Curr Bioinform. 2017;12:551–7.

    CAS  Article  Google Scholar 

  11. 11.

    Chen M, Ju CJ-T, Zhou G, Chen X, Zhang T, Chang K-W, et al. Multifaceted protein–protein interaction prediction based on siamese residual rcnn. Bioinformatics. 2019;35:305–14.

    Article  CAS  Google Scholar 

  12. 12.

    Yang S, Li H, He H, Zhou Y, Zhang Z. Critical assessment and performance improvement of plant–pathogen protein–protein interaction prediction methods. Brief Bioinform. 2019;20:274–87.

    PubMed  Article  CAS  Google Scholar 

  13. 13.

    Li Y, Ilie L. SPRINT: ultrafast protein–protein interaction prediction of the entire human interactome. BMC Bioinform. 2017;18:485.

    Article  CAS  Google Scholar 

  14. 14.

    Velásquez-Zapata V, Elmore JM, Banerjee S, Dorman KS, Wise RP. Next-generation yeast-two-hybrid analysis with Y2H-SCORES identifies novel interactors of the MLA immune receptor. PLoS Comput Biol 2021.

  15. 15.

    Banerjee S, Ghosh D, Basu S, Nasipuri M. JUPred_MLP: Prediction of phosphorylation sites using a consensus of MLP classifiers. 2016.

  16. 16.

    Banerjee S, Ghosh D, Basu S, Nasipuri M. JUPred_SVM : Prediction of Phosphorylation Sites using a consensus of SVM classifiers. In: Proceedings of Fifth International Conference on Soft Computing for Problem Solving. Springer; 2016. p. 1–8.

  17. 17.

    Banerjee S, Nag S, Tapadar S, Ghosh S, Guha S, Bakshi S. Improving protein protein interaction prediction by choosing appropriate physiochemical properties of amino acids. In: Computing and Communication (IEMCON), 2015 International Conference and Workshop on. IEEE; 2015. p. 1–8.

  18. 18.

    Banerjee S, Basu S, Ghosh D, Nasipuri M. PhospredRF: Prediction of protein phosphorylation sites using a consensus of random forest classifiers. In: Computing and Communication (IEMCON), 2015 International Conference and Workshop on. IEEE; 2015. p. 1–7.

  19. 19.

    Luo F, Wang M, Liu Y, Zhao X-M, Li A. DeepPhos: prediction of protein phosphorylation sites with deep learning. Bioinformatics. 2019;35:2766–73. https://doi.org/10.1093/bioinformatics/bty1051.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  20. 20.

    Li F, Li C, Marquez-Lago TT, Leier A, Akutsu T, Purcell AW, et al. Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome. Bioinformatics. 2018;34:4223–31.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  21. 21.

    Song J, Wang H, Wang J, Leier A, Marquez-Lago T, Yang B, et al. PhosphoPredict: a bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection. Sci Rep. 2017;7:1–19.

    Article  CAS  Google Scholar 

  22. 22.

    Chen H, Xue Y, Huang N, Yao X, Sun Z. MeMo: a web tool for prediction of protein methylation modifications. Nucl Acids Res. 2006;34 suppl_2:W249–53.

  23. 23.

    Eisenhaber B, Eisenhaber F. Prediction of posttranslational modification of proteins from their amino acid sequence. In: Data mining techniques for the life sciences. Springer; 2010. p. 365–84.

  24. 24.

    Elmore MG, Banerjee S, Pedley KF, Ruck A, Whitham SA. De novo transcriptome of Phakopsora pachyrhizi uncovers putative effector repertoire during infection. Physiol Mol Plant Pathol. 2020;110:101464.

    CAS  Article  Google Scholar 

  25. 25.

    Frantzeskakis L, Kracher B, Kusch S, Yoshikawa-Maekawa M, Bauer S, Pedersen C, et al. Signatures of host specialization and a recent transposable element burst in the dynamic one-speed genome of the fungal barley powdery mildew pathogen. BMC Genomics. 2018;19:381. https://doi.org/10.1186/s12864-018-4750-6.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  26. 26.

    Sperschneider J. Machine learning in plant–pathogen interactions: empowering biological predictions from field scale to genome scale. New Phytologist. 2019;nph.15771. https://doi.org/10.1111/nph.15771.

  27. 27.

    Sperschneider J, Dodds PN, Singh KB, Taylor JM. ApoplastP: prediction of effectors and plant proteins in the apoplast using machine learning. New Phytologist. 2017.

  28. 28.

    Sperschneider J, Gardiner DM, Dodds PN, Tini F, Covarelli L, Singh KB, et al. EffectorP: predicting fungal effector proteins from secretomes using machine learning. New Phytol. 2016;210:743–61. https://doi.org/10.1111/nph.13794.

    CAS  Article  PubMed  Google Scholar 

  29. 29.

    Magnan CN, Baldi P. SSpro/ACCpro 5: Almost Perfect Prediction of Protein Secondary Structure and Relative Solvent Accessibility Using Profiles, Machine Learning, and Structural Similarity. Bioinformatics. 2014;:btu352.

  30. 30.

    McGuffin LJ, Bryson K, Jones DT. The PSIPRED protein structure prediction server. Bioinformatics (Oxford, England). 2000;16:404–5.

    CAS  Article  Google Scholar 

  31. 31.

    Laskowski RA, Watson JD, Thornton JM. Protein function prediction using local 3D templates. J Mol Biol. 2005;351:614–26. https://doi.org/10.1016/j.jmb.2005.05.067.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  32. 32.

    Banerjee S, Guha S, Dutta A, Dutta S. Improvement of protein disorder prediction by brainstorming consensus. In: Computing and Communication (IEMCON), 2015 International Conference and Workshop on. IEEE; 2015. p. 1–7.

  33. 33.

    Salzberg SL. Next-generation genome annotation: we still struggle to get it right. BioMed Central; 2019. https://doi.org/10.1186/s13059-019-1715-2.

  34. 34.

    del Angel VD, Hjerde E, Sterck L, Capella-Gutierrez S, Notredame C, Pettersson OV, et al. Ten steps to get started in genome assembly and annotation. F1000Research. 2018;7.

  35. 35.

    Richards S. Full disclosure: genome assembly is still hard. PLoS Biol. 2018;16:e2005894.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  36. 36.

    Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucl Acids Res. 2003;31:5654–66.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  37. 37.

    Salamov A, Solovyev V. Fgenesh multiple gene prediction program; 1998.

  38. 38.

    Solovyev V, Kosarev P, Seledsov I, Vorobyev D. Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 2006;7:S10.

    PubMed  PubMed Central  Article  Google Scholar 

  39. 39.

    Kleffe J, Hermann K, Vahrson W, Wittig B, Brendel V. GeneGenerator—a flexible algorithm for gene prediction and its application to maize sequences. Bioinformatics (Oxford). 1998;14:232–43.

    CAS  Article  Google Scholar 

  40. 40.

    Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, et al. mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Res. 2009;19:2133–43.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  41. 41.

    Schlueter SD, Dong Q, Brendel V. GeneSeqer@ PlantGDB: gene structure prediction in plant genomes. Nucl Acids Res. 2003;31:3597–600.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  42. 42.

    Cantarel BL, Korf I, Robb SMCC, Parra G, Ross E, Moore B, et al. MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18:188–96. https://doi.org/10.1101/gr.6743907.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  43. 43.

    Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinform. 2011;12:491.

    Article  Google Scholar 

  44. 44.

    Campbell MS, Law M, Holt C, Stein JC, Moghe GD, Hufnagel DE, et al. MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol. 2014;164:513–24.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  45. 45.

    Campbell MS, Holt C, Moore B, Yandell M. Genome annotation and curation using MAKER and MAKER-P. Curr Protoc Bioinform. 2014;48:4–11. https://doi.org/10.1002/0471250953.bi0411s48.

    Article  Google Scholar 

  46. 46.

    Vonk FJ, Casewell NR, Henkel CV, Heimberg AM, Jansen HJ, McCleary RJR, et al. The king cobra genome reveals dynamic gene evolution and adaptation in the snake venom system. Proc Natl Acad Sci. 2013;110:20651–6.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  47. 47.

    Keane M, Semeiks J, Webb AE, Li YI, Quesada V, Craig T, et al. Insights into the evolution of longevity from the bowhead whale genome. Cell Rep. 2015;10:112–22.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  48. 48.

    Zhang J, Fu X-X, Li R-Q, Zhao X, Liu Y, Li M-H, et al. The hornwort genome and early land plant evolution. Nature plants. 2020;6:107–18.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  49. 49.

    Gray MW, Burger G, Derelle R, Klimeš V, Leger MM, Sarrasin M, et al. The draft nuclear genome sequence and predicted mitochondrial proteome of Andalucia godoyi, a protist with the most gene-rich and bacteria-like mitochondrial genome. BMC Biol. 2020;18:1–35.

    Article  CAS  Google Scholar 

  50. 50.

    Peng C, Ren J-L, Deng C, Jiang D, Wang J, Qu J, et al. The genome of Shaw’s sea snake (Hydrophis curtus) reveals secondary adaptation to its marine environment. Mol Biol Evol; 2020.

  51. 51.

    Weitemier K, Straub SCK, Fishbein M, Bailey CD, Cronn RC, Liston A. A draft genome and transcriptome of common milkweed (Asclepias syriaca) as resources for evolutionary, ecological, and molecular studies in milkweeds and Apocynaceae. PeerJ. 2019;7:e7649.

    PubMed  PubMed Central  Article  Google Scholar 

  52. 52.

    Zhang J, Zhang X, Tang H, Zhang Q, Hua X, Ma X, et al. Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L. Nat Genet. 2018;50:1565–73.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  53. 53.

    Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–9.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  54. 54.

    Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat Biotechnol. 2011;29:644.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  55. 55.

    Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–52.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  56. 56.

    Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19:455–77.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  57. 57.

    Xie Y, Wu G, Tang J, Luo R, Patterson J, Liu S, et al. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics. 2014;30:1660–6.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  58. 58.

    Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  59. 59.

    Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33:290–5. https://doi.org/10.1038/nbt.3122.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  60. 60.

    Liu R, Dickerson J. Strawberry: fast and accurate genome-guided transcript reconstruction and quantification from RNA-Seq. PLoS Comput Biol. 2017;13:e1005851.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  61. 61.

    Shao M, Kingsford C. Accurate assembly of transcripts through phase-preserving graph decomposition. Nat Biotechnol. 2017;35:1167–9. https://doi.org/10.1038/nbt.4020.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  62. 62.

    Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 2019;20:1–13.

    Article  CAS  Google Scholar 

  63. 63.

    Song L, Sabunciyan S, Yang G, Florea L. A multi-sample approach increases the accuracy of transcript assembly. Nat Commun. 2019;10:5000. https://doi.org/10.1038/s41467-019-12990-0.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  64. 64.

    Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS: Table 1. Bioinformatics. 2016;32:767–9. https://doi.org/10.1093/bioinformatics/btv661.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  65. 65.

    Hoff KJ, Lomsadze A, Borodovsky M, Stanke M. Whole-genome annotation with BRAKER. In: Gene prediction. Springer; 2019. p. 65–95.

  66. 66.

    Steijger T, Abril JF, Engström PG, Kokocinski F, Akerman M, Alioto T, et al. Assessment of transcript reconstruction methods for RNA-seq. Nat Methods. 2013;10:1177–84.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  67. 67.

    Lomsadze A, Burns PD, Borodovsky M. Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucl Acids Res. 2014;42:e119–e119. https://doi.org/10.1093/nar/gku557.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  68. 68.

    Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008;24:637–44.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  69. 69.

    Korf I. Gene finding in novel genomes. BMC Bioinform. 2004;5:59.

    Article  Google Scholar 

  70. 70.

    Keilwagen J, Hartung F, Grau J. GeMoMa: Homology-based gene prediction utilizing intron position conservation and RNA-seq data. In: Methods in molecular biology. 2019.

  71. 71.

    Leinonen R, Sugawara H, Shumway M, Collaboration INSD. The sequence read archive. Nucl Acids Res. 2010;39 suppl_1:D19–21.

  72. 72.

    Dobin A, Gingeras TR, Spring C, Flores R, Sampson J, Knight R, et al. Mapping RNA-seq with STAR. Curr Protoc Bioinform. 2016;51:586–97.

    Google Scholar 

  73. 73.

    Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  74. 74.

    Tang S, Lomsadze A, Borodovsky M. Identification of protein coding regions in RNA transcripts. Nucleic Acids Res. 2015;43:e78. https://doi.org/10.1093/nar/gkv227.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  75. 75.

    Goodstadt L. Ruffus: a lightweight Python library for computational pipelines. Bioinformatics. 2010;26:2778–9.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  76. 76.

    Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, Alioto T, et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods. 2013;10:1185–91.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  77. 77.

    Rapazote-Flores P, Bayer M, Milne L, Mayer C-D, Fuller J, Guo W, et al. BaRTv1.0: an improved barley reference transcript dataset to determine accurate changes in the barley transcriptome using RNA-seq. BMC Genomics. 2019;20:1–17.

    Article  CAS  Google Scholar 

  78. 78.

    Ustianenko D, Weyn-Vanhentenryck SM, Zhang C. Microexons: discovery, regulation, and function. Wiley Interdiscip Rev RNA. 2017;8:e1418.

    Article  CAS  Google Scholar 

  79. 79.

    Curry-Hyde A, Chen BJ, Mills JD, Janitz M. Microexons: novel regulators of the transcriptome. J Hum Transcript. 2018;2:1–6.

    Article  Google Scholar 

  80. 80.

    Wen F, Li F, Xia H, Lu X, Zhang X, Li Y. The impact of very short alternative splicing on protein structures and functions in the human genome. Trends Genet. 2004;20:232–6.

    CAS  PubMed  Article  Google Scholar 

  81. 81.

    Sakharkar MK, Chow VTK, Kangueane P. Distributions of exons and introns in the human genome. silico biology. 2004;4:387–93.

    CAS  Google Scholar 

  82. 82.

    Mano F, Aoyanagi T, Kozaki A. Atypical splicing accompanied by skipping conserved micro-exons produces unique WRINKLED1, an AP2 domain transcription factor in rice plants. Plants. 2019;8:207.

    CAS  PubMed Central  Article  PubMed  Google Scholar 

  83. 83.

    Song Q, Lv F, Tahir ul Qamar M, Xing F, Zhou R, Li H, et al. Identification and analysis of micro-exon genes in the rice genome. Int J Mol Sci. 2019;20:2685.

    CAS  PubMed Central  Article  PubMed  Google Scholar 

  84. 84.

    Bulman S, Ridgway HJ, Eady C, Conner AJ. Intron-rich gene structure in the intracellular plant parasite Plasmodiophora brassicae. Protist. 2007;158:423–33.

    CAS  PubMed  Article  Google Scholar 

  85. 85.

    Wang X. Protein and proteome atlas for plants under stresses: new highlights and ways for integrated Omics in post-genomics era; 2019.

  86. 86.

    Guo L, Liu C-M. A single-nucleotide exon found in Arabidopsis. Sci Rep. 2015;5:18087.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  87. 87.

    Gonatopoulos-Pournatzis T, Wu M, Braunschweig U, Roth J, Han H, Best AJ, et al. Genome-wide CRISPR-Cas9 interrogation of splicing networks reveals a mechanism for recognition of autism-misregulated neuronal microexons. Mol Cell. 2018;72:510–24.

    CAS  PubMed  Article  Google Scholar 

  88. 88.

    Consortium Gte. Human genomics. The human transcriptome across tissues and individuals. Science. 2015;348:660–5.

  89. 89.

    Irimia M, Weatheritt RJ, Ellis JD, Parikshak NN, Gonatopoulos-Pournatzis T, Babor M, et al. A highly conserved program of neuronal microexons is misregulated in autistic brains. Cell. 2014;159:1511–23.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  90. 90.

    Torres-Méndez A, Bonnal S, Marquez Y, Roth J, Iglesias M, Permanyer J, et al. A novel protein domain in an ancestral splicing factor drove the evolution of neural microexons. Nat Ecol Evol. 2019;3:691–701.

    PubMed  Article  PubMed Central  Google Scholar 

  91. 91.

    Parras A, Anta H, Santos-Galindo M, Swarup V, Elorza A, Nieto-González JL, et al. Autism-like phenotype and risk gene mRNA deadenylation by CPEB4 mis-splicing. Nature. 2018;560:441–6.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  92. 92.

    Wu J, Anczukow O, Krainer AR, Zhang MQ, Zhang C. OLego: fast and sensitive mapping of spliced mRNA-Seq reads using small seeds. Nucleic Acids Res. 2013;41:5149–63.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  93. 93.

    Kawahara Y, Sugiyama M. Change-point detection in time-series data by direct density-ratio estimation. In: Proceedings of the 2009 SIAM International Conference on Data Mining. SIAM; 2009. p. 389–400.

  94. 94.

    Lund R, Wang XL, Lu QQ, Reeves J, Gallagher C, Feng Y. Changepoint detection in periodic and autocorrelated time series. J Clim. 2007;20:5178–90.

    Article  Google Scholar 

  95. 95.

    Kawahara Y, Yairi T, Machida K. Change-point detection in time-series data based on subspace identification. In: Seventh IEEE international conference on data mining (ICDM 2007). IEEE; 2007. p. 559–64.

  96. 96.

    Takeuchi J, Yamanishi K. A unifying framework for detecting outliers and change points from time series. IEEE Trans Knowl Data Eng. 2006;18:482–92.

    Article  Google Scholar 

  97. 97.

    Aalvik Stranden S. A supervised sliding window approach for change point detection in multivariate time series; 2020.

  98. 98.

    Tartakovsky AG, Rozovskii BL, Blazek RB, Kim H. A novel approach to detection of intrusions in computer networks via adaptive sequential and batch-sequential change-point detection methods. IEEE Trans Signal Process. 2006;54:3372–82.

    Article  Google Scholar 

  99. 99.

    Klanderman MC, Newhart KB, Cath TY, Hering AS. Fault isolation for a complex decentralized waste water treatment facility. J R Stat Soc Ser C. 2020;69:931–51.

    Article  Google Scholar 

  100. 100.

    Quinlan AR. BEDTools: the Swiss-army tool for genome feature analysis. Curr Protoc Bioinformatics. 2014;47:11–2.

    PubMed  PubMed Central  Article  Google Scholar 

  101. 101.

    Killick R, Eckley I. changepoint: an R package for changepoint analysis. J Stat Softw. 2014;58:1–19.

    Article  Google Scholar 

  102. 102.

    Xiang S, Huang Z, Wang T, Han Z, Christina YY, Ni D, et al. Condition-specific gene co-expression network mining identifies key pathways and regulators in the brain tissue of Alzheimer’s disease patients. BMC Med Genomics. 2018;11:115.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  103. 103.

    Bruna T, Hoff K, Stanke M, Lomsadze A, Borodovsky M. BRAKER2: Automatic Eukaryotic Genome Annotation with GeneMark-EP+ and AUGUSTUS Supported by a Protein Database. bioRxiv. 2020.

  104. 104.

    Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10.

    CAS  PubMed  Article  Google Scholar 

  105. 105.

    Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinform. 2005;6:31.

    Article  CAS  Google Scholar 

  106. 106.

    Cheng C, Krishnakumar V, Chan AP, Thibaud-Nissen F, Schobel S, Town CD. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J. 2017;89:789–804.

    CAS  PubMed  Article  Google Scholar 

  107. 107.

    Li J-Y, Wang J, Zeigler RS. The 3,000 rice genomes project: new opportunities and challenges for future rice research. Gigascience. 2014;3:2047–217.

    Google Scholar 

  108. 108.

    Duitama J, Silva A, Sanabria Y, Cruz DF, Quintero C, Ballen C, et al. Whole genome sequencing of elite rice cultivars as a comprehensive information resource for marker assisted selection. PLoS ONE. 2015;10:e0124617.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  109. 109.

    Zhao K, Wright M, Kimball J, Eizenga G, McClung A, Kovach M, et al. Genomic diversity and introgression in O. sativa reveal the impact of domestication and breeding on the rice genome. PLoS ONE. 2010;5:e10780.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  110. 110.

    Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, et al. Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nat Commun. 2016;7:11708. https://doi.org/10.1038/ncomms11708.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  111. 111.

    Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009;326:1112–5.

  112. 112.

    The_C_elegans_Sequencing_Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282:2012–8.

  113. 113.

    Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 2008;18:1979–90. https://doi.org/10.1101/gr.081612.108.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  114. 114.

    Drosophila_consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature. 2007;450:203.

  115. 115.

    International_Human_Genome_Sequencing_consortium. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921.

  116. 116.

    Hood L, Rowen L. The human genome project: big science transforms biology and medicine. Genome Med. 2013;5:79.

    PubMed  PubMed Central  Article  Google Scholar 

  117. 117.

    Monat C, Padmarasu S, Lux T, Wicker T, Gundlach H, Himmelbach A, et al. TRITEX: chromosome-scale sequence assembly of Triticeae genomes with open-source tools. Genome Biol. 2019;20:284.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  118. 118.

    Appels R, Eversole K, Stein N, Feuillet C, Keller B, Rogers J, et al. Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science. 2018;361.

  119. 119.

    Krasileva KV, Vasquez-Gross HA, Howell T, Bailey P, Paraiso F, Clissold L, et al. Uncovering hidden variation in polyploid wheat. Proc Natl Acad Sci. 2017;114:E913–21.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  120. 120.

    Clavijo BJ, Venturini L, Schudoma C, Accinelli GG, Kaithakottil G, Wright J, et al. An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations. Genome Res. 2017;27:885–96.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  121. 121.

    Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, et al. RefSeq: an update on mammalian reference sequences. Nucl Acids Res. 2014;42:D756–63.

    CAS  PubMed  Article  Google Scholar 

  122. 122.

    Tello-Ruiz MK, Naithani S, Stein JC, Gupta P, Campbell M, Olson A, et al. Gramene 2018: unifying comparative genomics and pathway resources for plant research. Nucl Acids Res. 2018;46:D1181–9.

    CAS  PubMed  Article  Google Scholar 

  123. 123.

    Jiao Y, Peluso P, Shi J, Liang T, Stitzer MC, Wang B, et al. Improved maize reference genome with single-molecule technologies. Nature. 2017;546:524–7.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  124. 124.

    Eilbeck K, Moore B, Holt C, Yandell M. Quantitative measures for the management and comparison of annotated genomes. BMC Bioinform. 2009;10:67.

    Article  CAS  Google Scholar 

  125. 125.

    Venturini L, Caim S, Kaithakottil GG, Mapleson DL, Swarbreck D. Leveraging multiple transcriptome assembly methods for improved gene structure annotation. GigaScience. 2018;7. https://doi.org/10.1093/gigascience/giy093.

  126. 126.

    TAIR. Documentation for the TAIR gene model and exon confidence ranking system. 2009. http://plantta.jcvi.org/. Accessed 9 Oct 2020.

  127. 127.

    Sreenivasamurthy SK, Madugundu AK, Patil AH, Dey G, Mohanty AK, Kumar M, et al. Mosquito-borne diseases and Omics: tissue-restricted expression and alternative splicing revealed by transcriptome profiling of Anopheles stephensi. Omics J Integr Biol. 2017;21:488–97.

    CAS  Article  Google Scholar 

  128. 128.

    Azlan A, Obeidat SM, Yunus MA, Azzam G. Transcriptome profiles and novel lncRNA identification of Aedes aegypti cells in response to dengue virus serotype 1. BioRxiv. 2018;:422170.

  129. 129.

    Azlan A, Halim MA, Azzam G. Genome-wide identification and characterization of long intergenic noncoding RNAs in the regenerative flatworm Macrostomum lignano. Genomics. 2020;112:1273–81.

    CAS  PubMed  Article  Google Scholar 

  130. 130.

    Qi S, Akter S, Li S. Identification of Novel lincRNA and Co-Expression Network Analysis Using RNA-Sequencing Data in Plants. In: Plant long non-coding RNAs. Springer; 2019. p. 207–21.

  131. 131.

    Beisel NS, Noble J, Barbazuk WB, Paul A-L, Ferl RJ. Spaceflight-induced alternative splicing during seedling development in Arabidopsis thaliana. NPJ Micrograv. 2019;5:1–5.

    Article  Google Scholar 

  132. 132.

    Wang C, Wallerman O, Arendt M-L, Sundstrom E, Karlsson A, Nordin J, et al. A new long-read dog assembly uncovers thousands of exons and functional elements missing in the previous reference. bioRxiv. 2020.

  133. 133.

    Liu S, Aagaard A, Bechsgaard J, Bilde T. DNA methylation patterns in the social spider. Stegodyphus dumicola Genes. 2019;10:137.

    CAS  Google Scholar 

  134. 134.

    Wu S, Gao S, Wang S, Meng J, Wickham J, Luo S, et al. A reference genome of bursaphelenchus mucronatus provides new resources for revealing its displacement by pinewood nematode. Genes. 2020;11:570.

    CAS  PubMed Central  Article  Google Scholar 

  135. 135.

    Wang P, Luo Y, Huang J, Gao S, Zhu G, Dang Z, et al. The genome evolution and domestication of tropical fruit mango. Genome Biol. 2020;21:1–17.

    Article  CAS  Google Scholar 

  136. 136.

    Cieślik M, Chinnaiyan AM. Cancer transcriptome profiling at the juncture of clinical translation. Nat Rev Genet. 2018;19:93.

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  137. 137.

    Lorenzi L, Avila Cobos F, Decock A, Everaert C, Helsmoortel H, Lefever S, et al. Long noncoding RNA expression profiling in cancer: challenges and opportunities. Genes Chromosom Cancer. 2019;58:191–9.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  138. 138.

    Yang J, Moeinzadeh M-H, Kuhl H, Helmuth J, Xiao P, Haas S, et al. Haplotype-resolved sweet potato genome traces back its hexaploidization history. Nature plants. 2017;3:696–703.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  139. 139.

    Sun Z, Nair A, Chen X, Prodduturi N, Wang J, Kocher J-P. UClncR: ultrafast and comprehensive long non-coding RNA detection from RNA-seq. Sci Rep. 2017;7:1–10.

    Article  CAS  Google Scholar 

  140. 140.

    Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7:562–78.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  141. 141.

    Arrigoni A, Ranzani V, Rossetti G, Panzeri I, Abrignani S, Bonnal RJP, et al. Analysis RNA-seq and Noncoding RNA. In: Polycomb group proteins. Springer; 2016. p. 125–35.

  142. 142.

    Ghosh S, Chan C-KK. Analysis of RNA-Seq data using TopHat and Cufflinks. In: Plant Bioinformatics. Springer; 2016. p. 339–61.

  143. 143.

    Qi X, Xie S, Liu Y, Yi F, Yu J. Genome-wide annotation of genes and noncoding RNAs of foxtail millet in response to simulated drought stress by deep sequencing. Plant Mol Biol. 2013;83:459–73.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  144. 144.

    Marchant A, Mougel F, Mendonça V, Quartier M, Jacquin-Joly E, da Rosa JA, et al. Comparing de novo and reference-based transcriptome assembly strategies by applying them to the blood-sucking bug Rhodnius prolixus. Insect Biochem Mol Biol. 2016;69:25–33.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  145. 145.

    Li A, Zhang J, Zhou Z, Wang L, Liu Y, Liu Y. ALDB: a domestic-animal long noncoding RNA database. PLoS ONE. 2015;10:e0124003.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  146. 146.

    Cooper SJ, Trinklein ND, Anton ED, Nguyen L, Myers RM. Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome. Genome Res. 2006;16:1–10.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  147. 147.

    Brown RH, Gross SS, Brent MR. Begin at the beginning: predicting genes with 5′ UTRs. Genome Res. 2005;15:742–7.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  148. 148.

    Ohler U, Liao G, Niemann H, Rubin GM. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 2002;3:research0087–1.

  149. 149.

    Batut P, Gingeras TR. RAMPAGE: promoter activity profiling by paired-end sequencing of 5′-complete cDNAs. Curr Protoc Mol Biol. 2013;104:25B-B11.

    Article  Google Scholar 

  150. 150.

    Adiconis X, Haber AL, Simmons SK, Levy Moonshine A, Ji Z, Busby MA, et al. Comprehensive comparative analysis of 5′-end RNA-sequencing methods. Nat Methods. 2018;15:505–11.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  151. 151.

    Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci. 2003;100:15776–81.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  152. 152.

    Holmqvist E, Wright PR, Li L, Bischler T, Barquist L, Reinhardt R, et al. Global RNA recognition patterns of post-transcriptional regulators Hfq and CsrA revealed by UV crosslinking in vivo. EMBO J. 2016;35:991–1011.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  153. 153.

    Hickman R, van Verk MC, van Dijken AJH, Mendes MP, Vroegop-Vos IA, Caarls L, et al. Architecture and dynamics of the jasmonic acid gene regulatory network. Plant Cell Online. 2017;:tpc-00958.

  154. 154.

    Jackson RJ, Standart N. Do the poly (A) tail and 3′ untranslated region control mRNA translation? Cell. 1990;62:15–24.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  155. 155.

    Meijer HA, Thomas AAM. Control of eukaryotic protein synthesis by upstream open reading frames in the 5′-untranslated region of an mRNA. Biochem J. 2002;367:1–11.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  156. 156.

    Miller GM, Madras BK. Polymorphisms in the 3′-untranslated region of human and monkey dopamine transporter genes affect reporter gene expression. Mol Psychiatry. 2002;7:44–55.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  157. 157.

    Wu S, Huang S, Ding J, Zhao Y, Liang L, Liu T, et al. Multiple microRNAs modulate p21Cip1/Waf1 expression by directly targeting its 3′ untranslated region. Oncogene. 2010;29:2302–8.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  158. 158.

    Dixon DA, Kaplan CD, McIntyre TM, Zimmerman GA, Prescott SM. Post-transcriptional control of cyclooxygenase-2 gene expression The role of the 3′-untranslated region. J Biol Chem. 2000;275:11750–7.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  159. 159.

    Gu S, Jin L, Zhang F, Sarnow P, Kay MA. Biological basis for restriction of microRNA targets to the 3′ untranslated region in mammalian mRNAs. Nat Struct Mol Biol. 2009;16:144.

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  160. 160.

    Eberle AB, Stalder L, Mathys H, Orozco RZ, Mühlemann O. Posttranscriptional gene regulation by spatial rearrangement of the 3′ untranslated region. PLoS Biol. 2008;6:e92.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  161. 161.

    Halterman DA, Wise RP. Upstream open reading frames of the barley Mla13 powdery mildew resistance gene function co-operatively to down-regulate translation. Mol Plant Pathol. 2006;7:167–76.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  162. 162.

    Awata T, Inoue K, Kurihara S, Ohkubo T, Watanabe M, Inukai K, et al. A common polymorphism in the 5′-untranslated region of the VEGF gene is associated with diabetic retinopathy in type 2 diabetes. Diabetes. 2002;51:1635–9.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  163. 163.

    Rogers JT, Randall JD, Cahill CM, Eder PS, Huang X, Gunshin H, et al. An iron-responsive element type II in the 5′-untranslated region of the Alzheimer’s amyloid precursor protein transcript. J Biol Chem. 2002;277:45518–28.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  164. 164.

    Chin LJ, Ratner E, Leng S, Zhai R, Nallur S, Babar I, et al. A SNP in a let-7 microRNA complementary site in the KRAS 3′ untranslated region increases non-small cell lung cancer risk. Can Res. 2008;68:8535–40.

    CAS  Article  Google Scholar 

  165. 165.

    Halterman DA, Wei F, Wise RP. Powdery mildew-induced Mla mRNAs are alternatively spliced and contain multiple upstream open reading frames. Plant Physiol. 2003;131:558–67. https://doi.org/10.1104/pp.014407.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  166. 166.

    Conne B, Stutz A, Vassalli J-D. The 3′ untranslated region of messenger RNA: a molecular ‘hotspot’for pathology? Nat Med. 2000;6:637–41.

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  167. 167.

    Hunt M, Banerjee S, Surana P, Liu M, Fuerst G, Mathioni S, et al. Small RNA discovery in the interaction between barley and the powdery mildew pathogen. BMC Genomics. 2019;20:610.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  168. 168.

    Chapman AVE, Matthew H, Surana P, Velásquez-Zapata V, Xu W, Fuerst G, et al. Disruption of barley immunity to powdery mildew by an in-frame Lys-Leu deletion in the essential protein SGT1. Oxford Genetics. 2020.

  169. 169.

    Dai X, Xu Z, Liang Z, Tu X, Zhong S, Schnable JC. Non-homology-based prediction of gene functions. 2019;1–18.

  170. 170.

    Duvick DN. The contribution of breeding to yield advances in maize (Zea mays L.). Adv Agronomy. 2005;86:83–145.

  171. 171.

    Agrama HAS, Moussa ME. Mapping QTLs in breeding for drought tolerance in maize (Zea mays L.). Euphytica. 1996;91:89–97.

  172. 172.

    Maazou A-RS, Tu J, Qiu J, Liu Z. Breeding for drought tolerance in maize (Zea mays L.). Am J Plant Sci. 2016;7:1858.

  173. 173.

    Bhandary P, Seetharam AS, Arendsee ZW, Hur M, Wurtele ES. Raising orphans from a metadata morass: a researcher’s guide to re-use of public ’omics data. Plant Sci. 2018. https://doi.org/10.1016/j.plantsci.2017.10.014.

    Article  PubMed  PubMed Central  Google Scholar 

  174. 174.

    Arendsee ZW, Li L, Wurtele ES. Coming of age: orphan genes in plants. Trends Plant Sci. 2014;19:698–708. https://doi.org/10.1016/J.TPLANTS.2014.07.003.

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  175. 175.

    Arendsee Z, Li J, Singh U, Seetharam A, Dorman K, Wurtele ES. phylostratr: a framework for phylostratigraphy. Bioinformatics. 2019;35:3617–27.

    CAS  PubMed  Article  Google Scholar 

Download references

Acknowledgements

This research used resources provided by the SCINet project of the USDA Agricultural Research Service, ARS Project No. 0500-00093-001-00-D. The authors are also thankful to Dr. Karin Dorman (Professor, Department of Statistics, Iowa State University), for providing insightful feedbacks to compare annotations and for implementing changepoint detection. The authors thank Gregory Fuerst for taking care of submitting data to NCBI. Finally, the authors thank Dr. Eve Wurtele (Professor, Department of Genetics, Development and Cell Biology, Iowa State University) for permitting her student Priyanka Bhandari to collaborate on this work.

Funding

This research was supported by the US. Department of Agriculture, Agricultural Research Service, Project No. (5030-21000-068-00D) and (3625-21000-067-00D) through the Corn Insects and Crop Genetics Research Unit and Project No. (2030-21000-024-00D) through the Crop Improvement and Genetics Research Unit. Research supported in part by Oak Ridge Institute for Science and Education (ORISE) under US Department of Energy (DOE) contract number DE-SC0014664 to SB and National Science Foundation—Plant Genome Research Program Grant 13-39348 to RPW. PB received funding from National Science Foundation Grant (IOS 1546858, in part); Orphan Genes, “An Untapped Genetic Reservoir of Novel Traits”. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA, ARS, DOE, ORAU/ORISE or the National Science Foundation. USDA is an equal opportunity provider and employer.

Author information

Affiliations

Authors

Contributions

SB: Conceptualization, Data Curation, Formal Analysis, Investigation, Methodology, Software design, Validation, Visualization, Support personnel, Writing—Original Draft Preparation, Writing—Review and Editing. PB: Formal Analysis, Writing—Review and Editing. MGW: Conceptualization, Resources, Supervision, Writing—Review and Editing. TZS: Conceptualization, Resources, Supervision, Writing—Review and Editing. RPW: Conceptualization, Investigation, Resources, Supervision, Writing—Review and Editing. CMA: Conceptualization, Funding Acquisition, Investigation, Project Administration, Resources, Supervision, Writing—Review and Editing. All the authors have read and approved the final manuscript.

Corresponding author

Correspondence to Carson M. Andorf.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1.

Supplementary figures (S1–S9).

Additional file 2.

Input to finder.

Additional file 3.

Annotation edit distance of reference transcripts as reported by each gene annotation pipeline.

Additional file 4.

Performance of gene annotation pipelines on coding regions of transcripts.

Additional file 5.

Comparison of FINDER's performance with other gene annotation pipelines on a variety of different species.

Additional file 6.

Comparison of different transcriptome assembly softwares on a variety of species.

Additional file 7.

Improvement in reference gene annotation after adding untranslated regions verified with long-read from PacBio assemblies.

Additional file 8.

Description of RNA-Seq data used to execute FINDER, BRAKER2, MAKER2 and PASA.

Additional file 9.

Supplementary text document outlining methods and some results in more details.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Banerjee, S., Bhandary, P., Woodhouse, M. et al. FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences. BMC Bioinformatics 22, 205 (2021). https://doi.org/10.1186/s12859-021-04120-9

Download citation

Keywords

  • Genomics
  • Transcriptomics
  • Eukaryotic gene annotation
  • Gene prediction
  • Optimized RNA-Seq alignment
  • Changepoint detection