- Open Access
FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences
BMC Bioinformatics volume 22, Article number: 205 (2021)
Gene annotation in eukaryotes is a non-trivial task that requires meticulous analysis of accumulated transcript data. Challenges include transcriptionally active regions of the genome that contain overlapping genes, genes that produce numerous transcripts, transposable elements and numerous diverse sequence repeats. Currently available gene annotation software applications depend on pre-constructed full-length gene sequence assemblies which are not guaranteed to be error-free. The origins of these sequences are often uncertain, making it difficult to identify and rectify errors in them. This hinders the creation of an accurate and holistic representation of the transcriptomic landscape across multiple tissue types and experimental conditions. Therefore, to gauge the extent of diversity in gene structures, a comprehensive analysis of genome-wide expression data is imperative.
We present FINDER, a fully automated computational tool that optimizes the entire process of annotating genes and transcript structures. Unlike current state-of-the-art pipelines, FINDER automates the RNA-Seq pre-processing step by working directly with raw sequence reads and optimizes gene prediction from BRAKER2 by supplementing these reads with associated proteins. The FINDER pipeline (1) reports transcripts and recognizes genes that are expressed under specific conditions, (2) generates all possible alternatively spliced transcripts from expressed RNA-Seq data, (3) analyzes read coverage patterns to modify existing transcript models and create new ones, and (4) scores genes as high- or low-confidence based on the available evidence across multiple datasets. We demonstrate the ability of FINDER to automatically annotate a diverse pool of genomes from eight species.
FINDER takes a completely automated approach to annotate genes directly from raw expression data. It is capable of processing eukaryotic genomes of all sizes and requires no manual supervision—ideal for bench researchers with limited experience in handling computational tools.
Recent advances in sequencing technology enable the construction of chromosomal-level assemblies for even non-model organisms. As of December 2020, genomes of 16,108 eukaryotes, 295,784 prokaryotes, 41,936 viruses, 26,079 plasmids and 17,820 organelles are sequenced and available through GenBank , a considerable increase over the 1,500 sequences reported two decades ago (see Additional file 1: Fig. S1). Therefore, to annotate the ever-rising number of genome sequences, annotation software applications need to be fast, accurate, and designed to handle large amounts of expression data to facilitate discovery of novel genes across different conditions [2,3,4,5]. Extensive analysis of this available data is the key to achieving exhaustive gene discovery by analyzing samples from multiple tissues and conditions, obviating the need for additional sequencing.
Genome annotation is the process of identifying transcriptionally active regions of the genome and defining gene structures. Decoding the correct structures of genes is essential since several downstream applications rely on accurate annotations: detecting interactions between proteins [6,7,8,9,10,11,12,13,14], identifying post-translational modifications [15,16,17,18,19,20,21,22,23], mining effectors [24,25,26,27,28], and determining protein structure [29,30,31,32]. Although we have seen a significant improvement in genome sequencing technology, annotation methods continue to underperform [33, 34]. Obtaining accurate gene annotations is challenging, especially in recently sequenced non-model organisms. The presence of sequences exchanged through horizontal gene transfer in such genomes and the existence of fragmented assemblies make it difficult to predict gene structures . Multiple groups working on the same species have different and oftentimes conflicting annotations that are difficult to merge into a common consensus.
The early 2000s saw initial genome annotation attempts with the introduction of PASA , which was developed to map full-length transcripts and Expressed Sequence Tags (ESTs) in order to annotate genomes. In parallel, FGENESH [37, 38], GeneGenerator , mGene  and GeneSeqer  were introduced which predicted gene structures directly from genome sequence. Tools such as MAKER [42,43,44,45] and PASA  closely depend on pre-assembled full-length transcripts to generate annotations. ESTs and/or de novo assembled transcriptomes have been often provided as inputs to these tools to generate annotations [46,47,48,49,50,51,52]. Transcripts constructed via de novo [53,54,55,56,57] or genome-guided [58,59,60,61,62,63] approaches are sensitive to the nature of the assembler and its parameter settings. Such assemblers report sequences that are highly similar to one another, making the process of sifting the correct assemblies from artefacts difficult. This issue is moderately mitigated by BRAKER2 [64, 65], which uses read splice information instead of full-length assemblies to predict gene structures and has been shown to perform better than de novo approaches . BRAKER2 entails a round of unsupervised gene predictions using GeneMark-ET  generating ab-initio gene predictions followed by a second round of training by AUGUSTUS  using a subset of the gene models created by GeneMark-ET . All variations of MAKER (MAKER, MAKER2 and MAKER-P) use a combination of AUGUSTUS  and SNAP  to generate gene predictions. Unlike BRAKER2 or PASA, users need to run MAKER for multiple rounds to improve annotation. With no standard technique to optimize the number of rounds, users often undertake a trial-and-error approach to decide what data is supplied to MAKER in each execution round. These unguided choices can create different annotations based on the same data sets. Thus, current approaches report either incomplete genes and/or derive annotations that are missing alternatively spliced transcripts. In addition to MAKER, BRAKER, PASA there is another gene annotator -GeMoMa  that use protein coding genes from a reference genome and transfers those to the target genome. Since it does not optimally use information from expression data, it has not been compared with the other gene annotators discussed in this manuscript.
To overcome the drawbacks described above, we developed FINDER, a new, automated annotation pipeline that downloads RNA-Seq data from NCBI SRA , conducts genome-guided assembly of short reads, predicts gene structure, and annotates genes. FINDER annotates both untranslated and coding regions of genes, categorizes transcripts based on the tissue/conditions where they are expressed, and outputs a complete set of alternatively spliced transcripts. FINDER analyzes the spatial expression profile of each transcript to redefine its boundaries and/or even create newer transcripts and employs an optimized strategy to locate transcripts housing micro-exons. Finally, gene models predicted by BRAKER2 are incorporated into the annotation along with assemblies generated by PsiCLASS . We show that FINDER outperforms state-of-the-art annotation tools in constructing accurate gene structures, when executed with the same expression data.
The detailed workflow of FINDER is outlined in Fig. 1. The pipeline accepts metadata via a comma-separated values (csv) file (see Additional file 2: Table S1). Users can verify the input data using the `verifyInputsToFINDER` utility (Please check Sect. 1.5.1 of Additional file 9). Both single-end and paired-end data are accepted. The pipeline automatically downloads RNA-Seq data from NCBI SRA or the samples can be accessed locally. Multiple rounds of alignment are conducted using STAR [72, 73] with short reads, thus ensuring the capture of tissue-specific splice junctions and ultimately generates the most comprehensive set of alternatively spliced transcripts. FINDER uses PsiCLASS  to generate transcripts both at the tissue level and consolidates them to produce a consensus annotation. It employs change-point detection (CPD) using coverage data to polish intron/exon boundaries if needed. Polished transcripts are then supplied to GeneMarkS-T  to predict protein coding regions. In addition to constructing genes from expression data, FINDER uses BRAKER2  to predict genes de novo. Finally, gene models are assigned scores that reflect the confidence of prediction and evidence across different data sets. Throughout the pipeline run, intermediate temporary data is removed to optimize space usage. Proper logging of executions is implemented through ruffus .
Read alignments to the genome
Reads from each sample are aligned to the genome using STAR . FINDER accepts the location of the genomic STAR indices. If indices are not provided, then FINDER will generate them locally. FINDER implements multiple strategies to detect as many correct splice-junctions as possible. Several studies use a multi-step approach where splice junctions are detected in the first pass and then those junctions are used to guide the alignments in future passes [76, 77]. FINDER employs a similar strategy to align reads and obtain the most confident splice junctions in each tissue type and/or condition by conducting mapping in four passes (Please check Sect. 1.3 of Additional file 9 for more details).
Annotating transcripts with micro-exons
Certain genes in eukaryotes have micro-exons (i.e., exons with fewer than 50 nucleotides) [78,79,80,81] which impart important biological properties both in plants [82,83,84,85,86] and animals [87,88,89,90,91]. FINDER uses OLego  to map the reads which were reported unmapped by STAR, because OLego optimizes micro-exon sensitivity by checking intron signatures when no hits of seed sequences (~ 14 nt) are found. It is configured to align reads to exons of minimum length 2, with a minimum and maximum intron size of 20 and 10 K respectively.
Generating exon-exon transcript structure annotation with PsiCLASS
Alignments reported by STAR and OLego are combined and provided as input to PsiCLASS . Unlike traditional assemblers, PsiCLASS accepts alignments from multiple samples at the same time. It generates annotations for each sample and one consolidated gene annotation for all the samples. FINDER runs PsiCLASS with the—bamGroup option enabled which instructs PsiCLASS to preserve tissue/condition specific features. It is a fast meta-assembler generating 350 samples of output in less than three hours while running on 30 cores and consumes less than 50 GB of memory.
Polishing gene structures to optimize gene discovery
Gene structure annotations reported by PsiCLASS were polished to generate the best assemblies. Annotations generated by assemblers often have three kinds of errors that impact accuracy: (1) presence of redundant transcripts that are proper subsets of other transcripts, (2) multiple transcripts on the same strand merged into one, and (3) transcripts with ill-defined exon boundaries. Most assemblers ignore such cases to boost the speed of operation. Developing solutions to deal with these kinds of errors increases the number of correct structural annotations thereby improving downstream analysis.
FINDER uses different algorithmic and statistical approaches to deal with the above cases. To eliminate redundant transcripts, exon–intron structure of all transcripts is compared with each other to retain only unique transcripts. Even though eukaryotes possess large genomes, certain genes/transcripts are closely packed and are overlapping (Fig. 2). Reads originating from one of those genes often map to nearby overlapping genes making the task of distinctly recognizing the transcripts very challenging.
FINDER is configured to use changepoint detection (CPD) analysis to detect the descent in read coverage at the junction of two overlapping transcripts. Statistical CPD is a procedure to detect changes in the probability distribution of a stochastic process. Typically, CPD is widely used to detect changes in time series [93,94,95,96,97], but can be extended to other applications as well [98, 99]. We have found that even though CPD was developed under the assumption of normality, it can also be used where normality is violated.
In the first step in FINDER’s CPD, short read alignments to the genome are converted into number of read counts per nucleotide using bedtools . A custom python script is used to transfer the per nucleotide coverage data from the genome to the transcriptome reported by PsiCLASS. Each internal exon is considered as a potential site for the presence of changepoints if there exist premature stop codons in all the three frame translations. CPD only considers exons that have a high chance of housing a changepoint, thereby reducing duration of operation. The coverage pattern of each exon is probed to detect changepoints. The data has been modeled using an exponential distribution, and binary segmentation has been used to determines the changepoints in the exonic coverage using the ‘changepoints’ package . Read coverage of exons mimics a time series where each nucleotide position of an exon can be assumed to be a single unit of time. Coverage patterns of exons, suspected to be merged, contain a characteristic depression in the signal to split the gene models (Fig. 2a). Overlapping transcripts on opposite strands sometimes share a common exon (Fig. 2b). This negatively impacts precision since the boundaries of the predicted transcript exceed the boundaries of the transcript in the reference annotation. FINDER trims the transcript boundaries, using the changepoints, to better model the RNA-Seq coverage (Fig. 2b). These strategies improve the annotation by increasing the transcript F1 scores (Table 1).
De novo gene prediction from expression data and proteins from closely related species
Certain genes are expressed only under specific tissues and conditions . However, constructing an exhaustive set of genes expressed across all possible tissues and conditions is a daunting task due to the mammoth volume of potential expression data. Hence, approaches that can predict structures of unknown genes using information obtained from known genes are needed. Within the FINDER framework, we used BRAKER2  to predict the structure of protein coding genes. The pipeline is provided with alignment files generated by STAR and an optional, user-provided protein data file. If the previous execution fails, a second execution of BRAKER2 is launched without protein information. Genes predicted by BRAKER2 are compared to the genes obtained from expression data. To prevent too many false positives, predictions made by BRAKER2 are considered high confidence, only if those are supported by expression level or protein level evidence.
In addition to RNA-Seq data, FINDER also uses protein data (when provided), in two ways (1) to assess the veracity of the transcript models generated by BRAKER2, and (2) to align those proteins not recognized by BRAKER2 or PsiCLASS. Protein coding genes obtained from expression data and predicted by BRAKER2 are BLASTed  to the protein set provided by the user. Proteins not encountering any hits are aligned to the genome using exonerate  with a minimum threshold of 90% similarity. These alignments are augmented to the final set of gene predictions. Since these transcripts are obtained solely from proteins, they lack UTR sequences.
Prediction of coding regions
We leveraged GeneMarkS-T  to predict protein-coding regions of genes constructed from expression data. GTF files are first converted to FASTA files using the provided genome. Those FASTA files are supplied to GeneMarkS-T as inputs. GeneMarkS-T outputs coding sequence for the transcripts. CDS annotations are incorporated into the final GTF file by converting the transcriptomic coordinates to genomic coordinates.
Tissue/condition specific transcripts/gene models
Most eukaryotic genes have multiple isoforms which are derived from alternative transcripts. Expression of different transcripts can occur under different conditions in different tissues at different time points. FINDER compares assembled transcripts from each condition and prints out an association between each transcript and the provided tissue/condition (Additional file 9: Sect. 1.5).
Scoring gene models
FINDER groups genes into multiple categories based on supporting evidence. Genes that are expressed in RNA-Seq datasets, predicted by BRAKER2, and have protein evidence, are put into the high-confidence gene set. BRAKER2-predicted genes with no evidence of expression and/or proteins are treated as low confidence genes. FINDER expects a soft masked genome since it is a BRAKER2 requirement. Genes which are located in the repeat regions are marked as such and moved to the set of low-confidence genes.
Results and discussion
Choice of species for comparison
We tested the performance of FINDER primarily on three well-annotated plant organisms—Arabidopsis thaliana , Oryza sativa [107,108,109] and Zea mays [110, 111]. The genomes assemblies of these model organisms have been frequently updated and are almost complete with telomere-to-telomere sequences with fewer gaps and unknown nucleotides. In addition, their gene annotations have undergone regular improvement by mining the large number of RNA-Seq datasets available in the literature. Also, The Arabidopsis Information Resource (TAIR) provides a five-star rating system based on available evidence for each gene. Such a system offers a platform to test the quality of gene annotation software. For further evaluation, and to ensure that FINDER is able to annotate a wider range of genome types, we selected the following additional species to test: Caenorhabditis elegans , Drosophila melanogaster [113, 114], Homo sapiens [115, 116], Hordeum vulgare , and Triticum aestivum [117,118,119,120]). The genomes of these species range from small (C. elegans, D. melanogaster, A. thaliana), medium (O. sativa), to large (H. sapiens, Z. mays, H. vulgare, and T. aestivum). Finally, we evaluated FINDER on three different versions of Z. mays annotations—RefSeq , AGPv3 [111, 122] and AGPv4 [110, 123].
Metrics to assess quality of annotation
We used four metrics to compare the quality of annotations generated by each pipeline: (1) Annotation Edit Distance (AED) [42, 43, 124], (2) sensitivity, (3) specificity, and (4) F1 score. Although these metrics could be computed both at the nucleotide- and exon-level we chose to make comparisons at the transcript level since it encompasses bases, exons, and introns. An AED score of 0 indicates complete agreement of the predicted annotation with the reference, and a score of 1 denotes that the reference has not been identified in the annotation. A transcript is considered to be “recognized” only when all its intron definitions agree with at least one transcript from the predicted set. We used the Mikado “compare” utility to compare the predictions with the reference annotations . A highly sensitive annotation is one that can correctly recognize more reference transcripts. A set of annotations has high specificity when it reports minimal incorrect transcripts. For an annotation to be of good quality, both sensitivity and specificity should be high. A balanced metric is the F1 score which is the harmonic mean of sensitivity and specificity. While AED provides a good numeric assessment of how well the ground truth evidence is represented in an annotation, when individually used, it fails to capture the extent to which false positives are reported. Hence, F1 score complements AED since it incorporates both specificity and sensitivity. For evaluation purposes, we assume that the annotations achieved through community efforts are the ground truth and contain no errors.
FINDER generates more accurate gene models than BRAKER2, MAKER2 and PASA
FINDER leverages expression data to construct transcript models and employs statistical changepoint detection to enhance their structures (see “Implementation” section). Both MAKER2 and PASA were run with transcript sequences reported by PsiCLASS.
To assess FINDER’s performance, we compared the AED scores of transcript models generated by FINDER with those generated by other commonly used annotation methods. As shown in Fig. 3a, d, g, the violin plots for FINDER are broader at the base, indicating a greater number of transcripts with lower AED scores as compared to BRAKER2, MAKER, and PASA. We compared the FINDER AED scores with the AED scores reported by other pipelines using Wilcoxon’s signed rank test (More details in Additional file 9: Sect. 2.5). For all organisms (Fig. 3, Additional file 1: Figs. S2–S5 and Additional file 3: Table S2), the AED scores reported by FINDER were significantly lower (p_value < 0.01) than that of any other pipeline. Figure 3c, f, i, shows a stacked bar plot to represent the fraction of transcripts in each category of AED values. In all the cases, a higher percentage of transcripts reported by FINDER have lower AED scores (Additional file 1: Figs. S2–S5). This indicates that FINDER is capable of constructing gene structures that better comply with the reference annotations.
High-quality exhaustive annotations predict the fewest false positives thereby boosting the transcript F1 score. The transcript F1 scores of the gene models that were reported by FINDER for A. thaliana, O. sativa and Z. mays were higher than the models generated by BRAKER2, MAKER, and PASA (Fig. 3b, e, h). This same trend is observed for other tested organisms where FINDER was successful in detecting nucleotides, exons, introns, transcripts and genes (Table 1, Additional file 1: Figs. S2–S5 and Additional file 3: Table S2). MAKER2 and BRAKER2 registered a high specificity for most of the organisms because fewer transcripts were reported than FINDER. MAKER2 and BRAKER2 also had lower F1 scores, indicating less sensitivity than FINDER. Additionally, we compared the CDS regions of genes reported by FINDER with those of BRAKER2. For most of the organisms, FINDER generated transcript models with a higher F1 score (Additional file 4: Table S3). These results show that the better performance of FINDER is ensured not only due to the presence of UTRs but also due to enhanced CDS structure of gene models.
Finally, including BRAKER2 predictions and protein sequences to FINDER enhanced the gene model predictions. About 15% of the gene models reported by BRAKER2, those having high sequence similarity with the provided protein sequences were included in the final annotations (Table 2). As shown in Table 1 and Additional file 5: Table S4, including evidence at the protein level led to the identification of more genes.
Unlike BRAKER2, FINDER does not assume a homogeneous nucleotide composition of the genome . FINDER outperforms BRAKER2 while constructing gene models in complex organisms like H. sapiens, H. vulgare, and Z. mays since assemblers generating transcriptomes from alignments do not require a genome to possess homogeneous nucleotide composition.
FINDER in itself is restricted to annotate genes only in regions of the genome that are transcriptionally active. Recognizing that BRAKER2, being a gene predictor, can construct gene models in transcriptionally silent regions of the genome, FINDER is designed to incorporate the gene models predicted by BRAKER2 into the final annotations.
Distinct gene groups are accurately annotated with FINDER
Although eukaryotic genes differ from one another in terms of location, structure and the isoforms they encode, most annotation pipelines annotate and evaluate gene predictions with a global and uniform approach. The problem arises when these variances prompt each pipeline to perform differently on dissimilar groups of genes. To avoid this pitfall, we created groups of genes and transcripts based on various criteria (Table 3) and compared the performance of FINDER with BRAKER2, MAKER, and PASA for each of these sets.
On the set of UTR-containing transcripts, FINDER reported the best transcript F1 scores (Fig. 4, Additional file 1: Figs. S6, S7). Unlike BRAKER2, FINDER uses GeneMark S/T to predict CDS from the transcript sequences assembled by PsiCLASS and can hence annotate UTR regions. For most of the organisms, BRAKER2 and MAKER2 gene models register a low transcript F1 score in this category of genes. Next, we tested the performance of the annotation pipelines on transcripts that are closely located in the genome. On this set of transcripts, FINDER reported the best F1 transcript score for A. thaliana, O. sativa, and Z. mays (Fig. 4), and comparable scores for D. melanogaster (Additional file 1: Fig. S6), H. vulgare (Additional file 1: Fig. S8), and C. elegans (Additional file 1: Fig. S7) with BRAKER2. Most eukaryotic genes have multiple isoforms which differ from one another by their exon–intron definition. Splice sites and coverage information provides clues to construct such alternatively spliced transcripts. We selected genes with more than one transcript to check how well each annotation pipeline was able to detect transcript isoforms. For this case, FINDER was able to generate the best transcript structures with the highest transcript F1 score among all the pipelines gene annotation software applications (Fig. 4 and Additional file 1: Figs. S6–S9). Surprisingly, BRAKER2 fared poorly in this category despite training with all the detected splice sites from RNA-Seq data. This demonstrates that FINDER is capable of leveraging both intron splice sites and read coverages to report best transcript structures. For H. sapiens, PASA was able to generate the best transcript structures across all categories of transcripts. Adding transcripts from BRAKER2 and protein evidence improved the transcript F1 score for all the organisms, signifying the importance of incorporating de novo gene models and protein evidence.
BRAKER2 generated the best transcript annotation for the set of transcripts with a single exon (Fig. 4a, b and Additional file 1: Figs. S6–S9). Such transcripts, devoid of any introns, are difficult to construct from RNA-Seq alone. Also, the direction of the splice sites infers the direction of a transcript. Without any introns, such a single-exon transcript has to be probed for a CDS sequences' presence to infer directionality. BRAKER2 was configured to optimally predict only CDS regions of genes, hence, it performs well with the set of transcripts that have missing UTRs for organisms with small and moderate sized genomes (Fig. 4a, b and Additional file 1: Fig. S6–S9). The average number of transcripts per gene reported by BRAKER2 is lower than FINDER. While this boosts specificity, it compromises recall since BRAKER2 is not sensitive to detecting alternatively spliced transcripts. Hence, BRAKER2 accomplishes the best F1 score when tested on a set of single-transcript genes but performs poorly on a set of multi-transcript genes (Fig. 4a, b and Additional file 1: Figs. S6–S9).
Performance comparison on TAIR’s 5-star System
In order to assess the performance of the annotation pipelines on groups of genes constructed from varying levels of evidence, we used the TAIR10 5-star system. TAIR associates a quality score to each A. thaliana transcript based on the evidence used to construct the models, with five stars designating the best evidence and zero stars the least . The three categories with limited evidence (< 3 stars) have fewer than 3,000 transcripts each. BRAKER2′s performance, on the genes in these three categories, was slightly better than the rest of the annotation pipelines (Fig. 5). The other two categories (five star and four star) have 9,067 and 18,374 transcripts respectively. In both of these categories, FINDER was able to detect more transcripts than any other annotation pipeline. 51.5% and 86.4% of genes in the 5-star and 4-star category respectively were multi-exonic. In both these categories, FINDER correctly constructed more gene models compared to any other annotation pipeline (Fig. 5). FINDER reported 80% of the gene models belonging to the 4-star category—18% more than BRAKER2 (Fig. 5). Hence, it is evident from this analysis that FINDER can reconstruct the structures of most of the genes that are well-supported by underlying evidence.
Improving transcript annotations using changepoint analysis
The co-location of multiple overlapping genes on the genome strands makes it difficult to correctly annotate their structures (see “Polishing gene structures to optimize gene discovery” section). FINDER employs changepoint detection (CPD)  to split the merged transcripts reported by PsiCLASS (Fig. 2). To gauge the magnitude of improvement in transcript structures brought about by the application of CPD, we compared the accuracy of the predicted transcriptome before and after implementing CPD based on read coverage. As shown in Table 4 and Additional file 6: Table S5, implementing the CPD improved both specificity and sensitivity in organisms with small or medium-sized genomes. In A. thaliana, the transcript F1 scores increased from 40.78 to 45.95 (Table 4 and Additional file 6: Table S5) and in C. elegans it increased from 40 to 50. In large genomes, the improvement was not as significant, mainly because there are only a few genes that overlap with one another.
PsiCLASS meta-assembly works better than other approaches
We explored three popularly used software applications for merging transcriptome assemblies—StringTie-merge [77, 127,128,129,130,131,132,133], TACO [134,135,136,137,138,139] and Cuffmerge [140,141,142,143,144,145] to combine 116 A. thaliana assemblies constructed by StringTie , Scallop  and Strawberry  (Please check Sect. 3 of Additional file 9 for more details). The best assembly was reported by StringTie-merge and was hence used for all other organisms. We compared the accuracy of the consensus transcript models generated by StringTie-merge with the transcript models reported by PsiCLASS . As depicted in Table 4 and Additional file 6: Table S5, PsiCLASS generated the best transcript models for all organisms registering the highest transcript F1 score improving upon the StringTie models by up to 15%. Hence, FINDER uses only PsiCLASS to generate assemblies from short-read data.
Impact of missing untranslated region on annotation of transcripts
Gene transcription is triggered by adherence of a transcription factor in the promoter region of a gene. Promoters are typically located within 1,000 bp upstream of a gene’s transcription start site (TSS) [146,147,148]. Determining the TSS from sequencing data is best facilitated by RAMPAGE [149, 150] or CAGE-Seq , but this data is usually unavailable due to constraints imposed by cost and time. Nevertheless, a good estimate can be obtained from RNA-Seq data by assuming the start coordinates of the assembled genes as the TSS. Thus, researchers often localize their investigation to a section 500–1000 bp upstream of the assumed TSS [152, 153]. Without 5′ UTR annotation it is impossible to deduce a good approximation of the TSS. This leads to conducting promoter mining in a completely incorrect genome location. To assess the quality of 5′ UTR annotation, we plotted the difference of TSS between the reference genes and the genes reported by BRAKER2 and FINDER using a violin plot (Fig. 6). Further, we applied Wilcoxon’s rank-sum test and found that the TSS distances reported by FINDER were significantly less than that of BRAKER2 for A. thaliana and Z. mays. Interestingly, for O. sativa, BRAKER2 generated better gene models for more transcripts. Over 25% of reference gene models in O. sativa have no UTRs annotated which is higher compared to 15% UTR-less gene models in A. thaliana and Z. mays. This result illustrates that more FINDER transcripts have a TSS closer to the evidence as compared to the TSS of the transcripts reported by BRAKER2. This is an expected result since BRAKER2 was configured to annotate only CDS regions of transcripts. Table 5 highlights the number of transcripts that have better agreement with the reference TSS for FINDER and BRAKER2.
Enhancing ground truth annotations by extending untranslated regions
Official annotations of several model organisms, used as ground truth for this study, contain transcripts with missing UTR sequences. Even though UTRs do not code for proteins, they are relevant segments of a transcript involved in several important biological processes like mRNA translation [154,155,156], regulation of expression [157,158,159,160,161]] and a number of diseases [162,163,164,165,166]. In the A. thaliana TAIR10 annotations, there are 7,888 transcripts missing either UTR; 50% of these had a rating below 2 stars.
PacBio (Menlo Park, CA) offers long-read sequencing that contain both CDS and UTRs. Therefore, we used the PacBio annotations instead of the incomplete TAIR10 transcripts to assess FINDER’s performance on transcripts that were missing UTRs (Please refer to Sect. 2.6 in Additional file 9 for more details). Out of the 7,888 TAIR10 transcripts with missing UTRs, 113 transcripts were found both in the PacBio data and the 116 short-read RNA-Seq samples. We compared the FINDER annotations against these 113 transcripts. FINDER annotations were able to recall 91.55% of the nucleotides in 113 transcripts of TAIR10 and 97.86% of PacBio transcripts. The specificity of the FINDER annotations is markedly higher with PacBio transcripts (79.67%) compared to TAIR10 transcripts (72.14%). This demonstrates that FINDER enhances and improves upon the existing annotation.
The TRITEX H. vulgare annotation (Morex version r2) , released by the International Barley Sequencing Consortium (IBSC), is devoid of UTRs. We used FINDER to update and enrich the existing annotations by flanking the CDS region with UTRs on both sides. To verify the accuracy of the gene models reported by FINDER, we used PacBio full-length mRNA sequences derived from a time course of powdery mildew infected barley leaf tissue [167, 168]. A total of 7,352 gene models from IBSC, FINDER, and PacBio had a complete intron-chain match with each other. The gene structures for more than 93% (6,886 out of 7,352) of the FINDER models were improved when compared to PacBio full-length sequences (Additional file 7: Table S6). The highest F1 score achieved was 87.16. This shows that FINDER is capable of constructing accurate gene structures constituting both CDS and UTRs.
Evaluating performance with different annotations of Zea mays
Z. mays is an important model organism for crops and has been one of the most studied plants for genetics by researchers in several different fields [169,170,171,172]. Genes have been annotated in multiple ways using different kinds of data, resulting in substantial differences in gene structures . Here we compare three alternative annotation sets of Z. mays—RefSeq, AGPv3, and AGPv4 and the performance of FINDER surpassed all three approaches. The transcript F1 score for FINDER gene models compared against the NCBI gene models were 43.48, whereas the F1 scores for AGPv3 and AGPv4 were 26.69 and 22.51 respectively. We observed the same trend for other annotation pipelines and reported a higher transcript F1 score for NCBI than the AGP annotations (Table 1 and Additional file 3: Table S2). Hence, FINDER generated high-quality gene structures with high transcript F1 scores for different Z. mays annotations.
Evaluating FINDER on different clades reported by Phylostratr
Genes in each organism can be categorized by their evolutionary history [173, 174]. We used Phylostratr  to classify genes into evolutionary strata. Here we present our results on the three model organisms—A. thaliana, O. sativa, and Z. mays. For all three, FINDER was able to accurately detect more genes in highly populated strata (Fig. 7). The performance of FINDER and PASA was comparable in strata with few genes. It was surprising to note that BRAKER2 was unable to identify highly conserved genes (those from the “cellular organisms” strata) since those would be easier to predict than organism specific genes. This demonstrates that FINDER is capable of effectively constructing genes from different evolutionary backgrounds.
FINDER constructs gene models for polyploid genomes
Being a general-purpose genome annotator, in addition to diploid organisms, FINDER can annotate the genomes of polyploid organisms. We generated gene structures of Triticum aestivum, a hexaploid with 120,744 annotated genes and 146,597 transcripts . FINDER was able to detect 48,129 transcripts (39.9%). Out of the 130,582 transcripts predicted by FINDER, 48,104 (36.83%) matched perfectly with at least one reference annotation.
Identifying genes on chromosomes and deducing their structures from a plethora of evidence has been undertaken in multiple ways, with each method having advantages and disadvantages. Herein, we propose FINDER—an entirely automated, general-purpose pipeline to annotate genes in eukaryotic genomes. FINDER (1) implements an optimized mapping strategy that reduces the number of spurious mappings, (2) produces complete full-length transcripts comprising UTRs while identifying transcripts with micro-exons, (3) employs statistical CPD to modify gene boundaries and construct new genes, (4) reports more alternatively spliced transcripts as compared to other state-of-the-art annotation pipelines, and (5) assigns confidence classes to each transcript based on the evidence(s) that were used to construct those.
While FINDER’s performance has been superior to other gene annotation softwares, all the gene models reported by FINDER are predicted. Hence, a validation is necessary to ensure false positives are detected and removed. Also, future versions of FINDER will offer functionalities to leverage data from CAGE-Seq and Ribo-Seq to better annotate transcription start site and translation start sites respectively.
With a wide variety of available data for annotation, researchers often struggle to manage and optimize their usage. Several gene annotation software also offer users complicated configurations without providing substantial guidance. FINDER makes the job of gene annotation easy for bench scientists by automating the entire process from RNA-Seq data processing to gene prediction. Since FINDER does not assume the ploidy or the nucleotide composition of a genome, it could be applied to derive gene structures for a wide range of species, including non-model organisms. FINDER constructs gene models primarily from RNA-Seq data and is therefore capable of constructing tissue- and/or condition- specific isoforms which would have been impossible to obtain from ESTs only. FINDER supersedes the performance of existing software applications by utilizing read coverage information to fine-tune gene model boundaries. Instead of removing low-quality transcripts, FINDER flags them as low confidence—giving users the choice of using them as they seem fit. As a proof of concept, we provided evidence that using read coverage signal indeed enhances gene structures in a diverse set of organisms. Thus, we are confident that FINDER will pave the way for improved gene structure annotation in the future.
Availability and requirements
Project name: FINDER.
Project home page: https://github.com/sagnikbanerjee15/Finder.
Operating system(s): Linux, MacOS.
Programming language: Python, C, C++, Perl, Shell.
Other software requirements: All software requirements are listed in https://github.com/sagnikbanerjee15/Finder/blob/master/environment.yml.
Any restrictions to use by non-academics: MIT licensing restrictions apply.
Expressed sequence tags
Next generation sequencing
National Center for Biotechnology Information
Sequence read archive
Comma separated values
Annotation edit distance
Transcription start site
Central processing unit
Genome List-Genome-NCBI. https://www.ncbi.nlm.nih.gov/genome/browse/#!/overview/. Accessed 12 Jan 2021.
Morganti S, Tarantino P, Ferraro E, D’Amico P, Viale G, Trapani D, et al. Complexity of genome sequencing and reporting: next generation sequencing (NGS) technologies and implementation of precision medicine in real life. Crit Rev Oncol Hematol. 2019;133:171–82.
Koboldt DC, Steinberg KM, Larson DE, Wilson RK, Mardis ER. The next-generation sequencing revolution and its impact on genomics. Cell. 2013;155:27–38.
Phillips KA, Douglas MP. The global market for next-generation sequencing tests continues its torrid pace. J Precis Med. 2018;2018:4.
Kulski JK. Next-generation sequencing—an overview of the history, tools, and “Omic” applications. Next Generation Sequencing–Advances, Applications and Challenges. 2016;3–60.
Banerjee S, Mitra B, Chatterjee A, Santra A, Chatterjee B. Identification of relevant physico chemical properties of amino acids with respect to protein glycosylation prediction. In: Computing and Communication (IEMCON), 2015 International Conference and Workshop on. IEEE; 2015. p. 1–7.
Banerjee S, Basu S, Nasipuri M. Big Data Analytics and Its Prospects in Computational Proteomics. In: Information systems design and intelligent applications. Springer; 2015. p. 591–8.
Banerjee S, Velásquez-Zapata V, Fuerst G, Elmore JM, Wise RP, Elmore M. NGPINT: a next-generation protein–protein interaction software. Brief Bioinform. 2020;2020:1–14. https://doi.org/10.1093/bib/bbaa351.
Rao VS, Srinivas K, Sujini GN, Kumar GN. Protein–protein interaction detection: methods and analysis. Int J Proteom. 2014;2014:147648.
Patel S, Tripathi R, Kumari V, Varadwaj P. DeepInteract: deep neural network based protein–protein interaction prediction tool. Curr Bioinform. 2017;12:551–7.
Chen M, Ju CJ-T, Zhou G, Chen X, Zhang T, Chang K-W, et al. Multifaceted protein–protein interaction prediction based on siamese residual rcnn. Bioinformatics. 2019;35:305–14.
Yang S, Li H, He H, Zhou Y, Zhang Z. Critical assessment and performance improvement of plant–pathogen protein–protein interaction prediction methods. Brief Bioinform. 2019;20:274–87.
Li Y, Ilie L. SPRINT: ultrafast protein–protein interaction prediction of the entire human interactome. BMC Bioinform. 2017;18:485.
Velásquez-Zapata V, Elmore JM, Banerjee S, Dorman KS, Wise RP. Next-generation yeast-two-hybrid analysis with Y2H-SCORES identifies novel interactors of the MLA immune receptor. PLoS Comput Biol 2021.
Banerjee S, Ghosh D, Basu S, Nasipuri M. JUPred_MLP: Prediction of phosphorylation sites using a consensus of MLP classifiers. 2016.
Banerjee S, Ghosh D, Basu S, Nasipuri M. JUPred_SVM : Prediction of Phosphorylation Sites using a consensus of SVM classifiers. In: Proceedings of Fifth International Conference on Soft Computing for Problem Solving. Springer; 2016. p. 1–8.
Banerjee S, Nag S, Tapadar S, Ghosh S, Guha S, Bakshi S. Improving protein protein interaction prediction by choosing appropriate physiochemical properties of amino acids. In: Computing and Communication (IEMCON), 2015 International Conference and Workshop on. IEEE; 2015. p. 1–8.
Banerjee S, Basu S, Ghosh D, Nasipuri M. PhospredRF: Prediction of protein phosphorylation sites using a consensus of random forest classifiers. In: Computing and Communication (IEMCON), 2015 International Conference and Workshop on. IEEE; 2015. p. 1–7.
Luo F, Wang M, Liu Y, Zhao X-M, Li A. DeepPhos: prediction of protein phosphorylation sites with deep learning. Bioinformatics. 2019;35:2766–73. https://doi.org/10.1093/bioinformatics/bty1051.
Li F, Li C, Marquez-Lago TT, Leier A, Akutsu T, Purcell AW, et al. Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome. Bioinformatics. 2018;34:4223–31.
Song J, Wang H, Wang J, Leier A, Marquez-Lago T, Yang B, et al. PhosphoPredict: a bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection. Sci Rep. 2017;7:1–19.
Chen H, Xue Y, Huang N, Yao X, Sun Z. MeMo: a web tool for prediction of protein methylation modifications. Nucl Acids Res. 2006;34 suppl_2:W249–53.
Eisenhaber B, Eisenhaber F. Prediction of posttranslational modification of proteins from their amino acid sequence. In: Data mining techniques for the life sciences. Springer; 2010. p. 365–84.
Elmore MG, Banerjee S, Pedley KF, Ruck A, Whitham SA. De novo transcriptome of Phakopsora pachyrhizi uncovers putative effector repertoire during infection. Physiol Mol Plant Pathol. 2020;110:101464.
Frantzeskakis L, Kracher B, Kusch S, Yoshikawa-Maekawa M, Bauer S, Pedersen C, et al. Signatures of host specialization and a recent transposable element burst in the dynamic one-speed genome of the fungal barley powdery mildew pathogen. BMC Genomics. 2018;19:381. https://doi.org/10.1186/s12864-018-4750-6.
Sperschneider J. Machine learning in plant–pathogen interactions: empowering biological predictions from field scale to genome scale. New Phytologist. 2019;nph.15771. https://doi.org/10.1111/nph.15771.
Sperschneider J, Dodds PN, Singh KB, Taylor JM. ApoplastP: prediction of effectors and plant proteins in the apoplast using machine learning. New Phytologist. 2017.
Sperschneider J, Gardiner DM, Dodds PN, Tini F, Covarelli L, Singh KB, et al. EffectorP: predicting fungal effector proteins from secretomes using machine learning. New Phytol. 2016;210:743–61. https://doi.org/10.1111/nph.13794.
Magnan CN, Baldi P. SSpro/ACCpro 5: Almost Perfect Prediction of Protein Secondary Structure and Relative Solvent Accessibility Using Profiles, Machine Learning, and Structural Similarity. Bioinformatics. 2014;:btu352.
McGuffin LJ, Bryson K, Jones DT. The PSIPRED protein structure prediction server. Bioinformatics (Oxford, England). 2000;16:404–5.
Laskowski RA, Watson JD, Thornton JM. Protein function prediction using local 3D templates. J Mol Biol. 2005;351:614–26. https://doi.org/10.1016/j.jmb.2005.05.067.
Banerjee S, Guha S, Dutta A, Dutta S. Improvement of protein disorder prediction by brainstorming consensus. In: Computing and Communication (IEMCON), 2015 International Conference and Workshop on. IEEE; 2015. p. 1–7.
Salzberg SL. Next-generation genome annotation: we still struggle to get it right. BioMed Central; 2019. https://doi.org/10.1186/s13059-019-1715-2.
del Angel VD, Hjerde E, Sterck L, Capella-Gutierrez S, Notredame C, Pettersson OV, et al. Ten steps to get started in genome assembly and annotation. F1000Research. 2018;7.
Richards S. Full disclosure: genome assembly is still hard. PLoS Biol. 2018;16:e2005894.
Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucl Acids Res. 2003;31:5654–66.
Salamov A, Solovyev V. Fgenesh multiple gene prediction program; 1998.
Solovyev V, Kosarev P, Seledsov I, Vorobyev D. Automatic annotation of eukaryotic genes, pseudogenes and promoters. Genome Biol. 2006;7:S10.
Kleffe J, Hermann K, Vahrson W, Wittig B, Brendel V. GeneGenerator—a flexible algorithm for gene prediction and its application to maize sequences. Bioinformatics (Oxford). 1998;14:232–43.
Schweikert G, Zien A, Zeller G, Behr J, Dieterich C, Ong CS, et al. mGene: accurate SVM-based gene finding with an application to nematode genomes. Genome Res. 2009;19:2133–43.
Schlueter SD, Dong Q, Brendel V. GeneSeqer@ PlantGDB: gene structure prediction in plant genomes. Nucl Acids Res. 2003;31:3597–600.
Cantarel BL, Korf I, Robb SMCC, Parra G, Ross E, Moore B, et al. MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008;18:188–96. https://doi.org/10.1101/gr.6743907.
Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinform. 2011;12:491.
Campbell MS, Law M, Holt C, Stein JC, Moghe GD, Hufnagel DE, et al. MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol. 2014;164:513–24.
Campbell MS, Holt C, Moore B, Yandell M. Genome annotation and curation using MAKER and MAKER-P. Curr Protoc Bioinform. 2014;48:4–11. https://doi.org/10.1002/0471250953.bi0411s48.
Vonk FJ, Casewell NR, Henkel CV, Heimberg AM, Jansen HJ, McCleary RJR, et al. The king cobra genome reveals dynamic gene evolution and adaptation in the snake venom system. Proc Natl Acad Sci. 2013;110:20651–6.
Keane M, Semeiks J, Webb AE, Li YI, Quesada V, Craig T, et al. Insights into the evolution of longevity from the bowhead whale genome. Cell Rep. 2015;10:112–22.
Zhang J, Fu X-X, Li R-Q, Zhao X, Liu Y, Li M-H, et al. The hornwort genome and early land plant evolution. Nature plants. 2020;6:107–18.
Gray MW, Burger G, Derelle R, Klimeš V, Leger MM, Sarrasin M, et al. The draft nuclear genome sequence and predicted mitochondrial proteome of Andalucia godoyi, a protist with the most gene-rich and bacteria-like mitochondrial genome. BMC Biol. 2020;18:1–35.
Peng C, Ren J-L, Deng C, Jiang D, Wang J, Qu J, et al. The genome of Shaw’s sea snake (Hydrophis curtus) reveals secondary adaptation to its marine environment. Mol Biol Evol; 2020.
Weitemier K, Straub SCK, Fishbein M, Bailey CD, Cronn RC, Liston A. A draft genome and transcriptome of common milkweed (Asclepias syriaca) as resources for evolutionary, ecological, and molecular studies in milkweeds and Apocynaceae. PeerJ. 2019;7:e7649.
Zhang J, Zhang X, Tang H, Zhang Q, Hua X, Ma X, et al. Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L. Nat Genet. 2018;50:1565–73.
Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–9.
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat Biotechnol. 2011;29:644.
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–52.
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19:455–77.
Xie Y, Wu G, Tang J, Luo R, Patterson J, Liu S, et al. SOAPdenovo-Trans: de novo transcriptome assembly with short RNA-Seq reads. Bioinformatics. 2014;30:1660–6.
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511.
Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33:290–5. https://doi.org/10.1038/nbt.3122.
Liu R, Dickerson J. Strawberry: fast and accurate genome-guided transcript reconstruction and quantification from RNA-Seq. PLoS Comput Biol. 2017;13:e1005851.
Shao M, Kingsford C. Accurate assembly of transcripts through phase-preserving graph decomposition. Nat Biotechnol. 2017;35:1167–9. https://doi.org/10.1038/nbt.4020.
Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 2019;20:1–13.
Song L, Sabunciyan S, Yang G, Florea L. A multi-sample approach increases the accuracy of transcript assembly. Nat Commun. 2019;10:5000. https://doi.org/10.1038/s41467-019-12990-0.
Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS: Table 1. Bioinformatics. 2016;32:767–9. https://doi.org/10.1093/bioinformatics/btv661.
Hoff KJ, Lomsadze A, Borodovsky M, Stanke M. Whole-genome annotation with BRAKER. In: Gene prediction. Springer; 2019. p. 65–95.
Steijger T, Abril JF, Engström PG, Kokocinski F, Akerman M, Alioto T, et al. Assessment of transcript reconstruction methods for RNA-seq. Nat Methods. 2013;10:1177–84.
Lomsadze A, Burns PD, Borodovsky M. Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucl Acids Res. 2014;42:e119–e119. https://doi.org/10.1093/nar/gku557.
Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008;24:637–44.
Korf I. Gene finding in novel genomes. BMC Bioinform. 2004;5:59.
Keilwagen J, Hartung F, Grau J. GeMoMa: Homology-based gene prediction utilizing intron position conservation and RNA-seq data. In: Methods in molecular biology. 2019.
Leinonen R, Sugawara H, Shumway M, Collaboration INSD. The sequence read archive. Nucl Acids Res. 2010;39 suppl_1:D19–21.
Dobin A, Gingeras TR, Spring C, Flores R, Sampson J, Knight R, et al. Mapping RNA-seq with STAR. Curr Protoc Bioinform. 2016;51:586–97.
Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21.
Tang S, Lomsadze A, Borodovsky M. Identification of protein coding regions in RNA transcripts. Nucleic Acids Res. 2015;43:e78. https://doi.org/10.1093/nar/gkv227.
Goodstadt L. Ruffus: a lightweight Python library for computational pipelines. Bioinformatics. 2010;26:2778–9.
Engström PG, Steijger T, Sipos B, Grant GR, Kahles A, Alioto T, et al. Systematic evaluation of spliced alignment programs for RNA-seq data. Nat Methods. 2013;10:1185–91.
Rapazote-Flores P, Bayer M, Milne L, Mayer C-D, Fuller J, Guo W, et al. BaRTv1.0: an improved barley reference transcript dataset to determine accurate changes in the barley transcriptome using RNA-seq. BMC Genomics. 2019;20:1–17.
Ustianenko D, Weyn-Vanhentenryck SM, Zhang C. Microexons: discovery, regulation, and function. Wiley Interdiscip Rev RNA. 2017;8:e1418.
Curry-Hyde A, Chen BJ, Mills JD, Janitz M. Microexons: novel regulators of the transcriptome. J Hum Transcript. 2018;2:1–6.
Wen F, Li F, Xia H, Lu X, Zhang X, Li Y. The impact of very short alternative splicing on protein structures and functions in the human genome. Trends Genet. 2004;20:232–6.
Sakharkar MK, Chow VTK, Kangueane P. Distributions of exons and introns in the human genome. silico biology. 2004;4:387–93.
Mano F, Aoyanagi T, Kozaki A. Atypical splicing accompanied by skipping conserved micro-exons produces unique WRINKLED1, an AP2 domain transcription factor in rice plants. Plants. 2019;8:207.
Song Q, Lv F, Tahir ul Qamar M, Xing F, Zhou R, Li H, et al. Identification and analysis of micro-exon genes in the rice genome. Int J Mol Sci. 2019;20:2685.
Bulman S, Ridgway HJ, Eady C, Conner AJ. Intron-rich gene structure in the intracellular plant parasite Plasmodiophora brassicae. Protist. 2007;158:423–33.
Wang X. Protein and proteome atlas for plants under stresses: new highlights and ways for integrated Omics in post-genomics era; 2019.
Guo L, Liu C-M. A single-nucleotide exon found in Arabidopsis. Sci Rep. 2015;5:18087.
Gonatopoulos-Pournatzis T, Wu M, Braunschweig U, Roth J, Han H, Best AJ, et al. Genome-wide CRISPR-Cas9 interrogation of splicing networks reveals a mechanism for recognition of autism-misregulated neuronal microexons. Mol Cell. 2018;72:510–24.
Consortium Gte. Human genomics. The human transcriptome across tissues and individuals. Science. 2015;348:660–5.
Irimia M, Weatheritt RJ, Ellis JD, Parikshak NN, Gonatopoulos-Pournatzis T, Babor M, et al. A highly conserved program of neuronal microexons is misregulated in autistic brains. Cell. 2014;159:1511–23.
Torres-Méndez A, Bonnal S, Marquez Y, Roth J, Iglesias M, Permanyer J, et al. A novel protein domain in an ancestral splicing factor drove the evolution of neural microexons. Nat Ecol Evol. 2019;3:691–701.
Parras A, Anta H, Santos-Galindo M, Swarup V, Elorza A, Nieto-González JL, et al. Autism-like phenotype and risk gene mRNA deadenylation by CPEB4 mis-splicing. Nature. 2018;560:441–6.
Wu J, Anczukow O, Krainer AR, Zhang MQ, Zhang C. OLego: fast and sensitive mapping of spliced mRNA-Seq reads using small seeds. Nucleic Acids Res. 2013;41:5149–63.
Kawahara Y, Sugiyama M. Change-point detection in time-series data by direct density-ratio estimation. In: Proceedings of the 2009 SIAM International Conference on Data Mining. SIAM; 2009. p. 389–400.
Lund R, Wang XL, Lu QQ, Reeves J, Gallagher C, Feng Y. Changepoint detection in periodic and autocorrelated time series. J Clim. 2007;20:5178–90.
Kawahara Y, Yairi T, Machida K. Change-point detection in time-series data based on subspace identification. In: Seventh IEEE international conference on data mining (ICDM 2007). IEEE; 2007. p. 559–64.
Takeuchi J, Yamanishi K. A unifying framework for detecting outliers and change points from time series. IEEE Trans Knowl Data Eng. 2006;18:482–92.
Aalvik Stranden S. A supervised sliding window approach for change point detection in multivariate time series; 2020.
Tartakovsky AG, Rozovskii BL, Blazek RB, Kim H. A novel approach to detection of intrusions in computer networks via adaptive sequential and batch-sequential change-point detection methods. IEEE Trans Signal Process. 2006;54:3372–82.
Klanderman MC, Newhart KB, Cath TY, Hering AS. Fault isolation for a complex decentralized waste water treatment facility. J R Stat Soc Ser C. 2020;69:931–51.
Quinlan AR. BEDTools: the Swiss-army tool for genome feature analysis. Curr Protoc Bioinformatics. 2014;47:11–2.
Killick R, Eckley I. changepoint: an R package for changepoint analysis. J Stat Softw. 2014;58:1–19.
Xiang S, Huang Z, Wang T, Han Z, Christina YY, Ni D, et al. Condition-specific gene co-expression network mining identifies key pathways and regulators in the brain tissue of Alzheimer’s disease patients. BMC Med Genomics. 2018;11:115.
Bruna T, Hoff K, Stanke M, Lomsadze A, Borodovsky M. BRAKER2: Automatic Eukaryotic Genome Annotation with GeneMark-EP+ and AUGUSTUS Supported by a Protein Database. bioRxiv. 2020.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10.
Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinform. 2005;6:31.
Cheng C, Krishnakumar V, Chan AP, Thibaud-Nissen F, Schobel S, Town CD. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J. 2017;89:789–804.
Li J-Y, Wang J, Zeigler RS. The 3,000 rice genomes project: new opportunities and challenges for future rice research. Gigascience. 2014;3:2047–217.
Duitama J, Silva A, Sanabria Y, Cruz DF, Quintero C, Ballen C, et al. Whole genome sequencing of elite rice cultivars as a comprehensive information resource for marker assisted selection. PLoS ONE. 2015;10:e0124617.
Zhao K, Wright M, Kimball J, Eizenga G, McClung A, Kovach M, et al. Genomic diversity and introgression in O. sativa reveal the impact of domestication and breeding on the rice genome. PLoS ONE. 2010;5:e10780.
Wang B, Tseng E, Regulski M, Clark TA, Hon T, Jiao Y, et al. Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nat Commun. 2016;7:11708. https://doi.org/10.1038/ncomms11708.
Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al. The B73 maize genome: complexity, diversity, and dynamics. Science. 2009;326:1112–5.
The_C_elegans_Sequencing_Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science. 1998;282:2012–8.
Ter-Hovhannisyan V, Lomsadze A, Chernoff YO, Borodovsky M. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Genome Res. 2008;18:1979–90. https://doi.org/10.1101/gr.081612.108.
Drosophila_consortium. Evolution of genes and genomes on the Drosophila phylogeny. Nature. 2007;450:203.
International_Human_Genome_Sequencing_consortium. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921.
Hood L, Rowen L. The human genome project: big science transforms biology and medicine. Genome Med. 2013;5:79.
Monat C, Padmarasu S, Lux T, Wicker T, Gundlach H, Himmelbach A, et al. TRITEX: chromosome-scale sequence assembly of Triticeae genomes with open-source tools. Genome Biol. 2019;20:284.
Appels R, Eversole K, Stein N, Feuillet C, Keller B, Rogers J, et al. Shifting the limits in wheat research and breeding using a fully annotated reference genome. Science. 2018;361.
Krasileva KV, Vasquez-Gross HA, Howell T, Bailey P, Paraiso F, Clissold L, et al. Uncovering hidden variation in polyploid wheat. Proc Natl Acad Sci. 2017;114:E913–21.
Clavijo BJ, Venturini L, Schudoma C, Accinelli GG, Kaithakottil G, Wright J, et al. An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations. Genome Res. 2017;27:885–96.
Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, et al. RefSeq: an update on mammalian reference sequences. Nucl Acids Res. 2014;42:D756–63.
Tello-Ruiz MK, Naithani S, Stein JC, Gupta P, Campbell M, Olson A, et al. Gramene 2018: unifying comparative genomics and pathway resources for plant research. Nucl Acids Res. 2018;46:D1181–9.
Jiao Y, Peluso P, Shi J, Liang T, Stitzer MC, Wang B, et al. Improved maize reference genome with single-molecule technologies. Nature. 2017;546:524–7.
Eilbeck K, Moore B, Holt C, Yandell M. Quantitative measures for the management and comparison of annotated genomes. BMC Bioinform. 2009;10:67.
Venturini L, Caim S, Kaithakottil GG, Mapleson DL, Swarbreck D. Leveraging multiple transcriptome assembly methods for improved gene structure annotation. GigaScience. 2018;7. https://doi.org/10.1093/gigascience/giy093.
TAIR. Documentation for the TAIR gene model and exon confidence ranking system. 2009. http://plantta.jcvi.org/. Accessed 9 Oct 2020.
Sreenivasamurthy SK, Madugundu AK, Patil AH, Dey G, Mohanty AK, Kumar M, et al. Mosquito-borne diseases and Omics: tissue-restricted expression and alternative splicing revealed by transcriptome profiling of Anopheles stephensi. Omics J Integr Biol. 2017;21:488–97.
Azlan A, Obeidat SM, Yunus MA, Azzam G. Transcriptome profiles and novel lncRNA identification of Aedes aegypti cells in response to dengue virus serotype 1. BioRxiv. 2018;:422170.
Azlan A, Halim MA, Azzam G. Genome-wide identification and characterization of long intergenic noncoding RNAs in the regenerative flatworm Macrostomum lignano. Genomics. 2020;112:1273–81.
Qi S, Akter S, Li S. Identification of Novel lincRNA and Co-Expression Network Analysis Using RNA-Sequencing Data in Plants. In: Plant long non-coding RNAs. Springer; 2019. p. 207–21.
Beisel NS, Noble J, Barbazuk WB, Paul A-L, Ferl RJ. Spaceflight-induced alternative splicing during seedling development in Arabidopsis thaliana. NPJ Micrograv. 2019;5:1–5.
Wang C, Wallerman O, Arendt M-L, Sundstrom E, Karlsson A, Nordin J, et al. A new long-read dog assembly uncovers thousands of exons and functional elements missing in the previous reference. bioRxiv. 2020.
Liu S, Aagaard A, Bechsgaard J, Bilde T. DNA methylation patterns in the social spider. Stegodyphus dumicola Genes. 2019;10:137.
Wu S, Gao S, Wang S, Meng J, Wickham J, Luo S, et al. A reference genome of bursaphelenchus mucronatus provides new resources for revealing its displacement by pinewood nematode. Genes. 2020;11:570.
Wang P, Luo Y, Huang J, Gao S, Zhu G, Dang Z, et al. The genome evolution and domestication of tropical fruit mango. Genome Biol. 2020;21:1–17.
Cieślik M, Chinnaiyan AM. Cancer transcriptome profiling at the juncture of clinical translation. Nat Rev Genet. 2018;19:93.
Lorenzi L, Avila Cobos F, Decock A, Everaert C, Helsmoortel H, Lefever S, et al. Long noncoding RNA expression profiling in cancer: challenges and opportunities. Genes Chromosom Cancer. 2019;58:191–9.
Yang J, Moeinzadeh M-H, Kuhl H, Helmuth J, Xiao P, Haas S, et al. Haplotype-resolved sweet potato genome traces back its hexaploidization history. Nature plants. 2017;3:696–703.
Sun Z, Nair A, Chen X, Prodduturi N, Wang J, Kocher J-P. UClncR: ultrafast and comprehensive long non-coding RNA detection from RNA-seq. Sci Rep. 2017;7:1–10.
Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc. 2012;7:562–78.
Arrigoni A, Ranzani V, Rossetti G, Panzeri I, Abrignani S, Bonnal RJP, et al. Analysis RNA-seq and Noncoding RNA. In: Polycomb group proteins. Springer; 2016. p. 125–35.
Ghosh S, Chan C-KK. Analysis of RNA-Seq data using TopHat and Cufflinks. In: Plant Bioinformatics. Springer; 2016. p. 339–61.
Qi X, Xie S, Liu Y, Yi F, Yu J. Genome-wide annotation of genes and noncoding RNAs of foxtail millet in response to simulated drought stress by deep sequencing. Plant Mol Biol. 2013;83:459–73.
Marchant A, Mougel F, Mendonça V, Quartier M, Jacquin-Joly E, da Rosa JA, et al. Comparing de novo and reference-based transcriptome assembly strategies by applying them to the blood-sucking bug Rhodnius prolixus. Insect Biochem Mol Biol. 2016;69:25–33.
Li A, Zhang J, Zhou Z, Wang L, Liu Y, Liu Y. ALDB: a domestic-animal long noncoding RNA database. PLoS ONE. 2015;10:e0124003.
Cooper SJ, Trinklein ND, Anton ED, Nguyen L, Myers RM. Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome. Genome Res. 2006;16:1–10.
Brown RH, Gross SS, Brent MR. Begin at the beginning: predicting genes with 5′ UTRs. Genome Res. 2005;15:742–7.
Ohler U, Liao G, Niemann H, Rubin GM. Computational analysis of core promoters in the Drosophila genome. Genome Biol. 2002;3:research0087–1.
Batut P, Gingeras TR. RAMPAGE: promoter activity profiling by paired-end sequencing of 5′-complete cDNAs. Curr Protoc Mol Biol. 2013;104:25B-B11.
Adiconis X, Haber AL, Simmons SK, Levy Moonshine A, Ji Z, Busby MA, et al. Comprehensive comparative analysis of 5′-end RNA-sequencing methods. Nat Methods. 2018;15:505–11.
Shiraki T, Kondo S, Katayama S, Waki K, Kasukawa T, Kawaji H, et al. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc Natl Acad Sci. 2003;100:15776–81.
Holmqvist E, Wright PR, Li L, Bischler T, Barquist L, Reinhardt R, et al. Global RNA recognition patterns of post-transcriptional regulators Hfq and CsrA revealed by UV crosslinking in vivo. EMBO J. 2016;35:991–1011.
Hickman R, van Verk MC, van Dijken AJH, Mendes MP, Vroegop-Vos IA, Caarls L, et al. Architecture and dynamics of the jasmonic acid gene regulatory network. Plant Cell Online. 2017;:tpc-00958.
Jackson RJ, Standart N. Do the poly (A) tail and 3′ untranslated region control mRNA translation? Cell. 1990;62:15–24.
Meijer HA, Thomas AAM. Control of eukaryotic protein synthesis by upstream open reading frames in the 5′-untranslated region of an mRNA. Biochem J. 2002;367:1–11.
Miller GM, Madras BK. Polymorphisms in the 3′-untranslated region of human and monkey dopamine transporter genes affect reporter gene expression. Mol Psychiatry. 2002;7:44–55.
Wu S, Huang S, Ding J, Zhao Y, Liang L, Liu T, et al. Multiple microRNAs modulate p21Cip1/Waf1 expression by directly targeting its 3′ untranslated region. Oncogene. 2010;29:2302–8.
Dixon DA, Kaplan CD, McIntyre TM, Zimmerman GA, Prescott SM. Post-transcriptional control of cyclooxygenase-2 gene expression The role of the 3′-untranslated region. J Biol Chem. 2000;275:11750–7.
Gu S, Jin L, Zhang F, Sarnow P, Kay MA. Biological basis for restriction of microRNA targets to the 3′ untranslated region in mammalian mRNAs. Nat Struct Mol Biol. 2009;16:144.
Eberle AB, Stalder L, Mathys H, Orozco RZ, Mühlemann O. Posttranscriptional gene regulation by spatial rearrangement of the 3′ untranslated region. PLoS Biol. 2008;6:e92.
Halterman DA, Wise RP. Upstream open reading frames of the barley Mla13 powdery mildew resistance gene function co-operatively to down-regulate translation. Mol Plant Pathol. 2006;7:167–76.
Awata T, Inoue K, Kurihara S, Ohkubo T, Watanabe M, Inukai K, et al. A common polymorphism in the 5′-untranslated region of the VEGF gene is associated with diabetic retinopathy in type 2 diabetes. Diabetes. 2002;51:1635–9.
Rogers JT, Randall JD, Cahill CM, Eder PS, Huang X, Gunshin H, et al. An iron-responsive element type II in the 5′-untranslated region of the Alzheimer’s amyloid precursor protein transcript. J Biol Chem. 2002;277:45518–28.
Chin LJ, Ratner E, Leng S, Zhai R, Nallur S, Babar I, et al. A SNP in a let-7 microRNA complementary site in the KRAS 3′ untranslated region increases non-small cell lung cancer risk. Can Res. 2008;68:8535–40.
Halterman DA, Wei F, Wise RP. Powdery mildew-induced Mla mRNAs are alternatively spliced and contain multiple upstream open reading frames. Plant Physiol. 2003;131:558–67. https://doi.org/10.1104/pp.014407.
Conne B, Stutz A, Vassalli J-D. The 3′ untranslated region of messenger RNA: a molecular ‘hotspot’for pathology? Nat Med. 2000;6:637–41.
Hunt M, Banerjee S, Surana P, Liu M, Fuerst G, Mathioni S, et al. Small RNA discovery in the interaction between barley and the powdery mildew pathogen. BMC Genomics. 2019;20:610.
Chapman AVE, Matthew H, Surana P, Velásquez-Zapata V, Xu W, Fuerst G, et al. Disruption of barley immunity to powdery mildew by an in-frame Lys-Leu deletion in the essential protein SGT1. Oxford Genetics. 2020.
Dai X, Xu Z, Liang Z, Tu X, Zhong S, Schnable JC. Non-homology-based prediction of gene functions. 2019;1–18.
Duvick DN. The contribution of breeding to yield advances in maize (Zea mays L.). Adv Agronomy. 2005;86:83–145.
Agrama HAS, Moussa ME. Mapping QTLs in breeding for drought tolerance in maize (Zea mays L.). Euphytica. 1996;91:89–97.
Maazou A-RS, Tu J, Qiu J, Liu Z. Breeding for drought tolerance in maize (Zea mays L.). Am J Plant Sci. 2016;7:1858.
Bhandary P, Seetharam AS, Arendsee ZW, Hur M, Wurtele ES. Raising orphans from a metadata morass: a researcher’s guide to re-use of public ’omics data. Plant Sci. 2018. https://doi.org/10.1016/j.plantsci.2017.10.014.
Arendsee ZW, Li L, Wurtele ES. Coming of age: orphan genes in plants. Trends Plant Sci. 2014;19:698–708. https://doi.org/10.1016/J.TPLANTS.2014.07.003.
Arendsee Z, Li J, Singh U, Seetharam A, Dorman K, Wurtele ES. phylostratr: a framework for phylostratigraphy. Bioinformatics. 2019;35:3617–27.
This research used resources provided by the SCINet project of the USDA Agricultural Research Service, ARS Project No. 0500-00093-001-00-D. The authors are also thankful to Dr. Karin Dorman (Professor, Department of Statistics, Iowa State University), for providing insightful feedbacks to compare annotations and for implementing changepoint detection. The authors thank Gregory Fuerst for taking care of submitting data to NCBI. Finally, the authors thank Dr. Eve Wurtele (Professor, Department of Genetics, Development and Cell Biology, Iowa State University) for permitting her student Priyanka Bhandari to collaborate on this work.
This research was supported by the US. Department of Agriculture, Agricultural Research Service, Project No. (5030-21000-068-00D) and (3625-21000-067-00D) through the Corn Insects and Crop Genetics Research Unit and Project No. (2030-21000-024-00D) through the Crop Improvement and Genetics Research Unit. Research supported in part by Oak Ridge Institute for Science and Education (ORISE) under US Department of Energy (DOE) contract number DE-SC0014664 to SB and National Science Foundation—Plant Genome Research Program Grant 13-39348 to RPW. PB received funding from National Science Foundation Grant (IOS 1546858, in part); Orphan Genes, “An Untapped Genetic Reservoir of Novel Traits”. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the USDA, ARS, DOE, ORAU/ORISE or the National Science Foundation. USDA is an equal opportunity provider and employer.
Ethics approval and consent to participate
Consent for publication
The authors declare no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary figures (S1–S9).
Input to finder.
Annotation edit distance of reference transcripts as reported by each gene annotation pipeline.
Performance of gene annotation pipelines on coding regions of transcripts.
Comparison of FINDER's performance with other gene annotation pipelines on a variety of different species.
Comparison of different transcriptome assembly softwares on a variety of species.
Improvement in reference gene annotation after adding untranslated regions verified with long-read from PacBio assemblies.
Description of RNA-Seq data used to execute FINDER, BRAKER2, MAKER2 and PASA.
Supplementary text document outlining methods and some results in more details.
About this article
Cite this article
Banerjee, S., Bhandary, P., Woodhouse, M. et al. FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences. BMC Bioinformatics 22, 205 (2021). https://doi.org/10.1186/s12859-021-04120-9
- Eukaryotic gene annotation
- Gene prediction
- Optimized RNA-Seq alignment
- Changepoint detection