A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa

Background Accurate structural annotation depends on well-trained gene prediction programs. Training data for gene prediction programs are often chosen randomly from a subset of high-quality genes that ideally represent the variation found within a genome. One aspect of gene variation is GC content, which differs across species and is bimodal in grass genomes. When gene prediction programs are trained on a subset of grass genes with random GC content, they are effectively being trained on two classes of genes at once, and this can be expected to result in poor results when genes are predicted in new genome sequences. Results We find that gene prediction programs trained on grass genes with random GC content do not completely predict all grass genes with extreme GC content. We show that gene prediction programs that are trained with grass genes with high or low GC content can make both better and unique gene predictions compared to gene prediction programs that are trained on genes with random GC content. By separately training gene prediction programs with genes from multiple GC ranges and using the programs within the MAKER genome annotation pipeline, we were able to improve the annotation of the Oryza sativa genome compared to using the standard MAKER annotation protocol. Gene structure was improved in over 13% of genes, and 651 novel genes were predicted by the GC-specific MAKER protocol. Conclusions We present a new GC-specific MAKER annotation protocol to predict new and improved gene models and assess the biological significance of this method in Oryza sativa. We expect that this protocol will also be beneficial for gene prediction in any organism with bimodal or other unusual gene GC content. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1942-z) contains supplementary material, which is available to authorized users.


Background
Most widely used gene prediction programs depend on Hidden Markov Models (HMMs) to predict gene structure within genomic sequence [1][2][3]. Typically, genes are modeled within HMMs using a series of hidden states that represent generic gene structure. The hidden states are filled with transition probabilities based on k-mer sequences taken from the genes that are used to train the HMM. It is known that gene GC content can affect gene predictions. Korf found that accuracy of predicting genes in one species using a SNAP HMM that was trained for a second species was more correlated with the GC content of the two species' genomes than with the phylogenetic distance between the two species [1]. Additionally, in mammalian genomes, which contain so-called isochores, gene GC content is correlated with the GC content of the surrounding genome. The AU-GUSTUS gene prediction program has a feature that trains multiple HMMs that are each specialized for different narrow isochore-specific GC ranges in order to improve gene predictions [4][5][6].
We perceived that two factors might limit the accuracy of gene prediction in grass genomes. First, in many species including most plants, the GC content of genes has a relatively narrow and unimodal distribution, but in the grasses (Poaceae), the GC content of genes has a broad bimodal distribution ( Fig. 1a; [4,[7][8][9][10]). The bimodal distribution of GC-content in the grasses suggests that there exist two classes of genes (high GC and low GC) that the gene prediction programs are attempting to learn.
While gene prediction programs perform well with grasses [11], we hypothesized that the accuracy of grass gene predictions could be improved by accounting for the high and low GC gene classes. Furthermore, as with any supervised machine learning technique, we expect that it is difficult to predict grass genes at the tails of the natural GC distribution and that some grass genes may not be predicted at all using existing protocols. Second, grass gene GC content is not well correlated with the surrounding genomic regions ( Fig. 1b; [10,12,13]), and therefore, grass genomes do not contain isochores. We also predict that grass genome annotation will not benefit from analysis by the isochore-sensitive AUGUSTUS protocol [6]. Therefore, it is probable that gene annotation in grasses can be improved further.
MAKER is a commonly used structural annotation engine that has been used to annotate numerous plant genome assemblies [11,[14][15][16][17][18]. The MAKER gene annotation pipeline makes it very easy to train and then predict gene models from commonly used ab initio gene prediction programs, such as SNAP and AUGUSTUS [1,6,19]. We developed a new GC-specific MAKER protocol that makes use of genes with high and low GC content as training data in order to derive separate versions of the SNAP and AUGUSTUS HMMs that are tuned to accurately predict high and low GC genes. Using this new method, we improved regular MAKER gene predictions in Oryza sativa (rice) relative to available transcript and protein evidence. Furthermore, we identified novel genes with high and low GC content that had not been predicted by the standard MAKER protocol. Comparisons to the AUGUSTUS isochore-based prediction method as well as to the standard MAKER protocol showed that this GC-specific MAKER protocol shifts the overall GC content of predicted gene models both higher and lower than the standard MAKER protocol. This new GC-specific MAKER annotation method will be of interest to anyone working on structural annotation of genomes with bimodal GC content but will likely improve the annotation of any genome.

Results
Reannotation of the O. sativa genome with MAKER using HMMs trained on high and low GC content We thought that grass genes identified by gene prediction programs that are trained on genes with specific GC content could both find different genes and produce differing gene models at identical loci than prediction programs that are trained on genes with random GC content. We tested this hypothesis by reannotating the genome of O. sativa. In order to compare gene models within the O. sativa ssp. Nipponbare genome (v7 assembly; [20]) based on the GC content of different HMM training sets, three MAKER structural annotations were completed using a modified method. SNAP and AU-GUSTUS HMMS were trained either with training genes randomly picked without regard for GC content, with training genes with low GC content or with training genes with high GC content. The standard MAKER annotation using HMMs trained on randomly selected training genes for SNAP and AUGUSTUS predicted 29,133 gene models with transcript evidence and/or Pfam protein domains. The structural annotation based on high GC HMMs produced 26,063 evidence supported gene models, and the MAKER annotation based on low GC HMMs produced 26,559 evidence supported models ( Table 1). The average length of transcripts was very similar for the standard and low GC structural annotations ( Table 1). The average transcript length of the high a b Fig. 1 Bimodal distribution and coding region GC content in Oryza sativa. a Distribution of GC content of IRGSP v7 predicted coding regions. b GC content of IRGSP v7 predicted coding regions vs. genomic GC content 5Kb upstream and downstream of predicted coding regions GC predictions was considerably shorter, a trend that has been previously discussed in eukaryotic genomes [21].
The distribution of GC content of the gene predictions varied greatly (Fig. 2). The standard MAKER annotation has a bimodal distribution of gene GC content with a major peak at 49% and a minor peak at 68%. The GC distribution of the high GC annotation has a unimodal distribution with a major peak at 68%. The low GC annotation has a bimodal distribution with peaks at 47 and 67%. Notably, few low GC genes were predicted by the high GC HMMs, and a lower percentage of high GC genes were predicted by the low GC HMMs compared to the standard GC neutral MAKER annotation.
The SNAP and AUGUSTUS HMMs created for the standard, high and low GC MAKER structural annotations were also used together in a single MAKER run to produce a six HMMs annotation (Fig. 3). For this annotation, up to six ab initio predictions could be produced at a single locus, but when provided with multiple gene predictions at a single locus, MAKER chooses the single best gene model at that locus. The six HMMs annotation contained 29,942 evidence supported gene predictions ( Table 1). The GC distribution for the six HMMs gene set was bimodal with a major peak at 48% and second peak at 68% (Fig. 3). In comparison to the MSU Rice Genome Annotation Project (MSU-RGAP; Release 7) annotation [20], 2448 gene predictions were unique to the MAKER six HMMs annotation of O. sativa while 7004 gene models found in the MSU annotation were missing from the six HMMs annotation (Additional file 1). Using BUSCO [22] to assess the completeness of the six HMMs annotation, we found that the six HMM predictions contain a high number of complete BUSCO matches (86.2% complete; 3.9% fragmented; 9.9% missing), but that the MSU-RGAP does match more BUSCO sets (95.6% complete; 2.5% fragmented; 1.9% missing).
To assess the impact of high and low GC specific HMMs on the structural annotation of O. sativa, GC content and annotation edit distance (AED) scores were plotted for each set of predicted gene models and visualized as heatmaps (Fig. 4). AED scores are assigned by MAKER and can be used to assess gene prediction quality [23]. AED measures the concordance of a gene prediction relative to the transcript and protein evidence that supports it. AED scores range between 0 and 1, where 0 indicates perfect concordance between the gene prediction and evidence and 1 indicates that no transcript or protein supports the prediction. Genes predicted by HMMs trained on specific GC content caused a general shift in the GC distribution of predicted gene models for both the high and low GC annotations, in comparison to the standard MAKER annotation ( Fig. 4a  and b). In addition to this shift, standard MAKER gene predictions were improved by high or low GC HMMs as determined by a decrease in AED scores between overlapping gene predictions from the standard MAKER and high or low GC HMMs annotations (Fig. 4e, f ). The number of standard protocol gene models improved in the six HMMs annotation was 3740. The number or percent of genes with AED scores less than 0.5 (AED 0.5 ) can be used for genome wide assessment of annotation quality. The percentages of AED 0.5 genes were similar for all three annotations ( Table 1). The high percentage of well-supported gene predictions reflects the quality of transcriptome evidence provided during the structural annotation process.

Comparison of MAKER six HMMs method to alternative MAKER approaches
The results of the MAKER six HMMs structural annotation were compared to MAKER genome annotations where combinations of the SNAP and AUGUSTUS gene prediction programs were used with alternative parameters. As AUGUSTUS can be run so that it considers GC content of the genomic region (isochores) in which a gene prediction is made, we also trained AUGUSTUS in its isochore-sensitive mode and used it to make gene predictions within MAKER. Overall, the MAKER six HMMs annotation produced more genes than any other annotation strategy tested here (Table 1, Additional file 2: Table S1). MAKER run with only SNAP identified more evidence supported genes than either AUGUSTUS alone trained with randomly chosen training data or the isochore-specific AUGUSTUS protocol. Only a few hundred more genes were generated by the isochore-specific AUGUSTUS annotation than by the randomly trained AUGUSTUS HMM. Using randomly trained SNAP with either randomly trained or isochore-specific AUGUS-TUS produced similar numbers of gene predictions but more than when MAKER is run with any of these programs alone. The number of AED 0.5 gene predictions follows a similar trend to the total number of gene predictions made by any annotation protocol (Table 1; Additional file 2: Table S1; Additional file 3: Figure S1). However, as more genes are identified by a particular annotation method, the proportion of AED 0.5 genes decreases. The isochore-specific AUGUSTUS and the randomly trained AUGUSTUS and SNAP gene predictions did not vary in overall GC content (Additional file 3: Figure S2). For any machine learning protocol, different sets of training data can lead to slightly different prediction results. To ensure that the results that we observed when we trained SNAP and AUGUSTUS on high and low GC content training data sets were not random, we repeated the standard MAKER annotations three times using independently generated training data. The number of predicted gene models differed by less than 150 in the three randomly replicated standard MAKER annotations (Additional file 2: Table S2), and the AED cumulative frequency plots were nearly identical (Additional file 3: Figure S3).

Identification of novel high and low GC content genes
In addition to the improved high and low GC structural annotations created with the MAKER six HMMs annotation protocol, we discovered novel gene predictions specific to the annotations from the high and low GC HMMs. The low GC annotation contained 369 novel genes, while the high GC annotation contained 282 novel genes. Interestingly, the novel genes predicted by the low GC HMMs did not always have a low GC content, and some of the novel genes predicted by the high GC HMMs did not have high GC content (Figs. 2 and 4c, d). The locations of the novel high and low GC HMM predictions were distributed across all 12 O. sativa chromosomes (Table 2; Additional file 4). Of the novel high GC HMM predictions, 253 genes (90%) had some level of protein or transcript evidence for the prediction, while 324 (88%) novel low GC HMM predictions Fig. 3 Six HMMs MAKER structural annotation method. The center workflow depicts the standard method for training hidden markov models for use in MAKER, while the low GC (top) and high GC (bottom) training methods can be used after creating high and low GC HMM training data sets. After separately training HMMs with the low and high GC training data, all three SNAP HMMs and all three AUGUSTUS HMMs were specified in the maker_opts.ctl file (see the Methods section), and MAKER was run to create the six HMMs annotation, which incorporates gene predictions from the standard, high and low GC MAKER runs had protein or transcript support (Fig. 5). Overall, the AED scores increased as GC content increased for the novel high GC HMM predictions and as GC content decreased for the novel low GC HMM predictions ( Fig. 4c and d). The mean length of the novel high GC genes was 640 bp, while the novel low GC genes were on average 748 bp in length. The novel high and low GC gene predictions are shorter than the mean lengths of the original MAKER, six HMMs, high GC and low GC gene predictions (Table 1). Gene lengths were more widely distributed for predictions generated by the original, six HMMs, high and low GC methods while the distribution of gene lengths of the novel GC genes were more narrow and peaked at around 350 bp (Additional file 5). Plotting the effective codon number of novel high and low GC genes and the MSU-RGAP rice gene annotations against gene Fig. 4 Heatmap visualization of annotated edit distance (AED) and GC content of MAKER predicted gene models. a MAKER genes predicted using the high GC HMMs. b MAKER genes predicted using the low GC HMMs. c Novel genes predicted using the high GC HMMs. d Novel genes predicted using the low GC HMMs. e Gene predictions from the high GC HMMs that improved gene predictions made by the standard MAKER protocol. f Gene predictions from the low GC HMMs that improved gene predictions made by the standard MAKER protocol. g Gene predictions from the standard MAKER protocol. h Gene predictions from the MAKER six HMMs annotation GC content at synonymous sites (GC3s) shows a correlation between effective codon usage and GC3 percent (Additional file 6). The majority of novel high and low GC genes and MSU-RGAP rice genes fall below the solid line that represents expected effective codon usage under the null model where there is no selection on codon usage. This indicates that some selective pressure affects rice codon usage beyond compositional variation. However, the observed deviation from the null model is slight compared to species that exhibit extreme codon usage bias [24][25][26][27]. Additionally, codon usage variation is similar between the MSU-RGAP rice genes and the novel high and low GC genes (Additional file 6).
We also compared these novel six HMMs gene predictions to the MSU-RGAP Release 7 gene set and found 112 of the novel low GC HMM predictions and 167 of the novel high GC HMM predictions were present in that high-quality gene set [20]. Additional functional characterization of the novel high and low GC genes was performed using gene ontology enrichment, but the novel genes were not found to be enriched in any functional terms (data not shown).

Orthology of novel high and low GC genes to genes from other grass species
Using the total predictions generated through the MAKER six HMMs annotation, additional support was given to the novel predictions made by the high GC and low GC HMMs by first assessing sequence homology of the novel gene predictions to the NCBI non-redundant protein database [28]. Of the 651 novel predictions, 387 had a significant BLASTP hit (e-values less than 1e-10) to NCBI's non-redundant protein database. Second, the homology and orthology of these genes was evaluated relative to other MAKER six HMM predictions and Brachypodium distachyon, Sorghum bicolor and Zea mays using OrthoMCL [29][30][31][32] (Additional file 7). Of the novel high GC predictions, 51 genes were placed into orthogroups, with 19 as putative homologs only with other MAKER six HMMs predictions, and 32 were orthologous to genes from at least one of the other grass species. Interestingly, 23 novel high GC genes represented the only rice predictions in their orthogroups, and 11 novel high GC genes were single copy orthologs with the other grasses. Of the novel low GC predictions, 92 genes were placed into orthogroups, with 34 as putative homologs only to other MAKER six HMMs gene predictions, and 58 orthologous to the other grass species. Twelve novel low GC predictions were the only rice representatives in their orthogroups.
Translating ribosome affinity purification (TRAP) sequencing and RNA-seq provide additional support for novel high and low GC gene predictions In an effort to demonstrate additional support for the new GC specific gene models outside of the transcript data provided during the MAKER annotation process, TRAP-seq sequencing data isolated from callus, panicle and seedling tissues of an O. sativa line with a modified RPL18 transgene [33,34] were analyzed in the same  [33]. The calculated TEI of each of the novel genes predicted by the high and low GC HMMs that had TRAP-seq pseudoalignments indicates tissue specificity (Fig. 6). Additionally, RNAsequencing data from a variety of rice tissues, abiotic and biotic stresses were pseudoaligned to the MAKER six HMMs annotation [35,36]. RNA-seq reads were aligned to 262 (93%) of the 282 novel high GC HMM predictions and 329 (89%) of 369 of the novel low GC HMM predictions (Additional files 8 and 9). In combination, the TRAP-seq and RNA-seq data indicated that in addition to the transcript data already aligned to these predictions during annotation, a majority of these novel predictions are in fact being actively transcribed in various tissues from O. sativa with both tissue and treatment specificity.

Discussion
Ab initio gene prediction programs employ HMMs trained on gene sets that should be representative of the variation in gene nucleotide content. We hypothesized that in grass genomes, where genes have a wide variation in GC content and where that distribution is bimodal (Fig. 1a), gene prediction programs trained on random sets of training data would be overly generalized and that this could result in poorly predicted gene models with high or low GC contents. To address this, we developed a GC-specific MAKER gene annotation protocol that trains gene prediction programs SNAP and AUGUSTUS using training data with both high and low GC content. The resulting high-GC and low-GC SNAP and AUGUSTUS HMMs were used in addition to the regularly trained SNAP and AUGUSTUS HMMs to predict genes within MAKER (Fig. 3). We tested the six HMMs protocol by reannotating the O. sativa genome, and we identified 29,942 genes with transcript, protein or Pfam protein domain support. As expected, when MAKER predicted genes in the O. sativa genome using either the high-GC or low-GC SNAP and AUGUSTUS HMMs, the GC content of the resulting gene predictions were shifted higher or lower, respectively, compared to the GC content of genes predicted by the standard MAKER protocol (Fig. 2). Furthermore, the GC content distribution of genes predicted by the MAKER six HMMs protocol also showed a shift of the bimodal peaks to higher and lower GC values (Fig. 2). Importantly, most gene predictions made by the MAKER six HMMs annotation overlapped with loci predicted by the standard MAKER protocol, but in 3740 of these cases, the predictions made by the MAKER six HMMs protocol were improved over the standard MAKER predictions as shown by the better evidence support (i.e. lower AED scores) (Fig. 4e, f). This indicates that the high and low GC HMMs were often able to improve upon gene predictions made by the more generally trained gene prediction programs. In addition to improving the annotation of many genes, we also identified novel genes using this protocol. We found 651 genes that had been identified by high-GC or low-GC SNAP or AUGUSTUS HMMs but that had not been predicted using the standard MAKER pipeline. Of these newly identified genes, 372 were also not found in the most recent MSU-RGAP Release 7 structural annotation [20]. The 279 novel genes predicted by the high-GC or low-GC HMMs that were previously found in the MSU-RGAP Release 7 were likely predicted by MSU-RGAP due to the use of Fgenesh for gene identification, which may have its own biases related to GC content [20,37], or due to the use of different transcript and protein evidence (Additional file 1). Additionally, the MSU-RGAP annotation was improved by PASA, which improves de novo gene predictions with transcript alignment evidence, and therefore, PASA is likely not biased by GC content in the same way that HMM-based gene prediction programs can be affected [38]. Furthermore, 90 of the novel genes identified by the high-GC and low-GC HMMs were found to be orthologous to genes from other grass species or to other MAKER six HMMs gene predictions within O. sativa (Additional file 7). Additional support for the novel gene predictions comes from examining a TRAP sequencing data set that indicates that 67% of these new predictions are being actively transcribed in three different tissues from O. sativa [33] (Fig. 6). RNA-sequencing data from rice tissues and from abiotic and biotic stress experiments show high levels of tissue and treatment specific expression and lend further support to the validity of the novel high and low GC gene predictions (Additional files 8 and 9). Nonetheless, as with all computational gene prediction methods, the novel gene models identified by the GCspecific MAKER protocol should be further vetted through additional laboratory analysis.
There are 7004 genes in the MSU-RGAP Release 7 data set that were not predicted by the six HMMs annotation. Of these genes, 4635 are characterized as "expressed" meaning that they only have transcript support. An additional 1324 MSU-RGAP genes missing from the six HMMs annotation are described as "hypothetical", which indicates that they have no transcript or protein support, but they may contain a conserved protein domain (Additional file 1). The hypothetical MSU-RGAP genes are the genes with the weakest support from that annotation project. Some of the MSU-RGAP hypothetical genes may not pass the stringent evidence test that was applied to the MAKER six HMMs gene predictions which all had transcript or protein support or contained a Pfam domain. The MSU-RGAP genes that are missing from the six HMMs predictions are also rather short overall (Additional file 10). The mean length of the CDSes from these missed MSU-RGAP genes is 564 bp, and the mode is 243 bp, which is similar to the lengths of the novel six HMMs genes (Additional file 5). Another small portion of the missing genes from the six HMMs annotation are either chloroplast or mitochondria related. Additionally, the MAKER six HMMs annotation only used transcript evidence derived from StringTie assemblies of a small set of RNA-seq reads, but the MSU-RGAP annotation made use of EST (expressed sequence tags) and FL-cDNA (full length complimentary DNA) sequences that were not used to aid annotation in this report. We purposefully did not use an overly extensive collection of transcript evidence in this report as we had wanted to test our new protocol with transcript evidence that is similar to that which is typically available for new genome annotation projects. This limited transcript evidence set necessarily reduced the predictive power of our gene finder programs that can be heavily influenced by external evidence [11,14,18]. Finally, the MAKER six HMMs annotation was filtered to remove any predictions that had homology to known transposable elements (TE) and Pfam domains. The MSU-RGAP genes were also filtered to flag any genes with matches to a library of TE sequences, but these two methods were necessarily different and could have resulted in the removal of different subsets of TE-related gene predictions. Nonetheless, after discounting the "hypothetical" and "expressed" genes, there are 1045 highquality genes from the MSU-RGAP annotation that were not present in the six HMMs annotation. This set of missed genes may contain the BUSCO gene models that were found to be missing in the six HMMs annotation. A full list of all MSU-RGAP genes missing from the six HMMs annotation can be found in Additional file 10.
Interestingly, there may be additional unrecognized parameters that could be used to improve gene prediction besides our strategy of training gene prediction HMMs in a GCspecific fashion. In the six HMMs annotation, some low GC predictions were generated by the high GC HMMs, and some high GC predictions came from the low GC HMMs (Fig. 4a, b). While these could be cases of identical gene models being created by two or more HMMs at a particular locus with MAKER randomly retaining only one prediction as the final model for the locus, we also observed novel low GC predictions created by high GC HMMs as well as novel high GC predictions arising from low GC HMMs (Figs. 3  and 4c, d). This suggests that some unrecognized gene features besides simple GC content were present in the high and low GC HMMs that allowed the prediction of novel low and high GC genes, respectively.
It has been known that the GC content of genes used to train gene prediction HMMs can affect the accuracy of gene predictions [1,6]. The AUGUSTUS gene finder has an isochore-sensitive protocol that was developed in order to more accurately predict mammalian genes.
Despite the fact that isochores do not exist in plants ( Fig. 1; [12,13]), we used the isochore-sensitive AUGUSTUS protocol to predict genes in O. sativa, but we did not see a substantial difference in the number or quality of predicted gene models or a change in overall GC content distribution of those gene predictions (Additional file 3: Figures S1 and S2). This result was expected as gene GC content is not well correlated with the GC content of the surrounding genomic region, and therefore, partitioning the training data before training the gene prediction programs was found to be more effective at improving gene annotations in O. sativa.
Given the importance of accurate gene prediction to downstream genomics applications, the GC-specific MAKER protocol described here will be of use to those working on the structural annotation of any species with a bimodal distribution of GC content. MAKER is a powerful tool that enables research groups of any size to pursue structural annotation of sequenced genomes and, with the addition of this protocol, will aid in more accurate gene prediction.

Conclusions
In this paper we presented a new GC-specific MAKER annotation protocol that was used to successfully identify new evidence supported gene models in Oryza sativa with high and low GC content. This new method also improved 13% of gene models produced by the standard MAKER protocol. Comparisons of this method to the standard training protocols for the SNAP and AUGUSTUS ab initio gene prediction programs as well as the isochore-sensitive AUGUSTUS gene prediction method showed that by training gene prediction HMMs with data representing multiple ranges of GC content and allowing MAKER to pick the best ab initio gene prediction generated by multiple gene prediction HMMs, it is possible to create a final gene annotation set that includes large numbers of both improved and novel gene predictions. The novel gene predictions are supported by various forms of evidence including transcript and protein alignments and membership in ortholog groups with genes from other grass species. Additionally, TRAP-sequencing has shown that a majority of these new predictions are being actively transcribed in O. sativa. MAKER is a widely used structural annotation program that allows researchers to produce quality genome annotations. This new method will be an important addition to those interested in the prediction of genes in regions of extreme GC content in Poaceae genomes but will probably be generally applicable for species with narrow, unimodal gene GC distributions as well.

Methods
Processing, quality assessment and assembly of evidence Thirty-one paired end RNA-seq datasets for O. sativa grown from different stress environments and tissues were downloaded from the National Center for Biotechnology Information Sequence Read Archive (NCBI-SRA) (Additional file 2: Table S3) using SRAToolkit v. 2.3.4-2 [39]. Raw read quality was assessed with FastQC v. 0.10.1 and Illumina adapters were trimmed using Trimmomatic v. 0.32. Transcripts were assembled using StringTie v. 1.3.0, and these transcript assemblies were subsequently used as EST evidence for all MAKER runs. The SwissProt plant protein dataset was downloaded (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/uniprot_spro t_plants.dat.gz), and all O. sativa protein sequences were removed. The remaining protein sequences not from O. sativa were used as protein evidence during the MAKER annotation.

MAKER standard de novo structural annotation of O. sativa
The MAKER-P (r1128) genome annotation pipeline was used to annotate the Os-Nipponbare-Reference-IGRSP-1.0 v7 genome assembly. A custom repeat library was created for O. sativa using a method described previously [18]; (http://weatherby.genetics.utah.edu/MAKER/ wiki/index.php/Repeat_Library_Construction-Advanced), and the custom repeat library was used by RepeatMasker within the MAKER pipeline to mask repetitive elements. Transcript assemblies and protein sequences described above were used as evidence to aid gene predictions.
A complete description for running MAKER has been provided previously [19,40] and that protocol provides details about ancillary scripts and example command calls. An abbreviated description of the standard MAKER pipeline is given here, and details about the extended GC-specific MAKER pipeline are given below. As MAKER is run iteratively, repeat masking and evidence alignment was performed during an initial MAKER run, and the resulting GFF3 (general feature format) file containing masked regions and protein and transcript alignments was used during all subsequent MAKER runs. The initial MAKER run generates data that aids in the training of the gene predictions programs SNAP (version 2013-11-29) [1] and AUGUSTUS (version 2.6.1) [6] (Fig. 3). During the initial MAKER run, the parameter est2genome was used to cause MAKER to promote transcript alignments to gene models. High-quality transcript-derived gene models (AED < = 0.2) were used to train SNAP and AUGUS-TUS. Instructions for training SNAP can be found elsewhere [1,19,36]. We use a custom shell script, train_augustus.sh, which trains the AUGUSTUS HMM in only a few hours for most species.train_augustu s.sh <path to working directory for train-ing> <path to MAKER gff3 output from initial MAKER run> <species name for AUGUSTUS HMM directory> <path to single fasta file with all transcript assemblies> The train_augustus.sh shell script prepares training and testing data sets and makes use of the autoAug.pl training script from AUGUSTUS to create the appropriate HMM files. This training script is relatively fast, as it only requires the transcript evidence to be aligned to the genomic regions that contain training and testing gene models instead of aligning those sequences to the entire genome. The working directory is used for writing a number of intermediate files and directories during the AUGUSTUS training process. All transcript sequences that were used as evidence during the initial MAKER run must be placed into a single transcript fasta file and provided here as those sequences will be used during the AUGUSTUS HMM training. The species name provided for the HMM training will be used to name the directory that holds all of the files for the new HMM and is also used to specify the AUGUSTUS HMM in the maker_opts.ctl file. It is necessary to have write permissions in the /config/species directory within AUGUSTUS installation directory in order for this script to work as that is where the AUGUSTUS writes the species-specific HMM directory. On a shared compute system, it may be necessary to make a local installation of AUGUSTUS and to then point MAKER to that installation by updating the path in the maker_exe.ctl file. After training SNAP and AUGUSTUS HMMs, MAKER was then run one last time using only the SNAP and AUGUSTUS HMMs to predict genes. During the final MAKER run, the parameters keep_preds was set to 1.
To identify the high-quality gene set, the MAKER accessory scripts gff_merge and fasta_merge, which are included in the MAKER installation, were used to generate a GFF3 file with all gene predictions and evidence data and the transcript and protein fasta files for those predictions. Pfam domains were identified within the predicted proteins using hmmscan [41].hmmscan -do mE 1e-5 -E 1e-5 -tblout <MAKER max predictions hmmscan output file> <path to Pfam-A.hmm> <path to predicted protein fasta file> The annotation GFF3 file, the transcript and protein fasta files and the hmmscan results file were used to generate a quality MAKER standard gene set.genera-te_maker_standard_gene_list.pl -input_gf f <output of gff3_merge> -pfam_results <hm mscan output> -pfam_cutoff 1e-10 -output _file <path to MAKER standard gene list>-get_subset_of_fastas.pl -l <path to MAKER standard gene list> -f <fasta_merge output transcript/protein fasta> -o <path MAKER s tandard transcript/protein fasta>create_-maker_standard_gff.pl -input_gff <output of gff3_merge> -output_gff <path to MAKER standard GFF3> -maker_standard_gene_list <path to MAKER standard gene list> Despite our use of a custom repeat library that was used for masking repeat elements in the genome, some TErelated genes remain unmasked, and we performed additional analyses to identify and remove any TE-related predictions from our MAKER standard gene set. Predicted proteins were compared to a database of Gypsy transposable elements (3.1.b2) [42]. Predicted proteins were also aligned with blastp to a database of transposases [43,44] (http://weatherby.genetics.utah.edu/MAKER/wiki/index .php/Repeat_Library_Construction-Advanced). A GFF3 file of TE-related genes was derived from the MSU-RGAP gene annotation GFF3 file (http://rice.plantbiology.msu.edu/pub/data/Eukaryotic_Projects/o_sativa/an notation_dbs/pseudomolecules/version_7.0/all.dir/all.gff 3) and was compared to the MAKER standard GFF3 file using gffcompare [45].hmmscan -tblout <Gyps y HMM analysis output> -E 1e-5 -domE 1e-5 <path to gypsy_db_3.1b2.hmm> <path to mak er standard proteins fasta>blastp -db <Tp ases020812 database> -query <path to MAKE R standard protein fasta> -out <path to T pases blast output> -evalue 1e-10 -outfmt 6gffcompare -o <TE comparison output file > -r <MSU RGAP TE GFF3> <MAKER standard GFF3> The create_no_TE_genelist.py script use the data derived above, the Pfam hmmscan results file and a list of TE-related Pfam domains (TE_Pfam_domains.txt; available on Childs Lab GitHub repository) to create a list of MAKER standard genes with no TE-related predictions.create_no_TE_genelist.py -input_file_TE pfam <TE_Pfam_domains.txt> -input_file_-maxPfam <MAKER max predictions hmmscan output file> -input_file_geneList_toKeep <path to MAKER standard gene list> -input_-file_TEhmm <Gypsy HMM analysis output> -input_file_TEblast <path to Tpases blast output> -input_file_TErefmap <TE comparison output refmap file> -output_file <path to TE filtered MAKER standard gene list>-create_maker_standard_gff.pl -input_gff <MAKER standard GFF3> -output_gff <TE filtered MAKER standard GFF3> -maker_stan-dard_gene_list <path to TE filtered MAKER s tandard gene list>get_subset_of_fastas.pl -l <path to TE filtered MAKER standard gene list> -f <fasta_merge output transcript/ protein fasta> -o <TE filtered MAKER stand ard transcript/protein fasta> This high-quality gene set without TE-related genes was used for all analyses presented in the Results section. In addition to this standard MAKER annotation, two additional annotations were created using either the SNAP HMM alone or the AUGUSTUS HMM alone, and high-quality gene sets without TE-related genes were identified for each of these annotations, which were used for comparisons to the final GC-specific six HMMs annotation described below.
Training GC-specific HMMs with high-GC and low-GC gene sequences In order to train high and low GC-specific HMMs for SNAP and AUGUSTUS, it was necessary to use training data that consisted of gene models with CDS (coding DNA sequence) GC content within specific ranges. The transcript-based gene predictions from the initial MAKER run (when the est2genome parameter was used) served as the starting point for GC-specific HMM training (Fig. 3). After generating the GFF3 file describing the transcript-based gene models, the genome FASTA file was processed by the Perl script MAKER_GC_cutoff_determination.pl.MAKER_GC_cu-toff_determination.pl -fasta <full path to file with genome FASTA sequences> -gff <full path to MAKER created GFF3 of est2-genome> -name <BASE_NAME for output files> -peak <peak determination window, odd integer, default is 5> -smooth <smoothing window, odd integer, default is 7> The MAKER_GC_cutoff_determination.pl script helps to identify the GC values of the peaks in a bimodal grass gene GC content distribution. The script pulls out the CDS FASTA sequences for the transcript-based gene predictions from the GFF3 and calculates the GC content for each gene prediction. The script assigns the gene GC values to integer bins based on the -smooth parameter, which helps to smooth the calculation of the distribution by using a moving window average and writes the results to a file. This output file can be used in R to plot the distribution of gene GC content. A FASTA file of the CDS sequences and a GC content file (showing nucleotide composition and GC content of each prediction) are also produced. In addition, a text file is created with the high and low peak values of the bimodal gene GC distribution that were used here as set points in creating the high and low GC HMM training sets. These peaks are determined by taking each GC bin and looking at a set number of bins on either side (set by -peak). A peak is identified when the ((peak -1) / 2) bins on each side of a GC bin have lower calculated GC values than the middle GC bin. However, users may pick their own high-GC and low-GC cutoff values, and the gene GC content distribution graph may aid in picking those cutoff values. The MAKER_GC_training_set_create.py script relies on two files produced from the MAKER_GC_cutoff_determination.pl script: BASE_NA ME_gc_content.txt and BASE_NAME_cutoff.txt. The MAKER_GC_training_set_create.py script will create hig h-GC and low-GC GFF3 files that can be used for training SNAP and AUGUSTUS.MAKER_GC_training_se t_create.py -input_file_gff <path to MAKER GFF file> -input_file_GC_content <BASE_NA ME_gc_content.txt file> -input_file_GC_cutoff <BASE_NAME_cutoff.txt file> -out-put_file_low <path to the low GC GFF file> -output_file_high <path to the high GC GFF file> -genome_fasta <path to the genome fasta file> As detailed in Fig. 3, this script is run after est2genome transcript alignment to create new GC-specific HMM training sets. All subsequent steps should only use the high or low GC output files.

MAKER six HMMs annotation
After the creation of high and low GC SNAP and AU-GUSTUS HMMs, a final MAKER run is performed using the standard, high and low GC HMMs at the same time. When using multiple SNAP and AUGUSTUS HMMs for this six HMMs annotation, predictions from the different HMMs can be identified by providing the path to a specific HMM, a colon, and an HMM-specific identifier (see below). Providing a comma-separated list to the snaphmm and augustus_species parameters allows the designation of multiple HMMs. To create this new six HMMs structural annotation, the following parameters are set in the MAKER maker_opts.ctl file:#---Re-annotation Using MAKER Derived GFF3ma-ker_gff = path to MAKER alignment GFF3est_-pass=1protein_pass=1rm_pass=1#---Gene Pr edictionsnaphmm= path to standard SNAP HMM:orig_snap, path to high GC HMM:high_snap, path to low GC HMM:low_snapAUGUSTUS_-species= path to standard AUGUSTUS dire ctory:orig_aug, path to high GC AUGUSTUS directory:high_aug, path to low GC AUGUS-TUS directory:low_augkeep_preds=1 Once the six HMMs MAKER annotation is finished, a final high-quality MAKER gene set composed of gene models with transcript, protein or Pfam domain support was created using the same protocol that was used above for the standard annotation.

Creation of SNAP and AUGUSTUS HMMs trained with transcripts of randomized GC content
To assess the impact of GC specific HMM training on the structural annotation of O. sativa, three MAKER annotations were created using HMMs trained with transcripts with randomized GC content from the standard annotation. The following Perl scripts were used, which create the training GFF3 files based on a random seed instead of percentage GC content. The random_data-set_generate.pl script takes as input the MAKER standard transcript FASTA and outputs three transcript FASTAs to be used for downstream GFF3 creation and HMM training.random_dataset_gener ate.pl -transcript < name of transcript FASTA file > -random1 <name of output random file1> -random2 <name of output random file2> -random3 <name of output random file3> The seq_name.pl script was run for each of the three random output FASTA files, and generates a list of MAKER standard transcript names from each transcript FASTA.seq_name.pl-fastafile <path to an output file from random_dataset_generate.pl>-output <name of text file with ID names for each FASTA sequence> Finally, the random_gff3_create.pl script requires as inputs the MAKER standard GFF3 with the genome FASTA included and the gene IDs from each of the random FASTAs, and the script generates the final randomized GFF3s that were used for SNAP and AU-GUSTUS HMM training.random_gff3_create.pl-align_gff <path to MAKER GFF3 with FASTA included>-rand_1 <path to random 1 IDs>-rand_2 <path to random 2 IDs>-rand_3 <path to random 3 IDs> The outputs of these steps are three GFF3 files containing the coordinates of randomly selected gene predictions. Each of the GFF3 files created by ran-dom_gff3_create.pl was then used for SNAP and AU-GUSTUS training.

Isochore-specific AUGUSTUS training in O. sativa
To compare the MAKER GC specific HMM training protocol to the isochore-specific AUGUSTUS method, we trained AUGUSTUS in its isochore-sensitive mode as detailed below [6]. After isochore-specific AUGUS-TUS training, the resulting HMM was used in a MAKER run with all other parameters as had been used for the standard annotation to create a MAKER structural annotation based only on isochore-specific AUGUSTUS gene predictions. An additional annotation was also created with the isochore-specific AUGUSTUS HMM and the standard SNAP HMM for comparison to the six HMMs annotation method.
After one round of traditional AUGUSTUS training [6] which creates the augustus.gb.train and augustus.gb.test genbank formatted gene files, change the gc_range_min value to 0.32, gc_range_max value to 0.73 and the decom-p_num_steps value to 7 in the parameters.cfg file in the newly created AUGUSTUS species HMM directory. The following three commands will then complete the isochorespecific training of AUGUSTUS.[AUGUSTUS_installa tion_dir]/scripts/optimize_augustus.pl -s pecies=<species_name> augustus.gb.trainetraining -species=<species_name> augustus .gb.trainaugustus -species=<species_name> augustus.gb.test

Identification of novel high or low GC content gene predictions
The BEDtools v 2.23.0 [46] intersect command was used to compare two GFF3 files containing MAKER gene coordinates to identify novel gene models that were unique to the gene predictions created with high or low GC HMMs. Those predictions that were only created by high GC HMMs but not standard or low GC HMMs were considered novel high GC HMM predictions, while predictions created only by the low GC HMMs were deemed novel low GC HMM predictions.
Assessing genome assembly and annotation completeness BUSCO (Benchmarking Universal Single-Copy Orthologs) (v2.0) was used to assess completeness for the six HMMs protocol compared to the MSU-RGAP annotation. BUSCO was run in protein mode using the plant reference dataset (embryophyta_odb9) containing 1440 protein sequences and orthologous group annotations for major clades. Predicted proteins from the six HMMs annotation and representative proteins from the MSU-RGAP annotation were separately used as input to BUSCO.

Evaluation of codon usage compared to equal usage of non-synonymous codons
The effective number of codons (Nc) was determined and plotted against the GC content of synonymous sites (GC3s) for coding regions of the novel genes predicted by the six HMMs protocol and the MSU-RGAP annotation. CodonW was used to determine the effective number of codons and the GC content of the synonymous sites (http://codonw.sourceforge.net). The data was then plotted alongside a curve showing the relationship between the effective number of codons and GC3s under the hypothesis of no selection.
Translating ribosome affinity purification (TRAP) sequencing and RNA-seq analysis Paired end TRAP-seq and mRNA-seq reads were trimmed using Cutadapt v1.8.1 using default parameters. RNA quantification was conducted with the kallisto v 0.42.5 [47] pseudoalignment method using the six HMMs MAKER predicted transcripts with a bootstrap value of 100. Transcripts per million (TPM) were calculated for both the TRAP-seq and mRNA-seq reads for each tissue, and the translatome enrichment index (TEI) was calculated as the ratio of transcripts per million of TRAP-seq to mRNA-seq for each high-quality six HMMs MAKER transcript. TRAP-seq and mRNA-seq data are available in NCBI BioProject PRJNA298638. NCBI-SRA accession numbers for additional RNAsequencing data can be found in Additional file 9.  [48] using the following packages: ggplot2 [49], reshape2 [50], NMF [51] and ggridges (https://github.com/clauswilke/ggridges).