GPRED-GC: a Gene PREDiction model accounting for 5 ′- 3′ GC gradient

Background Gene is a key step in genome annotation. Ab initio gene prediction enables gene annotation of new genomes regardless of availability of homologous sequences. There exist a number of ab initio gene prediction tools and they have been widely used for gene annotation for various species. However, existing tools are not optimized for identifying genes with highly variable GC content. In addition, some genes in grass genomes exhibit a sharp 5 ′- 3′ decreasing GC content gradient, which is not carefully modeled by available gene prediction tools. Thus, there is still room to improve the sensitivity and accuracy for predicting genes with GC gradients. Results In this work, we designed and implemented a new hidden Markov model (HMM)-based ab initio gene prediction tool, which is optimized for finding genes with highly variable GC contents, such as the genes with negative GC gradients in grass genomes. We tested the tool on three datasets from Arabidopsis thaliana and Oryza sativa. The results showed that our tool can identify genes missed by existing tools due to the highly variable GC contents. Conclusions GPRED-GC can effectively predict genes with highly variable GC contents without manual intervention. It provides a useful complementary tool to existing ones such as Augustus for more sensitive gene discovery. The source code is freely available at https://sourceforge.net/projects/gpred-gc/.


Background
Identification and annotation of genes in genomic sequences is a key step for functional analysis of a genome. The goal of gene annotation is to identify the location and structure of protein-coding genes in genomic sequences. Computational gene prediction methods can be broadly divided into two main categories: ab initio methods and homology-based methods. Ab initio gene prediction tools can predict genes in the query sequence without relying on the availability of homologs. A majority of ab initio gene prediction tools rely on hidden Markov models (HMMs), which describe different gene structural elements such as UTRs, exons, introns, etc. Given a sequence, we can use HMMs to infer the most probable number of new genomes are being sequenced using nextgeneration sequencing platforms, closely-related species are not always available. Thus, ab initio gene prediction tools play a significant role to find novel genes in the sequences without a priori known homologs. Note that some tools incorporate both HMMs and homologous sequences for boosting gene prediction performance. If a tool can conduct gene prediction without homologous sequences, it is classified into the first category.

GC content-dependent gene prediction
As the base composition and the exon length distributions can differ significantly for genes with different GC contents, some gene prediction tools employ GC contentdependent training [1,2,9,22,23]. In animals, genome isochores are regions of the genome with different GC contents, and it has been shown that the GC content of animal genes closely matches the GC content of the isochore in which the gene is found [24]. The AUGUSTUS gene prediction program has a mode that creates independent HMMs based on the GC content of the genomic region that is being processed [9]. Both theoretical analysis and empirical results have shown that GC contentdependent training greatly improves the gene prediction accuracy and sensitivity.
In plants, isochores do not exist, and it has been shown that the GC content of plant genes is not correlated with the GC content of the genomic region in which the gene is found. Furthermore, in grasses such as Oryza sativa (rice), genes can be characterized as having either a high GC or low GC content whereas most non-grass species such as the model species Arabidopsis thaliana (thale cress) have genes with a narrow gene GC content distribution. Using a single HMM to predict these two classes of genes in O. sativa was shown to be less accurate than using a gene prediction protocol that was aware of the high and low GC genes in grasses. Bowman et al. [24] trained three HMM programs on low, medium and high GC genes. All HMMs were used to make gene predictions, but only the best prediction that was most congruent with available evidence was retained. This method improved gene predictions compared to a gene prediction protocol that was not GC aware.
While the method of Bowman et al. [24] is an improvement over other gene prediction programs, it is a heuristic that can be improved upon by an modification of the basic structure of the underlying gene prediction HMM. Furthermore, many grass genes exhibit a sharp 5 -3 decreasing GC content gradient [25], [26], which is not carefully modeled by existing gene prediction tools and Bowman's method. As a result, these tools have unsatisfactory sensitivity and accuracy for predicting genes with GC gradients. Figure 1 illustrates an example of a gene with descending slope of GC content in Oryza sativa data set.
To address these limitations, we propose a new gene prediction model with two advantages: 1) our model can predict genes with GC gradient with higher sensitivity and accuracy without manual intervention; 2) our unified model is optimized for genes of variant GC content and 5 -3 changing patterns. Fig. 1 A gene LOC_Os03g44820.1 with GC content gradient from Oryza sativa data set. X-axis represents each exon inside the gene. Y-axis represents the GC content

Methods
In this section, we describe GPRED-GC, a tool that predicts genes with 5 -3 GC gradient. The flowchart for training and annotating genes is shown in Fig. 2. The main novelty of our method is a modified hidden Markov model (HMM) that distinguishes exons of different GC content. The HMM has a similar topology to the one used in AUGUSTUS [22,23] and many other de novo gene prediction tools [1,4]. The major difference is that our model is designed to handle various GC contents and 5 -3 changing patterns inside coding regions.

The hidden Markov model of GPRED-GC
An HMM is a probabilistic sequence model with successful application for gene prediction. It models the key sequence features such as exons and introns in a gene and can be trained using annotated gene sets. Once the model is built, it can be applied to search for genes and annotate the gene structures using existing algorithms designed for HMMs, such as the Viterbi algorithm. Essentially, an optimal state path in an HMM that can maximize the likelihood or posterior probability of a query being produced by the model can be used to label each base in the query sequence.
As AUGUSTUS is a popular plant gene prediction tool, we use the generalized Hidden Markov Model from AUGUSTUS [22] as the base model. The essential difference is that instead of using one state to represent an exon, we have three states to model exons of high, medium, and low GC contents. Figure 3 illustrates the major difference for an exon state in a standard HMM and our HMM, which incorporates changes of GC contents across the genes.
Here, we make an assumption that the GC content change inside exons is relatively small. Although we can use a window-based model inside each exon to further refine the representation of GC gradient, it will significantly increase the model complexity. By only distinguishing exons of different GC contents, we have a better tradeoff between the model complexity and the model resolution.
For single exon genes, three states E H single , E M single , E L single are created. For the initial exon, three states E 0 init H , E 0 init M , E 0 init L are used to model exons of high, medium, and low GC content. Moreover, the initial exons of other phases, the internal exons of all phases, and the terminal exon all have three states for high, medium, and low GC content. Genes of variant GC changing patterns can be represented by the new exon states.
The added exon states allow the HMM to predict genes of various GC gradients with higher accuracy. For example, genes of negative GC gradient tend to be represented by a path starting with E H and ending with E L . Genes with high GC content and moderate gradient tend to be produced by a path mainly consisting of E H .

The states in the HMM
In total, the model of GPRED-GC has the following 79 states in set Q:  illustrates three internal exon states for one phase on the plus strand (corresponding to one reading frame). The internal exons of other phases, the initial exon, the terminal exon, and the single exon all have three states for high, medium, and low GC content. Genes of various GC contents and gradients can be represented as various paths through the exon states Figure 4 is a schematic representation of the HMM in our work. In the upper half of Fig. 4, the states represent protein-coding genes on the forward strand. The state IR stands for the intergenic region. In the lower half of Fig. 4, the states represent protein-coding genes on the reverse strand. Each state on the reverse strand begins with 'r' . They have the consistent biological meaning with the states on the forward strand. The superscript on the reverse strand represents the reading frame phase of an exon. Thus, there are nine states for a terminal exon and three states for an initial exon considering high, medium, and low GC contents.
Similar to existing HMMs for gene prediction, our HMM is also a general HMM, which supports length distribution and the Markov model emission of the exons. For each exon state on both forward strand and reverse strand, the exon length distribution is computed on the corresponding exons, respectively. Similarly, different inhomogeneous kth-order Markov models (by default k = 4) for each exon state are derived separately.

New transitions in our HMM
With new states representing exons of different GC contents, new transitions incident to these new states are added. In this section, we describe how we compute the transition probabilities for the new edges. In Fig. 4, the arrows represent the transitions between states in the state set Q with non-zero probabilities. The transitions from and to the intron states are the same as those of states described in the AUGUSTUS model [22]. For GPRED-GC, we consider two strategies for computing transition probabilities. First strategy, we use a very simple strategy by dividing the known transition probabilities concerning the exon states of AUGUSTUS equally for three exon states of high, medium, and low GC contents. This strategy can be used when we have very limited training data. Our hypothesis is that they start with equal probabilities. Figure 4 includes the transition probabilities of this strategy. The second strategy is a standard method based on maximum likelihood training. We compute the transition probabilities using the maximum likelihood estimation from the training data. In the following equation, a kl is the transition probability for k, l ∈ Q. A kl is the number of observed transitions from the state k to state l in training data. The maximum likelihood estimator is defined as To avoid zero probabilities due to sparse/insufficient training data, we add pseudocounts to the observed frequencies to reflect prior biases regarding the probability values. Given pseudocounts r kl , we define A kl as Usually, with the Laplace method, all r kl equal to 1. The performance comparison of the two strategies for computing transition probabilities will be shown in the Results and discussion Section.

Results and discussion
To evaluate the performance of GPRED-GC, we tested GPRED-GC on three sets of data from A. thaliana and O. sativa. The first data set on A. thaliana was downloaded from the server of Augustus [27]. The other two were from O. sativa, obtained from the MSU Rice Annotation Project and from Stanke et al. [28], respectively. A. thaliana is a dicotyledenous plant and not a grass species. The genes in A. thaliana do not have GC-gradients that are common to genes from grasses. We expect that our program should achieve similar performance to other The state diagram of GPRED-GC. The states beginning with r represents the reverse strand. E H single : a single exon of high GC content. E M single : a single exon of medium GC content. E L single : a single exon of low GC content. E init H : the initial coding exon of a multi-exon gene with high GC content. E init M : the initial exon of a multi-exon gene with medium GC content. E init L : the initial exon of a multi-exon gene with low GC content. DSS: a donor splice site. I short : an intron emitting at most d nucleotides. I fixed : a longer intron with the first d nucleotides. I geo : a longer intron emitting one nucleotide at a time after the first d nucleotides. ASS: an acceptor splice site with branch point. E H : an internal coding exon of a multi-exon gene with high GC content. E M : the internal exon of a multi-exon gene with medium GC content. E L : the internal exon of a multi-exon gene with low GC content. E H term : the last coding exon of a multi-exon gene with high GC content. E M term : the terminal exon of a multi-exon gene with medium GC content. E L term : the terminal exon of a multi-exon gene with low GC content. IR: intergenic region. Diamonds represent the states that emit fixed length strings. Ovals represent the states including explicit length distribution. The numbers at the arrows show the transition probabilities. The transition probabilities incident to new exon states are derived using equal divisions (strategy 1). The exponents 0, 1, and 2 represent the reading frame phase. For an exon state, this is the position of the last base of the exon in its codon. For the other states, the exponent are the preceding-exon phase. The small circles represent silent states ab initio gene prediction programs on gene prediction for A. thaliana. For the data sets from O. sativa, we expect to observe improved performance of gene prediction. As our HMM is modified from the HMM in Augustus, we compared our results to the output of regular AUGUSTUS ab initio gene finding program. In particular, we focus on examining the performance of our method on identifying genes with sharp change of GC content.

Evaluation metrics
We adopted the standard evaluation metrics [22] for gene prediction: sensitivity and specificity. The sensitivity and specificity are computed at three levels: the nucleotide level, the exon level, and the gene level. The sensitivity and the specificity are defined as TP (true positive) represents the number of correctly predicted features (coding nucleotides, exons, or genes). FN (false negative) represents the number of annotated features that are not correctly predicted by a gene annotation program. FP represents the number of predicted features that are not annotated. At each level, we report both the sensitivity and specificity. Sensitivity is the percentage of correctly predicted features in the set of all annotated features. Specificity is the percentage of correctly predicted features in the set of all predicted features. Specificity is also called positive predictive value (PPV) in other literature. At the exon level, a predicted exon will be correct if both splice sites are identical to their labeled positions. At the gene level, a predicted gene is considered correct if all exons are correctly identified, and no additional exons are identified in the gene. The predicted partial genes are evaluated similarly. The forward and reverse strands are considered as different sequences.

Gene prediction on A. thaliana
We trained our HMM for A. thaliana using the training set from Stanke's website [27]. The data set contained 249 genomic regions. There were two single exon genes and 247 multi-exon genes in the data set.

Training the HMM model
To train our HMM, we calculated the GC contents for all of the exons and classified them as high, medium, and low using specified cutoffs. For GPRED-GC, we have two cutoffs: lowT and highT. If an exon has GC content below lowT, it is classified as low GC content. If an exon has GC content above highT, it is labeled as high GC content. Otherwise, it is labeled as medium. The detailed classification is summarized in Procedure 1. For computing transition probabilities, we used two strategies. First, we equally divided the probabilities of AUGUSTUS for three states of high, medium, and low GC contents. Second, we used maximum likelihood estimation to calculate transition probabilities. We used 10-fold cross-validation for model training. It divided the training data set randomly into 10 subsets. The evaluation method is repeated 10 times. For each round, one of 10 subsets is designated as the test set and the other 9 subsets are put together for training. Then the average prediction accuracy of all 10 trials is calculated. The parameters maximizing the average prediction performance are kept as the default parameters.

Performance comparison between different gene prediction tools
We tested AUGUSTUS and GPRED-GC on the testing data set of A. thaliana [27], which has no overlap with the training data set. There were 74 genomic regions with 168 genes on the forward and reverse strand. Our program was modified from AUGUSTUS version 2.4 downloaded from [29]. Both original AUGUS-TUS and GPRED-GC were tested using default input parameters. Table 1 shows the comparison of the accuracy of AUGUSTUS and GPRED-GC with different thresholds of GC contents. In this experiment, the transition probabilities from intron to exons of different GC contents were equally divided into three portions. These experimental results show that GPRED-GC achieved slightly better sensitivity and specificity for gene level predictions. For base level and exon level predictions, GPRED-GC has higher specificity than AUGUSTUS. Overall, the performances of these two tools are comparable on this data set, which is expected for a non-grass genome that lack genes with widely varying GC contents or genes with negative GC gradients. In addition, the performance of GPRED-GC does not vary significantly with the change of the GC content cutoffs, mainly because the GC contents of the exons in this data set are clustered between 0.35 and 0.6.
We also implemented GPRED-GC by computing the transition probabilities from intron to different exon states using maximum likelihood estimation. Table 2 presents the accuracy comparison of AUGUSTUS and GPRED-GC with different cutoffs and trained transition probabilities. Using maximum likelihood estimation for computing the transition probabilities gave the better overall performance.

The uniquely predicted genes by GPRED-GC
As the major goal of GPRED-GC is to detect genes with highly variable GC contents, we evaluated this goal by examining the GC contents of uniquely identified genes by GPRED-GC. There were 14 uniquely identified genes by our tool and 149 shared genes. For all these genes, we computed their GC contents and the standard deviation (SD). In addition, we introduce another metric named "GC-distance", which is the largest difference of GC contents between all exons inside a gene. Thus, a gene with highly variable GC contents are more likely to have a big SD and also a large GC-distance.  The experimental results demonstrated that the average SD of the uniquely predicted genes was 0.046. However, the average SD of the common ones was 0.033 which is smaller than the uniquely predicted genes. Also, the average GC-distance for uniquely found genes was 0.111. The average GC-distance of the common genes was only 0.087. As an example, a uniquely identified gene is reported in Fig. 6.

Running time analysis
The theoretical time complexity is O(|Q|L) where Q is the set of the states in the HMM and L is the query length. The actual running time is in Tables 1 and 2. As the total number of states in GPRED-GC is less than twice of the states in AUGUSTUS, the running time of GPRED-GC is comparable to AUGUSTUS.

Gene prediction in O. sativa
We conducted two experiments using two different O. sativa data sets. The first O. sativa data set is part of the MSU Rice Genome Annotation Project [30], [31]. The second O. sativa data set was obtained from Stanke et al. [28]. Unlike A. thaliana, which has a set of gene predictions with high confidence, O. sativa does not have confidence descriptions assigned to gene predictions. Therefore, we choose two data sets on O. sativa to avoid the possible inaccurate annotations. These two data sets were constructed from different means and contain different sequences. The first data set is smaller than the second data set. Figure 7 presents the distribution of exons by their GC contents for 844 exons in the first O. sativa data set. Figure 8 shows the distribution of exons by their GC contents for 16,199 exons in the second O. sativa data set.

Gene identification in the first O. sativa data set
The training data set consisted of 150 genomic regions with 11 single-exon genes and 139 multi-exon genes on forward strand and reverse strand. Again, we used 10-fold cross validation strategy for model training. The   We compared GPRED-GC with AUGUSTUS in Table 3. In this experiment, we observed GC content change for exons inside each gene. For example, some genes tend to start with exons of high GC content and end with exons of low GC content. As the result, GPRED-GC achieved higher sensitivity than AUGUSTUS. GPRED-GC can improve both the sensitivity and the specificity at all levels using a cutoff of low GC content(0.40) and the cutoff of high GC content(0.60). With transition probabilities derived using maximum likelihood, the results are shown in Table 4.
Furthermore, we compared changes in GC contents of the uniquely identified genes by GPRED-GC and the common genes shared by AUGUSTUS and GPRED-GC. There were four uniquely predicted genes by GPRED-GC and 143 common genes. We compared the uniquely identified genes and common ones in terms of SD and GC-distance for each protein-coding gene. The results showed that the average SD of the uniquely predicted genes (0.098) was higher than that of common genes (0.075). Also, the average GC-distance of the uniquely found genes by GPRED-GC (0.241) was bigger than that of common genes (0.171). Figure 9 plots the GC contents of the genes that can be correctly predicted by GPRED-GC but miss-annotated by regular AUGUSTUS. All these genes have a negative GC gradient.

Running time on the first O. sativa data set
The running times of AUGUSTUS and GPRED-GC with different sets of thresholds are compared in Table 3 and  Table 4.

Finding genes in the second O. sativa data set
The second O. sativa data set was provided by Stanke et al. [28]. The detailed information about these genes can be found in the authors' paper. Here we provide a brief summary about the genes in this data set. First, Stanke et al. made a genbank file from the genome and  . 9 Genes of the first O. sativa data set predicted correctly by GPRED-GC but missed or incorrectly annotated by Augustus. Four genes are listed in the four subplots: (a), (b), (c), and (d). X-axis represents the exon index inside a gene. Y-axis represents GC content the gff file. Second, they constructed a set with both 5' and 3' UTRs annotated. Then, they selected genes with both UTRs and CDSes (from a visual inspection in JBrowse against panicle and leaf RNA-Seq STAR alignments). Fourth, they identified genes with errors only and removed the sequences with errors. Finally, the data set contains 1000 genes, which consist of 128 manually selected and 872 randomly chosen ones (from among filtered genes with both UTRs annotated). The HMM for O. sativa was trained on a selection of 800 genes with UTRs from phytozome [32]. For the training data set, there were 187 single exon genes and 613 multi-exon genes on the forward and reverse strands.
To assess the performance, we used the remaining 200 genes as the test set to avoid any overlap with the training set. The testing data set consisted of 40 single-exon genes and 160 multi-exon genes on the forward and reverse strands. Tables 5 and 6 compare the accuracy of AUGUSTUS and GPRED-GC on the test set. Table 5 shows the results of the prediction using the equal    Table 6 contains the prediction results of the HMM whose transition probabilities were trained using maximum likelihood. Both models shows improved accuracy compared to AUGUSTUS. By applying GPRED-GC of equal transition probabilities (strategy 1), the sensitivity of GPRED-GC at base level was enhanced from 0.859 to 0.942 for 0.31 lowT cutoff and 0.52 highT cutoff. At the exon level, the sensitivity of GPRED-GC was improved from 0.67 to 0.768.
The specificity of AUGUSTUS is slightly better than that of GPRED-GC for the same cutoffs at the base level. At the gene level, GPRED-GC had better sensitivity and specificity (0.4 and 0.211, respectively). By using the transition probabilities trained via maximum likelihood, we observed a bigger improvement in the performance (see Table 6).
We conducted additional analysis using the results of lowT=0.31 and highT=0.52. The analysis confirms that GPRED-GC can detect the genes containing a GC content gradient. There were 26 uniquely predicted genes by GPRED-GC and 163 common genes shared by AUGUSTUS and GPRED-GC. Uniquely predicted genes had higher SD (0.099) than common genes (0.080). Besides, the average GC-distance (i.e. the difference between the exon with the highest GC content and the exon with the lowest GC content) for each protein-coding gene of the uniquely found genes are larger than that of the common genes (0.211 and 0.171, respectively). GPRED-GC miss-annotated 11 genes, which are correctly predicted by AUGUSTUS. For all these genes, we correctly predicted the initial and terminal exons but some of the internal exons' starting and ending positions are not correctly computed. Overall, we correctly identified 26 more genes than AUGUSTUS while we missed 11 genes. There are 533 exons in all of these 37 genes. At the exon level for these genes, the sensitivities of GPRED-GC vs. AUGUSTUS are 0.838 and 0.627, respectively. The specificities of GPRED-GC VS AUGUSTUS are 0.639 and 0.543, respectively. Figure 10 illustrates the examples of genes correctly predicted by GPRED-GC but missed by AUGUSTUS. This is strong evidence showing that our tool can predict genes with changing GC contents.

How to determine the lowT and highT cutoffs?
Our experimental results have shown that the gene prediction performance is affected by the values of lowT and highT. In Table 6, setting lowT and highT to 0.31 and 0.52 achieved significantly better performance than 0.30 and 0.50. As such a small change can lead to a big difference, we investigated the reasons. Essentially, changing lowT and highT mainly changes the labels of exons of the training data. The numbers of exons classified as having high, medium, and low GC content may change. Meanwhile, the number of transitions involving these states may change too. Thus, we compared the corresponding parameters in the two HMMs for these two sets of cutoffs. While many parameters are identical, there are several differences as shown in Table 7.
For the cutoff set (0.30, 0.50), there are several edges that have zero or a very small number of training samples passing those edges. If the training case contains 0 samples for one edge, only the pseudocount will be used, leading to a very small transition probability. When the testing data Table 7 The comparison of the corresponding parameters in the two HMMs for these two sets of cutoffs has those cases, the overall generation probabilities tend to be small. In order to avoid any bias of training, our guidance is to choose the cutoffs so that the training data can cover all the edges/transitions.

Running time on the second O. sativa data set
The running times of AUGUSTUS and GPRED-GC with different sets of thresholds are compared in Tables 5 and 6. The two tools have comparable total running time.

Conclusion
In this work, we provided an implementation of a HMM that is optimized for predicting protein-coding regions that have various GC content and 5 -3 changing patterns.
Our experimental results showed that our program can identify genes that are missed by AUGUSTUS.
According to the previous studies, several directions can be improved. For gene prediction, some existing gene prediction tools can identify 5 UTR and 3 UTR regions. Currently, GPRED-GC is not able to identify UTR regions. We plan to extend GPRED-GC to accurately predict 5'UTR and 3'UTR regions. Existing gene prediction tools demonstrated that using extrinsic evidence derived from matches to an EST or protein database can improve the accuracy of gene prediction. We will further improve GPRED-GC accuracy by using hints from external sources if the data is available.