ZCURVE_V: a new self-training system for recognizing protein-coding genes in viral and phage genomes

Background It necessary to use highly accurate and statistics-based systems for viral and phage genome annotations. The GeneMark systems for gene-finding in virus and phage genomes suffer from some basic drawbacks. This paper puts forward an alternative approach for viral and phage gene-finding to improve the quality of annotations, particularly for newly sequenced genomes. Results The new system ZCURVE_V has been run for 979 viral and 212 phage genomes, respectively, and satisfactory results are obtained. To have a fair comparison with the currently available software of similar function, GeneMark, a total of 30 viral genomes that have not been annotated by GeneMark are selected to be tested. Consequently, the average specificity of both systems is well matched, however the average sensitivity of ZCURVE_V for smaller viral genomes (< 100 kb), which constitute the main parts of viral genomes sequenced so far, is higher than that of GeneMark. Additionally, for the genome of Amsacta moorei entomopoxvirus, probably with the lowest genomic GC content among the sequenced organisms, the accuracy of ZCURVE_V is much better than that of GeneMark, because the later predicts hundreds of false-positive genes. ZCURVE_V is also used to analyze well-studied genomes, such as HIV-1, HBV and SARS-CoV. Accordingly, the performance of ZCURVE_V is generally better than that of GeneMark. Finally, ZCURVE_V may be downloaded and run locally, particularly facilitating its utilization, whereas GeneMark is not downloadable. Based on the above comparison, it is suggested that ZCURVE_V may serve as a preferred gene-finding tool for viral and phage genomes newly sequenced. However, it is also shown that the joint application of both systems, ZCURVE_V and GeneMark, leads to better gene-finding results. The system ZCURVE_V is freely available at: . Conclusion ZCURVE_V may serve as a preferred gene-finding tool used for viral and phage genomes, especially for anonymous viral and phage genomes newly sequenced.


Background
Developments of DNA sequencing technology have resulted in a rapid expansion of genome data. It becomes a challenging issue to explore the secrets of genomes and maximize the scientific knowledge gained from them. The first step in analyzing a completely or partially sequenced genome is to identify all its genes. Accurate gene recognition is relevant to many biological applications, for example, DNA microarray, knockout experiments and drug design. There exist some well-known computer systems for gene-finding in bacterial and archaeal genomes. These systems are either based on statistic analysis, such as Gen-eMarkS [1], Glimmer [2,3], and ZCURVE [4], or based on similarity alignment, such as CRITICA [5] and ORPHEUS [6]. Generally, satisfactory predicted results are obtained by using the above statistics-based software. On the contrary, genome annotation in newly sequenced viruses and phages is frequently based on similarity search methods such as BLAST [7]. Some species-specific genes are likely to be missed although high specificity is obtained by using similarity search methods. Evidence shows that an open reading frame (ORF) longer than a given length and not or slightly overlapping with any adjacent ORFs is likely to be a gene. However, simply assigning all such ORFs to genes usually generates over-predictions. Therefore, it is of necessity to use highly accurate and statistics-based systems for viral genome annotations. Unfortunately, currently there are very few satisfactory statistics-based viral gene-finding systems, except GeneMark gene-finding family [8,9]. However, GeneMark systems for gene-finding in virus and phage genomes suffer from some basic drawbacks. It is the aim of this paper to put forward an alternative approach for viral and phage gene-finding to improve the quality of annotations, particularly, for newly sequenced genomes.
The ZCURVE system for finding protein-coding genes in bacterial and archaeal genomes developed by our group has been used in 40 laboratories or institutes all over the world [4]. In a recent paper, ZCURVE and the other two well-known bacterial gene-finding systems, Glimmer and CRITICA, are combined into a metatool named YACOP [10]. By adapting similar algorithm of ZCURVE, a new system specific to coronavirus genomes, ZCURVE_CoV, has been developed subsequently [11]. The ZCURVE_CoV system results in highly consistent results with GenBank annotations for coronavirus genomes, especially for SARS-CoV genomes [11]. However, the above software cannot be simply used to identify protein-coding genes in other viral or phage genomes. Here, a self-training system, ZCURVE_V is presented to address the problem. Similar to ZCURVE [4] and ZCURVE_CoV [11], the present ZCURVE_V system is also based on the Z curve representation of DNA sequences [12]. Compared with the most widely used viral gene-finding system, GeneMark family [8,9], the algorithm of ZCURVE_V is much simpler, because only 33 recognition variables are needed. Therefore, ZCURVE_V is conceptually different from Gene-Mark. Compared with GeneMark, ZCURVE_V resulted in better predicted results for smaller viral genomes (< 100 kb). In addition, the performance of ZCURVE_V is generally better than that of GeneMark for genomes with particular features, such as amsacta moorei entomopoxvirus, probably with the lowest genomic GC content among all the organisms sequenced so far. Moreover, it is also shown that joint applications of ZCURVE_V and Gene-Mark lead to better gene-finding results for viral and phage genomes.

Results and Discussions
Indices to evaluate ZCURVE_V The ZCURVE_V system has been run for 979 viral and 212 phage genome records, respectively. The default settings are adopted for all the options unless indicated otherwise. Evaluation of ZCURVE_V is based on the comparison between the gene-finding results and the RefSeq annotations for each genome. It should be noted that the RefSeq records are usually listed as provisional and have not themselves undergone extensive curation and literature cross-checking. However, to test and compare the performance of the presented algorithm we do need some criteria. Knowing that the RefSeq records are questionable, we chose to select those RefSeq data which possess the maximum reliability. For example, gene annotations in HIV, HBV and coronavirus are well known in the literature. Therefore, these three viruses are selected as samples to test and compare the algorithm. Other RefSeq records are selected similarly. Due to the inaccuracy of the RefSeq annotations currently available, the comparison between the performance of GeneMark and ZCURVE_V based on the RefSeq annotations should be deemed as preliminary.
Future and more reliable comparison should be based on experimentally verified data, rather than RefSeq annotations. Two independent indices defined by formulas (1) and (2) are used to evaluate the performance of ZCURVE_V [13] where TP, FP and FN are the positively true, false positive and false negative predictions, respectively.

Comparisons with GeneMark (I): viral genomes with different chromosome lengths
GeneMark gene-finding web server provides two alternative approaches for viral genome annotation, i.e., the online prediction using a heuristic approach (or using GeneMarks program for viral genomes longer than 100 kb) and the VIOLIN database [1, 8,9]. Generally speaking, the results obtained by the latter are more accurate than those obtained by the former. To strictly evaluate the performance of ZCURVE_V, the predicted results deposited in the GeneMark VIOLIN database are employed unless they are not available. A total of 30 viral genomes not annotated by GeneMark were used for the comparison, whose names are listed as follows: canarypox virus    Table 1, where the genomes are listed in the order in which the chromosome sequence length is descending. For the 15 viral genomes with the chromosome sequence length larger than 100 kb in Table 1, the average S n of ZCURVE_V and GeneMark is 95.9% and 92.5%, respectively, and the average S p of ZCURVE_V and GeneMark is 92.8% and 94.0%, respectively. For the 15 viral genomes with the chromosome sequence length less than 100 kb listed in Table 1, the average S n of ZCURVE_V and GeneMark is 95.6% and 80.0%, whereas the average S p of ZCURVE_V and GeneMark is 93.2% and 91.1%, respectively. As can be seen, both the average S n and S p of ZCURVE_V for the small viral genomes are similar with those for the large viral genomes, whereas the average S n of GeneMark for small viral genomes is much lower than that for large viral genomes. Note that viral and phage genomes shorter than 100 kb constitute the major part of viral and phage genomes sequenced so far. Over 90% of the 979 viral and 212 phage genomes analyzed here are shorter than 100 kb. If the average is performed over all the 30 genomes, S n and S p are 95.7% and 93.0% for ZCURVE_V, respectively, whereas S p and S p are 86.2% and 92.5% for GeneMark. In summary, S p of both systems is well matched, but S n of ZCURVE_V is much higher (about 9.5% higher) than that of GeneMark. Although Glimmer (2,3) were designed for gene-finding in bacterial genomes, for comparison, the gene-finding results by Glimmer 2.02 for all the 30 genomes are also listed in Table 1.

Comparisons with GeneMark (II): viral genomes with particular genomic features
Among the viruses curated by NCBI staff, two satellite viruses have genomic sequences shorter than 1000 bp, which are the cereal yellow dwarf virus-RPV satellite RNA (CYDV-RPV satRNA, NC_003533) and panicum mosaic satellite virus (satPaMV, NC_003847). Satellite maize white line mosaic virus (SV-MWLMV, NC_003631) and strawberry latent ringspot virus satellite RNA (SLRSV, NC_003848) have the sequence length a little bit longer than 1000 bp. As can be seen from Table 2, the gene-finding results of ZCURVE_V are more consistent with the Ref-Seq annotations than those of GeneMark for the four very small viral genomes.
The genome of Amsacta moorei entomopoxvirus (AmEPV, NC_002520) was sequenced in 2000 [14]. To our knowledge, it has the lowest genomic GC content among all the organisms completely sequenced so far, which is 17.78%. In the original annotation by the submitter of GenBank entries, all of the ORFs larger than 180 bp are predicted as possible protein-coding genes [14]. Such annotation method is very likely to generate over-annotation. The current RefSeq annotation curated by NCBI staff remains nearly the same compared with the original annotation, i.e., the genome contains 295 possible genes. After run-  ning ZCURVE_V, 245 out of the 295 annotated genes are found and the number of additionally predicted genes is 5. Among the 50 (295-245) genes not predicted by ZCURVE_V, only one gene has putative function and another two are similar to existing genes without functions in public databases, while the remaining 47 are only annotated as 'hypothetical proteins'. The result supports the notion that protein-coding genes are over-annotated in the amsacta moorei entomopoxvirus genome. The GeneMark VIOLIN database correctly predicts 239 annotated genes while the number of additionally predicted genes is as high as 323. It is obvious that most of these additional genes predicted by the GeneMark VIOLIN database are non-coding ORFs. Perhaps the severe over-prediction of the GeneMark VIOLIN database for the amsacta moorei entomopoxvirus genome is caused by its weak adaptability to genomes with particular features.

Applying ZCURVE_V to HIV-1, HBV and SARS-CoV genomes
According to the report "AIDS Epidemic Update 2004" launched by WHO and UNAIDS: the total number of people living with the human immunodeficiency virus (HIV) increased in 2004 to reach its highest level ever: an esti-mated 39.4 million people are living with the virus [15]. The global AIDS epidemic killed 3.1 million people in the past year. In the current GenBank annotation for HIV-1 (GenBank AC: AF033819), 9 protein-coding genes are contained, in which 7 genes are single-exon genes without any intron. Genes tat and rev have one intron, respectively. When using default settings, ZCURVE_V and the GeneMark VIOLIN database predict 7 and 6 genes for the genome, respectively. The predicted results are listed in Table 3. As can be seen, both ZCURVE_V and the Gene-Mark VIOLIN database predict the 5 annotated singleexon genes pol, gag, vpr, env and nef. The single-exon gene vif is correctly predicted by ZCURVE_V, whereas the Gen-eMark VIOLIN database misses it. The single-exon gene vpu is correctly predicted by the GeneMark VIOLIN database, whereas ZCURVE_V misses it. In addition, ZCURVE_V correctly predicts the 5' end for the introncontained gene tat. After adjusting the default settings, i.e., using the 'Keep Overlapping Genes' option, the gene vpu and one additional gene located at positions 7602-7694 bp are predicted by ZCURVE_V.
Hepatitis B virus is another virus that severely threatens human health. Currently, GenBank annotation contains 4  single-exon genes for HBV (GenBank AC: X04615). Among the 4 genes, gene P is jointly composed by two fragments. When using default settings, ZCURVE_V and the GeneMark VIOLIN database predict 3 and 2 genes for the genome, respectively. The predicted results are listed in Table 4. As can be seen, both ZCURVE_V and the Gen-eMark VIOLIN database predict gene P. Gene C is correctly predicted by ZCURVE_V, but the GeneMark VIOLIN database misses it. Gene X is correctly predicted by GeneMark, but ZCURVE_V misses it. In addition, ZCURVE_V predicts one additional gene that is embedded within gene P. After adjusting the default settings, i.e., using the 'Keep Overlapping Genes' option, gene S and X are also correctly predicted by ZCURVE_V.
SARS is a life-threatening disease that spread to may countries around the world in 2003 [16]. SARS is caused by a novel coronavirus, called SARS-coronavirus or SARS-CoV. SARS-CoVs belong to coronavirus and their genomes are single-stranded [17]. Among the 14 protein-coding genes annotated in SARS-CoV TOR2 genome (NC_004718), 12 genes are found by the ZCURVE_V system. The two genes missed by it are completely or nearly completely embedded within other genes and are very unlikely to encode proteins [11], while the GeneMark VIOLIN annotation misses 4 ones out of the 14 annotated genes [9].
In summary, the gene-finding performance of ZCURVE_V for the three well studied life-threatening viruses is generally better than that of GeneMark.

New genes missed by both RefSeq annotations and GenBank annotations
Gene-finding programs may be used to find new proteincoding genes that have been missed from the public databases. Using ZCURVE_V, we find some new genes missed from both the RefSeq annotations and GenBank annotations, which have significant similarities with other genes deposited in the public databases, as in the cases of the genomes of bacteriophage VT2-Sa (NC_000902), ectocarpus siliculosus virus (NC_002687) and pseudomonas phage D3 (NC_002484). The detailed predicted results of

Relationship between functions of predicted genes and their VZ scores
Compared with GeneMark, a more convenient feature of ZCURVE_V is that the coding potential scores VZ are provided for all of the predicted genes. The predicted genes with higher VZ scores have higher possibility to encode proteins. Bacteriophage P4 genome (NC_001609) is studied here as an example. As is shown in Table 5, all the predicted genes with VZ scores lower than 0.30 have no putative functions, in other words, all the function-known genes have the VZ scores higher than 0.30. On the other hand, it is possible that false positive predictions are generally associated with lower VZ scores. Therefore, the use of ZCURVE_V may reduce experimental expenses when studying functions of predicted genes by excising false positive predicted genes, based on the associated coding potential scores VZ.

Preferred utilization of ZCURVE_V in the annotation of anonymous viral genomes
All the GeneMark family, the heuristic approach and the VIOLIN database for viral and phage gene-finding have some limitations. Heuristic approach [8] is a self-training method and no human intervention is required during the running process. However, the performance of heuristic approach is generally worse than that of the GeneMark VIOLIN database [9]. The GeneMark VIOLIN database provides just an up-to-date analysis of newly sequenced viral genomes and is not able to be used to analyze anonymous viral genomes. Similar to the heuristic approach of GeneMark family, the ZCURVE_V is also a self-training method and enables analyzing any anonymous viral and phage genomes without any human intervention. Because the executable version of the program ZCURVE_V may be downloaded and run locally, it will be used more conveniently. More specific options when running ZCURVE_V strengthen its power. The prediction of ZCURVE_V is more accurate than that of GeneMark for viral or phage genomes shorter than 1000 bp. Therefore, it is suggested that ZCURVE_V may serve as a preferred gene-finding tool for viral and phage genomes, especially for anonymous viral and phage genomes newly sequenced. However, we should point out the limitations of ZCURVE_V when predicting genes for viruses that use alternative coding schemes. This includes RNA editing, splicing, polyprotein processing, etc. Generally, like GeneMark, ZCURVE_V cannot deal with the above special cases.

Joint applications of ZCURVE_V and GeneMark genefinding family
Both GeneMark and ZCURVE_V are based on statistical characteristics of coding (non-coding) sequences. However the former is Markov-chain-based and mainly considers the local characteristics of DNA sequence, whereas the latter is the Z-curve-based and lays stress on global characteristics. Due to the difference of inherent algorithm, the predictions of ZCURVE_V and GeneMark are different, although most of the predicted genes are identical. Higher accuracy may be obtained by combining them, in which genes predicted by either ZCURVE_V system or the GeneMark VIOLIN database are finally predicted as genes. Clover yellow mosaic virus (CLYMV, NC_001753), lymphocystis disease virus 1 (LCDV-1, NC_001824), ....transmissible gastroenteritis virus (TGEV, NC_002306) and yaba-like disease virus (YLDV, NC_002642) are chosen to demonstrate the effectiveness of joint applications of both systems. The results are listed in Table 6. As can be seen, the number of genes missed by the ZCURVE_V program decreases significantly although the number of additional predicted genes increases. Currently, it becomes a hotspot to develop an integrated genome annotation platform by joint applications of two or more systems based on different statistic analysis principles [10,19]. Similarly, joint applications of two or more viral gene-finding programs are also of necessity and feasibility. The programs of ZCURVE_V, GeneMark and others may all be jointed together to reach more accurate results. One referee of the manuscript points out that combining the use of prediction programs based on statistical measures such as ZCURVE_V with detection of functional motifs, sequence similarity, conservation of orthologs, presence of regulatory signals, etc., would be useful. Sequence similarity and conservation of orthologs methods may effectively reduce false positive predictions. Anyway, no one program can be used in isolation for making accurate predictions of the gene complement of any viral genome. Therefore use of multiple programs is always warranted. However, no concrete approach is provided to joint different information into a unified tool to reach the maximum accuracy. It seems that this is a topic of further study, not being included into the present paper.

Conclusion
A new self-training system, ZCURVE_V, for finding genes in viral and phage genomes has been proposed. The new system ZCURVE_V has been run for 979 viral and 212 phage genomes, respectively, and satisfactory results are obtained. To have a fair comparison with the currently available software of similar function, GeneMark, a total of 30 viral genomes that have not been annotated by Gen-eMark are selected to be tested. Consequently, the average specificity of both systems is well matched, however, the average sensitivity of ZCURVE_V for smaller viral genomes (< 100 kb), which constitute the main parts of viral genomes sequenced so far, is higher than that of Gen-eMark. Additionally, for the genome of amsacta moorei entomopoxvirus, probably with the lowest genomic GC content among the sequenced organisms, the accuracy of ZCURVE_V is much better than that of GeneMark, because the later predicts hundreds of false-positive genes.
ZCURVE_V is also used to analyze some well studied genomes, such as HIV-1, HBV and SARS-CoV. Accordingly, the performance of ZCURVE_V is generally better than that of GeneMark. Finally, GeneMark is not downloadable, whereas ZCURVE_V may be downloaded and run locally, particularly facilitating its utilization. Based on the above merits, it is suggested that ZCURVE_V may serve as a preferred gene-finding tool for viral and phage genomes newly sequenced. However, it is also shown that joint applications of both systems, ZCURVE_V and Gene-Mark, lead to better gene-finding results. The system ZCURVE_V is freely available at: http://tubic.tju.edu.cn/ Zcurve_V/.

Methods
A total of 979 viral and 212 phage genome records were downloaded from GenBank release 141.0 [20]. Each record corresponds to a genome or a genomic segment. The corresponding RefSeq annotations for these genomes were downloaded before July 20, 2004 [21]. For all of the viral and phage genomes, the predicted results of the Gen-eMark VIOLIN database [22] were also downloaded before July 20, 2004.
The present gene-finding method consists of the four steps: (1) Extracting the seed ORF for the analyzed genome In the present algorithm, only one seed ORF is required for a viral genome. This seed ORF is selected using a simple approach. It is found that an ORF with the largest length among all others in a genome is very likely to be a protein-coding gene. This ORF is called the 'Maximum ORF' in this paper. After carefully investigating over 100 viral genomes that have annotated genes, the deduction that the 'Maximum ORF' is a gene is valid accurately. For the two very small viral genomes, cereal yellow dwarf virus -RPV satellite RNA (NC_003533) and arabis mosaic virus small satellite RNA (NC_001546), there are no genes at al, indicating that the seed ORF so obtained is meaningless for these two genomes. If the 'Maximum ORF' is larger than 400 bp, it is directly regarded as a seed ORF (gene). However, if the 'Maximum ORF' is less than 400 bp, it is regarded as a seed ORF only if the base composition at the second codon position meets the following equation: G 2 < (A 2 + C 2 + T 2 )/3 + 0.1, where A 2 , C 2 , G 2 and T 2 are the occurrence frequencies of bases at the second position of an ORF. This equation approximately reflects the fact that bases at the second codon position lack guanine to some degree [23]. If a seed ORF is found, then it will be used as  a training sample to calculate the related parameters. Otherwise, if there is no seed ORF found, it means that the analyzed viral genome contains no functional genes.
(2) Training the parameter used to describe the coding potential The methodology adopted here is based on the Z curve [12], which is another representation of DNA sequence.
Here the algorithm is presented briefly as follows. The frequencies of bases A, C, G and T occurring in an ORF or a fragment of DNA sequence with bases at positions 1, 4, 7, ...; 2, 5, 8, ..., and 3, 6, 9, ..., are denoted by a 1 , c 1 , g 1 , t 1 , a 2 , c 2 , g 2 , t 2 , a 3 , c 3 , g 3 , t 3 respectively. They are actually the frequencies of bases at the 1 st , 2 nd and 3 rd codon positions. Based on the Z curve (12), a i , c i , g i , t i are mapped onto a point P i in a 3-dimensinal space V i , i = 1, 2, 3. The coordinates of P i , denoted by x i , y i , z i , are determined by the Ztransform of DNA sequence [12].
The Z-transform of DNA sequence transforms the four frequencies of DNA bases into the coordinates of a point in a 3-dimensional space. In addition to the frequencies of codon-position-dependent single nucleotides, we need to consider the frequencies of phase-specific dinucleotides. Let the frequencies of the 16 dinucleotides AA, AC, ..., and TT occurring at the codon positions1-2 and 2-3 of an ORF or a fragment of DNA sequence be denoted by p 12 (AA), p 12 (AC), ...,p 12 (TT); p 12 (AA), p 12 (AC), ... and p 12 (TT) respectively. Using the Z-transform [12], we find where , and are the coordinates, X = A, C, G, T  (5) for the seed ORF, which corresponds to a point O in the 33dimensional space. These 33 parameters will be used to differentiate coding/non-coding ORFs.

(3) Seeking all ORFs and predicting possible protein-coding genes
All the ORFs longer than a given value, for example 90 bp, are extracted as candidates of genes. For each ORF, which is represented by a point in the 33-dimensional space, the Euclidean distance of this point to the point O is obtained 6 90 .

(4) Dealing with overlapping ORFs
Among all the ORFs having VZ score larger than 0, some ORFs are falsely predicted as genes owing to their overlapping with coding ORFs. In the development of ZCURVE system, a strategy was proposed to deal with overlapping ORFs [11]. Later, this strategy was adopted again in the ZCURVE_CoV system [8]. Here the same strategy is employed once more, while the related parameters are adjusted because of the change of the definition of coding potential score. Briefly, if the VZ score of the longer ORF between the two overlapping ORFs minus a given value is still larger than that of the shorter one, it is recognized as gene, and the shorter is a non-coding one. Otherwise, both are kept as coding. For more detail, refer to [4].
There are three main different features between the present viral gene-finding system ZCURVE_V and our previously reported bacterial gene-finding system ZCURVE. Firstly, two different methods are used to generate seed ORFs: one simply selecting the 'Maximum ORF' and another selecting those long and non-overlapping ORFs as seed ORFs. Secondly, no negative samples (non-coding sequences) are required in the training set of the algorithm for ZCURVE_V system. Thirdly, instead of Fisher linear discriminant algorithm, Euclidean distance discriminant method is used here. Due to the adaptation, the ZCURVE_V system is capable of recognizing protein-coding genes in any anonymous viral or phage genomes, even for those shorter than 1000 bp.