- Open Access
ZCURVE_V: a new self-training system for recognizing protein-coding genes in viral and phage genomes
BMC Bioinformaticsvolume 7, Article number: 9 (2006)
It necessary to use highly accurate and statistics-based systems for viral and phage genome annotations. The GeneMark systems for gene-finding in virus and phage genomes suffer from some basic drawbacks. This paper puts forward an alternative approach for viral and phage gene-finding to improve the quality of annotations, particularly for newly sequenced genomes.
The new system ZCURVE_V has been run for 979 viral and 212 phage genomes, respectively, and satisfactory results are obtained. To have a fair comparison with the currently available software of similar function, GeneMark, a total of 30 viral genomes that have not been annotated by GeneMark are selected to be tested. Consequently, the average specificity of both systems is well matched, however the average sensitivity of ZCURVE_V for smaller viral genomes (< 100 kb), which constitute the main parts of viral genomes sequenced so far, is higher than that of GeneMark. Additionally, for the genome of Amsacta moorei entomopoxvirus, probably with the lowest genomic GC content among the sequenced organisms, the accuracy of ZCURVE_V is much better than that of GeneMark, because the later predicts hundreds of false-positive genes. ZCURVE_V is also used to analyze well-studied genomes, such as HIV-1, HBV and SARS-CoV. Accordingly, the performance of ZCURVE_V is generally better than that of GeneMark. Finally, ZCURVE_V may be downloaded and run locally, particularly facilitating its utilization, whereas GeneMark is not downloadable. Based on the above comparison, it is suggested that ZCURVE_V may serve as a preferred gene-finding tool for viral and phage genomes newly sequenced. However, it is also shown that the joint application of both systems, ZCURVE_V and GeneMark, leads to better gene-finding results. The system ZCURVE_V is freely available at: http://tubic.tju.edu.cn/Zcurve_V/.
ZCURVE_V may serve as a preferred gene-finding tool used for viral and phage genomes, especially for anonymous viral and phage genomes newly sequenced.
Developments of DNA sequencing technology have resulted in a rapid expansion of genome data. It becomes a challenging issue to explore the secrets of genomes and maximize the scientific knowledge gained from them. The first step in analyzing a completely or partially sequenced genome is to identify all its genes. Accurate gene recognition is relevant to many biological applications, for example, DNA microarray, knockout experiments and drug design. There exist some well-known computer systems for gene-finding in bacterial and archaeal genomes. These systems are either based on statistic analysis, such as GeneMarkS , Glimmer [2, 3], and ZCURVE , or based on similarity alignment, such as CRITICA  and ORPHEUS . Generally, satisfactory predicted results are obtained by using the above statistics-based software. On the contrary, genome annotation in newly sequenced viruses and phages is frequently based on similarity search methods such as BLAST . Some species-specific genes are likely to be missed although high specificity is obtained by using similarity search methods. Evidence shows that an open reading frame (ORF) longer than a given length and not or slightly overlapping with any adjacent ORFs is likely to be a gene. However, simply assigning all such ORFs to genes usually generates over-predictions. Therefore, it is of necessity to use highly accurate and statistics-based systems for viral genome annotations. Unfortunately, currently there are very few satisfactory statistics-based viral gene-finding systems, except GeneMark gene-finding family [8, 9]. However, GeneMark systems for gene-finding in virus and phage genomes suffer from some basic drawbacks. It is the aim of this paper to put forward an alternative approach for viral and phage gene-finding to improve the quality of annotations, particularly, for newly sequenced genomes.
The ZCURVE system for finding protein-coding genes in bacterial and archaeal genomes developed by our group has been used in 40 laboratories or institutes all over the world . In a recent paper, ZCURVE and the other two well-known bacterial gene-finding systems, Glimmer and CRITICA, are combined into a metatool named YACOP . By adapting similar algorithm of ZCURVE, a new system specific to coronavirus genomes, ZCURVE_CoV, has been developed subsequently . The ZCURVE_CoV system results in highly consistent results with GenBank annotations for coronavirus genomes, especially for SARS-CoV genomes . However, the above software cannot be simply used to identify protein-coding genes in other viral or phage genomes. Here, a self-training system, ZCURVE_V is presented to address the problem. Similar to ZCURVE  and ZCURVE_CoV , the present ZCURVE_V system is also based on the Z curve representation of DNA sequences . Compared with the most widely used viral gene-finding system, GeneMark family [8, 9], the algorithm of ZCURVE_V is much simpler, because only 33 recognition variables are needed. Therefore, ZCURVE_V is conceptually different from GeneMark. Compared with GeneMark, ZCURVE_V resulted in better predicted results for smaller viral genomes (< 100 kb). In addition, the performance of ZCURVE_V is generally better than that of GeneMark for genomes with particular features, such as amsacta moorei entomopoxvirus, probably with the lowest genomic GC content among all the organisms sequenced so far. Moreover, it is also shown that joint applications of ZCURVE_V and GeneMark lead to better gene-finding results for viral and phage genomes.
Results and Discussions
Indices to evaluate ZCURVE_V
The ZCURVE_V system has been run for 979 viral and 212 phage genome records, respectively. The default settings are adopted for all the options unless indicated otherwise. Evaluation of ZCURVE_V is based on the comparison between the gene-finding results and the RefSeq annotations for each genome. It should be noted that the RefSeq records are usually listed as provisional and have not themselves undergone extensive curation and literature cross-checking. However, to test and compare the performance of the presented algorithm we do need some criteria. Knowing that the RefSeq records are questionable, we chose to select those RefSeq data which possess the maximum reliability. For example, gene annotations in HIV, HBV and coronavirus are well known in the literature. Therefore, these three viruses are selected as samples to test and compare the algorithm. Other RefSeq records are selected similarly. Due to the inaccuracy of the RefSeq annotations currently available, the comparison between the performance of GeneMark and ZCURVE_V based on the RefSeq annotations should be deemed as preliminary. Future and more reliable comparison should be based on experimentally verified data, rather than RefSeq annotations. Two independent indices defined by formulas (1) and (2) are used to evaluate the performance of ZCURVE_V 
where TP, FP and FN are the positively true, false positive and false negative predictions, respectively.
Comparisons with GeneMark (I): viral genomes with different chromosome lengths
GeneMark gene-finding web server provides two alternative approaches for viral genome annotation, i.e., the online prediction using a heuristic approach (or using GeneMarks program for viral genomes longer than 100 kb) and the VIOLIN database [1, 8, 9]. Generally speaking, the results obtained by the latter are more accurate than those obtained by the former. To strictly evaluate the performance of ZCURVE_V, the predicted results deposited in the GeneMark VIOLIN database are employed unless they are not available. A total of 30 viral genomes not annotated by GeneMark were used for the comparison, whose names are listed as follows: canarypox virus (abbreviation name: CNPV, RefSeq AC: NC_005309), fowlpox virus (FPV, NC_002188), tupaia herpesvirus (THV, NC_002794), african swine fever virus (ASFV, NC_001659), myxoma virus (MYXV, NC_001132), shope fibroma virus (SFV, NC_001266), yaba-like disease virus (YLDV, NC_002642), orf virus (ORFV, NC_005336), bovine papular stomatitis virus (BPSV, NC_005337), Autographa californica nucleopolyhedrovirus (AcNPV, NC_001623), Bombyx mori nucleopolyhedrovirus (BmNPV, NC_001962), Phthorimaea operculella granulovirus (PhoGV, NC_004062), Adoxophyes honmai nucleopolyhedrovirus (AdhoNPV, NC_004690), lymphocystis disease virus 1 (LCDV-1, NC_005902), Plutella xylostella granulovirus (PxGV, NC_002593), Adoxophyes orana granulovirus (AdorGV, NC_005038), Neodiprion lecontei nucleopolyhedrovirus (NeleNPV, NC_005906), fowl adenovirus D (FAdV-9, NC_000899), porcine adenovirus C (PAdV-5, NC_002702), avian infectious bronchitis virus (IBV, NC_001451), citrus tristeza virus (CTV, NC_001661), simian hemorrhagic fever virus (SHFV, NC_003092), beet yellows virus (BYV, NC_001598), fer-de-lance virus (FDLV, NC_005084), equine arteritis virus (EAV, NC_002532), semliki forest virus (SFV, NC_003215), bean common mosaic necrosis virus (BCMV, NC_004047), garlic latent virus (GLV, NC_003557), figwort mosaic virus (FMV, NC_003554), and southern cowpea mosaic virus (SCMV, NC_001625), respectively. The predicted results for the 30 viral genomes are listed in Table 1, where the genomes are listed in the order in which the chromosome sequence length is descending. For the 15 viral genomes with the chromosome sequence length larger than 100 kb in Table 1, the average S n of ZCURVE_V and GeneMark is 95.9% and 92.5%, respectively, and the average S p of ZCURVE_V and GeneMark is 92.8% and 94.0%, respectively. For the 15 viral genomes with the chromosome sequence length less than 100 kb listed in Table 1, the average S n of ZCURVE_V and GeneMark is 95.6% and 80.0%, whereas the average S p of ZCURVE_V and GeneMark is 93.2% and 91.1%, respectively. As can be seen, both the average S n and S p of ZCURVE_V for the small viral genomes are similar with those for the large viral genomes, whereas the average S n of GeneMark for small viral genomes is much lower than that for large viral genomes. Note that viral and phage genomes shorter than 100 kb constitute the major part of viral and phage genomes sequenced so far. Over 90% of the 979 viral and 212 phage genomes analyzed here are shorter than 100 kb. If the average is performed over all the 30 genomes, S n and S p are 95.7% and 93.0% for ZCURVE_V, respectively, whereas S p and S p are 86.2% and 92.5% for GeneMark. In summary, S p of both systems is well matched, but S n of ZCURVE_V is much higher (about 9.5% higher) than that of GeneMark. Although Glimmer (2,3) were designed for gene-finding in bacterial genomes, for comparison, the gene-finding results by Glimmer 2.02 for all the 30 genomes are also listed in Table 1.
Comparisons with GeneMark (II): viral genomes with particular genomic features
Among the viruses curated by NCBI staff, two satellite viruses have genomic sequences shorter than 1000 bp, which are the cereal yellow dwarf virus-RPV satellite RNA (CYDV-RPV satRNA, NC_003533) and panicum mosaic satellite virus (satPaMV, NC_003847). Satellite maize white line mosaic virus (SV-MWLMV, NC_003631) and strawberry latent ringspot virus satellite RNA (SLRSV, NC_003848) have the sequence length a little bit longer than 1000 bp. As can be seen from Table 2, the gene-finding results of ZCURVE_V are more consistent with the RefSeq annotations than those of GeneMark for the four very small viral genomes.
The genome of Amsacta moorei entomopoxvirus (AmEPV, NC_002520) was sequenced in 2000 . To our knowledge, it has the lowest genomic GC content among all the organisms completely sequenced so far, which is 17.78%. In the original annotation by the submitter of GenBank entries, all of the ORFs larger than 180 bp are predicted as possible protein-coding genes . Such annotation method is very likely to generate over-annotation. The current RefSeq annotation curated by NCBI staff remains nearly the same compared with the original annotation, i.e., the genome contains 295 possible genes. After running ZCURVE_V, 245 out of the 295 annotated genes are found and the number of additionally predicted genes is 5. Among the 50 (295-245) genes not predicted by ZCURVE_V, only one gene has putative function and another two are similar to existing genes without functions in public databases, while the remaining 47 are only annotated as 'hypothetical proteins'. The result supports the notion that protein-coding genes are over-annotated in the amsacta moorei entomopoxvirus genome. The GeneMark VIOLIN database correctly predicts 239 annotated genes while the number of additionally predicted genes is as high as 323. It is obvious that most of these additional genes predicted by the GeneMark VIOLIN database are non-coding ORFs. Perhaps the severe over-prediction of the GeneMark VIOLIN database for the amsacta moorei entomopoxvirus genome is caused by its weak adaptability to genomes with particular features.
Applying ZCURVE_V to HIV-1, HBV and SARS-CoV genomes
According to the report "AIDS Epidemic Update 2004" launched by WHO and UNAIDS: the total number of people living with the human immunodeficiency virus (HIV) increased in 2004 to reach its highest level ever: an estimated 39.4 million people are living with the virus . The global AIDS epidemic killed 3.1 million people in the past year. In the current GenBank annotation for HIV-1 (GenBank AC: AF033819), 9 protein-coding genes are contained, in which 7 genes are single-exon genes without any intron. Genes tat and rev have one intron, respectively. When using default settings, ZCURVE_V and the GeneMark VIOLIN database predict 7 and 6 genes for the genome, respectively. The predicted results are listed in Table 3. As can be seen, both ZCURVE_V and the GeneMark VIOLIN database predict the 5 annotated single-exon genes pol, gag, vpr, env and nef. The single-exon gene vif is correctly predicted by ZCURVE_V, whereas the GeneMark VIOLIN database misses it. The single-exon gene vpu is correctly predicted by the GeneMark VIOLIN database, whereas ZCURVE_V misses it. In addition, ZCURVE_V correctly predicts the 5' end for the intron-contained gene tat. After adjusting the default settings, i.e., using the 'Keep Overlapping Genes' option, the gene vpu and one additional gene located at positions 7602–7694 bp are predicted by ZCURVE_V.
Hepatitis B virus is another virus that severely threatens human health. Currently, GenBank annotation contains 4 single-exon genes for HBV (GenBank AC: X04615). Among the 4 genes, gene P is jointly composed by two fragments. When using default settings, ZCURVE_V and the GeneMark VIOLIN database predict 3 and 2 genes for the genome, respectively. The predicted results are listed in Table 4. As can be seen, both ZCURVE_V and the GeneMark VIOLIN database predict gene P. Gene C is correctly predicted by ZCURVE_V, but the GeneMark VIOLIN database misses it. Gene X is correctly predicted by GeneMark, but ZCURVE_V misses it. In addition, ZCURVE_V predicts one additional gene that is embedded within gene P. After adjusting the default settings, i.e., using the 'Keep Overlapping Genes' option, gene S and X are also correctly predicted by ZCURVE_V.
SARS is a life-threatening disease that spread to may countries around the world in 2003 . SARS is caused by a novel coronavirus, called SARS-coronavirus or SARS-CoV. SARS-CoVs belong to coronavirus and their genomes are single-stranded . Among the 14 protein-coding genes annotated in SARS-CoV TOR2 genome (NC_004718), 12 genes are found by the ZCURVE_V system. The two genes missed by it are completely or nearly completely embedded within other genes and are very unlikely to encode proteins , while the GeneMark VIOLIN annotation misses 4 ones out of the 14 annotated genes .
In summary, the gene-finding performance of ZCURVE_V for the three well studied life-threatening viruses is generally better than that of GeneMark.
New genes missed by both RefSeq annotations and GenBank annotations
Gene-finding programs may be used to find new protein-coding genes that have been missed from the public databases. Using ZCURVE_V, we find some new genes missed from both the RefSeq annotations and GenBank annotations, which have significant similarities with other genes deposited in the public databases, as in the cases of the genomes of bacteriophage VT2-Sa (NC_000902), ectocarpus siliculosus virus (NC_002687) and pseudomonas phage D3 (NC_002484). The detailed predicted results of ZCURVE_V for the three genomes are listed in the Appendix, see . Now let us inspect a new gene located at positions c4872–c5093 of phage VT2-Sa genome coding for a putative protein with 72 amino acids. Using a BLASTP search against NR databases, a significant similarities (E-value for BLASTP = 6e-25, Identities = 100%) with gene (RefSeq AC: NP_308832) has been found, implying that the predicted gene codes for a C4-type zinc finger protein in Stx1 and Stx2 converting bacteriophage genomes. It should be noted that the GeneMark VIOLIN database also misses this gene. Another noticeable new gene is located at the positions 55,832 bp-56,248 bp in the direct strand of the pseudomonas phage D3 genome. The amino acid sequence of the protein encoded by this gene is found to have a significant similarity (E-value for BLASP = 8e-15, Identities = 58%) with the phage holin protein (RefSeq AC: NP_743718) found in the pseudomonas putida KT2440 genome. It also has a significant similarity (E-value = 4e-08, Identities = 52%) with the lysis protein (RefSeq AC: NP_892111) found in the genome of bacteriophage PY54. Because the two new genes have very significant similarities with function-known genes in public databases, they are likely to be functional genes missed in both GenBank and RefSeq annotations. According to our suggestions to NCBI staff, now they have been included in the current RefSeq annotations (RefSeq AC: YP_089649 and YP_138545, respectively).
Relationship between functions of predicted genes and their VZ scores
Compared with GeneMark, a more convenient feature of ZCURVE_V is that the coding potential scores VZ are provided for all of the predicted genes. The predicted genes with higher VZ scores have higher possibility to encode proteins. Bacteriophage P4 genome (NC_001609) is studied here as an example. As is shown in Table 5, all the predicted genes with VZ scores lower than 0.30 have no putative functions, in other words, all the function-known genes have the VZ scores higher than 0.30. On the other hand, it is possible that false positive predictions are generally associated with lower VZ scores. Therefore, the use of ZCURVE_V may reduce experimental expenses when studying functions of predicted genes by excising false positive predicted genes, based on the associated coding potential scores VZ.
Preferred utilization of ZCURVE_V in the annotation of anonymous viral genomes
All the GeneMark family, the heuristic approach and the VIOLIN database for viral and phage gene-finding have some limitations. Heuristic approach  is a self-training method and no human intervention is required during the running process. However, the performance of heuristic approach is generally worse than that of the GeneMark VIOLIN database . The GeneMark VIOLIN database provides just an up-to-date analysis of newly sequenced viral genomes and is not able to be used to analyze anonymous viral genomes. Similar to the heuristic approach of GeneMark family, the ZCURVE_V is also a self-training method and enables analyzing any anonymous viral and phage genomes without any human intervention. Because the executable version of the program ZCURVE_V may be downloaded and run locally, it will be used more conveniently. More specific options when running ZCURVE_V strengthen its power. The prediction of ZCURVE_V is more accurate than that of GeneMark for viral or phage genomes shorter than 1000 bp. Therefore, it is suggested that ZCURVE_V may serve as a preferred gene-finding tool for viral and phage genomes, especially for anonymous viral and phage genomes newly sequenced. However, we should point out the limitations of ZCURVE_V when predicting genes for viruses that use alternative coding schemes. This includes RNA editing, splicing, polyprotein processing, etc. Generally, like GeneMark, ZCURVE_V cannot deal with the above special cases.
Joint applications of ZCURVE_V and GeneMark gene-finding family
Both GeneMark and ZCURVE_V are based on statistical characteristics of coding (non-coding) sequences. However the former is Markov-chain-based and mainly considers the local characteristics of DNA sequence, whereas the latter is the Z-curve-based and lays stress on global characteristics. Due to the difference of inherent algorithm, the predictions of ZCURVE_V and GeneMark are different, although most of the predicted genes are identical. Higher accuracy may be obtained by combining them, in which genes predicted by either ZCURVE_V system or the GeneMark VIOLIN database are finally predicted as genes. Clover yellow mosaic virus (CLYMV, NC_001753), lymphocystis disease virus 1 (LCDV-1, NC_001824), ....transmissible gastroenteritis virus (TGEV, NC_002306) and yaba-like disease virus (YLDV, NC_002642) are chosen to demonstrate the effectiveness of joint applications of both systems. The results are listed in Table 6. As can be seen, the number of genes missed by the ZCURVE_V program decreases significantly although the number of additional predicted genes increases. Currently, it becomes a hotspot to develop an integrated genome annotation platform by joint applications of two or more systems based on different statistic analysis principles [10, 19]. Similarly, joint applications of two or more viral gene-finding programs are also of necessity and feasibility. The programs of ZCURVE_V, GeneMark and others may all be jointed together to reach more accurate results. One referee of the manuscript points out that combining the use of prediction programs based on statistical measures such as ZCURVE_V with detection of functional motifs, sequence similarity, conservation of orthologs, presence of regulatory signals, etc., would be useful. Sequence similarity and conservation of orthologs methods may effectively reduce false positive predictions. Anyway, no one program can be used in isolation for making accurate predictions of the gene complement of any viral genome. Therefore use of multiple programs is always warranted. However, no concrete approach is provided to joint different information into a unified tool to reach the maximum accuracy. It seems that this is a topic of further study, not being included into the present paper.
A new self-training system, ZCURVE_V, for finding genes in viral and phage genomes has been proposed. The new system ZCURVE_V has been run for 979 viral and 212 phage genomes, respectively, and satisfactory results are obtained. To have a fair comparison with the currently available software of similar function, GeneMark, a total of 30 viral genomes that have not been annotated by GeneMark are selected to be tested. Consequently, the average specificity of both systems is well matched, however, the average sensitivity of ZCURVE_V for smaller viral genomes (< 100 kb), which constitute the main parts of viral genomes sequenced so far, is higher than that of GeneMark. Additionally, for the genome of amsacta moorei entomopoxvirus, probably with the lowest genomic GC content among the sequenced organisms, the accuracy of ZCURVE_V is much better than that of GeneMark, because the later predicts hundreds of false-positive genes. ZCURVE_V is also used to analyze some well studied genomes, such as HIV-1, HBV and SARS-CoV. Accordingly, the performance of ZCURVE_V is generally better than that of GeneMark. Finally, GeneMark is not downloadable, whereas ZCURVE_V may be downloaded and run locally, particularly facilitating its utilization. Based on the above merits, it is suggested that ZCURVE_V may serve as a preferred gene-finding tool for viral and phage genomes newly sequenced. However, it is also shown that joint applications of both systems, ZCURVE_V and GeneMark, lead to better gene-finding results. The system ZCURVE_V is freely available at: http://tubic.tju.edu.cn/Zcurve_V/.
A total of 979 viral and 212 phage genome records were downloaded from GenBank release 141.0 . Each record corresponds to a genome or a genomic segment. The corresponding RefSeq annotations for these genomes were downloaded before July 20, 2004 . For all of the viral and phage genomes, the predicted results of the GeneMark VIOLIN database  were also downloaded before July 20, 2004.
The present gene-finding method consists of the four steps:
(1) Extracting the seed ORF for the analyzed genome
In the present algorithm, only one seed ORF is required for a viral genome. This seed ORF is selected using a simple approach. It is found that an ORF with the largest length among all others in a genome is very likely to be a protein-coding gene. This ORF is called the 'Maximum ORF' in this paper. After carefully investigating over 100 viral genomes that have annotated genes, the deduction that the 'Maximum ORF' is a gene is valid accurately. For the two very small viral genomes, cereal yellow dwarf virus -RPV satellite RNA (NC_003533) and arabis mosaic virus small satellite RNA (NC_001546), there are no genes at al, indicating that the seed ORF so obtained is meaningless for these two genomes. If the 'Maximum ORF' is larger than 400 bp, it is directly regarded as a seed ORF (gene). However, if the 'Maximum ORF' is less than 400 bp, it is regarded as a seed ORF only if the base composition at the second codon position meets the following equation: G2 < (A2 + C2 + T2)/3 + 0.1, where A2, C2, G2 and T2 are the occurrence frequencies of bases at the second position of an ORF. This equation approximately reflects the fact that bases at the second codon position lack guanine to some degree . If a seed ORF is found, then it will be used as a training sample to calculate the related parameters. Otherwise, if there is no seed ORF found, it means that the analyzed viral genome contains no functional genes.
(2) Training the parameter used to describe the coding potential
The methodology adopted here is based on the Z curve , which is another representation of DNA sequence. Here the algorithm is presented briefly as follows. The frequencies of bases A, C, G and T occurring in an ORF or a fragment of DNA sequence with bases at positions 1, 4, 7, ...; 2, 5, 8, ..., and 3, 6, 9, ..., are denoted by a1, c1, g1, t1, a2, c2, g2, t2, a3, c3, g3, t3 respectively. They are actually the frequencies of bases at the 1st, 2nd and 3rd codon positions. Based on the Z curve (12), a i , c i , g i , t i are mapped onto a point Pi in a 3-dimensinal space Vi, i = 1, 2, 3. The coordinates of Pi, denoted by x i , y i , z i , are determined by the Z-transform of DNA sequence .
The Z-transform of DNA sequence transforms the four frequencies of DNA bases into the coordinates of a point in a 3-dimensional space. In addition to the frequencies of codon-position-dependent single nucleotides, we need to consider the frequencies of phase-specific dinucleotides. Let the frequencies of the 16 dinucleotides AA, AC, ..., and TT occurring at the codon positions1-2 and 2–3 of an ORF or a fragment of DNA sequence be denoted by p12(AA), p12(AC), ...,p12(TT); p12(AA), p12(AC), ... and p12(TT) respectively. Using the Z-transform , we find
where , and are the coordinates, X = A, C, G, T and k = 12, 23. Let the 3-dimensional space be spanned by , and . The direct-sum of the subspaces V1, V2, V3, , , , , , , and is denoted by a 33-dimensional space V, i.e., V = V1 ⊕ V2 ⊕ V3 ⊕ ⊕ ..... ⊕ , where the symbol ⊕ denotes the direct-sum of two subspaces. The 33 components of the space V, i.e., u1, u2, ..., u33, are defined as follows
Therefore, an ORF or a fragment of DNA sequence can be represented by a point or a vector in the 33-dimensional space V. Note that u i ∊ [-1,+1], i = 1, 2, ..., 33. Therefore, the space V is a 33-dimensional super-cube with the side length of 2. A total of 33 parameters denoted by are calculated according to the equation (5) for the seed ORF, which corresponds to a point O in the 33-dimensional space. These 33 parameters will be used to differentiate coding/non-coding ORFs.
(3) Seeking all ORFs and predicting possible protein-coding genes
All the ORFs longer than a given value, for example 90 bp, are extracted as candidates of genes. For each ORF, which is represented by a point in the 33-dimensional space, the Euclidean distance of this point to the point O is obtained
A coding potential index VZ is defined as
where D0 is a constant called maximum Euclidean distance, whose default value is . All ORFs with VZ scores greater than 0 are regarded as possible protein-coding genes, whereas those with VZ scores less than 0 are regarded as non-coding.
(4) Dealing with overlapping ORFs
Among all the ORFs having VZ score larger than 0, some ORFs are falsely predicted as genes owing to their overlapping with coding ORFs. In the development of ZCURVE system, a strategy was proposed to deal with overlapping ORFs . Later, this strategy was adopted again in the ZCURVE_CoV system . Here the same strategy is employed once more, while the related parameters are adjusted because of the change of the definition of coding potential score. Briefly, if the VZ score of the longer ORF between the two overlapping ORFs minus a given value is still larger than that of the shorter one, it is recognized as gene, and the shorter is a non-coding one. Otherwise, both are kept as coding. For more detail, refer to .
There are three main different features between the present viral gene-finding system ZCURVE_V and our previously reported bacterial gene-finding system ZCURVE. Firstly, two different methods are used to generate seed ORFs: one simply selecting the 'Maximum ORF' and another selecting those long and non-overlapping ORFs as seed ORFs. Secondly, no negative samples (non-coding sequences) are required in the training set of the algorithm for ZCURVE_V system. Thirdly, instead of Fisher linear discriminant algorithm, Euclidean distance discriminant method is used here. Due to the adaptation, the ZCURVE_V system is capable of recognizing protein-coding genes in any anonymous viral or phage genomes, even for those shorter than 1000 bp.
Availability and requirements
A web interface of the ZCURVE_V system, has been constructed at the site: http://tubic.tju.edu.cn/Zcurve_V/. When a user pastes a viral or phage genomic sequence into the input window of the homepage, the gene-finding results will be returned to the user immediately. When running ZCURVE_V, a total of 9 specific options are selectable. These options include 'the minimum gene length', 'the maximum Euclidean distance D 0 ', 'the minimum coding potential score VZ', 'belonging to mycoplasma or not', 'being single-stranded DNA/RNA or not', 'the type of start codons', 'keeping overlapping genes or not', 'providing personal seed ORF sequence or not' and 'relocating translation start sites for predicted genes or not', respectively. Registered users may also download the executable version of the program ZCURVE_V, and run it on his (her) computer under the platforms of either Windows (95/98/NT/Me/2000 or higher), or Linux (Redhat 9.0 or higher), or SGI IRIX 6.5. The predicted results for 979 viral and 212 phage genomes are provided through the database named DOVGZ (D atabase O f V iral G enes predicted by Z CURVE_V), which is available online .
Besemer J, Lomsadze A, Borodovsky M: GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 2001, 29: 2607–2618. 10.1093/nar/29.12.2607
Salzberg SL, Delcher AL, Kasif S, White O: Microbial gene identification using interpolated Markov models. Nucleic Acids Res 1998, 26: 544–548. 10.1093/nar/26.2.544
Delcher AL, Harmon D, Kasif S, White O, Salzberg SL: Improved microbial gene identification with GLIMMER. Nucleic Acids Res 1999, 27: 4636–4641. 10.1093/nar/27.23.4636
Guo FB, Ou HY, Zhang CT: ZCURVE: a new system for recognizing protein-coding genes in bacterial and archaeal genomes. Nucleic Acids Res 2003, 31: 1780–1789. 10.1093/nar/gkg254
Badger JH, Olsen GJ: CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol 1999, 16: 512–24.
Frishman D, Mironov A, Mewes HW, Gelfand M: Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acid Res 1998, 26: 2941–2947. 10.1093/nar/26.12.2941
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389
Besemer J, Borodovsky M: Heuristic approach to deriving models for gene finding. Nucleic Acids Res 1999, 27: 3911–20. 10.1093/nar/27.19.3911
Mills R, Rozanov M, Lomsadze A, Tatusova T, Borodovsky M: Improving gene annotation of complete viral genomes. Nucleic Acids Res 2003, 31: 7041–55. 10.1093/nar/gkg878
Tech M, Merkl R: YACOP: Enhanced gene prediction obtained by a combination of existing methods. In Silico Biol 2003, 3: 441–51.
Chen LL, Ou HY, Zhang R, Zhang CT: ZCURVE_CoV: a new system to recognize protein coding genes in coronavirus genomes, and its applications in analyzing SARS-CoV genomes. Biochem Biophys Res Commun 2003, 307: 382–8. 10.1016/S0006-291X(03)01192-6
Zhang CT, Zhang R: Analysis of distribution of bases in the coding sequences by a diagrammatic technique. Nucleic Acids Res 1991, 19: 6313–6317.
Burset M, Guigó R: Evaluation of gene structure prediction programs. Genomics 1996, 34: 353–357. 10.1006/geno.1996.0298
Bawden AL, Glassberg KJ, Diggans J, Shaw R, Farmerie W, Moyer RW: Complete genomic sequence of the Amsacta moorei entomopoxvirus: analysis and comparison with other poxviruses. Virology 2000, 274: 120–39. 10.1006/viro.2000.0449
Joint United Nations Programme on HIV/AIDS (UNAIDS) and the World Health Organization (WHO), AIDS Epidemic Update, December 2004
Ksiazek TG, Erdman D, Goldsmith CS, Zaki SR, Peret T, Emery S, Tong S, Urbani C, Comer JA, Lim W: A novel coronavirus associated with severe acute respiratory syndrome. N Engl J Med 2003, 348: 1953–66. 10.1056/NEJMoa030781
Marra MA, Jones SJ, Astell CR, Holt RA, Brooks-Wilson A, Butterfield YS, Khattra J, Asano JK, Barber SA, Chan SY, Cloutier A, Coughlin SM, Freeman D, Girn N, Griffith OL, Leach SR, Mayo M, McDonald H, Montgomery SB, Pandoh PK, Petrescu AS, Robertson AG, Schein JE, Siddiqui A, Smailus DE, Stott JM, Yang GS, Plummer F, Andonov A, Artsob H, Bastien N, Bernard K, Booth TF, Bowness D, Czub M, Drebot M, Fernando L, Flick R, Garbutt M, Gray M, Grolla A, Jones S, Feldmann H, Meyers A, Kabani A, Li Y, Normand S, Stroher U, Tipples GA, Tyler S, Vogrig R, Ward D, Watson B, Brunham RC, Krajden M, Petric M, Skowronski DM, Upton C, Roper RL: The Genome sequence of the SARS-associated coronavirus. Science 2003, 300: 1399–1404. 10.1126/science.1085953
McHardy AC, Goesmann A, Puhler A, Meyer F: Development of joint application strategies for two microbial gene finders. Bioinformatics 2004, 20: 1622–31. 10.1093/bioinformatics/bth137
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL: GenBank: update. Nucleic Acids Res 2004, 32: D23-D26. 10.1093/nar/gkh045
Pruitt KD, Tatusova T, Maglott DR: NCBI Reference Sequence project: update and current status. Nucleic Acids 2003, 31: 34–7. 10.1093/nar/gkg111
Trifonov EN: Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16 S rRNA nucleotide sequences. J Mol Biol 1987, 194: 643–52. 10.1016/0022-2836(87)90241-5
DOVGZ (Database Of Viral Genes predicted by ZCURVE_V)[http://tubic.tju.edu.cn/Zcurve_V/database/]
We thank Dr Ren Zhang for invaluable assistance. We also thank Drs Ju Wang and Ling-Ling Chen for useful discussions. Suggestions from Feng Gao, Yun-Tao Dou and Jian-Hui Zhang on the manuscript are gratefully acknowledged. The present study was supported in part by the National Natural Science Foundation of China (grant 90408028) and the Program of CSIRTU by the Ministry of Education of China.
CTZ guided the whole study and took part in writing the manuscript. FBG designed the algorithm and wrote the computer program. He also run the program for about 1000 genomes and took part in writing the manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.