In silico identification of putative promoter motifs of White Spot Syndrome Virus

Background White Spot Syndrome Virus, a member of the virus family Nimaviridae, is a large dsDNA virus infecting shrimp and other crustacean species. Although limited information is available on the mode of transcription, previous data suggest that WSSV gene expression occurs in a coordinated and cascaded fashion. To search in silico for conserved promoter motifs (i) the abundance of all 4 through 8 nucleotide motifs in the upstream sequences of WSSV genes relative to the complete genome was determined, and (ii) a MEME search was performed in the upstream sequences of either early or late WSSV genes, as assigned by microarray analysis. Both methods were validated by alignments of empirically determined 5' ends of various WSSV mRNAs. Results The collective information shows that the upstream region of early WSSV genes, containing a TATA box and an initiator, is similar to Drosophila RNA polymerase II core promoter sequences, suggesting utilization of the cellular transcription machinery for generating early transcripts. The alignment of the 5' ends of known well-established late genes, including all major structural protein genes, identified a degenerate motif (ATNAC) which could be involved in WSSV late transcription. For these genes, only one contained a functional TATA box. However, almost half of the WSSV late genes, as previously assigned by microarray analysis, did contain a TATA box in their upstream region. Conclusion The data may suggest the presence of two separate classes of late WSSV genes, one exploiting the cellular RNA polymerase II system for mRNA synthesis and the other generating messengers by a new virus-induced transcription mechanism.


Background
White Spot Syndrome Virus (WSSV), type species of the virus family Nimaviridae (genus whispovirus), is a pathogen of major economic importance in cultured penaeid shrimp [1,2]. Histopathological studies on WSSV infected shrimp have shown that the virus mainly infects tissues of ectodermal and mesodermal origin, such as the stomach, gills, heart, gut, muscle tissue and hematopoietic tissue [3][4][5]. Infected cells within these tissues are characterized by the appearance of homogeneous hypertrophied nuclei and chromatin margination [1,5,6]. WSSV particles have been mainly detected in the nuclei of infected cells, indicating that transcription, replication and virion assembly probably occur in the nucleus [5][6][7][8]. It is not clear how the virions are released from the nucleus of an infected cell, but this most likely occurs by budding or by rupture of the nuclear envelope and/or the cell membrane.
The circular ds DNA genome of three WSSV isolates, originating from Taiwan (WSSV-TW), China (WSSV-CN) and Thailand (WSSV-TH), have been completely sequenced [9][10][11]. The genome of WSSV-TH has a size of 292,967 bp encompassing 184 open reading frames (ORFs), which are almost equally distributed on both strands [10]. Thus far, no evidence has been obtained for the occurrence of spliced transcripts. Only a limited amount of the WSSV ORFs could be assigned a function based on homology with known genes in public databases [10]. Concerning genes involved in replication and transcription of WSSV, four putative functional proteins have been annotated: a DNA helicase (ORF9), a DNA polymerase (ORF27), a cAMP-responsive element binding protein (ORF66) and a TATA box binding protein (ORF149). Furthermore, several genes involved in nucleotide metabolism, such as both subunits of a ribonucleotide reductase, a chimeric thymidine-thymidylate kinase, a thymidylate synthase, a dUTPase and an endonuclease have been identified on the genome [10,11]. Although present in various other large dsDNA viruses, no RNA polymerase or other genes involved in transcription, e.g. a poly(A)polymerase or mRNA capping enzymes, have (yet) been identified on the WSSV genome. Around 50 major or minor virion protein genes have been identified on the genome [12][13][14][15].
Upon infection, expression of the WSSV genes can be divided in at least an early and a late phase [16], while also an immediate-early phase might be present [17]. The mechanism of the switch between (immediate-) early and late WSSV gene expression, as well as the promoters and regulatory sequences involved, is largely unknown. However, many eukaryotic large ds DNA viruses of >100 kb have a coordinated and cascaded fashion of gene expression [18][19][20][21]. Baculoviruses and herpesviruses (both replicating in the nucleus) as well as poxviruses and asfarviruses (both replicating in the cytoplasm) express their early genes before viral replication initiates, while late genes are expressed after the onset of viral DNA replication. Both viruses replicating in the nucleus utilize the host RNA polymerase II for early gene transcription [19,22]. However, for late gene transcription, herpesviruses continue to exploit the cellular RNA polymerase II system, while late transcription of baculoviruses occurs by a novel RNA polymerase that is at least partially encoded by the baculovirus genome [22][23][24]. The viruses replicating in the cytoplasm encode their own RNA polymerase which synthesizes early as well as late mRNAs. This RNA polymerase is encapsidated within the virus particle to enable the initiation of viral gene expression upon arrival in the cytoplasm [18,25].
Despite the differences in gene expression strategies, the above viruses have in common that specific nucleotide motifs involved in transcription initiation, expression kinetics and expression level have been identified in the upstream regions of individual genes. Well known promoter elements used by many viruses are the TATA box and the initiator sequence, which is located at or near the site of transcription initiation (TIS). We hypothesize that conserved promoter motifs play an important role in transcription regulation of WSSV, and that they can be identified by in silico analysis of upstream regions of WSSV genes. As important promoter motifs are overrepresented in the 5' upstream regions of baculoviruses genes [26], we studied the relative abundance of all 4 through 8 nucleotide motifs in the upstream regions of WSSV genes compared to the complete WSSV genomic sequence. This enumeration strategy was validated by testing the eukaryotic large ds DNA viruses mentioned above. To further identify regulatory elements, the nucleotide composition in the upstream regions of WSSV early and late genes, as assigned by microarrays [16], is studied using MEME [27]. MEME is an algorithm which searches for conserved motifs in a selected set of sequences, in this case the upstream regions of WSSV ORFs. Experimental support for the in silico results is obtained by alignments of 5' ends of known WSSV early as well as late transcripts. These alignments include TISs mapped by 5'RACE (Rapid Amplification of cDNA Ends) in previous studies, as well as two newly determined TISs of the major structural protein genes ORF112 and ORF160. Polyadenylation of WSSV early and late genes is studied by alignment of poly(A) sites. Using this approach, we were able to find further support for the presence of coregulated clusters of WSSV genes, as well as to predict putative WSSV promoter elements involved in gene expression of these clusters.

Promoter analysis using the enumeration method
In a search for putative WSSV regulatory promoter elements, we compared the abundance of all 4, 5, 6, 7 or 8 nucleotide motifs in the 100 and 200 nt upstream sequences of all WSSV genes relative to their presence in the complete WSSV genomic sequence. This method will be referred to as the enumeration method in the remaining part of the article. For validation, this enumeration method was applied on the genome sequences of the type species of more extensively studied large ds DNA viruses mentioned in the introduction: AcMNPV (Autographa californica Multinucleopolyhedrovirus; Baculovirus), Human Herpes Virus 1 (HHV1; Herpesvirus), Vaccinia virus (Poxvirus) and African Swine Fever Virus (ASFV; Asfarvirus).

AcMNPV, HHV1, Vaccinia virus and ASFV
Only the analysis of the 4-mers of these viruses is shown, as these will always be included in larger motifs (Table 1). Most AcMNPV early genes contain a functional consensus TATA box upstream of the TIS [19]. AcMNPV initiator motifs are composed of the conserved nucleotide sequence CAGT and (a/g/t)TAAG, for early and late genes, respectively [18,23,28]. Ayres et al. [26] showed that the sequence TAAG occurs less frequently in the whole AcM-NPV genome than expected based of the AcMNPV nucleotide composition. The results of the 4-mer motif frequency in the 100 nt upstream of all AcMNPV ORFs analyzed with the enumeration method indeed shows that the TAAG motif frequency is 29% of the expected occurrence in the whole genome (Table 1). However, the analysis also shows that this motif has the highest relative enrichment in the upstream regions of the AcMNPV ORFs of all possible 4-mer motifs (4.0 times). Also the baculovirus early promoter motif CAGT is relatively more frequently present in upstream regions (1.4 times), although not as prominent as the TAAG motif. Parts of the TATA box as well as sequences of the well known baculovirus early transcription activating motifs GATA and CACNG [19] occur relatively often in the upstream regions of the ORFs (Table 1). Compared to 100 nt, the enrichment of the functional motifs in 200 nt upstream of the AcMNPV ORFs is less pronounced (Table 1) supporting the experimental observation that in baculoviruses important promoter elements are often located within 100 nt upstream of the translational start codon [26]. Analysis of 5-mer motifs of AcMNPV revealed that (a/g/t)TAAG was enriched in the upstream regions of the ORFs, but not CTAAG. Analysis of 6-mer motifs showed a relative enrichment of 3.0 times of the consensus TATA box sequence TATAAA in the 200 nt upstream regions.
For HHV1, the 4-mer nucleotide motifs of known promoter elements were identified by the enumeration method during analysis of the 200 nt upstream sequences, but not when analyzing 100 nt upstream sequences. This supports the view that, in contrast to baculoviruses, most regulatory elements are located more than 100 nt upstream of the HHV1 translational start codons [21]. Parts of the consensus TATA box, involved in HHV1 early and late transcription [21,29], occur relatively frequently in the 200 nt upstream of the HHV1 ORFs (Table 1). Also the sequence CATT, part of the CCATT boxes which are typically located upstream of the consensus TATA box of HHV1 early genes [21], shows a high relative enrichment of 2.1 ( Table 1).
For both cytoplasmatic viruses Vaccinia virus and ASFV the analysis shows that the late initiator sequences, TAAAT and TATA respectively [18,20,30], are highly enriched in the 100 nt as well as the 200 nt upstream sequences, although not as prominent as the late TIS of baculoviruses (Table 1). Also parts of the sequence TAAA(a/t), essential for Vaccinia virus intermediate gene expression, are enriched (Table 1). Furthermore, the analysis shows a considerable enrichment of motifs only consisting of A and T residues. Long stretches of these nucleotides upstream of the transcribed region are typical for Vaccinia virus and ASFV early promoters, as well as for ASFV late promoters [18,20,30].

WSSV
The same enumeration method was used to analyze the upstream sequences of WSSV ORFs. The analysis of the 4and 5-mer motifs is shown in Table 1. Sequences of the consensus TATA box appear relatively frequently compared to their presence in the complete WSSV genome ( Table 1). The enrichment of these TATA box sequences is similar to what is observed for AcMNPV and HHV1 (Table  1), indicating a functional role for the TATA box in WSSV transcription regulation. Besides the TATA box sequences, the sequence AACC has the highest enrichment in the 100 nt upstream sequences of WSSV ORFs, although not as pronounced as the occurrences of the AcMNPV TAAG motif (Table 1). Previous experiments showed that the TISs of the late WSSV envelope protein genes vp28 and vp19 start within this exact AACC sequence [31] indicating this could be a putative promoter element for late transcription. Furthermore, some motifs consisting of G and C residues, such as the 4-mers CCGG and CCCC and the 5mers CCGGG and CCCGG (Table 1), and G/C-rich sequences have a relatively high frequency in WSSV upstream regions. Compared to the 100 nt upstream of the ORFs, the results for the analysis of 200 nt are only slightly different and mostly less pronounced (Table 1). From the remaining analysis using 6, 7 or 8 nt motifs in the 100 or 200 nt upstream regions (data not shown), it is noteworthy that the enumeration method shows a relative enrichment of the 6-mer consensus TATA box sequence TATAAA of 4.7 times in the 100 nt upstream of the ORFs.
Previously, we showed that the WSSV genes clustered in an early and a late class based on expression profile in shrimp tissue [16]. Further analysis within the 100 nt upstream regions of either the WSSV early or late genes using the enumeration method showed that the sequence AACC has the highest relative enrichment of all possible 4-mer motifs for the late genes (2.4 times), while sequences of the TATA box were highly enriched in upstream regions of both gene classes (the sequence TATA showed a relative enrichment of 2.3 times in the 100 nt  upstream regions of both gene classes; other data not shown).

MEME
The 100 or 200 nt sequences upstream of all WSSV genes were also studied by MEME ( Table 2). As multiple classes of coregulated viral genes will be present within these sequences, the MEME settings for this analysis were to identify conserved motifs regardless whether it occurred in the upstream regions of all genes. MEME identified the TATA box as consensus nucleotide motif in these WSSV upstream sequences (Table 2). Furthermore this analysis showed that multiple upstream sequences contain stretches of T residues ( Table 2). Analysis on the location and composition of these sequences revealed that these are mostly part of the polyadenylation signals [32] of the upstream ORFs, and therefore probably not functional as promoter element of WSSV. The outcome of the 100 and 200 nt upstream sequences are very similar, in line with the results of the enumeration method.
For individual analysis of the WSSV early or late kinetic cluster [16] the frequency of a specific motif per individual sequence was set at one, as most WSSV genes belonging to one cluster were considered to be coregulated. Analysis of the 100 and 200 nt upstream regions of either the early or the late class genes identified the consensus TATA box as putative promoter element ( Table 2). Previously, we already showed that 37 of the 64 early genes (58%) and 28 of the 58 genes that clustered late (48%) contain a consensus TATA box [16]. Specific for the early class, MEME identified the consensus sequences CAA-CATCA and AGAAT, while for the late class it identified the consensus sequence AACC as well as an A-rich region ( Table 2). On the other hand, as the early or late kinetic cluster [16] could also consist of subsets of coregulated WSSV genes, an additional MEME analysis was performed in which a motif only had to occur in at least half of the upstream sequences of either the early or the late genes. The outcome was very similar to the results presented in Table 2. Interestingly, the TATA box and the AACC motif were identified by the enumeration method as being highly enriched in upstream regions of WSSV ORFs.

Alignments TISs of WSSV genes
To validate both in silico methods described above, we compared the outcome with alignments of all known 5' ends of the WSSV early and late class genes. To facilitate comparisons with other viruses, in these alignments the function of the protein encoded by the gene is used to determine its class, either early or late. Early genes often encode enzymes which have functions involved in processes such as nucleotide metabolism, DNA replication, protein modification, viral transcription initiation and host response modulation. Structural virion protein genes often comprise a large part of viral late genes. For nearly all WSSV genes analyzed, this classification matched the results obtained by the microarray study [16].

Early genes
The WSSV genome encodes around 10 genes which, based on their (putative) function, are considered to be early [10]. For several of these genes the 5' end of their transcripts has been mapped. RT-PCRs and/or Northern Blots of viral time courses confirmed that these genes were expressed in an early stage during infection (for references see Fig. 1). Furthermore ORF89, which is thought to be involved in latency, was empirically shown to be (immediate) early [33,34]. Fig. 1 shows an alignment of the experimentally determined transcription initiation sites (TISs) of WSSV early genes. The genes typically contain a consensus TATA box (sequence: TATA(a/t)A) [35]. The TIS is located 20 to 30 nucleotides downstream of the consensus TATA box, which is considered to be a functional distance [35][36][37]. This is between 20 to 85 nucleotides upstream of the translational start codon of the early gene products (Fig. 1). When the sequences are aligned by maximizing the identities around the transcriptional start site (Fig. 1), a clear consensus transcription initiation motif ((a/c)TCANT) overlapped with the transcriptional start sites. This resembles the RNA polymerase II core promoter motif identified in Drosophila, which often consists of a consensus TATA box and/or an initiator with the sequence (A)TCA(+1)(g/t)T(t/c) [35][36][37][38]. Similar to WSSV, the initiator of Drosophila is typically located 25-30 nt downstream of a TATA box [35][36][37]. Interestingly, the motif CTCAC, which is part of the identified WSSV consensus sequence (a/c)TCANT and which is the exact sequence of the TISs of the dutpase and rr1 (Fig. 1), was also shown to be enriched in upstream regions of WSSV ORFs (Table 1).

Late genes
The protein pattern of WSSV particles on an SDS-PAGE gel shows around 8 major WSSV structural virion proteins [12][13][14][15]. For 6 of these proteins (VP664, VP28, VP26, VP24, VP19 and VP15) the 5' end of the corresponding mRNA has been mapped [31,39]. RT-PCRs and/or Northern Blots of viral time courses confirmed that these genes were expressed in a late stage during infection [31,39]. We completed this analysis by mapping the TISs of the two other major structural protein genes, vp75 (ORF160) and vp73 (ORF112). Both vp75 and vp73 lack a consensus TATA box (Fig. 2a). Using 5'RACE, the TIS of vp75 was identified within the nucleotide sequence TG, 72 nt upstream of the translational start codon. For vp73, the TIS was located at nucleotide residues TC, 220 upstream of the translational start codon (Fig. 2a).
When the upstream sequences of all major structural protein genes are aligned by maximizing the identities around the transcriptional start sites (Fig. 3), the TISs are present within or very near the nucleotide sequence ATNAC. The transcripts start 20-25 nucleotides downstream of an A/T rich region, which has an average A/T content of 79% compared to 61% of the 200 nt upstream regions of the 8 genes. Vp15 and vp19 contain a consensus TATA box, of which only the TATA box of vp15 is at a functional distance of the TIS (Fig. 3) [31]. The length of the TIS to the translational start codon is different for the var-ious genes, ranging from 30 to 220 nt (Fig. 3). Interestingly, most of these features were predicted by our in silico analysis. The first three nucleotides of the AACC motif identified in the in silico analysis (Tables 1 and 2) are part of the consensus sequence ATNAC, and both contain the AC dinucleotide which is present for almost all genes in Fig. 3. Also the sequences ATAA and TAAC, parts of the ATNAC sequence, were identified as putative promoter elements (Table 1). Of all WSSV late genes, as assigned by microarray analysis [16], 40% (23 of the 58, both structural and non-structural protein genes) contains the sequence ATNAC in their 100 nt upstream region. The Arich (and T-rich) sequences identified by MEME are in line with the observation that late genes often contain long stretches of A/T residues upstream of their TIS (Fig. 3).
In addition to the 8 major structural proteins, the protein profile of WSSV particles shows a range of about 40 minor virion proteins [12,13]. Most of these have not been studied in detail. However, the corresponding messengers are supposed to be late, although 13 of them clustered in the early class during microarray analysis [16]. Remarkably, 45% of the minor virion protein genes (18 of the 40) contain a consensus TATA box within 300 nt of the translational start codon. This is in line with the MEME analysis, which also suggested that the TATA box might be involved in late transcription.

Polyadenylation
For various WSSV genes, the site of polyadenylation has been mapped using 3'RACE. We extended this analysis by mapping the polyadenylation site of ORF30, the collagen-Alignment of 5' flanking sequences of WSSV early genes Figure 1 Alignment of 5' flanking sequences of WSSV early genes. The genes are named after WSSV-TH ORF numbers and the function of their protein product. The transcription initiation site of each gene is encircled. Sequences are aligned by their consensus TATA box, as well as by maximizing the identities around the transcriptional start site. Below, the consensus sequence of the alignment and the Drosophila RNA polymerase II core promoter are shown. Similar sequences of the consensus TIS motif and the initiator of the Drosophila RNA pol II core promoter are underlined. Abbreviations used and references: pk: protein kinase [52]; DNA-pol: DNA polymerase [47]; tds: Thymidylate Synthase [48]; dutp-ase: dUTPase [42]; lat-rel: latency related gene [33]; rr1 and rr2: the large and small subunit of ribonucleotide reductase, respectively [53]; endonuc: endonuclease [54].
like ORF of WSSV [40]. Polyadenylation of ORF30 starts 32 nt after the translational stop codon, 16 nt after the first poly(A)-signal (sequence AATAAA; Fig. 2b) [41]. Fig. 4 shows an alignment of all known polyadenylation sites of WSSV. Polyadenylation typically starts within 11-19 nt after a consensus polyadenylation site. Typically, a T rich region (stretch of about twelve T residues) was identified 8 nt downstream of the poly(A)-site (Fig. 4). There seems to be no difference between the polyadenylation sites of early and late genes (Fig. 4). A total of 9 WSSV genes were found to be non-polyadenylated [13,42]. Except for vp12a (WSSV-TH ORF34), all these genes lack a consensus poly(A)-signal within -50 to 300 nt of their translational stop codon. Two (vp31 and vp13b encoded by WSSV-TH ORF163 and ORF155, respectively) do however contain the sequence ATTAAA within this region, which in vertebrates is often sufficient for polyadenylation [43], but apparently not in invertebrates or arthropods.

Discussion
In this paper, we used a new enumeration strategy based on a model proposed by Brazma et al. [44] to identify putative WSSV promoter elements. A set of computer scripts was designed, which calculated the difference in nucleotide motif frequencies in the upstream sequences of all genes compared to the complete WSSV genomic sequence. The rationale behind this analysis is that promoter motifs are often thought to be transcription factor binding sites, which are functional upstream of genes. The results obtained with the well studied large ds DNA viruses AcMNPV, HHV1, Vaccinia virus and ASFV (Table  1) show that our method is robust in identifying important promoter elements of completely sequenced viral genomes without a priori knowledge, as these are often enriched in upstream sequences of viral ORFs. Therefore, this new enumeration method can be useful in the analysis of newly sequenced genomes of large ds DNA viruses. For further analysis of the upstream regions of WSSV genes of the early and late cluster, as assigned by microarray analysis [16], MEME was used. Genes of either cluster might be coregulated by similar mechanisms, utilizing conserved nucleotide motifs. As MEME can identify motifs which have to occur in each individual sequence of a set of submitted sequences, or in a selected number of submitted sequences, it is highly complementary to the enumeration method. Another advantage of MEME is that it can identify degenerate motifs.
The enumeration method identified various nucleotide motifs ( Table 1) that were also identified by MEME ( Table   Sequences upstream  2) and by the alignments of experimentally determined 5' ends of WSSV mRNAs (Figs. 1 and 3). These include the consensus TATA box, as well as the nucleotide motif AACC. However, also other nucleotide motifs that were not validated with the other methods, e.g. some motifs rich in C or G residues, were (highly) enriched in WSSV upstream regions and might be involved in WSSV transcription. In accordance with the alignments shown in Figs. 1 and 3, where most putative promoter elements are located within 100 nt upstream of the ORFs, the nucleotide motifs identified with the enumeration method are most pronounced in the 100 nt upstream of the ORFs (Table 1) compared to 200 nt. This suggests that, similar to AcMNPV, most WSSV promoter elements are located within 100 nt upstream of the translational start codons, which is a reflection of the tight package of genes along the WSSV genome. It would be of interest to test the functionality of the sequences (a/c)TCANT and ATNAC, which were identified as the consensus TISs of the WSSV early and late class genes, respectively (Fig. 3) and other identified motifs (Table 1) in a reporter gene (e.g. luciferase) assay. For testing late promoters in this setup, a co-infection with WSSV should be considered to supply additional viral transcription factors required for late gene expression. In the absence of a suitable WSSV cell system, these reporter gene assays have been performed in the artificial Sf9 insect cell line [17,33,45] with all its limitations to the interpretation of the results. However, with the recent developments concerning differentiation and growth of crayfish hematopoietic stem cells in vitro [46], these experiments might be performed in crayfish cell cultures providing a more convenient and homologous system.
The identification of (putative) promoter elements provides further insight in the transcription mechanisms used by WSSV. The presence of a consensus TATA box for most early genes as well as a conserved transcription initiation motif similar to the Drosophila initiator suggest that WSSV uses the host RNA polymerase II transcription machinery for generating early transcripts, as also proposed by Chen et al. [47] and Liu et al. [42]. Previous analysis of WSSV late genes could not reveal any readily apparent dominant nucleotide element used for WSSV late gene expression [31]. Using the newly available microarray clustering [16], we could now show that around half of the WSSV putative late genes contain a consensus TATA box. This suggests that WSSV might exploit the cellular RNA polymerase II system not only for early but also for (part of) its late mRNA synthesis, similar to some other ds DNA viruses like herpesviruses [22]. Only one of the 8 major structural virion protein genes, which are expressed in the late phase of viral infection and most likely are co-regulated to secure correct assembly of the virion, contains a consensus TATA box. Alignment of the 5' ends of the 8 major structural protein genes identified a novel consensus transcription initiation site, ATNAC, downstream of an A/T rich region. The in silico analysis further supports the observation that both components might be late promoter elements. This suggests a second pathway for WSSV late gene expression, similar to the late gene expression strategy identified for baculoviruses [23,24]. However, different from baculoviruses, viral genes required for this pathway, such as a RNA polymerase or late transcription factors, have not been identified on the WSSV genome [10,11]. These genes could however be too much diverged from known homologues to be found based on amino acid homology.
The alignments of the 3' ends of WSSV mRNAs suggest that there is no difference in polyadenylation between early and late mRNAs. The WSSV polyadenylation characteristics of both classes resemble regular polyadenylation in eukaryotic mRNAs, which is typically located 10 to 25 nt downstream of the sequence AATAAA [41,43]. Also oligo-T stretches are often present about 30 nt downstream of the poly(A)-signal of eukaryotic genes [32]. Figure 3 Alignment of 5' flanking sequences of WSSV late genes. Name of structural protein genes as well as WSSV-TH ORF numbers are indicated. The transcription initiation site of each gene is encircled. For vp19 a minor transcription initiation site is also encircled. The TATA box for vp15 is underlined. The A/T rich region is boxed. Sequences are aligned by maximizing the identities around the transcriptional start site. References: vp28, vp26, vp24, vp19 and vp15 [31]; vp664 [39]; vp75 and vp73 (this study).

Alignment of 5' flanking sequences of WSSV late genes
These data indicate that WSSV uses the regular cellular enzymes for polyadenylation of mRNAs. However, other undefined signal pathways of polyadenylation might also be used, as two WSSV genes (dUTPase and TdS) were found to be polyadenylated without a poly(A)-signal present [42,48].

Conclusion
Using a combined approach of in silico analysis and experimentally determined data on WSSV transcriptomics, further support was found for the presence of different coregulated classes of WSSV genes. Comparisons with other large ds DNA viruses provided insight in the transcription mechanism of these classes and putative promoter motifs involved. In order to determine the functionality of these motifs empirically cell culture systems for shrimp will have to be further developed.

Virus infection
The virus isolate used in this study, known as WSSV-TH (acc.no. AF369029), originates from infected Penaeus monodon shrimp obtained in Thailand in 1996 and was treated as described before [15]. Crayfish Orconectes limosus was injected intramuscularly with purified WSSV using a 26-gauge needle to initiate infection. Three days post infection (d.p.i.), the crayfish were frozen in liquid nitrogen and stored at -80°C until further use.

5' and 3' Rapid Amplification of cDNA ends (5'/3' RACE)
Both 5' and 3' RACE were carried out using a commercial 5'/3' RACE kit (Roche) following the manufacturer's instructions. Total RNA was isolated from the frozen gill tissue of three infected crayfish O. limosus (harvested 3 d.p.i.) as described before [31]. In case of the 3' RACE of ORF30, first strand cDNA was synthesized using the oligo(dT) anchor primer. The resulting cDNA was amplified using one specific forward primer (ORF30-RACE-F1: CAGACCCGATTACAGTAGCAG; WSSV-TH location: 48983-49003) and the anchor primer. For the 5' RACE of ORF112 and ORF160, the RACE-R1 primers mentioned below were used for synthesis of the cDNA. This cDNA was purified using the High Pure PCR Product Purification Kit (Roche) and a homopolymeric 3' d(A)-tail was added to the cDNA in a mixture with a total volume of 20 μl, using terminal transferase and dATPs included in the kit. This mixture (5 μl) was used in a PCR, performed with an oligo(dT) anchor primer and a nested RACE-R2 primer (see below). The final products of the 5' and 3' RACE were cloned into the pGEM-T easy vector (Promega) and sequenced.

DNA sequencing and computer analysis
Plasmid clones carrying RACE products were sequenced at the company BaseClear (the Netherlands), using universal M13 forward and reverse primers. Sequence data were Alignment of 3' flanking sequences of WSSV genes Figure 4 Alignment of 3' flanking sequences of WSSV genes. The stop codon, polyadenlylation signal, start of polyadenylation and the T rich region are indicated. Sequences are aligned by stop codon and by polyadenylation signal. Abbreviations used and references: ie1: immediate-early 1 [17]; vp466 [12]; vp53a, vp11, vp136b [13]; collag: collagen-like ORF (this study). For abbreviations and references of other genes see Figs. 1 & 3.
analyzed using the software package DNASTAR4.2. All sequences data were edited and aligned in GeneDoc, version 2.6.000 [49].