Cloning, analysis and functional annotation of expressed sequence tags from the Earthworm Eisenia fetida

Background Eisenia fetida, commonly known as red wiggler or compost worm, belongs to the Lumbricidae family of the Annelida phylum. Little is known about its genome sequence although it has been extensively used as a test organism in terrestrial ecotoxicology. In order to understand its gene expression response to environmental contaminants, we cloned 4032 cDNAs or expressed sequence tags (ESTs) from two E. fetida libraries enriched with genes responsive to ten ordnance related compounds using suppressive subtractive hybridization-PCR. Results A total of 3144 good quality ESTs (GenBank dbEST accession number EH669363–EH672369 and EL515444–EL515580) were obtained from the raw clone sequences after cleaning. Clustering analysis yielded 2231 unique sequences including 448 contigs (from 1361 ESTs) and 1783 singletons. Comparative genomic analysis showed that 743 or 33% of the unique sequences shared high similarity with existing genes in the GenBank nr database. Provisional function annotation assigned 830 Gene Ontology terms to 517 unique sequences based on their homology with the annotated genomes of four model organisms Drosophila melanogaster, Mus musculus, Saccharomyces cerevisiae, and Caenorhabditis elegans. Seven percent of the unique sequences were further mapped to 99 Kyoto Encyclopedia of Genes and Genomes pathways based on their matching Enzyme Commission numbers. All the information is stored and retrievable at a highly performed, web-based and user-friendly relational database called EST model database or ESTMD version 2. Conclusion The ESTMD containing the sequence and annotation information of 4032 E. fetida ESTs is publicly accessible at .


Background
As key representatives of the soil fauna, earthworms are essential in maintaining soil fertility through their burrowing, ingestion and excretion activities [1]. There are over 8000 described species worldwide, existing everywhere but in Polar and arid climates [2]. They are increasingly recognized as indicators of agroecosystem health and ecotoxicological sentinel species because they are constantly exposed to contaminants in soil. The earthworm species (e.g., Eisenia fetida, Eisenia andrei, and Lumbricus terrestris) widely used in standardized acute and reproduction toxicity tests belong to the Lumbricidae family (phylum, Annelida; class, Clitellata; subclass Oligochaeta; order, Haplotaxida; superfamily, Lumbricoidea; family, Lumbricidae). E. fetida and E. andrei are two sibling species commonly found in North American composters and are sold commercially for fish bait. They have a life span of 4-5 years and are obligatorily amphimictic even though each worm has both male and female reproductive organs [3].
Like many other ecologically important species, genomics research in earthworms lags far behind other model species such as Mus musculus and Caenorhabditis elegans. In the absence of full genome sequences, expressed sequence tags (ESTs) allow rapid identification of expressed genes by sequence analysis and are an important resource for comparative and functional genomic studies. ESTs are often generated from either end of randomly selected cDNA clones and provide valuable transcriptional data for the annotation of genomic sequences. Because of recent advances in biotechnology, ESTs are produced daily in large quantities, with nearly 42 million entries in the current GenBank db EST database (release 030207). Nevertheless, it is still a challenging bioinformatics problem to analyze and annotate the often short, redundant and yet error prone EST sequences in an appropriate and efficient manner, especially when the genome sequence of the organism is unknown. Recent years have seen some EST projects undertaken with L. rubellus [4] and E. andrei [5], which have generated 19,934 and 1,108 ESTs, respectively (db EST release 030207). Before this study, there were only 96 nucleotide and 89 protein Entrez records found for E. fetida. In the present study, we cloned, sequenced and analyzed 4032 ESTs from E. fetida. We used suppression subtractive hybridization-PCR (SSH) to enrich cDNAs responsive to ten ordnance related compounds (ORCs). This work is part of a larger effort to identify candidate molecular biomarkers for rapid, mechanism-based gene expression assays to supplement current acute and reproductive toxicity tests. The specific objectives of this study were (1) to isolate and characterize cDNAs from E. fetida that can be used to monitor exposure to ORCs, and (2) to make the E. fetida EST information publicly accessible by integrating it to our web-based EST model organ-ism database (ESTMD) so that it can be shared with interested parties.

cDNA library and EST sequence analysis
We cloned a total of 4032 cDNAs from the two SSH libraries (see Methods for details). We transformed and picked 2208 clones from forward subtracted cDNA pools and 1824 from the reverse subtracted cDNA pools. After running on 96-well gel electrophoresis, 216 clones were found to be false positives with no inserts or had more than one insert. We sequenced the remaining 3816 clones producing 3144 good quality sequences with an average length of 310 bases. We batch-deposited them in the Gen-Bank db EST under accession numbers EH669363-EH672369 and EL515444-EL515580. Clone sequences that were too short (<50 bases) or of poor quality (<50 good quality bases, see methods for quality criteria) were excluded from further analysis. The observed failure rate (18%) is typical for high-throughput sequencing [6]. The deposited, cleaned sequences were further assembled into 2231 clusters (or unique sequences) on the basis of sequence similarity and quality. Nearly 80% or 1783 of the clusters produced were singletons, and 80% of the remaining 448 contigs (average length = 428 bases) were assembled from 2 or 3 clone sequences ( Figure 1). The highest number of sequences assembled into one contig was 30. The most represented putative genes in our libraries are Cd-metallothionein, cytochrome oxidase, chitotriosidase, actin, ATP synthase, Nahoda protein, lysozyme, SCBP (soluble calcium binding protein), ferritin, troponin T, lumbrokinase, and myohemerythrin (Table 1).

Comparative sequence analysis
We used the 2,231 unique sequences to search non-redundant protein databases using blastx [6][7][8]. A total of 743 sequences (33% of all unique sequences) matched known proteins with cut-off expect (E) values of 10 -5 or lower, among which 71 (3%) had E-values between 10 -100 and 10 -50 , 309 (14%) between 10 -50 and 10 -20 , and 363 (16%) between 10 -20 and 10 -5 (Table 2). A total of 880 unique sequences had less meaningful matches (E > 10 -5 ). The remaining 608 sequences (27%) had no matches. We also examined unique E. fetida sequences to determine similarity to the genes of four model organisms Drosophila melanogaster, Mus musculus, Saccharomyces cerevisiae, and Caenorhabditis elegans. A total of 830 blastx matches were found for 517 E. fetida unique sequences (23%) at the cutoff E-value of 10 -5 (Table 3). Some E. fetida ESTs matched genes conserved between the four organisms. More than 50% of the matches came from the mouse genome, whereas only 5 matches were found in the yeast genome. These results suggest that earthworms may be more evolutionarily distant from the yeast than from the other three organisms.

Functional classification
We adopted the Gene Ontology (GO) annotation of the aforesaid four model organisms to interpret the function of the E. fetida ESTs [6][7][8]. Each unique sequence of E. fetida was assigned the same gene functions of the best blastx hit genes (E ≤ 10 -5 ) in these model organisms' genome. The assigned GO terms for the unique sequences are categorized and outlined in Table 4 (biological process), Table 5 (molecular function), and Table 6 (cellular component). A complete listing of all the GO mappings is available in Additional file 1. The most represented molecular function is "binding" accounting for 51% of the total 517 unique sequences assigned with at least one GO term (Table 5), whereas those for biological processes are "cellular process" (39%) and "physiological process" (40%) ( Table 4). In terms of the final child GO categories, the most frequently assigned biological processes are "protein metabolism" (12.5%), "cellular macromolecule metabolism" (11.7%), and "cellular protein metabolism" (11%) under both cellular and physiological processes (Table 4), whereas those for molecular functions are "hydrolase activity" (11%) and "protein binding" (10%) ( Table 5). The largest subcategory in cellular components is "intracellular organelle" (23.6%) under both the intracellular part and the organelle ( Table 6).

Pathway assignment
We assigned the unique E. fetida sequences to a specific Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway based on their matching Enzyme Commission (EC) numbers. A total of 157 unique sequences (accounting for 7% of all unique sequences) including 28 contigs and 129 singletons matched enzymes with an EC number. Fiftyeight unique sequences are involved in two or more path-Distribution of 1361 good quality ESTs in 448 assembled contigs ways. The remaining 99 pathway-assigned sequences are mapped to only one pathway. Eighty-two unique sequences (52% of total) containing 14 contigs and 68 singletons were assigned to metabolism pathways (Table  7 and complete listing available in Additional file 2). Amino acid metabolism has the highest number of assigned pathways, followed by carbohydrate metabolism, energy metabolism, translation, and signal transduction. Genes putatively coded by a singleton EW1_F1plate05_B07 (enoyl coenzyme A hydratase) and Contig 251 (thioredoxin peroxidase) are most versatile, which are mapped to 10 and 8 pathways, respectively.

ESTMD (EST Model Database) web application
The ESTMD is a highly performed, web-accessible and user-friendly relational database [6]. It facilitates and enhances the retrieval and analysis of EST information by providing a number of comprehensive tools for mining raw, cleaned and clustered EST sequences, GO terms and KEGG pathway information as well as a variety of webbased services such as BLAST search, data submission and sequence download. The application is developed using advanced Java technology (Jsp and Servlets) and it supports portability, extensibility and data recovery. It can be accessed at http://mcbc.usm.edu/estmd/. The workflow process is as follows: Users input keywords or IDs from the web interface and then submit them as a query to the server. The server processes the query and retrieves date from the backend database through the database connection interface. The results are processed and sent to the users in proper formats.

Discussion
Using SSH-PCR we enriched earthworm cDNAs responsive to exposure of ten ORCs that represent three classes of chemicals, i.e., nitroaromatics (2,4-dinitrotoluene, 2,6dinitrotoluene, 2,4,6-trinitrotoluene (TNT), and trinitrobenzene), heterocyclic nitroamines (1,3,5-trinitroperhydro-1,3,5-triazine or RDX and 1,3,5,7-tetranitro-1,3,5,7-tetrazocane or HMX) and heavy metals (Cd, Cu, Zn and Pb) (Figures 3 and 4). Exposure times varied from 4-d to 28-d to capture gene expression changes at different time points. In consideration of the magnitude of effort required by this study, we selected a single dose for each compound. We also believe that differentially expressed transcripts captured on the time scale may represent to a certain degree those manifested on the dosage scale, and vise versa. We chose time over dosage mainly because we are more interested in early indication of later effects. We purposely mixed the RNA samples from different exposures for library construction because of the large variety of chemicals and exposure length, compared to the relatively small amount of resources available. The cloned cDNAs may not represent genes responding to one specific compound because each chemical, especially each class of chemicals, is likely to have specific mode of action involving different genes. Nevertheless, this library construction strategy served our downstream purpose of making cDNA microarrays with the isolated cDNA clones even though we cannot identify which cDNA or groups of cDNAs responded to which compound and at which exposure time point using the raw EST data. The combination of SSH-PCR and cDNA microarray analysis has been a widely used approach for identifying differentially expressed genes [9,10] and characterizing mechanisms of action of known and suspected toxicants [11,12], especially when there is no or little genomic information available for the test organism. Our microarray studies have generated data enabling us to further identify differentially expressed transcripts and to elucidate sublethal toxicological mechanisms in E. fetida exposed to TNT alone [13] or a mixture of TNT and RDX [14].
It is worth noting that the comparative sequence analysis (23%) and functional classification (7%) based on GO and KEGG analysis only found a small portion of the ESTs highly homologous (E ≤ 10 -5 ) with well-annotated genes. Nevertheless, the functions of these ESTs are widely dis-tributed representing 830 different GO terms and 99 different KEGG pathways. Notably, genes putatively involved in carbohydrate, energy and amino acid metabolism, cellular processes of endocrine, immune, nervous and sensory systems, signal transduction, DNA transcription, RNA translation and post-translation splicing are identified suggesting that the ten ORCs may have affected a wide range of important pathways.
From candidate biomarker gene point of view, we found repeatedly the existence of some toxicant-specific E. fetida mRNAs in our libraries (Table 1 and Additional file 3). For instance, the expression of metallothionein (MT) mRNA, the most abundant transcript in our cDNA libraries, is reportedly a sensitive and early genetic biomarker of metal exposure [15][16][17][18]. Demuynck et al. [15] demonstrated that a single exposure to 8 mg Cd/kg of dry soil for 1 day induced MT mRNA. Brulle et al. [16] observed changes in MT mRNA expression as early as 14 hr after exposure. There are also clear differences of MT gene expression between worms exposed to different Cd concentrations (8,80 or 800 mg Cd/kg of dry soil) [15]. Copper is an essential element for the activity of a number of physiologically important enzymes including cytochrome c oxidase (COX), Cu/Zn-superoxide dismutase (SOD), and dopamine-beta-hydroxylase (DBH). However, exposure to a toxic level of copper can not only induce MT for Cu sequestration [17] but also alter the expression of COX (Table 1), SOD and DBH genes (Additional file 3). Fur-

cDNA library construction
Two earthworm cDNA libraries were constructed using SSH-PCR [19]. Earthworms (E. fetida) were maintained in continuous culture from stocks obtained from Carolina Biological Supply (Burlington, NC). Worms were kept in moistened sphagnum peat (pH 6.5-7.5, moisture content 50%), and were fed ad libitum on a diet of Magic Worm Food (Carolina Biological Supply). Fully clitellate adults weighing 0.3-0.6 g (live weight) were selected for all experiments.
Exposed and unexposed earthworms were fixed in RNAlater (Ambion, Austin, TX) and stored at -80°C. Total RNA was extracted using RNeasy kits (Qiagen, Valencia, CA), and poly(A) mRNA was separated from total RNA using NucleoTrap mRNA purification kit (BD Biosciences, San Jose, CA). The integrity and concentration of both total and mRNA were checked on an Agilent 2100 Bioan-alyzer (Palo Alto, CA). The gel-like images generated by the Bioanalyzer show that both RNAs have only one bright band close to the 2 kb ladder band (Figures 5 &6), which is distinctive from the two bands seen with 18S and 26S RNA of mammalian RNA. A Clontech PCR-Select ™ cDNA subtraction kit (BD Biosciences) was then used to enrich for differentially expressed genes (Figure 7).

EST cloning and sequencing
After the secondary PCR amplification, both forward and reverse subtracted PCR products of the two libraries were cloned using pCR2.1 or pCR4.0 vectors and Mach1-T1 chemically competent cells (Invitrogen, Carlsbad, CA). Positive colonies were picked and grown overnight at 37°C in LB media containing 50 μg/mL ampicillin in a 96-deep well block format. Half of the clone culture (300 μl) was archived with 300 μl of 60% glycerol and stored at -80°C. Two μl of the remaining clone culture was amplified in a 100-μl PCR reaction. After amplification, 8 μl of the PCR reaction was checked on a 96-well electrophore-

EST processing
Many software programs are available that provide sequence cleansing and assembly. These include commercial software such as Sequencher (Gene Codes, Ann Arbor, Michigan, USA), and Aligner (CodonCode, Dedham, MA, USA), and open source software such as CAP and TIGR Assembler. With these software packages it is possible to quickly remove vector sequences from each EST clone and screen the ESTs for low-quality sequences. The high-quality and trimmed EST sequences then can be used to find overlap assembly of contiguous sequences. Sequence information was stored in ABI chromatograph trace files, and Phred was used to perform base-calling [20]. Phred read DNA trace data, called bases, assigned quality values to the bases, and wrote the base calls and quality values to output sequence files in either FASTA or SCF format. Quality values for the bases were later used by the sequence assembly program, Phrap [21], to increase the accuracy of assembled sequences. Phred uses simple Fourier methods [22] to examine the four base traces in the data set to predict a series of evenly spaced locations. It determines where the true peak location would be if there were no compressions, dropouts, or other factors shifting the peaks from their locations. Then Phred examines each trace to find the centers of the observed peaks and the areas of these peaks relative to their neighbors. A dynamic programming algorithm [23] is used to match the observed peaks detected in the second step with the predicted peak locations found in the first step. It uses a quality value lookup table to assign the corresponding quality value. The quality value is related to the base call error probability by the formula QV = -10 × log 10 (P e ) where P e is the probability that the base call is an error [20].
Typically, sequence chromatograms have low-quality regions at the beginning and the end of each sequence read [24]. One can automatically remove the low-quality ends when quality values are available. This process is called "end clipping" or "end trimming". There are two different end clipping methods [24], (1) maximizing regions with error rates below a given threshold, and (2) using separate criteria at the start and the end of the sequence. We chose the former method which was implemented in CodonCode Aligner [25] to remove low quality bases at both ends by setting quality score QV ≥ 20 (or P e ≤ 0.01). Flanking vector/adaptor sequences should also be trimmed off because they can lead to incorrect assemblies or alignment. We input a custom-made vector/adaptor file [24] into the Aligner [25] to trim vector/adaptor sequences. Furthermore, we used VecScreen [26] to detect and then manually removed any residual and partial vector contamination in our ESTs.
Phrap was used to assemble sequence fragments into a larger sequence by identifying overlaps between sample sequences [27]. Samples that can be joined together are put into "contigs". The following greedy algorithm is used in Phrap. First, it finds potential overlaps between samples by looking for shared 12-nucleotide "words" in the sequence. Then the pair of samples with highest number of shared words is found. If the alignment is good enough, it would be kept as a new contig, and the consensus sequence would be calculated; otherwise, the alignment would be rejected, and the two samples would be left sep- arated. Four criteria were used to determine whether to accept or reject an alignment: (1) minimum percent identity (the minimum percentage of identical bases in the aligned region) ≥70%; (2) minimum overlap length ≥25 bps, (3) minimum alignment score; which is similar to (2) but takes any mismatches into account, ≥20 bps; and (4) maximum gap size ≤15 bps. Overall, these criteria were relatively relaxed if compared to more stringent set-tings such as 90% for minimum percentage identity or minimum overlap length ≥35 bps. If one sample has an insertion/deletion that is larger than 15 bps, the alignment will typically stop there, and the rest of the sample will be considered unaligned. The alignment process would be then repeated. If a sample is in a contig, the consensus sequence is then used for the contig. If the two samples are already in the same contig, the next pair is The ESTMD database schema showing tables, fields, and data types Figure 2 The ESTMD database schema showing tables, fields, and data types.
Scheme of RNA sample pooling for subtractive suppression hybridization cDNA library construction: the first library Figure 3 Scheme of RNA sample pooling for subtractive suppression hybridization cDNA library construction: the first library.
Scheme of RNA sample pooling for subtractive suppression hybridization cDNA library construction: the second library Figure 4 Scheme of RNA sample pooling for subtractive suppression hybridization cDNA library construction: the second library.
retrieved and analyzed. It repeats and continues the pairwise joins until all possible joins have been tried, or until the maximum number of merge failures in a row has occurred.
After assembly, all contigs with more than three ESTs were assessed for missassemblies using the assembly viewer Consed [28]. Contigs flagged for possible missassemblies were manually edited using Consed tools to remove potential chimeric ESTs or other suspect ESTs. Chimerism occurs because of multiple insert cloning or mistracking of sequence gel lanes. After assembly with Phrap, contigs with more than three ESTs were examined again in Consed to eliminate additional missassemblies not resolved by Phrap. Any bps with a calculated quality value below 12 was changed to an N (unknown base) which was considered as a suspect ESTs.

EST comparative analysis and functional assignment
Comparative analysis was performed using blastx through NCBI with the unique sequences (including the consensus sequences of assembled contigs and the singletons). Blastx searches were conducted on our local BLAST server against the NCBI's non-redundant peptide sequence database. The returned search results (100 best hits) were transferred automatically into a relational database. We discarded hits with an E-value > 10 -5 and sorted out the remaining hits by organism name. To assign putative functions to the unique E. fetida sequences, we extracted the GO hierarchical terms of their homologous genes from the protein databases of the following four model organisms: Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, and Saccharomyces cerevisiae [29][30][31]. Meanwhile, we also mapped the unique sequences to metabolic pathways in accordance with the KEGG [32]. Enzyme commission (EC) numbers [33] were acquired for the unique sequences by blastx searching (E-value ≤ Subtracted and non-subtracted cDNAs electrophoresed on a 2% agarose/SybrGreen gel in 1× sodium borate buffer Figure 7 Subtracted and non-subtracted cDNAs electrophoresed on a 2% agarose/SybrGreen gel in 1× sodium borate buffer. Lane 1: forward subtracted earthworm (EW) cDNA; Lane 2: forward non-subtracted EW cDNA; Lane 3: reverse subtracted EW cDNA; Lane 4: reverse non-subtracted EW cDNA; Lane 5: subtracted human skeleton muscle (HSM) cDNA; Lane 6: non-subtracted HSM cDNA; Lane 7: control subtracted human skeleton muscle cDNA; Lane 8: 1 kb DNA ladder.
10 -5 ) the SWIR database, which is made up from three protein databases WormPep, SwissProt and Trembl. The EC numbers were then used to putatively map unique sequences to specific biochemical pathways [6,7]. All the matched GO and pathway information was automatically stored in our local relational database.

EST database implementation and web application
To facilitate efficient management and retrieval of the EST information obtained from this project, we upgraded our previous developed EST model database (ESTMD version 1) [6] and integrated the earthworm EST information into the new version of ESTMD. The current implementation of ESTMD (version 2) has many new features. The main changes include further normalization of tables from 50 tables to 17, altering main tables to be capable of storing multiple organism information, adding a new table (contigview) to store view information, using a 2D Java class for displaying contigs instead of a Perl script, and implementing the whole web application as a unified portable web module.
ESTMD is currently hosted on Suse Linux 10 and can be implemented in MySQL 4.0 or higher version. It has an integrated web-based application with a three-tier structure, i.e., client, sever and backend database (Figure 8). The web-based interface of the database was created using HTML and JavaScript to evaluate the validation of the input on the client side and to reduce the burden on the server side. Apache 2.2 is used as the HTTP web server, The architecture of ESTMD web application and database Figure 8 The architecture of ESTMD web application and database.
while Tomcat 5.5 is the Servlets container. Both of these programs were developed and maintained on Linux and WinNT, ensuring that the database is transplantable and platform-independent. The server-side programs are implemented by Java 2 Enterprise Edition (J2EE) technologies. Servlet and JSP (JavaServer Pages) are used to communicate between users and databases and to implement a query.