GigaTON: an extensive publicly searchable database providing a new reference transcriptome in the pacific oyster Crassostrea gigas

Background The Pacific oyster, Crassostrea gigas, is one of the most important aquaculture shellfish resources worldwide. Important efforts have been undertaken towards a better knowledge of its genome and transcriptome, which makes now C. gigas becoming a model organism among lophotrochozoans, the under-described sister clade of ecdysozoans within protostomes. These massive sequencing efforts offer the opportunity to assemble gene expression data and make such resource accessible and exploitable for the scientific community. Therefore, we undertook this assembly into an up-to-date publicly available transcriptome database: the GigaTON (Gigas TranscriptOme pipeliNe) database. Description We assembled 2204 million sequences obtained from 114 publicly available RNA-seq libraries that were realized using all embryo-larval development stages, adult organs, different environmental stressors including heavy metals, temperature, salinity and exposure to air, which were mostly performed as part of the Crassostrea gigas genome project. This data was analyzed in silico and resulted into 56621 newly assembled contigs that were deposited into a publicly available database, the GigaTON database. This database also provides powerful and user-friendly request tools to browse and retrieve information about annotation, expression level, UTRs, splice and polymorphism, and gene ontology associated to all the contigs into each, and between all libraries. Conclusions The GigaTON database provides a convenient, potent and versatile interface to browse, retrieve, confront and compare massive transcriptomic information in an extensive range of conditions, tissues and developmental stages in Crassostrea gigas. To our knowledge, the GigaTON database constitutes the most extensive transcriptomic database to date in marine invertebrates, thereby a new reference transcriptome in the oyster, a highly valuable resource to physiologists and evolutionary biologists.


Background
The Pacific Oyster, Crassostrea gigas, is a bivalve mollusk that belongs to lophotrochozoans, an under-described phylogenetic group among protostomes despite being the sister clade of ecdysozoans, which encompass insects and nematodes. C. gigas is an intertidal bivalve with a pelagic larval phase followed by a benthic adult life. Pacific oysters are farmed in intertidal or nearshore areas that undergo highly variable environmental conditions as well as an important anthropization, making C. gigas considered a sentinel species. Besides, C. gigas constitutes one of the most important shellfish aquaculture resources worldwide with an estimated market value reaching 1.3 billion dollars in 2013 (FAO). In this context, knowledge about the physiology of C. gigas is of fundamental, ecological and applied interests, leading to the generation of an important amount of transcriptomic data, as illustrated by the Gigasdatabase expressed-sequenced tags (EST) database [1]. These ESTs have enabled the development of microarrays used in physiological studies notably focusing on reproduction [2,3], resistance to summer mortality [4][5][6], hypoxia [7], thermal stress [8] or ploïdy [9]. Additionally, MPSS or RNA-Seq analyses have been performed to investigate molecular processes involved in larval growth heterosis [10], immune responses to pathogenic bacteria [11] or response to salinity stress [12]. In many studies, biological understanding of data was hampered by limited annotation of transcripts, notably due to incomplete assembly or lack of reference transcriptome resource.
Aside from transcriptome studies, the genome of the Pacific oyster has recently been sequenced and assembled, and is now available for the community to explore [13]. The present assembly of the oyster genome has been annotated and genes were predicted in silico by comparison with other species. Together with the genome characterization, extensive transcriptomic data have also been generated from a very broad range of developmental stages and physiological conditions giving rise to 114 RNA-seq libraries. These data were directly mapped to the genome and used to gain insights into developmental, shell formation and stress response issues in the oyster from studies based on differential expression of the predicted genes.
Up to date, these RNA-seq data were mostly used to validate the published version of the genome assembly. In addition, such a massive sequencing effort [13] offers the opportunity to gain more insights into the transcriptome of the Pacific oyster at different development stages and under different environmental conditions. Insights into these features are critical for our understanding of many aspects of the biology in marine bivalves and their evolution. However, there is to date, to the best of our awareness, no convenient publicly accessible resource (i) where all the transcriptomic data is assembled and gathered and (ii) that is able to answer the need for a convenient way to browse into such an important amount of data. Therefore, we further exploited the C. gigas transcriptomic data generated in parallel to the genome characterization by Zhang et al [13], by persample assembling, then meta-assembling of the obtained contigs, the 2204 million reads from the 114 RNA-seq libraries. We provide these results including sequence of complete open reading frames (ORFs) (thereby protein), presence of transcript variants, knowledge of 5' and 3'-untranslated regions, polymorphism and expression levels between physiological stages and conditions into an extensive, searchable, highly potent and user-friendly online database based on RNABrowse [14], the GigaTON database.

System design and implementation
Pipeline for in-silico mRNA expression levels, functional annotation, identification of assembly variants.

Data acquisition
The 114 RNA-seq libraries were generated as a part of the oyster genome project [13], except for two of them that were used in a study dealing with salinity stress response [15]. They comprise 99 single and 15 paired raw read files and have been downloaded in fastq format, from the NCBI SRA website (study number SRP014559 and SRP013236, respectively).

Sequence/data cleansing
The reads were first cleaned from remaining sequencing adapters using trim_galore (http:// www.bioinformatics.babraham.ac.uk/projects/ trim_galore/). Then the over-represented reads were filtered out using the normalize_by_kmer_coverage.pl script from the Trinity software package [16]. The next step aimed at discarding the invalid base calls by extracting the longest sub-sequence without unidentified nucleotides (Ns) from each read. If the length of the longest sub-sequence did not exceed half of the sequence length, the read and his mate, if present, were removed. This last step was performed using an in-house script (available upon request).

Contig assembly
The assembly was performed in two steps described hereafter. A per sample assembly was first performed, allowing us to build a contig set for each sample. Then, all sample contig sets were merged, then duplicates were removed and references with very low read counts were filtered out.
For each sample, nine Oases [17] assemblies using different k-mers (25,31,37,43,49, 55, 61, 65 and 69) were performed on cleaned reads. Each assembly produced a transcripts.fa contig file, which header includes an assembly locus. Only the longest and most covered (read depth) loci were kept using the OasesV0.2.04OutputToCsvDataBase.py script developed by a Brown University team (https:// code.google.com/p/oases-to-csv/). After that, all files were merged. Finally, anti-sense chimeras (accidentally produced by the assembly step) were split or removed with an in-house script (available upon request). Because similar collections of contigs were produced by different k-mers, a cd-hit-est [18] step was used to remove redundant contigs based on their sequence similarities (parameters -M 0 -d 0 -c 0.98 -T 8). Because different kmers sometimes constructed different transcript parts, TGICL [19] was used to assemble contigs having significant overlaps (parameters -c 4 -l 60 -p 96 -s 100000).
The contigs were then filtered on a minimum length of 200 base pairs. Input reads were mapped back to the contigs using bwa aln [20]. The resulting alignment files were used to correct the contig sequences from spurious insertions and deletions using an in-house script (available upon request) and to filter out those with very low coverage. The filter excluded contigs with less than two mapped reads per million. All sample contig fasta files were merged and each contig renamed by adding the sample name at the beginning of its name. The longest ORF was then searched with getorf from EMBOSS [21]. A cd-hit clustering was performed on the ORFs with sequence identity equal or greater than 0.9 in order to extract from each cluster, the contig with the longest ORF or, if the ORF size are the same, the longest contig. The remaining contigs were clustered by cd-hit-est with sequence identity equal or greater than 0.95. Input reads from all conditions were then mapped back to the contigs using bwa aln [20]. Contigs with very low coverage, less than two mapped reads per million, were filtered out. The final set contained 56,651 contigs, corresponding to a total of 93,954,163 base pairs. The median and mean length of the contigs were respectively 1180 and 1,659.35 base pairs. The longest contig contained 66,407 base pairs (Fig. 1).

Functional annotation
The contig set was annotated using blast versus several databases including uniprot-swissprot, refseq protein, refseq_rna, rfam, Ensembl oyster proteins and the contigs themselves.

Quantification
All sample read sets were aligned on the final contig set using bwa aln to quantify the expression. A quantification  Table 1 and Table 2, respectively. 7. RNABrowse environment setting, database features and request tools Differential analyses: Venn diagrams and Digital Differential Display (DDD). The GigaTON database can be requested for differential analysis of contig content (Venn diagrams) and expression level between libraries (DDD) using a Fisher's exact test. These requests are easily performed using a 'point-and-click' interface for sample library grouping, analysis parameters and results exploitation (browsing and downloading) (Fig. 2).
A detailed description of each library can be found in the download section of the GigaTON website (libraries description .xlsx). Contig search and browser: the BioMart portal. The information about the contigs can be searched, browsed and retrieved online using the Biomart portal (www.biomart.org) [6]. Briefly, the GigaTON contigs or subgroup of contigs can be filtered out according to various criteria and sub-criteria (name, length, annotation number in a wide range of repository databases, Gene Ontology, keyword, expression), and retrieved with their corresponding attributes selected by the user among an extensive list of 71, including: filter criteria, best annotation identifier, hit start/stop position, number of variants, DNA sequence, expression pattern in tissues, developmental stages and various physiological conditions (Fig. 3). The retrieved contigs can be browsed further for additional information (sequence, counterparts in other databases, expression between libraries) in convenient and user-friendly point-and-click interface tabs (Fig. 4).

Utility and discussion
Gain of knowledge regarding existing databases In the present work, we generated 56,621 high quality contigs. This assembly brings an important amount of additional information when compared to existing databases (+52.3 % in contig number compared to the EST library GigasDatabase v9 [1]). This number is also greater than the 28067 consensus genes that are predicted in the oyster genome [13]. This important increase in the transcript sequence information does not only lie in the number of contigs ( Fig. 5a) but also brings the 5' and 3' UTR information (Fig. 5b). Indeed, even though 454 sequencing was performed on cDNA from various oyster samples [13], none of the corresponding assembled full-length transcript sequence is, to our knowledge, publicly available to date. Therefore, we believe that the GigaTON database constitutes the new reference trancriptome in the Pacific oyster, a work of significance for the community. Some of its most relevant features are developed below. First, the novelty and gains of GigaTON are novel contigs. Among the 56621 contigs in GigaTON, 53,515 contigs (94,5 %) align to the Ensembl Oyter genome, with a vast majority displaying 95 % identity and 90 % coverage. A contig can be considered as 'novel' when its sequence is unknown or present in databases but not annotated, therefore including those which do not align with the genome, in addition to those that align with unannotated areas of the genome. The GigaTON (Fig. 5a).
An example of a novel contig in GigaTON is the contig 'CHOYP_contig_047748' that corresponds to the sequence encoding the precursor of a novel oyster neuropeptide family named "RxIamide". This sequence does not match any oyster Ensembl predicted gene, cDNA or protein, despite (i) ESTs present identity (HS206944, HS234774), and (ii) the gene products were characterized using mass spectrometry [23]. However, the considered sequence matches unannotated genomic sequences in the scaffold 40412 (43,230-43,117; 30,085-29,480). This situation is not unique and was also met for a number of neuropeptide-encoding genes (Buccalin, cerebrin, GGN-related peptide, LRNFVamide, opioid neuropeptide-like, tachykinin-like) [23]. Such examples illustrate the gain brought by GigaTON database, which, in addition to bringing new transcriptomic information, might help improving gene prediction.
Second, the GigaTON contigs provide insights into the 5' and 3' untranslated regions (UTRs) of the oyster transcribed sequences (Fig. 3 and Fig. 5b). Importantly, the 5' information is a significant asset towards the determination of a consensus promoter sequence in a bivalve   The parameters and results of the DDD analysis are indicated on the right. Ten contigs out of the 243 that were only expressed after metamorphosis are displayed as a 'click-and-browse' list (bottom panel). All results can also be downloaded as .tsv files for further analysis mollusk, which may help unlock functional genomics studies in lophotrochozoans. Besides, there is a widely admitted approximation that considers the 1 kb sequence upstream translation starts as being proximal promoter sequences, which is commonly used at present for the interpretation of NGS data [24][25][26]. However, the mean 5'-additional information we provide in the GigaTON database is around 600 pb (Fig. 5b), indicating a mean 600 pb length for 5'UTRs in the Pacific oyster. Therefore the aforementioned approximation presents a potential severe bias since up to 60 % of the sequence information might be mis-considered as promoter sequence instead of UTR, and thereby should be cautiously considered, as illustrated for some Hox genes orthologues [27]. This contradicts the general assumption that oysters, like other lophotrochozoans, have very short 5'-UTRs. However, because the length of the predicted proteins corresponding to the GigaTON contigs also displays a great increase (Fig. 5d), the additional 5' sequence information can also correspond to coding DNA, thereby minimizing the flaw in proximal promoter approximation previously stated. The same rationale can be proposed regarding the additional 3' sequence information provided within the GigaTON contigs as being 3'UTRs and/or coding regions. This point is important for the understanding of the mi/piRNA pathway in oysters, which mostly affects 3'-UTRs in vertebrates [28,29] but could also highlight alternative polyadenylation signals in mollusks. Therefore, additional work, including functional analyses dedicated to the determination of consensus promoter and/or polyadenylation signals, are required to decipher such potentially important issues for our understanding of gene expression regulation in the Pacific oyster, and should take great advantage of the present database. Second, the GigaTON pipeline led to the identification of a high number of assembly variants, since 30 % of the contigs display at least one variant (Fig. 5b). Some of these assembly variants highly likely correspond to transcript variants that could reflect alternative splicing when these variations are located within ORFs. Alternatively, when variations are located within UTRs, the presence of transcript variants could illustrate the usage of alternative promoters and/or polyadenylation signals.
Both can have important functional outcomes in terms of protein structure or gene expression level regulation. As an illustration, the contig CHOYP_41.3.3 (Fig. 4) exemplifies a putative transcript variant that might lack a region between two functional domains, thereby diminishing the length of the protein with probable physiological consequences.
Third, the great increase in length for a number of proteins as mentioned above, points out that the in silico protein prediction from the genome, which is mostly based on homology with other invertebrates, is significantly improved by the present assembly. This indicates a noticeable difference in protein sets between the Pacific oyster and other invertebrates. Such a discrepancy between invertebrates seems surprising, except if one considers the growing number of documented lophotrochozoan features, such as DNA methylation patterns [24,30], angiotensin-converting enzyme characteristics [31,32] or BMP cDNA sequences [33], that display a closer resemblance with vertebrates than with ecdysozoans. Besides, in a similar fashion than mapping long 454 reads helped assembling genomic scaffolds [13], the GigaTON database provides even longer contigs that we think could importantly help improve the current assembly of the oyster genome. These three features might be of functional significance for our understanding of the oyster physiology and support the importance of the GigaTON database, which encompasses a high diversity of transcriptomes, in the context of post-genomic studies in lophotrochozoans.

Easiness, potency and versatility of the tools
The GigaTON database provides, to our knowledge, the most complete and integrated transcriptomic resource to date in a marine mollusk. A remarkable strength of this database is the possibility to browse data quickly and easily to get a global picture of the user's question using Venn diagrams or DDD, which can easily be further examined in deep details. This is noticeably allowed by the use of the BioMart portal within GigaTON, which links extremely massive sequence and expression data in various physiological conditions with a user-friendly, (See figure on previous page.) Fig. 4 An overview of a contig search results. a: Caption from the 'General Info' tab. This screen enables the user to display and retrieve general information about the searched contig, including related keywords and functional annotation such as Gene Ontology. b: Caption from the 'Sequence View' tab. Statistics and DNA sequence of the contig are presented on this this screen. An embedded tool enables the user to search ORFs and get their translation. c: Caption from the 'JBrowse View' tab. This browser enables a quick overview of the contig and its corresponding counterparts in other databases. d: Caption from the 'Depth View' tab that contains expression data as the mean coverage and number of reads of the contig in all the libraries that can be chosen and displayed on a graph using a point-and-click interface. As an example, the contig CHOYP_41.3.3 is displayed in the Amu (in red) and Dgl (in blue) libraries. The structure of the corresponding protein has been added to the graph with the domain architecture. Discrepancies in mean depth coverage between library assembly variants may indicate splice variants at these positions (green gradient frame) easy to use, point-and-click interface. Our searchable online database enables a high versatility tool both in terms of request criteria and retrieved attributes that can be easily downloaded and further analyzed using dedicated software packages. In parallel, specific contig characteristics such as transcript variants, polymorphism and expression can be browsed online in the graphic interface. Furthermore, the included browser allows a convenient view of the target contig counterparts in numerous other databases at a single glance (Fig. 4c), without any particular The GigaTON contigs were translated using TransDecoder (http://transdecoder.github.io/) and resulted in 41445 proteins. They were aligned with the Ensembl protein from the oyster genome. The relative length of the 16787 alignments displaying at least 95 % identity between sets was plotted. The vertical smear on the right reflects longer proteins in the GigaTON database. For example, the protein with coordinates (X = 100, Y = 50) has its full Ensembl sequence covering 50 % of its GigaTON counterpart need for the user to be skilled in bioinformatics. Therefore, the GigaTON database is most likely to be extensively accessed by an increasing scientific community studying evolutionary, environmental or aquaculture related aspects in C. gigas. Together with the current genome assembly, it constitutes one of its strong assets as an emerging model species.

Conclusions
In this work, we assembled 114 RNA-seq libraries, mostly generated aside of the oyster genome project, into a publicly accessible database that includes userfriendly 'search, browse and retrieve' online tools, the GigaTON database. To date, the GigaTON database constitutes the most extensive transcriptomic resource in a marine invertebrate to our knowledge, and sets the new reference transcriptome in C. gigas, one of the most important shellfish aquaculture resources worldwide.

Availability and requirements
The GigaTON database is available at http://gigaton.sigenae.org. There are no restrictions for its use by nonacademics.