EchinoDB, an application for comparative transcriptomics of deeply-sampled clades of echinoderms
© Janies et al. 2016
Received: 5 July 2015
Accepted: 7 January 2016
Published: 22 January 2016
One of our goals for the echinoderm tree of life project (http://echinotol.org) is to identify orthologs suitable for phylogenetic analysis from next-generation transcriptome data. The current dataset is the largest assembled for echinoderm phylogeny and transcriptomics. We used RNA-Seq to profile adult tissues from 42 echinoderm specimens from 24 orders and 37 families. In order to achieve sampling members of clades that span key evolutionary divergence, many of our exemplars were collected from deep and polar seas.
A small fraction of the transcriptome data we produced is being used for phylogenetic reconstruction. Thus to make a larger dataset available to researchers with a wide variety of interests, we made a web-based application, EchinoDB (http://echinodb.uncc.edu). EchinoDB is a repository of orthologous transcripts from echinoderms that is searchable via keywords and sequence similarity.
From transcripts we identified 749,397 clusters of orthologous loci. We have developed the information technology to manage and search the loci their annotations with respect to the Sea Urchin (Strongylocentrotus purpuratus) genome. Several users have already taken advantage of these data for spin-off projects in developmental biology, gene family studies, and neuroscience. We hope others will search EchinoDB to discover datasets relevant to a variety of additional questions in comparative biology.
In many studies focused on using transcriptomics to reconstruct phylogenetic trees, most of the RNA-Seq data are filtered out and do not end up in a matrix for phylogenetic tree search. However the data not used in phylogenetics can be valuable for other purposes such as developmental biology , gene family studies [2, 3], neuroscience  as well as new ideas that will come from the community. Thus we make much of our transcriptome data freely available via an application called EchinoDB (http://echinodb.uncc.edu). The data can be accessed via text or sequence similarity searches.
Echinoderms are an exclusively marine phylum of deuterostome animals that share a deep common ancestor with chordates. The body plans of extant Echinoderms range from stalked, flower-like sea lilies, to ambulatory and stellate starfish and brittle stars, to soft-bodied sea cucumbers, to spiked, armored and globose sea urchins, to flat sand dollars. The benthic adult forms of these diverse animals share a water-vascular system in which a central coelomic ring extends to form five (and sometimes more) radial canals bearing tube feet. In contrast with the pentaradial form of benthic adults, most echinoderm larvae are bilaterally symmetric and a drastic metamorphosis is required to form the adult body. The diversity of echinoderm life cycles, anatomy and their shared ancestry with chordates make echinoderms important models in a variety of comparative disciplines. In this project we provide a means for investigators to find gene families of interest to their questions across biology.
Construction and content
RNA from muscle tissues samples (adult tube feet, pinnules or body wall) from 42 Echinoderm specimens (Additional file 1) was extracted using a Qiagen miRNEasy kit. An Agilent Bioanalyzer 2100 ver. 2.6 was used for quality control prior to library preparation. Samples were then submitted to the Duke Institute for Genome Science and Policy for library preparation with an Illumina TruSeq RNA kit, followed by RNA-Seq sequencing on an Illumina Hiseq 2000 platform (100 BP, paired end). Reads for each of the samples were filtered by quality score (cutoff threshold > Q20) by fastxtrimmer, Illumina adapters were then removed by fastxclipper, both components of the fastx toolkit .
RNA-Seq produced a total of 2.3 billon raw reads. Following trimming and adapter removal, 2.1 billion reads remained, an overall reduction of approximately 11 %. The sample from Pisaster ochraceus had the most reads at 88,987,394. The Cheiraster sp. sample had the least amount of reads at 30,190,658. The sample from Promachocrinus kerguelensis had the most reads removed with a decrease of nearly 19 %. On the other end of the spectrum, the sample from Gephyrocrinus messingi had the least amount of reads removed at a reduction of 3.64 %. There was no observed correlation between taxonomic level and read count. De novo assembly of contigs was then performed using Trinity  on a high memory compute cluster using 500 GB of RAM and 24 CPUs.
Contigs for each sample were conceptually translated into peptides using Transdecoder  and the PFAM-B protein family database  (minimum protein length = 100). Each translated contig was compared to all other contigs in order to discover orthologous clusters using OrthoMCL which uses BLASTP . To provide an initial annotation to the assembled contigs for each OrthoMCL cluster, 24,829 protein sequences for Strongylocentrotus purpuratus were downloaded from NCBI  and included in the OrthoMCL clustering. Most of these species have never been sequenced by any high throughput technology except for Strongylocentrotus purpuratus. This provided an opportunity to compare our Strongylocentrotus purpuratus contigs derived from the transcriptome to the publically available genome data for Strongylocentrotus purpuratus. We compared the Strongylocentrotus RefSeq dataset to our nucleotide contigs with BLASTN and found that 91.6 % of our contigs formed high scoring pairs (E-value 1e-10) with members of the RefSeq dataset.
EchinoDB is written using the Go programming language and Revel web framework, and is serviced by the NGINX web server. NGINX allows for load balancing and transparent server redirections in the web application. The redirection allows a single domain name to serve both the EchinoDB keyword search functionality and a BLAST (sequence similarity) interface using SequenceServer . All of the relational data and clusters are stored in a PostgreSQL database, and all sequence files are stored and indexed by BLAST on the local file system.
Utility and discussion
There is one well-annotated echinoderm genome, Strongylocentrotus purpuratus, in the public domain. As a result, the keyword search interface to our database is currently searchable by Strongylocentrotus purpuratus RefSeq ID, GI number, gene name, or other text in the annotation of this species. Strongylocentrotus purpuratus RefSeq proteins can participate in the formation of a cluster but are not required to form a cluster.
This is the first large collection of data for transcriptomes sampled across the Phylum Echinodermata, including rare and deep-sea taxa. Given the ancient evolutionary history of the phylum, it is crucial to have a resource that can provide insight via well-designed taxonomic comparisons. In contrast, other efforts have focused on Strongylocentrotus and a handful of easy-to-collect echinoderms and outgroups [12, 13].
Several users have already taken advantage of the data in EchinoDB for spin-off projects across taxa and disciplines. Developmental biologists have used EchinoDB data to study variation in skeletogenic proteins among ophiuroid and echinoid echinoderms . Biologists interested in gene families have used EchinoDB data to discover echinoderm hemoglobins related to the vertebrate neuroglobin and cytoglobins . Another group has used EchinoDB data to uncover variants within the tissue inhibitors of metalloproteinases gene family. These genes are involved in the physiology of mutable collagenous tissue in echinoderms, especially holothuroids which are known to have a wide range body elasticity and can eviscerate . Neuroscientists have used EchinoDB data to study variation in echinoderm neuropeptide precursors, known as SALMFamides. This work has opened a new line of research for the role of the SALMFamides variants extra-oral feeding in asteroids . We hope that users will find our application easy to use and the echinoderm tree of life transcriptome data useful in a variety of endeavors.
Availability and requirements
Use of http://echinodb.uncc.edu, its data, and source code http://firstname.lastname@example.org/bioservices/echinodb.git#_blank are unrestricted for use by academic and commercial researchers.
We thank Ben Grupe, Brian Livingstone, Jim Nestler and Nerida Wilson for collecting some of the specimens. We thank Greg Rouse and Harim Cha for collecting some specimens and curating vouchers at Scripps Institution of Oceanography. We thank John Williams for ongoing EchinoDB support.
EchinoDB is a product of the Echinoderm Tree of Life Project and is based upon work supported by the United States’ National Science Foundation (NSF) under Grant Numbers DEB 1036416, 1036358, 1036366, 1036368. Please cite EchinoDB’s URL, these NSF grant numbers, and this paper in publications resulting from these data.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Seaver R, Livingston B. Examination of the skeletal proteome of the brittle star Ophiocoma wendtii reveals overall conservation of proteins but variation in spicule matrix proteins. Proteome Sci. 2015;13:7. doi:10.1186/s12953-015-0064-7.View ArticlePubMedPubMed CentralGoogle Scholar
- Christensen A, Herman J, Elphick M, Kober K, Janies D, Linchangco G, et al. Phylogeny of Echinoderm Hemoglobins. PLoS One. 2015; DOI:10.1371/journal.pone.0129668.
- Clouse R, Linchangco G, Kerr A, Reid R, Janies D. Phylotranscriptomic analysis uncovers a wealth of TIMP variants in echinoderms. Royal Society Open Science. 2015; DOI: 10.1098/rsos.150377.
- Jones C, Zandawala M, Semmens D, Anderson S, Hanson G, Janies D, et al. Identification of a neuropeptide precursor protein that gives rise to a “cocktail” of peptides that bind Cu(II) and generate metal-linked dimers. Biochim Biophys Acta Gen Subj. 2016;1860:1. Part A, Pages 57–66. DOI: 10.1016/j.bbagen.2015.10.008.View ArticleGoogle Scholar
- Hannon et al. Fastx toolkit. http://hannonlab.cshl.edu/fastx_toolkit. Accessed August 2012.
- Henschel R, Lieber M, Wu L, Nista P, Haas B, LeDuc R. 2012. Trinity RNA-Seq assembler performance optimization. XSEDE’12 Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond. Article No. 45 DOI: 10.1145/2335755.2335842.
- Haas B, Papanicolaou A. Transdecoder. http://transdecoder.github.io/. Accessed August 2012.
- Finn R, Bateman A, Clements J, Coggill P, Eberhardt R, Eddy R, et al. Pfam: the protein families database. Nucleic Acids Res. 2014;42(Database issue):D222–30. doi:10.1093/nar/gkt1223.View ArticlePubMedGoogle Scholar
- Li L, Stoeckert C, Roos D. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Version 2.0.9. Genome Res. 2003;13(9):2178–89.View ArticlePubMedPubMed CentralGoogle Scholar
- NCBI http://www.ncbi.nlm.nih.gov (NCBI) protein refseqs for taxon 7668. Accessed August 2012.
- Priyam A, Woodcroft BJ, Rai V, Wurm Y. SequenceServer: BLAST searching made easy. http://sequenceserver.com. Accessed January 2015.
- Tu Q, Cameron RA, Worley KC, Gibbs RA, Davidson EH. Gene structure in the sea urchin Strongylocentrotus purpuratus based on transcriptome analysis. Genome Res. 2012;22:2079–87. doi:10.1101/gr.139170.112.View ArticlePubMedPubMed CentralGoogle Scholar
- Tu Q, Cameron RA, Davidson EH. Quantitative developmental transcriptomes of the sea urchin Strongylocentrotus purpuratus. Dev Biol. 2014;385:160–7. doi:10.1016/j.ydbio.2013.11.019.View ArticlePubMedGoogle Scholar