afterParty: turning raw transcriptomes into permanent resources
© Jones and Blaxter; licensee BioMed Central Ltd. 2013
Received: 7 December 2012
Accepted: 3 October 2013
Published: 7 October 2013
Skip to main content
© Jones and Blaxter; licensee BioMed Central Ltd. 2013
Received: 7 December 2012
Accepted: 3 October 2013
Published: 7 October 2013
Next-generation DNA sequencing technologies have made it possible to generate transcriptome data for novel organisms quickly and cheaply, to the extent that the effort required to annotate and publish a new transcriptome is greater than the effort required to sequence it. Often, following publication, details of the annotation effort are only available in summary form, hindering subsequent exploitation of the data. To promote best-practice in annotation and to ensure that data remain accessible, we have written afterParty, a web application that allows users to assemble, annotate and publish novel transcriptomes using only a web browser.
afterParty is a robust web application that implements best-practice transcriptome assembly, annotation, browsing, searching, and visualization. Users can turn a collection of reads (from Roche 454 chemistry) or assembled contigs (from any sequencing chemistry, including Illumina Solexa RNA-Seq) into a searchable, browsable transcriptome resource and quickly make it publicly available. Contigs are functionally annotated based on similarity to known sequences and protein domains. Once assembled and annotated, transcriptomes derived from multiple species or libraries can be compared and searched. afterParty datasets can either be created using the existing afterParty server, or using local instances that can be built easily using a virtual machine. afterParty includes powerful visualization tools for transcriptome dataset exploration and uses a flexible annotation architecture which will allow additional types of annotation to be added in the future.
afterParty's main use case scenario is one in which a working biologist has generated a large volume of transcribed sequence data and wishes to turn it into a useful resource that has some durability. By reducing the effort, bioinformatics skills, and computational resources needed to annotate and publish a transcriptome, afterParty will facilitate the annotation and sharing of sequence data that would otherwise remain unavailable. A typical metazoan transcriptome containing several tens of thousands of contigs can be annotated in a few minutes of interactive time and a few days of computational time.
Recent advances in DNA sequencing technology have greatly reduced the cost, time and effort required to generate large volumes of sequence data . While new sequencing approaches have been used to great effect in well-studied species [2, 3], perhaps the biggest beneficiaries have been research programmes focussing on non-model organisms. For such organisms, which typically lack a reference genome sequence, transcriptome sequencing offers an efficient way to explore the regions of the genome likely to be of most interest to researchers [4, 5]. The production of a novel transcriptome typically involves several steps . mRNA is extracted from the organism of interest, purified, fragmented and reverse transcribed into cDNA. Several such cDNA collections may be made in order to capture transcripts that are only produced in specific tissue types, life stages, environmental conditions, etc. The cDNA molecules are then ligated to sequencing adapters and size-selected before having one or both ends sequenced. The result is a very large number of short reads that must undergo significant processing before they can be used to investigate the biology of the organism.
The details of the data-processing steps depend on the details of the experiment and the sequencing technology, but the steps themselves remain the same . Read sequences are cleaned to remove low-quality regions and sequencing adapters before being assembled to give a collection of contigs, or putative transcripts. To gain an insight into the functions of the genes represented by these transcripts, and to identify novel transcripts, the contigs are annotated using a variety of methods. Typically, researchers annotate contigs using a combination of similarity to known sequences and protein domains [7, 8] and machine-learning methods which identify features such as transmembrane domains and signal peptides [9, 10]. These annotations can be used to put the putative transcripts in biological context [11, 12].
The increase in the availability of transcribed sequence data places corresponding demands on the bioinformatic tools used to make sense of it. While tools have been developed to carry out the tasks of cleaning [13, 14], assembling [6, 15, 16] and annotating [8–10] transcribed sequence data, integration of these tools into a pipeline is generally on an ad-hoc basis and in a manner that is not user-friendly. As high-throughput sequencing becomes more pervasive, analysis tools that can be used by biological researchers who are not expert bioinformaticians to both create and investigate annotated transctriptomes will become essential. The increasing volume of sequence data also puts pressure on methods of data dissemination. Publications and raw sequence data resulting from transcriptome sequencing projects are generally made available and archived, but intermediate, detailed annotations are typically not.
Comparison of existing software tools designed to work on whole transcriptome datasets
In active development
Sequence data type
Sanger EST reads
BLAST, InterProScan, KEGG, prot4est
Assembed contigs + read mapping data
Public microarray data
Roche 454 raw reads or assembled contigs
Several tools offer potential solutions for data exploration. BRIGEP  is a suite of tools that includes a transcriptome browser to address the need for data visualization. However, BRIGEP is focussed on integration with proteomic data, requires significant technical ability to set up, and does not assist the user in creating annotation. Similarly, the TranscriptomeBrowser  tool offers an interface to existing transcript data with a focus on molecular interactions. Genome browsers [22–24] are feature-rich, but they typically require considerable effort to set up, and the gene-centric requirements of transcriptome analysis and visualization do not fit well into their genome- and chromosome-centric paradigm [25, 26].
To address the need for an integrated, dependency-free, intuitive tool for transcriptome annotation and publication we have developed afterParty, a web application that runs entirely within a browser and functions both as an annotation tool and a transcriptome browsing and visualization tool. afterParty takes as its input either raw reads or assembled contigs, and uses existing best-practice tools and databases to annotate them, resulting in collections of annotated putative transcripts (“datasets”) along with metadata describing how the sequences were produced. afterParty also acts as a web interface to datasets, allowing non-bioinformatician users to browse contigs, search annotation, and define and visualize sets of contigs. Using afterParty, a biologist can turn a collection of next-generation sequencing reads into a durable, web-accessible transcriptome resource without the need for expert knowledge, software dependencies, or extensive computing power.
The afterParty web application functions as an interface to two sets of tools – one for creating datasets, and one for searching, browsing and visualizing them. To create a new dataset, the user uploads either a set of raw sequencing reads which afterParty assembles into contigs, or a collection of pre-assembled contigs (generated using any appropriate combination of sequencing technology and assembly software) and, optionally, coverage and quality data. Contigs are then annotated and the annotations indexed for rapid searching. To investigate an existing dataset, a user can browse individual contigs or search within datasets for contigs of interest. Searches can interrogate the annotations or contig properties (coverage, GC content, etc.) and can be performed across multiple assemblies in a dataset (e.g. for different species or different RNA libraries). afterParty is implemented as a web application and is written in Groovy  using the Grails  web framework and the PostgreSQL Relational Database Management Server  for data storage. It is offered as a publicly-available server at http://afterparty.bio.ed.ac.uk, but can also be downloaded and run locally.
Putative transcripts are represented in afterParty by contigs, which are grouped into assemblies. A compound sample can have multiple assemblies. Using this mechanism it is possible to have multiple versions of an assembly for a single set of reads. A contig may be decorated with multiple pieces of information, each of which is represented by an annotation. Each individual input sequence that makes up a contig is represented as a read. Arbitrary collections of contigs are stored as contig sets. A contig can belong to any number of contig sets.
In this scenario, the user uploads a collection of 454 pyrosequencing reads in FASTQ format. afterParty will carry out read assembly using the MIRA assembler , optionally trimming adapter sequences using ea-utils . It will then annotate each resulting contig by carrying out a sequence similarity search using BLASTX  against the UniProt  database of known protein sequences, and running InterProScan  to identify known protein domains. Quality and coverage information for each base in each contig as reported by the assembler will be stored along with the contig sequence, annotation, and read mapping locations.
In this scenario, the user has already assembled their sequencing reads into contigs and has various choices for uploading them. They can upload a FASTA format file containing contigs, in which case no coverage, quality or read mapping data will be stored, or they can upload an ACE  format file which contains coverage, quality and read mapping information. Once uploaded and stored, contigs are annotated as described in workflow A.
Transcriptome assembly from high-throughput data remains an active field of research. Thus workflow B allows users to apply methods best suited to their data type and organism(s) to generate an optimal contig set. In particular, this scenario is likely to be useful for Illumina RNA-Seq sequence data, as well as for complex or large 454 or Sanger transcriptomes that are unlikely to be assembled well by the default Mira assembler. Hybrid approaches to transcriptome assembly, in which output from multiple assembly tools is merged, can also be used under this scenario .
In this scenario, the user has already assembled a collection of contigs and run the necessary annotation tools. Contigs are uploaded as described for scenario B, and annotation data are uploaded in either XML (for BLASTX ) or GFF3 (for InterProScan ) format. No assembly or annotation is carried out by afterParty; the data are merely stored and indexed. This scenario is likely to be useful for users who have access to parallel compute facilities that can carry out the annotation more rapidly than could be accomplished using afterParty. This workflow allows the use of any BLAST database for annotation – for instance, a genome database for a closely-related organism.
In all three workflows datasets remain private, and only visible to the logged-in owner, until explicitly made public.
For the workflows where annotation is carried out inside afterParty (B and C above), annotation proceeds in two steps. First, BLASTX  from the BLAST+ 2.2.25 package is used to search the UniProt  protein reference database for sequences showing sequence similarity to the contig sequence. The ten most highly similar UniProt entries are stored as annotation, along with their E-value scores and the regions of the contig to which they show similarity. Second, the InterProScan 5 package  is used to identify protein domains and regions of interest on the contig using the following applications: ProDom-2006.1, PfamA-26.0, TIGRFAM-12.0, SMART-6.2, Gene3d-3.3.0, Coils-2.2, Phobius-1.01. All InterProScan matches are stored along with their E-value scores (where applicable) and positions.
Once a dataset has been created, afterParty offers users a variety of ways to explore it. All annotations, whether generated by afterParty or uploaded by the user, are indexed using PostgreSQL's full-text indexing tools. These improve the quality of search results by removing common English words, dealing with suffixes, and allowing boolean search terms. Users can browse a table of the contigs belonging to a particular assembly, compound sample, or study. Alternatively, they can use any of afterParty's search tools to identify contigs of interest. There are three ways to search in afterParty. To search by annotation, users supply a search string (which can include the boolean operators AND, NOT and OR) and afterParty will identify the set of contigs that have matching annotation. To search by similarity, users supply an input DNA or protein sequence and afterParty uses BLASTN, TBLASTN or TBLASTX to carry out a sequence similarity search and identify contigs with significant similarity. To search by contig property (any combination of GC content, read coverage, quality and length), users select a region of a scatter plot encompassing the values they wish to include.
Search results can be saved as contig sets, so that they can be retrieved or shared with colleagues without having to re-run the search. Searches can also be restricted to contig sets, leading to a powerful and intuitive way to identify contigs of interest by iteratively combining different types of search. For example, a user can start with a set of contigs from a particular developmental stage, search inside that set for contigs with a particular protein domain, then search inside the resulting set for contigs longer than a minimum length.
Grouping of contigs into contig sets allows in depth exploration of properties within and between sets. afterParty automatically creates contig sets for entire assemblies, compound samples, and studies. Database owners and users can define additional contig sets based on particular properties of contig annotation, such as stage-specific expression, or the results of a sequence similarity search.
Number of contigs
Number of annotations
AfterParty workflow (see Figure2)
Transcriptome of the nematode Litomosoides sigmodontis from three life stages
Roche 454 FLX / Titanium
Transcriptome of the nematode Anguilicolla crassus
Roche 454 FLX / Titanium
Transcriptome of the moth Plodia interpunctella
Illumina Solxa RNA-seq
We assembled and annotated a collection of transcriptome sequence data from the filarial nematode Litomosoides sigmodontis using the workflow depicted in Figure 2A. L. sigmodontis is the subject of an ongoing transcriptome project , and the transcriptome data is typical of the type for which we expect afterParty to be useful. 764,024 reads from five libraries were assembled, and annotated using an installation of afterParty on an 8-core server. Assembly took ~48 hours and annotation took ~5 days. The resulting dataset has 76,340 contigs. 69,355 have at least one UniProt annotation, and 24,491 have at least one protein domain annotation. The dataset can be explored on the afterParty web server . A subset of these raw L. sigmodontis data are available as a test dataset for new users.
We used a collection of already-assembled transcripts to create a transcriptome resource for the nematode Anguilicolla crassus using the workflow depicted in Figure 2B. Sequencing reads for male, female, and L3 individuals were generated using Roche/454 FLX Titanium chemistry and assembled using a hybrid strategy. The assembled contigs were uploaded before being annotated using afterParty. The resulting dataset has 14,064 contigs. 12,625 had at least one UniProt annotation and 6,583 had at least one protein domain annotation. The dataset can be explored on the afterParty web server . A. crassus transcriptome assembly data were kindly provided by Emanuel Heitlinger (Berlin) .
We used a collection of already-assembled transcripts along with existing annotation to create a transcriptome resource for the Indian Meal Moth, Plodia interpunctella, using the workflow depicted in Figure 2C. The assembly was built using Trinity  from RNA-seq data derived from four samples of fourth instar lavae, each consisting of 20 pooled individuals. Annotation was generated using BLAST  and InterProScan  on a Sun Grid Engine (SGE) compute cluster. The assembled contigs and annotation files were uploaded to afterParty to create a dataset with 116,191 contigs. 71,608 contigs had at least one UniProt annotation and 16,373 contigs had at least one protein domain annotation. The data can be explored on the afterParty web server . P. interpunctella transcriptome assembly data were kindly provided by Seanna McTaggart (University of Edinburgh).
Since afterParty acts as a wrapper around existing third-party tools for assembly and annotation, the overhead imposed versus running the tools manually is minimal. For a test dataset of 100,000 Roche 454 reads take from the L. sigmodontis dataset, assembly using MIRA  took 41 minutes using afterParty compared with 35 minutes when run manually. Annotating a subset of 100 contigs using a BLASTX  search vs. UniProt  took 407 seconds in afterParty compared with 394 seconds when run manually. Running InterProScan  against the same set of 100 contigs took 270 minutes in afterParty compared with 254 minutes when run manually. Timing tests were carried out on a workstation with 4 Intel Xeon L5640 2.27GHz CPUs.
We have designed afterParty to be locally deployable for researchers who wish to host datasets themselves, take advantage of local compute facilities, and maintain fine-grained access control. Local deployment of afterParty can be carried out in two ways. The source code is freely available (see Availability and Requirements) and can be installed (along with dependencies) on a standard web server. Alternatively, we have made available a virtual disk image including afterParty and all dependencies, which may be used to create a virtual machine running afterParty. afterParty has been tested using multiple datasets of between ~10,000 and ~250,000 contigs and found to run satisfactorily for dataset browsing and visualization on a 2-core web server with 4 GB RAM.
A single afterParty instance is capable of serving multiple datasets, so we anticipate that a single local installation will be sufficient to serve the needs of a group of researchers working on different projects. The afterParty interface has been designed to facilitate collaboration and sharing of information and is designed such that each study, compound sample, assembly, contig set and contig has a unique URL. Users can easily share a link to a given resource by embedding the URL in an email or web page.
An entire afterParty instance (potentially containing many datasets) can be archived either as a database dump or as a virtual disk image. Database dumps are more compact and hence easier to store. However, recent long-term archival solutions achieve storage costs on the order of $0.01 per gigabyte per month , making the storage of complete virtual machines a realistic option (we estimate the size of a complete VM image for a large afterParty instance to be less than 20GB).
We anticipate that the need for tools like afterParty will increase as next-generation sequencing technologies become ever more accessible. In particular, we see a role for afterParty in presenting transcriptome studies which encompass multiple related organisms, aggregating data across research projects.
Obvious extensions to afterParty are the inclusion of additional assembly options and of new types of annotation data. Although the MIRA assembly tool has been shown to produce suboptimal assemblies for some datasets , we chose it for use in afterParty because of its modest computational requirements, non-restrictive license, and ease of integration. We plan to integrate additional assembly tools and strategies into afterParty, which will allow the use of input data from other sequencing platforms. The modular design of afterParty's annotation framework ensures that new types can be easily added. We plan to add storage for expression data, such as microarray data and sequence-counting estimates of transcript abundance, open reading frames, matches to proteomics resources, and pathway annotations. We believe that the use of cross-species contig sets to store ortholog relationships will be particularly useful. We also plan to add export tools to afterParty that will aid users in preparing data for submission to annotation archives, such as the International Nucleotide Sequence Database Collaboration (INSDC) Third Party Annotation (TPA) databases.
The computational requirements of afterParty vary throughout the workflow in a distinctive way. The assembly stage can have high memory requirements, and the annotation stage can have high CPU requirements. Once a dataset has been assembled and annotated, however, the memory and CPU resources needed to serve it are modest. CPU-intensive operations such as searching annotations are very brief (in tests, our web server [2 CPU cores @ 2.50 GHz] was able to carry out a full-text search on a dataset with 1.2 million annotation items in under a second). This pattern of transient high demand (during assembly and annotation) and long-term low demand (during browsing and searching) makes afterParty a good candidate for cloud-based compute infrastructure. We are currently investigating the possibility of implementing a highly parallel cloud computing model for the afterParty annotation pipeline.
afterParty is an open-source tool for turning raw transcriptome sequencing reads and assembled contigs into searchable, browsable transcriptome resources with powerful visualization tools. In contrast to existing solutions, afterParty integrates all steps of the transcriptome annotation workflow and presents an intuitive user interface for non-expert users, while being flexible enough to accommodate assemblies and annotations produced by more experienced users. It implements best-practise assembly and annotation methods, and facilitates data sharing and visualization. It is our hope that, by easing the process of annotation, publication, and stable archiving, afterParty will facilitate the distribution and exploration of richly-decorated transcriptome data that would otherwise remain inaccessible.
Project name: afterParty
Project home page: https://github.com/mojones/afterParty2
Operating system: platform independent (developed on Ubuntu Linux 12.04)
Programming language: Groovy [http://groovy.codehaus.org/]
Grails 2.0.3 [http://grails.org/]
Spring security [http://grails.org/plugin/spring-security-core]
Spring security UI [http://grails.org/plugin/spring-security-ui]
PostgreSQL 9.1 [http://www.postgresql.org/]
NCBI blast+ 2.2.25 [ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/]
InterProScan 5 [http://code.google.com/p/interproscan/]
Mira 3.2.1 [http://sourceforge.net/projects/mira-assembler/]
License: GNU GPL
Any restrictions to use by non-academics: no
Because of the number of dependencies that afterParty relies on, we have made the software available in three different ways.
No special credentials are necessary to browse published datasets. We are happy to host new transcriptome datasets on this server; please contact the corresponding author (MJ) to obtain a user account. To get started, follow the various tutorials either on the wiki [https://github.com/mojones/afterParty2/wiki/afterParty], or as screencasts [http://www.youtube.com/user/theblaxterlab/videos].
The source code for afterParty is hosted at GitHub [https://github.com/mojones/afterParty2]. Pull requests are welcome. Bugs and feature requests can also be submitted at the above address. Follow the installation instructions here: https://github.com/mojones/AfterParty2/wiki/LocalInstall.
To assist researchers who would like to run a local installation of afterParty, we have prepared a virtual disk image, based on Ubuntu (server) 12.04, which can be run under a virtual machine hypervisor such as VirtualBox. The virtual disk image expands to around 80 GB and requires a 64-bit host. This is the easiest way to get afterParty running locally as all necessary dependencies and permissions are already set up. Follow the installation instructions here: https://github.com/mojones/AfterParty2/wiki/VMInstall.
afterParty is funded by the BBSRC Tools and Resources programme (grant number BB/I023585/1). Sequence data for model datasets were provided by the Enhancing Protective Immunity Against Filariasis programme of the EU (EU FP7 programme (EU Specific International Cooperation Action [SICA] reference 242131; L. sigmodontis). Emanuel Heitlinger (Institute for Biology, Berlin, A. crassus) and Seanna McTaggart (CIIE, The University of Edinburgh, P. interpunctella). We thank Stuart Taylor (Edinburgh Genomics, The University of Edinburgh) for hardware support, and colleagues in the Institute of Evolutionary Biology for comments on the software and the manuscript. afterParty is built using many open-source software tools; we would like to thank the contributors.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.