VTBuilder: a tool for the assembly of multi isoform transcriptomes
© Archer et al.; licensee BioMed Central Ltd. 2014
Received: 3 March 2014
Accepted: 19 November 2014
Published: 3 December 2014
Within many research areas, such as transcriptomics, the millions of short DNA fragments (reads) produced by current sequencing platforms need to be assembled into transcript sequences before they can be utilized. Despite recent advances in assembly software, creating such transcripts from read data harboring isoform variation remains challenging. This is because current approaches fail to identify all variants present or they create chimeric transcripts within which relationships between co-evolving sites and other evolutionary factors are disrupted. We present VTBuilder, a tool for constructing non-chimeric transcripts from read data that has been sequenced from sources containing isoform complexity.
We validated VTBuilder using reads simulated from 54 Sanger sequenced transcripts (SSTs) expressed in the venom gland of the saw scaled viper, Echis ocellatus. The SSTs were selected to represent genes from major co-expressed toxin groups known to harbor isoform variants. From the simulated reads, VTBuilder constructed 55 transcripts, 50 of which had a greater than 99% sequence similarity to 48 of the SSTs. In contrast, using the popular assembler tool Trinity (r2013-02-25), only 14 transcripts were constructed with a similar level of sequence identity to just 11 SSTs. Furthermore VTBuilder produced transcripts with a similar length distribution to the SSTs while those produced by Trinity were considerably shorter. To demonstrate that our approach can be scaled to real world data we assembled the venom gland transcriptome of the African puff adder Bitis arietans using paired-end reads sequenced on Illumina’s MiSeq platform. VTBuilder constructed 1481 transcripts from 5 million reads and, following annotation, all major toxin genes were recovered demonstrating reconstruction of complex underlying sequence and isoform diversity.
Unlike other approaches, VTBuilder strives to maintain the relationships between co-evolving sites within the constructed transcripts, and thus increases transcript utility for a wide range of research areas ranging from transcriptomics to phylogenetics and including the monitoring of drug resistant parasite populations. Additionally, improving the quality of transcripts assembled from read data will have an impact on future studies that query these data. VTBuilder has been implemented in java and is available, under the GPL GPU V0.3 license, from http:// http://www.lstmed.ac.uk/vtbuilder.
KeywordsTranscriptomics de novo Contigs Next generation sequencing Software Java Chimeras Haplotypes Non-chimeric Transcripts
With the advent of new sequencing technologies that have parallelized the way in which sequencing chemistry is performed ,, attempts have been made to gain new insight into previously unstudied transcriptomes at both an inter- and intra- species level ,. Prior to being utilized within transcriptomic studies however, the millions of short DNA fragments generated, termed reads, must to be assembled into longer contiguous sequences that are representative of the underlying transcripts present within the transcriptome. Despite recent advances in transcriptome assembly tools - and their application to a wide range of research areas including the characterization of diversity within viral populations, plants, mice and humans -, accurately reconstructing transcript diversity within complex multi-isoform transcriptomes has remained a significant challenge -. Assembling snake venom gland transcriptomes, a complex multi-isoform toxin cocktail arising from ancestral gene duplication events and divergent evolution -, exemplifies this challenge. Although linked by common ancestry, there is significant sequence variation within toxin families, such as the snake venom metalloproteinases (SVMP), C-type lectins, serine proteases (SP), phosoholipase A2s, bradykinin potentiating peptides and three-finger neurotoxins which often manifest in functionally distinct properties -. Sequence and functional diversity is complicated further at an inter- and intra-species level, as the expression of toxin isoforms is influenced by factors such as diet, habitat, sex, age and phylogeography -. Combined, this results in snake venom being complex - and it is this complexity that poses difficulties for current transcriptome assembly tools. Reconstructing accurate toxin transcripts is important because venom gland transcriptomes are a critical resource for the development of improved snakebite therapies ,.
Current assembly tools implement algorithms largely based around two different approaches ,. In the first, reads are aligned to positions within reference transcripts to which they are most similar. This is termed mapping. This results in scaffold-like alignments from which networks representing sequence variation are constructed. Paths across these networks are used to construct transcripts that represent the diversity present within the transcriptome. However, for many transcriptomes including snake venom gland transcriptomes, complete reference datasets rarely exist. When they do, being derived from low coverage Sanger/EST studies ,-, they may not have captured the full extent of variation within the transcriptome being studied. In this case, new transcripts cannot be discovered using a reference based approach as reads with insufficient similarity to sequences within the reference dataset will be discarded. Conversely, reads that are less divergent from transcripts within the reference dataset are more likely to map . Thus, the extent of divergence between venom gland transcriptomes even at an intra-species level ,,, will result in a biased loss of read data during mapping. This in turn, will result in a decreased accuracy in the estimation of transcript expression; even when mapping to a transcriptome from the nearest available species as a pseudo-reference dataset.
To resolve problems associated with the lack of a suitable reference transcriptome de novo based assembly can be applied. This usually involves the construction of de Bruijn networks that represent clusters of diversity, e.g. individual protein families within the data . On these networks nodes represent short sequence fragments, called k-mers, which are derived from reads, while edges represent shared identity between k-mers. These networks encompass all of the diversity present with the read data and traversals are used to construct transcripts. However, in the presence of isoform variation, maintaining non-chimeric paths across the subsequently complex networks becomes difficult ,. This is because a rise in diversity increases the number of nodes, which increases the combinatorials involved in path traversal. Distinguishing chimeric from non-chimeric paths is difficult as chimeras are in effect artificial recombinants generated between the true isoforms and, despite having superficial resemblance to true isoforms, relationships between co-evolving sites, functional motifs and other evolutionary factors are not maintained. This is due of the introduction of breakpoints within chimeras that are solely an artefact of the assembly process and not as a result of transcriptome evolution. Thus, resolving the true evolutionary relationship between transcripts becomes difficult. Long k-mers are often used to aid this task ,, but success is not guaranteed ,.
Producing non-chimeric transcripts is essential if the resolving power of next generation sequence (NGS) data is to be used to dissect the evolutionary dynamics within complex transcriptomes with no available reference. At the time of writing we are unaware of any freely, or otherwise, available software that makes this possible. We benchmark the accuracy of our software, against a current popular de novo assembler, Trinity ,, which implements a method to traverse multiple de Bruijn graphs. In our analysis we used read data simulated from 54 known venom gland Sanger sequenced transcripts (SSTs) representing isoforms of the most frequent and diverse viper venom gland protein families. Using VTBuilder, over 90% of SSTs were accurately reconstructed from the simulated reads into transcripts sharing 99% or greater sequence similarity with one of the known SSTs, compared with only 25% recovered using Trinity. Following this, we benchmarked accuracy and performance of VTBuilder by constructing transcripts from 2.5 million paired end Illumina MiSeq reads sequenced from the venom gland of the African puff adder, Bitis arietans. This is the first assembly of an NGS-derived snake venom gland transcriptome using a new tool to overcome the inclusion of chimeric transcripts that typically confound the interpretation of multi-isoform venom gland transcriptomes. The correct assembly of transcripts is an important step towards the realization of the full potential that NGS technology has to offer in resolving the biological complexity of highly variable transcriptomes.
The overall aim is to broadly capture transcript diversity by building a set of guide sequences from the read data and then to use these guides as a template to assist in the more accurate assembly of transcripts in a manner similar to reference based assembly ,. To achieve this, our software implements six steps schematically represented in Figure 2A.
(i.) Partitioning: This involves grouping reads into partitions (or clusters) that broadly reflect the protein diversity present within the data (i.e. at the protein family level not individual isoforms). Partitioning is done using an approach that is similar to seed based clustering . In the latter a read is randomly selected to seed (start) a partition and subsequent read inclusion to the partition is dependent on similarity to this read. We modified the approach to include reads derived from different positions on the same underlying transcript and thus share little sequence similarity. In VTBuilder a partition is initiated by randomly selecting a read from the input forward reads. This read is termed the partition seed. All remaining forward reads are searched and added to the partition, and removed from the input set, if they share a region of similarity (70% identity across 100 bases) to this seed. The 100 base window size allows for positional flexibility between the reads and the seed, i.e. reads are allowed to match to either side of the seed. Once complete, up to 12 reads are randomly selected from the reads just added, and a new search of the unpartitioned reads is initiated against these. Selecting reads added in the previous cycle reduces redundancy in the search. As cycles continue, the partition is expanded until no new reads can be added. At this point a new read is randomly selected from unallocated reads and used to seed a new partition and the process repeats. The partitioning step finishes when there are no unallocated reads left. Partitioning results in groups of unassembled reads that are directly related by transcript or indirectly related by protein family. Partitions containing 3 or more reads progress to step 2. Reads within smaller partitions are not used within step 2 but are reintroduced during step 3 (mapping) along with all other input data. Thus, no reads are permanently discarded from the pipeline at this point.
(ii.) De novo: Here guide sequences, which will be used as templates for subsequent mapping, are constructed from partitions. This is done using a greedy overlap method of assembly. Within each partition a read is randomly selected to initiate guide construction. We call this read the growing guide sequence (GGS). The remaining reads within the partition are searched against the GGS. If a read with high similarity to the GGS is identified (98% or greater similarity across a 100 nucleotide window) it is joined to the GGS, using the region of similarity as an anchor point. If this results in an extension to the GGS then the read is removed from the partition and the joined sequence replaces the current GGS. A new search against all the remaining reads in the partition is then initiated. This process iterates until a search against all remaining reads does not result in an extension to the GGS. If the partition still contains reads, a new guide is then initiated by selecting a random read and the process repeats. Thus, a single partition can result in more than one guide sequence. Once guide sequences have been created from all partitions a final de novo step is performed in order to join any partial guide sequences.
(iii.) Mapping: During mapping all input reads are aligned against the guide sequence that they are most similar, to at positions that minimize diversity. In VTBuilder, we used a mapping algorithm that we previously developed to map read data containing high amounts of variation. The algorithm uses short fragments (10 bases in length), termed k-mers, extracted from each guide sequence to form a library containing k-mer positional information for each guide. This library is then compared with k-mers derived from individual reads in order to find the most probable location for each read on the guide sequence to which it is most closely related. This standard k-mer indexing approach is described in detail in ,. This results in alignment-like structures, termed scaffolds, where reads are positionally correct to each other and to the guide sequence to which they are most related (Figure 2B, i).
The next two steps are designed to minimize chimeric transcripts by retaining as much diversity and positional information as possible within networks (step iv) whilst ensuring only the most robust (non-chimeric) paths are traversed to become transcripts (step v).
(iv.) Networking: Here we represent each scaffold-like alignment produced in (iii) as a graph-based structure by transforming isoform-specific differences (diversity) in alignments into nodes and edges, that are subsequently traversed and assembled into finished transcripts in step (v). Non-overlapping neighbouring windows of pre-defined size (see below) are first defined across the scaffold-like alignment (Figure 2B, i) after which reads spanning each window are extracted, truncated (Figure 2B, ii) and clustered using hamming distance (Figure 2B, iii). Clusters are represented as nodes on the network. Window size is calculated as one third of the minimum read length (user defined) thus ensuring that any three neighboring windows, and their subsequent nodes, have the potential to contain different regions of the same physical read. This information is used during network traversal in step (v). Edges are placed based on this physical linkage, where any two connected nodes physically share at least two reads, albeit different regions on these reads (Figure 2B, iv). Within the software, the minimum read length is limited to 120 bases to ensure there is sufficient sequence information within each window to cluster based on diversity. This step constructs multiple networks where nodes represent regions of diversity in reads mapped to the guide sequences.
(v.) Constructing Transcripts: A final list of assembled transcripts is outputted by traversing the networks created in (iv). Importantly for the reduction of chimeric transcripts, the physical linkage of reads between adjacent node triplets guides traversals i.e., with the exception of the first two nodes added to a path, a node will only be added if it contains read fragments that are physically linked to fragments present within the two previously added nodes constructed in (iv) from 3 neighboring windows. This ensures that each individual path is a traversal through nodes containing reads derived from a single isoform within the underlying data, and is the key step in limiting chimeric paths. Paths are initiated for each cluster of diversity present within the first window. For each path initiated, the addition of a second node is dependent on reads overlapping with the first. If paired end reads are available they are used to confirm paths. For each read on a path an attempt is made to map its pair. If less than 30% of the pairs map then the path is discarded.
(vi.) Expression: Calculation of relative transcript expression is achieved by remapping all the input reads to the finished transcripts. The expression level for a single transcript is taken as the number of reads mapping to that transcript normalized by the length of the transcript. These are outputted on the transcript titles as a percent relative to all other transcripts.
Results and discussion
Case study 1: simulated transcriptome assembly
The 54 known SSTs used to seed the simulation of reads as described in case study 1
No. of isoforms
1600 - 2000
1600 - 2300
700 - 1400
500 - 700
Cytochrome C Oxidase
Poly A Binding Protein
Protein Disulfide Isomerase
Next, we assessed the accuracy of transcript reconstruction by evaluating the level of sequence similarity between the 54 original SSTs and the transcripts constructed by both VTBuilder and Trinity. 50 of the 55 transcripts constructed by VTBuilder matched 48 of the SSTs with a similarity of 99% or greater. 53 transcripts matched 51 SSTs with a similarity of 95% or greater while 54 transcripts matched 53 of the SSTs with a similarity of 90% or greater. In comparison only 14 of the 112 transcripts constructed by Trinity matched 11 of the SSTs with a similarity of 99% or greater. 16 transcripts matched 13 SSTs with a similarity of 95% or greater. This remained unchanged at 90% similarity. Of the transcripts assembled from both software that matched the SSTs with a similarity of 90% or more, the length distributions of those produced by VTBuilder were more similar to the SSTs than those produced by Trinity (Figure 3B). These similarity and length distributions suggest that VTBuilder produces longer and more accurate transcripts than Trinity when run on the simulated reads and a thus a more comprehensive and accurate reconstruction of the original SSTs.
To further refine our understanding of the multi-isoform assembly process, we investigated whether reconstructed transcripts for both Trinity and VTBuilder displayed a one-to-one sequence similarity relationship with the original SSTs or whether chimeric assemblies producing many-to-many relationships existed. Within individual protein families containing multiple isoform variants, a many-to-many relationship would indicate a failure to distinguish between different isoforms. Reconstructed transcripts and original SSTs were used as nodes on a network where edges represent a sequence similarity of 90% or more. When the 54 transcripts (grey) constructed by VTBuilder were placed on a network along with the 53 SSTs that they matched (colors, see key), they largely displayed a one-to-one relationship (Figure 3C). Node size is proportional to sequence length further demonstrating that VTBuilder was capable of reconstructing transcripts of virtually identical composition and length as each original SSTs in comparison to Trinity where shorter, nearly exact matches of local similarity were more typical of the dataset (Figure 3D).
To investigate the effects of sequence error on VTBuilder performance we repeated our analysis using the same 54 SSTs but with a per site error rate introduced within each simulated dataset. At the per site mismatch error rates of around 0.2% typical of Illumina technology , VTBuilder constructed a total of 53 transcripts, 50 of which retained a greater than 90% similarity to the SST sequences with typically one-to-one relationships (Additional file 2: Figure S2). This level of accuracy in transcript reconstruction was maintained up to a high per site error rate of 1%, beyond which the total number of transcripts constructed increases as does the discrepancy between the number of VTBuilder transcripts sharing a 90% similarity with the SSTs. Introducing higher levels of artificial variation into the population, such as a 2% per site error rate, will result in at least 99.35% of the reads containing on average 5 errors across the 250 bases . This level of diversity is sufficient for VTBuilder to recognise transcripts as separate isoforms and leads to the sudden and expected rise in transcripts and drop in accordance with SSTs (Additional file 2: Figure S2).
Taken together, the results of our simulations indicate that VTBuilder can reconstruct transcripts that are highly similar both in length and sequence composition to the 54 input SSTs. The software can also accurately reconstruct transcripts when faced with a higher than expected degree of sequencing error.
Case study 2: assembly of a snake venom gland transcriptome from NGS data
The 101 unique toxin transcripts recovered by VTBuilder from reads sequenced from the venom gland of Bitis arietans (column 1) and the overall percentage of the toxin DNA that they make up within the transcriptome
% of toxin transcripts
# of unique transcripts
SVMP + DIS
We have demonstrated that transcripts constructed using VTBuilder accurately represent the variation present within venom gland transcriptomes. Unlike other approaches, our algorithm strives to maintain the relationships between factors such as to co-evolving sites and recombinant breakpoints within the underlying transcripts. VTBuilder has the potential to increase the usability of transcript sequences generated from read data across a wide range of research areas including; the detection of drug resistant variants within viruses and other disease causing parasites, where co-evolving sites confers resistance to particular classes of drugs -; the monitoring of disease progression, where variation across a range of sites can be indicative of progression and pathological outcome ,-; plant biology, where it has proven difficult to reconstruct full length transcripts representing complex transcript populations derived from genomes where polyploidy is present ,; and reconstructing accurate evolutionary relationships on phylogenetic trees, and in detecting recombinant breakpoints, where the usage of long non-chimeric transcripts is essential. We have made the source code for VTBuilder available from https://code.google.com/p/vt-builder/ where researchers from a wide range of backgrounds can access and develop it for their own requirements. Finally, we consider VTBuilder as an important progression towards the full utilization of the potential that NGS data offers. This is because highlighting the problem of chimeric sequence assembly, as well as having a proposed solution, will begin to reduce the number of such sequences being deposited within public data repositories which will have a positive impact on future studies querying such sources.
Availability and requirements
Executable jar file and user guide is available from: http://www.lstmed.ac.uk/vtbuilder
Googlecode home page (source code) : https://code.google.com/p/vt-builder/
Operating system(s) : Platform independent
Programming language : Java
Other requirements :
A Java runtime environment must be is installed. This is available from the Oracle website at: http://www.oracle.com/technetwork/java/javase/downloads/java-se-jre-7-download-432155.html. The tool is designed to run on a high spec desktop. We developed and tested it on a single processor Intel Xeon E2687W workstation equipped with 32GB of RAM, 16 cores and running Biolinux 7 . We have tested both real world and simulated data on Biolinux 7 (Ubuntu 12.04) running Open JDK IcedTea v1.13.4, where the real world data described in case study 2 took just over 4 h to assemble, and simulated datasets on Biolinux 8 (Ubuntu 14.04) running Open JDK IcedTea v2.5.1. We have also tested simulated datasets on on Mac OS × 10.7.5 running java 1.7.0_09.
License : GPL GPU V0.3.
Availability of supporting data
Simulated read data used in Case Study 1 along with the corresponding 54 seed sequences (Table 1) are available at: http://www.lstmed.ac.uk/vtbuilder. The Bitis arietans read data presented in Case Study 2 is available on request from the authors.
JA and GW were funded by the Leverhulme Trust, grant codes RPG-377 and RPG-2012-627 respectively. NC is a NERC Research Fellow at the Liverpool School of Tropical Medicine (LSTM). We would also like to thank Paul Rowley for expert herpetological assistance.
- Bennett S: Solexa Ltd. Pharmacogenomics. 2004, 5 (4): 433-438. 10.1517/14622422.214.171.1243.View ArticlePubMedGoogle Scholar
- Droege M, Hill B: The genome Sequencer FLX system–longer reads, more applications, straight forward bioinformatics and more complete data sets. J Biotechnol. 2008, 136 (1–2): 3-10. 10.1016/j.jbiotec.2008.03.021.View ArticlePubMedGoogle Scholar
- Durban J, Juarez P, Angulo Y, Lomonte B, Flores-Diaz M, Alape-Giron A, Sasa M, Sanz L, Gutierrez JM, Dopazo J, Conesa A, Calvete JJ: Profiling the venom gland transcriptomes of Costa Rican snakes by 454 pyrosequencing. BMC Genomics. 2011, 12: 259-10.1186/1471-2164-12-259.View ArticlePubMed CentralPubMedGoogle Scholar
- Rokyta DR, Wray KP, Margres MJ: The genesis of an exceptionally lethal venom in the timber rattlesnake (Crotalus horridus) revealed through comparative venom-gland transcriptomics. BMC Genomics. 2013, 14: 394-10.1186/1471-2164-14-394.View ArticlePubMed CentralPubMedGoogle Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011, 29 (7): 644-652. 10.1038/nbt.1883.View ArticlePubMed CentralPubMedGoogle Scholar
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer ML, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437 (7057): 376-380.PubMed CentralPubMedGoogle Scholar
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I: ABySS: a parallel assembler for short read sequence data. Genome Res. 2009, 19 (6): 1117-1123. 10.1101/gr.089532.108.View ArticlePubMed CentralPubMedGoogle Scholar
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhang T, Luo Y, Chen Y, Li X, Yu J: BIGrat: a repeat resolver for pyrosequencing-based re-sequencing with Newbler. BMC Res Notes. 2012, 5: 567-10.1186/1756-0500-5-567.View ArticlePubMed CentralPubMedGoogle Scholar
- Katz Y, Wang ET, Airoldi EM, Burge CB: Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat Methods. 2010, 7 (12): 1009-1015. 10.1038/nmeth.1528.View ArticlePubMed CentralPubMedGoogle Scholar
- Li JJ, Jiang CR, Brown JB, Huang H, Bickel PJ: Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc Natl Acad Sci U S A. 2011, 108 (50): 19867-19872. 10.1073/pnas.1113972108.View ArticlePubMed CentralPubMedGoogle Scholar
- Li W, Jiang T: Transcriptome assembly and isoform expression level estimation from biased RNA-Seq reads. Bioinformatics. 2012, 28 (22): 2914-2921. 10.1093/bioinformatics/bts559.View ArticlePubMed CentralPubMedGoogle Scholar
- Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010, 28 (5): 511-515. 10.1038/nbt.1621.View ArticlePubMed CentralPubMedGoogle Scholar
- Kalra S, Puniya BL, Kulshreshtha D, Kumar S, Kaur J, Ramachandran S, Singh K: De novo transcriptome sequencing reveals important molecular networks and metabolic pathways of the plant chlorophytum borivilianum. PLoS One. 2013, 8 (12): e83336-10.1371/journal.pone.0083336.View ArticlePubMed CentralPubMedGoogle Scholar
- Eriksson N, Pachter L, Mitsuya Y, Rhee SY, Wang C, Gharizadeh B, Ronaghi M, Shafer RW, Beerenwinkel N: Viral population estimation using pyrosequencing. PLoS Comput Biol. 2008, 4 (4): e1000074-10.1371/journal.pcbi.1000074.View ArticlePubMed CentralPubMedGoogle Scholar
- Prabhakara S, Malhotra R, Acharya R, Poss M: Mutant-bin: unsupervised haplotype estimation of viral population diversity without reference genome. J Comput Biol. 2013, 20 (6): 453-463. 10.1089/cmb.2012.0174.View ArticlePubMedGoogle Scholar
- Charuvaka A, Rangwala H: Evaluation of short read metagenomic assembly. BMC Genomics. 2011, 12 (Suppl 2): S8-10.1186/1471-2164-12-S2-S8.View ArticlePubMed CentralPubMedGoogle Scholar
- Mundry M, Bornberg-Bauer E, Sammeth M, Feulner PG: Evaluating characteristics of de novo assembly software on 454 transcriptome data: a simulation approach. PLoS One. 2012, 7 (2): e31410-10.1371/journal.pone.0031410.View ArticlePubMed CentralPubMedGoogle Scholar
- Kumar S, Blaxter ML: Comparing de novo assemblers for 454 transcriptome data. BMC Genomics. 2010, 11: 571-10.1186/1471-2164-11-571.View ArticlePubMed CentralPubMedGoogle Scholar
- Casewell NR, Wüster W, Vonk FJ, Harrison RA, Fry BG: Complex cocktails: the evolutionary novelty of venoms. Trends Ecol Evol. 2013, 28 (4): 219-229. 10.1016/j.tree.2012.10.020.View ArticlePubMedGoogle Scholar
- Fry BG, Roelants K, Champagne DE, Scheib H, Tyndall JD, King GF, Nevalainen TJ, Norman JA, Lewis RJ, Norton RS, Renjifo C, de la Vega RC: The toxicogenomic multiverse: convergent recruitment of proteins into animal venoms. Annu Rev Genomics Hum Genet. 2009, 10: 483-511. 10.1146/annurev.genom.9.081307.164356.View ArticlePubMedGoogle Scholar
- Fry BG, Wuster W: Assembling an arsenal: origin and evolution of the snake venom proteome inferred from phylogenetic analysis of toxin sequences. Mol Biol Evol. 2004, 21 (5): 870-883. 10.1093/molbev/msh091.View ArticlePubMedGoogle Scholar
- Fry BG, Scheib H, van der Weerd L, Young B, McNaughtan J, Ramjan SF, Vidal N, Poelmann RE, Norman JA: Evolution of an arsenal: structural and functional diversification of the venom system in the advanced snakes (Caenophidia). Mol Cell Proteomics. 2008, 7 (2): 215-246. 10.1074/mcp.M700094-MCP200.View ArticlePubMedGoogle Scholar
- Brust A, Sunagar K, Undheim EA, Vetter I, Yang DC, Casewell NR, Jackson TN, Koludarov I, Alewood PF, Hodgson WC, Lewis RJ, King GF, Antunes A, Hendrikx I, Fry BG: Differential evolution and neofunctionalization of snake venom metalloprotease domains. Mol Cell Proteomics. 2013, 12 (3): 651-663. 10.1074/mcp.M112.023135.View ArticlePubMed CentralPubMedGoogle Scholar
- Casewell NR, Wagstaff SC, Harrison RA, Renjifo C, Wüster W: Domain loss facilitates accelerated evolution and neofunctionalization of duplicate snake venom metalloproteinase toxin genes. Mol Biol Evol. 2011, 28 (9): 2637-2649. 10.1093/molbev/msr091.View ArticlePubMedGoogle Scholar
- Lynch VJ: Inventing an arsenal: adaptive evolution and neofunctionalization of snake venom phospholipase A2 genes. BMC Evol Biol. 2007, 7: 2-10.1186/1471-2148-7-2.View ArticlePubMed CentralPubMedGoogle Scholar
- Markland FS, Swenson S: Snake venom metalloproteinases. Toxicon. 2013, 62: 3-18. 10.1016/j.toxicon.2012.09.004.View ArticlePubMedGoogle Scholar
- Sunagar K, Jackson TN, Undheim EA, Ali SA, Antunes A, Fry BG: Three-fingered RAVERs: rapid accumulation of variations in exposed residues of snake venom toxins. Toxins. 2013, 5 (11): 2172-2208. 10.3390/toxins5112172.View ArticlePubMed CentralPubMedGoogle Scholar
- Vaiyapuri S, Thiyagarajan N, Hutchinson EG, Gibbins JM: Sequence and phylogenetic analysis of viper venom serine proteases. Bioinformation. 2012, 8 (16): 763-772. 10.6026/97320630008563.View ArticlePubMed CentralPubMedGoogle Scholar
- Dias GS, Kitano ES, Pagotto AH, Sant’anna SS, Rocha MM, Zelanis A, Serrano SM: Individual variability in the venom proteome of juvenile Bothrops jararaca specimens. J Proteome Res. 2013, 12 (10): 4585-4598. 10.1021/pr4007393.View ArticlePubMedGoogle Scholar
- Barlow A, Pook CE, Harrison RA, Wüster W: Coevolution of diet and prey-specific venom activity supports the role of selection in snake venom evolution. Proceedings Biological Sciences/The Royal Society. 2009, 276 (1666): 2443-2449. 10.1098/rspb.2009.0048.View ArticlePubMed CentralPubMedGoogle Scholar
- Furtado MF, Travaglia-Cardoso SR, Rocha MM: Sexual dimorphism in venom of Bothrops jararaca(Serpentes: Viperidae). Toxicon. 2006, 48 (4): 401-410. 10.1016/j.toxicon.2006.06.005.View ArticlePubMedGoogle Scholar
- Saad E, Curtolo Barros L, Biscola N, Pimenta DC, Barraviera SR, Barraviera B, Seabra Ferreira R: Intraspecific variation of biological activities in venoms from wild and captive Bothrops jararaca. J Toxicol Environ Health A. 2012, 75 (16–17): 1081-1090. 10.1080/15287394.2012.697839.View ArticlePubMedGoogle Scholar
- Antunes TC, Yamashita KM, Barbaro KC, Saiki M, Santoro ML: Comparative analysis of newborn and adult Bothrops jararaca snake venoms. Toxicon. 2010, 56 (8): 1443-1458. 10.1016/j.toxicon.2010.08.011.View ArticlePubMedGoogle Scholar
- Chippaux JP, Williams V, White J: Snake venom variability: methods of study, results and interpretation. Toxicon. 1991, 29 (11): 1279-1303. 10.1016/0041-0101(91)90116-9.View ArticlePubMedGoogle Scholar
- Daltry JC, Wüster W, Thorpe RS: Diet and snake venom evolution. Nature. 1996, 379 (6565): 537-540. 10.1038/379537a0.View ArticlePubMedGoogle Scholar
- Sasa M: Diet and snake venom evolution: can local selection alone explain intraspecific venom variation?. Toxicon. 1999, 37 (2): 249-252. 10.1016/S0041-0101(98)00121-4. author reply 253–260View ArticlePubMedGoogle Scholar
- Casewell NR, Wagstaff SC, Wüster W, Cook DA, Bolton FM, King SI, Pla D, Sanz L, Calvete JJ, Harrison RA: Medically important differences in snake venom composition are dictated by distinct postgenomic mechanisms. Proc Natl Acad Sci U S A. 2014, 111 (25): 9205-9210. 10.1073/pnas.1405484111.View ArticlePubMed CentralPubMedGoogle Scholar
- Castro EN, Lomonte B, Del Carmen GM, Alagon A, Gutierrez JM: Intraspecies variation in the venom of the rattlesnake Crotalus simus from Mexico: Different expression of crotoxin results in highly variable toxicity in the venoms of three subspecies. J Proteomics. 2013, 87: 103-121. 10.1016/j.jprot.2013.05.024.View ArticlePubMedGoogle Scholar
- Moura-da-Silva AM, Furlan MS, Caporrino MC, Grego KF, Portes-Junior JA, Clissa PB, Valente RH, Magalhaes GS: Diversity of metalloproteinases in Bothrops neuwiedi snake venom transcripts: evidences for recombination between different classes of SVMPs. BMC Genet. 2011, 12: 94-10.1186/1471-2156-12-94.View ArticlePubMed CentralPubMedGoogle Scholar
- Lomonte B, Tsai WC, Urena-Diaz JM, Sanz L, Mora-Obando D, Sanchez EE, Fry BG, Gutierrez JM, Gibbs HL, Sovic MG, Calvete JJ: Venomics of New world pit vipers: genus-wide comparisons of venom proteomes across Agkistrodon. J Proteomics. 2013, 96C: 103-116.Google Scholar
- Casewell NR, Harrison RA, Wüster W, Wagstaff SC: Comparative venom gland transcriptome surveys of the saw-scaled vipers (Viperidae: Echis) reveal substantial intra-family gene diversity and novel venom transcripts. BMC Genomics. 2009, 10: 564-10.1186/1471-2164-10-564.View ArticlePubMed CentralPubMedGoogle Scholar
- Harrison RA, Cook DA, Renjifo C, Casewell NR, Currier RB, Wagstaff SC: Research strategies to improve snakebite treatment: challenges and progress. J Proteomics. 2011, 74 (9): 1768-1780. 10.1016/j.jprot.2011.06.019.View ArticlePubMedGoogle Scholar
- Wagstaff SC, Laing GD, Theakston RD, Papaspyridis C, Harrison RA: Bioinformatics and multiepitope DNA immunization to design rational snake antivenom. PLoS Med. 2006, 3 (6): e184-10.1371/journal.pmed.0030184.View ArticlePubMed CentralPubMedGoogle Scholar
- Martin JA, Wang Z: Next-generation transcriptome assembly. Nat Rev Genet. 2011, 12 (10): 671-682. 10.1038/nrg3068.View ArticlePubMedGoogle Scholar
- Miller JR, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics. 2010, 95 (6): 315-327. 10.1016/j.ygeno.2010.03.001.View ArticlePubMed CentralPubMedGoogle Scholar
- Francischetti IM, My-Pham V, Harrison J, Garfield MK, Ribeiro JM: Bitis gabonica (Gaboon viper) snake venom gland: toward a catalog for the full-length transcripts (cDNA) and proteins. Gene. 2004, 337: 55-69. 10.1016/j.gene.2004.03.024.View ArticlePubMed CentralPubMedGoogle Scholar
- Kashima S, Roberto PG, Soares AM, Astolfi-Filho S, Pereira JO, Giuliati S, Faria M, Xavier MA, Fontes MR, Giglio JR, Franca SC: Analysis of Bothrops jararacussu venomous gland transcriptome focusing on structural and functional aspects: I–gene expression profile of highly expressed phospholipases A2. Biochimie. 2004, 86 (3): 211-219. 10.1016/j.biochi.2004.02.002.View ArticlePubMedGoogle Scholar
- Wagstaff SC, Harrison RA: Venom gland EST analysis of the saw-scaled viper, Echis ocellatus, reveals novel alpha9beta1 integrin-binding motifs in venom metalloproteinases and a new group of putative toxins, renin-like aspartic proteases. Gene. 2006, 377: 21-32. 10.1016/j.gene.2006.03.008.View ArticlePubMedGoogle Scholar
- Archer J, Rambaut A, Taillon BE, Harrigan PR, Lewis M, Robertson DL: The evolutionary analysis of emerging low frequency HIV-1 CXCR4 using variants through time–an ultra-deep approach. PLoS Comput Biol. 2010, 6 (12): e1001022-10.1371/journal.pcbi.1001022.View ArticlePubMed CentralPubMedGoogle Scholar
- Yang Y, Smith SA: Optimizing de novo assembly of short-read RNA-seq data for phylogenomics. BMC Genomics. 2013, 14: 328-10.1186/1471-2164-14-328.View ArticlePubMed CentralPubMedGoogle Scholar
- Cahais V, Gayral P, Tsagkogeorga G, Melo-Ferreira J, Ballenghien M, Weinert L, Chiari Y, Belkhir K, Ranwez V, Galtier N: Reference-free transcriptome assembly in non-model animals from next-generation sequencing data. Mol Ecol Resour. 2012, 12 (5): 834-845. 10.1111/j.1755-0998.2012.03148.x.View ArticlePubMedGoogle Scholar
- Bao E, Jiang T, Kaloshian I, Girke T: SEED: efficient clustering of next-generation sequences. Bioinformatics. 2011, 27 (18): 2502-2509.PubMed CentralPubMedGoogle Scholar
- Archer J, Baillie G, Watson SJ, Kellam P, Rambaut A, Robertson DL: Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II. BMC Bioinformatics. 2012, 13: 47-10.1186/1471-2105-13-47.View ArticlePubMed CentralPubMedGoogle Scholar
- Casewell NR, Wagstaff SC, Harrison RA, Wüster W: Gene tree parsimony of multilocus snake venom protein families reveals species tree conflict as a result of multiple parallel gene loss. Mol Biol Evol. 2011, 28 (3): 1157-1172. 10.1093/molbev/msq302.View ArticlePubMedGoogle Scholar
- Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG: Clustal W and Clustal X version 2.0. Bioinformatics. 2007, 23 (21): 2947-2948. 10.1093/bioinformatics/btm404.View ArticlePubMedGoogle Scholar
- Field D, Tiwari B, Booth T, Houten S, Swan D, Bertrand N, Thurston M: Open software for biologists: from famine to feast. Nat Biotechnol. 2006, 24 (7): 801-803. 10.1038/nbt0706-801.View ArticlePubMedGoogle Scholar
- Beerenwinkel N, Zagordi O: Ultra-deep sequencing for the analysis of viral populations. Curr Opin Virol. 2011, 1 (5): 413-418. 10.1016/j.coviro.2011.07.008.View ArticlePubMedGoogle Scholar
- Conesa A, Gotz S, Garcia-Gomez JM, Terol J, Talon M, Robles M: Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 2005, 21 (18): 3674-3676. 10.1093/bioinformatics/bti610.View ArticlePubMedGoogle Scholar
- Cardozo T, Kimura T, Philpott S, Weiser B, Burger H, Zolla-Pazner S: Structural basis for coreceptor selectivity by the HIV type 1 V3 loop. AIDS Res Hum Retroviruses. 2007, 23 (3): 415-426. 10.1089/aid.2006.0130.View ArticlePubMedGoogle Scholar
- Merker M, Kohl TA, Roetzer A, Truebe L, Richter E, Rusch-Gerdes S, Fattorini L, Oggioni MR, Cox H, Varaine F, Niemann S: Whole genome sequencing reveals complex evolution patterns of multidrug-resistant Mycobacterium tuberculosis Beijing strains in patients. PLoS One. 2013, 8 (12): e82551-10.1371/journal.pone.0082551.View ArticlePubMed CentralPubMedGoogle Scholar
- Thai H, Campo DS, Lara J, Dimitrova Z, Ramachandran S, Xia G, Ganova-Raeva L, Teo CG, Lok A, Khudyakov Y: Convergence and coevolution of hepatitis B virus drug resistance. Nat Commun. 2012, 3: 789-10.1038/ncomms1794.View ArticlePubMed CentralPubMedGoogle Scholar
- Ganeshan S, Dickover RE, Korber BT, Bryson YJ, Wolinsky SM: Human immunodeficiency virus type 1 genetic evolution in children with different rates of development of disease. J Virol. 1997, 71 (1): 663-677.PubMed CentralPubMedGoogle Scholar
- Miura M, Maekawa S, Takano S, Komatsu N, Tatsumi A, Asakawa Y, Shindo K, Amemiya F, Nakayama Y, Inoue T, Sakamoto M, Yamashita A, Moriishi K, Enomoto N: Deep-sequencing analysis of the association between the quasispecies nature of the hepatitis C virus core region and disease progression. J Virol. 2013, 87 (23): 12541-12551. 10.1128/JVI.00826-13.View ArticlePubMed CentralPubMedGoogle Scholar
- Shankarappa R, Margolick JB, Gange SJ, Rodrigo AG, Upchurch D, Farzadegan H, Gupta P, Rinaldo CR, Learn GH, He X, Huang XL, Mullins JI: Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J Virol. 1999, 73 (12): 10489-10502.PubMed CentralPubMedGoogle Scholar
- Toure O, Konate S, Sissoko S, Niangaly A, Barry A, Sall AH, Diarra E, Poudiougou B, Sepulveda N, Campino S, Rockett KA, Clark TG, Thera MA, Doumbo O: Candidate polymorphisms and severe malaria in a Malian population. PLoS One. 2012, 7 (9): e43987-10.1371/journal.pone.0043987.View ArticlePubMed CentralPubMedGoogle Scholar
- Wagner KW, Ye Y, Lin J, Vaporciyan AA, Roth JA, Wu X: Genetic variations in epigenetic genes are predictors of recurrence in stage I or II non-small cell lung cancer patients. Clin Cancer Res. 2012, 18 (2): 585-592. 10.1158/1078-0432.CCR-11-2087.View ArticlePubMed CentralPubMedGoogle Scholar
- Campbell PJ, Pleasance ED, Stephens PJ, Dicks E, Rance R, Goodhead I, Follows GA, Green AR, Futreal PA, Stratton MR: Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing. Proc Natl Acad Sci U S A. 2008, 105 (35): 13081-13086. 10.1073/pnas.0801523105.View ArticlePubMed CentralPubMedGoogle Scholar
- Weber AP, Weber KL, Carr K, Wilkerson C, Ohlrogge JB: Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing. Plant Physiol. 2007, 144 (1): 32-42. 10.1104/pp.107.096677.View ArticlePubMed CentralPubMedGoogle Scholar
- Wei L, Xiao M, Hayward A, Fu D: Applications and challenges of next-generation sequencing in Brassica species. Planta. 2013, 238: 1005-1024. 10.1007/s00425-013-1961-6.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.