SeqAssist: a novel toolkit for preliminary analysis of next-generation sequencing data
© Peng et al.; licensee BioMed Central Ltd. 2014
Published: 21 October 2014
While next-generation sequencing (NGS) technologies are rapidly advancing, an area that lags behind is the development of efficient and user-friendly tools for preliminary analysis of massive NGS data. As an effort to fill this gap to keep up with the fast pace of technological advancement and to accelerate data-to-results turnaround, we developed a novel software package named SeqAssist ("Sequencing Assistant" or SA).
SeqAssist takes NGS-generated FASTQ files as the input, employs the BWA-MEM aligner for sequence alignment, and aims to provide a quick overview and basic statistics of NGS data. It consists of three separate workflows: (1) the SA_RunStats workflow generates basic statistics about an NGS dataset, including numbers of raw, cleaned, redundant and unique reads, redundancy rate, and a list of unique sequences with length and read count; (2) the SA_Run2Ref workflow estimates the breadth, depth and evenness of genome-wide coverage of the NGS dataset at a nucleotide resolution; and (3) the SA_Run2Run workflow compares two NGS datasets to determine the redundancy (overlapping rate) between the two NGS runs. Statistics produced by SeqAssist or derived from SeqAssist output files are designed to inform the user: whether, what percentage, how many times and how evenly a genomic locus (i.e., gene, scaffold, chromosome or genome) is covered by sequencing reads, how redundant the sequencing reads are in a single run or between two runs. These statistics can guide the user in evaluating the quality of a DNA library prepared for RNA-Seq or genome (re-)sequencing and in deciding the number of sequencing runs required for the library. We have tested SeqAssist using a synthetic dataset and demonstrated its main features using multiple NGS datasets generated from genome re-sequencing experiments.
SeqAssist is a useful and informative tool that can serve as a valuable "assistant" to a broad range of investigators who conduct genome re-sequencing, RNA-Seq, or de novo genome sequencing and assembly experiments.
High throughput next-generation sequencing (NGS) technologies are capable of generating massive amounts of data in the form of paired-end or single-end reads with either fixed or variable lengths. The size of data files is often in the magnitude of mega- or giga-bytes (up to 1000 giga base pairs or Gb in a single sequencing run) and is likely to increase further in the years to come. While sequencing costs have dropped precipitously and sequencing speed and efficiency have risen exponentially, development of computational tools for preliminary analysis of these gigantic datasets have lagged behind data generation. Hence, there is an increasing demand for efficient and user-friendly programs for preliminary sequencing data analysis.
Currently, there are four commercially predominant NGS platforms, including Illumina/Solexa, Roche/454, ABI/SOLiD and ABI/Ion Torrent [1, 2]. These massively parallel DNA sequencing technologies have been applied to transcriptome sequencing (RNA-Seq), de novo genome sequencing, and genome re-sequencing. RNA-Seq is a widely used approach to transcriptomic profiling [3, 4]. Two representative efforts in de novo genome sequencing are the Genome 10K project to obtain whole genome sequences for 10,000 vertebrate species [5–7] and the 5K Insect Genome Initiative (i5K) to sequence the genomes of 5,000 arthropod species [8, 9]. Genome re-sequencing is an experimental procedure that involves sequencing individual organisms whose genome is already known . As a new genomics approach, genome re-sequencing has been applied to a wide range of fundamental and applied biological research including genetics, evolution, biomedicine, human diseases and environmental health, with good examples being the 1000 Genomes Project  and the Cancer Genomes project .
Prior to in-depth analysis of NGS deep sequencing data, e.g., differential gene expression and alternative splicing analysis for RNA-Seq studies, structural variants identification for genome re-sequencing studies, and genome assembly for de novo genome sequencing studies, investigators are often concerned about the following issues: (1) basic statistics of a sequencing run such as total numbers of raw, cleaned, and unique reads as well as the degree of reads redundancy; (2) sequencing library quality, i.e., does the library truly represent the genome of the re-sequenced organism, and (3) the number of sequencing runs required, i.e., how many runs are necessary to get a full representation of the sequencing library or to suffice a de novo genome assembly. Although many tools such as Partek Genomics Suite, CLC Genomic Workbench, or noncommercial platforms like Galaxy and GenePattern are currently available and capable of indirectly addressing these issues, one would have to possess some basic bioinformatics training and script writing skills in order to manipulate and turn the generated results into useful and straightforward information that can be easily understood by an experimentalists. Motivated by filling this gap, the limitations of existing tools, and also driven by the demand for accelerating data-to-results turnaround, we have developed a novel toolkit named SeqAssist ("Sequencing Assistant", acronym: SA). SeqAssist specifically addresses the aforementioned three issues and provides investigators who conduct RNA-Seq, de novo genome sequencing or genome re-sequencing experiments with a quick overview and preliminary analysis of their NGS data.
SA_Run2Run compares two separate sequencing datasets generated for the same or different DNA libraries, computes the basic statistics for each individual dataset, and estimates the redundancy rate between the two datasets. SA_Run2Run informs the user of the redundancy level both within each individual run and between two sequencing runs. Like the SA_RunStats workflow (Figure 1a), each input run dataset in the SA_Run2Run workflow (Figure 1c) independently goes through the same preprocessing, self-alignment and removing redundant reads steps to generate two new datasets containing unique cleaned sequences. Then, the two new datasets are aligned against each other using BWA-MEM, generating a new SAM file. After combining all aligned reads, reads common to both datasets (i.e., overlapping reads) are identified, the counts of redundant reads (identical or inclusive) are calculated for both overlapping and non-overlapping reads. The output statistics from SA_Run2Run include total numbers of raw reads, cleaned reads, and unique reads (after removing identical reads and inclusive reads), and numbers of total and unique overlapping reads. The redundancy rates within each dataset and between the two datasets can be further derived from these statistics. Similar to the SA_RunStats output, a list of unique sequences along with their length and count number is provided for each run. However, different from the SA_RunStats output, the list generated by SA_Run2Run is broken into two files, one containing overlapping reads and the other non-overlapping reads. To compare two paired-end sequencing runs, one has to run this workflow twice: Run1_R1.fastq vs. Run2_R1.fastq and Run1_R2.fastq vs. Run2_R2.fastq. The SA_Run2Run workflow intends to guide the user in deciding whether to perform more runs on a sequencing library by looking at the percentage of reads in a new run covered by the reads in a previous run or the pooled reads of multiple previous runs.
Results and discussion
To test all SeqAssist workflows, a synthetic dataset was generated by (1) clipping 10 distinct fragments with a length of 150 bp at different loci of the Escherichia coli str. K-12 substr. MG1655l genome (NCBI Reference Sequence Accession No. NC_000913.3, available at http://www.ncbi.nlm.nih.gov/nuccore/556503834?report=fasta) to construct 10 artificial chromosomes, (2) clipping 10 sequences of 75-100 bp in length from each artificial chromosome, and (3) repeating each sequence 10 times. These steps resulted in a dataset of 1,000 reads and a reference genome consisting of 10 short artificial chromosomes, both of which were used to test the SA_RunStats and SA_Run2Ref workflows. The synthetic dataset was further split in two halves to create Run 1 and Run 2 that were used to test the SA_Run2Run workflow. SeqAssist output results for the synthetic dataset are in agreement with expected results (see files in the SeqAssist software package for details), validating the scripts coded for all three workflows.
Here we demonstrate the applications of SeqAssist to preliminary analysis of multiple experimental NGS datasets. As the SA_RunStats workflow is an integral part of the SA_Run2Run workflow, we focus on the SA_Run2Ref and SA_Run2Run workflows in the following experiments. All the experiments were performed on a Dell M710 Blade server equipped with 283 GB of DDR3 memory at 1,066 MHz speed, an Intel® Xeon® E5630 CPU Quad-core that runs at 2.53 GHz, and two separate hard drives of 1.3 TB and 2.9 TB. The operating system was Red Hat Enterprise Linux Server release 6.3 (Santiago) using the CentOS 64-bit distribution.
Experimental NGS datasets
We selected multiple NGS datasets and two organisms of contrasting genomic complexity to demonstrate the features of SeqAssist. These datasets represented both fixed and variable length reads generated on Illumina and 454 sequencing platforms, respectively. The chosen organisms were the bacteria Escherichia coli, a prokaryote with a simple and small circular genome of 4.6 Mb in length , and the water flea Daphnia pulex, an eukaryote with a recently published draft genome consisting of 5,191 scaffolds with a total length of ca. 200 Mb .
Two datasets obtained from an E. coli genome sequencing project were downloaded from http://data.clovr.org/. One dataset (named Ecoli_454_500K) contains 500,000 shotgun 454 titanium sequences (variable length reads in SFF format), and the other (named Ecoli_I4M_R1 and Ecoli_I4M_R2) contains 4,000,000 paired-end shotgun Illumina sequences (2 × 49-bp fixed length reads in FASTQ format). The Bio.SeqIO module in Biopython (http://biopython.org/wiki/Main_Page) was employed to convert the SFF format to the FASTQ format for the 454 dataset by the command $ python -c "from Bio import SeqIO; SeqIO.convert('in.sff', 'sff', 'out.fastq', 'fastq');". The D. pulex datasets were collected in-house by repeatedly sequencing two libraries, each of which was prepared from genomic DNA isolated from an individual animal. One animal came from a population named ECT (acronym for "Environmental Consulting & Testing", the vendor from which it was obtained, Superior, WI, USA) and the other from another population named TCO (acronym for "The Chosen One", kindly donated by Dr. Norman D. Yan, York University, Toronto, ON, Canada).
To demonstrate the scalability of SeqAssist, we have also chosen a human genome re-sequencing dataset of the CEU HapMap individual NA12878 at a 15-fold coverage with an insert size of 300 bp and 3.6% duplicate reads. The dataset (SRA ID: ERX012406) consists of 7 paired-end Illumina Genome Analyzer IIx runs and is downloadable from NCBI's SRA database at http://www.ncbi.nlm.nih.gov/sra/?term=ERX012406. The reference diploid human genome (hg18) consists of 23 pairs of chromosomes or 3,234 Mb in total.
The SA_Run2Ref workflow
Basic statistics produced by SA_Run2Ref for two sequencing run datasets.
Illumina MiSeq runs (read length = 2 × 151 bp)
ECT + ECT_rerun
Total number of raw paired-end reads
Total number of cleaned reads
Total number of reads mapped to reference genome
Mapped/Cleaned reads (%)
Total number of scaffolds in reference genome
Number of covered reference scaffolds
Covered/Total scaffolds (%)
Genome coverage breadth (%)
Genome coverage depth
standard deviation of scaffold coverage depth
average scaffold coverage depth
Genome coverage evenness
Run time (min)
Sequencing datasets and genome mapping of the Daphnia pulex TCO library.
Mapped/Cleaned reads (%)
Run time (min)
Added run (multiplex, read length)
LF1 (36 ×, 2 × 151)
LF2 (36 ×, 2 × 251)
LF3 (36 ×, 2 × 251)
LF4 (36 ×, 2 × 251)
LF5 (6 ×, 2 × 251)
Large + Small
SF1 (36 ×, 2 × 151)
Large + Small
SF2 (36 ×, 2 × 151)
Large + Small
SF3 (36 ×, 2 × 151)
Large + Small
SF4 (36 ×, 2 × 151)
Large + Small
SF5 (36 ×, 2 × 151)
The SA_Run2Run workflow
The output statistics and derived statistics from running five pairs of NGS datasets through the SA_Run2Run workflow.
Total number of raw reads in the run
Total number of cleaned reads in the run
Number of unique reads in the run (after removing identical redundant reads)
Number of unique reads in the run (after removing identical & inclusive redundant reads)
Total number of overlapping reads in the run
Number of unique overlapping reads in the run
Number of unique overlapping reads from both runs
File size after preprocessing
Number of redundant cleaned reads in the run
Redundancy rate within the run
Total number of non-overlapping reads in the run
Number of unique non-overlapping reads in the run
Number of redundant non-overlapping reads in the run
Redundancy of non-overlapping reads in the run
Number of redundant overlapping reads in the run
Redundancy of overlapping reads in the run
Total overlapping reads/total cleaned reads (each run)
Total overlapping reads/total cleaned reads (both runs)
Total runtime (min)
With regard to redundant reads, a distinction is drawn between identical and inclusive reads. If reads are of the same length, there is no difference between these two types of redundant reads. Consequently, the number of unique overlapping reads in Run1 is the same as those in Run2 and both runs (see output statistics for Ecoli_I4M_R1 and Ecoli_I4M_R2 in Table 3 for examples). Preprocessing of Illumina datasets may cause length differentiation between cleaned reads. For datasets of reads with variable length, the number of unique reads after removing identical reads is higher than that after removing both types of redundant reads (see output statistics for Ecoli_454_500K, ECT and ECT_rerun datasets in Table 3 and Figure 5). As a result, the number of unique overlapping reads in Run1 may differ from that in Run2 and in both runs (see output statistics for ECT and ECT_rerun datasets in Table 3 and Figure 5).
Analysis of the human genome re-sequencing dataset
Runtime of the three SeqAssist workflows recorded when analyzing two paired-end human genome re-sequencing run data files of a CEU HapMap individual NA12878.
Run data file
We have demonstrated the main features of SeqAssist using multiple genome re-sequencing datasets. Output statistics from SeqAssist can guide the user in evaluating the quality of a DNA library prepared for genome re-sequencing and in deciding whether there is a need to perform additional sequencing runs on the library. Based on the low coverage breadth (66%) and the high reads mapping rate (88%) (Table 1), it appears that the ECT gDNA library may not be a good representation of the entire genome. The same holds true for the TCO library if considering its maximal breadth (62%, Figure 3) and mapping rate (84%, Table 2). In terms of the number of multiplexed sequencing runs required for the TCO library, five runs of the large fraction seemed to be sufficient because they reached the maximal coverage breadth, with a total raw reads number of 4.2 million.
In summary, the SA_RunStats workflow generates basic statistics about an NGS dataset, including numbers of raw, cleaned, redundant and unique reads, redundancy rate, and a list of unique sequences with length and read count. The SA_Run2Ref workflow estimates the breadth, depth and evenness of genome-wide coverage of the NGS dataset at a nucleotide resolution. The SA_Run2Run workflow compares two NGS datasets to determine the redundancy (overlapping rate) between the two NGS runs. Statistics produced by SeqAssist or derived from SeqAssist output files are designed to inform the user: whether, what percentage, how many times and how evenly a genomic locus (i.e., gene, scaffold, chromosome or genome) is covered by sequencing reads, how redundant the sequencing reads are in a single run or between two runs.
SeqAssist is a useful and informative tool that can serve as a valuable "assistant" to a broad range of investigators who conduct genome re-sequencing, RNA-Seq, or de novo genome sequencing and assembly experiments. For RNA-Seq experiments, SeqAssist output files that contain unique sequences along with their mapped genomic loci and copy numbers may be readily transformed into gene expression data. An investigator who de novo assembles a genome from sequencing data may use SeqAssist to map the original reads to the assembled genome and obtain a ratio of mapped to cleaned reads. As an additional parameter to existing metrics , this ratio can be used to objectively compare the quality of different assemblies made from the same sequencing data. For further in-depth analyses of NGS data, one is advised to use other appropriate tools available from the bioinformatics community. For instance, one may choose to apply spliced aligners such as RUM  and SpliceSeq  to identify splice junctions for alternative splicing detection of RNA-Seq data, or employ a structural variant (SV) discovery software such as BreakDancer , Pindel  and PRISM  to call SV events and discern breakpoints from genome re-sequencing data.
We plan to improve visualization features of SeqAssist in the future versions. Specifically, the nucleotide-resolution mapping and coverage depth (copy number) information generated from SA_Run2Ref shall be transformed into interactive visual graphics to allow the user to visualize gene coverage or expression levels.
Availability and requirements
♦ Project name: SeqAssist
♦ Project home page:http://orca.st.usm.edu/cbbl/seqassist/
♦ Operating systems: Linux (Ubuntu)
♦ Programming language: Perl v.5.14.2 with the following packages: Parallel::ForkManager and Getopt::Long
♦ License: Free for commercial and academic uses.
♦ Any restrictions to use by non-academics: None.
This work was supported by the U.S. Army Environmental Quality and Installation Technology Basic Research Program and the National Science Foundation under award number EPS 0903787. Permission to publish this information was granted by the Chief of Engineers.
Publication charges for this article were funded by the U.S. Army Environmental Quality and Installation Technology Basic Research Program.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 11, 2014: Proceedings of the 11th Annual MCBIOS Conference. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S11.
- Mardis ER: Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet. 2008, 9: 387-402. 10.1146/annurev.genom.9.081307.164359.View ArticlePubMedGoogle Scholar
- Mardis ER: Next-generation sequencing platforms. Annu Rev Anal Chem (Palo Alto Calif ). 2013, 6: 287-303. 10.1146/annurev-anchem-062012-092628.View ArticleGoogle Scholar
- Martin JA, Wang Z: Next-generation transcriptome assembly. Nat Rev Genet. 2011, 12: 671-682. 10.1038/nrg3068.View ArticlePubMedGoogle Scholar
- Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009, 10: 57-63. 10.1038/nrg2484.PubMed CentralView ArticlePubMedGoogle Scholar
- Bernardi G, Wiley EO, Mansour H, Miller MR, Orti G, Haussler D: The fishes of Genome 10K. Mar Genomics. 2012, 7: 3-6.View ArticlePubMedGoogle Scholar
- Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J Hered. 2009, 100: 659-674. Genome 10K Community of ScientistsGoogle Scholar
- Wong PB, Wiley EO, Johnson WE, Ryder OA, O'Brien SJ, Haussler D: Tissue sampling methods and standards for vertebrate genomics. Gigascience. 2012, 1: 8-10.1186/2047-217X-1-8.PubMed CentralView ArticlePubMedGoogle Scholar
- The i5K Initiative: advancing arthropod genomics for knowledge, human health, agriculture, and the environment. J Hered. 2013, 104: 595-600. i5K ConsortiumGoogle Scholar
- Levine R: i5k: The 5,000 Insect Genome Project. Am Entomol. 2011, 72: 110-113.View ArticleGoogle Scholar
- Stratton M: Genome resequencing and genetic variation. Nat Biotechnol. 2008, 26: 65-66. 10.1038/nbt0108-65.View ArticlePubMedGoogle Scholar
- Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE: An integrated map of genetic variation from 1,092 human genomes. Nature. 2012, 491: 56-65. 10.1038/nature11632.View ArticlePubMedGoogle Scholar
- Stephens PJ, Tarpey PS, Davies H, Van LP, Greenman C, Wedge DC: The landscape of cancer genes and mutational processes in breast cancer. Nature. 2012, 486: 400-404.PubMed CentralPubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25: 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010, 26: 589-595. 10.1093/bioinformatics/btp698.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N: The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009, 25: 2078-2079. 10.1093/bioinformatics/btp352.PubMed CentralView ArticlePubMedGoogle Scholar
- Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M: The complete genome sequence of Escherichia coli K-12. Science. 1997, 277: 1453-1462. 10.1126/science.277.5331.1453.View ArticlePubMedGoogle Scholar
- Colbourne JK, Pfrender ME, Gilbert D, Thomas WK, Tucker A, Oakley TH: The ecoresponsive genome of Daphnia pulex. Science. 2011, 331: 555-561. 10.1126/science.1197761.PubMed CentralView ArticlePubMedGoogle Scholar
- Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience. 2013, 2: 10-10.1186/2047-217X-2-10.PubMed CentralView ArticlePubMedGoogle Scholar
- Grant GR, Farkas MH, Pizarro AD, Lahens NF, Schug J, Brunk BP: Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics. 2011, 27: 2518-2528.PubMed CentralPubMedGoogle Scholar
- Ryan MC, Cleland J, Kim R, Wong WC, Weinstein JN: SpliceSeq: a resource for analysis and visualization of RNA-Seq data on alternative splicing and its functional impacts. Bioinformatics. 2012, 28: 2385-2387. 10.1093/bioinformatics/bts452.PubMed CentralView ArticlePubMedGoogle Scholar
- Chen K, Wallis JW, McLellan MD, Larson DE, Kalicki JM, Pohl CS: BreakDancer: an algorithm for high-resolution mapping of genomic structural variation. Nat Methods. 2009, 6: 677-681. 10.1038/nmeth.1363.PubMed CentralView ArticlePubMedGoogle Scholar
- Ye K, Schulz MH, Long Q, Apweiler R, Ning Z: Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. Bioinformatics. 2009, 25: 2865-2871. 10.1093/bioinformatics/btp394.PubMed CentralView ArticlePubMedGoogle Scholar
- Jiang Y, Wang Y, Brudno M: PRISM:pair-read informed split-read mapping for base-pair level detection of insertion, deletion and structural variants. Bioinformatics. 2012, 28: 2576-2583. 10.1093/bioinformatics/bts484.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.