SequelQC: Analyzing PacBio Sequel Raw Sequence Quality

Summary: PacBio sequencing is an incredibly valuable third-generation DNA sequencing method due to very long read lengths, ability to detect methylated bases, and its real-time sequencing methodology. Yet hitherto no tractable program exists for analyzing the quality of PacBio Sequel raw sequence data. Here we present SequelQC, a bash tool that quickly processes PacBio Sequel raw sequence data from multiple SMRTcells producing multiple statistics and publication-quality plots describing the quality of the data including N50, read length and count statistics, PSR, and ZOR. Availability and implementation: SequelQC is implemented in bash, R, and Python and is freely available at https://github.com/ISUgenomics/SequelQC Contact: davehuf@iastate.edu or arnstrm@iastate.edu Supplementary information: Supplementary data are available at BioRxiv online.


Introduction
The third generation of sequencing is here and making tremendous impact on the sequencing field. These long-read sequencing platforms are undergoing active development, pushing the boundaries in terms of total output, read length, sequencing time, cost reduction and read accuracy (Schadt et al., 2010;Goodwin et al., 2016). One example is the PacBio Sequel platform, the most widely used long-read sequencing platform, which uses Single Molecule Real Time (SMRT) sequencing technology (Rhoads and Au, 2015;Goodwin et al., 2016). Unlike second generation approaches, PacBio sequencing can provide much longer length reads, in much less time, with greatly reduced content bias, the ability to distinguish between methylated and unmethylated bases, and with almost as much accuracy (Schadt et al., 2010;Ross et al., 2013;Rhoads and Au, 2015;Goodwin et al., 2016). Due to improvements in data formats and the technology itself, previous base quality programs for PacBio RSII data are no longer valid for benchmarking 1 . CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/611814 doi: bioRxiv preprint PacBio Sequel data. Currently, the only program that provides quality assessment for the Sequel raw sequence data is the instrumentation software itself, SMRT Link, a linux-only, computationally intensive webtool where the user must upload their data files one at a time. Furthermore, SMRT Link can only be installed by root users, requiring the installation of 23 external programs to run, and generates nondownloadable plots after setting up a web server (https://www.pacb.com/support/software-downloads/). The development of a fast, easy-to-install-and-use, third-party program to assess raw sequence quality is therefore crucial for the genomic community. Here we present SequelQC, an efficient and user-friendly program that calculates multiple standardized statistics and creates publication-quality plots describing the quality of raw PacBio Sequel data.

Implementation and Usage
In order to create tables and plots summarizing the quality of PacBio Sequel data, SequelQC uses standard libraries and packages within bash, R and Python. PacBio sequence files include both subreads files containing reads of interest and scraps files with additional reads generated during the sequencing process. SequelQC can be run with or without scraps files. With scraps files SequelQC takes longer to run, but also produces more plots and analysis in standard plots using additional information concerning continuous long reads (CLRs). The main script is written in bash, which calls samtools (Li et al., 2009) and awk to convert BAM files to SAM format and extracts only needed information. Then Python is used to make all necessary calculations, producing intermediate data files that are passed to R. Python 2 or 3 can be used, and the version is determined automatically. By default these intermediate data files will be deleted at the end of the program's operation, but they will be retained if the user selects the appropriate parameters. At this point, reads are organized into up to four read groups: 1) subreads, 2) longest subreads, 3) CLRs, and 4) subedCLRs (CLRs containing subreads). When scraps are included, the default is to use all four read groups, but the user can request only two groups if preferred: 1) subreads and 2) subedCLRs. Alternatively, if scraps are not included the two read groups are 1) subreads and 2) longest subreads. Final plots and tables are produced in R, including a table of summary statistics, which can be viewed easily in Microsoft Excel, as well as several publication-quality PDF plots. A subset of these plots are included as Supplemental Figures to this manuscript.
The summary statistics table includes information for all chosen read groups for each SMRTcell.
Statistics include number of reads, total bases, mean and median read length, N50, L50, PSR, and ZOR.
PSR is the polymerase-to-subread ratio and is calculated as follows: total bases from the longest subreads per CLR divided by the total bases from subreads. ZOR is the ZMW (zero-mode waveguide) occupancy ratio and is calculated as follows: the number of CLRs with subreads divided by the number of subreads.
While the summary statistics table is always produced, the user can request more or fewer plots based on

Conclusion
SequelQC is an easy to install and use program that calculates key statistics and generates publicationquality plots for raw PacBio Sequel data. SequelQC provides all standard metrics for overall sequence quality including N50, read length and count statistics, PSR, and ZOR. SequelQC can evaluate eight SMRTcells from the PacBio human genome dataset in about 30min on our HPC system. Other than the proprietary PacBio SMRTlink program, which requires the user to set up a web server to install, there is currently no program available to compute these statistics. We therefore conclude that SequelQC is the only reasonable choice for most users of PacBio Sequel sequencing data.

3
. CC-BY-NC 4.0 International license is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. It . https://doi.org/10.1101/611814 doi: bioRxiv preprint