ARKS Algorithm
Given a draft genome assembly and a set of linked reads, ARKS produces a scaffold graph in which nodes represent contigs and directed edges represent inferred order of probably adjacent contig pairs within the genome. Briefly, ARKS infers the edges of the graph by determining the Chromium barcodes associated with the 5′-end (head) and 3′-end (tail) sequences of each contig, and connecting all contig head/tail pairs that share more than a minimum number of barcodes. Once ARKS has produced the scaffold graph, downstream scaffolding algorithms such as LINKS can be used to navigate the graph and translate high-confidence paths into scaffold sequences. The principal challenge of the ARKS algorithm is to efficiently identify all probable contig pairs within the draft genome assembly, along with their relative head/tail orientations.
First, for the sake of argument, we make a clear distinction between the terms “alignment” and “mapping”: while the former refers to inferring the relative coordinates of bases between a query and a target, the latter is about determining if the query sequence hits a particular target. The ARKS algorithm proceeds in three steps: (1) mapping Chromium linked barcodes to contigs using kmers, (2) scoring candidate contig pairs, and (3) generating the output scaffold graph with estimated distances. In the first step, we map the Chromium reads to the draft assembly sequences (hereinafter inclusively termed contigs referring to contigs and scaffolds for simplicity and clarity) in order to build a set of barcode-to-contig mappings (Fig. 1a), which are used in the second step of the algorithm. More precisely, we build a mapping from each barcode to a list of contig head/tail regions, hereafter referred to as contig ends. To reduce memory usage and noise in later steps of the algorithm, we mask the interior sequences of the contigs, and leave only a fixed-length sequence (parameter -e, default 30 kbp) at the head and tail of the contig sequence as targets for mapping. To map the reads, we first index the contig ends by shredding the sequences into kmers, and by loading them into a hash table that associates kmers and contig ends. Then, for each read, we find the best-matching contig end by shredding the read into kmers, hashing the kmers to identify candidate contig ends, and finally selecting the contig end with the largest fraction of kmers overlapping the read using the scoring function
$$ {score}_j\left( contig, read\right)=\frac{\left| kmers(contig)\cap kmers(read)\right|}{\left| kmers(read)\right|} $$
where the operator |·| refers to set cardinality. To reduce spurious mappings, we require a contig end to match a minimum fraction of the read kmers (parameter -j, 0.55 by default) before it is considered as a candidate.
To further increase the specificity of the read mappings, we also require both reads of a pair to map to the same target, and discard kmers having multiple contig end memberships. For each read pair that is successfully mapped, we store its barcode in a hash table that maps barcodes to contig ends. This hash table is the input for the second step of the algorithm.
In the second step, we iterate over the hash table we built, in order to identify and score candidate contig end pairs (Fig. 1b). Iterating over the keys (barcodes), we obtain a list of contig ends that match each barcode. Then, for each possible pairing of contig ends within the list, we increment a counter that tracks the number of shared barcodes for the pair. These counts are stored in a second hash table that maps pairs of contig ends to counts, and is used as the input for the third and final step of the algorithm.
In the third step, we generate the edges of the scaffold graph by iterating over the shared barcode counts for each candidate pair of contig ends (Fig. 1c). All pairs of contig ends with a count greater than a minimum threshold (parameter -l, 0 by default) are output as an edge of the output graph. It is possible that we will obtain non-zero barcode counts linking the same pair of contigs in multiple orientations, where the possible orientations are head-to-head, head-to-tail, tail-to-head, and tail-to-tail. This can occur, for example, if a long DNA molecule spans the full length of a contig, causing the same barcode to map to both ends. In the case of multiple candidate orientations, we perform a binomial test with the null hypothesis that the two orientations with the highest shared barcode counts are equally likely. If the test generates a p-value less than a user-defined threshold (parameter -r, 0.05 by default), we output the edge with the orientation corresponding to the highest shared barcode count; otherwise, we omit the edge from the graph.
In generating the scaffold graph, ARKS also uses the barcode data to estimate distances between neighbouring contigs. These constraints are potentially useful for downstream assembly and scaffolding algorithms, in order to identify correct paths. In general, pairs of contig ends that are more distant will share fewer barcodes; however, the parameters of this relationship are dependent on the sequencing coverage and the length distribution of the underlying long molecules. In order to build an empirical model of the barcodes-to-distance relationship, we first train our algorithm using the distances between head and tail regions within given contigs. For each input contig with length greater than two times the head/tail length (parameter –e defined above), we record both the distance between the head and tail regions and the Jaccard index for the number of shared barcodes
$$ J\left(x,y\right)=\frac{\left| barcodes(x)\cap barcodes(y)\right|}{\left| barcodes(x)\cup barcodes(y)\right|} $$
where x and y are adjacent, linked contig sequences. For a typical empirical relationship between the Jaccard index and the distance, see Additional file 1: Figure S1. Next, we estimate the distance between the neighbouring contigs by taking the median of intra-contig distance samples with the B closest Jaccard indices (parameter –B, 20 by default).
$$ D\left(x,y\right)= median\left\{D\left({x}_i,{y}_i\right)\Big\Vert \underset{i_1,{i}_2,\dots, {i}_B}{\mathrm{argmin}}\sum \left|\ J\left(x,y\right)-J\left({x}_i,{y}_i\right)\right|\right\} $$
Data sources
Human Chromium linked reads were obtained for two individuals’ cell lines, NA24143 and NA12878. The NA24143 reads were downloaded from the Genome in a Bottle (GIAB) FTP site, while the NA12878 reads were downloaded from the 10xG company website (Additional file 1: Table S1). The C. elegans Chromium reads were simulated from the Bristol N2 Strain reference using LRSim (v1.0) [16] at 50-fold coverage using parameters -x 17 -f 50 -t 20 -m 10, as recommended for genomes of this size range (~100Mbp) (Additional file 1: Table S2). The NA24143 Falcon assembly and its Hi-C/HiRise-scaffolded counterpart were obtained from the GIAB FTP site (Additional file 1: Table S1).
Data analysis
The raw Chromium reads from C. elegans as well as the NA12878 human individual reads were processed using Long Ranger (v2.1.3) [17] to extract the barcodes from the reads. The barcodes of the NA24143 reads were already extracted in the downloaded file. For ARCS, fragScaff, and Architect, the Long Ranger output reads required additional read formatting to append the barcode to the read header in the form “header_ < barcode>”. Following this formatting step, the reads were aligned to the corresponding draft genome using BWA mem (v0.7.12) [18], setting -t 8. ARKS does not require similar read formatting as it reads barcodes directly from the BX tag of the Long Ranger output.
For the C. elegans scaffolding runs, the raw simulated 10xG reads were assembled using Supernova (v1.2.0) [2] to produce the draft assembly. Both ARKS and ARCS (v1.0.0) [3] were run with parameters -c 8 -z 500 -m 8–10,000 -e 30,000 and with LINKS (v1.8.5) [15] parameters -l 5 and -a 0.3, 0.5. A sweep of values for -k was tested for ARKS, which was run with eight threads (−t 8).
While running fragScaff (v140324) we also used eight threads (−t 8), and varied the end node size (parameter -E), the mean number of links per bin (parameter -j), and the score multiplier (parameter -u), as these are suggested to be the most influential scaffolding parameters [4]. We used -E 30000 to be consistent with ARKS and ARCS, but also performed runs with -E 13508 because it is half of the N80 of the draft assembly, which is approximately the recommended end node size [4]. We used -j 1 -u 2 to test stringency and improve quality, -j 6.5 -u 2.5 because Adey et al. [4] note that these are the optimal scaffolding parameters for Drosophila melanogaster, which has a slightly larger genome than C. elegans (175 Mbp vs 100 Mbp), and -j 3 -u 2 as an intermediate parameter setting.
For the Architect (v0.1) scaffolding runs, the parameters --rc-abs-thr, --rc-rel-edge-thr, and --rc-rel-prun-thr control the thresholds for creating and pruning a read-cloud edge, respectively, and are suggested to be the most influential parameters [5]. For these C. elegans runs, we set --rc-abs-thr 5, and vary --rc-rel-edge-thr (0.2, 0.3) and --rc-rel-prun-thr (0.1, 0.2) to provide a comprehensive sweep.
Raw 10xG Chromium data from the NA12878 individual was assembled using Supernova (v1.2.0) [2]. The draft Supernova assembly was scaffolded using both ARKS and ARCS with parameters -c 5 -e 30,000 -z 3000 -r 0.05 and -m 50–6000. We kept the LINKS parameter -l constant at 5, but varied a = 0.3, 0.5, in decreasing stringency. We ran ARKS with a range of –k values (60 to 100, step 20), and with eight threads (−t 8). For the fragScaff runs, we set -E 30000 to correspond to -e in ARKS/ARCS and used eight threads (−t 8). We tested -j 1 -u 2 to ensure high stringency, -j 1.75 -u 2.5 because Adey et al. [4] found these values to be optimal, and -j 3 -u 2.5 because they were the optimal values for a baseline N50 length of 47 kbp. For the Architect runs, we ran all combinations of --rc-abs-thr (10, 5), --rc-rel-edge-thr (0.2, 0.3, 0.4), and --rc-rel-prun-thr (0.1, 0.2), to provide a comprehensive sweep of parameters.
The NA24143 Falcon and Falcon-HiRise assemblies were corrected using Tigmint [19], which uses Chromium molecule coverage to identify putative misassemblies (parameters w = 2000, n = 2). The NA24143 assemblies were then scaffolded using ARKS with parameters -c 5, −g 1, −j 0.5, −z 3000, −e 30,000, −r 0.05, −t 8, −m 50–1000, and LINKS parameter -l 5. We parameterized the values of both -k and -a in distinct scaffolding runs.
To compare the actual distances between merged scaffolds with the distances estimated by ARKS, the C. elegans and NA12878 Supernova assemblies were scaffolded with ARKS using the gap distance estimation option (−D) (parameters -c 8 -z 500 -m 8–10,000 -e 30,000 -k 60 -a 0.5 -j 0.5 and -c 5 -e 30,000 -z 3000 -m 50–6000 -k 60 -a 0.5 -j 0.5 for the two genomes, respectively). Actual distances between scaffolds were derived from alignments of the scaffolds to the appropriate reference genomes.
Misassemblies in all genome drafts were assessed using QUAST (v5.0.0, --scaffolds, --scaffold-gap-max-size 100,000) [20]. All benchmarking was completed using a DELL server with 128 Intel(R) Xeon(R) CPU E7–8867 v3, 2.50GHz with 2.6 TB RAM.