SLR: a scaffolding algorithm based on long reads and contig classification

Background Scaffolding is an important step in genome assembly that orders and orients the contigs produced by assemblers. However, repetitive regions in contigs usually prevent scaffolding from producing accurate results. How to solve the problem of repetitive regions has received a great deal of attention. In the past few years, long reads sequenced by third-generation sequencing technologies (Pacific Biosciences and Oxford Nanopore) have been demonstrated to be useful for sequencing repetitive regions in genomes. Although some stand-alone scaffolding algorithms based on long reads have been presented, scaffolding still requires a new strategy to take full advantage of the characteristics of long reads. Results Here, we present a new scaffolding algorithm based on long reads and contig classification (SLR). Through the alignment information of long reads and contigs, SLR classifies the contigs into unique contigs and ambiguous contigs for addressing the problem of repetitive regions. Next, SLR uses only unique contigs to produce draft scaffolds. Then, SLR inserts the ambiguous contigs into the draft scaffolds and produces the final scaffolds. We compare SLR to three popular scaffolding tools by using long read datasets sequenced with Pacific Biosciences and Oxford Nanopore technologies. The experimental results show that SLR can produce better results in terms of accuracy and completeness. The open-source code of SLR is available at https://github.com/luojunwei/SLR. Conclusion In this paper, we describes SLR, which is designed to scaffold contigs using long reads. We conclude that SLR can improve the completeness of genome assembly.


Background
With the increasing availability of third-generation sequencing technologies, which include Single-Molecule Real-Time (SMRT) technology from Pacific Biosciences and Nanopore-based technology from Oxford Nanopore, many biological applications have been greatly improved. Compared with second-generation sequencing technologies, third-generation sequencing technologies produce longer reads with a higher sequencing error rate [1]. In the field of de novo genome assembly, a large number of assembly tools based on third-generation sequencing technologies have been presented to resolve the most prominent problem: repetitive regions. However, producing a complete and accurate assembly is still a *Correspondence: luojunwei@hpu.edu.cn 1 College of Computer Science and Technology, Henan Polytechnic University, 454000 Jiaozuo, China Full list of author information is available at the end of the article challenging task. Scaffolding is an important step in the pipeline of genome assembly, and aims to orient and order contigs [2,3]. Scaffolding generates scaffolds consisting of sequence fragments including oriented and ordered contigs. The gap between two adjacent contigs in a scaffold is filled with 'N' characters. Scaffolding can significantly increase the continuity of assembly results and benefit downstream analyses such as those of gene order, comparative or functional genomics and patterns of recombination [4].
According to the kind of reads used for scaffolding, existing scaffolding tools generally fall into the following three categories: (i) Using paired reads for scaffolding. The insert size of paired reads can reach a few thousands bases, so this technique can partially resolve the problem of repetitive regions. Such scaffolding tools, such as OPERA [5], SSPACE [6], BESST [7], ScaffMatch [8], SCARPA [9], ScaffoldScaffolder [10], and BOSS [11], usually use greedy heuristic algorithms to generate scaffolds based on a scaffold graph, in which a vertex denotes a contig and an edge represents the existence of paired reads that can be separately aligned to the two corresponding contigs. However, because the length of reads from second-generation technologies is commonly only a few hundred bases, the reads can usually be aligned with two or more positions in the contigs. Moreover, the region between the paired reads is unknown, and there are sequencing errors in the reads. Some spurious edges are usually introduced into a scaffold graph, which complicates the scaffolding task. Obtaining more accurate and contiguous scaffolding results based on paired reads is a difficult task.
(ii) Using long reads for scaffolding. This kind of scaffolding tool usually aligns the long reads against contigs first and then finds contigs that can be aligned with the same long read. Then, these tools use the local alignment result to infer the global order and orientation of contigs. For instance, SSPACE-LongRead [12] first aligns whole long reads with contigs using the alignment tool BLASR [13]. Next, contig pairs and multi-contig linkage information are obtained and used to order and orient the contigs and generate scaffolds. LINKS [14] does not align the whole long reads to the contigs; it first extracts the k-mer pairs in an interval from long reads. Afterwards, these k-mer pairs are aligned to the contigs, and the alignment results are used to link the contigs. Finally, LINKS selects a neighbour of a contig as its correct neighbour based on the number of links. SMSC [15] first aligns the long reads to the contigs using either Nucmer [16] or BLASR and then constructs a breakpoint graph in which a vertex is a contig and an edge is added to indicate a long read bridging two vertices. It transforms the scaffolding problem to a maximum alternating path coverage problem in the breakpoint graph and resolves this problem using a 2-approximation algorithm. RAILS [17] scaffolds contigs with long reads using the scaffolding engine originally developed for SSAKE [18] and LINKS. Based on the sequencing coverage of each contig, npScarf [19] classifies contigs into unique contigs and repetitive contigs. npScarf first bridges the unique contigs and generates scaffolds based on a greedy strategy and then fills the gaps by repetitive contigs. However, most contig sets used for scaffolding do not include information on sequencing coverage, which limits the application of npScarf.
(iii) Using optical mapping data or Hi-C data for scaffolding. Optical mapping data can serve as a unique "fingerprint" or "barcode" for genome sequences. By comparing optical mapping data with a restriction enzyme map of the contigs, the order and orientation of contigs can be inferred. Supernova [20], Architect [21], ARCS [22] and fragScaff [23] attempt to find pairs of contigs based on linked reads. The problem with using optical mapping data is that a barcode used to locate contigs may have many different alignment positions, which usually causes contradictions between contigs. Hi-C data are commonly sequenced by paired-end sequencing. The paired reads come from the interacting fragments between genomic loci that are nearby in three-dimensional space but may be separated by many nucleotides in the linear genome. Scaffolding using Hi-C data is the most challenging method, as the genomic distance between a given Hi-Cbased read pair is highly variable and may span a few kilobases to megabases without any direct indication of the true distance [1].
Although some scaffolding tools based on long reads have made great progress, two primary issues still require more attention. (i) Scaffold graph construction: In a scaffold graph, each vertex refers to a contig, and an edge is created between two vertices if the two contigs can be aligned with the same long read. Due to the repetitive regions in contigs and the high sequencing error rate of long reads, the scaffold graph usually becomes very complicated, which has negative effects on the later scaffolding steps. Hence, simplifying the scaffold graph is a significant goal for scaffolding. (ii) Edge weighting: In the scaffold graph, most current methods prefer to weight each edge by the number of long reads that can be aligned with two vertices simultaneously. However, the length of the alignment between a long read and a contig can reflect the confidence level of the alignment, which is usually ignored by existing methods.
When a long read links the two flanking regions of a repetitive region, the problem of the repetitive region can be resolved because the order and orientation of the two flanking regions can be obtained directly. Moreover, a repetitive region can usually be aligned with more than one long read, and their 5'-end (or 3'-end) neighbour regions are not the same. After aligning the long reads against the contigs, we can identify whether contigs are repetitive based on their aligniment positions in the long reads. When constructing a scaffold graph, it is difficult to avoid spurious edges introduced by repetitive contigs and sequencing errors. We can identify spurious edges by detecting orientation and position contradictions in the scaffold graph [10,11]. Using only non-repetitive contigs to construct a scaffold graph not only simplifies the complexity of the scaffold graph but also improves the accuracy of spurious edge detection.
In this paper, we present a scaffolding algorithm based on long reads and contig classification (SLR), which utilizes two new strategies to address the two issues above. For issue (i), SLR classifies the contigs into unique contigs and ambiguous contigs. SLR utilizes the unique contigs to construct a scaffold graph, which can decrease the complexity of the scaffold graph and simplify the following scaffolding steps. For issue (ii), SLR uses the alignment length to weight each edge in the scaffold graph. Moreover, SLR employs linear programming to detect and remove the contradictions in the scaffold graph, which guarantees that the scaffold graph includes only simple paths.
Based on these two new strategies, SLR determines the orientations and orders of the contigs. In experiments, SLR is compared with three popular scaffolding tools by scaffolding five long-read datasets with Pacific Biosciences and Oxford Nanopore technologies. The experimental results show that SLR produces better results in terms of accuracy and completion for most datasets.

Results
To evaluate the performance of SLR, we compared SLR with three popular scaffolding tools based on long reads, namely, SSPACE-LongRead (SSPACE-LR), LINKS and npScarf.

Datasets and metrics
Contig and long-read datasets for Escherichia coli (E. coli), Saccharomyces cerevisiae W303 (S. cerevisiae), and Human chromosome X (Chr X) were utilized as input for all tools. E. coli and S. cerevisiae include two different long-read datasets sequenced with Pacific Biosciences and Oxford Nanopore technologies and consist of two different contig sets assembled by different assemblers. The long reads for Chr X are from Pacific Biosciences. The details of the long-read datasets are shown in Table 1. The contig sets, which were evaluated by QUAST [24], are shown in Table 2. Then, these contig sets and long-read sets form nine datasets, shown in Table 3, were used for scaffolding, and each dataset included one contig set and one long-read set. We named the nine datasets as E. coli_1_SMRT, E. coli_2_SMRT, S. cere-visiae_1_SMRT, S. cerevisiae_2_SMRT, Chr X_1_SMRT, E. coli_1_ONT, E. coli_2_ONT, S. cerevisiae_1_ONT, and S. cerevisiae_2_ONT.
QUAST aligns the contigs (or scaffolds) to the reference genome and obtains some metrics. NG50 is the length of the longest contig (or scaffold) such that all the contigs (or scaffolds) of that length or longer cover at least half of the reference genome. N50 is the length of the longest contig (or scaffold) such that all the contigs (or scaffolds) of that length or longer cover at least half of the length of all contigs (or scaffolds). Misassemblies (Errors) is the number of positions (breakpoints) in the contigs or scaffolds in which errors (Translocation, Inversion, Relocation) occur. NGA50 is the NG50 of contigs or scaffolds after they have been broken at every breakpoint. Genome Fraction is the percentage of aligned bases in the reference genome. Usually, Misassemblies can represent the accuracy of the scaffolding result, and NGA50 and NA50 can reflect the completion and continuity of the scaffolding result. In the experiments below, we used QUAST to evaluate the scaffolding results for SSPACE-LR, LINKS, npScarf and SLR.

Evaluations on nine datasets
The long-read sets about first five datasets are obtained by SMRT technology. And, the long-read sets about last four datasets are obtained by Nanopore technology. All the scaffolding tools were run on these nine datasets, and detailed evaluation results from QUAST are shown in Additional file 1: Tables S1 and S2. Because NGA50 and Misassemblies are two important metrics for evaluating scaffolding tools, we show NGA50 vs Misassemblies in Fig. 1. The best scaffolding result can found in the top-left corner of each figure. Except in Fig. 1(b) and Fig. 1(i), SLR is in the top-left corner throughout Fig. 1, which indicates that SLR has lower Misassemblies and a higher NGA50. Although npScarf performs better in Fig. 1(b) and Fig. 1(i), the performance of SLR is close to it.

Running time and peak memory
Due to the high error rate in long reads, aligning long reads with contigs usually takes a long time. LINKS selects k-mer pairs from the long reads to link the contigs, which avoids long read alignment. However, LINKS requires more memory to store the k-mer pairs. As shown in Table 4, we find that LINKS consumes less time and more memory. SLR and npScarf have similar time consumption, because both use BWA-MEM [25] to align long reads against contigs. In all experiments, npScarf allocates a large memory despite the size of the dataset. When extracting alignment information from the BAM file, SLR keeps the alignment of one long read in memory and produces a local scaffold that is saved on the hard disk. After  The contig set about E. coli_1 and S. cerevisiae_1 are provided by [29], the contig set about E. coli_2 and S. cerevisiae_2 are provided by [30], and the contig set about Chr X_1 are provided by [31] processing one long read, SLR processes the next long read, which can reduce the memory requirement. Compared with other tools, SSPACE-LR and SLR require less memory for scaffolding.

Effectiveness of contig classification
To verify the effectiveness of the contig classification method presented in this paper, we removed the step of contig classification from SLR and this new algorithm was named SLR1. Then, we benchmarked SLR with SLR1 on all datasets. The scaffolding results for SLR and SLR1 are shown in Table 3. We can see that SLR performs better than SLR1 in terms of Misassemblies and NGA50. Therefore, we can prove that our proposed contig classification method is effective. Next, we combined the contig classification method with other scaffolding tools. SLR classified each contig set into a unique contig set and an ambiguous contig set. We first ran SSPACE-LR and LINKS on the unique contig set, generating some scaffolds. Then, we inserted the ambiguous contigs into the scaffolds. For this purpose, we should determine the order and orientation of the unique contigs in these scaffolds. BWA-MEM is used to align the unique contigs against these scaffolds.
Only if a unique contig is completely aligned in a scaffold, the corresponding alignment is retained. Then, we can obtain the order and orientation of the unique contigs in these scaffolds. The final scaffolding results is shown in Fig. 2. SSPACE-LR-CC represents the method based on SSPACE-LR combined with contig classification. LINKS-CC represents the method based on LINKS combined with contig classification. According to Fig. 2, we find that SSPACE-LR-CC and LINKS-CC outperformed SSPACE-LR and LINKS in NGA50. This further confirms the effectiveness of the method of contig classification.
Compared with SLR, SSPACE-LR-CC outperformed SLR in NGA50 for E. coli_2_SMRT and Chr X_1_SMRT. For the remaining seven datasets, SLR performed better than SSPACE-LR-CC in NGA50. SLR performed better than LINKS-CC in NGA50 for all datasets. Meanwhile, SLR outperformed SSPACE-LR-CC and LINKS-CC in Misassemblies for all datasets.
The detailed evaluation results are provided in Additional file 1. Note that, because npScarf makes sequence consensus between contigs and long reads, it is difficult to identify the order of the unique contigs in the scaffolds. We did not use npScarf in the this experiment. Each dataset includes one contig set and one long-read set, and corresponds to one genome.

Evaluation using a repeat-aware evaluation framework
We also used a repeat-aware evaluation framework [26] to evaluate the performance of SSPACE-LR, LINKS, npScarf and SLR. For each original contig set, by aligning contigs with the reference genome, this framework splits contigs in misassembly events, and extracts repetitive sub-contig from original contigs. Then, it outputs a new contig set. The framework records the number of correct links, which is the number of correct contig joins. After a scaffolding tool runs on this new contig set and a long-read set, the framework computes the number of correctly predicted links. Therefore, we can compute precision, recall and F1-score for the scaffolding results. For the contig set about Chr X, the framework ran for more than one week and gave no new contig set. Hence, we processed only the remaining original contig sets. So, there are eight new datasets used for this experiment, which are named E. coli_1_SMRT_R, E. coli_2_SMRT_R, S. cerevisiae_1_SMRT_R, S. cere-visiae_2_SMRT_R, E. coli_1_ONT_R, E. coli_2_ONT_R, S. cerevisiae_1_ONT_R, and S. cerevisiae_2_ONT_R. The detailed evaluation results provided by the framework are shown in Additional file 1: Tables S9 and S10. In addition, for these new datasets, we also evaluated the scaffolding results by QUAST, which are shown in Fig. 3. According to Fig. 3, SLR achieved the best NGA50 values for all the datasets. This experiment shows that SLR can identify repetitive contigs and overcome the problem of repeating regions.

Discussion
npScarf utilizes sequencing coverage to classify contigs. However, most contigs used for scaffolding do not include information about sequencing coverage, which limits the application of npScarf. SLR can classify contigs without any additional information about the contig set. SSPACE-LR uses a greedy heuristic strategy to determine the neighbour of a contig based on the number of long reads that can be aligned. LINKS uses a strategy similar to that of SSPACE-LR to determine the neighbours by counting the number of k-mer pairs between two contigs. These two tools have difficulty identifying the correct neighbours when encountering complex repetitive regions.

Conclusion
With the development of third-generation highthroughput sequencing technologies, scaffolding methods based on long reads have undergone substantial improvement. A scaffold graph is the basis for inferring the orders and orientations of contigs. However, the problems introduced by repetitive regions and sequencing errors pose challenges in the process of constructing scaffold graphs. In this paper, we presented a novel scaffolder, SLR, for determining the orientations and orders of contigs based on long reads and contig classification. SLR employs a new contig classification procedure to overcome the problems associated with repetitive regions in scaffolding. SLR first produces local scaffolds based on the alignment between long reads and contigs. A local scaffold corresponds to a long read and the contigs that can be aligned with it. SLR classifies contigs into unique and ambiguous contigs based on local scaffolds. A scaffold graph including only unique contigs is constructed; this process can simplify the scaffold graph and improve the accuracy of detecting and removing contradictions.

Fig. 2 Contig classification combines with SSPACE-LR and LINKS
Experiments were conducted that included long reads obtained with SMRT-based and Nanopore-based technologies. The experimental results illustrated that SLR is superior in terms of continuity and accuracy. For larger genomes, such as the complete human genome, however, SLR is difficult to scale due to its long run time.

Method
A contig set C and a long read set LR are used as input data. The algorithm is composed of four steps: (i) producing local scaffolds; (ii) classifying contigs; (iii) constructing a scaffold graph; and (iv) generating scaffolds. In the first step, the alignment tool BWA-MEM is used to align LR against C. For each long read and set of contigs that can be aligned with it, SLR determines the orders and orientations of the contigs and forms a local scaffold. In the second step, SLR classifies the contigs into unique contigs and ambiguous contigs based on their positions in the local scaffolds. In the third step, SLR constructs a scaffold graph based on unique contigs and then detects and removes the contradictions in the scaffold graph. In the fourth step, SLR extracts the simple paths from the scaffold graph to yield a draft scaffold set. Next, SLR inserts the ambiguous contigs into the draft scaffolds. The details of each step are described below. Note that the long reads whose lengths are longer than L r and the contigs whose lengthes are longer than L c are used by SLR. L r and L c are two parameters that can be defined by users. In addition, if a contig is completely contained in other contigs, SLR will ignore it in the following scaffolding steps.

Producing local scaffolds
SLR utilizes BWA-MEM to align LR against C, and the SAM file is converted to a BAM file by Bamtools [27]. Due to the high sequencing error rate in long reads, the alignment positions are usually different from the real positions. With the following method, SLR first revises the alignment positions and obtains reliable alignments.
For an alignment between the j-th long read lr j and the i-th contig c i , we assume that the region [sr ij , er ij ] in lr j is aligned with the region [sc ij , ec ij ] in c i . If sr ij < sc ij , sr ij = 0 and sc ij = sc ijsr ij , else sc ij = 0 and sr ij = sr ijsc ij . If  Fig. 4. After revision, the alignment will be reliable if the following hold: i) The mapping quality is higher than s m (a threshold with a default 20); ii) both the values er ij − sr ij and ec ij −sc ij are greater than l m (a threshold with a default 100); iii) for each of sr ij , er ij , sc ij and ec ij , the difference between its original position and its revised position is smaller than α (a threshold with a default 150). SLR retains only reliable alignments.
A local scaffold is composed of ordered and oriented contigs that can be aligned with the same long read. The i-th local scaffold ls i is represented by the vertex sequence s i1 , s i2 , ...s im , where m is the number of contigs in the i-th local scaffold. s ij is represented by a four-tuple (sc ij , sco ij , scg ij , scl ij ). sc ij refers to the j-th contig in ls i . sco ij denotes the alignment orientation between the contig and the long read. sco ij = 1 represents forward alignment. sco ij = 0 represents reverse alignment. scg ij denotes the gap distance between sc ij and sc i(j+1) . In particular, the gap distance of the last vertex is zero. scl ij is the alignment length between sc ij and the long read. Note that if there are two or more contigs aligned with the same end of the long read, SLR keeps only the contig that has the greast alignment length. An example is shown in Fig. 5.
The contig sc ij is in the middle position of ls i if 1 < j < m. sc ij and sc i(j+1) are adjacent in ls i . If sco ij = 1, sc i(j−1) (j > 1) is the 5'-end neighbour contig of sc ij , and sc i(j+1) (j < m) is the 3'-end neighbour contig of sc ij . If sco ij = 0, sc i(j−1) (j > 1) is the 3'-end neighbour contig of sc ij , and sc i(j+1) (j < m) is the 5'-end neighbour contig of sc ij .
In this step, SLR finally obtains a local scaffold set LS. Due to the high sequencing error rate, a contig may not be aligned with the long read that connects its left and right neighbour contigs. To resolve this problem, SLR deletes some local scaffolds. For example, the local scaffold ls 1 is (A, C), and the local scaffold ls 2 is (B, C). If the sum of LEN(B) and the gap distance between B and C in ls 2 is smaller than the gap distance between A and C in ls 1 and there exists a local scaffold (A, B, C), SLR removes ls 1 .

Classifying contigs
Repetitive regions are the critical problem in the process of scaffolding. When constructing a scaffold graph, the 5'-end (or 3'-end) of a repetitive contig can usually be linked with two or more other contigs, which complicates the scaffold graph. Because repetitive contigs commonly emerge in many different local scaffolds, they have two or more distinct 5'-end (or 3'-end) neighbour contigs. When a contig is not in the middle position of any local scaffold, no long read can span the contig to link its two neighbour contigs, and this contig is usually a long unique contig. Although the contig has multiple 5'-end or 3'-end neighbour contigs, SLR uses contradiction removal step to identify its correct neighbour contigs. Hence, SLR can identify whether a contig is unique based on its positions in the local scaffolds.   11 , LEN(c 1 ) − 1] (region 2 ) is not aligned with lr 1 . However, when lr 1 is truely aligned with c 1 and the alignment is reliable, region 4 should be aligned with the region [sc 11 − sr 11 , sc 11 ] in c 1 , and region 2 should be aligned with the region [er 11 , er 11 + LEN(c 1 ) − ec 11 ]. Because of the high sequencing error rate in long reads, the alignment tool usually does not provide accurate alignment start and end positions. Then, SLR sets sc 11 = sc 11 − sr 11 , sr 11 = 0, ec 11 = LEN(c 1 ) − 1 and er 11 = er 11 + LEN(c 1 ) − ec 11 . When the alignment is reliable, the region [sc 11 , ec 11 ] in c 1 is aligned with the region [sr 11 , er 11 ] in lr 1 To reduce the negative effects of short repetitive contigs, SLR considers a contig whose length is shorter than L ca (a threshold that can be set by users) to be an ambiguous contig. These short contigs are temporally ignored in the local scaffolds. Next, the contigs longer than L ca are classified using the following method.
SLR identifies a contig as ambiguous if the following hold: i) The contig is in the middle position of one or more local scaffolds and ii) the number of 5'-end (or 3'-end) neighbour contigs of the contig is greater than one.
After all ambiguous contigs have been identified, the remaining contigs are considered unique contigs. In this way, the contigs are classified into unique contigs and ambiguous contigs by SLR. An example of such contig classification is shown in Fig. 6.

Constructing a scaffold graph
A scaffold graph G is represented by a vertex set V and an edge set E. A vertex v i corresponds to a contig c i . An edge e ij is denoted by a five-tuple (v i , v j , o ij , g ij , w ij ). Two vertices v i and v j are connected by e ij . g ij is the gap distance between v i and v j . o ij is the relative orientation of v i and v j . There are four types of relative orientation between v i and v j : (i) the 3'-end of v i is connected to the 5'-end of v j ; (ii) the 5'-end of v i is connected to the 3'-end of v j ; (iii) the 5'-end of v i is connected to the 5'-end of v j , and (iv) the 3'-end of v i is connected to the 3'-end of v j . For types (i) and (ii), v i and v j are on the same strand. For the other two types, v i and v j are on the opposite strands. w ij is the weight of the edge, which reflects its confidence.
Neglecting the ambiguous contigs and constructing scaffold graph G with only unique contigs will significantly simplify G and reduce the difficulties in inferring the orders and orientations of the unique contigs. Therefore, all unique contigs make up the vertex set V. Below, we describe how to create the edge set E. The superiority of constructing a scaffold graph using unique contigs is illustrated in Fig. 6.

Adding edges to the scaffold graph
First, SLR ignores the ambiguous contigs in all local scaffolds; therefore, some non-adjacent unique contigs may become adjacent in one local scaffold. Assume that the i-th local scaffold ls i (s i1 , s i2 , ...s im ) in LS includes two adjacent unique contigs sc ip and sc is . If one or more ambiguous contigs exist between sc ip and sc is , the gap distance between sc ip and sc is is re-calculated by formula (1); otherwise, it is equal to scg ip . Here, GD(sc ip , sc is , lr i ) represents the gap distance between sc ip and sc is in ls i . Moreover, SLR can obtain a weight value, which is the minimum value of scl ip and scl is . The weight value can be used to evaluate the confidence level of the relation between sc ip and sc is . As the weight value becomes larger, the order of the two unique contigs becomes more reliable.
We assume that sc ip is represented by c a and that sc is is represented by c b . For c a and c b , SLR selects all local Fig. 6 (a) There are six long reads: lr 1 , lr 2 , lr 3 , lr 4 , lr 5 , and lr 6 . The contigs c 1 and c 2 are aligned with lr 1 . c 3 , c 4 and c 5 are aligned with lr 2 . c 6 , c 4 and c 7 are aligned with lr 3 . c 7 , c 8 and c 9 are aligned with lr 4 . c 10 , c 11 and c 12 are aligned with lr 5 . c 9 , c 11 , c 13 and c 2 are aligned with lr 6 . We assume that all these alignments are forward, and all contigs are longer than L ca . (b) Based on the alignment result described in (a), SLR obtains six local scaffolds: ls 1 , ls 2 , ls 3 , ls 4 , ls 5 , and ls 6 . (c) The scaffold graph G 1 is built using all contigs. We find that G 1 is complicated. (d) Based on the contig classification method described in Section 2.2, the contigs can be divided into two categories. Because c 4 is located in the middle position of ls 2 and ls 3 and has two distinct 3'-end neighbours and two distinct 5'-end neighbour contigs, it is identified as an ambiguous contig. c 11 is also an ambiguous contig. The remaining contigs are identified as unique contigs. The scaffold graph G 2 is built based on unique contigs and is thus less complicated than G 1 scaffolds in which c a and c b are adjacent. Next, SLR determines the relative orientation of the gap distance between and weight of c a and c b based on these local scaffolds. For two unique contigs, the relative order and orientation should be unique. If different values of o ab are obtained from the local scaffolds, SLR keeps only the local scaffold set LS ab for which the value of o ab is the same, and the number of elements in LS ab is the largest. The gap distance between c a and c b is calculated according to formula (2). In addition, we can obtain a weight value for each local scaffold in LS ab . The final weight of c a and c b (denoted w ab ) can be obtained by seeking the maximum weight value obtained by the local scaffolds in LS ab . Then, SLR adds an edge e ab to G.
in which n is the number of elements in LS ab , and ls i ∈ LS ab . After processing all pairs of unique contigs in LS, a draft scaffold graph G can be constructed by SLR for the subsequent steps.

Removing contradictions
Due to sequencing errors in long reads and complex repetitive regions, the scaffold graph G may still contain some spurious edges. Detecting and removing the spurious edges in G can be viewed as detecting and removing the orientation and position contradictions [10,11]. BOSS utilizes an iterative strategy to detect and remove contradictions. BOSS first constructs a sub-graph that includes only edges with a high weight. Next, it iteratively adds the remaining edges to the sub-graph from high to low weight. Each iteration includes a sub-graph, and BOSS builds two linear programming models [28] to solve orientation and position contradictions in the sub-graph. SLR utilizes a revised method based on BOSS to remove contradictions. The difference in SLR compared to BOSS is that SLR adds all edges to the sub-graph in the first iteration. Hence, SLR completes contradiction removal within one iteration, while BOSS requires several iterations. The methods of building the linear programming model of BOSS and SLR are the same, as described below.
First, SLR detects and deletes orientation contradictions. For the edge e ij ∈ G, if o i = o j , SLR constructs constraint Eq. (3). If o i = o j , SLR constructs constraint Eq. (4).
in which η ij ∈ {0, 1} is a variable that represents whether e ij is spurious. 0 i ∈ {0, 1} is also a variable that denotes the orientation of v i . The objective function is MAX( (w ij * η ij )). Second, SLR detects and deletes position contradictions. For the edge e ij ∈ G, SLR constructs constraint Eq. (5). (5) in which p i is a variable that represents the assigned position of v i . φ ij is a slack variable in the range [ 0, 1] that reflects the consistency between g ij and |p j − p i |. The objective function is MAX( (w ij * φ ij )). For an edge, if the gap distance computed by the assigned position is far from the original one, the edge is deemed spurious one, and then SLR deletes it from G.
After eliminating the orientation and position contradictions, if there are two or more edges linking the same end of a vertex, SLR keeps only the edge with the highest weight and removes the others. Consequently, the scaffold graph G contains only simple paths.

Generating scaffolds
Each simple path in G refers to a scaffold, and SLR selects all simple paths and constructs a draft scaffold set. For any two adjacent vertices in the draft scaffold, SLR scans the local scaffold set LS again and finds local scaffolds that contain them. If ambiguous contigs exist between these vertices in a local scaffold, these ordered and oriented ambiguous contigs correspond to a path. If there are two or more different paths, SLR selects the one with the greatest number of local scaffolds that support it and then inserts it between the two vertices. Note that an ambiguous contig may occur two or more times in the scaffolds.
Next, SLR selects local scaffolds that contain the first contig of a scaffold. SLR constructs a scaffold graph based on these local scaffolds. If a simple path starts from the first contig in the scaffold graph, it is merged with the head of the scaffold. In the same way, SLR extends the tail of the scaffold. Once the first t contigs of a scaffold are the same as the last t contigs of another scaffold (t is a threshold set by users), SLR will merge them together to form a new scaffold. In the same way, SLR will reverse a scaffold and detect whether it can be merged with other scaffolds. Finally, SLR outputs the scaffolds as the final result.