Skip to main content
Fig. 1 | BMC Bioinformatics

Fig. 1

From: RegCloser: a robust regression approach to closing genome gaps

Fig. 1

Pipeline of RegCloser. a Align the input paired-end or mate-pair reads onto the draft genome, and collect the reads falling in the gap regions. b An illustrative example of insert-size guided pairwise alignment for overlap detection in a gap containing a triple tandem repeat ‘GAACCCT’. First, on the axis corresponding to the DNA sequence in the gap, the collected reads are placed at their prior positions, which are inferred from their mate position and the insert size. The prior positions of reads ④ and ⑤ are 11 and 16. Notably, two pseudo reads, ① and ⑨, generated from the contig ends flanking the gap are added. Then, a pair of reads are aligned only when their prior positions are close, taking into account the variation of insert sizes. If an alignment is statistically significant, it is marked by yellow parallel lines between bases as well as by double-headed arrows between reads. The tandem repeat causes a false overlap between reads ④ and ⑥, as indicated by the red double-headed arrow, resulting in an outlier in the latter regression model. The repeat also leads to two different significant alignments between reads ④ and ⑤, respectively marked by the yellow solid and dotted lines. In this case, RegCloser will select the alignment more compatible with the prior positions (solid line), rather than the highest-scoring one (dotted line). c The linear regression model of genome assembly. The reads’ real positions on the gap axis are represented as parameters \({\beta }_{i}\ (1\le i\le n)\) to be estimated. Each detected overlap between reads \(i\) and \(j\) provides an observation on the difference between \({\beta }_{i}\) and \({\beta }_{j}\): \({y}^{(i,j)}={\beta }_{j}-{\beta }_{i}+{\varepsilon }^{(i,j)}\). \({\varepsilon }^{(i,j)}\) is the observational error, which is normally caused by sequencing errors on the DNA fragment between \({\beta }_{i}\) and \({\beta }_{j}\). False overlaps cause outliers, which have abnormally large \(|{\varepsilon }^{\left(i,j\right)}|\). All the observations in a gap are integrated into the matrix form \({\varvec{Y}}={\varvec{X}}{\varvec{\beta}}+{\varvec{\varepsilon}}\). d A two-step robust regression estimation of the model parameters. \({\rho }_{H}\) is the Huber loss function. \({I}_{o}\) is the index set of observations with large residuals and identified as potential outliers. e Generate multiple sequence alignment of the collected reads by their estimated positions, and determine the gap sequence as the consensus between the two pseudo reads ① and ⑨. As a result, the triple tandem repeat is recovered by RegCloser

Back to article page