Skip to main content
Fig. 3 | BMC Bioinformatics

Fig. 3

From: RegCloser: a robust regression approach to closing genome gaps

Fig. 3

Closing a gap containing a tandem repeat. a Comparison of the closing results from five methods on a gap containing a triple tandem repeat. The repeat unit is shaded in yellow, and the copy number is 2.9. The gap starts at the middle of the first copy and ends at the repeat end. The sequences flanking the gap are shaded in blue. Existing tools including Phrap, GapCloser, GapFiller, and Sealer make mistakes on the copy number; only RegCloser resolves the tandem repeat correctly. b An illustration of the two different alignments between two reads from the tandem repeat. The position â‘  is mapped by the green and red alignments respectively to positions â‘¡ and â‘¢, which differ in a shift of 69 bp, i.e., the size of the repeat unit. c Violin plots for the observational errors of overlaps detected by all-against-all pairwise alignment (left) versus by insert-size guided pairwise alignment (right). The observational error of an overlap between reads \(i\) and \(j\) refers to the difference between \({y}^{(i,j)}\) and (\({\beta }_{j}-{\beta }_{i}\)), where \({\beta }_{i}\) and \({\beta }_{j}\) are true read positions. The insert-size guided strategy gets rid of the large errors around 138 bp and substantially reduces the moderate errors around 69 bp, thus generating a higher-quality dataset for the regression. d Variation of the residual distribution along the process of the two-step robust regression. In Step 1, as the iteration increases in the IRLS algorithm for computing the robust M-estimate, most residuals converge towards 0 bp, while a fraction of residuals are still between 0 and 69 bp. In Step 2, the read coordinates are estimated by OLS on the data excluding the subset of outliers identified in Step 1. The final residuals of all data are clustered into two peaks exactly at 0 and 69 bp, which means the outliers have been separated out. e Screenshots of the layouts generated respectively from the OLS estimate and the two-step robust estimate of the reads’ genomic positions. The former layout is unclean and the latter layout is well aligned

Back to article page