Skip to main content
Fig. 4 | BMC Bioinformatics

Fig. 4

From: RegCloser: a robust regression approach to closing genome gaps

Fig. 4

Generating an optimal layout of TGS long reads by the robust regression approach. a Illustration of the regression representation. In the linear axis corresponding to the real genome, the position of each TGS long read is represented by a parameter \({\beta }_{i}\) to be estimated. Each overlap, marked by the yellow double-headed arrow, provides an observation on the difference between two reads’ positions. However, chimeric reads, as well as repeats from distant regions in either the same or reverse strain of the genome, will bring in false overlaps, as those marked by red crossings, where ① indicates a false overlap caused by a chimeric read, ② indicates one caused by a repeat in the same strain, and ③ indicates one caused by a repeat in the reverse strain. All the overlap observations are integrated into the linear regression model \({\varvec{Y}}={\varvec{X}}{\varvec{\beta}}+{\varvec{\varepsilon}}\). Then the two-step robust regression procedure gives a globally optimal estimate of the read positions, which lead to a layout. Meanwhile, it detects the outliers, which correspond to the false overlaps. b Boxplot of the differences between the estimated and true positions of the reads in the layout that was generated by RegCloser for de novo assembly of the E. coli genome using a HiFi dataset

Back to article page