Skip to main content
Fig. 2 | BMC Bioinformatics

Fig. 2

From: Pheniqs 2.0: accurate, high-performance Bayesian decoding and confidence estimation for combinatorial barcode indexing

Fig. 2

Tokenizing a standard Illumina read. Example of tokenization syntax for a 150-nt paired-end dual-indexed sequencing run. A Input files containing read segments emitted by the sequencer are indexed as an array, where 0 = Read1, 1 = Index1, 2 = Index2, 3 = Read2. Barcode tokens are defined for each type of barcode included in the experimental design and may appear at any position and orientation in any read segment. B This example contains a sample barcode composed of two 8nt elements (i5 and i7), a 12nt inline cellular barcode (Cell), and a 12nt inline molecular barcode (UMI). The biological sequences of interest (template) are located in Read 1 (31nt just downstream of the Cell and UMI) and all of Read 2 (here, 75nt). Each token comprises three colon separated components, “segment:start:end”. Per Python array slicing syntax, the start coordinate (offset) is inclusive and the end coordinate is exclusive. Start and end coordinates default to 0 and the end of the segment, respectively. C Template read segments, observed and most likely inferred barcode sequences, quality scores, and error probabilities are emitted to designated SAM fields. The reported error for each barcode is one minus its estimated confidence (the posterior decoding probability); for a combinatorial barcode, the reported error is one minus the product of the confidence scores for each component

Back to article page