Skip to main content
Fig. 1 | BMC Bioinformatics

Fig. 1

From: Pheniqs 2.0: accurate, high-performance Bayesian decoding and confidence estimation for combinatorial barcode indexing

Fig. 1

Pheniqs decoding pipeline architecture. A Pheniqs requires as input sequence data files in any standard format and (if not using default parameters) a JSON configuration file. Python API tools (not shown) assist with IO management and can automatically generate an initial configuration using metadata from an Illumina run directory. Pheniqs evaluates each configuration component to determine how the data should be processed and to ensure that all required directives are present and properly specified. Any validation failures trigger clear, explicit error messages. Prior distributions of expected barcodes either derive from initial sample proportions as given (e.g. per a sample sheet), or are estimated directly from the data during a preliminary PAMLD decoding run. Barcode tokens are extracted from read segments using transform directives and then passed to a decoder (PAMLD or MDD). New decoding algorithms may be implemented as derived classes. Decoded barcodes and quality scores are written to specific SAM auxiliary field for each barcode type. Pheniqs can emit FASTQ files split by sample barcode, but SAM format is preferred since it preserves all associated metadata, and binary (BAM) and compressed (CRAM) versions produce considerably smaller files. POSIX integration allows direct piping to automated workflows, and support for real-time translation of file formats enables teeing to multiple outputs (thus avoiding the need to write temporary files). A JSON-encoded run report is also generated that provides summary statistics for the analysis. B PAMLD noise and confidence filters. Reads with a lower conditional probability than random sequences fail the noise filter and are classified as noise without further consideration. Reads with a posterior probability that does not meet the confidence threshold fail the confidence filter; these reads are classified, but they are marked as qc fail so the confidence threshold can be reconsidered at a later stage

Back to article page