Skip to main content

Table 1 Outline of the GBS-SNP-CROP workflow, featuring inputs and outputs of all seven steps (scripts)

From: GBS-SNP-CROP: a reference-optional pipeline for SNP discovery and plant germplasm characterization using variable length, paired-end genotyping-by-sequencing data

  Input file(s) Output file(s) Timea (hrs:mins)
Stage 1. Process the raw GBS data
Step 1 Parse the raw reads - CASAVA generated paired-end (R1, R2) files (.fastq.gz) - Parsing summary information (.txt) 2:24
- Read length distribution summary (.txt)
- Barcode-ID file (.txt) - Parsed paired-end [PE] reads (.fastq)
- Parsed, unpaired R1 reads (.fastq)
Step 2 Trim based on quality - Parsed PE reads (.fastq) - High quality, parsed PE reads (.fastq) 0:10
- High quality, parsed singletons (.fastq)
Step 3 Demultiplex - One pair (R1, R2) of high quality files (.fastq) per library - One pair (R1, R2) of high quality files (.fastq) per genotype 0:16
- Barcode-ID file (.txt)
Stage 2. Build the Mock Reference
Step 4 Cluster reads and assemble the Mock Reference [MR] - Genotype-specific PE files (.fastq) - Mock Reference [centroids] (.fasta) 0:14b
- Barcode-ID file (.txt) - Mock Reference [genome] (.fasta)
Stage 3. Map the processed reads and generate standardized alignment files
Step 5 Align with BWA-mem and process with SAM tools - Genotype-specific high quality PE files (.fastq) - Filtered reads (.bam) 3:36
- Sorted BAM files (.sorted.bam)
- Reference or MR [genome] (.fasta) - Indexed BAM files (.sorted.bam.bai)
- Barcode-ID file (.txt) - Indexed reference or MR (.fasta.idx)
- One base call alignment summary file (.mpileup) per genotype
Step 6 Parse mpileup output and produce the SNP discovery master matrix - One base call alignment summary file (.mpileup) per genotype - One base call alignment summary count file (.txt) per genotype 4:37
- Barcode-ID file (.txt) - SNP discovery master matrix (.txt)
Stage 4. Call SNPs and Genotypes
Step 7 SNP genotyping across the population - SNP discovery master matrix (.txt) - SNP genotyping matrix for the population (.txt) 0:04
  1. a The computation times presented here are specific to the particular dataset in this study
  2. b The time to build the Mock Reference using only the single most read-abundant genotype (-MR01). Using the five most read abundant genotypes and using all 48 genotypes, the required computation time for this step increases to 0:55 and 4:30, respectively (see Table 2)