Skip to main content

Table 1 Outline of the GBS-SNP-CROP workflow, featuring inputs and outputs of all seven steps (scripts)

From: GBS-SNP-CROP: a reference-optional pipeline for SNP discovery and plant germplasm characterization using variable length, paired-end genotyping-by-sequencing data

 

Input file(s)

Output file(s)

Timea (hrs:mins)

Stage 1. Process the raw GBS data

Step 1 Parse the raw reads

- CASAVA generated paired-end (R1, R2) files (.fastq.gz)

- Parsing summary information (.txt)

2:24

- Read length distribution summary (.txt)

- Barcode-ID file (.txt)

- Parsed paired-end [PE] reads (.fastq)

- Parsed, unpaired R1 reads (.fastq)

Step 2 Trim based on quality

- Parsed PE reads (.fastq)

- High quality, parsed PE reads (.fastq)

0:10

- High quality, parsed singletons (.fastq)

Step 3 Demultiplex

- One pair (R1, R2) of high quality files (.fastq) per library

- One pair (R1, R2) of high quality files (.fastq) per genotype

0:16

- Barcode-ID file (.txt)

Stage 2. Build the Mock Reference

Step 4 Cluster reads and assemble the Mock Reference [MR]

- Genotype-specific PE files (.fastq)

- Mock Reference [centroids] (.fasta)

0:14b

- Barcode-ID file (.txt)

- Mock Reference [genome] (.fasta)

Stage 3. Map the processed reads and generate standardized alignment files

Step 5 Align with BWA-mem and process with SAM tools

- Genotype-specific high quality PE files (.fastq)

- Filtered reads (.bam)

3:36

- Sorted BAM files (.sorted.bam)

- Reference or MR [genome] (.fasta)

- Indexed BAM files (.sorted.bam.bai)

- Barcode-ID file (.txt)

- Indexed reference or MR (.fasta.idx)

- One base call alignment summary file (.mpileup) per genotype

Step 6 Parse mpileup output and produce the SNP discovery master matrix

- One base call alignment summary file (.mpileup) per genotype

- One base call alignment summary count file (.txt) per genotype

4:37

- Barcode-ID file (.txt)

- SNP discovery master matrix (.txt)

Stage 4. Call SNPs and Genotypes

Step 7 SNP genotyping across the population

- SNP discovery master matrix (.txt)

- SNP genotyping matrix for the population (.txt)

0:04

  1. a The computation times presented here are specific to the particular dataset in this study
  2. b The time to build the Mock Reference using only the single most read-abundant genotype (-MR01). Using the five most read abundant genotypes and using all 48 genotypes, the required computation time for this step increases to 0:55 and 4:30, respectively (see Table 2)