GBS-SNP-CROP: a reference-optional pipeline for SNP discovery and plant germplasm characterization using variable length, paired-end genotyping-by-sequencing data

Melo, Arthur T. O.; Bartaula, Radhika; Hale, Iago

doi:10.1186/s12859-016-0879-y

Table 1 Outline of the GBS-SNP-CROP workflow, featuring inputs and outputs of all seven steps (scripts)

From: GBS-SNP-CROP: a reference-optional pipeline for SNP discovery and plant germplasm characterization using variable length, paired-end genotyping-by-sequencing data

	Input file(s)	Output file(s)	Time^a (hrs:mins)
Stage 1. Process the raw GBS data
Step 1 Parse the raw reads	- CASAVA generated paired-end (R1, R2) files (.fastq.gz)	- Parsing summary information (.txt)	2:24
	- CASAVA generated paired-end (R1, R2) files (.fastq.gz)	- Read length distribution summary (.txt)
	- Barcode-ID file (.txt)	- Parsed paired-end [PE] reads (.fastq)
	- Barcode-ID file (.txt)	- Parsed, unpaired R1 reads (.fastq)
Step 2 Trim based on quality	- Parsed PE reads (.fastq)	- High quality, parsed PE reads (.fastq)	0:10
Step 2 Trim based on quality	- Parsed PE reads (.fastq)	- High quality, parsed singletons (.fastq)	0:10
Step 3 Demultiplex	- One pair (R1, R2) of high quality files (.fastq) per library	- One pair (R1, R2) of high quality files (.fastq) per genotype	0:16
Step 3 Demultiplex	- Barcode-ID file (.txt)		0:16
Stage 2. Build the Mock Reference
Step 4 Cluster reads and assemble the Mock Reference [MR]	- Genotype-specific PE files (.fastq)	- Mock Reference [centroids] (.fasta)	0:14^b
Step 4 Cluster reads and assemble the Mock Reference [MR]	- Barcode-ID file (.txt)	- Mock Reference [genome] (.fasta)	0:14^b
Stage 3. Map the processed reads and generate standardized alignment files
Step 5 Align with BWA-mem and process with SAM tools	- Genotype-specific high quality PE files (.fastq)	- Filtered reads (.bam)	3:36
	- Genotype-specific high quality PE files (.fastq)	- Sorted BAM files (.sorted.bam)
	- Reference or MR [genome] (.fasta)	- Indexed BAM files (.sorted.bam.bai)
	- Barcode-ID file (.txt)	- Indexed reference or MR (.fasta.idx)
	- Barcode-ID file (.txt)	- One base call alignment summary file (.mpileup) per genotype
Step 6 Parse mpileup output and produce the SNP discovery master matrix	- One base call alignment summary file (.mpileup) per genotype	- One base call alignment summary count file (.txt) per genotype	4:37
	- Barcode-ID file (.txt)	- SNP discovery master matrix (.txt)	4:37
Stage 4. Call SNPs and Genotypes
Step 7 SNP genotyping across the population	- SNP discovery master matrix (.txt)	- SNP genotyping matrix for the population (.txt)	0:04

^a The computation times presented here are specific to the particular dataset in this study
^b The time to build the Mock Reference using only the single most read-abundant genotype (-MR01). Using the five most read abundant genotypes and using all 48 genotypes, the required computation time for this step increases to 0:55 and 4:30, respectively (see Table 2)

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com

BMC Bioinformatics

Contact us