PGen: large-scale genomic variations analysis workflow and browser in SoyKB
- Yang Liu†1, 2,
- Saad M. Khan†1, 2,
- Juexin Wang†2, 3,
- Mats Rynge4,
- Yuanxun Zhang3,
- Shuai Zeng2, 3,
- Shiyuan Chen2, 3,
- Joao V. Maldonado dos Santos5,
- Babu Valliyodan5, 6,
- Prasad P. Calyam3,
- Nirav Merchant7,
- Henry T. Nguyen5, 6,
- Dong Xu1, 2, 3 and
- Trupti Joshi†1, 2, 3, 8Email author
© The Author(s). 2016
Published: 6 October 2016
With the advances in next-generation sequencing (NGS) technology and significant reductions in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations and to apply the knowledge towards improvements in traits. To efficiently facilitate large-scale NGS resequencing data analysis of genomic variations, we have developed “PGen”, an integrated and optimized workflow using the Extreme Science and Engineering Discovery Environment (XSEDE) high-performance computing (HPC) virtual system, iPlant cloud data storage resources and Pegasus workflow management system (Pegasus-WMS). The workflow allows users to identify single nucleotide polymorphisms (SNPs) and insertion-deletions (indels), perform SNP annotations and conduct copy number variation analyses on multiple resequencing datasets in a user-friendly and seamless way.
We have developed both a Linux version in GitHub (https://github.com/pegasus-isi/PGen-GenomicVariations-Workflow) and a web-based implementation of the PGen workflow integrated within the Soybean Knowledge Base (SoyKB), (http://soykb.org/Pegasus/index.php). Using PGen, we identified 10,218,140 single-nucleotide polymorphisms (SNPs) and 1,398,982 indels from analysis of 106 soybean lines sequenced at 15X coverage. 297,245 non-synonymous SNPs and 3330 copy number variation (CNV) regions were identified from this analysis. SNPs identified using PGen from additional soybean resequencing projects adding to 500+ soybean germplasm lines in total have been integrated. These SNPs are being utilized for trait improvement using genotype to phenotype prediction approaches developed in-house. In order to browse and access NGS data easily, we have also developed an NGS resequencing data browser (http://soykb.org/NGS_Resequence/NGS_index.php) within SoyKB to provide easy access to SNP and downstream analysis results for soybean researchers.
PGen workflow has been optimized for the most efficient analysis of soybean data using thorough testing and validation. This research serves as an example of best practices for development of genomics data analysis workflows by integrating remote HPC resources and efficient data management with ease of use for biological users. PGen workflow can also be easily customized for analysis of data in other species.
In-depth informatics analysis of genotypic data can provide a better understanding of genotype-phenotype correlations with applications designed to assist in the work toward improvement of traits. In order to achieve this, many research institutions are generating large-scale sequencing datasets for crop germplasm [1, 2] for a comprehensive overview of the sequence variation observed in these large collections of crops. With the decreasing costs of NGS, many projects can easily generate single and paired end Illumina reads for hundreds to thousands of samples in a short time. These genomics datasets are large and require significant computing time for analysis. SNP/Indel identification procedures need to be followed by other complex downstream analyses ranging from SNP annotations, copy number variations (CNV), genome wide associations studies (GWAS) analysis, haplotype analysis and others. Most analyses need to be conducted on the entire datasets and often need to combine multiple datasets. Not many biological labs generating the data are equipped with large data storage, computing resources or computing skills for handling such analyses in a time sensitive fashion. These analyses take anywhere from a few days to several months, given the volume of NGS samples and datasets sequenced. In addition, many research institutions may not have access to enough dedicated resources available locally to conduct this type of analysis and usually need to work closely with informatics or computational biology collaborators to build such a capacity and tap into the latest emerging computational techniques. There is a significant need for fast, efficient and easy-to-use computational pipelines to be made available to biological researchers, that use the most advanced techniques such as high-performance computing (HPC), cloud storage resources and provide access to remote computing resources with a scalability to meet the demands of such research projects.
Soybean is an important economic crop and is no exception to the computational barriers associated with a lack of access to advanced HPC and other NGS resources just mentioned. Soybean is a great source of dietary protein and oil for human and animal consumption. The soybean community has invested a great deal of efforts in both sequencing germplasm and creating phenotypic datasets, which has resulted in hundreds of resequencing datasets for both cultivated (G. Max) and wild soybean genomes (G. Soja) . Here we describe our recent informatics workflow and tool development, and its application to NGS datasets in soybean. To analyze these data, we developed “PGen,” a genomic variation analysis workflow using Burrows-Wheeler Aligner (BWA)  for alignment and the Genome Analysis Toolkit (GATK)  for SNP and indels identification. This workflow can be run in both Linux systems using repository from a GitHub and Pegasus  environment, and online via submission through the SoyKB website [7, 8]. We have applied this workflow to analyze resequencing data of a total of 500+ soybean germplasm lines for SNPs and indels calling from multiple datasets. All the soybean results are integrated and available for browsing via SoyKB’s new NGS resequencing data browser available at http://soykb.org/NGS_Resequence/NGS_index.php. PGen workflow can also be utilized for other organisms and crops by easy customization and serves as a good template for reproducible workflow for bioinformatics analysis with different types of NGS data.
Soybean germplasm NGS datasets
Details of soybean NGS resequencing datasets generated
Number of sequenced lines
# of reads (Millions)
Valliyodan et al. 2016 
USB Phase I
USB Phase II
Maldonado et al. 2016 
Genomic variations identification with PGEN workflow
PGen workflow optimization using TACC computing resource
The PGen workflow consists of several individual tasks with diverse core and memory requirements, which were assigned based on tools’ applicability of multiple threads and memory cost after testing
Indexing of reference genome
Alignment to reference genome
Sorting sam files
Removal of PCR duplicates
Add or replace read groups
Create realign target
Select SNPs and indels
Create genotype GVCF
PGen workflow availability
Introduction: We provide an introduction page, which presents the structure and computing environment of the PGen workflow. A user manual and public data for testing are also provided (Fig. 3a).
Upload data: The upload data instructional page allows users to upload raw data and reference genome on local machine to SoyKB server and then upload to iPlant data store (iDS) using FUSE mount. Successfully uploaded data will be shown on the create workflow page when selecting inputs (Fig. 3b).
Create Workflow: The create workflow page connects SoyKB users to the SoyKB data folder on the iPlant and allows them to select raw read fastq files and reference genome fasta file from there as inputs. A workflow is then created using selected variants filtering criteria and computing resources, and a working directory is created for output in the workflow-monitoring page (Fig. 3c).
Monitor Workflow: Users must be trained to use the PGen workflow history and working directory lists as shown on the workflow-monitoring page. These are used to check the status of workflows, which are shown in pie charts and log histories, which are saved to track error messages for any failed workflows. A statistical summary of computing resources utilized for tasks is generated for all successful workflows (Fig. 3d). Users must learn to use this functionality, which is enabled by linking the PGen workflow in SoyKB with the Narada Metrics system . Sharing statistics and workflow monitoring information is done in real time via the developed RESTful (representational state transfer) APIs (application program interface). Narada Metrics is a software-defined measurement and performance monitoring framework. The framework consists of a Central Intelligence System (CIS) and a number of Measurement Point Appliances (MPA). MPAs are run in a remote distributed resource (such as TACC, Informatics Science Institute (ISI)), which are controlled by CIS to execute workflow on these remote resources, monitor workflow status, collect performance data and send back to CIS. CIS is web service, which provide UI interfaces for users to schedule workflows and view their workflow status.
Workflow Results: Users can view and download BAM and VCF files of final results as outputs for further merging and conducting downstream analysis when they access the workflow result page (Fig. 3e).
SoyKB NGS resequencing data browser
Genomic variations for soybean germplasm lines
Summary of results for NGS resequencing datasets analyzed with the PGen workflow
# of sequenced lines
# of SNPs
# of Indels
# of Non synonymous SNPs
# of CNVs
USB Phase I
USB Phase II
Comparison of running time of PGen workflow of one sample using different computing resources
Cumulative job wall time
8 h, 29 mins
9 h, 11 mins
3 h, 25 mins
SoyKB NGS resequencing data browser
Introduction: The introduction tab provides details of different soybean datasets generated from multiple resources. We have analyzed more than 500+ soybean lines using the PGen workflow (Fig. 4a).
Summary: The summary tab contains plant genotype information (PI name) of sequenced soybean lines as well as statistics related to raw datasets. It provides the total number of raw reads, mapping rates, SNPs and indels identified (Fig. 4b) and other details for every germplasm line.
FastQC: The FastQC tab provides users access to the data quality results for every line that was generated using FastQC (Fig. 4c). Reports are available for both browsing in a webpage as well as downloading as a zipped file.
SNP: The SNP tab provides access to the list of filtered SNPs from all analyzed soybean datasets. This tab allows users to search SNPs by selecting a chromosome and entering the start and end coordinates for the region of interest (Fig. 4d).
Indel: The Indel tab provides access to the list of filtered indels from all analyzed soybean datasets. Indels can also be searched by using a chromosome and coordinates for the region of interest, similar to the SNPs search.
SnpEff Annotation: The SnpEff tab provides users access to the SNP annotation results computed on the filtered SNPs and indel results using the SnpEff tool. The annotation page displays variant regions on the chromosome, synonymous/non-synonymous effects, amino acid changes, and transcript gene names along with access to the SnpEff html summary page (Fig. 4e).
CNMOPS: The CNMOPS tab contains results of the CNV analysis generated using the cn.MOPs tool. This page displays the identified CNV region’s (gain or loss) of each soybean line in both searchable tabular and PDF format (Fig. 4f).
There are several challenges including data storage, data transfer, computing time, and availability of computing resources that accompany large genomic scale studies in biological organisms. Genomic variation studies on germplasm datasets in crops are no exceptions. Advances in high-performance computing and cloud storage technology can provide solutions for such challenges, but are generally out of reach for typical biological researchers. With PGen genomics variation analysis workflow development and availability, we have provided an efficient and easy-to-use analytics solution for biological users to address their needs for large-scale resequencing data analysis using HPC and cloud resources. For less computer savvy biological researchers, the web-based implementation in SoyKB allows them to still leverage the same scalable resources and solutions, but in an easy-to-use, non-tedious manner. The SoyKB NGS resequencing browser platform and online PGen workflow system allow users to easily submit analyses and access results via webpages. The workflow utilizes HPC resources from XSEDE and cloud storage from iDS to conduct NGS resequencing analysis and can be customized to work with other organisms as well. The workflow system is very flexible and additional local or remote computing resource can be easily incorporated. PGen can currently be run using three computing resources. First, we have the Pegasus resources of the Informatics Science Institute (ISI). The second resource comes from the Stampede and Wrangler high-performance computing cluster of TACC. The third resource is the XSEDE gateway allocation, which has been setup for SoyKB users utilizing the PGen online workflow. We are also building a fourth computing resource locally utilizing HPC resources at the University of Missouri-Columbia to provide PGen execution. More computing resources can be added as they become available to users in future.
PGen, together with its source code, is freely available to academic users via GitHub. It outlines best practices for efficient utilization of distinct and unique cyberinfrastructure (CI) resources available through multiple providers, with an emphasis on creating extensible and scalable workflows that can be easily modified and deployed. A similar approach can be utilized for designing many other bioinformatics analysis pipelines using the Pegasus workflow management system (Pegasus-WMS).
We appreciate the National Science Foundation for supporting the Extreme Science and Engineering Discovery Environment (XSEDE) by grant number ACI-1053575.
This article has been published as part of BMC Bioinformatics Volume 17 Supplement 13, 2016: Proceedings of the 13th Annual MCBIOS conference. The full contents of the supplement are available online at http://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-13.
The sequencing work was supported by the Missouri Soybean Merchandising Council (Grant #368) and the United Soybean Board (Grant #1320-532-5615). The publication of article was funded with internal research support.
Availability of data and materials
The PGen workflow is available at (https://github.com/pegasus-isi/PGen-GenomicVariations-Workflow). The web-based workflow submission is available at (http://soykb.org/Pegasus/index.php). MSMC sequencing data is deposited in the NCBI short read archive under accession code SRP062245.
YL, SK and JW designed the workflow and worked closely with MR for the Pegasus workflow development. YZ, SZ and SC developed the web-based interface and PGen analytics capacity within SoyKB. JV, BV and HN provided data. PC provided cloud support and NM provided iPlant data store support. TJ and DX provided guidance for the study. TJ was involved in planning, drafting and supervision of the entire project. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
The research protocol was approved by the Ethical committee of the participating universities and all subjects have provided written informed consent.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Zhou Z, Jiang Y, Wang Z, Gou Z, Lyu J, Li W, Yu Y, Shu L, Zhao Y, Ma Y, Fang C, Shen Y, Liu T, Li C, Li Q, Wu M, Wang M, Wu Y, Dong Y, Wan W, Wang X, Ding Z, Gao Y, Xiang H, Zhu B, Lee SH, Wang W, Tian Z. Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean. Nat Biotechnol. 2015;33(4):408–14.View ArticlePubMedGoogle Scholar
- Duitama J, Silva A, Sanabria Y, Cruz DF, Quintero C, Ballen C, Lorieux M, Scheer B, Farmer A, Torres E, Oard J, Tohme J. Whole genome sequencing of elite rice cultivars as a comprehensive information resource for marker assisted selection. PLoS One. 2015;10(4):e0124617.View ArticlePubMedPubMed CentralGoogle Scholar
- Valliyodan B, Qiu D, Patil G, Zeng P, Huang J, Dai L, Chen C, Li Y, Joshi T, Song L, Vuong TD, Musket TA, Xu D, Shannon JG, Shifeng C, Liu X, Nguyen HT. Landscape of genomic diversity and trait discovery in soybean. Sci Rep. 2016;6:23598.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with burrows wheeler transform. Bioinformatics. 2009;25(14):1754–60.View ArticlePubMedPubMed CentralGoogle Scholar
- McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The genome analysis toolkit: a MapReduce framework for analyzing next-generation dna sequencing data. Genome Res. 2010;20(9):1297–303.View ArticlePubMedPubMed CentralGoogle Scholar
- Deelman E, Singh G, Su MH, Blythe J, Gil Y, Kesselman C, Mehta G, Vahi K, Berriman GB, Good J, Laity A. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci Program. 2005;13(3):219–37.Google Scholar
- Joshi T, Fitzpatrick MR, Chen S, Liu Y, Zhang H, Endacott RZ, Gaudiello EC, Stacey G, Nguyen HT, Xu D. Soybean knowledge base (SoyKB): a web resource for integration of soybean translational genomics and molecular breeding. Nucleic Acids Res. 2013. 905.Google Scholar
- Joshi T, Patil K, Fitzpatrick MR, Franklin LD, Yao Q, Cook JR, Wang Z, Libault M, Brechenmacher L, Valliyodan B, Wu X, Cheng J, Stacey G, Nguyen HT, Xu D. Soybean knowledge base (SoyKB): a web resource for soybean translational genomics. BMC Genomics. 2012;13(1):1.View ArticleGoogle Scholar
- Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, Hyten DL, Song Q, Thelen JJ, Cheng J, Xu D, Hellsten U, May GD, Yu Y, Sakurai T, Umezawa T, Bhattacharyya MK, Sandhu D, Valliyodan B, Lindguist E, Peto M, Grant D, Shu S, Goodstein D, Barry K, Futrell-Griggs M, Abernathy B, Du J, Tian Z, Zhu L, et al. Genome sequence of the palaeopolyploid soybean. Nature. 2010;463(7278):178–83.View ArticlePubMedGoogle Scholar
- Goodstein DM, Shu S, Howson R, Neupane R, Hayes RD, Fazo J, Mitros T, Dirks W, Hellsten U, Putnam N, Rokhsar DS. Phytozome: a comparative platform for green plant genomics. Nucleic Acids Res. 2012;40(D1):1178–86.View ArticleGoogle Scholar
- Andrews S. Fastqc: A quality control tool for high throughput sequence data. Reference Source. 2010.Google Scholar
- Picard tools. [http://broadinstitute.github.io/picard/].
- Goff SA, Vaughn M, McKay S, Lyons E, Stapleton AE, Gessler D, Matasci N, Wang L, Hanlon M, Lenards A, Muir A, Merchant N, Lowry S, Mock S, Helmke M, Kubach A, Narro M, Hopkins N, Micklos D, Hilgert U, Gonzales M, Jordan C, Skidmore E, Dooley R, Cazes J, McLay R, et al. The iplant collaborative: cyberinfrastructure for plant biology. Frontiers in plant science. 2011;2:34.View ArticlePubMedPubMed CentralGoogle Scholar
- Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly. 2012;6(2):80–92.View ArticlePubMedPubMed CentralGoogle Scholar
- Klambauer G, Schwarzbauer K, Mayr A, Clevert DA, Mitterecker A, Bodenhofer U, Hochreiter S. cn.MOPS: mixture of poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate. Nucleic Acids Res. 2012;40(9):e69.View ArticlePubMedPubMed CentralGoogle Scholar
- Langewisch T, Zhang H, Vincent R, Joshi T, Xu D, Bilyeu K. Major soybean maturity gene haplotypes revealed by Snpviz analysis of 72 sequenced soybean genomes. PLoS One. 2014;9(4):94150.View ArticleGoogle Scholar
- Towns J, Cockerill T, Dahan M, Foster I, Gaither K, Grimshaw A, Hazlewood V, Lathrop S, Lifka D, Peterson GD, Roskies R, Scott J, Willkins-Diehr N. XSEDE: accelerating scientific discovery. Computing in Science & Engineering. 2014;16(5):62–74.View ArticleGoogle Scholar
- Texas advanced computing center (TACC). [http://www.tacc.utexas.edu].
- Calyam P, Mishra A, Antequera RB, Chemodanov D, Berryman A, Zhu K, Abbott C, Skubic M. Synchronous big data analytics for personalized and remote physical therapy. Pervasive and Mobile Computing. 2015;28:3–20.View ArticleGoogle Scholar
- Song Q, Hyten DL, Jia G, Quigley CV, Fickus EW, Nelson RL, Cregan PB. Development and evaluation of soysnp50k, a high-density genotyping array for soybean. PLoS One. 2013;8(1):54985.View ArticleGoogle Scholar
- Wang J, Joshi T, Valliyodan B, Shi H, Liang Y, Nguyen HT, Zhang J, Xu D. A bayesian model for detection of high-order interactions among genetic variants in genome-wide association studies. BMC Genomics. 2015;6(1):1.Google Scholar
- Maldonado Dos Santos JV, Valliyodan B, Joshi T, Khan SM, Liu Y, Wang J, Vuong TD, de Oliveira MF, Marcelino-Guimarães FC, Xu D, Nguyen HT. Evaluation of genetic variation among brazilian soybean cultivars through genome resequencing. BMC Genomics. 2016;17(1):1.View ArticleGoogle Scholar