Reference genotype matrix construction
For construction of reference genotype matrix, we utilized previously published soybean 222 validation germplasm consisting of 83 cultivars (Glycine max), 79 wild accessions (Glycine soja), and 66 breeding lines including F2 and RIL with some redundancy [7]. This dataset consists of soybean germplasm from Korea, China and Japan, and few western accessions and it would be advantageous for construction of reference genotype matrix to cover the allele space as large as possible because the North-eastern Asia counties are presumably considered as the origin of soybean domestication [16]. Furthermore, the dataset is originally designed to develop Axiom SoyaSNP array with known clusters such as wild/cultivating soybean, landraces/breeding population, and some redundant accessions for validating SoyaSNP array-based genotyping. These known clusters are also beneficial to validate our clustering result.
First of all, we filtered out the SNP positions with minor allele frequency (MAF). We selected a strong criteria, 0.2, to leave only common alleles that would provide strong signal for clustering and filtered out the SNP positions with MAF lower than the criteria. After filtering out the SNP positions, we determined 61,875 SNP positions of 222 accessions (Additional file 1). Diploid genotypes of filtered SNP positions were scored and transformed into 20 principal components (PC) with sum of explained variance ratio, 0.52, using sklearn package to build reference genotype matrix [17]. It resulted in 20 × 222 reference genotype matrix.
This light weighted genotype matrix could highly reduce computational burden of hosting server for hierarchical clustering trading off the information for clustering. However, the dendrogram based on clustering with the reference genotype matrix showed consistent phylogenetic relationship with known clusters such as wild/cultivated soybean, landrace/breeding populations, and redundantly assigned accessions (Fig. 1). Most importantly, the breeding populations from artificial crossing such as WI, HI, TS, and WH were well clustered each other. This result validate the PCA-transformed reference genotype matrix can cluster the soybean accessions precisely.
Identification of similar accessions to query VCF file
To test whether the PCA-transformed reference genotype matrix can retrieve similar accession to query VCF files, additional VCF files were prepared from the publicly accessible NGS reads with more than 10X coverage in the National Center for Biotechnology Information (NCBI) sequence read archive (SRA), and the National Agricultural Biotechnology Information Center (NABIC) in Korea (Additional file 2). Read alignments were implemented with software BWA [18] followed by samtools, bcftools [19], and sambamba [20] as following.
# read mapping
> bwa mem Gmax_275_v2.0.fa sample_1.fastq sample_2.fastq -o out.sam
> samtools fixmate -O bam out.sam out.bam
> sambamba sort -o out.sort.bam out.bam
# SNP calling
> bcftools mpileup -Ob -o out.bcf -f Gmax_275_v2.0.fa out.sort.bam # or bamfiles
> bcftools call -vmO z -o out.vcf.gz out.bcf
The single VCF files were converted into numeric scores by pre-trained PCA parameters and clustered with the reference genotype matrix. The pre-trained PCA parameter was calculated when we construct the reference genotype matrix. We will re-calculate the PCA parameter when we update the reference genotype matrix. We implemented hierarchical clustering with method, ‘complete’, by Scipy package [21]. Resulting clusters were visualized into dendrogram with python matplotlib package [22]. The resulting panel displays the hierarchical clustering dendrogram and genotype distance values within sample cluster group.
Glycine soja accession (SRR1533201, Alias1: IGDB-TZX-7991, Alias2: ZJ-Y282, Origin: Zhejiang, China) was mapped to wild soybean cluster consisting of the accessions from China [23]. The closest accession was PI464937B of which the origin is Jiangsu in China. Glycine max accession (SRR1533345, Alias1: IGDB-TZX-012, Alias2: PI548379, Origin: Heilongjiang, China) was mapped near soybean cultivar Clark. Its breeding pedigree is started from the cultivar Mandarin (PI548379) whose origin is also Heilongjiang, China. Another G. max accession (NN-1503-000001, NABIC) known as cultivar, “Baegun”, was mapped to soybean cultivar cluster and the closest cultivar was “Kwangkyo”, which is known as one of parents of “Baegun” [24]. These results suggest that our approach is working properly and can estimate parental lines if they are already in reference genotype matrix (Fig. 2).
Continuous improvement of reference genotype matrix by accumulating public resequencing data
With increasing number of soybean resequencing projects, it would be practical if we can integrate multiple sequencing deposits from public DBs into our reference genotype matrix. To test the applicability of pre-trained PCA parameters to additional Glycine max set that are not included in our reference genotype matrix (eg. Western soybean germplasm), we downloaded Canadian soybean accessions that are publicly available in NCBI SRA (Additional file 2) [25]. After SNP calling of the Canadian cultivars, we transformed the genotypes into numeric scores using pre-trained PCA parameters and integrate them with our reference genotype matrix. The resulting tree showed that the Canadian accessions are mapped near the “Mapple Presto” that has been maintained at Rural Development Administration in Korea with slight different name to “Maple Presto” in the Canadian accessions (Fig. 3). This shows that additional soybean accessions from other projects can be easily integrated into the reference genotype matrix for continuous update.
Web interface of soybean-VCF2Genomes
We built our web application, Soybean-VCF2Genomes, using Django (Version 1.5, https://djangoproject.com) and semantic UI (https://semantic-ui.com/) (Fig. 4). Web interface is simply designed for user to choose the reference genotype matrix that currently two matrix are available; default 222 sample matrix and default + Canadian sample matrix. User can upload their single sample VCF file via file upload form. The result panel will display the image of hierarchical clustering tree in high resolution with query position as well as the distance value table. Total analysis time would be ~ 5 min after completion of VCF file upload.
Utility and discussion
Soybean-VCF2Genomes identifies similar soybean accessions in reference genotype matrix for input VCF file with more than 10X genome coverage. The reference genotype matrix would be updated with the public NGS data of representative soybean germplasm. Currently ~ 1800 soybean accessions are available in NCBI SRA and more and more accessions are expected to be deposited in near future. However, the meta information of the deposited data is not always in ideal format and it is necessary to curate the labels (eg. name of cultivar, germplasm ID) of the accessions and select representative accessions according to the geographical origin and known pedigrees to improve our reference genotype matrix. The efforts for continuous update will eventually allow label identification of query sample rather than similar accession retrieval.
The identification relative phylogenetic position among the available soybean germplasm space would be beneficial to estimate rough phenotype of unknown samples based on the known phenotype information of the closest accession in reference genotype matrix. This would be also quick solution when researchers are trying to determine the rough origin of the wild or landrace accessions collected from local regions for many purposes such as genetic resource collection and import/export quarantine. Also, this will be useful for validation of commercial cultivars to prove its genomic background. Moreover, breeders can estimate relative distance of their crossbreeding combinations based on the reference genotype matrix to determine the plan of annual field trial. Our web application accepts directly VCF file which the sequencing service providers usually returns to customers. Hence, it is easy-to-use without any bioinformatic knowledge.
One limitation is that Soybean-VCF2Genomes currently supports only single sample VCF file due to the hosting burden and file transfer issues. As it contains only variant information, the compressed single sample VCF file size varies from 70Mbytes to 150Mbytes. However, the multi-sample VCF files (> 10 samples) would be too large to transfer as it reaches several hundreds or thousands Mbytes. Supporting multi-sample VCF file would be possible if we can host Soybean-VCF2Genomes at cloud server that support fast file transfers and flexible storage and memory.