For the detection and the visualization of recombination, we employed a three-step method. First, we aligned the raw sequence reads to the reference genome (S288C and/or SK1). Second, we performed variants calling to identify the polymorphic sites and to characterize the normalized allelic frequency of the individual markers in the parental and recombinant samples. Third, we plotted the allelic variation to provide a quick and intuitive way to map the recombination events.
Dataset
To test UGDR, we downloaded publicly available Illumina paired-end reads sequencing data of recombined and potentially recombined strains from the NCBI Sequence Read Archive (SRA) (Additional file 1: Table S1). The downloaded set includes: four-spore tetrads of haploid strains and two diploid mother and daughter RTG strains (RTG17M & RTG17D,) issued from a SK1/S288C hybrid background [11], and two clinical triploids strains (YJM1138 and YJM114) and two clinical tetraploids strains (YJM958, TJM959) [13]. It is important to note that the triploids and tetraploids are sequenced by Zhu et al. without sporulation or induced recombination. The S288C fasta sequence was downloaded from the Saccharomyces Genome Database and the novel SK1 fasta assemble was provided by Liti [5, 14, 15].
Genome alignment and variants calling (Galocal)
The key steps for studying large-scale dataset are reads mapping and identification of potential polymorphic sites (SNP calling). To manage the alignment of the reads and call the variants, we assembled the necessary open source tools in a custom bash script named Galocal (Fig. 1a). First, we mapped all the reads to the reference using the gapped Burrows-wheeler Aligner (BWA) v.0.7.15 [14]. Next, to achieve an accurate variant detection, we conducted a different post-alignment analysis that included elimination of unmapped reads using SAMtools v.1.3 [14] and removal of PCR duplicate using Picard tools v.2.4.1 (https://github.com/broadinstitute/picard) [16]. After that, we used the Bayesian algorithms: FreeBayes framework-1.0.2 [17] to call variants and the DepthOfCoverage tool from the GATK v3.5 framework [18] to determine the depth of coverage (DoC). Finally, the output files (variants files [.vcf] and depth of coverage files [.doc]) were further used to analyze the variations in allelic frequency reflecting the region of recombination.
Analysis of allele frequency variation and identification of recombination region (UGDR)
When describing the allele frequency variation in this study, we use the term “ratio variation”. To analyze ratio variation, we investigated and compared each chromosome’s copy number that supported the alleles in both the potentially recombined strains and reference strain. To start the exploration of the chromosome’s copy involved in the recombination, it must be noted that the allele (SNP) ratios are ploidy dependent (Additional file 1: Table S2). Indeed, in haploid strains, the heterozygous allele ratio is equal to 1. Meanwhile, in diploid strains, the heterozygous allele ratio is equal to 0.5 and the homozygous allele ratio is equal to 1. In triploid and tetraploid strains, the heterozygous allele ratios are 0.33, 0.67 and 0.25, 0.5, 0.75, respectively [13]. To calculate the allele ratios from the variants calling files (VCF), we involved two main constants: (1) the observed number of alternative alleles (AO), (2) and the total number of bases sequenced and aligned at a given reference base position (DP). Therefore, the ratio represents the fraction of an alternative allele’s number in the total number of bases at a given position (\(Ratio= AO/ DP\)). Further, the allele ratio is tightly dependent on the quality and counts of the reads at each position; so, the selected alleles for the study are those supported by enough coverage (minimum 20) and an average Phred quality score \(>200\) [19] and AO to \(>20\) for an 80\(\times\) sequencing coverage (cutoffs that the users can adopt to their data). The ratio was calculated, for each allele passing the filters, and summarized with other information such as position and type of the polymorphism in two distinct files called “parental alleles file”, for the data issued from the reference stain, and “recombined alleles file” for the potentially recombined strain (Fig. 1b).
Next, to detect recombination, we considered the allelic ratio as: (1) invariable if the allele ratio was equal in the parental (P) and potentially recombinant (R) strains (Additional file 1: Fig. S1B), and (2) variable if the P alleles ratio was significantly different and followed the predicted likelihood of ratio variation summarized in (Additional file 1: Fig. S1-A and Table S3). For example, during recombination of a triploid cell, the variants with a parental ratio of 0.33 may vary to 0.66 or 0 (Additional file 1: Fig. S1c-top). If the ratio changes from 0.33 to 0 or 1, with a loss of depth of coverage this indicates a loss of a chromosome (3n to 2n—Additional file 1: Fig. S1D-top). In contrast, If the ratio changes from 0.33 to 0.5 or 0.25 with a gain of depth of coverage this indicates a gain of a chromosome (3n to 4n—Additional file 1: Fig. S1D-top). The outcome of this step is stored in two files called “variants alleles file” for information on the alleles with variants and “invariant alleles file” for those with invariant ratios. To robustly call the recombined region, we only retained those delimited by two break points holding at least 4 adjacent alleles that have undergone a variation of their allelic ratio. Nonetheless, regions that contain additional alleles with an invariable ratio were considered when a pair of two variant alleles were separated by a maximum of two invariant ones.
Lastly, the UGDR pipeline creates the “Region of recombination” file that contains all the information about the recombined regions (start and end as well as the number of variable alleles composing the region) and plots the recombination map.
Calculate the normalized depth of coverage (NDoC)
To evaluate the recombination region detection analysis, several features like Depth of Coverage (DoC) and mapping quality are important. The analysis of DoC is essential to detect events like insertion, deletion and copy number variation (CNV) between and along the chromosomes during recombination. We used the program DepthOfCoverage of the GATK toolkit [18] to obtain the base-level coverage of the genome (Fig. 1c). After that, to correct the technical DoC variation between parent and recombined cells, we calculated a normalized depth of coverage by dividing the base coverage of the recombinant cell by the base coverage of the parent and averaged the DoC by a one Kb sliding window along the genome (A cutoff that the users can adapt to their data). At the end of this step, the normalized coverage is written to a “Samplename_depthofCove.txt” file and the Depth of coverage plot is generated.
Display of recombination and coverage profiles
A genome-wide recombination profile is generated at the end of the process in which the parent and the recombinant genomes are plotted in parallel, and the alleles are represented in different colors according to their allele frequency status. The invariable genotypes (same ratio as the parent) are represented by gray markers (vertical lines), LOHs: ratio = 0, ratio = 1 are represented by red and blue markers, respectively and the heterozygous recombination events s are represented by black markers (e.g. 0.33–0.66 the marker at this location is black). To help identify chromosomal copy changes, the normalized depth of coverage of each chromosome is represented in parallel to the allele frequency where orange circles denote potential chromosomal loss and green dots denote potential chromosomal gain.