Statistical identification of highly variable genomic regions in the human population
Here we present SVhound, a tool to predict potential regions where additional Structural Variation (SV, defined as genomic variation greater than 50 bp) can be expected if more genomes were sequenced.
In short, SVhound partitions a genome into non-overlapping windows. For each window, SVhound counts the number of different SV-alleles that occur in a sample of n genomes (see “Methods”). Based on the number of distinct SV-alleles, SVhound predicts regions that can potentially harbor new structural variants (clairvoyant SV, clSV) by estimating the probability of observing a new SV-allele (see “Methods”). Note these are not SV that are detected within the same sample set by deeper coverage or utilization of long reads, but SV that belong to not yet sequenced samples. Thus, clairvoyant SVs (clSVs) are defined as previously undetected SV of unknown genotype. SVhound assigns probabilities to each region to find a clSV. Thus, regions with a high probability will produce more SV if more samples are sequenced.
Figure 1A exemplifies this for three windows and a sample of n = 100 genomes. In windows w1, w2, w3, we detected k = 3, 5, 2 SV-alleles, leading to diversity parameter estimates \(\theta \left( {w_{1} } \right) = 0.430, \theta \left( {w_{2} } \right) = 0.948, \theta \left( {w_{3} } \right) = 0.204\) and the probabilities to find a clSV in the respective windows, if an additional genome or sequence from the respective window is sequenced, equal \(p_{new} \left( {w_{1} } \right) = 0.00430,p_{new} \left( {w_{2} } \right) = 0.009390, p_{new} \left( {w_{3} } \right) = 0.00205.\)
To investigate the power of SVhound to predict clSVs and to study the influences of the window-length and sample size, we randomly sub-sampled 50 (2.00%), 100 (4.00%), 500 (19.97%) and 1000 (39.34%) human genomes from the 2504 genomes of the 1KGP [20] for a variety of window lengths (5, 10, 50, 100, 200, 500 and 1000 kbp). For each of the 28 combinations of window lengths and sample sizes we compared the \(p_{new}\) estimates with the fraction \(f_{undetected}\) of SV-alleles that do not occur in the respective sub-sample, but are present in the full 1KGP data (see “Methods”).
Figures 1B displays the association between \(p_{new}\) and \(f_{undetected}\) for sub-samples of size n = 100 (Fig. 1B top panel) and n = 1000 genomes (Fig. 1B bottom panel) and window lengths of 10kbp and 100kbp, respectively. We observed that the window length had a bigger impact on the performance of SVhound by evaluating \(p_{new}\); for example the correlation coefficient (r) for 10kbp window is r = 0.3976 and r = 0.1698 for 100 and 1000 genomes respectively (Fig. 1B top panel), while for 100kbp window the performance of SVhound greatly improves with r = 0.8519 for 100 genomes and r = 0.9524 for 1000. We also noticed that the sample size only improved the correlation coefficient for window lengths of at least 50kbp. The scatterplots of the 28 window-sample size combinations are shown in Additional file 1: Fig. S1.
While the above analysis was based on one simulation, we performed 100 simulations for each of the 28 parameter combinations. Additional file 1: Fig. S2 and Additional file 1: Fig. S3 show the distribution of the correlation coefficients, the coefficients of determination (\(r^{2}\)) and the slopes for the 100 simulations and the observations exemplified in Fig. 1B are corroborated.
Additional file 6: Table S1.1 shows the average correlation coefficients for the 100 simulations for each of the 28 window-sample size combinations. If the window length is large and the sample size is large then we observe a high correlation between \(p_{new}\) and \(f_{undetected}\), with large window lengths we have more data to estimate the model parameter and thus the predictions improve. But not only the correlation is high for large windows, also the slope of the regression line approaches one with increasing sample size and window length (Additional file 6: Table S1.2). This indicates that \(p_{new}\) is indeed a good predictor of \(f_{undetected} .\)
We note that, with increasing window length \(p_{new}\) increases (see also Fig. 1C), which can be explained with the infinite allele assumption almost being met and thus the probability to find clSVs increases. Contrary, the increase in sample size has the opposite effect (Additional file 1: Fig. S4). With increasing window length the chances also increase to find SV-alleles that occur exactly once, high numbers of such singletons will increase the diversity parameter, \(\theta ,\) and subsequently \(p_{new}\) (see “Methods”). However, with larger window lengths the resolution and thus the genomic location of the predicted additional SV-alleles is reduced.
We further investigate what drives the increase in predictiveness with the increase of window length. We note that for small window lengths the average number of SV-alleles per window was low and thus affects the diversity-parameter estimation. Additional file 1: Fig. S1 shows that 100kbp was the smallest window length were an increase in sample size improves the correlation (\(r^{2}\)) and the slope approaches one. We computed the average number of SV-alleles for the 100kbp window using the whole 1KGP dataset and found that genome-wide we have on average 10 SV-alleles per window, which we use in all following analyses to estimate the appropriate window size to each dataset.
Identification of polymorphic candidate regions across 2504 human genomes from the 1000 genome project
We applied SVhound to the 2504 genomes of 1KGP SV calls to identify likely regions (loci) with clSV. SVhound automatically calibrated the window length to 100kbp. The human genome was then partitioned into 18,397 windows from which we analyze the top candidate loci, representing 1% of the windows with the highest probability of detecting clSVs (\(p_{new} \ge 0.34\%\)). Figure 2A shows the probabilities of detecting a clSV for each window. The red dots mark the top 1% (188) windows with the highest \(p_{new}\) for clSVs (here thereafter candidate windows). The most noticeable candidate window is located on chromosome 15 with \(p_{new}\) = 25.77% of detecting a clSV if one new sample is added. The remaining windows with \(p_{new} < 0.34\%\) are not considered in the analysis (in dark/light gray).
We are particularly interested where in the human genome the 188 candidate windows occur. To achieve this, we overlapped the candidate windows to several annotation databases. First, we investigate whether these candidate windows are identified only in intergenic regions or if these windows are actually preferentially located near genes. As windows are large enough, each window can overlap with more than one annotated element. We found 107 candidate windows that overlap with 204 protein coding genes (Additional file 6: Table S2), 148 candidate windows overlapping non coding genetic elements (Additional file 1: Fig. S5) and 24 windows in intergenic regions. Next, to understand the biological role of the 204 genes we performed an enrichment analysis with PANTHER [21], and found enrichment for biological processes related to: cellular detoxification of nitrogen compound, xenobiotic catabolic process, interferon-gamma-mediated signaling pathway, regulation of immune response and sensory perception of smell (Additional file 6: Table S2 and Additional file 6: Table S3). The noticeable candidate window that we observed on chromosome 15 contains two olfactory receptor proteins and four olfactory receptor pseudogenes (Fig. 2A).
Next, we investigate whether SVhound is suggesting regions containing repeats that are known to show many structural variants [9]. For this we analyzed the overlap of the candidate windows with annotated repeat elements from the RepeatMasker track [22], simple tandem repeat elements [23] and segmental duplications [24]. First, we found that the LINE and LTR repeat families were the most often observed in candidate windows, with the L1-LINE repeat [23] being the most abundant, followed by LTR-ERV1 (Additional file 6: Table S4.1 and Additional file 6: Table S4.2). Next, for the case of simple tandem repeat elements, we found all but one candidate window overlapped with at least one of these elements, which coincides with their abundance in the human genome. We observed that these ubiquitous elements were not present more abundantly within the candidate windows when compared to the rest of the genome (T-test of difference in means p-value = 0.314, Kolmogorov–Smirnov test p-value = 0.2378, Additional file 1: Fig. S6). Finally, for the segmental duplications [24] we found 101 candidate windows overlap with at least one segmental duplication, from which 88 overlapped with multiple ones (Additional file 6: Table S5).
Next, we wondered if SVhound actually only identifies repeats or indeed regions that will harbor undetected SV. Low complexity repeats, for example, are often the cause of falsely identified SV and thus maybe do not always harbor these clSV. To assess this we focused on non repetitive regions such as the high confidence regions defined by the Genome in a Bottle Consortium (GIAB, [16, 25]) representing reliable regions for structural variation detections using short reads (e.g. outside of segmental duplications, low mapping quality regions) and thus potential targets for experimental validation. We found that only 18 out of the 188 candidate windows (9.57%) did not overlap with the high confidence regions annotated by the GIAB Consortium [16] (Additional file 6: Table S6). Therefore, SVhound indeed reports loci with biological significance rather than enriching for artifacts or regions known to be variable in the genome (e.g. intergenic). Finally, we compared the results of SVhound to two different approaches of investigating SV in a population: (1) a classic approach of detecting SV hotspots in the genome and (2) a comparison to rare alleles (MAF < 1%, see “Methods”).
For the first case, we used the hotspot analysis of Lin and Gokcumen [26], which divided the genome in 100 kb windows and then we used the same coordinates to identify candidate windows with SVhound. We found that 83 windows were considered both a hotspot and a candidate window by SVhound (34.6%, Additional file 1: Fig. S7). Moreover, 157 (65.4%) of the candidate windows were not cataloged hotspots, thus showing that SVhound detects both hotspots and non-hotspots as candidates for further analysis. This result is not surprising, because SVhound computes the probability to find a new SV. This probability depends on the number of SVs in the window and the sample size (see “Methods”) in a non-linear way. For the second approach, we performed a comparison between rare observed SVs (low frequency SV, MAF < 1%) and the candidate windows proposed by SVhound. We found 22,386 SV that fall in the category of having “rare alleles”, from which only 967 of such “rare alleles” overlapped with a candidate window. These results clearly indicate a difference between the results one can expect from these two approaches when compared to SVhound.
Next, we applied SVhound to identify SV confined to particular human ancestries defined in the 1000 genomes project (African (AFR), Admixed American (AMR), European (EUR), East Asian (EAS) and South Asian (SAS)). We split the 2504 genomes into five so-called “super-populations” (661 AFR, 347 AMR, 503 EUR, 504 EAS, 489 SAS) and scanned for candidate windows by repeating the previous analysis for each ancestry. Additional file 1: Fig. S8 shows the candidate windows (top 1% with highest \(p_{new}\)) for each of the five studied populations. From the collection of all top 1% candidate windows (total number of distinct windows: 468) we investigated those present in a single population (ancestry-specific windows) and thus identified potential regions of high polymorphism specific to a particular population; and those that occurred in all populations (ubiquitous windows) and thus represent regions of high polymorphism in the all humankind (Fig. 2B, Additional file 6: Table S7).
We detected 45 (9.62%) ubiquitous windows, whereas 264 (56.41%) windows were ancestry-specific, which break down as follows: South Asian, 61; African, 60; European, 57; East Asian, 53; Admixed American, 33. Finally, the remaining 159 (33.97%) candidate windows occurred in two to four populations.
Next, we investigated the role of the genes in the ubiquitous and the ancestry-specific windows (Additional file 6: Table S8). For the genes in the ubiquitous windows, we found enrichment in biological processes also found in the 1KGP full data set (nitrobenzene metabolic process, cellular detoxification of nitrogen compound, xenobiotic catabolic process, interferon-gamma-mediated signaling pathway, antigen processing) (Additional file 6: Table S9.1). When analyzing the ancestry-specific windows, we only found gene enrichment in the South Asian population for 8 biological processes related to keratinization (tissue development, Additional file 6: Table S9.6).
Finally, we analyzed if repeat elements overlap with ubiquitous and ethnic specific candidate windows. Here, the L1 (LINE), ERV1 (LTR) and ERVL-MaLR (LTR) repeats were the most abundant among both ubiquitous and ancestry specific candidate windows (Additional file 6: Table S10.1). Next, when analyzing the repeat elements present in a single ethnic group, LTR Gypsy-like is an example that overlaps with the ancestry specific windows of the African population [27]. Similarly, an ERVL-like (LRT) repeat is restricted to ancestry specific windows for European population, the TcMar-Tc2 (DNA repeat) was found in ancestry specific windows for the Admixed American population and Satellite-telo in the South Asian population (Additional file 6: Table S10.2).
Identification of polymorphic candidate regions across 19,652 human genomes in the USA
To extend our work further, we applied SVhound to detect regions with undetected SVs in 19,652 genomes of US residents (CCDG data) that include 8969 European-American, 8099 Hispanic or Latino-American and 2584 African-American genomes [4]. SVhound automatically estimated the optimal window length to be 10kbp. We again considered as candidate windows those representing 1% with the highest probability of detecting a clSV (\(p_{new} \ge 0.081\%\)). Figure 3 shows the distribution of the probabilities to detect a clSV when splitting the genomes in 126,185 windows, highlighting in red the 1282 the candidate windows.
Next, we used a similar annotation strategy to the 1KGP over the 1282 candidate windows, overlapping them to several databases. We found 381 candidate windows that overlapped with 331 protein coding genes (Additional file 6: Table S11), 396 overlapping non coding genetic elements (Additional file 1: Fig. S9) and 599 windows in intergenic regions. Again, we performed an enrichment analysis with PANTHER using the 331 genes and found gene enrichment for 27 biological processes, all of them related to immune response, e.g. phagocytosis, homophilic cell adhesion via plasma membrane adhesion molecules, complement activation, B cell receptor signaling pathway, positive regulation of B cell activation among others (Additional file 6: Table S12).
Next, we analyzed the repeat elements that lay within the candidate windows (Additional file 6: Table S13.1 and Additional file 6: Table S13.2) and observed an overall increase in the number of repeats overlapping with candidate windows. The LINE and LTR families were found in 44.2% and 30.66% of the candidate windows, which represent a decrease of 20.67% for the LINE and 23.6% for the LTR when compared to the 1KGP data. In addition, the DNA repeats were found in 23.79%% of the candidate windows, while the rest of repeat elements are found in less than 3% of the windows.
Next, we analyzed the presence of simple tandem repeats within the candidate windows of the CCDG dataset. Here, we found significant differences in the average number and the distribution of simple tandem repeats across the 1282 candidate windows (T-test p-value < 2.2e−16, Kolmogorov–Smirnov test p-value = 1.453e−08, Additional file 1: Fig. S10), result that deviates again from our analysis of 1KGP data. We found that the candidate windows from the CCDG dataset overlapped with centromeric and pericentromeric regions, which tend to be abundant in highly repetitive sequences [28] and repeat elements and were likely inaccessible/filtered from the 1KGP dataset.
Finally, we noticed consecutive clusters of candidate windows (ten or more consecutive windows cataloged as candidates) along some genomic regions (Additional file 6: Table S14). We found such clusters of candidate windows in chromosomes 5 (two clusters of size 12), 7 (three cluster sizes 17, 16 and 28), 9 (cluster size 12), 11 (cluster sizes 12 and 13), 12 (cluster size 12), 14 (cluster sizes 11 and 10), 17 (cluster size 16), and 19 (cluster size 25). One cluster is located near the telomere (chromosome 5) and seven in pericentromeric regions (chromosomes 5, 7, 11, 12) which are well known for having a high density of simple repeats, satellite repeats, and repeat elements in general (LINE, LTR, etc.) and coincide with the instability of such regions in genome assemblies, which are known to be hard to resolve due to their repetitive nature.
Further, five clusters are within coding regions in chromosomes 9, 14, 17 and 19. Here, it is prominent the case of a 155kbp region in chromosome 9 that overlaps with a novel lncRNA (ENSG00000285784). Next, we found a 169kbp region on chromosome 14 that include eight olfactory receptor genes, a 123kbp region on chromosome 14 which include 4 immunoglobulin genes (IGHA2, IGHE, IGHG4, IGHG2) two miRNA (MIR8071-1, MIR8071-2) and a lncRNA (COPDA1), a 185kbp region in chromosome 17 which include the KAT8 regulatory NSL complex (KANSL1, also observed in the GWAS analysis) and a 301 kbp region in chromosome 19 where we found six pregnancy specific beta-1-glycoprotein and two lncRNA (PSG8-AS1, ENSG00000282943). Thus, many of these clusters of candidate genomic regions are already well known to be highly variable.
We then focused on segmental duplications overlapping candidate windows. Here, we observed a slight decrease in the number of candidate windows overlapping with a segmental duplication (41.5%) when compared to the 1KGP (53.7%) (Additional file 6: Table S15). We identified the candidate windows that overlapped with the GIAB high confidence regions that exclude regions where short reads cannot reliably identify SV. Overall, 69.4% (890) of candidate windows overlapped with these “high-confidence” regions and thus indicate that reliable SV calling can be achieved in such regions [16]. (Additional file 6: Table S16).
Finally, we compared the results of the two independent human datasets, (1KGP, CCDG) that we analyzed with SVhound to examine the similarities in the prediction. As each dataset was analyzed with distinct genome reference, we compared the shared genes that overlapped with candidate windows. Surprisingly, we found only 41 genes present in candidate windows of both the 1KGP and CCDG data sets, representing approx 8.3% of the 495 genes associated with at least one of the candidate windows from the 1KGP or CCDG data (Additional file 6: Table S17). This small intersection may be related to the fact that the CCDG dataset focuses on the US population while the 1KGP dataset comprises 26 different ethnicities [20], coupled with the difference in number of candidate windows (188 in the 1KGP dataset to 1282 in the CCDG dataset, see Additional file 6: Table S18.1).
Identification of SV and further polymorphic candidate regions across 150 Rhesus Macaques
Finally, we applied SVhound to 150 whole genome sequences from the rhesus macaque (Macaca mulatta), a widely used primate model of human disease that has not been well studied with respect to SV [18, 19]. For this we created a novel catalog of SV for rhesus macaques by comparing 150 genomes to the reference Mmul_8 (see “Methods”, [8]). We identified SVs among the genomes of these 150 rhesus macaques that came from several US research colonies (see “Methods” for details). The largest proportion of SVs were deletions (45.84%) followed by insertions (36.88%), inversions (11.45%) and tandem duplications (5.82%) (Additional file 6: Table S19.1 and Additional file 6: Table S19.2). This follows roughly the distribution expected from human SV datasets [9]. Interestingly, we found a high number of SVs on chromosome 19 (Additional file 6: Table S19.3). Chromosome 19 includes tandem repeats of olfactory receptors, KIR (killer cell immunoglobulin-like receptor) loci and other immunology genes and was previously shown to have a higher rate of both CNV and SNV polymorphism than other macaque chromosomes [18, 29]. Figure 4A shows the minor allele frequency (MAF) spectrum. The MAF spectrum for the genome wide SVs follows the same distributions as in other populations (e.g. human), with the majority of the 102,572 SVs (53.7%) exhibiting low frequency (MAF < 0.05). We observe 5946 SV having an MAF > 45%, which might be because the reference genome contains an array of low frequency SVs. Interestingly, we noticed a profound peak for Alu insertions (Fig. 4B) that highlights Alu activity in this species.
We applied SVhound to identify candidate regions that may contain undiscovered variation. First, we observed that the rhesus raw data contained a larger number of SVs when compared to the human dataset (Additional file 6: Table S18.1), even though the number of genomes was an order of magnitude smaller when compared to the 1KGP and two orders of magnitude smaller when compared to the CCDG. This time SVhound estimated a window length of 27kbp. Here, given the small sample size, the non candidate windows presented higher \(p_{new}\) discovery probabilities when compared to the two full human datasets and similar to those in the subsamples (e.g. 100 individuales of 1KPG in Fig. 1B, top panels).
We extracted the top 1% candidate windows from the 75,554 windows \((p_{new} \ge 22.3\%\), Fig. 4C). Then, we extracted 479 annotated rhesus genes that overlap with a candidate window and performed an enrichment analysis with PANTHER (unmapped ID not counted, Additional file 6: Table S20). We did not find any significant enrichment for biological processes (Additional file 6: Table S21) probably also because of the small sample size.
Utilization of SVhound for quality control (QC) of population studies
Given SVhounds ability to automatically adjust and determine regions of clSV, we next investigated if it can also be leveraged to QC population SV data sets. By utilizing the SV-density coupled with the number of different SV-alleles, \(k\), one can assess the quality of a given dataset. As an example, we compare a subset of 150 genomes from the 1KGP and the rhesus dataset (also 150 genomes). Even when both datasets have the same sample size, the window length selected for rhesus is 27kbp, while for the 1KGP dataset is 319kbp (Additional file 6: Table S18.2).
First, we noticed that the distribution of the \(p_{new}\) values is similar with an average \(p_{new} = 1.85\%\) for the 1KGP and \(p_{new} = 2\%\) for the rhesus dataset (median \(p_{new} = 0.9\%\), \(p_{new} = 0.73\%\), and max \(p_{new} = 94.7\%\) and \(p_{new} = 97.3\%\) respectively) which show consistency on the \(p_{new}\) values, regardless of the dataset, when the desired SV-density remains the same.
Next, we included in the analysis a total of 100 random samples of 150 individuals from the 1KGP and 100 random samples of 150 individuals from the CCDG. We observe that for each dataset, the selected window length lies in its own distribution (Fig. 5). These window lengths reflect two important aspects of the dataset: first, the overall number of SV in each particular dataset, with 1KGP having 66,626, the CCDG dataset 304,533 and the rhesus 493,188 (Additional file 6: Table S18.1). When randomly removing SVs from the CCDG dataset, we observe an increase in the window length (Additional file 2: Fig. S11). This is also observed in the 1KGP dataset. Second, given a fixed SV-density, the distribution of the number of different SV-alleles, \(k\), reflects a similar distribution despite the difference in window length. This distribution mimics the allele frequency spectrum, where most windows have few SV-alleles and only a small number of windows (the candidate ones) have a high number of SV-alleles (Additional file 3: Fig. S12, Additional file 4: Fig. S13, Additional file 5: Fig. S14). We can use this expected distribution of \(k\) to detect possible errors and biases in the data that can be caused by a defined population structure, an increase in the number of falsely called SV or possible contamination. Given these insights, SVhound indeed can be utilized also to QC SV population catalogs and will identify a deviation from the expected “L” shaped distribution of the number of different SV-alleles, \(k\). Finally, we can examine the probability of detecting new SV-alleles to identify saturation and focus the efforts in specific genomic loci or new species.