We assume that variable sites are diallelic and the major and minor allele are known such that genotypes are expected to follow a Binomial model. In low-coverage sequencing data, genotypes are unobserved and genotype likelihoods are therefore used instead to account for the uncertainty in sequencing process. We use the iterative procedure in PCAngsd [18] to estimate individual allele frequencies that can be seen as the underlying parameters in the Binomial sampling processes of the genotypes accounting for population structure. In the following, we will denote N as the number of individuals and M as the number of sites. We can then define the posterior genotype dosage as follows for individual i in site j
$$\begin{aligned} {\mathbb {E}}[G_{ij} \,|\, X_{ij}, {\hat{\pi }}_{ij}] = \sum ^2_{g=0} g\, P(G_{ij} = g \,|\, X_{ij}, {\hat{\pi }}_{ij}), \end{aligned}$$
(1)
for \(i=1, \dots , N\) and \(j=1, \dots , M\), where \(P(G_{ij} = g \,|\, X_{ij}, {\hat{\pi }}_{ij})\) is the posterior genotype probability of genotype g with X being the observed sequencing data, and \({\hat{\pi }}\) being the individual allele frequency. Details of deriving the posterior genotype from genotype likelihoods can be found in the Additional file 1 (Equation S1-S2). Missing data is imputed based on population structure based on the posterior genotype dosages. We standardize the dosage under the assumption of a Binomial model,
$$\begin{aligned} y_{ij} = \frac{ {\mathbb {E}}[G_{ij} \,|\, X_{ij}, {\hat{\pi }}_{ij}] - 2{\hat{f}}_j}{\sqrt{2{\hat{f}}_j (1 - {\hat{f}}_j)}}. \end{aligned}$$
(2)
Here \({\hat{f}}\) is the estimated allele frequency at site j based on all of the samples. We then perform truncated SVD [11] on the full standardized data matrix (\(N \times M\)) to extract the top K principal components (PCs) that capture population structure in the dataset
$$\begin{aligned} \hat{\mathbf {Y}} = \mathbf {U}_{[1:K]} \mathbf {S}_{[1:K]} \mathbf {V}_{[1:K]}^T, \end{aligned}$$
(3)
where \(\mathbf {U}_{[1:K]}\) represents the captured population structure of the individuals and \(\mathbf {V}_{[1:K]}\) represents the scaled variant weights, while \(\mathbf {S}_{[1:K]}\) is the diagonal matrix of singular values. This low-rank approximation along with the standardized matrix \(\mathbf {Y}\) are all we need to estimate the two test statistics for low-coverage sequencing data.
FastPCA statistic
The selection statistic derived in Galinsky et al. [7], hereafter referred to as FastPCA, tries to detect selection by looking for variants that significantly differentiate from genetic drift along an axis of genetic variation. They define the selection statistics for the k-th principal component to be the properly normalized variant weights, using the properties of an eigenvector, such that they are standard normal distributed. The selection statistics are then defined as follows in our setting for genotype likelihood data
$$\begin{aligned}d_{jk} = v_{jk} \sqrt{M}, \end{aligned}$$
(4)
$$\begin{aligned}&d_{jk} \sim {\mathcal {N}}(0,1), \end{aligned}$$
(5)
$$\begin{aligned}&d_{jk}^{2} \sim \chi ^2_{1}, \end{aligned}$$
(6)
for \(j=1, \ldots , M\) and \(k=1, \ldots , K\). \(v_{jk}\) is the variant weight for the kth component at site j. The squared statistic will then follow a \(\chi ^2\)-distribution with 1 degree of freedom. This statistic is implemented in the PCAngsd framework and referred to as PCAngsd-S1.
pcadapt statistic
The test statistic implemented in pcadapt [15] is based on a robust Mahalanobis distance of the standardized estimates in a multiple linear regression for each site. The regression model is defined as follows in our setting for genotype likelihood data
$$\begin{aligned} \mathbf {y}_j = \mathbf {U}_{[1:K]} \varvec{\beta }_j + \varvec{\epsilon }_j, \end{aligned}$$
(7)
for \(j=1, \dots , M\), with \(\varvec{\beta }_j\) being the regression coefficients, and \(\varvec{\epsilon }_j\), the residual vector for site j. The coefficients are easily derived using the normal equation and properties of the previously computed truncated SVD (Eq. 3), thus \(\varvec{\beta }_j = \mathbf {S}_{[1:K]} \mathbf {V}_{[j,1:K]}\). A z-score of the regression coefficients in site j are defined as
$$\begin{aligned} \mathbf {z}_j =\varvec{\beta }_j/\sqrt{\frac{(\mathbf {y}_j - \hat{\mathbf {y}}_{j})^T (\mathbf {y}_j - \hat{\mathbf {y}}_{j})}{N-K}}, \end{aligned}$$
(8)
with \(\hat{\mathbf {y}}_{j}\) being the vector of low-rank approximations in site j (Eq. 3). The test statistic is computed as a robust Mahalanobis distance of \(\mathbf {z}_j\), where the squared distance will be \(\chi ^{2}_K\) distributed as described in Luu et al. [15]. We use standardized expected genotypes \(y_{ij}\) (Eq. 2) for genotype likelihood data, hereafter referred to as PCAngsd-S2, instead of using known genotypes as pcadapt. Note, that we correct for inflation using the genomic inflation factor [5], inline with the recommendations [15], in all analysis based on the pcadapt or PCAngsd-S2 statistics. See QQ-plot in Additional file 1: Figure S2 and S3 for examples of the uncorrected PCAngsd-S2 test statistic.
1000 genomes project data
We used data from the 1000 Genomes Project (phase3) [4] to test the two selection statistics implemented in the PCAngsd framework. Specifically, we tested two sets of populations, one with East Asian ancestry with 400 unrelated individuals from four East Asian populations, Han Chinese in Beijing (CHB), Han Chinese South (CHS), Chinese Dai in Xishuanagbanna (CDX), Kinh in Ho Chi Minh City (KHV), and one with European ancestry with 404 unrelated individuals from four European populations, Utah residents with Northern and Western European Ancestry (CEU), British in England and Scotland (GBR), Iberian populations in Spain (IBS), Toscani in Italy (TSI).
For all the selected individuals, we have both high quality genotype (HQG) data and low-coverage sequencing data available from the 1000 Genomes Project (phase3) (see [4] for details on the HQG data). The low coverage data is whole genome sequencing data with a mean depth of coverage around 6X (Additional file 1: Figure S1).
Analyses on polymorphic sites from the high quality genotype data
To directly compare PCAngsd against pcadapt and FastPCA, where the latter two only takes called genotypes as input, we restricted the selection analyses to polymorphic sites with a minimum allele frequency of 5% in the HGQ data. In total 5.8 and 6 million polymorphic sites are retained in the Asian and European population sets, respectively.
Restricting to the polymorphic sites in HQG data, we calculated genotype likelihoods (GL) from the low-coverage data using ANGSD with minimum mapping quality of 20 and minimum base quality of 30 [9]. We used the GL data as input to PCAngsd to compute the two selection statistics (PCAngsd-S1, PCAngsd-S2) for the population sets. To verify the results obtained from the low-coverage data, we also analyzed the same individuals in the HQG data using PCAngsd, pcadapt (default settings), and FastPCA (fastmode:YES, following [7]).
We also tested the performance of pcadapt and FastPCA on low-coverage data by calling genotypes using bcftools [12] with minimum mapping quality of 30, minimum base quality of 20, and disabling BAQ (–no-BAQ) to resemble the filters used with ANGSD. We restricted the analyses to the polymorphic sites in the HQG described above. From the called genotypes from low-coverage sequencing data, we generated two datasets: One excluding all genotype calls with genotype quality \(<20\) (hereafter referred to as CG standard) and one including all called genotypes (hereafter referred to as CG*).
Analyses on low coverage data
We also used the low-coverage sequencing data to test the performance of PCAngsd without prior knowledge on polymorphic sites. We used ANGSD with the same mapping (30) and base quality (20) filters described above to calculate GL. Variable sites were identified using a likelihood ratio test (-SNP_pval \(10^{-6}\)) and a minimum allele frequency filter on 0.05 (-minmaf 0.05). To remove false positive variable sites, we next applied a callability filter that excludes genomic regions of low quality and complexity. The filter is based on the sequencing depth and mapping quality across samples, thus, no external information is required. In total, we identified 4.1 million and 4.6 million polymorphic sites in the Asian and European population sets, respectively.
Read length bias in low coverage sequencing data
The low coverage sequencing data from the 1000 Genomes Project consists of multiple difference sequencing sources with highly variable sequencing length. Variable sequencing length can introduce a bias in population genetics analyses, particularly for low structure analyses. In the PCA of the East Asian samples based on polymorphic sites identified from the low coverage sequencing data, PC2 correlates with the sequencing length of the samples (Additional file 1: Figure S7). To identify genomic sites where the GLs correlate with the sequencing length, we conducted a logistic regression analyses tailored for low coverage sequencing data using ANGSD (-doAsso 5, [8]), where the trait/phenotype is the samples stratified the into two groups based on their read length (\(<99\)bp and \(\ge 99\)bp) and their sample population of origin was used as covariates to ensure we did not identify genomic sites driven by population differentiation. We removed sites with a p-value \(< 10^{-3}\).