Data and quality control
Genotypes of 2504 people in the 1000 Genomes Project Phase III were downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502 [12]. Variants were removed based on a Hardy-Weinberg equilibrium exact test p-value filter (< 0.001) and genotyping rate filter (> 0.02). The Hardy-Weinberg equilibrium test measures whether the ratio between homozygous and heterozygous genotypes differs significantly from prediction under HWE assumptions. SNPs from the major histocompatibility complex (MHC) on chromosome 6 and in the chromosome 8 inversion region were excluded. The remaining SNPs were pruned twice using plink 1.9 [17, 18] with windows of 1000 variants and step size 10, pair-wise squared correlation threshold = 0.02, and minor allele frequency > 0.05. The pruning operation deals with linkage desequilibrium or non-random association of alleles at different loci: it reduces the number of SNPs, keeps SNPs in linkage equilibrium, and thereby reduces data dimensionality. A training set was built by removing the following populations: Americans of African ancestry in Southwest USA (code = ASW); African Caribbeans in Barbados (ACB); Mexican ancestry from Los Angeles USA (MXL); Gujarati Indian from Houston, Texas (GIH); Sri Lankan Tamil from the UK (STU); and Indian Telugu from the UK (ITU). We used these populations as an external test set to predict the degree of relative admixture in individuals and populations. For the classification models, we also merged British in England and Scotland (GBR) and Utah Residents with Northern and Western European Ancestry (CEU) to obtain a single category for Northern and Western European Ancestry.
Additional dataset: Arabidopsis thaliana
We used an additional dataset of 1135 Arabidopsis thaliana genomes extracted from the 1001 Genomes Project [14]; the genotypes and an imputed SNP matrix could be downloaded from 1001genomes.org. Arabidopsis thaliana was the first plant genome to be sequenced and is a commonly used model organism. Variants were removed using a permissive genotyping rate filter (> 0.2). SNPs were pruned using plink 1.9 [17, 18] with windows of 100 variants and step size 10, pair-wise squared correlation threshold = 0.1, and minor allele frequency > 0.05. We merged the imputed SNP matrix with our filtered list of SNPs to obtain a filtered imputed SNP matrix.
Visualization of ancestry clusters using t-SNE and GTM
t-SNE [5] translates similarities between points into probabilities; Gaussian joint probabilities in the original input space and Student’s t-distributions in the latent space. The Kullback-Leibler divergence between data distributions in the input and latent space is minimized with gradient descent. t-SNE has several parameters to optimize: the learning rate for gradient descent, the perplexity of distributions in the initial space, and the early exaggeration. In this paper, we used the scikit-learn v0.19.1 implementation for t-SNE [19], with default learning rate = 200, perplexity = 30, and early exaggeration = 12. The main disadvantage of t-SNE is its lack of a framework to project new points onto a pre-trained map - a feature available in PCA and GTM.
The core principle of GTM [6] is to fit a manifold into the high-dimensional initial space. The points yk on the manifold Y in the initial space are the centers of normal probability distributions of g, which here are individuals described by the genotype matrix G:
$$\begin{array}{@{}rcl@{}} p(\mathbf{g}|\mathbf{x}_{k},\mathbf{W},\beta) = \frac{\beta}{2\pi}^{D/2} \exp{\left(-\frac{\beta}{2}\|\mathbf{y}_{k}-\mathbf{g}\|^{2}\right)} \end{array} $$
(1)
where β is the common inverse variance of these distributions and W is the parameters matrix of the mapping function y(x;W) which maps nodes xk in the latent space to yk: y(xk;W)=Wϕ(xk), where ϕ(xk) is a set of radial basis functions. W and β are optimized with an expectation-maximization (EM) algorithm maximizing the overall data likelihood. The responsibility or posterior probability that the individual gn in the original genotype space is generated from the kth node in the latent space is computed using Bayes theorem:
$$\begin{array}{@{}rcl@{}} R_{kn} = p(\mathbf{x}_{k} | \mathbf{g}_{n},\mathbf{W},\beta) = \frac{p(\mathbf{g}_{n}|\mathbf{x}_{k},\mathbf{W},\beta) p(\mathbf{x}_{k})}{{\sum\nolimits}_{k^{\prime}=1}^{K} p(\mathbf{g}_{n}|\mathbf{x}_{k^{\prime}},\mathbf{W},\beta)p(\mathbf{x}_{k^{\prime}})} \end{array} $$
(2)
These responsibilities are used to compute the mean position of an individual on the map x(gn), by averaging over all nodes on the map:
$$\begin{array}{@{}rcl@{}} \mathbf{x}(\mathbf{g}_{n}) = \sum\limits_{k=1}^{K} \mathbf{x}_{k} R_{kn} \end{array} $$
(3)
We used the python package ugtm v1.1.4 [20] for generative topographic mapping, and scripts used for ancestry classification are available online (https://github.com/hagax8/ancestry_viz). GTM has several hyperparameters to tune, which might have a high impact on the shape of the map: the number of radial basis functions, a width factor for these functions, map grid size, and a regularization parameter.
Ancestry classification models
PCA does not provide a comprehensive framework to build a probabilistic classification model. However, a simple classification model associated with the 2-dimensional plot can be built using the k-NN approach in three steps: (1) a PCA plot is constructed from a training set, (2) a test set is projected on the plot, and (3) each test individual is assigned the predominant ancestry amongst its k nearest neighbors in the training set. We did not construct k-NN models for t-SNE since it is not straightforward to project new points onto a t-SNE map. On the other hand, GTM provides a probabilistic framework which can be used to build classification models and generate class membership probabilities [10]. GTM responsibilities can be seen as feature vectors: they encode individuals depending on their position on the map, which is discretized into a finite number of nodes (positions). They can be used to estimate the probability of a specific ancestry given the position on map, using Bayes’ theorem
$$\begin{array}{@{}rcl@{}} P(a|\mathbf{x}_{k})=\frac{P(\mathbf{x}_{k}|a) \times P(a)}{{\sum\nolimits}_{a} P(\mathbf{x}_{k}|a) \times P(a)} \end{array} $$
(4)
where P(xk|a) is computed as follows:
$$\begin{array}{@{}rcl@{}} P(\mathbf{x}_{k}|a)=\frac{{\sum\nolimits}_{n}{R_{kn}}}{N_{a}} \end{array} $$
(5)
where Rkn is the responsibility of node xk for an individual belonging to population a, which counts Na individuals. It is then possible to predict the ancestry profile P(a|gi) of a new individual with associated responsibilities {Rki}
$$\begin{array}{@{}rcl@{}} P(a|\mathbf{g}_{i})=\sum\limits_{k} P(a|\mathbf{x}_{k}) \times R_{ki} \end{array} $$
(6)
GTM nodes xk can be represented as points coloured by most probable ancestry amax using P(amax|xk). We compared performances of visual classifications (PCA and GTM) with linear support vector machine classification (SVM), a classical machine learning algorithm. Linear SVM is only dependent on C, the penalty hyperparameter. Increasing C increases the variance of the model and decreases its bias. In this application, classification performance is estimated by the average F1 score over all ancestry classes in a 5-fold cross-validation experiment (5-CV) repeated 10 times. The F1 score is a harmonic mean of precision and recall. For each of the 10 repetitions, labels are predicted for 5 partitions of the data, which are concatenated to obtain predicted values for the entire dataset. From these, F1 scores are computed for each class a and repetition j. The per-class performance measure is computed across the 10 repetitions:
$$\begin{array}{@{}rcl@{}} {F1 score}_{a} = \frac{\sum\limits_{j=1}^{10}{{F1 score}_{aj}}}{10} \end{array} $$
(7)
The overall model performance measure is a weighted average across per-class F1 scores, with weights equal to the number of individuals in the class:
$$\begin{array}{@{}rcl@{}} {F1 score}=\left (\sum\limits_{j=1}^{10}{\frac{\sum\limits_{a}{{F1 score}_{aj} \times N_{a}}}{N_{total}}} \right) \div 10 \end{array} $$
(8)
This procedure is performed for each parameter combination and for each algorithm (PCA, GTM, SVM). The best model for each algorithm is defined as having the largest overall F1 score. Only the performance of the best model is reported in the Results section. For PCA, we vary k (the number of neighbours) from 1 to 10. For GTM, we set the map grid size (number of nodes) = 16*16, the number of RBFs = 4*4, regularization = 0.1 and rbf width factor = 0.3. For linear SVM, the penalty parameter is set to C=2r where r runs from -5 to 10.
Posterior probabilities of ancestry membership for whole populations
All our models are trained with only twenty 1000 Genomes Project populations. Six populations are used as an external test set (cf. foregoing section Data and quality control). Posterior probabilities of ancestry membership are estimated for all individuals in these test populations (Eq. 6) based on observed superpopulation distributions (Eq. 5). We also generate probabilities of belonging to a superpopulation for each population as a whole, by replacing individual responsibilities {Rki} in equation 6 by an overall population responsibility {Rkp}
$$ R_{kp}=\frac{{\sum\nolimits}_{i}{R_{ki}}}{N_{i}} $$
(9)
It should be noted that these responsibilities {Rkp} correspond to the averaged distribution of the population on the map, and can be used to compare populations and estimate their diversity.