A flow chart of the two-step imputation process is depicted in Fig. 3, with the training process through the 5-fold cross validation in the left and the independent testing in the right. For ease of presentation, we use the Illumina 6 K gene chip to represent the lower density chip and the Illumina 50 K gene chip to represent the higher density one. The study samples are genotyped on the 6 K SNP set T, and the reference samples are genotyped on the 50 K SNP set T∪U. The goal is to impute the genotype values on U for the study samples. The top two lines in Fig. 2 plot T∪U and T, respectively, using their physical loci on the first half of chromosome 14 (BTA 14).
One-step imputation
We first present the training process. Let Beagle be our base imputation algorithm. Denote the set of study samples as \({\mathcal S}\) and the set of reference samples as \({\mathcal R}\). The genotype dataset is thus denoted as \(({\mathcal S} \cup {\mathcal R}, T \cup U)\). First, Beagle is used to impute the genotype of the study samples on the untyped SNPs of U, by simply running on the dataset \(({\mathcal S} \cup {\mathcal R}, T \cup U)\). The achieved accuracy in this one-step imputation is denoted as a
c
c
1. In our simulation experiments, the genotype of the study samples on the SNPs of U are masked; the imputed genotype is then compared against the true genotype for calculating the imputation accuracy.
Marker feature vector
Our goal of training is to select a relatively small number of SNPs from U, denoted as M, and append them to set T to create an intermediate pseudo array, such that the subsequent two-step imputation from T to T∪M and then to T∪U yields a higher imputation accuracy. In this step, we will evaluate the potential for each SNP of U in the add-one two-step imputation.
Let m
i
∈U be a candidate marker (i=1,2,…,|U|), and tentatively set the intermediate pseudo array to contain T and m
i
only, that is T∪{m
i
}. Applying the base imputation algorithm (Beagle in our case), we do the add-one two-step imputation from T to T∪{m
i
} and then from T∪{m
i
} to T∪U. At the end of the two-step process, we calculate the imputation accuracy for each marker m
j
of U across all study samples, denoted as a
ij
. Let a
ii
denote the one-step imputation accuracy from T to T∪{m
i
}. The vector v
i
=〈a
i1,a
i2,…,a
i|U|〉, is the feature vector for marker m
i
.
Marker clustering and target marker cluster
Intuitively, two markers of similar feature vectors have about the same performance in the two-step imputation, in terms of imputing the other untyped markers; and thus it is sufficient to include only one of them. Therefore, we apply the k-means clustering algorithm to cluster the markers of U represented by their feature vectors, where k is the pre-set number of clusters which can be empirically determined, for example, by the Davies-Bouldin index [33]. In our experiments, for genotype imputation from the Illumina 6 to 50 K using Beagle, we examined 5 to 100 clusters with an increment of 5 in the cross validation. Let C
1,C
2,…,C
k
denote the k clusters of SNPs of U by k-means.
For each cluster C
i
, if marker m
j
∈U is consistently imputed well with average imputation accuracy higher than or equal to the one-step imputation accuracy a
c
c
1, then m
j
is a target marker for the cluster C
i
. The set of all target markers for cluster C
i
form the target marker cluster T
C
i
for the cluster C
i
.
If the target marker cluster T
C
i
is empty, then no marker of C
i
would be selected to form the pseudo array; otherwise, define the contribution of a marker m
j
of C
i
as the add-one two-step imputation accuracy from T∪{m
j
} to T∪U, with the imputation accuracy calculated over only markers of T
C
i
. The top contribution marker of C
i
, denoted as \(\phantom {\dot {i}\!}m_{i^{*}}\), is selected into M, and it is for imputing the genotype of SNPs of T
C
i
only. We call this target marker cluster T
C
i
one piece of the imputation.
Experimental setup in the two-step piecemeal imputation
We have two sets of sequenced animals (Holstein and Simmental) to be used as references. There are also much larger numbers of animals that are genotyped by various Illumina and Affymetrix gene chips, which are the study samples. The project goal is to impute the study animals to their whole sequences. When running the base programs in our experiments, we followed the default settings of the population-based genotype imputation for both Beagle and FImpute. For Beagle, we set the -Xmx parameter to 4GB for memory management without the low memory option for better performance.
Training through cross validation
Recall that we have four levels of SNP density, 6 K, 50 K, HD and Sequence, where HD can be either the Illumina 777 K or the Affymetrix 660K. For each genotype imputation experiment from a lower density to a higher density, we want to select some untyped SNPs to form the pseudo array and partition the other untyped markers into pieces accordingly. We do this training via a 5-fold cross validation. The quality of these selected markers and their associated pieces is examined via independent testing.
We use an example genotype imputation experiment for the Simmental group from 6 K to 50 K to explain the procedure. First, we derive the 50 K genotype for all sequenced Simmental animals from their sequence data. Next, these animals are randomly partitioned into five folds of approximately equal size, with the consideration of their country of origins. Four folds of the animals are used as the references, \({\mathcal R}\), while the last fold is held as study samples, \({\mathcal S}\), for which the genotype of SNPs outside of 6 K are masked to mimic the untyped SNPs U. The five-fold cross validation scheme rotates each one of the five folds as the study fold.
On the dataset \(({\mathcal R} \cup {\mathcal S}, T \cup U)\), using Beagle we do the one-step imputation to obtain the imputation accuracy a
c
c
1; we also do the add-one two-step imputation to obtain the feature vector for each untyped marker of U. The average of the five feature vectors from the five-fold cross validation defines the final feature vector for each untyped marker. With these vectors, the k-means algorithm is run to cluster the untyped markers into C
1,C
2,…,C
k
(which form a partition of U). Subsequently, we determine the target marker cluster T
C
i
for each cluster C
i
(all these target marker clusters form another distinct partition of U), and select the markers \(\phantom {\dot {i}\!}M = \{m_{1^{*}}, m_{2^{*}}, \ldots, m_{k^{*}}\}\) for piecemeal imputation. For each study fold \({\mathcal S}\), the study samples are then piecemeal imputed to fill the genotype for the untyped markers U: when an untyped marker does not belong to any target marker cluster, it is imputed in the one-step, otherwise, it is imputed in multiple pieces through a majority vote. The imputed genotype values are compared against the ground truth, which were masked before the imputation, to calculate the piecemeal imputation accuracy as the percentage of correctly imputed genotype. The final piecemeal imputation accuracy from the five-fold cross validation is the average over all five folds, and is denoted as a
c
c
π
.
Independent testing
In these experiments, the sequenced animals from the Canadian Cattle Genome Project and the 1000 Bull Genomes Project are used as references. The study samples are the animals that are genotyped by various Illumina and Affymetrix gene chips.
We use the example genotype imputation experiment for the Simmentals from 6 to 50 K to explain the procedure. First, we derive the 50 K genotype for all sequenced Simmental animals from their sequence data; we also derive the 6 K genotype for the genotyped Simmental animals from their genotype data (they are genotyped at density 50 K or higher). Second, from the above 5-fold cross validation process, we have identified a set of markers \(\phantom {\dot {i}\!}M = \{m_{1^{*}}, m_{2^{*}}, \ldots, m_{k^{*}}\}\) and determined their associated target marker clusters {T
C
1,T
C
2,…, T
C
k
}, respectively. Using these selected markers and their defined pieces, our piecemeal imputation imputes the genotype for the study samples for the selected markers \(\phantom {\dot {i}\!}M = \{m_{1^{*}}, m_{2^{*}}, \ldots, m_{k^{*}}\}\) first, and later uses the imputed data to further impute the genotype for the study samples at the other untyped SNPs. At the end, the imputed genotype for the study samples at all the untyped SNPs, either imputed in the first step or imputed in the second step, is compared against the ground truth to calculate the independent testing piecemeal imputation accuracy
a
c
c
π
as the percentage of correctly imputed genotype.