Whole genome SNP genotype piecemeal imputation

Background Despite ongoing reductions in the cost of sequencing technologies, whole genome SNP genotype imputation is often used as an alternative for obtaining abundant SNP genotypes for genome wide association studies. Several existing genotype imputation methods can be efficient for this purpose, while achieving various levels of imputation accuracy. Recent empirical results have shown that the two-step imputation may improve accuracy by imputing the low density genotyped study animals to a medium density array first and then to the target density. We are interested in building a series of staircase arrays that lead the low density array to the high density array or even the whole genome, such that genotype imputation along these staircases can achieve the highest accuracy. Results For genotype imputation from a lower density to a higher density, we first show how to select untyped SNPs to construct a medium density array. Subsequently, we determine for each selected SNP those untyped SNPs to be imputed in the add-one two-step imputation, and lastly how the clusters of imputed genotype are pieced together as the final imputation result. We design extensive empirical experiments using several hundred sequenced and genotyped animals to demonstrate that our novel two-step piecemeal imputation always achieves an improvement compared to the one-step imputation by the state-of-the-art methods Beagle and FImpute. Using the two-step piecemeal imputation, we present some preliminary success on whole genome SNP genotype imputation for genotyped animals via a series of staircase arrays. Conclusions From a low SNP density to the whole genome, intermediate pseudo-arrays can be computationally constructed by selecting the most informative SNPs for untyped SNP genotype imputation. Such pseudo-array staircases are able to impute more accurately than the classic one-step imputation.


Background
Genome-wide association studies (GWAS) are processes of genetic fine-mapping that find whether common genetic variants are associated with a trait of interest [1]. These common genetic variants are expected to be abundant and well distributed across the whole genome. One of the most commonly used variant types is the single nucleotide polymorphism (SNP) -a site in the genome at which different individuals display different nucleotides. In this paper we consider biallelic SNPs in the cattle genome. The 1000 Bull Genomes Project (www.1000bullgenomes.com) has identified more than 28 million single nucleotide variations in cattle. The number that contains more than 648,000 SNPs spanning the entire bovine genome. In addition, we have certain lowdensity and medium-density genotype datasets obtained using the Illumina BovineLD BeadChip (Illumina 6 K) and the Illumina BovineSNP50 BeadChip (Illumina 50 K), respectively.
The project goal discussed in this paper is to use the sequenced animals as references to impute the whole genomes (that is, the 26,700,000 SNP genotype values) for the animals genotyped at various lower density levels, a scenario we designate as the whole genome SNP genotype imputation. For example, the Illumina 6 K chip provides genotypes for around 6000 SNPs across the 29 autosomes, and the animals genotyped with the Illumina 6 K chip will have the remaining 26,694,000 SNPs imputed up from the 6000 SNP genotype, based on the sequence-derived SNP genotypes of the reference sequenced animals. The subsequent GWAS may be done on either the 6000 (i.e. 0.2 %) genotyped SNPs, or can also be done on the imputed genotype including the other 26,694,000 (i.e. 100 %) SNPs.
Note that a typical GWAS genotype dataset that contains thousands of SNPs and hundreds to thousands of individuals is apt to contain missing genotype data. Besides discarding the concerned SNPs from GWAS or perhaps repeating genotyping experiments, computationally inferring the missing data has long been proposed as an alternative, at minimal labor and cost [2][3][4][5]. In other words, SNP genotype imputation is not a novel topic, but has received extensive research attention since the start of genotyping arrays -see reviews and surveys [6][7][8][9][10][11]. Nevertheless, in this paper, our imputation target is not the small percentage of missing genotype for SNPs on the gene chips, but the untyped SNPs. Such an objective is becoming increasingly relevant, thanks to ongoing efforts to make sequenced animals available as references.
More specifically in our whole genome SNP genotype imputation projects, the higher density chip is meant to be the whole genome sequence data, consisting of all 26.7 million SNPs; the lower density chip can be either of the Illumina 6 K, 50 K, 777 K, or the Affymetrix 660 K. Running an imputation method one time to directly impute the whole genome sequences is referred to as the onestep imputation, which is not the computational problem we try to address in this paper. Recently, several studies showed evidence that within bovine genomics, the two-step imputation is generally more accurate than the one-step imputation, where the lower density genotyped animals are first imputed to a medium density SNP set and then further imputed to the higher density [29][30][31]. For instance, Larmer et al. showed that for Beagle, FImpute, and Impute v2 [22], the two-step imputation from 6 to 50 K then to 777 K achieves higher accuracies than the one-step imputation from 6 K directly to 777 K [30].
In our preliminary studies, we have conducted the so called add-one two-step imputation experiments, in which the median density reference panel contains only one extra SNP than the low density SNP panel. While rotating this extra SNP from the pool of markers in the high density panel, we observed that a portion of them can individually boost the imputation accuracy in the add-one two-step experiment compared to the one-step direct imputation. Inspired by this observation, we are asking the natural question whether building up a staircase of pseudo arrays in between the lower density SNP set and the designated higher density SNP set would give the best imputation result. In this paper, we attempt to answer this question by first presenting a novel two-step piecemeal imputation framework, which essentially builds an intermediate pseudo array by mining the hidden relations between the lower and the higher density arrays. We remark that our pseudo array is not an actually manufactured gene chip, but an artificial one that is computationally derived from a learning procedure, which evaluates and selects some SNP markers based upon their add-one two-step imputation performance. Moreover, the pseudo-arrays are modeldependent, that is, different base imputation programs built upon different models could result in different selection of markers for our two-step piecemeal imputation. We show that by wrapping either Beagle or FImpute in our two-step piecemeal imputation framework, we are able to achieve higher genotype imputation accuracies. (That is, our method will be based on the one-step imputation, and is hunting for improvement upon the corresponding one-step imputation from the data.) Though we believe most effective imputation methods mentioned earlier can be adopted, the main reason we only go with Beagle and FImpute is their fast speed (i.e. efficiency). Based on the two-step piecemeal imputation, we demonstrate how staircase arrays can be built for whole genome SNP genotype imputation.
We briefly describe our two-step piecemeal imputation framework here (see Fig. 3 for a flow chart); more details are provided in the Methods section. By a study sample, we mean an animal that is genotyped at the lower density. The SNPs on this lower density chip form a set T. A reference panel consists of individuals genotyped at the higher density. All SNPs of T are assumed to be assayed in the higher density chip; the other SNPs on the higher density chip but not in T form the set U -they are the untyped SNPs in the study samples. For each marker m i ∈ U, an add-one two-step imputation from T to T ∪ {m i } to T ∪ U is conducted to evaluate its potential in imputing the other untyped markers. Next, markers that have similar potentials are clustered together, and correspondingly for each marker cluster, we determine which other markers can be imputed well, forming into a target marker cluster. We select one marker from every marker cluster, together with T, to form the pseudo intermediate array. The SNPs in a target marker cluster, called a tract, are imputed via the add-one two-step imputation where the selected marker is from the associated marker cluster. Piecing these tracts together gives the final imputation result.

Results
We first evaluate our two-step piecemeal imputation method empirically, and make comparisons against the usual one-step imputation; we also compare our multistep piecemeal imputation method against the usual two-/three-step imputations. In this empirical evaluation, we use Beagle (version 3.3.2) and FImpute (version 2.2) as the two base imputation programs respectively, which are fast and memory efficient enough to run them thousands of times on multiple large to huge datasets.

Sequenced animals
The Canadian Cattle Genome Project (www.genome canada.ca) has contributed more than 350 animals to the 1000 Bull Genomes Project. From these projects we derived two datasets: a Holstein sequence collection containing 114 animals, and a Simmental sequence collection containing 82 animals. They were used for the piecemeal imputation method training through a cross validation process (i.e. partitioned into a subset of study samples and another subset of reference samples). They also served as reference samples in all the independent testing experiments.

Genotyped animals
From the Canadian Cattle Genome Project, we obtained Affymetrix 660 K genotypes for 390 Simmental animals.
There are 23 of these 390 Simmental animals included in the set of sequenced animals. The genotyped animals that are not included in the set of sequenced animals are used as study samples in the independent testing experiments from a lower (than 660 K) density to impute their genotypes at the density 660 K. The 23 genotyped and sequenced animals are used as study samples in the independent testings from a lower density to impute their whole genome.

SNP sets
We used single chromosomes of small size (BTA 27) or medium size (BTA 14) in the development of the piecemeal imputation framework. BTA 27 was used for the Holstein data set and BTA 14 for the Simmental data set. The only challenge to deal with all 29 chromosomes is the need of a huge amount of disk storage, see Discussion.
The numbers of SNPs included in the Illumina 6 K, 50 K, 777 K and the Affymetrix 660 K are summarized in Table 1, where the second column contains their formal chip names that one can look up on the Illumina and Affymetrix websites.
On BTA 27, the 114 sequenced Holstein animals have genotype values for 529,674 SNPs. The Illumina 777 K chip contains 10,219 of them, among which 664 are included in the 50 K chip, and 119 of these 664 SNPs are included in the 6 K chip, as summarized in Table 2. On BTA 14, the 82 sequenced Simmental animals have genotype values for 933,833 SNPs. Table 2 shows that the Affymetrix 660 K chip contains 14,367 of these 933,833, among which 1618 are included in the 50 K chip, and a further 219 of these 1,618 SNPs are included in the 6 K chip.

5-fold cross validation
We use 5-fold cross validation to empirically examine our piecemeal imputation method, also to construct (a.k.a. train) the staircase pseudo arrays to impute the genotyped animals to their whole genome. The cross validation results also suggest the possible levels of improvement compared with the one-step imputation. Table 3 contains the cross validation results on the Simmental datasets of 82 animals, where the lower density is either 6 or 50 K and the higher density refers to  either 50 or 660 K (second column). The third and the fourth columns hold the one-step and piecemeal imputation accuracies (aac 1 and aac π , respectively), and the improvement of piecemeal over one-step is shown in the fifth column. For the null hypothesis that the usual onestep accuracies and the two-step piecemeal imputation accuracies have equal mean accuracies, we conducted statistical significance testing. Using Beagle as the base program, the p-values for the three 5-fold cross validation experiments are 0.0215, 0.0005 and 0.0004, respectively, indicating that the improvements by the two-step piecemeal imputation are statistically significant; using FImpute, the corresponding p-values are 0.64, 0.49 and 0.61, suggesting statistically insignificant improvements. From 6 to 50 K, 5 to 100 marker clusters, in increments of 5, were examined and the best piecemeal imputation results are included in the table, while in Fig. 1 all of these accuracies are plotted (blue dots). From 6 or 50 K to 660 K, 100 to 1,000 marker clusters, in increments of 100, were examined. Analogous results on the Holstein datasets of 114 animals are shown in Table 4. Both tables show a 1.5-3.0 % net accuracy improvement against Beagle, in all three cases (with statistical significance testing p-values 0.00369, 0.00003 and 0.00019, respectively) and a 0.5-1.0 % net accuracy improvement against FImpute (with p-values 0.54, 0.38 and 0.31, respectively).

Independent testing
Independent testing is to examine the quality of the selected markers and the defined pieces learned from the training step. The study samples used in the testing are not involved in the training step. The piecemeal imputation accuracies are again compared to the corresponding one-step imputation accuracies, respectively.
Columns 8-10 of Table 3 contain these independent testing results on the 367 genotyped Simmental animals where the lower density is either 6 or 50 K and the higher density refers to either 50 or 660 K (second column). The 8th and 9th columns hold the one-step and piecemeal imputation accuracies (aac 1 and aac π ), respectively. The improvement of the piecemeal over the one-step is shown in the tenth column. Note that for each imputation setting, the selected markers and the defined pieces are taken from the respective cross validation experiment. One exception is that there are 8 markers in the Affymetrix 660 K chip for which the two alleles (i.e. nucleotides) do not agree with the alternating alleles identified through genome sequencing; these 8 markers were excluded and one target marker cluster was discarded in the testing.
Analogous independent results on the 8 genotyped Holstein animals are shown in Table 4. Both tables show an accuracy improvement in all settings, though the improvement is about 40 % lower than the 5-fold cross validation.

Multi-step imputation: independent testing
With the selected markers and their associated target marker clusters from the training step, we experimented with the usual two-step imputation from 6 K to 50 K to 660 K on the 367 genotyped Simmental animals, and the four-step piecemeal imputation from 6 to 660 K. The four-step piecemeal imputation is a result of replacing each usual one-step imputation by a potentially promising two-step piecemeal imputation. The usual two-step imputation accuracy is denoted as acc 2 ; the four-step piecemeal imputation accuracy is still denoted as acc π . Similar experiments were done on the 8 Holstein animals on BTA 27 genotyped using the 777 K chip. Results are on the Simmental datasets for markers on chromosome 14. Columns 3-7 contain the 5-fold cross validation results on the 82 animals, with the selected markers and their associated target marker clusters. Independent testing results on the 367 animals are in columns 8-10, using the selected markers and their associated target marker clusters from the cross validation. 1 In the independent testing from 50 K to 660 K, 8 markers of the Affymetrix 660 K chip were filtered out due to their genotype disagreeing with the alternating alleles specified by sequencing, and consequently only 999 target marker clusters were used. The columns labelled with + show the improvements, in bold, of the piecemeal imputation over the one-step imputation For the 23 genotyped and sequenced Simmental animals, we experimented with the usual two-step imputation from 50 to 660 K to Sequence and the four-step piecemeal imputation from 50 K to Sequence, and the usual three-step imputation from 6 K to 50 K to 660 K to Sequence and the five-step piecemeal imputation from 6 K to Sequence. Here "Sequence" refers to all the 529,674 SNPs on BTA 14. The usual three-step imputation accuracy is denoted as acc 3 ; the five-step piecemeal imputation accuracy is denoted as acc π . Note that since we do not have a 660 K to Sequence training step to select markers (because first the usual one-step imputation is very good leaving little room for further improvement and second the training phase requires storage beyond our capacity), the last step in the five-step piecemeal imputation is a direct one-step imputation.
All these usual two/three-step imputation accuracies and the corresponding four/five-step piecemeal imputation accuracies are summarized in Table 5, where there is accuracy improvement in all settings. We note that these 23 animals were used in the training step, and thus the results reported here could be slightly biased.

Rationale behind the two-step piecemeal imputation
Several recent studies in cattle have shown that twostep imputation can be more accurate than the classic one-step imputation [29][30][31]. Also in our preliminary study, we observed that some markers in the add-one two-step imputation experiments are able to boost the overall accuracy. These results have led us to put efforts into finding a set of markers that would perform the best in the subsequent two-step imputation. However, such an optimal set of markers is not assayed in any existing chips, nor easy to obtain in reasonable computational time. Besides the selection scheme in our piecemeal imputation framework, we also tried several other approaches including sequential forward selection, which   Simmental genotyped animals are used in the two-step independent testing (6 K→50 K→HD), with results in columns 4, 6 and 7. The piecemeal imputation uses the selected markers and their associated target marker clusters from the training step. Additional 23 Simmental sequenced and genotyped animals are used in the two/three-step imputation to Sequence (50 K→660 K→Sequence, 6 K→50 K→660 K→Sequence). All one-step imputation accuracies are included in column 3. The last column labelled with + show the improvements, in bold, of the piecemeal imputation over the two-or three-step imputation did not result in any significant improvement. We thus proposed an alternative to partition the higher density SNP set into multiple pieces, which are learned through the add-one two-step imputation experiments. Each piece is then imputed by the corresponding add-one two-step imputation experiment. This procedure laid the foundation for our two-step piecemeal imputation strategy. Nevertheless, our partition scheme is not necessarily optimal, as we adopted the k-means only because it outperformed other clustering methods slightly. In addition, we also experimented with the linkage-disequilibrium (LD) blocks produced by Haploview [32] for finding closely linked markers, but again the increase in accuracy was insignificant and the results are often inferior to those of the our marker selection scheme (detailed results not shown).

Marker clusters and their effects
From our 5-fold cross validation results, it seems as though the number of marker clusters does not affect the final piecemeal imputation accuracy much. For example, for genotype imputation from 6 to 50 K on the Simmental dataset, the piecemeal imputation accuracies of all the 20 different clustering results are plotted in Fig. 1, where the dashed blue/red lines are the Beagle/FImpute one-step imputation accuracies, and the solid dots represent the two-step piecemeal imputation accuracies. Despite FImpute performing better than Beagle, the connected dots for both FImpute and Beagle do not vary much with different numbers of clusters. A simple guideline would be to have an average cluster size of 10-100. We also look into the content of a marker cluster. For example, when k = 15, the first five of the 15 marker clusters are plotted in Fig. 2, where the x-axis represents the physical locus. It is interesting to see that the markers of a cluster are not necessarily close to each other, though they have very similar imputation potentials. The LD between pairs of these markers, by Haploview, are insignificant.

Imputation result sensitivity to the selected markers
The imputed genotype for the study samples at a selected marker m i is used in the second step, of the two-step piecemeal imputation, to impute the other untyped markers of U − {m i }. Comparing the add-one two-step imputation result to the usual one-step imputation, we have seen subtle changes at many untyped markers of U for different selected markers. Indeed, some of them exhibit a gain in accuracy whereas some have a loss in accuracy and yet others are unaffected. This has led us to use the overall gain in accuracy to measure the imputation potential of a selected marker.
By setting up a feature vector for a candidate marker to keep a record of the accuracy gains and losses at each untyped marker, we observed from our preliminary twostep imputation experiments (results not shown) that the candidate markers fall into three categories when used for creation of the pseudo-array: 1) those that yield an accuracy gain over the usual one-step imputation; 2) those that yield a net zero gain; 3) those that yield an accuracy loss from the usual one-step imputation. Through clustering these feature vectors, the impact of selecting different markers from a cluster is expected to be reduced to the minimum, as evidenced by our preliminary experiments (we did not re-examine this issue in all the experiments reported here).

Target marker clusters
For all the markers of a marker cluster, those other untyped markers that can be similarly well imputed in the add-one two-step imputation are the target markers associated with the marker cluster. We have looked into the content of such a target marker cluster. Similar to a It is also interesting to observe that some target marker clusters are overlapping. Note that target clusters are formed after the marker clusters are determined, that is, in terms of the feature vectors, the marker clusters are formed using the whole vectors, but the target marker clusters are formed by using only the vector entries corresponding to the makers in a marker cluster. Therefore, such a phenomenon of an untyped marker being imputed with high accuracies by several selected markers can be explained.

Other clustering methods
The main reason for marker clustering is to avoid selecting redundant markers to form the intermediate pseudo array, here redundant means the similar potential in imputing the genotype for other SNPs. We had experimented with Haploview to construct the LD blocks for this purpose, which did not result in any conclusive accuracy increase (detailed results not shown). Other popular feature selection methods in machine learning, such as SFS and SBS, were also tested. Based on the feature vectors, we tried clustering methods other than k-means, with results not better than k-means. Thus we go with k-means in the final piecemeal imputation framework.
As discussed in the last paragraph, forming the marker clusters and the associated target marker clusters is more like a bi-clustering task, and it would be worthwhile to try some good bi-clustering algorithms. Coming back to the LD-based marker selection, though multiple experiments with different thresholds in Haploview did not give good results, we realize that such an approach avoids the add-one two-step imputation experiments in the training phase, and it can be substantially faster. This suggests the need for better LD block estimation/prediction by SNP genotype values.

Cattle genomic distance
In our current empirical experiments, we used the population-based option in our base programs. The underlying assumption for such an option is that individuals are unrelated. On the other hand, related animals can certainly bias towards the correct genotype. Therefore, if one would be able to define a degree of relatedness between two individuals based on their SNP genotype, then using only closely related sequenced animals to a study animal as references may potentially lead to more accurate genotype imputation.
Animals from different breeds are deemed more distantly related than the same breed animals. We therefore separated the datasets by breeds. In fact, earlier research suggests cattle whole genome SNP genotype imputation should be done breed by breed [30], which is also confirmed by our preliminary testing that inter-breed imputation has slightly lower performance (detailed results not shown).

Computational time
The running time of the two-step piecemeal imputation depends on the number of study animals, the number of reference animals, and the higher SNP density. Our experiments were mostly done using the high-performance computing facilities, and we were able to run them in parallel. The same as any other machine learning tasks, the most time-consuming stage in the two-step piece imputation is the training phase, when the addone two-step experiments are conducted to select the good potential markers. The independent testing, and the real imputation tasks, can be finished much quicker. It is important to point out that in all our experiments reported here, running time was never an issue as it was in hours per run, but the major challenge is the need for a huge disk storage (more than 84 TB) when we were performing whole-genome SNP genotype imputation; and again this was happening in the training phase. We used more than 3TB for storing all the intermediate data, such as the feature vectors, for the two datasets used in this paper. The imputed SNP genotype values are expected to be useful in the downstream data analysis, such as GWAS, and thus the increased computation burden in the piecemeal imputation framework becomes worthy.

Methods
A flow chart of the two-step imputation process is depicted in Fig. 3, with the training process through the 5-fold cross validation in the left and the independent testing in the right. For ease of presentation, we use the Illumina 6 K gene chip to represent the lower density chip and the Illumina 50 K gene chip to represent the higher density one. The study samples are genotyped on the 6 K SNP set T, and the reference samples are genotyped on the 50 K SNP set T ∪ U. The goal is to impute the genotype values on U for the study samples. The top two lines in Fig. 2 plot T ∪ U and T, respectively, using their physical loci on the first half of chromosome 14 (BTA 14).

One-step imputation
We first present the training process. Let Beagle be our base imputation algorithm. Denote the set of study samples as S and the set of reference samples as R. The genotype dataset is thus denoted as (S ∪ R, T ∪ U). First, Beagle is used to impute the genotype of the study samples on the untyped SNPs of U, by simply running on the dataset (S ∪ R, T ∪ U). The achieved accuracy in this one-step imputation is denoted as acc 1 . In our simulation experiments, the genotype of the study samples on the SNPs of U are masked; the imputed genotype is then compared against the true genotype for calculating the imputation accuracy.

Marker feature vector
Our goal of training is to select a relatively small number of SNPs from U, denoted as M, and append them to set T to create an intermediate pseudo array, such that the subsequent two-step imputation from T to T ∪M and then to T ∪ U yields a higher imputation accuracy. In this step, we will evaluate the potential for each SNP of U in the add-one two-step imputation.
Let m i ∈ U be a candidate marker (i = 1, 2, . . . , |U|), and tentatively set the intermediate pseudo array to contain T and m i only, that is T ∪ {m i }. Applying the base imputation algorithm (Beagle in our case), we do the add-one two-step imputation from T to T ∪ {m i } and then from T ∪ {m i } to T ∪ U. At the end of the twostep process, we calculate the imputation accuracy for each marker m j of U across all study samples, denoted as a ij . Let a ii denote the one-step imputation accuracy from T to T ∪ {m i }. The vector v i = a i1 , a i2 , . . . , a i|U| , is the feature vector for marker m i .

Marker clustering and target marker cluster
Intuitively, two markers of similar feature vectors have about the same performance in the two-step imputation, in terms of imputing the other untyped markers; and thus it is sufficient to include only one of them. Therefore, we apply the k-means clustering algorithm to cluster the markers of U represented by their feature vectors, where k is the pre-set number of clusters which can be empirically determined, for example, by the Davies-Bouldin index [33]. In our experiments, for genotype imputation from the Illumina 6 to 50 K using Beagle, we examined 5 to 100 clusters with an increment of 5 in the cross validation. Let C 1 , C 2 , . . . , C k denote the k clusters of SNPs of U by k-means.
For each cluster C i , if marker m j ∈ U is consistently imputed well with average imputation accuracy higher than or equal to the one-step imputation accuracy acc 1 , then m j is a target marker for the cluster C i . The set of all target markers for cluster C i form the target marker cluster TC i for the cluster C i .
If the target marker cluster TC i is empty, then no marker of C i would be selected to form the pseudo array; otherwise, define the contribution of a marker m j of C i as the add-one two-step imputation accuracy from T ∪ {m j } to T ∪ U, with the imputation accuracy calculated over only markers of TC i . The top contribution marker of C i , denoted as m i * , is selected into M, and it is for imputing the genotype of SNPs of TC i only. We call this target marker cluster TC i one piece of the imputation.

Experimental setup in the two-step piecemeal imputation
We have two sets of sequenced animals (Holstein and Simmental) to be used as references. There are also much larger numbers of animals that are genotyped by various Fig. 3 A flow chart of the two-step piecemeal imputation framework, including both the training phase through a 5-fold cross validation and the independent testing. In the chart, T is the set of markers in the lower density chip and T ∪ U is the set of markers in the higher density chip; m i is a marker of U; S is the set of study samples genotyped on T and R is the set of references genotyped on T ∪ U. The goal is to impute the genotype for markers of U for the study samples Illumina and Affymetrix gene chips, which are the study samples. The project goal is to impute the study animals to their whole sequences. When running the base programs in our experiments, we followed the default settings of the population-based genotype imputation for both Beagle and FImpute. For Beagle, we set the -Xmx parameter to 4GB for memory management without the low memory option for better performance.

Training through cross validation
Recall that we have four levels of SNP density, 6 K, 50 K, HD and Sequence, where HD can be either the Illumina 777 K or the Affymetrix 660K. For each genotype imputation experiment from a lower density to a higher density, we want to select some untyped SNPs to form the pseudo array and partition the other untyped markers into pieces accordingly. We do this training via a 5-fold cross validation. The quality of these selected markers and their associated pieces is examined via independent testing.
We use an example genotype imputation experiment for the Simmental group from 6 K to 50 K to explain the procedure. First, we derive the 50 K genotype for all sequenced Simmental animals from their sequence data. Next, these animals are randomly partitioned into five folds of approximately equal size, with the consideration of their country of origins. Four folds of the animals are used as the references, R, while the last fold is held as study samples, S, for which the genotype of SNPs outside of 6 K are masked to mimic the untyped SNPs U. The fivefold cross validation scheme rotates each one of the five folds as the study fold.
On the dataset (R ∪ S, T ∪ U), using Beagle we do the one-step imputation to obtain the imputation accuracy acc 1 ; we also do the add-one two-step imputation to obtain the feature vector for each untyped marker of U. The average of the five feature vectors from the five-fold cross validation defines the final feature vector for each untyped marker. With these vectors, the k-means algorithm is run to cluster the untyped markers into C 1 , C 2 , . . . , C k (which form a partition of U). Subsequently, we determine the target marker cluster TC i for each cluster C i (all these target marker clusters form another distinct partition of U), and select the markers M = {m 1 * , m 2 * , . . . , m k * } for piecemeal imputation. For each study fold S, the study samples are then piecemeal imputed to fill the genotype for the untyped markers U: when an untyped marker does not belong to any target marker cluster, it is imputed in the one-step, otherwise, it is imputed in multiple pieces through a majority vote. The imputed genotype values are compared against the ground truth, which were masked before the imputation, to calculate the piecemeal imputation accuracy as the percentage of correctly imputed genotype. The final piecemeal imputation accuracy from the five-fold cross