Data-driven encoding for quantitative genetic trait prediction

MOTIVATION
Given a set of biallelic molecular markers, such as SNPs, with genotype values on a collection of plant, animal or human samples, the goal of quantitative genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Quantitative genetic trait prediction is usually represented as linear regression models which require quantitative encodings for the genotypes: the three distinct genotype values, corresponding to one heterozygous and two homozygous alleles, are usually coded as integers, and manipulated algebraically in the model. Further, epistasis between multiple markers is modeled as multiplication between the markers: it is unclear that the regression model continues to be effective under this. In this work we investigate the effects of encodings to the quantitative genetic trait prediction problem.


RESULTS
We first showed that different encodings lead to different prediction accuracies, in many test cases. We then proposed a data-driven encoding strategy, where we encode the genotypes according to their distribution in the phenotypes and we allow each marker to have different encodings. We show in our experiments that this encoding strategy is able to improve the performance of the genetic trait prediction method and it is more helpful for the oligogenic traits, whose values rely on a relatively small set of markers. To the best of our knowledge, this is the first paper that discusses the effects of encodings to the genetic trait prediction problem.


Background
Whole genome prediction of complex phenotypic traits using high-density genotyping arrays has attracted a lot of attention, as it is relevant to the fields of plant and animal breeding and genetic epidemiology [1][2][3][4][5][6][7][8]. Given a set of biallelic molecular markers, such as SNPs, with genotype values encoded as {0, 1, 2} on a collection of plant, animal or human samples, the goal is to predict the quantitative trait values by simultaneously modeling all marker effects.
One of the earliest, though still very relevant, treatments of genomic selection was given in [1]. In the article, the authors present four approaches: least-squares estimation, BayesA, BayesB, and rrBLUP (Ridge-Regression BLUP), based on a linear model of the marker effects on the trait being studied. The latter three methods are still competitive with the state-of-art techniques, and have also spawned a number of interesting variants. Specifically, rrBLUP [1,9] has been used widely for trait prediction where it builds a linear model by fitting all the genotypes, and the coefficient computed for each marker can be considered as a measure of the importance of the marker. The name rrBLUP stands for "ridge-regression" BLUP, where BLUP stands for the standard "best linear unbiased prediction" approach used in the field. rrBLUP can be viewed as either ridge-regression with a specific shrinkage parameter, or a particular mixed model equation with certain variance components [10,11]. The rrBLUP method has the benefits of the underlying hypothesis of normal distribution of the trait value and the marker effects, which is well suited for highly polygenic traits; rrBLUP is computationally efficient and robust, which makes it one of the most commonly used models in whole genome prediction. Other popular predictive models are Elastic-Net, Lasso, Ridge Regression [12,13], Bayes A, Bayes B [1], Bayes C π [14], and Bayesian Lasso [15,16], as well as other machine learning methods.
The genetic trait prediction problem is defined as follows. Given n training samples, each with m ≫ n genotype values (we use "feature", "marker", "genotype", "SNP" interchangeably) and a trait value, and a set of n′ test samples each with the same set of genotype values but without trait value, the task is to train a predictive model from the training samples to predict the trait value, or phenotype of each test sample based on their genotype values. Let Y be the trait value of the training samples. The problem is usually represented as the following linear regression model: where X i is the i-th genotype value, m is the total number of genotypes and b i is the regression coefficient for the i-th genotype, e l is the error term. We call this model single marker model.
The above model assumes that only the single markers, or main effects, play a role for the prediction. However, it is known that the interactions of the genotypes may also contribute to the genetic traits under certain conditions, which is known as Epistasis [17]. The pairwise epistasis between two genotypes X i and X j is often modeled as the product of the two genotype values. Therefore, with the traditional representation, the linear regression model with pairwise epistasis effects is modified as the following: where X i X j is the product of the genotype values of the i-th and j-th genotype and it denotes the interaction of the two genotypes, a i,j represents the coefficient for the interaction. Thus in this epistasis model, the epistasis effects are considered as augmented genotypes besides the original genotype matrix X. We call this model epistasis model.
Genotypes for a marker can be either homozygous or heterozygous. For Genome Wide Association Study (GWAS), we only need to identify the association between a marker and the case/control trait. Therefore, we care more about whether genotypes are homozygous or heterozygous and the frequency of the alleles. They don't necessarily need to be quantitative. They are usually represented as a pair of alleles, for example "AA" and "TT" for homozygous genotypes, "AT" for heterozygous genotype.
On the other hand, for genetic trait prediction problem, in Equation 1 and 2, the genetic trait values Y are quantitative. Thus the genotypes X i need to be quantitative as well. Researchers generally assign three distinct encodings to the three possible genotype values . A few   common sets of encodings for genotypes are {0, 1, 2},  where 0 and 2 are for homozygous genotypes and 1 is  for heterozygous genotype, and {−1, 0, 1}, where -1 and  1 are for homozygous genotypes and 0 is for heterozygous genotype. As genotypes need to be encoded, different encodings may lead to different prediction accuracies, especially for the epistasis model. This is because the multiplications of different encodings are different. For example, the multiplication of two heterozygous genotypes for encoding {0, 1, 2} is 1 × 1 = 1, but for encoding {−1, 0, 1} is 0 × 0 = 0. It's not clear which encoding should we use nor which encoding will produce better results. Another unreasonable setting is different interactions of the genotypes may have the same value. For example, the multiplication of two genotypes 0, 1 and 0, 2 are both 0. But there is no biological interpretation why the two genotype interactions contribute "identically" to the trait.
Based on the above observations, we developed a novel data-driven encoding method where the encoding of the genotypes depends on the data itself. The basic idea is straitforward: For each genotype g of each marker i, we identify the set of trait values for the samples whose genotype is g at marker i. Then we take the average of this set of traits and replace the genotype with this average value. Thus all the genotypes can be determined by the data itself and the encoding allows each marker to be encoded differently. We call this encoding method pure data-driven encoding.
In the traditional encoding, heterozygous genotype is the average value of the two homozygous genotypes, thus from the encoding we could tell which one is heterozygous genotype, which ones are homozygous genotypes. One problem of the pure data-driven encoding is that we can not distinguish between the homozygous genotype and heterozygous genotype any more, as their encodings completely depend on the data. So we propose a second version of the encoding, where we compute the new encoding for the two homozygous genotypes the same as in the pure data-driven encoding method, but we compute the new encoding for the heterozygous genotype as the mean of the whole data set. We call this encoding method hybrid data-driven encoding. More details will be given in the methods section.

Related work
Lots of techniques have been applied to the genetic trait prediction problem defined in Equation 1. Consider the typical situation for linear regression, where we have the training set y ∈ R l , x ∈ R l×n , in a standard linear regression, we wish to find parameters b 0 , b such that the sum of square residuals, The lasso approach [12,13] uses an additional l 1 penalty which aims to achieve a sparse solution. This idea has even been extended to group lasso where variable are included or excluded in groups [18,19]. Alternatively Ridge regression (or Tikhonov regularization) [20] uses an l 2 penalty which is ideal for the case when many predictors have non-zero coefficients. Elastic-net uses both an l 1 and l 2 penalty with a trade off parameter between the two [21]. Consequently lasso and ridge regression can be seen as special cases of Elastic-net. See [22] and references therein. The Elastic-net problem can be stated as Thus when a = 1 corresponds to lasso and a = 0 corresponds to ridge.
Elastic-Net (with non-zero a) can be easily extended for genome wide associate studies by use of the nonzero b parameters selected when training the data. That is, the l 1 penalty achieves a sparse solution, and in turn signals which variables contribute most when training on the data. rrBLUP (Ridge regression BLUP) [1,9] is one of the most popular methods for genetic trait prediction. rrBLUP simply is ridge regression with a specific choice of l in (3). Specifically, Meuwissen et al. [23] assumes that the b coefficients are iid from a normal distribution such that b i~N (0, s b ). Then the choice of λ = σ 2 e /σ 2 β where σ 2 e is the residual error. In this case, the ridge regression penalized estimator is equivalent to best linear unbiased predictor (BLUP) [24]. Support vector machines (SVMs) are a tool in statistics and machine learning for the task of supervised learning [25][26][27][28][29] used for either classification or regression. Here we are interested in the latter case. Following [30], given a training set (x i , y i ), i = 1,...l, where x i ∈ R n , the goal of ε-SV regression is to find a function f(x) that is at most ε deviation from the training data y i over the training data x i , while remaining as flat as possible in the feature space. Training an SVM requires solving min w,b,ξ The data vectors x i are mapped to another space via the function j, and SVM attempts to fit the data in this higher dimensional space. Thus, the choice of j, referred to as the kernel, has a large impact. Four kernels are usually used:

Support vector regression involves solving Equation 4
given training data. The vector w, the choice of the kernel, and the choice of kernel parameters, used previously to solve Equation 4 gives a model capable of predicting future data.
The above work all aim to solve single marker genetic trait prediction. There are also lots of existing work on epistasis models for GWAS. As exhaustive search of all possible epistasis interactions is infeasible even for a small number of markers, greedy strategies [31][32][33][34][35][36] have been applied to detect epistasis effects where a subset of high-marginal effect markers, which are markers that contribute to the trait themselves, are first selected. Then the test is conducted either between all the markers in this subset or between the markers in this subset and the remaining markers. These strategies, however, miss all the possible epistasis between the low-marginal effect markers, which are shown to exist [17]. Xiang et al. [37] proposed an optimal algorithm to efficiently detect epistasis without conducting an extensive search. A data structure is created to effectively prune interactions that are potentially insignificant. These work focus on GWAS and they do not require a quantitative encoding. As a result, none of the existing work investigated the effects of encoding for genetic trait prediction, where quantitative encoding is a must. As multiplication is one of the most popular epistasis models, in this work, we consider only the multiplication model for epistasis.

Methods
Genotype usually has three values, one for the homozygous major allele, one for the homozygous miner allele and one for the heterozygous allele. In the traditional encoding {0, 1, 2}, 1 is the value for the heterozygous genotype, 0 and 2 are for the homozygous genotype, one on major allele, one on miner allele. All the markers are encoded the same way. In this work, we propose two data-driven encoding strategies. The first encoding strategy is called pure data-driven encoding. The new encoding for genotype of value 0 at marker i is computed as E (i, 0) = Ave(trait(i, 0)), where E(i, 0) is the new encoding for genotype of value 0 at maker i, trait(i, 0) is the set of traits for the samples whose genotypes are 0 at marker i, Ave() is the function to compute the average value. Thus we also have E(i, 1) = Ave(trait(i, 1)), E(i, 2) = Ave(trait (i, 2)).
The first encoding strategy is pure data-driven. It doesn't distinguish homozygous genotypes with heterozygous genotypes. Thus we developed the second encoding strategy hybrid data-driven encoding, where we still have E(i, 0) = Ave(trait(i, 0)) and E(i, 2) = Ave(trait(i, 2)) for the two homozygous genotypes 0, 2. However, for the heterozygous genotype 1, we have E(i, 1) = Ave(trait(i, {0, 1, 2})), where trait(i, {0, 1, 2}) indicates the trait values for all the samples. The intuition of this encoding strategy is from the traditional encoding that the encoded value of the heterozygous genotype should be the average of the values for homozygous genotypes. However, we would also like to take the trait values into consideration so that the encoding of the heterozygous genotype not only depends on the encoding of the homozygous genotypes, but also the corresponding trait values. As we will show later in the experiments, the pure data-driven encoding is in general worse than the traditional encoding, and the hybrid data-driven encoding is in general better than the traditional encoding.
The data-driven encoding strategies can be naturally extended to pairwise epistasis effects or even higher dimensional epistasis effects. As the hybrid data-driven encoding has better performance, the extension is based on the hybrid data-driven encoding method. As shown in Figure 1, for pairwise epistasis effects, given the traditional encoding {0, 1, 2}, we have 9 possible combinations for markers i and j, organized in the 3 × 3 grid matrix. Assuming 0 is the traditional encoding for homozygous genotype with major allele, 2 is the traditional encoding for homozygous genotype with miner allele, 1 is traditional encoding for heterozygous genotype, then the cell (0, 0) (from now on, for simplicity, we ignore the marker indices i, j for the cell) is the traditional encoding for a pair of homozygous genotypes, both with major allele, the cell (2, 2) is the traditional encoding for a pair of homozygous genotypes, both with miner allele, the cell (1, 2) is the traditional encoding for a pair of heterozygous genotype and homozygous genotype with miner allele. The meaning of the other cells can be inferred similarly.
The same data-driven encoding algorithm can be further extended to higher dimensional epistasis effects. In this work, we only focused on the application of the data-driven encoding algorithm on single marker and pairwise epistasis effects.
As we will show later in the experiments in Section, the data-driven encoding not only is able to improve the performance for the epistasis model, but also for the single marker model. Next we investigate the reason that the data-driven encoding is able to improve the performance of the prediction in general. For the traditional encoding, the same genotype of different markers are encoded with the same value. However, from GWAS, we know that different SNPs have different associations to the trait. Thus the markers contribute differently to the trait. Therefore, constraining the same genotype of different markers with the same value may not be appropriate to obtain the best contrast factors, or the coefficients of the markers for the regression. Our encoding method, on the contrary, allows the same genotype of different markers to have different values. And the higher the association between the marker and the trait is, the more close the new encoding values to the trait value. The regression based on the new encodings thus can more effectively identify the contributions of the markers, leading to better accuracy. This also explains why the data-driven encoding works better for olygogenic traits, which depend on relatively few number of markers. For these olygogenic traits, a few markers are significantly associated with them. The datadriven encodings of these markers are close to the value of the traits and they can be more easily identified in the regression. Thus the prediction can be improved. On the contrary, for polygenic traits, which depend on a large number of markers, all these markers have similar associations. Thus the new encoding values are similar to each other. So the data-driven encoding is not able to improve the prediction performance much. In our future work, we would like to propose a theoretical analysis for the performance of the data-driven encoding.

Results
As rrBLUP is one of the most commonly used methods for genetic trait prediction, in our experiments, we evaluate the prediction accuracy of rrBLUP for different encodings.

Effects of different encodings
We first illustrate that different encodings lead to different prediction performances for epistasis model defined in Formula 2, thus a data-driven encoding has the benefit that we do not need to worry about the selection of the encodings. We use the Maize data set [7], which consists of two maize diversity panels with 300 Flint and 300 Dent lines developed for the European CornFed program. The two panels, Flint and Dent, were genotyped using a 50 k SNP array, which after removing SNPs with high rate of missing markers and high average heterozygosity, yielded 29,094 and 30,027 SNPs respectively. Both of them contain 261 samples. In this experiment, we use only Dent data set. For a pair of SNPs, we consider the following four combinations of encodings (SNP 1  We test all pairwise epistasis effects under the four combinations of encodings and rank the epistasis effects Figure 1 The data-driven encoding for pairwise epistasis of markers i, j. He et al. BMC Bioinformatics 2015, 16(Suppl 1):S10 http://www.biomedcentral.com/qc/1471-2105/16/S1/S10 according to their mutual information to the trait. Then we measure the correlation of the ranks among these four combinations. Spearman's rank correlation is a popular method to compare two ranks. However, it requires that the two ranks have the same set of elements. While in our case, the ranks reported by different encodings overlap but may contain different SNP pairs. So we adopted an average accuracy correlation method [38]. The results for Dent trait Tass are shown in Table 1. We can see that the ranks from different encoding strategies have very low correlations, indicating that different encodings lead to completely different epistasis effects.
Next we compare the performance of rrBLUP for the two encodings {0, 1, 2} and {−1, 0, 1} and we apply rrBLUP on all the three traits of Dent under each encoding, respectively. We applied the epistasis model in Formula 2 and show the average results of 10-fold cross validation in Table 2. The performance is measured as the square of the person's correlation, or r 2 . The larger the r 2 , the better the regression is and the better the prediction is. As we can see, the two encodings lead to very different accuracies and {0, 1, 2} is a better encoding on the Dent data set.

Simulated data
We next compare the predictive performance of rrBLUP based on the traditional encoding {0, 1, 2} and the two versions of data-driven encodings we proposed. We simulate both polygenic traits and oligogenic traits. We randomly generate the genotype matrix X of size 100 × 500, namely 100 samples each with 500 genotypes. The trait is generated according to the formula Y = m i=1 β i X i + e l . The coefficient of the markers b i and the residual error e l are also randomly generated. We set the coefficients of the first s genotypes, namely b 1 , b 2 ,...,b s , as non-zero and of the remaining genotypes as 0. We vary the value of s as 5, 10, 20, 50, 100, with small s for oligogenic traits and large s for polygenic traits. Out of 100 samples, 90 are training samples and 10 are test sample. We randomly simulate 10 data sets and for each data set we conduct 10-fold cross validation. We compute the average prediction performance and show the results in Table 4.
We can see that when s is small, for example, s = 5, 10, 20, the traits are oligogenic and the performance of the hybrid encoding is better than that of the traditional encoding. However, when s is big, for example, s = 50, 100, 200, the traits are polygenic, the performance of the hybrid encoding is worse than that of the traditional encoding. We also observe that the pure data-driven encoding works poorly for all cases, indicating that it is important to also take into consideration the relationship of homozygous and heterozygous genotypes.

Real data
Next we apply the new encoding strategy to four different data sets. As the pure data-driven encoding shows poor performance, we evaluate only the hybrid data-driven encoding. We compare the performance of rrBLUP on both the traditional encoding and the hybrid data-driven encoding and show the average r 2 of 10-fold cross validation.
The first data set is the Maize data set [7] which was used to evaluate the effects of different encodings in Section. As we can see in Table 3, for all six traits, the data-driven encoding achieves better performance than the traditional encoding does.
The second data set is the Asian rice, Oryza sativa, data set [39]. This data set was based on 44,100 SNP variants from 413 accessions of O. sativa, taken from 82 countries containing 34 phenotypes. We selected two phenotypes, one is polygenic (Protein.content), one is oligogenic (Pericarp.color). The data sets have 36,901 markers and 413 samples. We again vary the number of selected features as 500, 1000. As shown in Table 3, for the oligogenic trait (Pericarp.color), the performance of the data-driven encoding is significantly better than that of the traditional encoding. On the contrary, for the polygenic trait (Protein.content), the two encodings achieve similar performance. This is consistent with our simulation, namely the data-driven encoding works better for oligogenic traits.
The third data set is Pig data set, which is a collection data on male and female pigs born since 2000 and was taken from [5] and consists of 3,534 animals from a single PIC nucleus pig line yielding 52,842 SNPs with five measured traits (phenotypes). Only traits 2 and 4 were selected for study here. As described in [5], genotypes  were sequenced from the Illumina PorcineSNP60 chip and full pedigree information is available, which we did not use in this study. In the original study, trait 2 was rescaled by a weighted mean of corrected progeny phenotypes. Whereas trait 4 was corrected for environmental factors such as year of birth and location. Genotypes were filtered for minor allele frequency less than 0.001 and with missing genotypes less than 10%. The original study used AlphaImpute to impute any missing data [14]. As we can see in Table 3, for both traits, the datadriven encoding achieves better performance. The fourth data set is QTLMAS data set, which was taken from the QTL-MAS Workshop, which was held on May 17-18, 2010 in Poznan Poland [1]. The data set consists of 3,226 individuals over five generations (F0-F4) with 20 founders, five male and 15 females. There were two phenotype traits, the first a quantitative trait and the second a binary trait. Only the first four generations (2,326 individuals) have phenotype records. The genome is approximately 500 million bp with five chromosomes, each 100 million bp. In total, each individual was genotypes for 10,031 biallelic SNPs. We can see in Table 3, data-driven encoding achieves better performance for Trait 1 and worse performance for Trait 2.
For genetic prediction, an improvement of 5% is considered as significant. As shown in Table 3, in general the data-driven encoding is able to improve the prediction performance and in many cases the improvement is significant. Thus even for single marker model, the datadriven encoding is superior to the traditional encoding.
We also applied the data-driven encoding strategy on the epistasis model shown in Formula 2. We computed all pairs of epistasis effects and selected the top 2,000 pairs, using their relevance, measured by mutual information, to the trait. We then include these top 2,000 pairs of epistasis effects in Formula 2. We conducted the experiments only on the Maize data set as it is the only data set that is small enough for an extensive search.
As we can see in Table 5, the performance of epistasis is better for most of the traits when using the data-driven encoding. For the two traits where the traditional encoding is better, the performance of the data-driven encoding is only slightly worse than that of the original encoding.

Conclusions
In this work, we showed that the genetic trait prediction problem heavily depends on the encoding of genotypes,  especially when epistasis effects are considered. We developed a data-driven encoding method which is simple but effective with the benefits that we don't need to choose between different encodings. Out experiments show that the data-driven encoding is able to improve the prediction accuracy for both single marker model and epistasis model, especially for olygogenic traits. To our knowledge, this is the first work that discusses the effects of encodings for genetic trait prediction problem. In our future work, we would like to theoretically analyze the effects of the data-driven encoding for the genetic trait prediction problem.