Does encoding matter? A novel view on the quantitative genetic trait prediction problem

Background Given a set of biallelic molecular markers, such as SNPs, with genotype values encoded numerically on a collection of plant, animal or human samples, the goal of genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Genetic trait prediction is usually represented as linear regression models which require quantitative encodings for the genotypes. There are lots of work on the prediction algorithms, but none of the existing work investigated the effects of the encodings on the genetic trait prediction problem. Methods In this work, we view the genetic trait prediction problem from a novel angle: a multiple regression on categorical data problem, which requires encoding the categorical data into numerical data. We further proposed two novel encoding methods and we show that they are able to generate numerical features with higher predictive power. Results and Discussion Our experiments show that our methods are superior to the other encoding methods for both single marker model and epistasis model. We showed that the quantitative genetic trait prediction problem heavily depends on the encoding of genotypes, for both single marker model and epistasis model. Conclusions We conducted a detailed analysis on the performance of the hybrid encodings. To our knowledge, this is the first work that discusses the effects of encodings for genetic trait prediction problem.


Background
Whole genome prediction of complex phenotypic traits using high-density genotyping arrays has attracted a lot of attention, as it is relevant to the fields of plant and animal breeding and genetic epidemiology [1][2][3][4][5][6][7][8]. Given a set of biallelic molecular markers, such as SNPs, with genotype values typically encoded as {0, 1, 2} on a collection of plant, animal or human samples, the goal is to predict the quantitative trait values by simultaneously modeling all marker effects.
More specifically, the genetic trait prediction problem is defined as follows. Given n training samples, each with m n genotype values (we use "feature", "marker", "genotype", "SNP" interchangeably) and a trait value, and *Correspondence: dhe@us.ibm.com IBM T.J. Watson Research, Yorktown Heights, NY, USA a set of n test samples each with the same set of genotype values but without trait value, the task is to train a predictive model from the training samples to predict the trait value, or phenotype of each test sample based on their genotype values. Let Y be the trait value of the training samples. The problem is usually represented as the following linear regression model: where X i is the i-th genotype value, m is the total number of genotypes and β i is the regression coefficient for the ith genotype, e l is the error term. We call this model single marker model.
The above model assumes that only the single markers, or main effects, play a role for the prediction. However, it is known that the interactions of the markers may also contribute to the genetic traits under certain conditions, which is known as Epistasis [15]. The pairwise epistasis between two markers i and j is often modeled as the product of the two genotype values. Therefore, with the traditional representation, the linear regression model with pairwise epistasis effects is modified as the following: where X i X j is the product of the genotype values of the i-th and j-th genotype and it denotes the interaction of the two genotypes, α i,j represents the coefficient for the interaction. Thus in this epistasis model, the epistasis effects are considered as augmented genotypes besides the original genotype matrix X. We call this model epistasis model. As multiplication is one of the most popular epistasis models, in this work, we consider only the multiplication model for epistasis. Genotypes for a marker can be either homozygous or heterozygous. For Genome Wide Association Study (GWAS), we only need to identify the association between a marker and the case/control trait. Therefore, we care more about whether genotypes are homozygous or heterozygous, the specific alleles and their frequencies. The genotypes don't necessarily need to be quantitative. They are usually represented as a pair of alleles, for example "AA" and "TT" for homozygous genotypes, "AT" for heterozygous genotype.
On the other hand, for genetic trait prediction problem, in Eq. 1, the genetic trait values Y are quantitative. Thus the genotypes X i needs to be quantitative as well. Researchers generally assign three distinct encodings to the three possible genotype values. A few common sets of encodings for genotypes are {0, 1, 2}, where 0 and 2 are for homozygous genotypes and 1 is for heterozygous genotype, and {−1, 0, 1}, where -1 and 1 are for homozygous genotypes and 0 is for heterozygous genotype.
There have been lots of work on predicting genetic trait values from genotype data, such as rrBLUP (Ridge regression BLUP) [1], Elastic-Net, Lasso, Ridge Regression [10,11], Bayes A, Bayes B [1], Bayes C π [12], and Bayesian Lasso [13,14], as well as other machine learning methods. However, all previous work consider genetic trait prediction problem as a regression problem on a numerical data set. We, on the contrary, look at the problem from a totally different angle: we consider the problem as a problem of multiple regression on categorical data, namely a regression on multiple categorical features. The genotype of each marker has three possible categories: homozygous with major allele, homozygous with minor allele and heterozygous. In order to conduct regression on categorical data, we need to first encode the categorical data. Many encoding methods have been proposed for categorical data, including dummy encoding, ordinal encoding, target-based encoding etc. The traditional coding of {0,1,2} is indeed the ordinal coding, which assumes that the categories follow certain order. In this problem setting, the three categories can be considered as following the order of the number of major or minor alleles.
In this work, we first review the existing encoding mechanisms and we show that only ordinal encoding and target-based encoding are appropriate for the genetic trait prediction problem. The ordinal encoding encodes the three categories into three unique numerical values, such as {0,1,2} or {-1,0,1}. The advantage of the ordinal encoding is that the order of the categories are maintained by the encoding. The target-based encoding encodes the three categories as follows: For each category g of each marker i, we identify the set of trait values for the samples whose category is g at marker i. Then we take the average of this set of traits and assign the genotype with this average value. The advantage of the target-based encoding is that each marker, or feature, can be encoded differently according to the data. We also observe that the ordinal encoding method always encodes different markers with the same set of numerical values. On the contrary, the target-based encoding methods can not maintain the order of the categories. To combine the advantages of both mechanisms and to address their drawbacks at the same time, we further developed two hybrid encoding methods. The hybrid methods conduct target-based encoding for the two homozygous categories first, then encode the heterozygous category either as the mean of the trait for all samples, or the mean of the two homozygous categories. Thus they allow the flexibility of the encodings, where different markers can be encoded differently. We showed that the encoded value for the heterozygous category is always bounded by the two values of the homozygous categories. Therefore the hybrid encoding methods maintain the order of the categories. We also extended these hybrid encoding methods to epistasis model. We showed that our hybrid encoding methods are superior to both ordinal and target-based encodings for both single marker model as well as epistasis model. Due to space limit, we did not include the experimental results for the epistasis model, which will be included in the extended version of the work.

Preliminaries
Given the traditional encoding of genotypes as {0, 1, 2}, lots of techniques have been applied to the genetic trait prediction problem defined in Eq. 1. Consider the typical situation for linear regression, where we have the training set y ∈ R l , x ∈ R l×n , in a standard linear regression, we wish to find parameters β 0 , β such that the sum of square 2 , is minimized. Many machine learning methods have been applied to the genetic trait prediction problem, such as Elastic-Net, Lasso, Ridge Regression [10,11], Bayes A, Bayes B [1], Bayes C π [12], and Bayesian Lasso [13,14]. As in this work, we applied rrBLUP and SVR, we mainly focused on reviewing these two techniques.
rrBLUP (Ridge regression BLUP) [1,9] is one of the most popular methods for genetic trait prediction. rrBLUP simply is ridge regression with a specific choice of λ in (3). Specifically, Meuwissen et al. [16] assumes that the β coefficients are iid from a normal distribution such that β i ∼ N(0, σ 2 β ). Then the choice of λ = σ 2 e /σ 2 β where σ 2 e is the residual error. In this case, the ridge regression penalized estimator is equivalent to best linear unbiased predictor (BLUP) [17]. Support vector machines (SVMs) are a tool in statistics and machine learning for the task of supervised learning [18][19][20][21][22] used for either classification or regression. Here we are interested in the latter case. Following [23], given a training set (x i , y i ), i = 1, . . . l, where x i ∈ R n , the goal of -SV regression is to find a function f (x) that is at most deviation from the training data y i over the training data x i , while remaining as flat as possible in the feature space. Training an SVM requires solving min w,b,ξ The data vectors x i are mapped to another space via the function φ, and SVM attempts to fit the data in this higher dimensional space. Thus, the choice of φ, referred to as the kernel, has a large impact. Four kernels are usually used: Support vector regression involves solving Eq. 4 given training data. The vector w, the choice of the kernel, and the choice of kernel parameters, used previously to solve Eq. 4 gives a model capable of predicting future data.

Encoding mechanisms and evaluation
For a linear regression problem shown in Eq. 1, different encodings would not change the regression result as the coefficients and error terms would compromise the difference of the encodings. Assuming the old encoding and new encoding for the i-the feature are X i and X i respectively, we could always have We can see that the regression for the new encoding and old encoding indeed differ only by the error term. As the error term e follows a normal distribution N(0, δ), For different encodings, the δ could be different, thus lead to different regression performance. Therefore encodings does matter to the regression task, as we will show in the next few sections.

Different encoding mechanisms
In this work, we view the quantitative genetic trait prediction as a multiple regression on categorical data. Multiple regression problem on categorical data requires encoding. Various encoding mechanisms have been proposed. The most common one is called dummy encoding. In general, a categorical variable with k levels will be transformed into k − 1 dichotomous variables each with two levels. For example, for a variable of three possible values, or levels, we could transform it into two dummy dichotomous variables A and B. For value one, we assign A = 1, B = 0. For value two, we assign A = 0, B = 1. For value three, we assign A = 0, B = 0. Thus for n markers, we need at least 2n dummy dichotomous variables. As the complexity for rrBLUP is O(m 2 ), where m is the number of markers, the complexity with dummy encoding becomes O(4n 2 ). Given in our problem setting n is usually tens of thousands, O(4n 2 ) in reality is significantly bigger than O(n 2 ). Another issue is that using dummy encoding, we are not able to obtain a single coefficient for a marker, which is generally considered as the importance of the markers for plant and animal breeding. A set of similar encoding mechanisms such as Forward Difference Coding, Backward Difference Coding, Helmert Coding, Reverse Helmert Coding, Deviation Coding, Orthogonal Polynomial Coding all have the same issues for a combination of categorical variables for similar reasons.
Another very popular encoding method that addresses the scalability issue is ordinal encoding. Ordinal encoding assumes that the categories follow certain order and then encodes the categories with numerical values such as 0, 1, 2. This is indeed the case for our problem setting where the three categories can be considered as following the order of the number of major or minor alleles. When a combination of categories are considered, each category is encoded independently. For genotype encoding, a traditional way is to encode the three categories the same way across all markers. A different encoding mechanism, Target-based encoding encodes each category as the mean of the target variable for that specific category. This encoding method allows each marker to be encoded differently. However, the order of the categories are not maintained. Thus for a combination of categorical variables, the order of the categories of each variable is relatively random.
In this work, in order to address the drawbacks of the ordinal encoding and the target-based encoding while maintaining both of their advantages, we develop two hybrid encoding methods. Assuming 0 stands for the homozygous genotype with major allele, 2 stands for the homozygous genotype with minor allele, 1 stands for the heterozygous genotype, the first hybrid method computes new encodings of genotypes at marker i as the follows: where E(i, 0) is the new encoding for genotype of value 0 at marker i, trait(i, 0) is the set of traits for the samples whose genotypes are 0 at marker i, Ave() is the function to compute the average value. We call this method Hybrid One.
The second hybrid method computes E(i, 0) and E(i, 2) the same as algorithm hybrid one does. However, instead of the average of the trait, E(i, 1) is computed as the average of E(i, 0) and E(i, 2). We call this method Hybrid Two.
We can see that for both hybrid one and hybrid two, E(i, 0) and E(i, 2) are computed the same as those from target-based encoding. However, target-based encoding computes E(i, 1) as Ave(trait(i, 1)) which then loses the order of the categories. For both hybrid one and hybrid two, it is guaranteed that E(i, 1) = Ave(trait(i, {0, 1, 2})) and E(i, 1) = E(i,0)+E(i,2) 2 are in between of E(i, 0) and E(i, 2). Thus the order of categories is maintained. The difference is that E(i, 1) = Ave(trait(i, {0, 1, 2})) is closer to E(i, 0) as 0 stands for the heterozygous with major allele, where most of the samples have this genotype, thus the average of the whole trait values is close to E(i, 0). On the contrary, E(i, 1) = E(i,0)+E(i,2) 2 requires its value as the mean of E(i, 0) and E(i, 2). From our experiments, the second strategy achieves slightly better performance.
We show an example in Fig. 1. The figure on the left shows the multiple regression problem. The figures on the right show the codings of different encoding methods. Notice we just simply give some sample regression and the regressions might not be perfect. As we can see, for ordinal encoding, X 1 is positively correlated to Y, X 2 is negatively correlated to Y. For the other encodings, both X 1 and X 2 are positively correlated to Y.
Notice for the ordinal encoding, we could check if the correlation between a feature and the trait is positive or not. If the correlation is negative, we could flip 0 and 2 to make the feature positively correlated to the trait. This strategy would work well for single marker model. However, it can not be extended to the epistasis model where we could have 9 values for each feature for the pairwise epistasis case and 3 k values for the k-way interactions.

Performance analysis of different encoding mechanisms
As we will show later in the experiments in Section "Results", the hybrid encoding methods are able to improve the performance not only for the epistasis model, but also for the single marker model. Next we investigate the reason that the hybrid encodings are in general superior to the other encoding methods.

Importance of encoding flexibility
The issue of the traditional encoding, or the ordinal encoding for multiple regression problem is that as all the categories are encoded the same way for different markers. Our encoding methods, as well as the target based encoding, encode each marker by assigning values similar to the trait values according to how the genotype categories are distributed for the marker. As we will show later in our experiments, this strategy is able to improve the Fig. 1 Examples of different encodings correlation of the encoded features to the trait value and thus tends to improve the regression performance.

Importance of maintaining the order of categories
The issue of target-based encoding is that the encoded values of the categories completely depend on the trait and therefore the order of the categories are not maintained. The encoded value for the heterozygous category for target-based encoding could be smaller or greater than the values for both homozygous categories. On the contrary, for the hybrid encoding, the value for the heterozygous category is always in between of the values of the two homozygous categories and therefore the order of the categories are maintained.
To evaluate the importance of such category order maintenance, we calculate the distance between samples for different encodings, where we consider each sample as a m dimensional vector and m is the number of features. The distance is computed as the follows.
We first compute the z-score of trait value for each sample as x−μ δ 2 , where x is the original trait value, μ is the mean of all the traits, δ is the standard deviation of all the traits. Then, we assign discretized values to samples according to their trait z-score using the following formula: Then we consider the distance of the samples with identical discretized trait value. The intuition is that when a set of samples have similar trait values and they are close to each other in the feature space, the residual error of the regression tend to be small. An extreme scenario is all the samples have the same trait value and also the same feature values, namely all these samples are identical points in the feature space, the regression will have residual error 0. On the contrary, if the samples are far from each other while they have the same trait value, the residual error tend to be large.
As for different encodings the scales of the encoded values are different, we normalize the encoded values and compute the z-score of the j-th marker for the i-th sam- , where z(i, j) is the z-score for sample i at marker j, x(i, j) is the encoded value of sample i at marker j, μ(j) is the mean of the encoded values for marker j for all samples, var(j) is the variance of the encoded values for marker j. Once we compute the z-score of each marker for each sample under each encoding, we measure the pairwise distance between every pair of samples i, j using Euclidean distance as dis(i, j) = (z(i, k) − z(j, k)) 2 , for 1 ≤ k ≤ d and d is the total number of markers. Then we compute the average distance. The smaller the average distance is, the closer the samples are. As we will show later, the target-based encoding has higher sample-wise distance than the ordinal encoding and hybrid encoding, which explains the observation that the hybrid encoding methods lead to better regression performance.
As we can see, both the encoding flexibility and the order of the categories are important for multiple regression on categorical data. Our hybrid encoding methods keep the encoding flexible among markers and in the meanwhile maintain the order of the categories. Therefore our methods achieve better performance than both the ordinal encoding and the target-based encoding do.

Extension to epistasis model
The hybrid encoding strategies can be naturally extended to pairwise epistasis effects or even higher dimensional epistasis effects. As shown in Fig. 2, for pairwise epistasis effects, given the traditional encoding {0, 1, 2}, we have 9 possible combinations for markers i and j, organized in the 3 × 3 grid matrix. Assuming 0 is the traditional encoding for homozygous genotype with major allele, 2 is the traditional encoding for homozygous genotype with minor allele, 1 is traditional encoding for heterozygous genotype, then the cell (0, 0) (from now on, for simplicity, we ignore the marker indices i, j for the cell) is the traditional encoding for a pair of homozygous genotypes, both with major allele, the cell (2, 2) is the traditional encoding for a pair of homozygous genotypes, both with minor allele, the cell (1, 2) is the traditional encoding for a pair of heterozygous genotype and homozygous genotype with minor allele. The meaning of the other cells can be inferred similarly.

Results
As rrBLUP is one of the most commonly used methods for genetic trait prediction, in our experiments, we evaluate the prediction accuracy for different encodings mainly using rrBLUP.
We apply the new encoding strategy to four different data sets, summarized in Table 1. We compare the performance of rrBLUP on both the traditional encoding and the two hybrid encodings and the target-based encoding. As r 2 , the square of the Pearson's correlation coefficient is the most common evaluation metric for genetic trait prediction problem, we show the average r 2 of 10-fold cross validation. Notice all the encodings are generated only from the training data and then applied to the test data accordingly.
The first data set is the Maize data set [7] which consists of two maize diversity panels with 300 Flint and 300 Dent lines developed for the European CornFed program. The two panels, Flint and Dent, were genotyped using a 50k SNP array, which after removing SNPs with high rate of missing markers and high average heterozygosity, yielded 29,094 and 30,027 SNPs respectively. Both of them contain 261 samples.
The second data set is the Asian rice, Oryza sativa, data set [24]. This data set was based on 44,100 SNP variants from 413 accessions of O. sativa, taken from Table 2 Performance of rrBLUP (average r 2 ) on the traits of four real data sets under the traditional encoding vs. the hybrid encodings vs. the target-based encoding. We also show the improvements of the hybrid encodings over the traditional encoding  The third data set is Pig data set, which is a collection data on male and female pigs born since 2000 and was taken from [5] and consists of 3,534 animals from a single PIC nucleus pig line yielding 52,842 SNPs with five measured traits (phenotypes). Only traits 2 and 4 were selected for study here. As described in [5], genotypes were sequenced from the Illumina PorcineSNP60 chip and full pedigree information is available, which we did not use in this study. In the original study, trait 2 was rescaled by a weighted mean of corrected progeny phenotypes. Whereas trait 4 was corrected for environmental factors such as year of birth and location. Genotypes were filtered for minor allele frequency less than 0.001 and with missing genotypes less than 10 %. The original study used AlphaImpute to impute any missing data [14].
The fourth data set is QTLMAS data set, which was taken from the QTL-MAS Workshop, which was held on May 17-18, 2010 in Poznan Poland [1]. The data set consists of 3,226 individuals over five generations (F0-F4) with 20 founders, five male and 15 females. There were two phenotype traits, the first a quantitative trait and the second a binary trait. Only the first four generations (2,326 individuals) have phenotype records. The genome is approximately 500 million bp with five chromosomes, each 100 million bp. In total, each individual was genotypes for 10,031 biallelic SNPs.
For genetic prediction, to our knowledge, there is no method can achieve consistently better performance than rrBLUP does with similar running time. Also compared with rrBLUP, even for cases where the performance can be improved, most of the other methods can not make an improvement over 5 %. Thus we consider an improvement of 5 % as significant. As shown in Table 2, in general the hybrid encodings are able to improve the prediction performance and in many cases the improvement is significant. The target-based encoding is slightly better than the traditional encoding, but worse than both hybrid encodings. Thus for single marker model, the hybrid encodings are superior to the traditional encoding and the targetbased encoding. The two hybrid encodings have similar performance.
We also conducted SVR (support vector regression) with sigmoid kernel on all the data sets with different encodings. We show only the results for the Maize data. The results are shown in Table 3. We can see that the two hybrid encoding methods achieve almost identical accuracies, both are higher than the accuracy from the target based encoding. The traditional encoding has the worst accuracy.
Next we compute the average correlation of the top-100 markers with the highest absolute correlation values (as the correlation could be either positive or negative) to the trait under different encoding methods. The results are shown in Table 4. We show only the results for the Maize data. We can see that for the ordinal encoding, the average correlations are smaller than those of the other encoding  methods, indicating that by allowing encoding flexibility, we could potentially improve the regression performance. The target-based encoding has the highest average correlation. However, due to its lack of category order maintenance, its performance is worse than those from the hybrid methods. The two hybrid methods have identical average correlations which are slightly lower than that of the target-based encoding.
In order to show the importance of category order maintenance, we show in Table 5 the average pairwise distance of the samples for each encoding method. Due to space limit, we show only the results for the Maize data. We can see that the target-based encoding has the biggest pairwise distance while the hybrid encoding methods have lower pairwise distance. The traditional encoding has lower pairwise distance, but due to its mixture of both positively-correlated and negatively-correlated features, its performance is worse than those of the hybrid encodings.
We also applied the hybrid encoding strategies on the epistasis model shown in Formula 2. Due to space limit, we did not include the experimental results for the epistasis model, which will be included in the extended version of the work. However, our experiments indicate that the hybrid encoding strategies improved the prediction performance on the epistasis model as well.

Conclusions
In this work, we showed that the quantitative genetic trait prediction problem heavily depends on the encoding of genotypes, for both single marker model and epistasis model. We developed two hybrid encoding methods which are simple but effective. Our experiments show that the hybrid encodings are able to improve the prediction accuracy for both single marker model and epistasis model. We also conducted a detailed analysis on the performance of the hybrid encodings. To our knowledge, this is the first work that discusses the effects of encodings for genetic trait prediction problem. In our future work, we would like to develop more effective encoding methods for both single marker and epistasis models. We would also like to investigate the effects of variation of allele frequency between train and test data and the effects of correlation of markers (linkage).

Declarations
This article has been published as part of BMC Bioinformatics Vol 17 Suppl 9 2016: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2015: genomics. The full contents of the supplement are available online at http://bmcbioinformatics.biomedcentral. com/articles/supplements/volume-17-supplement-9.

Funding for publication
Publication of this article was funded by IBM T.J. Watson Research.

Availability of data and material
The datasets are publicly available.