Machine learning approach for pooled DNA sample calibration

Background Despite ongoing reduction in genotyping costs, genomic studies involving large numbers of species with low economic value (such as Black Tiger prawns) remain cost prohibitive. In this scenario DNA pooling is an attractive option to reduce genotyping costs. However, genotyping of pooled samples comprising DNA from many individuals is challenging due to the presence of errors that exceed the allele frequency quantisation size and therefore cannot be simply corrected by clustering techniques. The solution to the calibration problem is a correction to the allele frequency to mitigate errors incurred in the measurement process. We highlight the limitations of the existing calibration solutions such as the fact they impose assumptions on the variation between allele frequencies 0, 0.5, and 1.0, and address a limited set of error types. We propose a novel machine learning method to address the limitations identified. Results The approach is tested on SNPs genotyped with the Sequenom iPLEX platform and compared to existing state of the art calibration methods. The new method is capable of reducing the mean square error in allele frequency to half that achievable with existing approaches. Furthermore for the first time we demonstrate the importance of carefully considering the choice of training data when using calibration approaches built from pooled data. Conclusion This paper demonstrates that improvements in pooled allele frequency estimates result if the genotyping platform is characterised at allele frequencies other than the homozygous and heterozygous cases. Techniques capable of incorporating such information are described along with aspects of implementation.


Background
Recently the Illumina HiSeq X Ten [1] achieved a new low in per genome sequencing cost, continuing the ongoing reduction in cost per genome since 2001 [2]. These cost reductions now make it practical to genotype individuals in large association studies of humans. However, this is not the case for studies involving large populations of low economic value species where contemporary genotyping technology is cost prohibitive. The cost benefits achieved in [1] have not been realised on platforms based on alternative technology, such as Sequenom, and therefore pooling is still required in this scenario. This is evidenced by the ongoing use of DNA pooling in studies on low economic value species, specifically to reduce *Correspondence: andrew.hellicar@csiro.au † Equal contributors 1 CSIRO Computational Informatics, Castray Esplanade, Hobart, Australia Full list of author information is available at the end of the article cost [3,4]. DNA pooling has been shown to provide a cost benefit over individual genotyping [5] and allows access to a broader community to enable genetic association studies.
Pooling techniques date back to 1943 when blood from soldiers was pooled for testing of disease [6] and pooling of DNA was first proposed in 1985 [7]. The field advanced rapidly and in 2002 a broad review of the approach (applied to SNP data) was published [8]. DNA pooling combines DNA from multiple individuals into a single sample which can be genotyped once, as opposed to genotyping each individual. This reduces the cost of genotyping by a factor equal to the number of individuals in the pooled sample. In general pooling strategies are more complex and involve the multiple genotyping of duplicate pools, the effiiency of pooling approaches is given in [8]. The general pooling approach changes the measurement from detecting whether or not a substance is present, to measuring the concentration of the substance. In the case of DNA pooling, the 'substances' are the discrete SNP genotypes AA, AB, BB with corresponding A-allele frequencies 1, 1/2, 0 and the 'concentration' is equivalent to the real valued A-allele frequency within the range [0, 1].
The most significant drawback of the pooling approach is the error incurred in the process of measuring the pool's allele frequency. The impact of this error is illustrated in the context of a bi-allelic quantitative trait linkage study.
Given a population and a single trait of interest, two sub-populations (α and β) are identified exhibiting opposing extremes of the trait. From each sub-population a sub-set of individuals are selected, DNA acquired from each individual and combined in a single pooled sample representative of the respective subset. The two pooled samples are genotyped and their allele frequencies are compared. Both fixed and variable errors in the allele frequency measurement impact the power of such a study: where Z is the study test statistic, f α and f β are the best estimates of the A-allele frequency of the two subpopulations, and V α and V β are the variances in f α and f β . If the genotyping hardware response for a sample's allele A and allele B are H A and H B respectively then typically the sample's allele frequency ( f ) is calculated from: Three main factors contribute to the allele frequency variation including: sampling error E s (due to the limited pool size), sample construction error: E p (due to non ideal pool constructing resulting from the unequal contributions of individuals to the pool sample) and allele frequency measurement error: E m (due to chemistry and detection errors in the genotyping process). If the true sub-population allele frequency is p, then these errors result in a measured allele frequency f = p + E s + E p + E m . The variance introduced in f by approximating sub-population with N individuals is the expectation of the square error: . Similarly the unequal contributions to a pool from individual samples contribute to a variance component E 2 p = τ p(1 − p)/2N [9] where τ is the standard deviation in the fractions of the pool contributed by the individuals. A thorough analyses of these errors under different sampling conditions is given in [10]. Both these variance contributions can be reduced by increasing the pool size. Measurement error; however, is independent of pool size. Reducing measurement error requires averaging over multiple measurements, which reduces cost effectiveness of the pooling strategy. To resolve this issue, a range of calibration techniques have been proposed for E m reduction. Three example approaches are k-correction [11], linear interpolation [12] and the polynomial-based probe specific correction (PPC) method [13].
Despite the fact that these methods were developed for different platforms, they all contain a number of similarities which allow them to be applied to data generated by the Sequenom platform. All existing calibration techniques have a mapping which takes as input the raw allele frequency resulting from the platform's response to each of the two alleles present for a SNP. The Sequenom data is also available in this format. Furthermore the SNP specific corrections are based on the platform's allele responses to multiple individuals for the SNP being corrected. Sequenom data can also be generated by multiple individuals to provide such a data set. To explain these techniques the following notation is adopted: Given a SNP requiring calibration, and a set of AA homozygous individuals in the SNP, define AA = (H A (AA), H B (AA)) where H A (AA) and H B (AA) are the average value for H A and H B over the AA homozygous set of individuals. Similarly AB and BB are average values defined for heterozygous AB and homozygous BB sets of individuals respectively. The measured allele frequency f, corresponding to points AA, AB, and BB, are f AA , f AB , and f BB respectively. The calibration techniques all map f AA and f BB into A-allele frequencies 1 and 0 respectively with calibration specific approaches between these values to map f AB into A-allele frequency 0.5. How they achieve this varies between the methods. k-correction was introduced to correct for error in the PCR process [11], specifically SNP dependent unequal amplification of alleles during PCR. The correction involves using AB to calculate ratio k = H A (AB)/H B (AB), ideally k = 1 in the absence of differential amplification. The ratio k is used to correct the distorted post-PCR measured quantities resulting in the following expression for calibrated allele frequency f : k-correction approach can be applied to the Sequenom data without modification. The piece-wise linear calibration approach of Illumina [12] involves four linear transformations of (H A , H B ) corresponding to rotation, translation, shear and scale transformations. These transform AA and BB onto H A and H B axes respectively, with approximately equal amplitude. Finally a piecewise linear function maps angles on the (H A , H B ) plane at points AA, AB, and BB onto A-allele frequencies 1, 0.5, and 0 respectively. The function linearly interpolates angles between these points, therefore the entire calibration process for pools involves a combination of the four transformations, calculating angle atan(H A /H B ), and piece-wise linear interpolation. Our implementation of the piece-wise linear approach is similar; however to ensure consistency across the calibration methods we utilise the form given in Eq. 2 and the corrected allele frequency f is: Minor changes include the fact that the ratio in Equation (2) is used in calculating allele frequency, as opposed to the normalised angle (2/pi)atan(H A /H B ). Dividing by H A +H B in Eq. 4 introduces a normalising factor, and enforcing the homozygous values to 0, 1 and heterozygous cluster centre to 0.5 is equivalent to the rotation and shear transformation. However the translation transformation is not implemented as it requires estimating the intercept of the asymptotes of the AA and BB clusters. However the majority of the approach is captured in the expression above.
The polynomial-based probe specific correction (PPC) approach [13] adds a probe pair index as an additional variable for the Affymetrix platform. Specifically each SNP contains 10 probe pairs, which are each calibrated by a second order polynomial mapping the three allele frequency values per probe (f AA , f AB , and f BB ) onto 1, 0.5, 0 and interpolating between these values. Finally the 10 calibrated probe values are averaged to estimate allele frequency. Whilst the 10 probe correction is not relevant for the Sequenom data the second order polynomial mapping can be applied directly and is the following: Although the expressions for the various methods are all distinct, each expression can be decomposed into three corrections which remove distortion in the raw allele frequency response of the platform. All methods initially include a constant and linear correction to adjust the two allele frequencies corresponding to the two homozygous cases. The methods are identical at this point. Finally a method specific distortion is applied to correct the heterozygous case allele frequency. To highlight this we define an intermediate 'homozygous corrected' allele frequency f 1 of the form: where f is the raw allele frequency and d 0 and d 1 are constants: To correct the heterozygous case all methods apply a distortion correction D to give fully calibrated allele frequency: where D satisfies the following conditions where D is specific to each method. For piece-wise linear correction: For PPC: The expression for k-correction includes both H A and H B terms; however, these can be eliminated by solving (2) and (3) for H B , equating and cancelling H A . Furthermore after correcting homozygous cases using Eq. 6, the allele frequency for the heterozygous case is f 1 AB = K/(1 + K). The distortion term corrected by k-correction then takes the form: The expression in (11) is more complicated than those in (9) and (10). If distortion in the reciprocal of allele frequency is used (11) becomes a first order distortion correction of the form in (6); however, for consistency between three methods we have expressed all in terms of allele frequency.
Examination of Eqs. 9, (10) and (11) show they satisfy conditions in (8). Example plots of polynomials and distortions D are given in Figure 1. The limitations of the approaches are they all: 1. Impose assumptions on variation between 0, 0.5, 1.0, 2. address a limited set of error types.
To highlight limitation 1 we show that when testing with allele frequencies between 0, 1/2 and 1, the performance of each interpolation method varies significantly between SNPs. We then outline a machine learning technique [14] that samples across the full allele frequency range. The technique can model non-linear distortions to correct the broad range of errors that occur in the chemistry/detection processes across different genotyping platforms. Therefore they resolve the drawbacks of existing approaches. Furthermore the technique substantially reduces the measurement error. After learning the calibration the approach can be used to calibrate pooled samples measured on the same platform without further training. Finally we demonstrate the training requirements for machine learning approaches by training and testing on sets containing individuals, pools, and a combination of both.

Experimental data
Experimental data from Black Tiger prawns Penaeus monodon was acquired from the Sequenom IPLEX platform [15]. The raw data, typical of a commercial run, generated 61 SNP (H A , H B ) measurement pairs on 1041 individuals. This data was then processed for quality control Figure 2. A second experiment was conducted whereby all steps required in the genotyping process were conducted in a manner as rigorous as possible; however, due to increased cost and time of the rigorous process only a smaller set of 47 individuals were genotyped (randomly selected from the 1041 in the larger experiment). The calls from the more accurate experiment were used to rank the SNP accuracy of the larger experiment. In total 13 SNPs were identified as being inconsistent between the two experiments and were removed leaving 48 SNPs in total. The 1041 individuals were ranked in terms of number of available calls. Low quality samples (<80% calls) were removed from the data set leaving 850 samples. We also removed 1279 measurement pairs (H A , H B ) values below a threshold (R<1) resulting in 39521 (H A , H B ) pairs. 22 pool samples containing a minimum of 18 individuals and a maximum of 26 individuals were created from the 1041 individuals. An individual was in at most one pool sample. Three pools resulted in no data and were removed resulting in 19 pooled samples, of which 11 (H A , H B ) pairs were below the minimum signal threshold and removed, leaving 901 measurement pairs. Because the pools were constructed from individuals selected from the 1041 genotyped individuals, we can calculate the ground truth allele frequencies of these pools using the 39521 individual results. A small fraction (3.4%) of the pool and SNP combinations contained individuals which did not pass quality control, and therefore were not included when calculating pool ground truth.
Finally amongst the 850 individuals 41 individuals were genotyped twice, after data quality control a total of 1621 duplicate measurements of (H A , H B ) were available. These duplicate samples were used to estimate the underlying variability in the measurement process.

Measurement error calculation
The measurement error can be decomposed into a fixed bias term E B and a random term E N such that E = E B +E N with corresponding MSE: The bias term is the expectation of the error (E) which results in an erroneous offset in the allele frequency estimate which cannot be reduced by averaging multiple estimates, the variance in measured allele frequency is E 2 N and both errors are functions of allele frequency: In particular E N (0), E N (1/2), E N (1) can be directly measured using individual samples with allele frequencies 0, 0.5 and 1 respectively. For example, Figure 3 shows the uncalibrated cluster of points corresponding to heterozygous individuals, E N (1/2) corresponds to the angular spread of the cluster and E B (1/2) to the rotation of the cluster centre from the ideal 45 degree angle. An optimal pooling strategy involves minimising the combination of E S , E P and E N . Intuitively the strategy should balance the contributions from different sources, significant reduction of any single error below that of other errors has limited benefit. The optimal strategy is dependent on a combination of the expected allele frequencies and trait probabilities, and can be found based on information loss in the process of pooling [16].
To test our methods we introduce three testing regimes: individuals, pools and combined. The regimes evaluate the performance of the calibration methods by testing with samples that are either all individuals, all pools, or a combination of individual and pool samples. The combined set is constructed such that the error incurred on the combined set contains equal contributions from pools and individuals. Although the presented methods are developed to be applied to data sets containing pools, we include a data set comprising individuals only. The intent is not to provide results indicative of application of the methods, but to demonstrate the performance of the methods at detector values typical of homozygous and heterozygous samples. The individual regime highlights the contrast in performance with the existing methods developed with individual data in addition to allowing error to be decomposed into bias and variance components due to the presence of multiple calibrations at the same allele frequency.
The test data sets are named I all , P all and C all . I all contains all 850 individuals, P all contains all pool data. C all contains all the samples in I all and P all ; however, to ensure equal number of pooled samples, samples from P all are replicated either 43 or 44 times into C all until pool samples comprise an equal proportions of the data set. See Figure 4.

Polynomial calibration
Three existing techniques were applied: linear interpolation, k-correction and 2nd order Lagrange interpolation. In addition we also implemented three variations of Hermite interpolation to explore whether alternative interpolating functions could achieve better corrections on the Sequenom platform. The techniques are equivalent to the existing methods in mapping the homozygous cases as in (6), with a distortion specific term D satisfying conditions in (8). Piecewise Hermite interpolation implements two Hermite polynomials over sub-domains 0, f AB and f AB , 1 and enforces a derivative at f = 0, f = f AB , and f = 1. We enforce zero derivative in our implementation. An equal domain version creates a symmetric function either side of f AB , finally a fixed point variation of the equal domain version enforces the derivative to be unit valued at f AB . To highlight the differences in calibration polynomials the functions are plotted in Figure 1 for the case of correcting an erroneous heterozygous allele frequency measurement f AB = 0.3.
The MSE in allele frequency was calculated by calibrating under the three regimes described previously: individuals, pools and combined.

Machine learning approaches
The new approach outlined in this paper utilises machine learning techniques to learn functions that correct and estimate allele frequency. Three approaches were implemented including linear regression (LR), multilayer perceptron (MLP), and support vector regression (SVR). WEKA implementations of the LR and MLP algorithms were used [17] and libSVM [18] used for the SVR. Each method learns a function that maps f into a calibrated allele frequency output f . The training data set (which includes samples of f and known ground truth allele frequencies f * ) is used to learn the mapping function. The methods find different solutions to the function due to the fact the methods impose differing constraints on the solution and optimise different objectives. LR finds the linear representation that minimises the least squares error over the training set and requires no additional parameters to define the approach. Both MLP and SVR learn nonlinear mappings and require a number of parameters to define both the type of function representation, and the optimisation approach.
The MLP [19] is implemented as a cascaded series of matrix-vector multiplications. The 'vector' input to the first matrix-vector product is the uncalibrated allele frequency value. A non-linear operation is applied to the output of each matrix-vector multiplication, the result is then multiplied by the next matrix in the series. The output of the final matrix-vector product is the calibrated allele frequency value.
Therefore the function representation is defined by parameters describing the number of matrix vector products (number of layers) and the length of the vectors (number of hidden nodes) resulting from the matrix vector products (notwithstanding the final output 'vector' length which is prescribed and of length one in this case). Furthermore a parameter specifying the type of non-linearity applied at each layer is required. Optimisation involves finding the values of the matrix elements (weights) that minimise an objective function. Typically a regularisation parameter is included to ensure the weights do not overfit the training data by finding an exact match, additional parameters specify the search method. Here we use a gradient based search with learning rate and momentum describing the update set. Specific values for parameters are shown in Table 1.
Support vector regression builds a function based on the training data itself. The function is represented as a sum of non-linear basis functions (called kernels) centred at each training sample. Parameters are required to describe the choice of kernel, the cost function and the optimisation approach. A common choice of basis function are Gaussians with a specified standard deviation in the input domain. The SVR cost function has no cost for small errors, this allowable error can be explicitly provided as in the -SVR, or implicitly provided via a parameter ν in the nu-SVR which finds a balance between regularisation and ν. Here the nu-SVR [20] is used with the parameters provided in Table 1. Where parameters are not explicitly stated, default parameters provided by WEKA and lib-SVM were used. Whereas the polynomial calibration used cluster centres for determining the calibration polynomials, machine learning directly use the sample values as training data. The question arises as to what proportion of pooled data should be used versus the individual data. To examine this we introduce three training regimes individuals, combined and pools, which train the models using data from the respective data sources. The intent of the machine learning approaches is to provide samples away from the homozygous and heterozygous sample cases, to improve the calibration in these regions; however, we provide the individuals training set to allow comparison with existing methods which rely on samples from individuals only, and also provide the ability to decompose errors into variance and bias components in the resulting f .
An additional requirement on the data sets to ensure valid results for the machine learning approach is there is no intersection between the data used for training the models and the data used for testing the models. To achieve this the data sets are further refined. Specifically we use a cross-validation approach: the original data set containing all the data is partitioned into 10 blocks. One block is removed for testing and the remaining 9 blocks used for training. Consequently we create two pool data sets P train and P test which partition P all , and individual sets I train and I test which partition I all . Similar to the combined testing regime, we create data sets C train and C test , resampling from P train and P test to ensure equal representation by pooled samples in the combined data sets. The process for generating the data sets is shown in Figure 4.
The data sets used in training the machine learning approach and used for calibration are dependent on the training and testing regimes and shown in Table 2.

Results and discussion
The pairs of duplicate measurements were used to calculate the underlying variation in the measurement process which cannot be removed by calibration. The difference in duplicated measurements d is a random variable with twice variance of the allele frequency measurement. Given m duplicate measurements the variance is d 2 /2m. After data cleaning m = 1621 duplicate samples remained. The measurement process was found to contribute a variance component of 1.91 × 10 −3 to E 2 N . The results for standard calibration techniques are shown in Table 3. Mean square errors are averaged over SNPs and samples, by summing over all (H A , H B , f * ) entries in the respective testing data sets. Due to the fact that calibration polynomials map cluster centres to exact allele frequency, bias error is small and the majority of error is random E ∼ E N . For the case of no calibration the bias error is larger than the random error (E 2 B = 5.56×10 −3 ). Each polynomial's ability to calibrate is highly SNP dependent. The proportion of SNPs for which each approach achieved superior results (in comparison to the other approaches) is shown in Table 4. An optimal approach might select the best calibration function on a SNP by SNP basis, such an approach would attain the results shown in the bottom row in Table 3. Table 4 clearly shows the optimal form of the calibration function differs across SNPs. For example, although the SNPs shown in Figure 3 exhibit similar cluster centres, the distribution of points in the clusters are significantly different. Existing calibration approaches operate only on the three cluster means and not the distributions. The proposed machine learning approach operates on the clusters as well. Care should be taken in interpreting Table 4, for example 'doing nothing' yields the most accurate results 35.4% of the time. Clearly many of the SNPs raw data is already accurate and calibration degrades the data. However, the introduced errors on these SNPs are more than compensated by error reduction across the remaining SNPs resulting in   Table 5. Numbers in parentheses after the testing set names correspond to the worst case standard deviation of the errors over cross validation folds in the column corresponding to the test set. For the individuals testing regime the error was decomposed into bias and variance components, and total mean square error E 2 and variance E 2 N are provided.
The reason for generating testing and training sets including mixtures of individuals and pools is evident in the results. Examination of just one testing set can lead to erroneous conclusions on performance. For example piecewise Hermite polynomials achieved the best results in Table 3 for minimising variance in individuals. However, this is a result of the zero derivatives enforced at 0, 0.5 and 1, which tend to compress the results towards the correct allele frequencies. The disadvantage of this is seen with the larger errors incurred when testing with pools. A similar, overfitting effect, occurs for learning models trained on individuals which result in flattening of the The effect of changing the number of pools and individuals was also explored. The linear regression approach was applied to two scenarios, using individuals for training and testing, and using pools for training and testing. The results showed the existing method's MSE was improved upon if at least 10 individuals and 8 pools were included in the respective training sets. Improvement stopped after 225 individuals were included. The available pools data set was not large enough to see performance stop improving.
In summary, whereas existing calibration approaches are trained using individual samples, machine learning approaches should not, and pooled samples are required. There is an advantage in including calibration pools when building calibration models. However care must be taken to avoid learning near the pool allele frequency values only. Models that achieved the best results (when tested on pools) were those trained only on the calibration pools and were not accurate elsewhere over the full allele frequency range. This is highlighted by the larger errors committed by all methods when trained on pools and tested on individuals. A typical experiment will involve calibration pools (with known ground truth allele frequency) and phenotype pools to be corrected. The spread of the calibration pool allele frequencies is determined by the allele frequency of the population the pool is taken, and the size of the pool. However, for phenotype specific pools being calibrated there is no guarantee a SNPs allele frequency lies within this spread, particularly if there exists a relationship between the SNP and phenotype. Therefore ideally a calibration function should be accurate over the full range of allele frequencies [0,1], or alternatively be only applied within the spread of allele frequencies on which the model was built. One alternative is to use smaller number of samples in constructing calibration pools, to increase spread. Another solution is to include a mixture of pools and individuals in the training of the algorithm such as the combined data set. The most accurate method applied to pools, when trained with allele frequencies on the combined set was the SVR which achieved a MSE of 6.54×10 −3 on pools with a variance of 2.55×10 −3 on individuals. This is 33% larger than the duplicate values random error so although some scope still exists for improvement in reducing variation, any future substantial improvements would require investigation of the causes of variation in the platform response to duplicate samples. The best polynomial calibrator (2nd order Lagrange) achieved 12.34×10 −3 MSE on pools only with much larger variance of 4×10 −3 × 10 −3 on individuals only. This is almost a factor of 2 in error reduction in MSE by the machine learning approach. The best approach for minimising variance on the individuals only set was the MLP which achieved a random squared error component 2.17×10 −3 comparing with calibrated MSE of 3.75×10 −3 . This corresponds to an increase in increase in test statistic of 72% compared to standard calibration on the pools only data set. A comparison of the best performing ML approach with the existing methods is given in Table 6.

Conclusion
This is the first study of a machine learning approach to calibration of pooled SNP samples which has demonstrated the importance of training sample location on performance. The approach was tested on data generated by a Sequenom iPLEX SNP panel providing results for 61 SNPs on Tiger prawn individual and pooled samples. We showed that SNP to SNP variation is significant between the allele frequencies and different calibration polynomials are suitable for different SNPs. We introduced a machine learning technique to model each SNP separately and included data between the discrete allele frequencies of individuals by incorporating calibration pools into the model. The machine learning approach achieves significantly less error, by reducing error by a factor of 2 and improves study test statistic by 72% as a consequence of reduction in allele frequency variance.
An additional advantage of the machine learning technique is the ability to calibration functions on higher dimensional inputs. The use of additional input information can allow errors which previously created variance in f, to become predictable in the additional dimension. In this situation variance causing error is converted to a bias error which can be corrected by calibration with a resulting reduction in variance. Here we have limited access to auxiliary data from the experiment and using allele frequency alone has allowed comparison of the techniques with the same input data.