Overview of fitTetra
The R package fitTetra [2] is a program for genotype calling of SNP markers in tetraploid species using normal mixture models. It consists of three main functions: 1) CodomMarker: the core function to fit a normal mixture model to a vector of transformed signal ratios of a single bi-allelic marker over all the samples; 2) fitTetra: a function to fit different normal mixture models to a single bi-allelic marker and select the optimal one; 3) saveMarkerModels: a function to fit mixture models for a series of markers and save the results to files. This suite of functions allows for automatic genotype calling for a collection of SNP markers. The normal mixture model is formulated for the response y = arcsine-square root transformed fraction sb/(sa + sb) where sa and sb are the fluorescence signal strengths for alleles a and b. The normal mixture density of the i-th response yi is \( f\left({y}_i\right)={\sum}_{j=1}^5{\pi}_j{f}_j\left({y}_i\right) \) with fj(yi) the normal density with mean μj and common variance σ2. The probabilities πj, means μj and variance σ2 are estimated by maximum likelihood using the EM-algorithm. Starting values of the parameters can be provided by the user or are derived by the software through naïve clustering of the data. The genotype calling of observation yi is based on the set of 5 values \( {p}_{ij}={\pi}_j{f}_j\left({y}_i\right)/{\sum}_{j=1}^5{\pi}_j{f}_j\left({y}_i\right) \) (j = 1,..,5) that describe the probabilities that a sample belongs to each of the five distributions. If one of the pij is larger than a user-defined threshold value (e.g. 0.9) the genotype will be called to be j-1 (as the distribution for j = 1 corresponds to a dosage score of 0).
The CodomMarker function allows restrictions to be placed on the μj and πj parameters. Parameters πj may be free or restricted to follow Hardy-Weinberg equilibrium. Parameters μj may be restricted such that 1) background signals for a and b (for the two SNP alleles) are equal or unequal; 2) the relationship between signal ratio and allele dosage is linear or quadratic. These restrictions allow the means μj to be asymmetrically positioned within the range 0 - π/2. An example of a histogram produced by fitTetra for a SNP for tetraploid potato is shown in Fig. 1. The dark bars in the histogram represent ratios of diploid potato genotypes included in this study. The allele ratios for diploid homozygotes are equal to ratios for homozygotes (nulliplex and quadruplex) in tetraploids and the ratio for the diploid heterozygote is the same as the ratio for duplex genotypes in tetraploids. Therefore, the position of diploid distributions compared to tetraploid distribution may serve as an extra check for the quality of the calling.
Extension to multiple populations
In fitTetra 2.0 multiple populations may be specified, whereas fitTetra 1.0 treated all observations as stemming from one single population. When multiple populations are used, the model parameters for the means μj (i.e. the positions of the peaks) and variance σ2 (i.e. the width of the peaks on arcsine-square root scale) are shared among the different populations, while the mixing proportions πj may be different for each population.
In fitTetra 1.0 the CodomMarker function, performing calling of a single marker allowed for three types of restrictions on the mixing proportions πj: 1) freely estimated (ptype = “p.free”), i.e. the only restriction is that \( {\sum}_{j=1}^5{\pi}_j=1 \); 2) Hardy-Weinberg equilibrium (“p.HW”) where only the allele frequency is estimated and mixing proportions are a function of those based on Hardy-Weinberg equilibrium, which is often reasonable for association panels; 3) fixed proportions (“p.fixed”), i.e. the πj are kept fixed at their given values during the EM-algorithm (used when values are known from another source).
In the new version of the program we added a new type of restriction on the mixing proportions, p.type = “p.F1”, for an F1 population. Given the parental genotypes estimated in the previous iteration of the EM algorithm, the “p.F1” proportions are calculated as the expected tetrasomic segregation ratios of the F1 progeny of these two parents. These segregation ratios assume random bivalent pairing and no double reduction and are shown in Table 1. Double reduction is expected to occur in relatively low frequency even when there are occasionally multivalent pairings. The parental samples are specified as separate populations.
The EM-algorithm is used to find the maximum likelihood estimates of the parameters of the multi-population mixture model. In this algorithm E-steps (Expectation) and M-steps (Maximization) are repeatedly taken until the likelihood converges. In the E-step, given current estimates of the parameters, the probabilities for an observation to belong to each of the five mixture components are calculated. In the M-step, based on these calculated probabilities, new component means and variance are calculated over all observations. Then new mixing proportions are calculated per population depending on the type of restrictions on the mixing proportions, as discussed above.
In fitTetra 2.0, function “CodomMarker” (and, as a result the wrapper function “fitTetra” and the convenience function “saveMarkerModels”) produces the same type of output as fitTetra 1.0 with the exception of the output of probabilities of the five distributions, which are now population specific. Optionally, an array of histograms is plotted, showing per population (for association panels and for FS’s) the histogram of ratios with the fitted mixture model. Parental data are shown with separate coloured symbols in the histograms for the corresponding FS family (Fig. 1).
Support for parental genotype data
When extra parental information is available, it can be used to further facilitate the genotype calling. Custom SNP arrays, the most popular means of genotyping tetraploids, are often based on discovery studies where parents of a population are sequenced in order to design the array for the population itself [5, 11]. The user can supply this information and set the probability type to “p.fixed” for the parental populations. This will fix the mixing proportions for the parental populations through the run of the algorithm. The probabilities (segregation ratios) of the corresponding F1 progeny will be calculated accordingly. It is possible that parental genotypes are erroneous (e.g. if they were derived from RNA-Seq they might be affected by differential gene expression). To avoid losing markers just because of that, fitTetra 2.0 will always attempt fitting a model where “p.free” is used instead of “p.fixed” and model with better BIC score will be selected. If parental genotypes supplied by the user are correct, the run with the “p.fixed” should produce model at least as good as with “p.free”.
Moreover, if the parents were analyzed in the same run as the FS family their dosages are also used to assign starting means for one or two of the mixing components. When parents have different dosages, two starting means are known, and the remaining three are estimated using a simple linear regression model. This may not be completely accurate as the relationship between dosage and intensity is not necessarily linear, but in our experience it provides a better start than the naïve clustering of the samples.
When multiple populations are measured in the same run, the means of the five genotype/dosage distributions are shared between populations. Because fitTetra 2.0 has the ability to process multiple populations at once, using known dosages of parents benefits the genotype calling not only for their F1 offspring but also for all the other populations, even when they are not related to these parents.
However, the starting parental dosages may be incorrect and applying these data as starting values may result in incorrect calls, or in not fitting any model at all. An example of this is shown in our previous SNP discovery study [5] in which we designed the array with RNA-Seq performance. When parental dosages are estimated from RNA-Seq data, they might be affected by differential expression of alleles (e.g. if the true parental genotype is AABB and allele A is expressed three times as much as allele B, the SNP array will report genotype AABB but the RNA-Seq read counts will suggest AAAB). In order to account for that, the algorithm fits a series of models both without and with parental dosage data; the best model (lowest value for Bayesian Information Criterion) is selected. For brevity, we will refer to fitTetra 2.0 without the use of parental data as fitTetra2NP and to fitTetra 2.0 where parental genotypes were used to fix the mixing proportions and calculate starting means as fitTetra2P later on.