The mixture model
We fit mixture models to the response signals to classify the markers into one of five genotype classes, corresponding to the five possible allele dosages in tetraploids. This type of classification is often called modelbased clustering, because a statistical model is used for the responses. We describe the model here.
Let the pair s
_{
a
}and s
_{
b
}represent the measured a and b allele signal strengths for an individual. We analyze the fraction s
_{
a
}/(s
_{
a
}+ s
_{
b
}). As it is advantageous to have a homoscedastic response in the mixture model, and the calculated fraction shows variance heterogeneity with smaller variation for fractions closer to 0 and 1, we take the arcsinesquare root (asr) transformed fraction
to stabilize the variance.
For the transformed fraction
y a normal (or Gaussian) mixture model [
18] is fitted:
with f
_{
j
}the density of a normal distribution with mean μ
_{
j
}and common standard deviation σ. The mixing probabilities π
_{
j
}are the prior probabilities of a marker to have allele dosage j, with Σ_{
j
}π_{j}= 1. In the model described above, five components are specified for the five allele dosages (0,..,4), but in other situations less or more components may be needed. In case of five components, ten model parameters have to be estimated: five means μ
_{
j
}(for the mean responses of the five allele dosages), one standard deviation σ (measuring the common spread of individual responses with the same allele dosage), and four probabilities π
_{
j
}(measuring the fraction of individuals having the j
^{th} allele dosage), with the fifth probability following from the other four.
One of the principles in statistical modelling is parsimony: remove redundant model parameters to improve stability and interpretability of results. Here it may be beneficial to put constraints on two groups of parameters:
1) Constraints on π
_{
j
}according to HardyWeinberg equilibrium (HWE). If the allele dosages are in HWE, a single parameter p, representing the allele frequency in the population, suffices instead of four probabilities π
_{
j
}. The constraints are π
_{1} = p
^{4}, π
_{2} = 4p
^{3}(1p), π
_{3} = 6p
^{2}(1p)^{2}, π
_{4} = 4p(1p)^{3}, π
_{5} = (1p)^{4}.
2) Constraints on
μ
_{
j
}, by incorporating an assumed relationship between allele dosage and signal strength. We first assume that the signal strengths
s
_{
a
}and
s
_{
b
}depend linearly on the allele dosage: with
x the dosage of allele a, and 4
x the dosage of allele b, the model states for the mean signal strengths of
s
_{
a
}and
s
_{
b
}
where
a
_{0} and
b
_{0} are the background signal strength for alleles a and b. The fraction
contains a superfluous parameter, and can be simplified into model 1:
with c
_{1} = a
_{0} /a
_{1}, c
_{2} = b
_{0} /a
_{1}, and r = b
_{1} /a
_{1}. Hence, parameters c
_{1} and c
_{2} are proportional to the background signals, and r is the ratio of sensitivities of the a and b signal strengths to the allele dosages.
If the a and b background signal strengths are equal, a common parameter
c =
c
_{1} =
c
_{2} can be used to arrive at model 2:
The assumption of a linear relationship between signal strength and allele dosage may be too restrictive. Therefore, the model for the individual signal strengths is extended into
assuming equal curvature for both signals, rendering the third model
with d = a
_{2} /a
_{1}.
Model 3 may be simplified into model 4, by equating the background signal strengths:
Models (1)  (4) are formulated for the fraction of means of signal strengths. However, as the response is the asrtransformed variable y, the models need to be transformed as well. The transformed model (1) for the expectation of y is μ
_{
y
}= asr((c
_{1} + x)/(c
_{1} + x + c
_{2} + r (4  x))), and likewise for the other three models.
There are two minor complications with the models:

The models 14 are developed for the fraction of expected signal strengths E(s
_{
a
})/(E(s
_{
a
}) + E(s
_{
b
})), but we analyze the fraction s
_{
a
}/(s
_{
a
}+ s
_{
b
}), amounting to a model for the expected fraction E(s
_{
a
}/(s
_{
a
}+ s
_{
b
})). However, the expectation of a fraction and the fraction of expectations are approximately, but not exactly, equal.

Transformation bias. We analyze the asrtransformed ratio of intensities y = asr(s
_{
a
}/(s
_{
a
}+ s
_{
b
})), amounting to a model for the expectation E(y). This expectation is approximately, but not exactly, equal to asr(E(y)).
Summarizing, two approximations are employed: E(asr(s
_{
a
}/(s
_{
a
}+ s
_{
b
}))) ≈ asr(E(s
_{
a
}/(s
_{
a
}+ s
_{
b
}))) ≈ asr(E(s
_{
a
})/(E(s
_{
a
}) + E(s
_{
b
}))).
To compare different models, e.g. the unconstrained and HWEconstrained model, 2loglikelihoods (2LL) may be compared, with by definition a smaller 2LL for the unconstrained (larger) model. To balance model fit and increased model complexity, we use the Bayesian Information Criterion (BIC), which adds a penalty to the 2LL based on the number of parameters k in the model (and n the number of individuals): BIC = 2LL+k ln(n) [19].
The different mixture models are fitted to the transformed fractions using maximum likelihood (ML). The EMalgorithm is used to find the MLestimates [20]. The EMalgorithm needs starting values of the parameters. Next, E and Msteps are iterated. In the Estep, given the current parameter values, the posterior probabilities of an individual to have each of five allele dosages are calculated, followed by the Mstep, in which the mixture probabilities π
_{
j
}are estimated, and μ
_{
j
}and σ by weighted nonlinear least squares. The fitting is done using R [21]. For a more elaborate description of mixture models for marker genotyping and the EMalgorithm, see [17].
Model and marker selection
The selection of a suitable mixture model for a given marker is the result of a multistep process that has been developed empirically.
Before starting the model selection itself, unreliable observations should be removed. In the case of an Illumina GoldenGate assay we removed all observations with a total signal intensity less than 3200 (see Data sets for the rationale for this threshold).
In the first step, eight different mixture models are fitted. Each model consists of 5 component distributions. The means of the component distributions are constrained by the five possible allele ratios, using one parameter for the ratio of intrinsic signal strength for both alleles, and additionally one or two parameters for signal background, and no or one coefficient for a quadratic term in the signal response (Equations 14). This results in four possible models for the means of the component distributions. Each of these models is combined with two models for the mixing proportions: (a) the mixing proportions are not constrained, or (b) the mixing proportions are constrained to HardyWeinberg equilibrium (HWE) ratios. The HWE restriction often helps in identifying the peaks, even if the actual ratios depart slightly from the HWE. As the EM algorithm does not always find the global maximum from a given start configuration of parameter values, the EM algorithm for these eight models is started with two different configurations of means: one where the five original means are derived from a hierarchical clustering of the signal ratios, and one where they are set at equidistant positions on the transformed scale from 0.142 to 1.429 (corresponding to 0.02 and 0.98 on the original scale).
The BIC of the 16 results are compared and the result with the minimum BIC is selected. Using the selected model, for every sample the probabilities of belonging to each of the five distributions are calculated. Only if the maximum probability is above a certain threshold (by default 0.99) the corresponding genotype class is assigned to the sample. This threshold affects the reliability of the genotype scores; a high threshold (such as the default) results in a high reliability but in less called genotypes; and if the percentage of called genotypes drops below a specified level (see below) the SNP is not scored at all.
If the difference in response between the two allelic signals is large (parameter r is much smaller or larger than 1), a wide gap occurs between the nulliplex or quadruplex peak and the next peak, while the other four peaks are closely spaced. In such cases it may happen that the EM algorithm does not find the optimal fit but instead fits the simplex or triplex peak in the wide gap. In order to detect and correct such misfits, a second step tests whether the simplex or triplex peaks have a lower mixing proportion or a smaller number of samples assigned than peaks at both sides. If this is the case, the eight models are fitted again with a third starting configuration for the distribution means: if the triplex peak appears to be fitted in the gap the means of the duplex, simplex and nulliplex peaks are reassigned to the triplex, duplex and simplex peak; for the nulliplex peak a new mean halfway between the (new) simplex mean and 0.0 is assigned. A similar rearrangement is made if the simplex peak appears to be fitted in the gap. Using this new starting configuration of the means the EM algorithm for the eight models is run again.
For each of the fitted models a check is done if a lower peak occurs between higher peaks. Neither in a cross progeny nor in a population in HardyWeinberg equilibrium such a pattern is expected. Therefore, by default the algorithm includes a third step which rejects all fitted solutions where such a pattern occurs; however, this check can be disabled. If in all fitted models lower peaks occur between higher peaks or if this check is disabled, no solutions are rejected in this step.
After these initial steps, the fitted model with the lowest BIC among the nonrejected solutions is selected. Again for every sample the probabilities of belonging to each of the five distributions are calculated and genotypes are assigned using the same criterion as in step 1.
In the final step, markers can be rejected based on several additional criteria. If less than a minimum fraction (by default 60%) of the samples are assigned a genotype this indicates an unclear peak pattern. This parameter interacts with the parameter specifying the minimum probability level required for assigning a genotype as described above. Also a peak variance above a certain threshold (by default 0.1 on the transformed scale) causes the marker to be rejected; again this filters against unclear peak patterns. This parameter may be decreased when the general noise level of the wellperforming assays is low. A third criterion for marker rejection is when more than a maximum fraction (by default 85%) of the assigned samples are in the same peak. This parameter may be increased for data sets with more samples, as long as there are sufficient samples outside the main peak for reliable fitting of the remaining components of the mixture distribution.
It is recommended to try out some different values of the parameters based on the guidelines above and inspect the results for a subset of the markers, before selecting the values to apply to the full dataset.