 Research
 Open Access
 Published:
New statistical methods for estimation of recombination fractions in F_{2} population
BMC Bioinformatics volume 18, Article number: 404 (2017)
Abstract
Background
Dominant markers in an F_{2} population or a hybrid population have much less linkage information in repulsion phase than in coupling phase. Linkage analysis produces two separate complementary marker linkage maps that have little use in disease association analysis and breeding. There is a need to develop efficient statistical methods and computational algorithms to construct or merge a complete linkage dominant marker maps. The key for doing so is to efficiently estimate recombination fractions between dominant markers in repulsion phases.
Result
We proposed an expectation least square (ELS) algorithm and binomial analysis of threepoint gametes (BAT) for estimating gamete frequencies from F_{2} dominant and codominant marker data, respectively. The results obtained from simulated and real genotype datasets showed that the ELS algorithm was able to accurately estimate frequencies of gametes and outperformed the EM algorithm in estimating recombination fractions between dominant loci and recovering true linkage maps of 6 dominant loci in coupling and unknown linkage phases. Our BAT method also had smaller variances in estimation of twopoint recombination fractions than the EM algorithm.
Conclusion
ELS is a powerful method for accurate estimation of gamete frequencies in dominant threelocus system in an F_{2} population and BAT is a computationally efficient and fast method for estimating frequencies of threepoint codominant gametes.
Background
A great advance has been made in building genetic maps of various species due to the development of largescale molecular marker technologies [1,2,3,4,5,6,7] and statistical methods [4, 8,9,10,11,12,13,14,15,16,17,18]. However, mapping of numerous molecular markers has been complicated by linkage phases of dominance [14,15,16, 19]. In twopoint analysis, markers in repulsion phase provide quite less linkage information than in coupling phase [14, 15, 20, 21]. This is especially true for dominant markers in F_{2} population [14]. In practical mapping experiments, although the linkage phase for each dominant marker is random, a half of markers are derived from one of two coupling phases. The phase between couplings is repulsion [14, 15]. This situation results in two separate partner linkage maps for dominant markers: high linkage information content of markers in the coupling phase and low linkage information content of markers in the repulsion phase. Thus one has to build two complementary linkage maps [14, 15, 21, 22]. To date, there has not yet been an effective way to integrate both into a complete map. Mester et al. [15] attempted to use pairs of codominant and dominant (CD) markers to merge such two complementary maps because pairs of the CD markers in repulsion phase have much higher linkage information content than pairs of dominantonly markers in repulsion phase. However, this strategy demands that all dominant markers be paired with codominant markers, which is not a general case in mapping practice, otherwise, local and global disturbance will then violently affect the reliability of the integrated map.
The twopoint analysis implemented by the expectation maximization (EM) algorithm [11,12,13, 23,24,25] is a highly powerful approach to estimate recombination fractions between codominant loci and between dominant loci in coupling phase, but the EM algorithm has very low power in estimation of recombination fractions between dominant loci in repulsion phase. This is because it is difficult for the EM algorithm to distinguish genotypes in coupling phase from those in repulsion phase for dominant markers.
Therefore, the key of developing a powerful method for mapping dominant loci in an intersection population is to overcome the difficulty of distinguishing coupling phase from repulsion phase. Since twopoint analysis, as pointed out above, performs very poorly in the estimation of recombination fractions between dominant loci, threepoint analysis is alternatively taken into account. However, few threepoint EM algorithms can be applied to dominant markers because dominant markers are less informative for maximum likelihood estimation [26]. One effective way to carry out threepoint analysis is to dissect threepoint genotypes into various gamete components that are informative for distinction between coupling and repulsion phases, and then, to estimate their frequencies. With these estimated gamete frequencies, one can immediately estimate recombination fractions between dominant loci in couple and repulsion phases. A key to this strategy is to obtain estimate of gamete frequencies. On the basis of dissection of genotypes, Tan and Fu proposed a binomial analysis of threepoint (BAT) to estimate frequencies of dominant gametes [19]. However, this binomial approach is limited to the frequency of the threepoint recessive gamete abc. The accuracy of estimation is completely dependent on the observed frequency of its phenotype (aabbcc). We have developed a new method called “expectation least square” (ELS) to address this problem. ELS estimation, similarly to expectation maximum algorithm, is realized on the basis of Tan and Fu’s BAT method [19]. That is, the expectation of phenotype frequencies can be given by using Eqs. (19) in the BAT of Tan and Fu [19], and the difference between estimated and expected values of phenotype frequencies is given using least square. The expectation and least square steps are iterated so that the difference between estimated and expected values is less than tolerant value. In addition, we have also developed a fast binomial approach to estimate frequencies of codominant gametes.
Methods
Real data collection
Mouse genotype data: A RFLP dataset of 333 F_{2} mice was obtained from MAPMAKER/EXP (version 3.0b) [13].
Simulation
For dominant loci, we just took unknown phase into account in simulation and followed a point process model [27] and scheme of Tan and Fu [19] to perform simulations. In N F_{1} meioses, recombination events occurred at random between two adjacent loci. Here for the simplicity, we allowed for only independent crossovers during procedure of recombination occurrence between nonsister chromatides. We generated N F_{2} individuals with ratio = phenotype A: phenotype a = 3:1 at each dominant locus or A(homozygote): H(heterozygote): B(homozygote) = 1:2:1 at each codominant locus. We set three levels for sample size: N = 100, 200, and 300 F_{2} individuals and 100 iterations and used variance (equivalent to mean square error, MSE) that quantifies deviation of estimated recombination fraction between two adjacent loci from its true value to evaluate these estimators. Since the ELS and BAT estimators work in threepoint system, threepoint recombination fractions were incorporated to twopoint recombination fractions by using Tan and Fu [19] method. Simulation of codominant and dominant F_{2} populations and the ELS and BAT estimations of gamete frequencies in F_{2}population were implemented by our R functions (Additional file 1, source code).
Results
Estimation of the frequencies of threelocus gametes in an F_{2} population
Since our ELS method for accurate estimation of the frequencies of threelocus gametes in a population with random union of gametes is based on dissection of phenotypes, for convenience, we start by presenting the BAT method of Tan and Fu [19].
ELS estimation of frequencies of dominant marker gametes
Our study here is restricted to three biallelic dominant markers. We use A and a, B and b, C and c to represent two alleles at three loci where upper letters (A, B and C) stand for dominant alleles and lower letters (a, b and c) for recessive alleles. A tripleheterozygote individual via meiosis produces eight types of gametes at the three loci: ABC, ABc, Abc, AbC, aBC, abC, aBc and abc. Gametes ABC and abc are a pair of sister gametes on which two alleles at the all three loci are different and come from two different parents. Similarly, Abc and aBC, abC and ABc, AbC and aBc are also pairs of sister gametes. Two sister gametes theoretically have equal frequency in an F_{2}population because no mutation, no migration, no gene conversion and no selection occur in such a random mating population. From the expectation that sistergametes have equal frequencies, we have in an F_{2} population f(ABC) = f(abc) = q _{1}, f(ABC) = f(aBC) = q _{2}, f(ABc) = f(aBC) = q _{3}, f(AbC) = f(aBc) = q _{4}. These gamete frequencies are constrained by 2q _{1} + 2q _{2} + 2q _{3} + 2q _{4} = 1. The individuals in the population can be classified into four categories: category 0 in which all individuals possess 0 dominant locus, that is, all individuals have three recessive loci; categories 1, 2 and 3 in which all individuals have respectively only one, two and three homozygous or heterozygous dominant loci. To accurately estimate gamete frequencies, we dissect a phenotype into different zygote types (genotypes) in each category using sister gametes. In category 1, for example, aabbC_ has only locus c with one or two dominant alleles. Therefore it can be dissected into three zygote types:
Phenotypes aaB_cc and A_bbcc are dissected in a similar fashion. Category 2 also has three phenotypes and each of them can be dissected into four zygote types that are comprised of five pairs of sister gametes. For instance, phenotype type A_B_cc can be dissected into
Category 3 has only one phenotype. The phenotype is comprised of 8 zygote types (genotypes) and therefore it is not useful for estimate of gamete frequencies. We use Q _{1}, Q _{2}, Q _{3}, Q _{4}, Q _{5}, Q _{6}, and Q _{7} to respectively represent the frequency expectations of phenotypes aabbcc, aabbC_, aaB_cc, A_bbcc, A_B_cc, A_bbC_, and aaB_C_ in a population. The frequency of phenotype aabbcc is
The other 6 phenotypes have their frequencies:
Using Q = 2 (q _{2} q _{3} + q _{2} q _{4} + q _{3} q _{4}), Eq. (4) is simplified as
Estimates of q _{1} , … , q _{4} can be obtained from the above sets of equations by replacing Q _{ k } with their observed frequencies where k = 1, 2,…,7 for 7 phenotypes. Theoretically, eqs. (1) and (3) are sufficient to make solutions for the frequencies of four types of gametes. However, Eq. (5) can be used to further minimize noise in the observed frequencies. That is, Q _{2}, Q _{3},and Q _{4} can be alternatively estimated as
where Q = Q _{5} + Q _{6} + Q _{7} + Q _{1} − 0.25 [19]. It implicates that Q _{2}, Q _{3}, and Q _{4} can also be estimated from the estimated frequencies of Q _{1}, Q _{5}, Q _{6}, and Q _{7}. Thus, we can combine the two sets of estimates of Q _{2}, Q _{3}, and Q _{4} into one set:
where a _{ k } and b _{ k } are weights of \( {\widehat{Q}}_k \) and \( {\widehat{Q}}_k^{\#} \), respectively, where k = 2, 3, and 4. \( {\widehat{Q}}_k \) and \( {\widehat{Q}}_k^{\#} \) are respectively estimates of Q _{ k } and \( {Q}_k^{\#} \). In general case, a _{ k } = b _{ k } (see Additional file 3: Appendix B). An alternative method for weighting is \( {a}_k={\widehat{Q}}_k/\left({\widehat{Q}}_k+{\widehat{Q}}_k^{\#}\right) \) and b _{ k } = 1 − a _{ k }. When the sample is small, it is likely that \( {\widehat{Q}}_k^{\#}\le \) 0 or \( {\widehat{Q}}_k= \) 0. In such a case, one can set a _{ k } = 1 and b _{ k } = 0 for \( {\widehat{Q}}_k^{\#}\le \) 0, or a _{ k } = 0 and b _{ k } = 1 for \( {\widehat{Q}}_k^{\#}> \) 0 and \( {\widehat{Q}}_k \) = 0. Since \( {Q}_2={q}_3^2+2{q}_1{q}_3+{q}_1^2{q}_1^2={\left({q}_3+{q}_1\right)}^2{q}_1^2 \), q _{3} can be given by
Similarly,
Q _{1}, Q _{2}, Q _{3}, and Q _{4} are respectively estimated by \( {\widehat{Q}}_1 \), \( {\widehat{Q}}_2^{\ast } \), \( {\widehat{Q}}_3^{\ast } \), \( {\widehat{Q}}_4^{\ast } \), therefore q _{3}, q _{2}, q _{4}, and q _{1} are respectively estimated by
In Eq. (9), accurate estimation of q _{1} is a key contribution to accurate estimations of q _{2}, q _{3}, and q _{4}. Equations (3) and (4) show that Q _{2} ~ Q _{7} can also provide information of solution to q _{1}. But it is impossible to directly obtain a solution for q _{1} from Q _{2} ~ Q _{7}. To estimate q _{1} from Q _{1} ~ Q _{7}, we here proposed a seeking method, named “expectation least square” (ELS) method.
Similar to the EM method [11, 25, 28, 29], the ELS method also consists of two steps. The first step is the expectation step, denoted by Estep, and the second step is the leastsquare step, denoted by LSstep. q _{1} is initialized to be \( {\widehat{q}}_1^0=\sqrt{{\widehat{Q}}_1} \). We use \( {\widehat{q}}_1^0 \) to estimate q _{2}, q _{3}, and q _{4} and get \( {\widehat{q}}_2^0 \), \( {\widehat{q}}_3^0 \), and \( {\widehat{q}}_4^0 \) from Eqs. (9). Then, we calculate the expected values of Q _{2} ~ Q _{7} from Eqs. (3) ~ (4) with \( {\widehat{q}}_2^0 \), \( {\widehat{q}}_3^0 \), and \( {\widehat{q}}_4^0 \) . At iteration j, we realize Estep and LSstep to get \( {\widehat{q}}_2^j \), \( {\widehat{q}}_3^j \), and \( {\widehat{q}}_4^j \):
Estep:
Calculate the expected values \( E\left({Q}_2^j\right) \) ~ \( E\left({Q}_7^j\right) \) of Q _{2} ~ Q _{7} by replacing \( {\widehat{q}}_1^j \), \( {\widehat{q}}_2^j \), \( {\widehat{q}}_3^j \), and \( {\widehat{q}}_4^j \) into Eqs. (3) ~ (4) where \( {\widehat{q}}_2^j \), \( {\widehat{q}}_3^j \), and \( {\widehat{q}}_4^j \) are obtained by
where
where i = 2 , …, 4 and \( {Q}_i^{\#j}={\widehat{Q}}_{i+3}E\left({Q}^{j1}\right) \) where \( E\left({Q}^{j1}\right)=2\left({\widehat{q}}_2^{j1}{\widehat{q}}_3^{j1}+{\widehat{q}}_2^{j1}{\widehat{q}}_4^{j1}+{\widehat{q}}_3^{j1}{\widehat{q}}_4^{j1}\right) \).
LSstep:
Calculate square value using
Note that \( {\widehat{q}}_1^j \) is a value we want to seek for, therefore, Eq. (10) does not contain \( {\left({\widehat{Q}}_1{EQ}_1^j\right)}^2 \). As it is very difficult to directly get solutions for these four qvalues from the derivative approach, we use an iteration approach to minimize square value:
Use \( {\widehat{q}}_1^j={\widehat{q}}_1^{j1}\pm \varDelta \) to calculate \( {\widehat{q}}_2^j \), \( {\widehat{q}}_3^j \), and \( {\widehat{q}}_4^j \) where j is the jth iteration, j = 1 , …, and Δ is specified with a very small value. Here our algorithm to realize LSstep is
If \( {S}_j^2>{S}_{j1}^2 \), then
if \( {\widehat{q}}_1^j>{\widehat{q}}_1^{j1} \), then \( {\hat{q}}_1^j={\hat{q}}_1^{j1}\varDelta \),
otherwise, \( {\widehat{q}}_1^j={\widehat{q}}_1^{j1}+\varDelta \)
else if \( {S}_j^2<{S}_{j1}^2 \), then
if \( {\widehat{q}}_1^j>{\widehat{q}}_1^{j1} \), then \( {\widehat{q}}_1^j={\widehat{q}}_1^{j1}+\varDelta \),
otherwise, \( {\widehat{q}}_1^j={\widehat{q}}_1^{j1}\varDelta \).
Note that there are not \( {S}_j^2={S}_{j1}^2 \) and \( {\widehat{q}}_1^j={\widehat{q}}_1^{j1} \) in this algorithm. The iteration will stop at \( {S}_j^2\le t \) where t is a given tolerant value. Once the final estimate (\( {\widehat{q}}_1^f \)) of q _{1} is found at a given tolerant value where j = f, the final estimates of q _{2}, q _{3}, and q _{4} are obtained. Then we let \( {\widehat{q}}_1={\widehat{q}}_1^f \), \( {\widehat{q}}_2={\widehat{q}}_2^f \), \( {\widehat{q}}_3={\widehat{q}}_3^f \), and \( {\widehat{q}}_4={\widehat{q}}_4^f \) .
BAT for estimation of the frequencies of codominant marker gametes in F_{2} population
To avoid confusing notations in codominant loci with those in dominant loci, we let 0 and 1 code for homozygote from two parents, respectively, and 2 code for heterozygote at a locus. Since homozygote and heterozygote at three loci can be recognized, most of zygotes are informative for estimation of the frequencies of four pairs of sister gametes. We still assume that the sistergametes have equal frequencies, that is, q _{1} = f(111) = f(000), q _{2} = f(100) = f(011), q _{3} = f(110) = f(001), q _{4} = f(101) =f(010) in F_{2} population. Here these complementary zygote type pairs are listed as follows:
\( \mathrm{Zygote}\ \mathrm{gamete}\ \mathrm{frequency}\ \mathrm{expected}\ \mathrm{Zygote}\ \mathrm{gamete}\ \mathrm{frequency}\ \mathrm{expected} \)
Let P _{1}, P _{2}, P _{3} and P _{4} represent the frequencies of complementary homozygote types (111/000), (100/011), (110/001), and (101/010) in each of which all three loci are homozygous; let P _{12}, P _{13}, P _{14}, P _{23}, P _{24}, and P _{34} be the frequencies of complementary twolocus homozygote types (200/211), (002/112), (121/020), (021/120), (102/012), and (201/210) in each of which only one locus are heterozygous and let P _{1234}, P _{1324}, P _{1423} be the frequencies of complementary onelocus homozygote types (122/022), (221/220) and (212/202) in each of which two loci are heterozygous. Then, \( {P}_1=2{q}_1^2 \), \( {P}_2=2{q}_2^2 \), \( {P}_3=2{q}_3^2 \), \( {P}_4=2{q}_4^2 \), P _{12} = 4q _{1} q _{2}, P _{13} = 4q _{1} q _{3}, P _{14} = 4q _{1} q _{4}, P _{23} = 4q _{2} q _{3}, P _{24} = 4q _{2} q _{4}, P _{34} = 4q _{3} q _{4}, P _{1234} = 4q _{1} q _{2} + 4q _{3} q _{4}, P _{1324} = 4q _{1} q _{3} + 4q _{2} q _{4}, P _{1423} = 4q _{1} q _{4} + 4q _{2} q _{3}. From the zygote type pair list above, we find that the frequencies of these 12 pairs of zygote types can constitute two sets of 6 binomial equations:
We use arithmetic mean to get frequencies of these zygote types in F_{2} population:
where \( {a}_{ij}={\widehat{Q}}_{ij}^1/\left({\widehat{Q}}_{ij}^1+{\widehat{Q}}_{ij}^2\right) \) and b _{ ij } = 1 − a _{ ij }. \( \left({a}_{ij}{Q}_{ij}^1+{b}_{ij}{Q}_{ij}^2\right)={a}_{ij}{\left({q}_i+{q}_j\right)}^2 \) +b _{ ij }(q _{ i } + q _{ j })^{2}= (a _{ ij } + b _{ ij }) (q _{ i } + q _{ j })^{2} = (q _{ i } + q _{ j })^{2} where i and j are gamete types i and j (i = 1, 2, 3 and j = 2, 3, 4 and i ≠ j). Thus, the frequencies of four types of nonsister gametes in a codominant threelocus system in an F_{2}population are easily and fast estimated by
where \( {\widehat{Q}}_{ij} \) and \( {\widehat{P}}_k \) are respective estimates of Q _{ ij } and P _{ k } in F_{2} population where k = 1,…,4 denote gamete types 1, …, 4.
A modified BAT method (BAT II) for estimating the frequencies of eight gamete types without assumption that the sister gametes have equal frequencies in any generation population is given in Additional file 2, Appendix A.
Estimation of recombination fractions
Since these four qs are estimated separately, sum of them does not always satisfy a constraint of \( {\widehat{q}}_1+{\widehat{q}}_2+{\widehat{q}}_3+{\widehat{q}}_4=0.5 \). For this reason, we normalize our estimates as
For three linked loci, the frequencies of the four gamete pairs can be used to find the double crossover types by distinguishing coupling phase from repulsion phase between loci. For example, for an order abc of the three loci a, b and c, p _{4} is determined to be the frequency of double crossover types if its value is the smallest and/or p _{1} is the largest, which are produced at three coupling loci or p _{1} is found to be the frequency of double crossover types if its value is the smallest and/or p _{4} is the largest, which are formed at loci a and c in coupling phase and locus b in repulsion phase. In a similar way, we can also define p _{3} or p _{2} as the frequency of double crossover types.
If p _{4} is frequency of double crossover types, then the recombination fractions between loci a and b, between loci b and c, and between loci a and c can be estimated by
For the linkage orders acb and bac, the recombination fractions between loci are also estimated in a similar way.
In the repulsion phase, the linkage abc order of three loci determines p _{1} to be the frequency of double crossover types, so estimates of recombination fractions between loci a and b, between loci b and c, and between loci a and c are
For the linkage orders bac and acb, the recombination fractions between three loci in the repulsion phase can be estimated in this way.
r _{ ab } , r _{ bc } , and r _{ ac } are simple notations of three recombination fractions in a triple. However, when n markers on a chromosome or a fragment are genotyped, it is difficult to use these notations of three recombination frequencies to denote recombination fractions in n(n − 1)(n − 2)/6 triples. To notate recombination fractions in multiple triples, we let r _{ ab } = r _{ abc } where c is referred to as a reference marker for recombination fraction between markers a and b, r _{ ac } = r _{ acb } where b as reference marker for that between loci a and c, and r _{ bc } = r _{ bca } where a as reference locus for that between markers b and c, in a threelocus system consisting of markers a, b, and c [19]. In more general fashion, we denote i for the first marker, j for the second maker, and k for the last marker. Thus, the rest n − 2 markers are combined with loci i and j into n − 2 threepoints, therefore, there are n − 2 estimates of the recombination fraction between markers i and j. Hence estimate of recombination fraction between loci i and j is given by Tan and Fu’s method [19]:
Practical examples
Here we used RFLP (restriction fragment length polymorphisms) data of 333 F_{2} mice from MAPMAKER/EXP (version 3.0b), LANDER et al. [13] to illustrate performances of our ELS and BAT methods to estimate recombination fractions between dominant and codominant loci. RFLP markers are codominant markers. In genotype data of 333 F2 mice, “A” stands for homozygote A (two alleles from parent A), “H” for heterozygote H (an allele from parent A and the other from parent B), and “B” for homozygote B (two alleles from parent B). We arbitrarily selected 6 codominant markers from the original genotype data. To evaluate our ELS algorithm, we converted the codominant genotype data into dominant genotype data by changing B to H. For convenience, we used arabic digits (1, 2,…,6) to label these six markers: marker 1, marker 2, …, marker 6. Sometime we also used locus 1, locus 2, …, locus 6 to mark these six marker loci. The frequencies of 20 nonsister gametes were estimated by respectively performing ELS on the dominant data and BAT on the codominant genotype data, normalized by using Eq. (16) and the results are summarized in Tables 1 and 2. For the ELS estimation, three nonsister gametes containing loci 4 and 6 (146, 246 and 346) fitted well the ratio of 1:1:1:1 (Chisquare test pvalue >0.084, Table 1), indicating that loci 4 and 6 are unlinked to loci 1, 2 and 3. In addition, the frequencies of gametes 256, 356, and 345 also fitted the ratio of 1:1:1:1 with pvalue ≥ 0.063 (Chisquare test, Table 1), but gametes 156, 245 and 145 had the ratios significantly deviating against 1:1:1:1 (Chisquare test pvalue <0.0212, Table 1), we could infer that locus 5 was linked to loci 1 but independent of locus 3 and unascertained at locus 2. Thus, we definitely excluded loci 4 and 6 in the linkage. By using eqs. (17) – (19), the recombination fractions in four triples (123), (125), (135), and (235) were calculated by following the five given steps: the first step is to determine the linkage order of three loci in triple. For example, in triple (123), p _{1} = f(abc) = 0.208668 is the largest value while p _{2} = f(Abc) = 0.086162 is the smallest one, that is to say, gamete Abc is double crossover type and abc is parental type, so their order is 2(b)1(a)3(c). Step2 is to determine linkage phase: since gamete bac is parental type and bAc is double crossover type, gamete BAC or bac is couple phase. At step 3, we abstracted frequencies of gametes 123, 125, 135, 235 (Table 3) from Table 1. At step 4, recombination fractions between loci in a triple were estimated as
Similarly, we also estimated the recombination fractions in triples (125), (135), and (235) (Table 3). Finally, the threepoint estimates of the recombination fractions were incorporated into twopoint estimates by applying Eq. (19) to the data in Table 4:
Table 2 displays frequencies of codominant gametes estimated by our BAT method. It is clear to see that frequencies of gametes 145, 246, 345, 346, and 456 fitted well ratio of 1:1:1:1 with pvalue ≥ 0.0559 (Chisquare test), however, the frequencies of gametes 156, 256 and 356 did not fit the ratio of 1:1:1:1 with pvalue < 0.0121 (Chisquare test, Table 2), inferring that loci 4 and 6 are unlinked to loci 1, 2 and 3 but locus 5 could not be inferred to linked to them. Again, in codominant genotype data, locus 5 was still unascertained. Following the steps above, we obtained estimates of recombination fractions between these four loci (Table 5). Both ELS estimates of recombination fractions between dominant loci and BAT estimates between codominant loci show that locus 5 could not be tightly linked to any one of loci 1, 2 and 3. Loci 1, 2 and 3 could be determined to have linkage order of 213. Simulation data also showed that the codominant estimator had higher precision than the dominant estimator (see Simulation data section), suggesting that codominant markers indeed contain higher linkage information than dominant ones.
Simulation data
We performed simulation study to compare the two estimators of recombination fractions. We followed the simulation scheme of Tan and Fu [19]. Briefly, we set two linkage maps comprised of 6 dominant loci and 6 codominant loci, respectively. Five possible map distances 10, 15, 20, 25, and 30 cM (1 cM = 1%) were randomly assigned to the five adjacent intervals on these two linkage models with equal probability (see Methods for detail). The point process model [27] was used to generate F_{2} population. We did not consider recombination interference and linkage disequilibrium. Recombination fractions between adjacent loci in an unknown linkage phase (or say random phase) were estimated by the twopoint EM [14, 23] and ELS estimators in 100 repeated samples of 100, 200, and 300 individuals drawn from the simulated F_{2} population. These two estimators were rated by the variance that quantifies deviation of estimated recombination fraction between two adjacent loci from its true value and is equivalent to mean squared error (MSE). For dominant markers, simulation shows that the ELS algorithm had much smaller variances in estimation of true recombination fractions between adjacent loci in samples of 100, 200 and 300 F_{2} individuals than twopoint EM algorithm (Fig. 1). In Table 6, one can find that ELS had slightly higher probability of recovering true linkage maps of 6 loci than EM [14, 23] and BAT in the case of coupling phase and samples of 100 and 200 F_{2} individuals. When sample size reached 300 individuals, both ELS and EM recovered true coupling linkage maps with 100% probability and BAT also had 97.9% recovery rate. However, in unknown phase, ELS recovered true linkage maps of 6 loci with 23.4% probability in sample of 100 F_{2} individuals and reached 85% recovery rate in sample of 300 F_{2} individuals. By contrast, EM had very low recovery rate (23.4%) even when sample size was 300. Therefore, ELS performed much better than twopoint EM algorithm in all given scenarios. An inexact comparison can be done between ELS and threepoint EM algorithm of Lu et al. [30], Table 4 in Lu et al. showed that their threepoint EM algorithms had 98.5% probability of finding the correct linkage map of three dominant markers in coupling phase from a sample of fullsib 100 individuals (corresponding to 100 F_{2} individuals), our ELS had 96.7% probability of recovering true linkage map of 6 dominant markers in coupling phase in 100 F_{2} individuals (Table 6). The probability to find a given linkage map will remarkably decrease as number of markers increases. So we can predict that the threepoint EM algorithm would not have over 96.7% of the probability to find a given linkage map of 6 dominant markers. For the repulsion phase (or trans × trans), Lu et al.’s threepoint EM algorithm had 99.5% probability of finding a correct linkage map of three markers in 100 fullsib individuals, which is higher than 98.6% in coupling phase. In theory, any EM algorithm should have much lower probability to find a given linkage order in repulsion phase than in coupling phase because the repulsion phase has much less linkage information content than the coupling phase [14, 26]. So, this result may be required to be confirmed in more simulations. Since Lu et al. did not implement simulation of random phase case and the repulsion phase is not random phase, the comparison cannot be made between the threepoint EM and ELS algorithms in the random phase. For codominant markers, the BAT method performed with smaller variances than the twopoint EM algorithm in the most cases. The results provided strong evidence for the conclusion that a method or algorithm based on threepoint gametes can mitigate effect of low linkage information of repulsion phase on estimation of recombination fractions. Compared to the simulated results in Table 3 in [19], one can find that the ELS algorithm is better than the Tan and Fu’s BAT method. Table 3 in [19] showed that in case of unknown phase, the BAT method outperformed twopoint EM.
Discussion
Accurate estimation of recombination fractions is a key for mapping multiple markers. Therefore, powerful method for estimating recombination fractions is required. For dominant loci, the EM and ML methods have been verified to have low power to estimate frequencies of recombination between loci in repulsion phase [14, 19]. This is because the EM method cannot distinguish dominant homozygous genotypes from dominant heterozygous genotypes.
Compared to the EM algorithm, the ELS algorithm based on Tan and Fu’s method [19] has small bias for estimating recombination fractions between dominant loci on a chromosome in a larger F_{2} population due to the following reasons: (a) gamete analysis can effectively distinguish marker linkage phases; (b) accurately estimate q _{1}, and (c) average of estimates of recombination fraction between two loci over all reference loci [Eq. (19)] effectively balances sampling error. Estimation of q _{1} is restriction of the Tan and Fu’s method. We here proposed iteration expectationleast square algorithm (ELS) to seek for accurate q _{1} estimation. This new algorithm is similar to expectation maximum algorithm and its statistical properties will be given by more simulation comparisons in elsewhere. In addition, importance for high efficiency of recombination fraction estimation is \( {\widehat{Q}}_k^{\ast } \). ELS had much higher recovery rate by using \( {\widehat{Q}}_k^{\ast } \) than by using \( {\widehat{Q}}_k \) in both coupling and unknown phase (Table 6). Correlation analysis also indicated that \( {\widehat{Q}}_k^{\ast } \) indeed has the linkage behavior similar to \( {\widehat{Q}}_k \) (Additional file 3, Appendix B). Furthermore, we found that \( {\widehat{Q}}_k^{\ast } \) obtained from a data set of 100 simulated samples of 100 F_{2} individuals has remarkably smaller variance than \( {\widehat{Q}}_k \) (data not shown). To fully confirm that \( {\widehat{Q}}_k^{\ast } \) is the optimal choice in our ELS method, \( {\widehat{Q}}_k^{\ast } \) was taken into account where \( {\widehat{Q}}_k^{\ast } \) = \( \left({\widehat{Q}}_k+{\widehat{Q}}_k^o\right)/2 \) if \( {\widehat{Q}}_k^o \) >0, otherwise, \( {\widehat{Q}}_k^{\ast } \) = \( {\widehat{Q}}_k \). The simulated result showed that ~31% of linkage maps recovered true order of 6 dominant loci in samples of 100 F_{2} individuals, which is apparently lower than that by using \( {\widehat{Q}}_k^{\ast }=\frac{1}{2}\left({\widehat{Q}}_k+{\widehat{Q}}_k^{\mathrm{o}}\right) \). For this reason, we chose \( {\widehat{Q}}_k^{\ast }=\frac{1}{2}\left({\widehat{Q}}_k+{\widehat{Q}}_k^{\mathrm{o}}\right) \) in our ELS algorithm. Besides the ELS algorithm, average of recombination fraction between two loci over all reference loci greatly reduces noise of recombination fractions.
BATII given in Additional file 2, Appendix A, can be used to estimate frequencies of 8 codominant gamete types in any nature population because it does not require the assumption that the sister gametes have equal frequencies in a population. However, its estimation accuracy is not higher than the first BAT method in F_{2} population because sistergametes really have equal frequencies and twolocus heterozygote types are not useful in the BATII. In a natural population, for example, human population, the frequencies of these gametes are not purely derived from recombination events but may be due to selection, genetic drift, migration and mutation. If, however, sister gametes are found to be equal in statistics, then these frequencies can still be used to inference recombination fractions between loci and recombination inference.
Conclusions
Accurate estimation of recombination fractions between loci is given by methodologies developed for accurate estimation of gamete frequencies in a population. Analyses of simulated and real dominant and codominant data show that the ELS method proposed here is a powerful algorithm for accurate estimation of frequencies of gametes with unknown phase in dominant threelocus system in F_{2} population and BAT is a computationally efficient and powerful method for estimating frequencies of nonsister threepoint codominant gametes.
Abbreviations
 BAT:

Binomial analysis of threepoint gametes
 ELS:

Expectation least square
 EM:

Expectation maximization
 RFLP:

Restriction fragment length polymorphism
References
Bowers JE, Bachlava E, Brunick RL, Rieseberg LH, Knapp SJ, Burke JM. Development of a 10,000 locus genetic map of the sunflower genome based on multiple crosses. G3 (Bethesda). 2012;2(7):721–9.
Davey JW, Hohenlohe PA, Etter PD, Boone JQ, Catchen JM, Blaxter ML. Genomewide genetic marker discovery and genotyping using nextgeneration sequencing. Nat Rev Genet. 2011;12(7):499–510.
Elshire RJ, Glaubitz JC, Sun Q, Poland JA, Kawamoto K, Buckler ES, Mitchell SE. A robust, simple genotypingbysequencing (GBS) approach for high diversity species. PLoS One. 2011;6(5):e19379.
Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR. MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes. Genet Epidemiol. 2010;34(8):816–34.
Moriguchi Y, UjinoIhara T, Uchiyama K, Futamura N, Saito M, Ueno S, Matsumoto A, Tani N, Taira H, Shinohara K, et al. The construction of a highdensity linkage map for identifying SNP markers that are tightly linked to a nuclearrecessive major gene for male sterility in Cryptomeria Japonica D. Don. BMC Genomics. 2012;13:95.
Sonah H, Bastien M, Iquira E, Tardivel A, Legare G, Boyle B, Normandeau E, Laroche J, Larose S, Jean M, et al. An improved genotyping by sequencing (GBS) approach offering increased versatility and efficiency of SNP discovery and genotyping. PLoS One. 2013;8(1):e54603.
Verma S, Gupta S, Bandhiwal N, Kumar T, Bharadwaj C, Bhatia S. Highdensity linkage map construction and mapping of seed trait QTLs in chickpea (Cicer Arietinum L.) using genotypingbysequencing (GBS). Sci Rep. 2015;5:17512.
DonisKeller H, Green P, Helms C, Cartinhour S, Weiffenbach B, Stephens K, Keith TP, Bowden DW, Smith DR, Lander ES, et al. A genetic linkage map of the human genome. Cell. 1987;51(2):319–37.
Ellis THN. Neighbor mapping as method for ordering genetic markers. Genet Res. 1997;69:35–43.
Lander ES, Botstein D. Homozygosity mapping: a way to map human recessive traits with the DNA of inbred children. Science. 1987;236(4808):1567–70.
Lander ES, Green P. Construction of multilocus genetic linkage maps in humans. Proc Natl Acad Sci U S A. 1987;84(8):2363–7.
Lander ES, Green P. Counting algorithms for linkage: correction to Morton and Collins. Ann Hum Genet. 1991;55(Pt 1):33–8.
Lander ES, Green P, Abrahamson J, Barlow A, Daly MJ, Lincoln SE, Newberg LA. MAPMAKER: an interactive computer package for constructing primary genetic linkage maps of experimental and natural populations. Genomics. 1987;1(2):174–81.
Liu BH. Statistical genomics: linkage, mapping, and QTL analysis. Florida: CRC Press; 1998.
Mester DI, Ronin YI, Hu Y, Peng J, Nevo E, Korol AB. Efficient multipoint mapping: making use of dominant repulsionphase markers. Theor Appl Genet. 2003;107(6):1102–12.
Ronin Y, Mester D, Minkov D, Belotserkovski R, Jackson BN, Schnable PS, Aluru S, Korol A. Twophase analysis in consensus genetic mapping. G3 (Bethesda). 2012;2(5):537–49.
Tan YD, Fu YX. A novel method for estimating linkage maps. Genetics. 2006;173(4):2383–90.
Zhang L, Li H, Wang J. Linkage analysis and map construction in genetic populations of clonal F1 and double cross. G3 (Bethesda). 2015;5(3):427–39.
Tan YD, Fu YX. A new strategy for estimating recombination fractions between dominant markers from an F2 population. Genetics. 2007;175(2):923–31.
Allard RW. Formulas and tables to facilitate the calculation of recombination values in heredity. California: University of California; 1956.
Knapp SJ, Holloway JL, Bridges WC, Liu BH. Mapping dominant markers using F_{2} matings. Theor Appl Genet. 1995;91(1):74–81.
Peng J, Korol AB, Fahima T, Roder MS, Ronin YI, Li YC, Nevo E. Molecular genetic maps in wild emmer wheat, Triticum Dicoccoides: genomewide coverage, massive negative interference, and putative quasilinkage. Genome Res. 2000;10(10):1509–31.
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Methodol. 1977;39(1):1–38.
Morton NE, Collins A. Counting algorithms for linkage. Ann Hum Genet. 1990;54(Pt 2):103–6.
Ott G. Analysis of human genetic linkage. Baltimore/London: John Hopkins University Press; 1991.
Esch E. Estimation of gametic frequencies from F2 populations using the EM algorithm and its application in the analysis of crossover interference in rice. Theor Appl Genet. 2005;111(1):100–9.
Foss E, Lande R, Stahl FW, Steinberg CM. Chiasma interference as a function of genetic distance. Genetics. 1993;133(3):681–91.
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. JR Statist Soc. 1977;39B:1–38.
Niu T, Ding AA, Kreutz R, Lindpaintner K. An expectationmaximizationlikelihoodratio test for handling missing data: application in experimental crosses. Genetics. 2005;169(2):1021–31.
Lu Q, Cui Y, Wu R. A multilocus likelihood approach to joint modeling of linkage, parental diplotype and gene order in a fullsib family. BMC Genet. 2004;5:20.
Acknowledgements
Not applicable.
Funding
This research and article’s publication was supported in part by grant R01 CA183878. The funding agency played no role in the design or conclusion of the study.
Availability of data and materials
Since all simulation data were temporary and dynamic data and discarded when the programs were ended, we did not have simulation data to be deposited in a repository. R source code for ELS and BAT as well as the program for generating simulated data are available in Additional file 3.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 18 Supplement 11, 2017: Selected articles from the International Conference on Intelligent Biology and Medicine (ICIBM) 2016: bioinformatics. The full contents of the supplement are available online at <https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume18supplement11>.
Author information
Authors and Affiliations
Contributions
Study design: TYD, XHZ, QM. Method development and data analysis: TYD, QM. Manuscript writing: TYD, XHZ, QM. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1:
Source code: three R functions: BAT.R, ELS.R, simulatF2.R. (ZIP 4 kb)
Additional file 2:
Appendix A. Binomial analysis of threepoint method (BATII) is described in detail. BATII is used to estimate frequencies of sister gametes at codominant loci in natural populations. (DOCX 184 kb)
Additional file 3:
Appendix B. A proof of a proposition that equal weights of two datasets combined into a dataset have maximum linkage information and minimum error for linkage analysis is given. (DOCX 41 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Tan, YD., Zhang, X.H.F. & Mo, Q. New statistical methods for estimation of recombination fractions in F_{2} population. BMC Bioinformatics 18 (Suppl 11), 404 (2017). https://doi.org/10.1186/s1285901718048
Published:
DOI: https://doi.org/10.1186/s1285901718048
Keywords
 Dominant marker
 Codominant marker
 Gamete frequency
 EM algorithm
 ELS algorithm