New statistical methods for estimation of recombination fractions in F2 population

Background Dominant markers in an F2 population or a hybrid population have much less linkage information in repulsion phase than in coupling phase. Linkage analysis produces two separate complementary marker linkage maps that have little use in disease association analysis and breeding. There is a need to develop efficient statistical methods and computational algorithms to construct or merge a complete linkage dominant marker maps. The key for doing so is to efficiently estimate recombination fractions between dominant markers in repulsion phases. Result We proposed an expectation least square (ELS) algorithm and binomial analysis of three-point gametes (BAT) for estimating gamete frequencies from F2 dominant and codominant marker data, respectively. The results obtained from simulated and real genotype datasets showed that the ELS algorithm was able to accurately estimate frequencies of gametes and outperformed the EM algorithm in estimating recombination fractions between dominant loci and recovering true linkage maps of 6 dominant loci in coupling and unknown linkage phases. Our BAT method also had smaller variances in estimation of two-point recombination fractions than the EM algorithm. Conclusion ELS is a powerful method for accurate estimation of gamete frequencies in dominant three-locus system in an F2 population and BAT is a computationally efficient and fast method for estimating frequencies of three-point codominant gametes. Electronic supplementary material The online version of this article (10.1186/s12859-017-1804-8) contains supplementary material, which is available to authorized users.


Background
A great advance has been made in building genetic maps of various species due to the development of large-scale molecular marker technologies [1][2][3][4][5][6][7] and statistical methods [4,[8][9][10][11][12][13][14][15][16][17][18]. However, mapping of numerous molecular markers has been complicated by linkage phases of dominance [14][15][16]19]. In two-point analysis, markers in repulsion phase provide quite less linkage information than in coupling phase [14,15,20,21]. This is especially true for dominant markers in F 2 population [14]. In practical mapping experiments, although the linkage phase for each dominant marker is random, a half of markers are derived from one of two coupling phases. The phase between couplings is repulsion [14,15]. This situation results in two separate partner linkage maps for dominant markers: high linkage information content of markers in the coupling phase and low linkage information content of markers in the repulsion phase. Thus one has to build two complementary linkage maps [14,15,21,22]. To date, there has not yet been an effective way to integrate both into a complete map. Mester et al. [15] attempted to use pairs of codominant and dominant (CD) markers to merge such two complementary maps because pairs of the CD markers in repulsion phase have much higher linkage information content than pairs of dominant-only markers in repulsion phase. However, this strategy demands that all dominant markers be paired with codominant markers, which is not a general case in mapping practice, otherwise, local and global disturbance will then violently affect the reliability of the integrated map.
The two-point analysis implemented by the expectation maximization (EM) algorithm [11][12][13][23][24][25] is a highly powerful approach to estimate recombination fractions between codominant loci and between dominant loci in coupling phase, but the EM algorithm has very low power in estimation of recombination fractions between dominant loci in repulsion phase. This is because it is difficult for the EM algorithm to distinguish genotypes in coupling phase from those in repulsion phase for dominant markers.
Therefore, the key of developing a powerful method for mapping dominant loci in an intersection population is to overcome the difficulty of distinguishing coupling phase from repulsion phase. Since two-point analysis, as pointed out above, performs very poorly in the estimation of recombination fractions between dominant loci, three-point analysis is alternatively taken into account. However, few threepoint EM algorithms can be applied to dominant markers because dominant markers are less informative for maximum likelihood estimation [26]. One effective way to carry out three-point analysis is to dissect three-point genotypes into various gamete components that are informative for distinction between coupling and repulsion phases, and then, to estimate their frequencies. With these estimated gamete frequencies, one can immediately estimate recombination fractions between dominant loci in couple and repulsion phases. A key to this strategy is to obtain estimate of gamete frequencies. On the basis of dissection of genotypes, Tan and Fu proposed a binomial analysis of threepoint (BAT) to estimate frequencies of dominant gametes [19]. However, this binomial approach is limited to the frequency of the three-point recessive gamete abc. The accuracy of estimation is completely dependent on the observed frequency of its phenotype (aabbcc). We have developed a new method called "expectation least square" (ELS) to address this problem. ELS estimation, similarly to expectation maximum algorithm, is realized on the basis of Tan and Fu's BAT method [19]. That is, the expectation of phenotype frequencies can be given by using Eqs. (1)(2)(3)(4)(5)(6)(7)(8)(9) in the BAT of Tan and Fu [19], and the difference between estimated and expected values of phenotype frequencies is given using least square. The expectation and least square steps are iterated so that the difference between estimated and expected values is less than tolerant value. In addition, we have also developed a fast binomial approach to estimate frequencies of codominant gametes.

Real data collection
Mouse genotype data: A RFLP dataset of 333 F 2 mice was obtained from MAPMAKER/EXP (version 3.0b) [13].

Simulation
For dominant loci, we just took unknown phase into account in simulation and followed a point process model [27] and scheme of Tan and Fu [19] to perform simulations. In N F 1 meioses, recombination events occurred at random between two adjacent loci. Here for the simplicity, we allowed for only independent crossovers during procedure of recombination occurrence between nonsister chromatides. We generated N F 2 individuals with ratio = phenotype A: phenotype a = 3:1 at each dominant locus or A(homozygote): H(heterozygote): B(homozygote) = 1:2:1 at each codominant locus. We set three levels for sample size: N = 100, 200, and 300 F 2 individuals and 100 iterations and used variance (equivalent to mean square error, MSE) that quantifies deviation of estimated recombination fraction between two adjacent loci from its true value to evaluate these estimators. Since the ELS and BAT estimators work in three-point system, three-point recombination fractions were incorporated to two-point recombination fractions by using Tan and Fu [19] method. Simulation of codominant and dominant F 2 populations and the ELS and BAT estimations of gamete frequencies in F 2 population were implemented by our R functions (Additional file 1, source code).

Estimation of the frequencies of three-locus gametes in an F 2 population
Since our ELS method for accurate estimation of the frequencies of three-locus gametes in a population with random union of gametes is based on dissection of phenotypes, for convenience, we start by presenting the BAT method of Tan and Fu [19].

ELS estimation of frequencies of dominant marker gametes
Our study here is restricted to three biallelic dominant markers. We use A and a, B and b, C and c to represent two alleles at three loci where upper letters (A, B and C) stand for dominant alleles and lower letters (a, b and c) for recessive alleles. A triple-heterozygote individual via meiosis produces eight types of gametes at the three loci: ABC, ABc, Abc, AbC, aBC, abC, aBc and abc. Gametes ABC and abc are a pair of sister gametes on which two alleles at the all three loci are different and come from two different parents. Similarly, Abc and aBC, abC and ABc, AbC and aBc are also pairs of sister gametes. Two sister gametes theoretically have equal frequency in an F 2 population because no mutation, no migration, no gene conversion and no selection occur in such a random mating population. From the expectation that sister-gametes have equal frequencies, we have in an F 2 population f(ABC) = f(abc) = q 1 , f(ABC) = f(aBC) = q 2 , f(ABc) = f(aBC) = q 3 , f(AbC) = f(aBc) = q 4 . These gamete frequencies are constrained by 2q 1 + 2q 2 + 2q 3 + 2q 4 = 1. The individuals in the population can be classified into four categories: category 0 in which all individuals possess 0 dominant locus, that is, all individuals have three recessive loci; categories 1, 2 and 3 in which all individuals have respectively only one, two and three homozygous or heterozygous dominant loci. To accurately estimate gamete frequencies, we dissect a phenotype into different zygote types (genotypes) in each category using sister gametes. In category 1, for example, aabbC_ has only locus c with one or two dominant alleles. Therefore it can be dissected into three zygote types: Phenotypes aaB_cc and A_bbcc are dissected in a similar fashion. Category 2 also has three phenotypes and each of them can be dissected into four zygote types that are comprised of five pairs of sister gametes. For instance, phenotype type A_B_cc can be dissected into Category 3 has only one phenotype. The phenotype is comprised of 8 zygote types (genotypes) and therefore it is not useful for estimate of gamete frequencies. We use Q 1 , Q 2 , Q 3 , Q 4 , Q 5 , Q 6 , and Q 7 to respectively represent the frequency expectations of phenotypes aabbcc, aabbC_, aaB_cc, A_bbcc, A_B_cc, A_bbC_, and aaB_C_ in a population. The frequency of phenotype aabbcc is The other 6 phenotypes have their frequencies: Using Q = 2 (q 2 q 3 + q 2 q 4 + q 3 q 4 ), Eq. (4) is simplified as Estimates of q 1 , … , q 4 can be obtained from the above sets of equations by replacing Q k with their observed frequencies where k = 1, 2,…,7 for 7 phenotypes. Theoretically, eqs. (1) and (3) are sufficient to make solutions for the frequencies of four types of gametes. However, Eq. (5) can be used to further minimize noise in the observed frequencies. That is, Q 2 , Q 3 ,and Q 4 can be alternatively estimated aŝ where Q = Q 5 + Q 6 + Q 7 + Q 1 − 0.25 [19]. It implicates that Q 2 , Q 3 , and Q 4 can also be estimated from the estimated frequencies of Q 1 , Q 5 , Q 6 , and Q 7 . Thus, we can combine the two sets of estimates of Q 2 , Q 3 , and Q 4 into one set: where a k and b k are weights ofQ k andQ k # , respectively, where k = 2, 3, and 4.Q k andQ k # are respectively estimates of Q k and Q k # . In general case, a k = b k (see Additional file 3: Appendix B). An alternative method for weighting is a k ¼Q k =Q k þQ k # À Á and b k = 1 − a k . When the sample is small, it is likely thatQ k # ≤ 0 orQ k ¼ 0.
In such a case, one can set a k = 1 and b k = 0 forQ k , q 3 can be given by Similarly, Q 1 , Q 2 , Q 3 , and Q 4 are respectively estimated byQ 1 , Q 2 Ã ,Q 3 Ã ,Q 4 Ã , therefore q 3 , q 2 , q 4 , and q 1 are respectively estimated bŷ In Eq. (9), accurate estimation of q 1 is a key contribution to accurate estimations of q 2 , q 3 , and q 4 . Equations (3) and (4) show that Q 2~Q7 can also provide information of solution to q 1 . But it is impossible to directly obtain a solution for q 1 from Q 2~Q7 . To estimate q 1 from Q 1~Q7 , we here proposed a seeking method, named "expectation least square" (ELS) method.

LS-step:
Calculate square value using Note thatq j 1 is a value we want to seek for, therefore, Eq. (10) does not containQ 1 −EQ As it is very difficult to directly get solutions for these four q-values from the derivative approach, we use an iteration approach to minimize square value: where j is the jth iteration, j = 1 , …, and Δ is specified with a very small value. Here our algorithm to realize LS-step is

BAT for estimation of the frequencies of codominant marker gametes in F 2 population
To avoid confusing notations in codominant loci with those in dominant loci, we let 0 and 1 code for homozygote from two parents, respectively, and 2 code for heterozygote at a locus. Since homozygote and heterozygote at three loci can be recognized, most of zygotes are informative for estimation of the frequencies of four pairs of sister gametes. We still assume that the sister-gametes have equal frequencies, that is, & ; Let P 1 , P 2 , P 3 and P 4 represent the frequencies of complementary homozygote types (111/000), (100/011), (110/001), and (101/010) in each of which all three loci are homozygous; let P 12 , P 13 , P 14 , P 23 , P 24 , and P 34 be the frequencies of complementary two-locus homozygote types (200/211), (002/112), (121/020), (021/120), (102/012), and (201/210) in each of which only one locus are heterozygous and let P 1234 , P 1324 , P 1423 be the frequencies of complementary one-locus homozygote types (122/022), (221/220) and (212/202) in each of which two loci are heterozygous. Then, P 1 ¼ 2q 2 1 , P 2 ¼ 2 q 2 2 , P 3 ¼ 2q 2 3 , P 4 ¼ 2q 2 4 , P 12 = 4q 1 q 2 , P 13 = 4q 1 q 3 , P 14 = 4q 1 q 4 , P 23 = 4q 2 q 3 , P 24 = 4q 2 q 4 , P 34 = 4q 3 q 4 , P 1234 = 4q 1 q 2 + 4q 3 q 4 , P 1324 = 4q 1 q 3 + 4q 2 q 4 , P 1423 = 4q 1 q 4 + 4q 2 q 3 . From the zygote type pair list above, we find that the frequencies of these 12 pairs of zygote types can constitute two sets of 6 binomial equations: We use arithmetic mean to get frequencies of these zygote types in F 2 population: where 2 where i and j are gamete types i and j (i = 1, 2, 3 and j = 2, 3, 4 and i ≠ j). Thus, the frequencies of four types of non-sister gametes in a codominant threelocus system in an F2 population are easily and fast estimated bŷ whereQ ij andP k are respective estimates of Q ij and P k in F 2 population where k = 1,…,4 denote gamete types 1, …, 4. A modified BAT method (BAT II) for estimating the frequencies of eight gamete types without assumption that the sister gametes have equal frequencies in any generation population is given in Additional file 2, Appendix A.

Estimation of recombination fractions
Since these four qs are estimated separately, sum of them does not always satisfy a constraint ofq 1 þq 2 þq 3 þq 4 ¼ 0:5. For this reason, we normalize our estimates as For three linked loci, the frequencies of the four gamete pairs can be used to find the double crossover types by distinguishing coupling phase from repulsion phase between loci. For example, for an order a-b-c of the three loci a, b and c, p 4 is determined to be the frequency of double crossover types if its value is the smallest and/or p 1 is the largest, which are produced at three coupling loci or p 1 is found to be the frequency of double crossover types if its value is the smallest and/or p 4 is the largest, which are formed at loci a and c in coupling phase and locus b in repulsion phase. In a similar way, we can also define p 3 or p 2 as the frequency of double crossover types.
If p 4 is frequency of double crossover types, then the recombination fractions between loci a and b, between loci b and c, and between loci a and c can be estimated by For the linkage orders a-c-b and b-a-c, the recombination fractions between loci are also estimated in a similar way.
In the repulsion phase, the linkage a-b-c order of three loci determines p 1 to be the frequency of double crossover types, so estimates of recombination fractions between loci a and b, between loci b and c, and between loci a and c are For the linkage orders b-a-c and a-c-b, the recombination fractions between three loci in the repulsion phase can be estimated in this way. r ab , r bc , and r ac are simple notations of three recombination fractions in a triple. However, when n markers on a chromosome or a fragment are genotyped, it is difficult to use these notations of three recombination frequencies to denote recombination fractions in n(n − 1)(n − 2)/6 triples. To notate recombination fractions in multiple triples, we let r ab = r abc where c is referred to as a reference marker for recombination fraction between markers a and b, r ac = r acb where b as reference marker for that between loci a and c, and r bc = r bca where a as reference locus for that between markers b and c, in a three-locus system consisting of markers a, b, and c [19]. In more general fashion, we denote i for the first marker, j for the second maker, and k for the last marker. Thus, the rest n − 2 markers are combined with loci i and j into n − 2 three-points, therefore, there are n − 2 estimates of the recombination fraction between markers i and j. Hence estimate of recombination fraction between loci i and j is given by Tan and Fu's method [19]: Practical examples Here we used RFLP (restriction fragment length polymorphisms) data of 333 F 2 mice from MAPMAKER/EXP (version 3.0b), LANDER et al. [13] to illustrate performances of our ELS and BAT methods to estimate recombination fractions between dominant and codominant loci. RFLP markers are codominant markers. In genotype data of 333 F2 mice, "A" stands for homozygote A (two alleles from parent A), "H" for heterozygote H (an allele from parent A and the other from parent B), and "B" for homozygote B (two alleles from parent B). We arbitrarily selected 6 codominant markers from the original genotype data. To evaluate our ELS algorithm, we converted the codominant genotype data into dominant genotype data by changing B to H. For convenience, we used arabic digits (1, 2,…,6) to label these six markers: marker 1, marker 2, …, marker 6. Sometime we also used locus 1, locus 2, …, locus 6 to mark these six marker loci. The frequencies of 20 non-sister gametes were estimated by respectively performing ELS on the dominant data and BAT on the codominant genotype data, normalized by using Eq. (16) and the results are summarized in Tables 1 and 2 Similarly, we also estimated the recombination fractions in triples (125), (135), and (235) ( Table 3). Finally, the three-point estimates of the recombination fractions were incorporated into two-point estimates by applying Eq. (19) to the data in Table 4:   Table 2 displays frequencies of codominant gametes estimated by our BAT method. It is clear to see that frequencies of gametes 145, 246, 345, 346, and 456 fitted well ratio of 1:1:1:1 with p-value ≥ 0.0559 (Chi-square test), however, the frequencies of gametes 156, 256 and 356 did not fit the ratio of 1:1:1:1 with p-value < 0.0121 (Chi-square test, Table 2), inferring that loci 4 and 6 are unlinked to loci 1, 2 and 3 but locus 5 could not be inferred to linked to them. Again, in codominant genotype data, locus 5 was still unascertained. Following the steps above, we obtained estimates of recombination fractions between these four loci (Table 5). Both ELS estimates of recombination fractions between dominant loci and BAT estimates between codominant loci show that locus 5 could not be tightly linked to any one of loci 1, 2 and 3. Loci 1, 2 and 3 could be determined to have linkage order of 2-1-3. Simulation data also showed that the codominant estimator had higher precision than the dominant estimator (see Simulation data section), suggesting that codominant markers indeed contain higher linkage information than dominant ones.

Simulation data
We performed simulation study to compare the two estimators of recombination fractions. We followed the simulation scheme of Tan and Fu [19]. Briefly, we set two linkage maps comprised of 6 dominant loci and 6 codominant loci, respectively. Five possible map distances 10, 15, 20, 25, and 30 cM (1 cM = 1%) were randomly assigned to the five adjacent intervals on these two linkage models with equal probability (see Methods for detail). The point process model [27] was used to generate F 2 population. We did not consider recombination interference and linkage disequilibrium. Recombination fractions between adjacent loci in an unknown linkage phase (or say random phase) were estimated by the two-point EM [14,23] and ELS estimators in 100 repeated samples of 100, 200, and 300 individuals drawn from the simulated F 2 population. These two estimators were rated by the variance that quantifies deviation of estimated recombination fraction between two adjacent loci from its true value and is equivalent to mean squared error (MSE). For dominant markers, simulation shows that the ELS algorithm had much smaller variances in estimation of true recombination fractions between adjacent loci in samples of 100, 200 and 300 F 2 individuals than two-point EM algorithm (Fig. 1). In Table 6, one can find that ELS had slightly higher probability of recovering true linkage maps of 6 loci than EM [14,23]   .5% probability of finding a correct linkage map of three markers in 100 fullsib individuals, which is higher than 98.6% in coupling phase. In theory, any EM algorithm should have much lower probability to find a given linkage order in repulsion phase than in coupling phase because the repulsion phase has much less linkage information content than the coupling phase [14,26]. So, this result may be required to be confirmed in more simulations. Since Lu et al. did not implement simulation of random phase case and the repulsion phase is not random phase, the comparison cannot be made between the three-point EM and ELS algorithms in the random phase. For codominant markers, the BAT method performed with smaller variances than the twopoint EM algorithm in the most cases. The results provided strong evidence for the conclusion that a method or algorithm based on three-point gametes can mitigate effect of low linkage information of repulsion phase on estimation of recombination fractions. Compared to the simulated results in Table 3 in [19], one can find that the ELS algorithm is better than the Tan and Fu's BAT method. Table 3 in [19] showed that in case of unknown phase, the BAT method outperformed two-point EM.

Discussion
Accurate estimation of recombination fractions is a key for mapping multiple markers. Therefore, powerful method for estimating recombination fractions is required. For dominant loci, the EM and ML methods have been verified to have low power to estimate frequencies of recombination between loci in repulsion phase [14,19]. This is because the EM method cannot distinguish dominant homozygous genotypes from dominant heterozygous genotypes. Compared to the EM algorithm, the ELS algorithm based on Tan and Fu's method [19] has small bias for estimating recombination fractions between dominant loci on a chromosome in a larger F 2 population due to the following reasons: (a) gamete analysis can effectively distinguish marker linkage phases; (b) accurately estimate q 1 , and (c) average of estimates of recombination fraction between two loci over all reference loci [Eq. (19)] effectively balances sampling error. Estimation of q 1 is restriction of the Tan and Fu's method. We here proposed iteration expectation-least square algorithm (ELS) to seek for accurate q 1 estimation. This new algorithm is similar to expectation maximum algorithm and its statistical properties will be given by more simulation comparisons in elsewhere. In addition, importance for high efficiency of recombination fraction estimation isQ k Ã .
ELS had much higher recovery rate by usingQ k Ã than by usingQ k in both coupling and unknown phase ( Table  6). Correlation analysis also indicated thatQ k Ã indeed has the linkage behavior similar toQ k (Additional file 3, Appendix B). Furthermore, we found thatQ k Ã obtained from a data set of 100 simulated samples of 100 F 2 individuals has remarkably smaller variance thanQ k (data not shown). To fully confirm thatQ k Ã is the optimal choice in our ELS method,Q k Ã was taken into account wherê The simulated result showed that~31% of linkage maps recovered true order of 6 dominant loci in samples of 100 F 2 individuals, which is apparently lower than that by usingQ k Ã ¼ 1 2Q k þQ k o À Á . For this reason, we chosê Q k Ã ¼ 1 2Q k þQ k o À Á in our ELS algorithm. Besides the ELS algorithm, average of recombination fraction between two loci over all reference loci greatly reduces noise of recombination fractions. BATII given in Additional file 2, Appendix A, can be used to estimate frequencies of 8 codominant gamete types in any nature population because it does not require the assumption that the sister gametes have equal frequencies in a population. However, its estimation accuracy is not higher than the first BAT method in F 2 population because sister-gametes really have equal frequencies and two-locus heterozygote types are not useful in the BATII. In a natural population, for example, human population, the frequencies of these gametes are not purely derived from recombination events but may be due to selection, genetic drift, migration and mutation. If, however, sister gametes are found to be equal in statistics, then these frequencies can still be used to inference recombination fractions between loci and recombination inference.