MetaQTL: a package of new computational methods for the metaanalysis of QTL mapping experiments
 JeanBaptiste Veyrieras^{1}Email author,
 Bruno Goffinet^{2} and
 Alain Charcosset^{1}
DOI: 10.1186/14712105849
© Veyrieras et al; licensee BioMed Central Ltd. 2007
Received: 14 November 2006
Accepted: 08 February 2007
Published: 08 February 2007
Abstract
Background
Integration of multiple results from Quantitative Trait Loci (QTL) studies is a key point to understand the genetic determinism of complex traits. Up to now many efforts have been made by public database developers to facilitate the storage, compilation and visualization of multiple QTL mapping experiment results. However, studying the congruency between these results still remains a complex task. Presently, the few computational and statistical frameworks to do so are mainly based on empirical methods (e.g. consensus genetic maps are generally built by iterative projection).
Results
In this article, we present a new computational and statistical package, called MetaQTL, for carrying out wholegenome metaanalysis of QTL mapping experiments. Contrary to existing methods, MetaQTL offers a complete statistical process to establish a consensus model for both the marker and the QTL positions on the whole genome. First, MetaQTL implements a new statistical approach to merge multiple distinct genetic maps into a single consensus map which is optimal in terms of weighted least squares and can be used to investigate recombination rate heterogeneity between studies. Secondly, assuming that QTL can be projected on the consensus map, MetaQTL offers a new clustering approach based on a Gaussian mixture model to decide how many QTL underly the distribution of the observed QTL.
Conclusion
We demonstrate using simulations that the usual model choice criteria from mixture model literature perform relatively well in this context. As expected, simulations also show that this new clustering algorithm leads to a reduction in the length of the confidence interval of QTL location provided that across studies there are enough observed QTL for each underlying true QTL location. The usefulness of our approach is illustrated on published QTL detection results of flowering time in maize. Finally, MetaQTL is freely available at http://bioinformatics.org/mqtl.
Background
In the last two decades, the advent of molecular markers and their use in linkage mapping experiments has tremendously increased the potential of quantitative genetics. Linkage mapping experiments now provide an efficient tool to identify regions of the genome where polymorphism affects the variation of quantitative traits, called Quantitative Trait Loci (QTL). Although a large number of advanced statistical methods have been developed to improve the localization of QTL, the limited number of recombination events available in routinely used pedigree designs for QTL mapping lead essentially to an approximate mapping of the QTL (see for instance [1]). This is mainly due to both a few mating generations and a restricted number of sampled individuals (generally a few hundreds). From results of QTL experiments gathered over a wide range of plant species, [2] have shown that confidence intervals around most likely QTL positions are, on average, approximately 10 cM, which usually includes several hundreds of genes. More recent advents in the area of molecular biology have allowed researchers to carry out positional cloning of QTL (see for instance the review of [3]) but this approach still remains extremely expensive both in terms of time and resources. Also several authors [2, 4] have pointed out that QTL detection is statistically biased both in the true number of QTL, which is underestimated since only QTL with large effects are detected, and in the QTL effects which are over estimated as only significant effects are reported (a phenomenon has commonly referred to as the Beavis effect [5]). Even though QTL mapping experiments must be considered with an awareness of these limitations, they have become commonplace and have greatly improved our knowledge about the genetic component of complex traits.
Since the first publication of a QTL localization using molecular data [6], more and more species and traits have been studied and many of these results has been made available via public databases. One of the main purposes of these databases was to help researchers to compare results from different QTL studies, to study the congruency of QTL locations in order to address the following question: "do QTL identified for a given trait in a population correspond to those detected in other populations ?". In theory one would expect that the variation of a quantitative trait within a species is explained by a finite number of genes. Thus QTL congruency investigation will be a relevant approach to improve knowledge on trait genetics and several publications have pointed out its usefulness [7–12]. Nevertheless, the combination of results from linkage studies can be tedious since, even if several studies focus on the same trait within the same species, family structures, sample sizes, marker maps, or QTL detection methods may differ between studies. Some methods have been recently developed to tackle the issues raised by between QTL studies heterogeneity. Integration of genetic maps and QTL locations by iterative projections on a reference map is now widely used to position both markers and QTL on a single and homogeneous consensus map (see for instance [13]). However this process yields a consensus marker map for which both the statistical properties and biological "reality" can't be clearly assessed, even if a robust ordered marker map was used as reference. [14] proposed an original approach using graph theory to integrate various types of maps (genetic, physical or sequencebased) but it mainly dealt with dissection of marker order inconsistencies between maps. From up to now it seems that there is no efficient methodological framework to build reliable consensus marker maps on which markers and candidate genes from different mapping experiments can be both ordered and positioned (except by merging raw mapping data from multiple populations as proposed by [15] and [16]).
In order to study QTL congruency, [17] proposed an original approach based on a metaanalysis strategy. Metaanalysis, which is mainly used in medical, social, and behavioral sciences, aims to pool results across independent studies in order to combine them in a single result or estimate. The relevance of metaanalysis investigations in genetics and evolution has been discussed and pointed out by several authors in the last decade (see for instance [18–21]). More recently [22] developed another metaanalysis based approach to overcome the betweenstudy heterogeneity and to refine both QTL location and the magnitude of the genetic effects. Yet both the method of [17] and [22] are limited to a small number of underlying QTL positions (from one to four for the former and only one for the later) which is a serious limitation for a whole genome study of QTL congruency. Even if the average number of QTL per experiment is around four in plants [2, 12], one would expect that more than four genes can be involved in the trait variation on a single chromosome.
To remove these impediments we have developed a new 2stage metaanalysis procedure in order to integrate multiple independent QTL mapping experiments. Our aim was to create a global framework to evaluate the homogeneity of both genetic marker and QTL mapping results from literature and public databases. The first part of our metaanalysis procedure consists in building a consensus genetic marker map that takes into account the statistical properties of genetic distance estimates using a Weighted Least Squares (WLS) strategy. Secondly, once the consensus marker map has been built, the QTL locations can be projected on to the map. We also propose a new clustering algorithm based on a Gaussian mixture model in order to identify the number of underlying QTL which best explain the observed distribution of QTL positions in the mapping experiments. As it has been emphasized by [17], the crucial point at this step is to find an unbiased criterion to select the correct number of QTL. In the context of Gaussian mixture, a large variety of model choice criteria have been reported in the literature. We explore, by means of simulations, the properties of some of these criteria for our particular mixture model. These new methods have been implemented into a Java package called MetaQTL. Finally, as an example, we applied our new approach to QTL detection results gathered for flowering time in maize.
Results
Metaanalysis of genetic maps
Input genetic map information
Consider a set of n genetic mapping experiments concerning the same linkage group. These different experiments may involve different kinds of population pedigree. We consider that for each experiment i = 1, ..., n only the estimated distances between ordered markers along the linkage group are available. We use c_{ i }, N_{ i }, M_{ i }to denote the population cross design, the population size and the number or markers on the i^{th} genetic map, respectively. Let's suppose that two markers m_{ j }and m_{ k }have been positioned on the i^{th} map, $\widehat{r}$_{i,jk}stands for the estimated recombination rate between markers m_{ j }and m_{ k }and $\widehat{d}$_{i,jk}= f[$\widehat{r}$_{i,jk}] the corresponding estimated distance, where f is the mapping function which is assumed to be the same in the n mapping experiments (without loss of generality). Applying the classical asymptotic Gaussian distribution of the maximumlikelihood estimation of the parameter, we assume that the $\widehat{r}$_{i,jk}are normally distributed around the true recombination rate r_{i,jk}between markers m_{ j }and m_{ k }with a variance var($\widehat{r}$_{i,jk}) = ${\eta}_{i,jk}^{2}$. This variance ${\eta}_{i,jk}^{2}$ depends on the cross design c_{ i }, the value of r_{i,jk}, the sample size N_{ i }and the amount of information supplied by the marker pair m_{ j }and m_{ k }sampled population (see Additional File 1 for expression of η).
Since mapping functions are generally one to one functions, the functional invariance property of the maximumlikelihood estimate can be applied. Thus $\widehat{d}$_{i,jk}is also normally distributed around the true distance denoted d_{i,jk}= f[r_{i,jk}]. To obtain the variance of $\widehat{d}$_{i,jk}, denoted ${\gamma}_{i,jk}^{2}$, we use the first term of the Taylor expansion of the inverse of the mapping function leading to the approximation:
Now suppose the n experiments are consistent with the following assumptions:

Assumption 1 : they come from independent population samples. This implies that cov($\widehat{r}$_{i,jk}, $\widehat{r}$_{i',jk}) = 0 and cov($\widehat{d}$_{i,jk}, $\widehat{d}$_{i',jk}) = 0 for any pair of markers m_{ j }and m_{ k }which have been mapped in population i and i', i ≠ i' and (i, i') ∈ [1..n]^{2} .

Assumption 2 : there is no interference, i.e in each mapping experiment the recombination events occur independently in each marker interval. This is surely an idealization, but presently, most of the statistical models used to build genetic marker maps are based on this assumption. Thus for a given mapping experiment i, both the ordered marker interval recombination rate and distance estimates are independent, i.e cov($\widehat{r}$_{i,j(j+1)}, $\widehat{r}$_{i,(j+1)(j+2)}) = 0 and cov($\widehat{d}$_{i,j(j+1)}, $\widehat{d}$_{i(j+1)(j+2)}) = 0 for i ∈ [1,..., n] and j ∈ [1,..., M_{ i } 2].

Assumption 3 : the "true" marker order and recombination rate are supposed to be the same in the different populations, i.e r_{i,jk}= r_{i',jk}if markers m_{ j }and m_{ k }have been mapped in population i and i', i ≠ i' and (i, i') ∈ [1..n]^{2}.

Assumption 4 : all the genetic maps are connected. Mathematically, this means that if we consider maps as vertices and common markers as edges, then the corresponding graph is supposed to be connected.
Metaanalysis model
We define $\widehat{D}$ = ($\widehat{d}$_{i,jk}) and Γ = diag(${\gamma}_{i,jk}^{2}$) the vector of ordered marker interval distance estimates and the diagonal terms of the variance covariance matrix of $\widehat{D}$. We assume that a total of M distinct markers have been mapped in the n populations. The aim of the metaanalysis is to combine all the available information on marker order and positions in order to build a consensus linkage group on which the M markers are positioned. To do so we introduce Y = (y_{1},..., y_{ M }) the vector of the "true" positions of these M markers on the consensus linkage group, where the y_{ i }'s can be either positive or negative depending on an arbitrary zeroreference on the chromosome (hereafter we suppose y_{1} = 0). If the n mapping experiments are consistent with the previous assumptions and assuming that the distances on the linkage group are additive we propose to estimate Y by solving the following linear system:
where (j, k) ∈ [1,...,M]^{2}, i ∈ [1,..., n], $\widehat{d}$_{i,jk}is the distance estimate of the interval between marker m_{ j }and m_{ k }consecutive on the i^{th} experiment, y_{ k } y_{ j }is the true distance between these markers, ε_{i,jk}~ $\mathcal{N}$(0, ${\gamma}_{i,jk}^{2}$) is the error term. If assumption 4 holds we are ensured that this system has at least one solution. Applying a classical weighted least squares (WLS) strategy, the optimal solution is the one which minimizes the target function,
Let's introduce the design matrix A such that χ = ^{T}($\widehat{D}$  AY)Γ^{1}($\widehat{D}$  AY). Then the value of Y which minimizes χ is given by:
which is also a maximumlikelihood estimation of Y with variancecovariance matrix given by (^{T}A Γ^{1} A)^{1}. Thus $\widehat{Y}$ gives both the marker positions and the marker order along the consensus linkage group. The goodnessoffit of the model can be evaluated by the means of a chisquare test as χ ~ ${\chi}_{qM+1}^{2}$ where q is the length of the vector $\widehat{D}$, i.e the number of marker intervals over the n experiments. As an illustration, let's consider the following idealized scenario : suppose that the n gathered genetic maps share the same markers, i.e M_{ i }= M for i = 1,..., n. In this simple case the computation of $\widehat{Y}$ is straightforward:
and χ is the sum of M  1 terms each distributed as a chisquare with n  1 degree of freedoms. This is equivalent to for each marker intervals testing if the distances are homogeneous between populations using a classical test of equal means. This trivial example illustrates how our WLS approach can be used to test for the homogeneity of the recombination rate between several mapping experiments. This can be viewed as an alternative to the Mtest devised by [23] when raw data are not available.
Metaanalysis of QTL
Input QTL map information
Suppose that for a given trait a QTL detection has been carried out in the n mapping experiments. The minimal information supplied by the i^{th} QTL experiment consists of a set of estimated positions of the QTL, denoted $\widehat{x}$_{ ij }, and the corresponding proportion of variance explained by each QTL, the rsquares values λ_{ ij }. Here j ∈ [1,..., q_{ i }] and q_{ i }is the number of QTL detected in the i^{th} mapping experiment on the current linkage group (generally q_{ i }= 1 or possibly 2). The confidence intervals (CI) of the $\widehat{x}$_{ ij }'s, denoted v_{ ij }, can also be reported. The construction of the CI may have been performed by different approaches:
When the CI is not available it is possible to obtain an approximation of the CI by applying the empirical formula proposed by [29]. By means of intensive simulations they showed that for either a backcross or a F_{2} population the expected CI, at 95% level, can be expressed as CI(95) ≈ 530/(N λ) where N is the population size and λ the proportion of variance explained by the QTL. More recently [30] have derived simple analytical equations which are in good agreement with the formula of [29].
Whatever the method used to estimate the uncertainty on the QTL locations, we assume that the $\widehat{x}$_{ ij }'s are normally distributed around the true position x_{ ij }of the j^{th} QTL: $\widehat{x}$_{ ij }~ $\mathcal{N}$(x_{ ij }, ${\sigma}_{ij}^{2}$) where ${\sigma}_{ij}^{2}$ is the variance of the estimated position which can be deduced from the confidence interval v_{ ij }. For a CI of β% (β depends on the method used to compute the CI), the standard deviation σ_{ ij }can be estimated as σ_{ ij }= v_{ ij }/(2u_{ β }) where u_{ β }is the doublesided βpercentile of a centered normalized gaussian. This Gaussian approximation based on the classical asymptotic theory has been suggested by [17], even though this is not perfectly correct for QTL with small effects [31].
Furthermore the n QTL mapping experiments are assumed to be consistent with the following assumptions:

Assumption 1 : they are independent. This can be considered as correct when the individuals measured in the different populations have been generated independently. Independence between experiments i and i' means independence between $\widehat{x}$_{ ij }and $\widehat{x}$_{ i'j }.

Assumption 2 : for a given trait there is a finite number of underlying QTL which cosegregate in the mapping experiments: this means that the populations share the same trait determinism with potentially different allelic configurations at the QTL. In other word there is a finite number of true QTL positions on the linkage groups, i.e {x_{ ij }(i, j) ∈ [1,..., n] × [1,..., q_{ i }]} can potentially contain redundancy.
In addition to the two previous assumptions we also assume that the detected QTL locations are independent within experiments. This is not really true when the QTL detection does not properly take into account linked QTL. But with the advent of composite interval mapping strategy [32, 33] multipleQTL model can now be fitted by adding properly chosen cofactors which limit the impact of linkage between QTL on the position estimates. Therefore we assume that $\widehat{x}$_{ ij }and $\widehat{x}$_{ ij' }are independent for all j ≠ j'.
Preprocessing
The first step is to apply our WLS strategy to the n mapping experiments in order to build a consensus linkage group. Then the QTL locations are projected on the consensus linkage group using a simple scaling rule between the original QTL flanking marker interval and the corresponding one on the consensus chromosome. For a given QTL location the new confidence interval (if available) on the consensus linkage group is computed by taking into account the average scaling between the original and the consensus chromosome. This is done by computing the sum over the common marker intervals of the ratio of the interval lengths weighted by the probability that the QTL position lies in this interval. There are two possible strategies to approximate this probability. The first one relies on a rough approximation using a Gaussian distribution around the most likely position $\widehat{x}$_{ ij }of the j^{th} QTL, namely Pr(QTL j in m) = $\frac{{\displaystyle {\int}_{{u}_{m}}^{{u}_{m+1}}\phi [(u{\widehat{x}}_{ij})/{\sigma}_{ij}]du}}{{\displaystyle {\int}_{0}^{L}\phi [(u{\widehat{x}}_{ij})/{\sigma}_{ij}]du}}$ where φ[u]is the density function of a centered normalized Gaussian distribution, m is the index of the marker interval, u_{ m }and u_{m+1}are the absolute positions of the flanking markers on the original map of total length L. If the LOD score profile is available, a more accurate strategy can be applied by substituting φ for the density function which best fits the profile.
The metaanalysis model
The purpose of the QTL metaanalysis is to evaluate, for a given trait, the degree of congruency of the QTL detected in the n mapping experiments. By assuming that there is a finite number of true QTL locations, [17] proposed a clustering based approach to both classify the observed QTL and estimate the positions of the underlying QTL. Their method proceeds by testing all the possible QTL combinations and then choosing the one which maximizes a penalized loglikelihood. Although interesting, this method suffers from a categorical repartition of the QTL in the clusters, which is a limit case of Gaussian mixture models. We propose to adopt a similar clustering strategy but with a more standard Gaussian mixture model which allows QTL to be probabilistically distributed into clusters.
In order to lighten the notation we denote by q the total number of observed QTL locations and we ignore the mapping experiment subscripts so that $\widehat{X}$ = ($\widehat{x}$_{1},..., $\widehat{x}$_{ q }) and Σ = (σ_{1}, ...,σ_{ q }). Then, let's suppose there are K ≥ 1 true QTL located at ${X}^{[K]}=({x}_{1}^{[K]},\mathrm{...},{x}_{K}^{[K]})$ which segregate in at least one of the n QTL mapping experiments. Since the QTL position estimates $\widehat{X}$ are normally distributed around their true positions, the problem of finding the K underlying true positions can be viewed as a Gaussian mixture problem where the variances of each observation are known. Thus the loglikelihood of the observations can be written as follows:
where Θ^{[K]}= (X^{[K]}, Π^{[K]}) denotes the parameters of the model, ${\prod}^{[K]}=({\pi}_{1}^{[K]},\mathrm{...},{\pi}_{K}^{[K]})$ are the mixing proportions, which sum to one, and φ[x] is the density function of a centered normalized Gaussian distribution. We assume without loss of generality that ${x}_{1}^{[K]}<{x}_{2}^{[K]}<\mathrm{...}<{x}_{K}^{[K]}$ and that ${\pi}_{j}^{[K]}$ ≠ 0, for j ∈ [1,..., K]. In other word the distribution of the observed QTL locations is shaped by a mixture density where the components ${x}_{j}^{[K]}$ are the positions of the true QTL on the linkage group and the mixing proportions π_{ j }represent the proportion of QTL related to the j^{th} true QTL which have been detected in the n mapping experiments.
Maximizing 1 can be achieved via a standard EM algorithm [34] by using the following parameter updates (Mstep):
where ${t}_{ij}^{[K]}$ is the conditional probability that $\widehat{x}$_{ i }belongs to the j^{th} metaQTL. This conditional probability ${t}_{ij}^{[K]}$ is obtained by applying a simple Bayes' rule evaluated at the current parameter estimates (Estep):
The EMalgorithm is run until reaching convergence: this yields the maximumlikelihood estimate denoted $\tilde{\Theta}$^{[K]}= ($\tilde{X}$^{[K]}, $\tilde{\prod}$^{[K]}). Finally, once $\tilde{\Theta}$^{[K]}has been obtained the variancecovariance matrix of the parameter estimates, conditionally to the current model, can be computed by applying the Supplemental EM (SEM) strategy proposed by [35].
The problem is that we do not know K, i.e the number of true QTL positions. Since the mixture model of K components is nested into the model with K + 1 components, the likelihood ratio test (LRT) should be suitable. However, as discussed by many authors (see for instance [36, 37]) the LRT statistic does not follow the usual χ^{2} distribution due to testing a null hypothesis on the boundary of the parameter space (i.e the regularity conditions on the loglikelihood do not hold). Another strategy is to use the KullbackLeibler information in order to derive the information criterion which is widely used to select a statistical model. In particular, the KullbackLeibler information can be viewed as a measure of goodnessoffit of a statistical model. Here for a given value of K, minimizing the KullbackLeibler information is equivalent to maximizing the negentropy KL,
where g($\widehat{X}$, Σ) is the the true underlying density function. Thus, from the point of view of the negentropy maximization principle, the goodness of the model can be evaluated by the expected loglikelihood. Note that the negentropy maximization principle naturally leads to the maximization of the loglikelihood. However, the maximized loglikelihood is a naive estimate of the expected loglikelihood: since the same data set $\widehat{X}$ is used for both the estimation of the parameter and the estimation of the expected loglikelihood, L($\widehat{X}$, Σ; Θ^{[K]}) is a biased estimator of the expected loglikelihood. Its bias is defined by,
and the use of L($\widehat{X}$, Σ; $\tilde{\Theta}$^{[K]})  B is justified as an estimate of KL. There are different strategies to estimate this bias and several information based criteria have been reported in the mixture model literature in order to tackle the issue raised by choosing the number of components (see Additional File 2). In the next Simulation section, we propose to evaluate the ability of some of these information based criteria to determine the optimal number of QTL.
Simulation study
For the sake of concision, in this section we only present simulations for the QTL metaanalysis (simulations for the metaanalysis of genetic maps are described in Additional File 3). We assume that the complexity which shapes the distribution of the observed QTL along the chromosome can be represented by our mixture model. In order to explore mixture configurations which are realistic we have assumed that the QTL effects have a Lshaped distribution (i.e most of the detected QTL in mapping experiments have a small effect and only a few show a strong effect, or in other words, most of the detected QTL have large confidence intervals). Consequently this implies that Σ^{1}, the inverse of the QTL standard deviations, has also a Lshaped distribution (i.e the smaller the effect of the QTL the larger the confidence interval of the estimated QTL position). Then for a given value of the number of true QTL, K, we randomly generated configurations as follows:
 1.
Draw Σ from a inverse gamma distribution (this simply mimics a Lshaped distribution).
 2.
Generate the mixing proportions by choosing them over the discrete uniform [0.1, 0.9] distribution subject to constraint $\sum _{k=1}^{K}{\pi}_{k}}=1$.
 3.
Draw from a multinomial, with frequencies equal to the mixing proportions, the origins of the q observed QTL Z = (z_{1},...,z_{ q }) where z_{ ij }= 1 if the i^{th} observed QTL belongs to the j^{th} true QTL, 0 otherwise.
 4.
Generate the true QTL positions, X = (x_{1},..., x_{ k }), subject to constraint x_{ k }+ τ_{ min }<x_{k+1}<x_{ k }+ τ_{ max }where τ_{ min }and τ_{ max }are defined so that the distance between x_{ k }and x_{k+1}lies between δ_{ min }and δ_{ max }. The distance δ is defined as the mahalanobis distance between x_{ k }and x_{k+1}: $\delta =\sqrt{\frac{{({x}_{k}{x}_{k+1})}^{2}}{{a}_{k}^{2}+{a}_{k+1}^{2}}}$ where ${a}_{k}=({\displaystyle \sum _{i=1}^{q}{z}_{ik}{\sigma}_{i}})/({\displaystyle \sum _{i=1}^{q}{z}_{ik}})$ is the average standard deviation for the k^{th} true QTL. This measures the separation between consecutive true QTL relatively to the precision of the experiments: δ ≤ 2 corresponds to tightly or moderately separated QTL, while δ ≥ 3 corresponds to well separated QTL.
 1.
Draw a sample $\widehat{X}$ of size q.
 2.
Run the EMalgorithm to obtain $\tilde{\Theta}$^{[K]}for K = 1,..., q.
 3.
Choose the best model according to each criterion.
Since the goal of QTL metaanalysis is to obtain a better predictive inference of the true QTL locations we have compared the two alternative strategies:

Strategy 1 : choose the model with as many true QTL as the number of observed QTL. It is the naive model, $\overline{x}$_{ i }(1) = $\widehat{x}$_{ i }

Strategy 2 : choose the best model K according to the model choice criterion, ${\overline{x}}_{i}(2)={\displaystyle {\sum}_{j=1}^{K}{t}_{ij}^{[K]}}{\tilde{x}}_{j}^{[K]}$
For each strategy s = 1, 2, the measure of performance used was the mean squared error of prediction defined as follows:
Absolute values of these MSEP are not of interest here because our goal is comparison of strategies; hence, we consider the ratios MSEP(2)/MSEP(1) for 5 different information based criteria:

AIC = 2L($\widehat{X}$, Σ; $\tilde{\Theta}$^{[K]}) + 2ν

AIC_{ c }= 2L($\widehat{X}$, Σ; $\tilde{\Theta}$^{[K]}) + 2ν + $\frac{2\nu (\nu +1)}{q\nu 1}$

AIC3 = 2L($\widehat{X}$, Σ; $\tilde{\Theta}$^{[K]}) + 3ν

BIC = 2L($\widehat{X}$, Σ; Θ) + ν log(q)

EIC ≈ AIC  K + 1, which was obtained by means of simulations (data not shown).
where ν = 2K  1 is the number of free parameters of the model and q the number of observed QTL along the chromosome (see Additional file 2 for theoretical details on each above criterion).
Model choice criteria for the metaanalysis of flowering time QTL on maize chromosome 8
K  Model Choice Criterion  

AIC  AIC_{ c }  AIC3  BIC  
1  1096.32  1088.94  1088.32  1084.11 
2  497.22  490.52  491.22  488.06 
3  139.15  133.79  135.15  133.05 
4  34.73  31.53  32.73  31.67 
5  0.00  0.00  0.00  0.00 
6  4.00  8.50  6.00  7.05 
7  8.00  18.70  12.00  14.11 
8  12.00  31.17  18.00  21.16 
9  16.00  46.75  24.00  28.21 
10  19.54  66.33  29.54  34.81 
34  51.42  43.92  76.42  89.58 
Implementation
Parameter estimates of the best metaanalysis model for flowering time on maize chromosome 8
QTL  Position $\tilde{X}$  Weight $\tilde{\prod}$  Mahalanobis distance to next QTL  95% CI 

1  14.6  0.06  6.21  11.7 
2  75.4  0.35  1.26  6.2 
3  89.5  0.44  2.64  3.8 
4  114.5  0.07  5.20  11.1 
5  165.2  0.09    13.9 

Experiment file: stores descriptions on mapping experiments (name, population type and size, reference, ...).

Genetic map directory : contains one file per input marker map. Each file contains the corresponding genetic marker map.

QTL map directory : contains one file per QTL mapping experiments. For each QTL mapping experiment the file provides the properties of each detected QTL (trait, position, confidence interval, rsquare, ...).

Trait ontology file: describes how the traits are related together using a simple hierarchical relationship scheme. This information can then be used to group the QTL according to the ontology in subsequent analyses.
Once the database created, MetaQTL first checks the input data files and then summarizes their content into a set of XML files. All the programs of MetaQTL use these XML files as inputs. Utilities are provided to convert them in various plain text file formats if required (for more details on using MetaQTL see the user manual in Additional File 5).
Application
Recently, [12] made a bibliographical review of QTL studies relative to 4 traits related to flowering time in maize: days to pollen shed (DPS), silking date (SD), plant height (HT) and leaf number (LN). From the 22 QTL studies they reported, we excluded 6 experiments for which QTL detection was based on ANOVA with a low density of markers and 2 other for which it was not possible to get exact information on either the genetic linkage map or the QTL locations. In addition to these 15 mapping experiments we considered 3 other recent experiments (details of these 18 QTL studies are given in Additional File 6). We focus here on chromosome 8 and we present results by using for each step of the metaanalysis the corresponding program name of MetaQTL.
Result of InfoMap
Among the 153 distinct markers which have been positioned over the 18 mapping experiments on the chromosome 8, only 53 markers are observed in at least two different mapping experiments. We restricted the metaanalysis to these 53 markers. Only one order inconsistency was detected between [38] and [39] concerning markers umc89a and umc12a. As in [38]umc12a is very close to umc89a (less than 2 cM) we have decided to ignore this marker in this mapping experiment. Over the 18 mapping experiments the mean interval distance was about 18.9 cM with an average of 8.7 markers per mapping experiment and it existed at least one common marker path which connected all the mapping experiments together (insuring that the WLS can be applied).
Result of ConsMap
Result of QTLProj
From the 18 QTL studies we projected 34 QTL on the consensus chromosome 8. Among these 34 QTL, 16 (47%) are related to SD, 10 (29%) to DPS and 8 (24%) to HT. The distribution of the rsquare values clearly shows a Lshape: 75% of the QTL have rsquare values lower than 12%. For 17 QTL a CI was reported (build from a 1LOD support) from which we computed the standard deviations assuming that a 1LOD support corresponds in fact to a 90% CI. For the other QTL we derived the standard deviations from the formula proposed by [29]. Then models from K = 1 to K = 10 QTL were considered and their parameters estimated by applying our EMalgorithm.
Result of QTLClust
Secondly, two closely linked QTL (2 and 3) contribute to 75% of the reported QTL. This is strongly consistent with the knowledge of this region where a major QTL, vgt1, is tightly linked to another QTL, vgt2 [43, 44]. It is worth noting that the confidence interval of the QTL corresponding to vgt1 (around 3.8 cM) encompasses a marker interval of approximately 2 cM (at the left of the marker umc89a) in which this QTL has been finely mapped by [44] using NIL lines (result not included in our analysis). This congruency lends further credence to the metaanalysis approach.
Discussion and conclusion
Nowadays more and more studies concerning QTL detection are available via public databases and the number of articles dealing with the comparison and/or integration of these results increases [12, 45–47]. We believe that our metaanalysis procedure can contribute to facilitate the elaboration of such syntheses by providing a simple statistical framework to establish consensus models for both linkage maps and QTL locations.
First, the WLS strategy we proposed is a step forward to integrate several genetic marker maps. Contrary to iterative projection procedures, this approach provides a wellestablished statistical machinery (WLS) to assess the goodnessoffit of the consensus model. It can also be used to test the homogeneity of the distance estimates among different mapping experiments. This can be usefull to investigate the possible variation of recombination rate among genotypes (as reported by [41]). As pointed out in the application, this method can suffer from the lack of knowledge about the effective precision on the marker interval distances in each individual mapping experiment due to possible missing data and/or the type of scoring of individual markers (codominance vs dominance). This could be improved by asking researchers to supply the variance estimates of the marker interval distances when they submit their results to a public database. These variance estimates could be used to improve the weight factors in the WLS model. Also, as sometimes robust framework maps are available in the literature or via public databases, the program ConsMap in MetaQTL offers the possibility to fix a genetic map as a reference (i.e for which the distances between ordered markers are assumed to be the "actual" distances). In this case, only the positions of the markers which are not reported on the reference have to be estimated.
Secondly, for the QTL metaanalysis itself, the Gaussian mixture model used to fit the distribution of the observed QTL locations on the chromosome provides a wellstudied statistical inference technique. In this modelbased clustering, each "true" QTL is mathematically represented by the Gaussian distribution of its detected positions, which leads to a probabilistic classification of the observed QTL. Contrary to [17] who developed a specific model choice criterion, our simulation results show that AIC gives relatively good performances in our QTL metaanalysis framework. This difference, with regard to the conclusions of [17], may be explained by their discrete formulation of the problem (recall that, instead of using a usual Gaussian mixture likelihood to evaluate the probability of the data, they assumed that the observations could be categorically assigned to the mixture components). Parameter estimates obtained by this approach were not really the maximumlikelihood estimates of the underlying mixture model. This may have added a bias in the evaluation of the AIC criterion, which could explain the bad performances of AIC in their simulations.
Thus, our mixturemodelling approach makes it possible to go beyond the limits encountered by [17]: the Akaike like criterion they proposed was limited to models from 1 to 4 QTL. As a consequence, [47] who used the method of [17], was obliged to break chromosomes on distinct segments to carry out the metaanalysis. This subjective division of the chromosome can now be avoided thanks to our method. Simulations have shown that the ratio between the number of observed QTL and the number of "true" QTL is one of the main limiting factor. The number of "true" QTL which can be assessed by the metaanalysis must be reasonable compared to the number of observed QTL (at least between 5 or 10 observed QTL per actual location). Note that this also depends on the distance between true QTL. But since there are more and more QTL locations reported for a given trait and since the real number of distinct QTL locations which can be detected with usual experimental designs is limited (only QTL with relatively large effects can be found), we assume that in many cases the ratio between observed and "true" QTL locations will steadily increase and should generally be reasonable. It is worth noting that, provided that the number of observed QTL is appropriate, the metaanalysis is able to separate "true" QTL locations even if they are closely linked (as illustrated in the application with vgt1 and vgt2, and the consistency of the vgt1 estimated position with fine mapping result of [44], result not included in the metaanalysis).
The ultimate step toward a more accurate identification of QTL relies on finding the underlying genes. Up to now, the majority of QTL isolated in plants have been cloned via positional cloning (see for instance [48]). However positional cloning of QTL is quite expensive both in terms of time and resources due to the necessity to screen recombinant individuals within large population (typically several hundreds) and to characterize these individuals with a very dense set of molecular markers. As an alternative and thanks to the advent of structural and functional genomics, QTL can also be resolved through association mapping of candidate genes. Candidate genes identification is based on a assumption that the polymorphism of the gene is associated with the variation of the trait of interest. Both function and mapping information have to be crossed to establish this assumption. The function of the gene may have been determined in the species of interest, based for instance on mutant analysis. More often, function is hypothesized based on sequence homology with genes the function of which has been established in model species, including possible positional cloning of QTL. Gene mapping information may have been obtained in the species of interest, but may have been also inferred from synteny based projections, as illustrated by [12] for rice to maize. Relevancy of the colocalization between QTL and candidate genes crucially depends on the confidence interval of the QTL positions. For this purpose the reduction of the confidence interval of the QTL is an important goal [2]. The ability of our method to reduce the QTL confidence interval by taking advantage of pooling QTL results could contribute in an increased resolution in selecting candidate genes. It is worth noting that candidate genes are generally mapped on a framework map used as reference for the species of interest (e.g. in maize [49]), while the QTL detections are carried out using specific populations (generally obtained by crossing parents contrasted for the trait(s) of interest).
Therefore, the selection of candidate genes which colocalize with QTL depends also on the process used to merge these different maps. Up to now, no statistical method had been proposed to combine candidate genes and QTL mapped in independent experiments, so we think that our WLS strategy should increase the precision of the integration of candidate gene mapping information.
Finally once candidate genes have been selected and their different haplotypes defined, association studies can be carried out. The identification of a statistically significant association between haplotype variation at a candidate gene and the target trait gives further credence on the role of this gene in the trait variation. Since the last 5 years more and more association studies have been reported in plants [50]. It would be interesting to integrate these new results into a global metaanalysis framework. Further developments are needed to combine onto a synthetic model the different scale of mapping: from linkage mapping (QTL) to fine mapping (association studies).
Availability and requirements

Project name: MetaQTL

Project home page: http://bioinformatics.org/mqtl

Operating system: Platform independent

Programming language: Java

Other requirements: Java 5.0

License: GNU GPL
Declarations
Acknowledgements
This work was supported by INRA grant and the authors would like to thank F. Chardon, from INRA, for his help on the gathering process of QTL results relative to flowering time in maize. We would like also to thank C. Lepoittevin for her help in testing and debugging MetaQTL. We also thank Graham Coop for his help to improve the style of the manuscript.
Authors’ Affiliations
References
 Kearsey M, Pooni HS: The genetical analysis of quantitative traits. London: Chapman and Hall; 1996.View ArticleGoogle Scholar
 Kearsey M, Farquhar A: QTL analysis in plants; where are we now? Heredity 1998, 80: 137–142. 10.1046/j.13652540.1998.00500.xView ArticlePubMedGoogle Scholar
 Paran I, Zamir D: Quantitative traits in plants: beyond the QTL. Trends Genet 2003, 19(6):303–306. 10.1016/S01689525(03)001173View ArticlePubMedGoogle Scholar
 Xu S: Theoretical basis of the Beavis effect. Genetics 2003, 165(4):2259–68.PubMed CentralPubMedGoogle Scholar
 Beavis W: The power and deceit of QTL experiments: lessons from comparative QTL studies. In Proceedings of the FortyNinth Annual Corn and Sorghum Industry Research Conference. Washington, DC: American Seed Trade Association; 1994:250–266.Google Scholar
 Paterson AH, Lander ES, Hewitt JD, Peterson S, Lincoln SE, Tanksley SD: Resolution of quantitative traits into Mendelian factors by using a complete linkage map of restriction fragment length polymorphisms. Nature 1988, 335: 521–529. 10.1038/335721a0View ArticleGoogle Scholar
 Lin YR, Schertz KF, Paterson AH: Comparative analysis of QTLS affecting plant height and maturity across the poaceae, in reference to an interspecific sorghum population. Genetics 1995, 141: 391–411.PubMed CentralPubMedGoogle Scholar
 Paterson AH, Lin YR, Schertz KF, Doebley JF, Pinson SRM, Liu SC, Stansel JW, Irvine JE: Convergent domestication of cereal crops by independent mutations at corresponding genetic loci. Science 1995, 269: 1714–1718. 10.1126/science.269.5231.1714View ArticlePubMedGoogle Scholar
 Keightley P, Knott S: Testing the correspondence between map positions of quantitative trait loci. Genet Res Camb 1999, 74: 323–328.View ArticleGoogle Scholar
 Mihaljevic R, Utz HF, Melchinger AE: Congruency of quantitative trait loci detected for agronomic traits in testcrosses of five populations of european maize. Crop Sci 2004, 44: 114–124.View ArticleGoogle Scholar
 Khatkar M, Thomson P, Tammen I, Raadsma H: Quantitative trait loci mapping in dairy cattle: review and metaanalysis. Genet Sel Evol 2004, 36: 163–190. 10.1051/gse:2003057PubMed CentralView ArticlePubMedGoogle Scholar
 Chardon F, Virlon B, Moreau L, Falque M, Joets J, Decousset L, Murigneux A, Charcosset A: Genetic architecture of flowering time in maize as inferred from quantitative trait loci metaanalysis and synteny conservation with the rice genome. Genetics 2004, 168: 2169–2185. 10.1534/genetics.104.032375PubMed CentralView ArticlePubMedGoogle Scholar
 Arcade A, Labourdette A, Falque M, Mangin B, Chardon F, Charcosset A, Joets J: BioMercator: integrating genetic maps and QTL towards discovery of candidate genes. Bioinformatics 2004, 20(14):2324–2326. 10.1093/bioinformatics/bth230View ArticlePubMedGoogle Scholar
 Yap I, Schneider D, Kleinberg J, Matthews D, Cartinhour S, McCouch S: A graphtheoretic approach to comparing and integrating genetic, physical and sequencebased maps. Genetics 2003, 165: 2235–2247.PubMed CentralPubMedGoogle Scholar
 Stam P: Construction of integrated genetic linkage maps by means of a new coputer package: JoinMap. Plant J 1993, 3: 739–744.View ArticleGoogle Scholar
 Schiex T: Carthagene: constructing and joining maximum likelihood maps. ISMB 1997, 5: 258–267.PubMedGoogle Scholar
 Goffinet B, Gerber S: Quantitative trait loci: a metaanalysis. Genetics 2000, 155: 463–473.PubMed CentralPubMedGoogle Scholar
 Allison DB, Heo M: Metaanalysis of linkage data under worstcase conditions: a demonstration using the human OB region. Genetics 1998, 148: 859–865.PubMed CentralPubMedGoogle Scholar
 Van Zandt P, Mopper S: A metaanalysis of adaptive deme formation in phytophagous insect populations. Am Nat 1998, 152: 595–604. 10.1086/286192View ArticleGoogle Scholar
 Vollestad LA, Hindar K, Moller AP: A metaanalysis of fluctuating asymmetry in relation to heterozygosity. Heredity 1999, 83: 206–218. 10.1046/j.13652540.1999.00555.xView ArticlePubMedGoogle Scholar
 Lohmueller KE, Pearce CL, Pike M, Lander ES, Hirschhorn JN: Metaanalysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat Genet 2003, 33: 177–182. 10.1038/ng1071View ArticlePubMedGoogle Scholar
 Etzel C, Guerra R: Metaanalysis of geneticlinkage of quantitive trait loci. Am J Hum Genet 2003, 71: 56–65. 10.1086/341126View ArticleGoogle Scholar
 Morton NE: The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am J Hum Genet 1956, 8(2):80–96.PubMed CentralPubMedGoogle Scholar
 Conneally P, Edwards J, Kidd K, Lalouel J, Morton N: Reports of the committee methods of linkage analysis and reporting. Cytogent Cell Genet 1985, 40: 356–359.View ArticleGoogle Scholar
 Lander ES, Botstein D: Mapping Mendelien factors underlying quantitative traits using RFLP linkage maps. Genetics 1989, 121: 185–199.PubMed CentralPubMedGoogle Scholar
 Dupuis J, Siegmund D: Statistical methods for mapping quantitative trait loci from a dense set of markers. Genetics 1999, 151: 373–386.PubMed CentralPubMedGoogle Scholar
 Darvasi A, Weinreb A, Minke V, Weller J, Soller M: Detecting markerQTL linkage and estimating QTL gene effect and map location using a saturated map. Genetics 1993, 134: 943–951.PubMed CentralPubMedGoogle Scholar
 Kao C, Zeng Z, Teasdale R: Multiple Interval Mapping for Quantitative Trait Loci. Genetics 1999, 152: 1203–1216.PubMed CentralPubMedGoogle Scholar
 Darvasi A, Soller M: A simple method to calculate resolving power and confidence interval of QTL map location. Behav Genet 1997, 27(2):125–132. 10.1023/A:1025685324830View ArticlePubMedGoogle Scholar
 Visscher PM, Goddard ME: Prediction of the confidence interval of quantitative trait loci. Behavior Genetics 2004, 34(4):477–482. 10.1023/B:BEGE.0000023652.93162.e8View ArticlePubMedGoogle Scholar
 Mangin B, Goffinet B, Rebaï A: Constructing confidence intervals for QTL location. Genetics 1994, 138: 1301–1308.PubMed CentralPubMedGoogle Scholar
 Jansen RC: Interval mapping of multiple quantitative trait loci. Genetics 1993, 135: 205–211.PubMed CentralPubMedGoogle Scholar
 Zeng ZB: Precision mapping of quantitative trait loci. Genetics 1994, 136(4):1457–1468.PubMed CentralPubMedGoogle Scholar
 Dempster A, Laird N, Rubin D: Maximum Likelihood from incomplete data via the EM algorithm (with discussion). J Roy Stat Soc 1977, B 39: 1–38.Google Scholar
 Meng X, Rubin DB: Using EM to obtain asymptotic variancecovariance matrices: the SEM algorithm. J Am Stat Assoc 1991, 86: 899–909. 10.2307/2290503View ArticleGoogle Scholar
 Aitkin M, Rubin D: Estimation and Hypothesis Testing in Finite Mixture Models. Journal of the Royal Statistical Society 1985, 47: 67–75.Google Scholar
 Titterington D, Smith A, Markov U: Statistical Analysis of Finite Mixture Distributions. New York: John Wiley and Sons; 1985.Google Scholar
 Poupard B, Moreau L, Charcosset A: Analyse de 1'épistatsie entre QTL pour 3 caractères agronomiques chez le maïs. Tech rep, INRA 2001.Google Scholar
 Mechin V, Argillier O, Hebert Y, Guingo E, Moreau L, Charcosset A, Barriere Y: Genetic analysis and QTL mapping of cell wall degistibility and lignification in silage maize. Crop Sci 2001, 41: 690–697.View ArticleGoogle Scholar
 Stadler LJ: The Variability of Crossing Over in Maize. Genetics 1925, 11: 1–37.Google Scholar
 Williams CG, Goodman MM, Stuber CW: Comparative recombination distances among Zea mays L. inbreds, wide crosses and interspecific hybrids. Genetics 1995, 141(4):1573–1581.PubMed CentralPubMedGoogle Scholar
 Ji Y, Stelly DM, De Donato M, Goodman MM, Williams CG: A candidate recombination modifier gene for Zea mays L. Genetics 1999, 151(2):821–830.PubMed CentralPubMedGoogle Scholar
 Vladutu C, McLaughlin J, Phillips RL: Fine mapping and characterization of linked quantitative trait loci involved in the transition of the maize apical meristem from vegetative to generative structures. Genetics 1999, 153(2):993–1007.PubMed CentralPubMedGoogle Scholar
 Salvi S, Tuberosa R, Chiapparino E, Maccaferri M, Veillet S, van Beuningen L, Isaac P, Edwards K, Phillips RL: Toward positional cloning of Vgt1, a QTL controlling the transition from the vegetative to the reproductive phase in maize. Plant Mol Biol 2002, 48(5–6):601–13. 10.1023/A:1014838024509View ArticlePubMedGoogle Scholar
 Khavkin E, Coe EH: Mapped genomic locations for developmental functions and QTLs reflect concerned groups in maize (Zea mays L.). Theor Appl Genet 1997, 95: 343–352. 10.1007/s001220050569View ArticleGoogle Scholar
 Khavkin E, Coe EH: The major quantitative trait loci for plant stature, development and yield are general manifestations of developmental gene clusters. Maize Newslett 1998, 72: 60–66.Google Scholar
 Chardon F, Hourcade D, Combes V, Charcosset A: Mapping of a spontaneous mutation for early flowering time in maize highlights contrasting allelic series at twolinked QTL on chromosome 8. Theor Appl Genet 2005, 112: 1–11. 10.1007/s001220050050zView ArticlePubMedGoogle Scholar
 Salvi S, Tuberosa R: To clone or not to clone plant QTLs: present and future challenges. Trends Plant Sci 2005, 10(6):297–304. 10.1016/j.tplants.2005.04.008View ArticlePubMedGoogle Scholar
 Falque M, Decousset L, Dervins D, Jacob AM, Joets J, Martinant JP, Raffoux X, Ribiere N, Ridel C, Samson D, Charcosset A, Murigneux A: Linkage mapping of 1454 new maize candidate gene Loci. Genetics 2005, 170(4):1957–1966. 10.1534/genetics.104.040204PubMed CentralView ArticlePubMedGoogle Scholar
 Gupta PK, Rustgi S, Kulwal PL: Linkage disequilibrium and association studies in higher plants: present status and future prospects. Plant Mol Biol 2005, 57(4):461–485. 10.1007/s111030050257zView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.