InterferenceAnalyzer: Tools for the analysis and simulation of multi-locus genetic data
- Lalitha Viswanath^{1} and
- Elizabeth A Housworth^{2}Email author
DOI: 10.1186/1471-2105-6-297
© Viswanath and Housworth; licensee BioMed Central Ltd. 2005
Received: 19 August 2005
Accepted: 12 December 2005
Published: 12 December 2005
Abstract
Background
Good statistical models for analyzing and simulating multilocus recombination data exist but are not accessible to many biologists because their use requires reasonably sophisticated mathematical and computational implementation. While some labs have direct access to statisticians or programmers competent to carry out such analyses, many labs do not. We have created a platform independent application with an easy-to-use graphical user interface that will carry out such analyses including the simulations needed to bootstrap confidence intervals for the parameters of interest. This software should make multi-locus techniques accessible to labs that previously relied on less powerful and potentially statistically confounded single interval or double interval techniques.
Results
We introduce InterferenceAnalyzer, an implementation with a user-friendly graphical interface incorporating previously developed algorithms for the analysis and simulation of multilocus recombination data. We demonstrate the use and features of the program with an example of multilocus tetrad data from the mustard plant, Arabidopsis thaliana, and the yeast, Saccharomyces cerevisiae.
Conclusion
InterferenceAnalyzer provides easy access to the powerful and appropriate statistical tools for the multi-locus analysis of genetic data.
Background
The placement of crossovers along the tetrad, however, often does show crossover interference; that is, a crossover discourages another one from occurring nearby. Crossover interference has been observed in many organisms including fruit flies [4–7], yeast [2, 8, 9], bread mold [4, 10], mice [11], humans [12, 13], and green plants such as Arabidopsis [14, 15]. The only successful statistical model for crossover interference is the counting or Chi-Square model whose mathematical formulation dates back to Payne in 1956 [16] and which was given an elegant formulation in a text of Bailey as the segmental calculus in 1961 [17]. If the crossovers were distributed at random, the spacing between them would be exponential, which is equivalent to the scaled Chi-Square distribution $\frac{1}{2}\chi \frac{2}{2}$. If the spacing between crossovers is the sum of two exponential random variables, then the distribution is the scaled Chi-Square distribution $\frac{1}{4}\chi \frac{2}{4}$. In general, if the spacing between crossovers is the sum of m + 1 exponential random variables, then the distribution is $\frac{1}{2(m+1)}{\chi}_{2(m+1)}^{2}$. The model gained biological credibility when Foss et al. [4] proposed that the double strand breaks that initiate recombination events were distributed at random but only every m + 1^{st} one was resolved as a crossover, the intervening ones being resolved with noncrossovers (i.e., simple gene conversions unaccompanied by crossing over.) The number of noncrossovers between pairs of crossovers, m, is known as the interference parameter. The counting model has been shown repeatedly to fit genetic data well, both statistically and graphically, and provides a substantially better fit than other statistical models of interference [4, 6, 7, 11, 12].
The mathematical details of the use of the segmental calculus for analyzing tetrad data under the counting model and for the extension of the counting model to include an independent subset of crossovers not subject to interference, which provides a better fit to data from Arabidopsis, humans, and yeast, can be found in [7, 14]. The basic idea is to use matrices to keep track of the number of noncrossovers to the left (rows) and to the right (columns) of the first and last crossovers in the interval, respectively. The dimensions of the matrices required in the analysis are (m + 1) × (m + 1) where m is the interference parameter. The estimate of m is chosen to maximize the statistical likelihood function the data.
Calculating the likelihood function involves summing over all possible patterns for the relative positions of the crossovers among the noncrossovers, which is accomplished by the multiplication of matrices that are determined for each interval and each tetrad pattern (parental ditype, tetratype, and nonparental ditype). In the case of the extended model, we also have to sum over all the possibilities for the number of crossovers that are non-interfering. The estimates of m (the interference parameter) and p (the proportion of crossovers not subject to interference) are chosen jointly to maximize the likelihood function of the data. Since the interference parameters in some organisms can be quite large, the numerical optimization for either model can be quite time-consuming. We save some computational time by using the formula of Perkins [10] for estimating the intermarker distances rather than using maximum likelihood to estimate these values. For most practical applications, the maximum likelihood estimates and the Perkins estimates will be close.
Implementation
InterferenceAnalyzer is written in Java. The original source code and executable .jar files are available for Windows, Linux, and MacOS. The application is also available as a Windows executable. The source code, executables, sample data sets, and sample results are available at [18].
Results
We demonstrate how to use the software to analyze a specific dataset, use simulations to give confidence intervals for parameter estimates and assess the significance of the fit of the extended counting model, and discuss the relative speed of our software compared to comparable SAS code.
Raw data analysis
Sample Data. The first 6 lines of two possible formats for the sample data file including the parental marker values and the scoring of the first tetrad. In the second data set, the first tetrad could not be properly scored at the second marker. Any tetrad with any mis-scored marker or any gene conversion at a marker is discarded from the analysis.
Sample Data | Sample Data | ||||||||
---|---|---|---|---|---|---|---|---|---|
Parent 1: | - | - | + | - | Parent 1: | ade5,7 | URA3 | KAN | lys5 |
Parent 2: | + | + | - | + | Parent 2: | ADE5,7 | ura3 | kan | LYS5 |
+ | - | + | - | ADE5,7 | xxx | kan | LYS5 | ||
First Tetrad | - | - | - | + | First Tetrad | ade5,7 | xxx | KAN | lys5 |
- | + | + | - | ade5,7 | xxx | KAN | lys5 | ||
+ | + | - | + | ADE5,7 | xxx | kan | LYS5 |
The file containing the data is uploaded to the software using the "Load File" button on the "Analyze Raw Data" tab of the software. The user may decide to analyze the data only under the original counting model, only under the extended model which allows for a portion of the crossovers to be free from interference, or under both models by checking the appropriate buttons. After the user clicks on the "Analyze" button, a progress bar displays. The progress bar allows the user to know that the program is running but is not a good measure of the time remaining because it is linked to the current value of the interference parameter, m, which is allowed to run from 0 to 20. It takes much longer to calculate the likelihood for larger values of m than for smaller ones but the program terminates as soon as the peak of the likelihood function has been reached and so often does not reach the larger values allowed for m. There is no linear measure available for the time remaining to complete the calculation of the maximum likelihood estimator.
The results are displayed and buttons that allow exporting the results and the intermarker distances to files for use later are displayed. Exporting the intermarker distances is highly recommended in order to be able to use the simulations panel to give confidence intervals for the parameters and assess the significance of the extended model over the original counting model.
Use of simulations
The analysis of the original tetrad data discussed in the previous section gives point estimates for the interference parameter, m, in the counting model and the interference parameter, m, and the proportion of crossovers that are free of interference, p, in the extended model. Interval estimates can come from using the asymptotic normality of the maximum likelihood estimators and Fisher's Information function or via simulations. Simulations are more appropriate with small datasets and with large estimates of m because the distribution of the maximum likelihood estimators tend not to be close to normally distributed. Also, while we can form the standard likelihood ratio test statistic for determining how much better the extended model fits the data than the original counting model, the null hypothesis that the original counting model is an adequate model for the data is on the boundary of the parameter space (p = 0). Thus, the distribution of the usual likelihood ratio test statistic need not be approximately Chi-Square and simulations are needed to accurately assess the significance of the hypothesis that the extended model fits the data better.
Confidence intervals for parameter estimates
For the Arabidopsis data, the extended model estimate of the interference parameter was m = 14 and the estimate of the proportion of crossovers free from interference was p = 0.20. To place confidence intervals around these estimates, we use these parameter estimates from the original data analysis and the estimates of the intermarker genetic distances obtained from the original data using Perkins's formula to simulate new data sets. In each simulated data set, we re-estimate the model parameters m and p. The variation we see in these estimates reflect our uncertainty in the original parameter estimates. If, for each parameter, we place the simulated values in order and extract the 2.5 and 97.5 percentiles, we obtain the 95% percentile bootstrap confidence interval.
To use the Simulations panel for this purpose, we would load the file containing our intermarker distances, enter the number of tetrads in our original data set (57) in the "Sample Size" textbox, choose m = 14 and enter 0.20 in the textbox for p, and uncheck the box for analyzing the data using the original counting model since we are not interested in those results.
Sample Simulation Results. An example of the simulation results
Simulation Results | |||||
---|---|---|---|---|---|
Results | counting_... | counting_... | extended_model | extended_model | extended_model |
Simulation | m | negative... | m | p | negative_log_likelihood |
1 | 14 | 0.337 | 231.22 | ||
2 | 20 | 0.324 | 232.92 | ||
3 | 5 | 0.212 | 230.34 | ||
4 | 19 | 0.157 | 197.78 | ||
5 | 20 | 0.204 | 215.51 |
Since data sets from yeast avoid these problems, we include a set of yeast data generated in the Stahl Lab and analyzed in [2]. The sample size is large (1783 tetrads) and the interference parameter estimates are relatively small (m = 3 for the extended model). The estimates for the model parameters for the extended model for the original data set were m = 3 and p = 0.088. Two hundred simulations of data sets of 1783 tetrads using these parameter estimates for m and p took approximately eight hours on a Macintosh 1.5 GHz PowerPC G4 laptop computer with 1 GB DDR SDRAM. After exporting the results, opening them in a spreadsheet program, and sorting the data by the interference parameter estimate under the extended model, pulling off the 5^{ th }and 195^{ th }values gives a 95% percentile bootstrap confidence intervals for m of [3, 4]. Similarly, sorting the data by the proportion of non-interfering crossovers, p, and pulling off the 5^{ th }and 195^{ th }values gives a 95% percentile bootstrap confidence interval for p of (0.058, 0.135).
Assessment of the significance of the fit of the extended model
Performance
The intermarker distances for the Arabidopsis data used in our worked example are [0.149, 0.228, 0.132, 0.061, 0.167, 0.219, 0.175]. For 200 simulations with m = 3 and p = 0 analyzed under both the original counting model and its extension to include a proportion of noninterfering crossovers, InterferenceAnalyzer took approximately 1 hour on a Dell LATITUDE C840 (Intel Pentium 4 processor) with 1.60 GHz CPU and 1 GB Ram whereas the equivalent code in SAS took approximately 16 hours. Thus, the Java program seems to be approximately 16 times faster than similar code in SAS on the same computer.
Discussion
The development of InferferenceAnalyzer should make the powerful multilocus techniques for assessing interference accessible to geneticists. Future development planned includes allowing for the analysis of spore data where only one product of meiosis is observed, allowing for analysis when positions of the crossovers along a tetrad or spore are known using the algorithms in [12, 13], and the inclusion of the ability to simulate data under the mechanical stress model for crossover interference [19]. While the mechanical model does a good job approximating the observed interference patterns in real data, it is not a statistical model and its best fitting parameter values cannot be obtained feasibly from data. Thus our software will not be able to fit the mechanical model to data but only allow the simulation of such data.
Conclusion
We recognize the need for easy-to-use software in order to make sophisticated and powerful multilocus statistical techniques readily available to all geneticists. InterferenceAnalyzer is our attempt at this goal.
Availability and requirements
Project name: InterferenceAnalyzer
Project home page: http://mypage.iu.edu/~ehouswor/InterferenceAnalyzer/
Operating syatems(s): Platform independent
Programming language: Java
Other requirements: Java 1.4.1 or higher
License: Open source
Any restrictions to use by non-academics: None
Declarations
Acknowledgements
This work was funded by a grant from the National Science Foundation, DMS 0306243 to EAH. We also thank Franklin Stahl for his continued encouragement for this work and two anonymous reviewers for their comments and advice.
Authors’ Affiliations
References
- Zhao H, McPeek MS, Speed TP: Statistical analysis of chromatid interference. Genetics 1995, 139: 1057–1065.PubMed CentralPubMedGoogle Scholar
- Malkova A, Swanson J, German M, McCusker J, Housworth EA, Stahl FW, Haber JE: Gene conversion and crossing over along the 405-kb left arm of Saccharomyces cerevisiae chromosome VII. Genetics 2004, 168: 49–63. 10.1534/genetics.104.027961PubMed CentralView ArticlePubMedGoogle Scholar
- Mather K: Reduction and equational separation of the chromosomes in bivalents and multivalents. J Genet 1935, 30: 53–78.View ArticleGoogle Scholar
- Foss E, Lande R, Stahl FW, Steinberg CM: Chiasma interference as a function of genetic distance. Genetics 1993, 133: 681–691.PubMed CentralPubMedGoogle Scholar
- Goldgar DE, Fain PR: Models of multilocus recombination: nonrandomness in chiasma number and crossover positions. Am J Hum Genet 1988, 43: 38–45.PubMed CentralPubMedGoogle Scholar
- McPeek MS, Speed TP: Modeling interference in genetic recombination. Genetics 1995, 139: 1031–10044.PubMed CentralPubMedGoogle Scholar
- Zhao H, Speed TP, McPeek MS: Statistical analysis of crossover interference using the chi-square model. Genetics 1995, 139: 1045–1056.PubMed CentralPubMedGoogle Scholar
- Mortimer RK, Fogel S: Genetical interference and gene conversion. In Mechanisms in recombination. Edited by: Grell RF. Plenum Press, New York; 1974:263–275.View ArticleGoogle Scholar
- King JS, Mortimer RK: A mathematical model of interference for use in constructing linkage maps from tetrad data. Genetics 1991, 129: 597–602.PubMed CentralPubMedGoogle Scholar
- Perkins DD: Crossing-over and interference in a multiply-marked chromosome arm of Neurospora. Genetics 1962, 47: 1253–1274.PubMed CentralPubMedGoogle Scholar
- Broman KW, Churchill GA, Paigen K: Crossover interference in the mouse. Genetics 2002, 160: 1123–1131.PubMed CentralPubMedGoogle Scholar
- Broman KW, Weber JL: Characterization of human crossover interference. Am J Hum Genet 2000, 66: 1911–1926. 10.1086/302923PubMed CentralView ArticlePubMedGoogle Scholar
- Housworth EA, Stahl FW: Crossover interference in humans. Am J Hum Genet 2003, 73: 188–197. 10.1086/376610PubMed CentralView ArticlePubMedGoogle Scholar
- Copenhaver GP, Housworth EA, Stahl FW: Crossover interference in Arabidopsis. Genetics 2002, 160: 1631–1639.PubMed CentralPubMedGoogle Scholar
- Lam ST, Horn SR, Radford SJ, Housworth EA, Stahl FW, Copenhaver GP: Crossover interference on nucleolus organizing region-bearing chromosomes in Arabidopsis. Genetics 2005, 170: 807–812. 10.1534/genetics.104.040055PubMed CentralView ArticlePubMedGoogle Scholar
- Payne LC: The theory of genetical recombination: a general formulation for a certain class of intercept length distributions appropriate to the discussion of multiple linkage. Proc Roy Soc B 1956, 144: 528–544.View ArticleGoogle Scholar
- Bailey NTJ: Introduction to the Mathematical Theory of Genetic Linkage. Oxford University Press, London; 1961.Google Scholar
- InterferenceAnalyzer WEB Page[http://mypage.iu.edu/~ehouswor/InterferenceAnalyzer/]
- Kleckner N, Zickler D, Jones GH, Dekker J, Padmore R, Henle J, Hutchinson J: A mechanical basis for chromosome function. PNAS 2004, 101: 12592–12597. 10.1073/pnas.0402724101PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.