Parallel and serial computing tools for testing single-locus and epistatic SNP effects of quantitative traits in genome-wide association studies
© Ma et al; licensee BioMed Central Ltd. 2008
Received: 27 March 2008
Accepted: 21 July 2008
Published: 21 July 2008
Genome-wide association studies (GWAS) using single nucleotide polymorphism (SNP) markers provide opportunities to detect epistatic SNPs associated with quantitative traits and to detect the exact mode of an epistasis effect. Computational difficulty is the main bottleneck for epistasis testing in large scale GWAS.
The EPISNPmpi and EPISNP computer programs were developed for testing single-locus and epistatic SNP effects on quantitative traits in GWAS, including tests of three single-locus effects for each SNP (SNP genotypic effect, additive and dominance effects) and five epistasis effects for each pair of SNPs (two-locus interaction, additive × additive, additive × dominance, dominance × additive, and dominance × dominance) based on the extended Kempthorne model. EPISNPmpi is the parallel computing program for epistasis testing in large scale GWAS and achieved excellent scalability for large scale analysis and portability for various parallel computing platforms. EPISNP is the serial computing program based on the EPISNPmpi code for epistasis testing in small scale GWAS using commonly available operating systems and computer hardware. Three serial computing utility programs were developed for graphical viewing of test results and epistasis networks, and for estimating CPU time and disk space requirements.
The EPISNPmpi parallel computing program provides an effective computing tool for epistasis testing in large scale GWAS, and the epiSNP serial computing programs are convenient tools for epistasis analysis in small scale GWAS using commonly available computer hardware.
Estimated single-processor computing time on the SGI Altix XE 1300 Linux cluster system with 2.66 GHz Intel Clovertown processor, and the total number of tests for two-locus and three-locus analysis.
Number of SNPs (N)
Computing time (T)
T ≈ 1.2 years
T ≈ 200,000 years
Number of tests (M)
M = (1.25) 1011
M = (2.08) 1016
Computing time (T)
T ≈ 5 years
T ≈ 1.5 million years
Number of tests (M)
M = (5.0) 1011
M = (1.67) 1017
where L x = contrast to estimate the genetic effect, s2 = (y - Xĝ)' (y - Xĝ) (n - k) = estimated residual variance, ĝ = the least squares estimates of the SNP genotypic effects, and si = a function of marginal and conditional allelic and genotypic frequencies for estimating genetic effect i, which is either additive, dominance or an epistasis effect, and where n = number of observations and k = rank of X . For testing epistasis effects involving the X chromosome in mammals (or Z chromosome in birds), only females (or males in birds) can be included in the analysis. For epistasis analysis involving SNPs in pseudoautosomal regions, the analysis is the same as for autosomal SNPs. These epistasis testing methods were implemented in a parallel computing program intended for larges scale GWAS and in a serial computing program intended for small scale GWAS that could be analyzed on commonly available computer hardware.
Example of distributing N SNPs to m(m+1)/2 processor cores (Pi, i = 1, m(m+1)/2) for the case where N/m is an integer, where m = N/n = number of subsets of SNPs with each subset having n SNPs (m and n are assumed integers).
SNP1 ... SNPn
SNPn(m-1)+1 ... SNPN
SNP1 ... SNPn
SNPn(m-1)+1 ... SNPN
A parallel computing program named EPISNPmpi and a serial computing program named EPISNP were developed for genome-wide pairwise epistasis testing. Three serial computing utility programs were developed to estimate computing time, to produce graphical chromosome view of significant single-locus results, and to produce graphical display of epistasis network.
The EPISNPmpi and EPISNP programs
The EPISNPmpi and EPISNP programs provide two sets of SNP tests: single-locus analysis and pairwise analysis. The single-locus analysis tests three effects of each SNP: SNP genotypic effect (M), additive (A) and dominance (D) effects. The pairwise analysis tests five effects of each pair of SNPs: The I-effect, A × A, A × D, D × A, and D × D. Three input files in text format are required, the phenotype file, the SNP genotype file, and the parameter file. The phenotype file contains observations of the quantitative trait(s), family ID, individual ID, individual gender, and non-genetic fixed effects such as smoking status and age of each individual. The SNP genotype file contains family ID, individual ID, individual gender, and SNP genotypes, and should be one file for each chromosome. The parameter file with file name parameter.dat provides various user-specified controls for the EPISNPmpi and EPISNP programs to have the flexibility to be generally applicable. These controls include the number of quantitative traits to be analyzed, user specified number of chromosomes, code for the sex chromosome, formats for SNP genotypes and missing values, and user specified number of fixed non-genetic factors to be included in the statistical model, where a fixed non-genetic factor can be an indicator variable or continuous variable (covariable). Both EPISNPmpi and EPISNP programs are applicable to populations with HWD and LD.
where k = number of processor cores, tk = computing time using k processor cores, and t1 = computing time using one processor core. In Figure 1, the computing times were normalized to the computing time on 15 processor-cores because the minimal number of cores used was 15. Results in Figure 1 showed that the observed computing time and the predicted computing time assuming ideal speedup and scalability matched very well, indicating that the EPISNPmpi coding achieved excellent speedup and scalability. Based on the observed run times of 0.20 and 19.3 hours for 50,000 and 500,000 SNPs respectively using 528 cores of the Calhoun system, the estimated computing time for pairwise epistasis tests is approximately an increasing quadratic function of the number of SNPs. Let N = the number of SNPs and N0 = a smaller number of SNPs with a known computing time (t0) for running EPISNPmpi such that N = N0 (x). Then, the computing time required for analyzing N SNPs (tN) is approximatelytN = (t0)(x2)
The run time of 19.3 hours for 500,000 SNPs using 528 cores showed that pairwise epistasis testing for GWAS with about 500,000 SNPs could be completed in one day using about 25% of the 2048 cores of the Calhoun system. Based on this computing time and equations (1–2), the predicted time for pairwise epistasis testing among 1,000,000 SNPs using all 2048 cores of the Calhoun system would require about 20 hours to complete. This prediction indicates that EPISNPmpi is capable of completing pairwise epistasis analysis in one day for any large scale GWAS currently in existence, noting that the numbers of SNPs used in current large scale GWAS are in the range of 500,000 ~ 940,000, as represented by NIH's GAIN projects . Sample size, or the number of individuals, affects the computing time as well, but the increase in computing time due to increased sample size is minor. The EPISNPmpi code is highly portable to various computing platforms and has been ported to all supercomputer systems at the Minnesota Supercomputer Institute and to several popular serial computing platforms.
The EPISNP program is designed for epistasis analysis in small-scale GWAS on commonly available computer hardware. For example, an analysis of 5700 SNP markers took about 18 hours to complete on a PC with a single 3.8 GHz Pentium 4 processor.
The EPISNPmpi and EPISNP programs each produces two output files of the most significant results of single-locus tests and two output files of the most significant results of pairwise epistasis tests. The output file for significant epistasis results currently displays the names and chromosome locations of the two SNPs in each SNP pair with significant I-effect (interaction between the two loci), A × A, A × D, D × A, or D × D effect, significance level (p-value), and ordered estimates of individual effects that are useful for identifying the best and worst gene combinations affecting a phenotype . The second output file of single-locus tests is used as the input file of the EPISNPPLOT program and the second output file of pairwise epistasis tests is used as the input file of the EPINET program.
Three serial computing utility programs
Commodity cluster-based processing of EPISNPmpi
EPISNPmpi has been developed and tested on many modern high-performance computers and supercomputer systems. Price-to-performance ratio of the computing system can be an important consideration in practice. To utilize commonly available computer hardware for high performance computing, EPISNPmpi has been implemented to run on commodity cluster or on an inexpensive network of workstations using MPICH message passing libraries. MPICH is a portable implementation of MPI, a standard for message-passing for distributed-memory applications, and is freely available at http://www.mcs.anl.gov/mpi/mpich1/download.html.
Computational difficulty is the main bottleneck of epistasis testing in large scale GWAS. The computing tools we have developed help address the computational difficulty in epistasis analysis in large scale GWAS. The computing speed can be further improved if a more powerful computer system is used. However, serious computational challenges still exist in at least three areas: 1) Increased number of SNPs used in GWAS, 2) Integration of GWAS and a gene expression study, and 3) Joint epistasis testing for three or more SNPs at a time. The human genome has about 10 million SNPs. Although an exhaustive analysis of all human SNPs is not yet a reality, the number of SNPs used in GWAS is clearly rapidly increasing. Since the computing time for epistasis testing increases approximately as a quadratic function of the number of SNPs, computing difficulty will rapidly increase as the number of SNPs increases. Integration of large scale GWAS and a gene expression study using the same individuals poses another serious computational challenge. In this case, the computing time required is multiplied by the number of genes, where gene expression intensity of each gene is treated as one phenotype . The joint epistasis testing for three or more SNPs could be the ultimate computing challenge. As shown in Table 1, adding just one SNP to the pairwise epistasis test for 1,000,000 SNPs could require 1/3 million times as much computing time. A tempting solution would be to test epistasis effects for a subset of SNPs with significant single-locus effects. However, this is not a good idea because requiring significant main effects for epistasis testing could miss many or even all significant epistasis effects with stringent p-values to declare significance. For example, the significant epistasis effects with p < 10-7 for 5700 SNPs covering all 23 human chromosomes reported in Ma et al.  did not involve any SNPs with significant single-locus at p < 10-4. Therefore, requiring significant single-locus effects at p < 10-4 would have missed all the ten significant epistasis effects at p < 10-7 among the 5700 SNPs. The EPISNPmpi and EPISNP programs provide capabilities for testing all possible pairwise epistasis effects. However, the use of these programs should be considered as only one step in GWAS analysis. Considerable work still may be required for digesting the test results.
The EPISNPmpi parallel computing program provides a computing tool capable of completing pairwise epistasis tests in large scale GWAS in a timely manner using a supercomputer system. The serial computing programs can be useful and convenient tools for epistasis analysis in small scale GWAS using commonly available computer hardware. EPISNPmpi is a portable program which not only exploits the capability of supercomputers but also runs on inexpensive loosely coupled cluster systems.
Availability and requirements
Project name: Parallel and serial computing for genome-wide SNP analysis
Project homepage: http://animalgene.umn.edu/
1. EPISNPmpi :http://animalgene.umn.edu/episnpmpi/index.html
Currently supported processors type, MPI libraries, compilers and corresponding binaries
2. epiSNP :http://animalgene.umn.edu/episnp/index.html
Currently supported operation systems, processors types, and compilers used to generate binaries
In the above binaries, epiSNP_2.0_Windows.zip contains all the four programs (EPINET, CPUHD, EPISNPPLOT, EPINET), while each of the other .gz file contains EPISNP and CPUHD only.
Other requirements: None.
Any restrictions to use by non-academics: None.
genome-wide association study
single nucleotide polymorphism
two-locus interaction effect
- A × A:
additive × additive epistasis effect
- A × D:
additive × dominance epistasis effect
- D × A:
dominance × additive epistasis effect
- D × D:
dominance × dominance epistasis effect
2.6 GHz IBM BladeCenter Linux cluster at the Minnesota Supercomputer Institute
the SGI Altix XE 1300 Linux cluster system with 2.66 GHz Intel Clovertown processor at the Minnesota Supercomputer Institute.
This research is partially supported by the Minnesota Supercomputer Institute (LM, HBR), Digital Technology Center of the University of Minnesota (DD), National Research Initiative Grant no. 2008-35205-18846 from the USDA Cooperative State Research, Education, and Extension Service (LM), Cargill, Inc. (JRG), and the Agricultural Experiment Station of the University of Minnesota (YD). Supercomputer computing time was provided by the Minnesota Supercomputer Institute.
- Balding DJ: A tutorial on statistical methods for population association studies. Nat Rev Genet 2006, 7: 781–791. 10.1038/nrg1916View ArticlePubMedGoogle Scholar
- Carlborg O, Haley CS: Epistasis: too often neglected in complex trait studies? Nat Rev Genet 2004, 5: 618–625. 10.1038/nrg1407View ArticlePubMedGoogle Scholar
- Li W, Reich J: A complete enumeration and classification of two-locus disease models. Hum Hered 2000, 50: 334–349. 10.1159/000022939View ArticlePubMedGoogle Scholar
- Moore JH: The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum Hered 2003, 56: 73–82. 10.1159/000073735View ArticlePubMedGoogle Scholar
- Purcell S, Sham PC: Epistasis in quantitative trait locus linkage analysis: interaction or main effect? Behav Genet 2004, 34: 143–152. 10.1023/B:BEGE.0000013728.96408.f9View ArticlePubMedGoogle Scholar
- Nishihara E, Tsaih SW, Tsukahara C, Langley S, Sheehan S, DiPetrillo K, Kunita S, Yagami K, Churchill GA, Paigenn B, Sugiyama F: Quantitative trait loci associated with blood pressure of metabolic syndrome in the progeny of NZO/HILtJ × C3H/HeJ intercrosses. Mammalian Genome 2007, 18: 573–583. 10.1007/s00335-007-9033-5View ArticlePubMedGoogle Scholar
- Sambandan S, Yamamoto A, Fanara JJ, Mackay TFC, Anholt RRH: Dynamic genetic interactions determine odor-guided behavior in drosophila melanogaster . Genetics 2006, 74: 1349–1363. 10.1534/genetics.106.060574View ArticleGoogle Scholar
- Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, Thakurta DG, Sieberts SK, Monks S, Reitman M, Zhang C, Lum PY, Leonardson A, Thieringer R, Metzger JM, Yang L, Castle J, Zhu H, Kash SF, Drake TA, Sachs A, Lusis AJ: An integrative genomics approach to infer causal associations between gene expression and disease. Nat Genet 2005, 37: 710–717. 10.1038/ng1589PubMed CentralView ArticlePubMedGoogle Scholar
- Fisher RA: The correlation between relatives on the supposition of Mendelian inheritance. Trans Roy Soc Edinburgh 1918, 52: 399–433.View ArticleGoogle Scholar
- Cockerham CC: An extension of the concept of partitioning hereditary variance for analysis of covariances among relatives when epistasis is present. Genetics 1954, 39: 859–882.PubMed CentralPubMedGoogle Scholar
- Kempthorne O: The correlation between relatives in a random mating population. Proc R Soc Lond B Biol Sci 1954, 143: 102–113.View ArticlePubMedGoogle Scholar
- Mao Y, London NR, Ma L, Dvorkin D, Da Y: Detection of SNP epistasis effects of quantitative traits using an extended Kempthorne model. Physiol Genomics 2007, 28(1):46–52.View ArticleGoogle Scholar
- Wolfinger RD, Gibson G, Wolfinger ED, Bennett L, Hamadeh H, Bushel P, Afshari C, Paules RS: Assessing gene significance from cDNA microarray expression data via mixed models. J Comput Biol 2001, 8: 625–37. 10.1089/106652701753307520View ArticlePubMedGoogle Scholar
- Eager DL, Zahorjan J, Lazowska ED: Speedup versus efficiency in parallel systems. Trans On Competers 1989, C-38: 408–423. 10.1109/12.21127View ArticleGoogle Scholar
- Alabdulkareema M, Lakshmivarahan S, Dhallb SK: Scalability analysis of large codes using factorial designs. Parallel Computing 2001, 27: 1145–1171. 10.1016/S0167-8191(01)00068-0View ArticleGoogle Scholar
- Genetic Association Information Network (GAIN)[http://www.fnih.org/GAIN2/platforms.shtml]
- Ma L, Runesha HB, Da Y: EPISNPmpi: A supercomputer parallel computing program for epistasis testing in genome-wide association studies, user manual version 2.0.Department of Animal Science and Supercomputer Institute, University of Minnesota; [http://animalgene.umn.edu/episnpmpi/index.html]
- Ma L, Dvorkin D, Garbe JR, Runesha HB, Da Y: epiSNP: A computer package of serial computing programs for epistasis testing in genome-wide association studies, user manual version 2.0.Department of Animal Science and Supercomputer Institute, University of Minnesota; [http://animalgene.umn.edu/episnp/index.html]
- Jansen RC, Nap JP: Genetical genomics: the added value from segregation. Trends Genet 2001, 17: 388–391. 10.1016/S0168-9525(01)02310-1View ArticlePubMedGoogle Scholar
- Ma L, Dvorkin D, Garbe JR, Da Y: Genome-wide analysis of single-locus and epistasis SNP effects on anti-cyclic citrullinated peptide as a measure of rheumatoid arthritis. BMC Proceedings 2007, 1(Suppl 1):S127.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.