- Open Access
Application of the Linux cluster for exhaustive window haplotype analysis using the FBAT and Unphased programs
© Mishima et al; licensee BioMed Central Ltd. 2008
Published: 28 May 2008
Genetic association studies have been used to map disease-causing genes. A newly introduced statistical method, called exhaustive haplotype association study, analyzes genetic information consisting of different numbers and combinations of DNA sequence variations along a chromosome. Such studies involve a large number of statistical calculations and subsequently high computing power. It is possible to develop parallel algorithms and codes to perform the calculations on a high performance computing (HPC) system. However, most existing commonly-used statistic packages for genetic studies are non-parallel versions. Alternatively, one may use the cutting-edge technology of grid computing and its packages to conduct non-parallel genetic statistical packages on a centralized HPC system or distributed computing systems. In this paper, we report the utilization of a queuing scheduler built on the Grid Engine and run on a Rocks Linux cluster for our genetic statistical studies.
Analysis of both consecutive and combinational window haplotypes was conducted by the FBAT (Laird et al., 2000) and Unphased (Dudbridge, 2003) programs. The dataset consisted of 26 loci from 277 extended families (1484 persons). Using the Rocks Linux cluster with 22 compute-nodes, FBAT jobs performed about 14.4–15.9 times faster, while Unphased jobs performed 1.1–18.6 times faster compared to the accumulated computation duration.
Execution of exhaustive haplotype analysis using non-parallel software packages on a Linux-based system is an effective and efficient approach in terms of cost and performance.
respectively. Unfortunately, most of commonly-used software packages for statistical genetics are not written in parallel code and rewriting them for parallel analysis requires the reliability testing of the new code. To circumvent these obstacles and still achieve higher performance, we used a queuing system to sequentially submit jobs in a parallel manner on a Linux cluster.
Results and discussion
Computation Performance. Computation performance in each method. Analyzed haplotype window types were consecutive window haplotypes (ConsWH) and combinational window haplotypes (CombWH). Fold acceleration is defined as actual elapsed time divided by accumulated time for each process. Acceleration linearity is defined as fold acceleration divided by number of used compute-nodes. 22 nodes are used for analysis except five nodes for the Unphased CombWH analysis.
Because we used programs written in non-parallel code, we adopted the process-based parallelization approach. This lower parallelization granularity is problematic when each single process takes a very long time to finish. The performance during the Unphased ConsWH analysis with the -uncertain option demonstrates the limitation of a process-based parallelization approach. Possible solutions are using higher power compute-nodes or parallelizing the code. Even though some analysis required very little time (e.g., CombWH analysis for small window size), the summation of the amount of time required to detect process termination by Grid Engine and fork a new process by the Linux kernel was not negligible when high numbers of processes were involved. Therefore, Unphased-CombWH-uncertain analyses sharing smaller window sizes 1, 2, or 3 were bundled into single processes. The Unphased-CombWH analysis of short windows, comprised by all possible combinations of 1–5 loci out of 26 total, involves > 80,000 haplotypes and results in a large number of output files. Combining analyses of all combinations for each window size made data management much more efficient, resulting in only five files, albeit with lower acceleration. Analysis with larger window sizes may require file compression, file archiving or database management softwares to optimize acceleration. Although our HPC cluster system consists of retired PCs and regular network appliances, the system was sufficient to meet our substantial statistical genetics demands. This may explained by the minimal memory required and the low network traffic between nodes by either program.
The small-scale cluster developed in this study effectively accelerated the efficiency of statistical genetic analysis, saving years of time. Today, the necessity of intensive computational power is increasing at the individual and small group level. Here we show that at minimal cost, off-the-shelf hardware, open source software, and existing non-parallel statistical packages can be configured to bring HPC into the realm of small groups.
The Linux HPC cluster was build using the Rocks Cluster Distribution http://www.rocksclusters.org/ version 4.3 with the SGE roll for supporting the Grid Engine job queuing system http://gridengine.sunsource.net/. The cluster consisted of one PC for frontend-node, and 22 PCs for compute-nodes. All the computers had Intel Pentium 4 (1.7 GHz) CPUs. The frontend-node and each compute-node was connected each other by 100BASE-T network through a switching hub.
Dataset and statistical genetics softwares
Whole blood samples were collected from 277 extended families (546 nuclear families, 1484 persons). Subsequently, DNA was extracted from these samples, and genetic variations of single nucleotide polymorphisms (SNPs) at 26 genetic loci were characterized for each individual. The results obtained 172k-byte pedigree data file for following analysis. For the FBAT program, the original Linux executable was installed on the cluster. The hbat -e interactive command of FBAT was used for ConsWH and CombWH analysis, and the -e option was implemented to account for the bias introduced by studying multiple members from the same family. For the Unphased program, its version 3.0.10 source code was recompiled with the GNU C compiler with the option -march = pentium4 -O2 for the CPU-specific optimization. Unphased was used with -uncertain and -certain options from the command-line instead of its graphical user interface written in Java. The former option includes ambiguous genetic data to increase the sensitivity, whereas the later option evaluates only known genetic data, resulting in much quicker analysis.
Array job submission and execution
Optimization of analysis parameters
The window sizes for CombWH analysis were limited to 1–5 loci because the results of larger window sizes were divergent. The maximum number limit of array jobs (max_aj_jobs) was changed from default 75000 to zero by the qconf -mconf command. To increase the efficiency of job distribution, the flush time of Grid Engine (reporting_params/flush_time) was also optimally decreased by the the qconf -mconf command from default 15 sec to 5 sec.
NIH R01-DE014667 funds ACL. JN is funded by NIH EB006412-01 and NSF 0727007. We thank Mauricio Arcos-Burgos for coordination of sample collection; Lina M. Moreno and Tamara D. Busch for managing DNA samples, genotyping, and data organization; and Jamie P. L'Heureux for research coordination and database administration. Thank also goes to Boyd Knosp of Research Services at Information Technology Services for his support, engagement, and computer contribution.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 6, 2008: Symposium of Computations in Bioinformatics and Bioscience (SCBB07). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S6.
- Laird NM, Lange C: Family-based designs in the age of large-scale gene-association studies. Nat Rev Genet 2006,7(5):385–394. 10.1038/nrg1839View ArticlePubMedGoogle Scholar
- Laird NM, Horvath S, Xu X: Implementing a unified approach to family-based tests of association.[http://www.biostat.harvard.edu/~fbat/]Genet Epidemiol 2000,19(Suppl 1):S36-S42. Publisher Full Text 10.1002/1098-2272(2000)19:1+<::AID-GEPI6>3.0.CO;2-MView ArticlePubMedGoogle Scholar
- Dudbridge F: Pedigree disequilibrium tests for multilocus haplotypes. Genet Epidemiol 2003,25(2):115–121. [http://www.mrc-bsu.cam.ac.uk/personal/frank/software/unphased/] 10.1002/gepi.10252View ArticlePubMedGoogle Scholar
- Lin S, Chakravarti A, Cutler DJ: Exhaustive allelic transmission disequilibrium tests as a new approach to genome-wide association studies. Nat Genet 2004,36(11):1181–1188. 10.1038/ng1457View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.