ParallABEL: an R library for generalized parallelization of genome-wide association studies
© Sangket et al. 2010
Received: 21 October 2009
Accepted: 29 April 2010
Published: 29 April 2010
Skip to main content
© Sangket et al. 2010
Received: 21 October 2009
Accepted: 29 April 2010
Published: 29 April 2010
Genome-Wide Association (GWA) analysis is a powerful method for identifying loci associated with complex traits and drug response. Parts of GWA analyses, especially those involving thousands of individuals and consuming hours to months, will benefit from parallel computation. It is arduous acquiring the necessary programming skills to correctly partition and distribute data, control and monitor tasks on clustered computers, and merge output files.
Most components of GWA analysis can be divided into four groups based on the types of input data and statistical outputs. The first group contains statistics computed for a particular Single Nucleotide Polymorphism (SNP), or trait, such as SNP characterization statistics or association test statistics. The input data of this group includes the SNPs/traits. The second group concerns statistics characterizing an individual in a study, for example, the summary statistics of genotype quality for each sample. The input data of this group includes individuals. The third group consists of pair-wise statistics derived from analyses between each pair of individuals in the study, for example genome-wide identity-by-state or genomic kinship analyses. The input data of this group includes pairs of SNPs/traits. The final group concerns pair-wise statistics derived for pairs of SNPs, such as the linkage disequilibrium characterisation. The input data of this group includes pairs of individuals. We developed the ParallABEL library, which utilizes the Rmpi library, to parallelize these four types of computations. ParallABEL library is not only aimed at GenABEL, but may also be employed to parallelize various GWA packages in R. The data set from the North American Rheumatoid Arthritis Consortium (NARAC) includes 2,062 individuals with 545,080, SNPs' genotyping, was used to measure ParallABEL performance. Almost perfect speed-up was achieved for many types of analyses. For example, the computing time for the identity-by-state matrix was linearly reduced from approximately eight hours to one hour when ParallABEL employed eight processors.
Executing genome-wide association analysis using the ParallABEL library on a computer cluster is an effective way to boost performance, and simplify the parallelization of GWA studies. ParallABEL is a user-friendly parallelization of GenABEL.
GWA analysis  is a well established and powerful method for identifying loci associated with variations of complex genetic traits such as common diseases. Hundreds of new genes have been implicated in human health and disease during the last few years in various GWA studies. In a typical study, hundreds of thousands, or millions, of single-nucleotide polymorphisms (SNPs) are typed in thousands of individuals in order to detect genetic risk factors.
GenABEL is a specialized library package for GWA analysis  implemented in R, an open source statistics programming language and environment [4, 5]. GenABEL enables GWA analysis to be done using a regular desktop computer due to its efficient data storage and memory management. Nevertheless, analysis of very large data sets are computationally challenging and may take hours or weeks to complete. Examples include the utilization of sophisticated adjustments for population stratification and relationship structures, the estimation of linkage disequilibriums and the calculation of genome-wide identity-by-state, haplotypic tests, and permutation analyses.
To increase the computational throughput, a user can partition their data into sets, and perform the analysis of the sets across a network of computers; a concept known as parallel and/or distributed computing. However, performing such analysis requires high levels of computer expertise. The user needs sufficient programming skills to partition and distribute data, control and monitor tasks across the computers, and merge output files. Occasionally, a data set may fail to be processed, e.g. if the user did not partition the data into small enough subsets to be processed on a particular machine. Also, the outputs from the computers may be scattered and their order hard to follow.
Several attempts had been made to parallelize genetic association analyses. Grid Engine, a cutting-edge parallel tool, can schedule parallel tasks involving genetic association analysis programs  such as FBAT  and UNPHASED . The approach, first proposed by Mishima et al., is based on non-parallel code combined through process-based parallelization. The downside is that the user still needs to monitor when each task is finished, and when the outputs from all the tasks can be merged. Moreover, each process may take a very long time to finish, and load balance can be problematic. A granularity problem (a high computation to communication ratio) may occur, but higher power compute-nodes or code parallelization are possible solutions. The R/parallel package has been used to automate loop parallel execution, but the application must run on a single computer with multi-core processors, and does not currently support cluster computing . Its inclusion would allow the computing time limit of the package to be eliminated. Misawa and Kamatani  developed the ParaHaplo package for haplotype-based whole-genome association studies using parallel computing. It is aimed at correcting multiple comparisons in multiple SNP loci in linkage disequilibrium. There are other statistical analyses requirements in GWA studies, such as obtaining statistics for a particular SNP or a trait, association test, characterizing an individual in the study, and pair-wise statistics between individuals. Furthermore, Ma et al.  developed EPISNPmpi, a parallel system for epistasis testing in large scale GWA analysis.
Rmpi  is an R library which provides various functions to parallelize tasks on R using the MPI (Message-Passing Interface) . Rmpi employs various functions to manage flow analysis in parallel environment, and is applicable for employing multi-core CPUs distributed across many computers, not only multi-core CPUs on a single computer. However, it is difficult, if not impossible, for a non-programmer to write a parallel Rmpi program. Therefore, SPRINT  was developed to implement parallel R functions. Although users can use SPRINT easily, it does not specifically support GWA studies.
In this article, we present the development of our ParallABEL library, a new R library for parallelization of GWA studies based on Rmpi. ParallABEL aims to speed up the computation of GWA studies for various statistical analysis requirements and also simplify analysis parallelization. With ParallABEL, the users do not need to be experts programming on partitioning and distributing data, controling and monitoring tasks, and merging output files.
GWA analyses grouping
function name of GenABEL
Provides summary of observed genotypes, allelic frequency, genotypic distribution, P-value of the exact test for HWE and chromosome
Fast score test for association between a trait and genetic polymorphism
Linear and logistic regression and Cox models for genome-wide SNP data
Score test for association between a trait and genetic polymorphism, in samples of related individuals
Produces call rate and heterozygosity per person
Computes average homozygosity (inbreeding) for a set of people, across multiple markers. Can be used for Quality Control (e.g. contamination checks)
Given a set of SNPs, computes a matrix of average IBS for a group of people
Given a set of SNPs, computes a matrix of D'
Given a set of SNPs, computes a matrix of rho
Given a set of SNPs, computes a matrix of r2
We have developed the ParallABEL library to parallelize the serial functions of these groups using Rmpi library. The four implementation groups are named Type1_parall_by_SNPs for the first group, Type2_parall_by_individuals for the second group, Type3_parall_by_pairs_of_individuals for the third group and Type4_parall_by_pairs_of_SNPs for the fourth group.
An advantage of ParallABEL is usage simplicity, hiding otherwise tedious scripts for file management monitoring tools. These functions not only partition input data with automatic load balancing, but also gather output from each processor automatically. Load balancing is critical because an unbalanced work load will result in higher loads for particular processors, which eventually undermines the overall performance.
The input data for Type2_parall_by_individuals consists of individuals, partitioned like Type1_parall_by_SNPs
The SNPs input of Type4_parall_by_pairs_of_SNPs will be partitioned in a similar way to Type3_parall_by_pairs_of_individuals.
This sequential workflow may take a very long time to produce some demanding statistical analyses. Our novel parallel workflow for producing statistical data in GWA studies is shown in Figure 4B, and can save computing time. The genotype and phenotype data is passed for distribution to the SUN Grid Engine, a job scheduler. It queues jobs and assigns them to processors in a cluster. LAM/MPI (Local Area Multicomputer/Message Passing Interface)  has various functions which can be called by Rmpi to parallelize R. ParallABEL parallelizes GenABEL using this Rmpi library. The statistical data from this workflow has been validated by comparing it with the outputs from the non-parallel approach. ParallABEL runs not only on Linux cluster, such as the Rocks Cluster Distribution, but also on any Operating System that supports R and LAM/MPI or Open MPI, such as the Unix and Solaris Operating Systems. It can also run on computer clusters lacking the Sun Grid Engine by executing immediately. However, the administrator will normally not allow a user to run a parallel program without utilizing a queuing process from the Sun Grid Engine.
To parallelize GWA studies, ParallABEL running on the frontend-node partitions input data into smaller subsets so that tasks can be fairly distributed among the processors. It sends tasks to idle processors on compute-nodes. When the computation on a compute-node has finished, the frontend-node will send another task to the idle processor - a cycle that continues until all the tasks are completed, which is known as the 'task pull' method . When all the tasks are finished, the frontend-node automatically merges all the outputs.
Our computer cluster, Hanuman, runs Rocks Cluster Distribution version 4.3, which includes the SUN Grid Engine version 4.3 . The cluster consists of 5 IBM servers xSeries 336s, comprising of a frontend-node and four compute-nodes. All servers have 2 SINGLE-CORE Intel Xeon (2.8 GHz) processors and 4 GB RAM. The frontend-node and all the compute-nodes are connected through an Ethernet switch, and the user can connect via the Internet. The cluster provides LAM/MPI version 7.1.2, R program version 2.8.1, Rmpi library version 0.5-6, and GenABEL version 1.4-2, which are utilized as components by our ParallABEL library.
The North American Rheumatoid Arthritis Consortium (NARAC) data is part of a dataset employed to observe associations between disease and variants in the major-histocompatibility-complex locus . The NARAC genotype data contains 545,080 SNPs from 2,062 individuals. The data was used to measure the performance of ParallABEL by employing 868 individuals for cases, and 1,194 individuals as controls.
ParallABEL reduced the computing time for Type3_parall_by_pairs_of_individuals, especially with 8 processors. The Type3_parall_by_pairs_of_individuals executing speed on eight processors was approximately seven times faster than on one processor. On a single processor, the complete analysis took 8.1 hours, but only 1.1 hours with 8 processors. The computing time for Type1_parall_by_SNPs also tends to be like that for Type3_parall_by_pairs_of_individuals.
The computing time for the sequential version of Type2_parall_by_individuals can be very short (e.g. 20 seconds). While the parallel version took longer (5.3 minutes for 2 processors), due to the overhead of data partitioning, data distribution, and data merging. Data distribution can be time consuming because the data must be saved on the frontend-node before the compute-nodes can load it, and the frontend-node must also speed time communicating with the compute-nodes. In addition, GenABEL is tailored to quickly retrieve subset of SNPs, as this is a typical GWA scan procedure, but is much less efficient in retrieving subsets of individuals, which is less typical. Thus the overhead of data partitioning in subsets of individuals prevailed over the gain achieved by parallel processing. These results highlighted a place where GenABEL data storage and processing is ineffective, and we are currently working on better algorithms to do by-individual analyses.
Data partitioning for each chromosome
Number of SNPs
Number of subsets
19,20,21,22, X, Y
Type4_parall_by_pairs_of_SNPs took only 1.4 days to execute on eight processors, indicating that time-saving with ParallABEL is linearly correlated to the number of nodes. This suggests that with more SNPs, more computing time will be saved by ParallABEL.
We have presented the ParallABEL library which employs parallel computing to reduce computing time for data intensive tasks. ParallABEL can run on clustered computers that support LAM/MPI and R. With clustered computers, processors or even personal computers can be easily added as new compute-nodes. ParallABEL runs on both distributed and shared memory architectures as it was developed with MPI. For a distributed memory architecture, MPI usually uses a computer network for task communications. For a shared memory architecture, MPI does not employ the network for task communications. This means that a distributed memory architecture may exhibit more overhead than a shared memory architecture (for example, eight single-core processors versus a single eight-core processor). In our experiments, Type1_parall_by_SNPs took only 6 minutes to execute on a shared memory architecture but 14 minutes on a distributed memory architecture. The overhead of a shared memory architecture was tested on a server, which has 2 QUAD-CORE Intel Xeon(R) (2.8 GHz) processors and 8 GB. The server runs on CentOS version 5.4, and provides Open MPI version 1.4.1.
ParallABEL allows the user to specify the number of processors employed for data execution. We expect computational performance to increase linearly with the number of processors when using Type1_parall_by_SNPs, Type3_parall_by_pairs_of_individuals, and Type4_parall_by_pairs_of_SNPs. In addition, ParallABEL is faster than GenABEL on one processor. Computing times for Type3_parall_by_pairs_of_individuals and Type4_parall_by_pairs_of_SNPs are longer than those for Type1_parall_by_SNPs because the input data consists of pairs of individuals and SNPs respectively, which are much larger than the SNPs input for Type1_parall_by_SNPs. In addition, if the number of SNPs is n, then the number of inputs for Type1_parall_by_SNPs is n but the number of inputs data for Type4_parall_by_pairs_of_SNPs is n*n. ParallABEL can save much more computational time when utilizing Type3_parall_by_pairs_of_individuals and Type4_parall_by_pairs_of_SNPs than when using Type1_parall_by_SNPs. Therefore, as the amount of input data increases, the time saved by ParallABEL also increases. ParallABEL does not only reduce the computing time but also is as easy-to-use as the more conventional GenABEL.
ParallABEL can not reduce the computing time when the data size is too small, such as the result shown when employing the hom function of Type2_parall_by_individuals, because the computing time is too short. In that case, the overheads of data partitioning and output merging overwhelm the computational performance.
Operating system(s): Platform independent
Programming language: R
Other requirements: LAM/MPI or Open MPI, Rmpi, GenABEL
License: GPL for non-profit organizations
Any restrictions to use by non-academics: license needed
This research was supported by a grant from the program for Strategic Scholarships for Frontier Research Network for the Joint Ph.D. Program Thai Doctoral degree from the Office of the Higher Education Commission, Thailand; the Thailand Center of Excellence for Life Sciences (TCELS); and Prince of Songkla University, Thailand. The work of YSA was supported by grants from the Netherlands Scientific Foundation (NWO), the Russian Foundation for Basic Research (RFBR), Netherlands Genomics Initiative (NGI) and Centre for Medical Systems Biology (CMSB). We are grateful to Prof. Dr. Amornrat Phongdara and Assoc. Prof. Dr. Wilaiwan Chotigeat for establishing the PSU research group in Bioinformatics. The NARAC data was supported by the GAW grant (R01 GM031575) and the NIH grant that supports a collection of RA data (AR44422). We would like to thank Dr. Jean W. MacCluer and Vanessa Olmo for the permission to use the data, and Dr. Andrew Davison for polishing the written English of the manuscript. We also thank the Thai National Grid Center and Prince of Songkla University Grid Center for supporting the computer clusters used in this research.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.