GWAS on your notebook: fast semi-parallel linear and logistic regression for genome-wide association studies
© Sikorska et al.; licensee BioMed Central Ltd. 2013
Received: 21 December 2012
Accepted: 24 April 2013
Published: 28 May 2013
Genome-wide association studies have become very popular in identifying genetic contributions to phenotypes. Millions of SNPs are being tested for their association with diseases and traits using linear or logistic regression models. This conceptually simple strategy encounters the following computational issues: a large number of tests and very large genotype files (many Gigabytes) which cannot be directly loaded into the software memory. One of the solutions applied on a grand scale is cluster computing involving large-scale resources. We show how to speed up the computations using matrix operations in pure R code.
We improve speed: computation time from 6 hours is reduced to 10-15 minutes. Our approach can handle essentially an unlimited amount of covariates efficiently, using projections. Data files in GWAS are vast and reading them into computer memory becomes an important issue. However, much improvement can be made if the data is structured beforehand in a way allowing for easy access to blocks of SNPs. We propose several solutions based on the R packages ff and ncdf.
We adapted the semi-parallel computations for logistic regression. We show that in a typical GWAS setting, where SNP effects are very small, we do not lose any precision and our computations are few hundreds times faster than standard procedures.
We provide very fast algorithms for GWAS written in pure R code. We also show how to rearrange SNP data for fast access.
For the benefit of readers who are not familiar with genome-wide association studies we provide a brief introduction to this area.
There are many ways to investigate the influence of genes on (human) traits. One of them, genome-wide association studies (GWAS), exploits the fact that strings of DNA contain many small variations, called SNPs which may influence the level of traits or risk of having a disease. Modern micro-array technology makes it possible to measure genotypes of a million SNPs in one go, at a reasonable price, using only one drop of blood. In large epidemiological studies, this has been done for large to very large groups of individuals, for which (many) phenotypes have been measured too. SNPs that are found to be influential may point to relevant genes. This approach has been applied on a grand scale . The number of results published on GWAS is rapidly increasing. The GWAS catalogue includes over 1400 papers on newly discovered important SNPs .
Typically, the number of genotyped SNPs is around half a million. However, it is possible to impute the most probable genotypes for real or hypothetical SNPs using spatial correlation on the genome. This way, the number of SNPs analyzed in a GWAS can grow to 2.5 or even 30 million.
The statistical model used in GWAS is rather basic: univariate linear or logistic regression of phenotype on genotypes, for each SNP in turn, correcting for covariates like age, height and gender. Large sample sizes are required to detect very small effects at the very strict “GWA-significance level”, namely 5×10-8, the common 0.05 divided by one million (inspired by Bonferroni correction for that many tests). The goal is to find SNPs for which the p-value will survive this conservative multiple testing correction.
Dedicated software is available to support those analyzes. Popular examples are: GenABEL , PLINK, Mach2qtl[5, 6] and ProbABEL . Computation times are long. An example from the literature is a GWAS with a continuous trait for 6000 individuals and 2.5 mln SNPs, which on “a regular computer” takes around 6 hours . This time will dramatically increase with larger sample size and/or more SNPs to test. Additionally, logistic regression is more computationally demanding than linear regression. Based on the available published materials, it is actually quite difficult to assess computation times. Usually information about available memory, number of used processors/cores, and the size of the model (the number of covariates) are not provided.
GWAS may be computationally demanding, but the problem is “embarrassingly parallel”, meaning that it can be distributed over as many processors as desired, by simply splitting the work into groups of SNPs. This brute force approach with computing clusters is now being applied broadly, with GRIMP as an example in our institution .
We show that huge speed gains can be achieved by simple rearrangements of linear model computations, exploiting fast matrix operations. We call this the “semi-parallel” approach, to set it apart from parallel computation on multiple processors. A similar idea can be found in , in the framework of expression quantitative trait loci (eQTL) analysis. That paper focuses on computing R2 statistics, to get a first insight into a data. We are more ambitious: we want to reproduce very closely the results of “traditional” GWAS software. Present-day GWAS practice is focused on very low p-values, regardless of the amount of variance that the SNPs actually explain. Thus, we apply large matrix operations to compute estimates, standard errors and p-values for GWAS with a continuous outcome.
There is a second challenge: reading the data quickly enough from a disk into computer memory. A key issue is to rearrange them in such a way that arbitrary blocks of SNPs (containing all individuals) can be accessed very quickly. We show how to pre-process data for this goal.
The bottom line is that a GWAS for one million SNPs and 10k individuals can be done on an average notebook computer within 15 minutes. This is the time needed for pure computations. Accounting for the time needed to load the data, the whole time of the analysis increases to 25 minutes.
Semi-parallelization of GWAS with a binary outcome is more difficult. Parameters in logistic regression are estimated via maximum likelihood, which unlike the least squares approach is an iterative procedure. However, we were able to find an approximate way to provide odds ratio for the SNP effect using semi-parallel computations.
The paper is written in a tutorial-like manner. We gradually extend the complexity of the problem, showing step by step how to speed up computations using simple tricks in R. Also the goal is not to present a package (there is none) but to introduce a new way of thinking about large-scale GWAS computation and to present and provide code that anyone can easily integrate into existing systems.
Data, real and simulated
A GWAS is based on very large numbers of SNPs, for many thousands of individuals, leading to very large data files. Observed genotypes generally are coded as the number of reference alleles, 0, 1 or 2. Very efficient storage is possible, using only 2 bits per SNP (per individual). The program PLINK uses this approach to store genotypes in its BED file format. The package SNPstats mimics it for storing SNPs in computer memory. This is quite attractive: 100k SNPs (we use k as shorthand for thousand) for 10k individuals can be stored in a quarter of a Gigabyte.
In large data sets one may expect some values of the traits and/or genotypes to be missing. Typically genotypes with a call rate (percentage of measured genotypes in the sample) below 95% will be removed from the analysis.
The recent GWAS practice is to use genotype imputation. The commonly used MACH[5, 6] program does two things: it imputes missing SNPs within genotyped markers and predicts untyped markers. The result of imputation is the expected dose, a non-integer number between 0 and 2. In this case more room is needed to store the values. Actual files with imputation results are then much larger. Those that MACH produces are ASCII-code files, using six positions per number (with three decimals). In principle a more compact representation is possible. There is no need to be precise and by multiplying the dose by 100 we can store an integer between 0 and 200 in one (unsigned) byte. This is four times larger than for raw genotypes.
In our experience, SNP data are stored in such a way that all SNPs for one individual form one record. We call this structure “row per person”. It is useful for random reading of (blocks of) individuals, but selection of certain (blocks of) SNPs is time consuming. Essentially one has to read the complete records for all persons and keep only the required selection of SNPs. This has to be repeated for every block of SNPs one considers.
It is much more attractive to have each record represent one SNP, as measured for all individuals (“row per SNP”). To achieve this, given a “row per person” organization, is an important part of the enterprise. It would not be an issue if all data would fit into fast random-access memory, but this is usually not the case, as we are talking of 10 to 100 Gb.
There is no need for real data when discussing computation times. Instead we simulate genotypes as random numbers from a uniform distribution between 0 and 2. Phenotypes and covariates are simulated as independent variables coming from a standard normal distribution. In our simulations we set the sample size to a typical GWAS scenario, namely 10K individuals. The number of simulated SNPs is 1000, which is determined by the available RAM.
In this section we present semi-parallel computation, using the R programming language as the vehicle for implementation (R version 2.15). We report computation speeds, as achieved on a single PC running Windows XP on an Intel E8400 (3.00 GHz) with 3.2 GB of RAM. We report user times provided by the R function proc.time. User time is defined as the CPU time charged for the execution of user instructions of the calling process.
A simple benchmark for comparing to other computers is the time needed for the singular value decomposition (SVD) of a 1000 ×1000 random matrix. For our computer it is 5 seconds in R software.
Our goal is to report computation times in a standardized way, such that they can be easily recalculated for different numbers of individuals and/or SNPs. Computation time for GWAS is linear in sample size and in the number of investigated SNPs. We express speed in “sips” standing for “snp-individual per second”. It is obtained by dividing the product of the numbers of individuals and SNPs by the time needed for a computation. Conversely, if one divides the product of the number of individuals and the number of SNPs by speed, one obtains the number of seconds needed for a job. One should keep in mind that due to the imprecision of proc.time and its variability from run to run, the calculated times/speeds are only approximate. They are provided to assess the order of magnitude of the times gained in computations. Because of the size of the numbers, we will exclusively use Msips, meaning one million sips.
Regression without additional covariates
This means that to analyze a GWAS with 2.5 mln of SNPs and 10k individuals around 5 minutes are needed. However, this is for an unrealistic scenario, without covariates. Also, we have not calculated the p-values yet. We will now discuss the needed extensions.
Regression with covariates
Speeds are 45, 25, 13 Msips for 2, 10 and 30 covariates respectively, about 70 times faster than using lm.
Standard errors and p-values
The calculation of the p-values assumes, given the large sample size, that the test statistic has a normal distribution. We used the lower tail of the normal distribution to calculate the p-values. It is not advisable to use the textbook definition 2 * (1-pnorm(b / err)), because it suffers from severe rounding errors.
Speed in Msips for linear model (estimates, standard errors and p -values) with k covariates for the functions ls, lsfit and semi-parallel (SP)
We tested our codes on a another PC with Intel Xeon(R) X5550, 2.67 GHz, 24 GB of RAM and the 64 bit version of R. This machine was around 1.4 times faster than our PC. However, the ratios of the speeds remained similar. Semi-parallel approaches is 60-80 times faster than looping function lm.
The semi-parallel algorithm does not allow missing values. A single NA in either a phenotype vector or a SNP matrix will result in NA in the vector of estimates.
Incomplete phenotypes are easy to handle. We can exclude those individuals from the whole analysis. Missing genotypes are more problematic. In general missing data can be handled using weighted least squares estimation, taking as weights 0 and 1 for missing and available observation. However the weights will vary for different SNPs and semi-parallel approach for the model with covariates cannot be applied anymore.
with p representing probability of “success”.
The speed is 0.2 Msips which is four times slower than fitting a regression model to a continuous outcome.
where z w and s w are weighted means defined as and respectively.
The speed is 55 Msips which is 275 faster than using glm.
Speed in Msips for logistic model (estimates, standard errors and p -values) with k covariates for the functions glm and semi-parallel (SP)
Organization of the SNP data
Our semi-parallel algorithms substantially reduce computation times. However before we can apply our algorithm we need to load the data into computer memory. Of course this always is an issue, but not really critical when computations are slow.
We assume that we have limited memory available, say 2 to 4 Gb. With 64 bit operating systems, 64 bit R and expensive hardware, it is possible to build a system that can have all data in memory. We do not expect the reader to be that lucky. Instead we assume that we will read in blocks of SNPs of reasonable size.
Data loading entails not only CPU but also I/O times. That is why in this section we only focus on the elapsed time provided by proc.time. This is the clock time measured from the start of the operation until its completion.
We propose different solutions depending on the type of the genotypes we are dealing with (observed or imputed). We show how to efficiently deal with PLINK data formats. For imputed dosage (MACH) files, we discuss what the difficulties are when loading in the structure necessary to apply fast computation algorithm. We describe two R packages: ff, ncdf which we found most useful to tackle this problem.
Observed genotypes in PLINK format
Once “bad” SNPs have been removed we can apply a fast computation algorithm. We analyzed a model with correction for 25 covariates. The association scan for our example data file was finished within one minute.
Imputed genotypes in MACH format
MACH files are larger than PLINK files and may include hundreds of thousands of SNPs written as “row per person” in text files. On a computer without large amount of RAM we will not be able to read into R the whole data file. We have to work with blocks of SNPs. The “row per person” structure is very inefficient if we want to read only a group of SNPs (say 1000) for all individuals. Having a transpose of it, so the “row per SNP” would make it possible for function scan to create a matrix with a block of SNPs for all persons. But even then, reading 1000 SNPs for 10000 individuals takes around 13 seconds. For a genome with 2.5 mln SNPs we would need around 9 hours just to bring the data into R.
There are other, faster ways to deal with large data files in R. One possibility is to work with binary files. Saving and reading binary files is easily done using writeBin and readBin. However, those files work on vectors. This is not an optimal solution for us. Saving all the genotypes for individuals sequentially will not allow us for an easy access to the blocks of SNPs later on.
Reading goes very fast and is done within 30 seconds.
The elapsed times for reading are as similar to those of ncdf.
Computations for GWAS were made easy. We have shown that they can be rearranged as large matrix operations performing 60-80 times faster for linear regression and up to 300 times faster for logistic regression. The algorithms can be written in pure R and they do not exceed 20 lines of code.
Fast computations demand fast access to the data and this is actually a harder problem. Not all SNP data fit in memory at the same time. They have to be read in as blocks containing all individuals and selections of SNPs. In practice data are not organized in this way, but as records that contain all SNPs for each individual. We have shown two ways to rearrange data, in a preliminary step, to make fast access possible. Our first solution uses the standardized netCDF file format. It has the advantage that the files can be exchanged easily between computers, operating systems and programming languages. Our second solution uses memory mapped files, as implemented in the package ff. It is the fastest solution and it is easy to use, but it is less portable than netCDF.
We believe that we have presented here an attractive solution to computations for relatively large GWAS, on modest hardware, using pure R code. Our algorithms are still “embarrassingly parallel”: it is trivial to divide the task over multiple machines, each working on a different block of SNPs. However, using the package SNOW to exploit multiple processors in one PC, we discovered that it takes so much time to load the data into separate processes that it was not worth the effort.
Using the many processors on modern graphic cards looks like an attractive road to explore. We feel that we are still in a transition phase in which easily accessible libraries for R are not yet available. At the moment of writing this manuscript, most available packages are tied to Nvidia GPUs and needed special installation procedures. We have not yet explored this approach.
A more complicated case of weighting is encountered when one corrects for correlation between individuals. Because the relationship matrix has as many rows and columns as the number of individuals, this poses a real challenge. Several solutions have been proposed, see . More research is needed to determine whether they can be combined with our semi-parallel approach.
GWAS for static phenotypes is only one important issue. Much more challenging are longitudinal data, in which multiple measurements per individual are available. In general the number of measurements varies between persons, as well as the times of observation. One has to use linear mixed models, which entail heavy computation loads. A typical mixed model with 10K observations takes about 1 second, implying a speed of 0.01 Msips, more than 100 times slower than a linear model. The need for fast computations in case of longitudinal data and few approximate procedures have been described in . We are working on algorithms involving large matrix operations for massive fitting of linear mixed models. We have had some successes, but a lot has still to be done. We will report on this subject in due time.
Availability and requirements
Project name: GWASPProject home page: https://bitbucket.org/ksikorska/gwaspOperating systems : Platform independentProgramming language: ROther requirements: R 2.15 or higher + ncdf and ff packagesLicence: GPL licence
The authors would like to thank anonymous reviewers for their helpful and constructive comments.
- Pearson T, Manolio T: How to interpret a genome-wide association study. JAMA: J Am Med Assoc. 2008, 299 (11): 1335-1344. 10.1001/jama.299.11.1335.View ArticleGoogle Scholar
- Hindorff L, MacArthur J, Morales J, Junkins H, Hall P, Klemm A, Manolio T: A catalog of published genome-wide association studies. [http://www.genome.gov/gwastudies]
- Aulchenko Y, Ripke S, Isaacs A, Van Duijn C: GenABEL: An R library for genome-wide association analysis. Bioinformatics. 2007, 23 (10): 1294-1296. 10.1093/bioinformatics/btm108.View ArticlePubMedGoogle Scholar
- Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, Bender D, Maller J, Sklar P, De Bakker P, Daly M: PLINK: A Tool Set For Whole-genome Association and population-based Linkage Analyses. Am J Hum Genet. 2007, 81 (3): 559-575. 10.1086/519795.PubMed CentralView ArticlePubMedGoogle Scholar
- Li Y, Willer C, Ding J, Scheet P, Abecasis G: MaCH: Using sequence and genotype data to estimate Haplotypes and unobserved genotypes. Genet Epidemiol. 2010, 34 (8): 816-834. 10.1002/gepi.20533.PubMed CentralView ArticlePubMedGoogle Scholar
- Li Y, Willer C, Sanna S, Abecasis G: Genotype imputation. Annu Rev Genomics Hum Genet. 2009, 10: 387-10.1146/annurev.genom.9.081307.164242.PubMed CentralView ArticlePubMedGoogle Scholar
- Aulchenko Y, Struchalin M, Van Duijn C: ProbABEL package for genome-wide association analysis of imputed data. BMC Bioinformatics. 2010, 11: 134-10.1186/1471-2105-11-134.PubMed CentralView ArticlePubMedGoogle Scholar
- Estrada K, Abuseiris A, Grosveld F, Uitterlinden A, Knoch T, Rivadeneira F: GRIMP: A web-and grid-based tool for high-speed analysis of large-scale genome-wide association using imputed data. Bioinformatics. 2009, 25 (20): 2750-2752. 10.1093/bioinformatics/btp497.PubMed CentralView ArticlePubMedGoogle Scholar
- Shabalin A: Matrix eQTL: Ultra fast eQTL analysis via large matrix operations. Bioinformatics. 2012, 28 (10): 1353-1358. 10.1093/bioinformatics/bts163.PubMed CentralView ArticlePubMedGoogle Scholar
- Agresti A: Categorical Data Analysis. 2002, Wiley-interscience:View ArticleGoogle Scholar
- Feero WG, Guttmacher AE, Manolio TA: Genomewide association studies and assessment of the risk of disease. N Engl J Med. 2010, 363 (2): 166-176. 10.1056/NEJMra0905980.View ArticleGoogle Scholar
- Adler D, Gläser C, Nenadic O, Zucchini W, Oehlschlägel J: ff: memory-efficient storage of large data on disk and fast access functions. 2012, [http://CRAN.R-project.org/package=ff] [R package version 2.2-7]Google Scholar
- Pierce D: ncdf: Interface to Unidata netCDF data files. 2011, [http://CRAN.R-project.org/package=ncdf] [R package version 1.6.6]Google Scholar
- Clayton D: snpStats: SnpMatrix and XSnpMatrix Classes and Methods. 2012, [http://www-gene.cimr.cam.ac.uk/clayton] [R package version 1.6.0]Google Scholar
- Lippert C, Listgarten J, Liu Y, Kadie C, Davidson R, Heckerman D: FaST linear mixed models for genome-wide association studies. Nat Methods. 2011, 8 (10): 833-835. 10.1038/nmeth.1681.View ArticlePubMedGoogle Scholar
- Sikorska K, Rivadeneira F, Groenen PFJ, Hofman A, Uitterlinden AG, Eilers PHC, Lesaffre E: Fast linear mixed model computations for genome-wide association studies with longitudinal data. Stat Med. 2013, 32 (1): 165-180. 10.1002/sim.5517.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.