Heterogeneous computing architecture for fast detection of SNP-SNP interactions
© Sluga et al.; licensee BioMed Central Ltd. 2014
Received: 14 November 2013
Accepted: 19 June 2014
Published: 25 June 2014
The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested.
We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort.
General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.
We are witnessing a dramatic shift in the design of personal computer systems, where speedups are achieved by porting the parallel traits of supercomputers into the world of personal computing. Modern computers are heterogeneous platforms with many different types of computational units, including central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), coprocessors and custom acceleration logic. Today’s CPUs contain from two to twelve cores, each capable of executing multiple instructions per clock cycle. Assisting the CPU, graphics processing units usually render 3D graphics, but can also provide a general-purpose computing platform. Current GPUs are designed as massively parallel processors offering substantially more computing power than CPUs. GPUs are the most powerful computational hardware available at an affordable price [1, 2]. The availability of general-purpose GPUs with computing abilities in commodity laptop and desktop computers has generated a wide interest, including applications in bioinformatics [3–9].
The newest addition to the commodity computer parallel processing hardware is the Intel Xeon Phi family of coprocessors  designed for computationally intensive applications. Xeon Phi implements Intel’s Many Integrated Core (MIC) architecture and offers a theoretical performance similar to that of modern GPUs, but promises easier porting of existing software to the new architecture. Tianhe-2, currently the world’s fastest supercomputer has 48 000 Xeon Phi coprocessors .
Many computational problems in bioinformatics require substantial computational resources . Problems that can be computed with a high degree of parallel and independent processing are most suited for heterogeneous massively parallel hardware. Our aim was to investigate how these modern architectures cope with problems that are typical for bioinformatics, such as the problem of SNP-SNP interaction detection. As a proof-of-concept, we focused on a parallel implementation of computational core for the web-application SNPsyn  by exploiting heterogeneous processing resources, multi-core CPUs, GPUs, and the new MIC coprocessors.
SNPsyn computes the information gain exhaustively across all SNP pairs to avoid missing any pair where SNPs on their own provide no information about the phenotype under study. Because the number of pairs is quadratic to the number of SNPs, the exhaustive search quickly becomes computationally intractable for commodity computer systems. The information-theoretic-based detection of SNP-SNP interactions has a high degree of data parallelism and requires much more processing power than memory storage. This makes it a perfect candidate for processing on modern massively parallel architectures.
Below we describe the SNP-SNP interaction scoring approach we use in SNPsyn and discuss its implementation on CPU, CUDA and MIC architectures. Our particular concern is to evaluate Intel’s new MIC architecture and compare its advantages against currently prevailing CUDA architecture.
SNP-SNP interaction scoring
Computation of marginal probabilities q(x), q(y), q(p) and joint probability distributions q(x,p), q(y,p), q(x,y), q(x,y,p) requires a single scan through case and control samples. The number of joint probability distributions q(x,y) and q(x,y,p) that need to be determined grows quadratically with the number of SNPs. This ensures enough computational load to compensate for the memory transfer costs and makes it efficient for an implementation on parallel hardware.
Permutation analysis is used to evaluate the significance of results on true data. Data is randomly shuffled thirty times. Each time, information gain and synergy for all pairs are calculated to obtain the null distribution, which is used to determine the significance of results on true data. Details on permutation analysis are described in Curk et al..
Parallel implementations of interaction scoring
Calculations are performed in parallel for as many pairs of SNPs as allowed by the hardware. We took special care to efficiently use the GPU and Xeon Phi hardware. We minimized memory transfers between the main CPU and the coprocessors to avoid bottlenecks and vectorized the code wherever possible. We optimized the number of threads running on the GPU to maximize throughput. To cope with the memory limitation of the GPU, SNPsyn includes optional heuristics to quickly estimate the importance of SNPs and reduce the data set prior to analysis. In the following sections we present the implementation details regarding both architectures.
GPU and CUDA
GPUs gain their computational power from the numerous processing cores packed into one chip. For example, the modern Nvidia Tesla K20 GPU has 13 streaming multiprocessors, each containing 192 computational units called CUDA cores. These cores lack sophisticated control units and are thus likely to work best when executing the same instruction on many data elements in parallel with no divergent program paths in the algorithm. A programmer sees the GPU as a parallel coprocessor and can use it to speedup computationally intensive parts of the algorithm. Of course, there must be enough data parallelism in the code to make it worthwhile.
Different tools are available for programming GPUs. Nvidia offers the CUDA toolkit  for programming its own products. It includes a proprietary compiler and a set of libraries that extend the C++ syntax with parallel programming constructs. Another popular option is the OpenCL framework . It supports hardware from different vendors but usually lags slightly in terms of performance when compared to specialized development kits such as CUDA.
Xeon Phi and MIC
Intel designed the Xeon Phi family of coprocessors around the new MIC architecture  to compete with GPUs specialized in general-purpose computing. The design follows a different approach in comparison to GPUs. Coprocessors consists of many simple, but fully functional processor cores derived from the Intel Pentium architecture. Intel improved the original design by adding a 512-bit wide vector unit and Hyper-Threading Technology. This enables Xeon Phi to achieve similar theoretical performance as modern GPUs. The model 5510P, which we used in this study, includes sixty cores interconnected with a bidirectional ring bus. Each core is capable of running four threads in parallel. The cores fetch data from the 8 GB of on-board RAM and communicate with the host CPU through the PCIe bus. In comparison to GPUs, each core on a Xeon Phi can efficiently execute the code even if threads do not follow the same program path. This makes it suitable for a wider range of problems, including multiplications of sparse matrices , and operations on trees and graphs .
Comparison of parallel computer architecture platforms with key aspects from the viewpoint of software development
x86/x64 single CPU
Intel Xeon Phi
CUDA Toolkit or OpenCL framework
Intel compiler suite
Windows, Linux, Mac OSX
Linux (RedHat and SuSE), Windows
Required programming skills
Lines of code*
Architecture specific optimizations
Recommended optimizations using
Extensive documentation, many
Bugs in drivers, documentation needs
We benchmarked SNPsyn on a workstation with two six-core Intel Xeon E5-2620 2.00 GHz CPUs capable of running up to twenty-four threads in parallel, 64 GB of RAM, two Nvidia Tesla K20 general-purpose computing cards with 5 GB of RAM each and one Intel Xeon Phi 5110P coprocessor with 8 GB of RAM. The operating system was CentOS 6.4.
The single thread CPU configuration takes more than 30 days to analyze the data on 660 000 SNPs and 1 000 subjects. Running twelve threads in parallel, one on each of the CPU cores, speeds up the computation by a factor of 10 and reduces the execution time to approximately 3 days. Increasing the number of threads to twenty-four reduces the time to perform the analysis to around 2 days with the speedup peaking at 12.8 compared to a one thread configuration. Memory bottleneck is the main factor for the poor speedup, which is far below the theoretical value of 24. Interestingly, similar speedups are achieved on all (smaller) data sets, meaning that there is enough data parallelism to keep the CPU busy.
Nvidia K20 provides for considerable reduction in execution times, with the analysis of the largest data set taking only around 17 hours, demonstrating a speedup of 42 in comparison to a single CPU thread. Sharing the work between both GPU cards doubles the speedup and reduces the execution time to 8 hours. Increasing the number of subjects leads to a noticeable decrease in speedup, as more data is being transferred between the main memory and the GPU. On the other hand, increasing the number of SNPs introduces more data parallelism into the computations, reflecting in an improved speedup.
Xeon Phi is positioned somewhere in-between K20 and CPU-only implementation. It achieves a speedup of nearly 20 on the largest data set, making the analysis run a day and a half, which is double the time needed on a K20. The speedup behaves similarly for Xeon Phi as for K20 – it increases with the number of SNPs and decreases with the number of subjects. This confirms that the drop is caused by transferring larger amounts of data without introducing additional parallelism.
Using only CPUs to analyze the data is unfeasible except for small data sets since the computations can take days to complete even on multiple cores. Xeon Phi provides a considerable performance boost with a maximum speed-up of nearly 20 and lots of on-board memory to store the data. Nvidia K20 clearly outperforms every other configuration in terms of speed and is the perfect choice when one wants to cut on the execution times as much as possible. This comes at a price of cumbersome programming and less on-board memory, which limits the size of data.
Technical specification of hardware platforms
Intel Xeon E5-2620
Nvidia Tesla K20
Intel Xeon Phi 5110P
Number of transistors
Peak power consumption
Single precision floating point performance
64 GB can be expanded
We investigated how modern heterogeneous architectures cope with a selected computational problem typical for bioinformatics. The proof-of-concept implementation of SNPsyn on heterogeneous systems greatly reduces the (wall-clock) time needed for analysis of large GWAS data sets. GPUs proved to be a mature platform that offers a large amount of computing power to address inherently parallel problems, but is demanding for the programmer. A user who is only interested in using SNPsyn to analyze their data will profit the most by having multiple GPUs in their system. The new MIC architecture greatly alleviates programming but lacks in performance. Its ease of programming combined with good performance has a lot to offer to developers who don’t want to spend too much time optimizing their algorithms. Nevertheless, MIC is a general platform capable of tackling a wider range of more complex problems. This makes it very promising to excel in more complex analysis of SNP-SNP interactions such as adjustment for covariates .
Availability and requirements
Project name: SNPsyn
Project home page: http://snpsyn.biolab.si
Operating systems: Linux, Windows, Mac OS
Programming language: C++
Other requirements: CUDA 2.0 or higher, Intel Composer XE 2013 or newer, make
License: GNU GPLv3
Restrictions to use by non-academics: none
BZ and TC were supported by the Slovenian Research Agency (ARRS, P2-0209). UL and DS were supported by the Slovenian Research Agency (ARRS, P2-0241).
- Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC: GPU computing. Proceedings of the IEEE. 2008, New York, USA: IEEE, 879-899.Google Scholar
- Nickolls J, Dally WJ: The GPU computing era. IEEE Micro. 2010, 30 (2): 56-69.View ArticleGoogle Scholar
- Greene CS, Sinnott-Armstrong NA, Himmelstein DS, Park PJ, Moore JH, Harris BT: Multifactor dimensionality reduction for graphics processing units enables genome-wide testing of epistasis in sporadic als. Bioinformatics. 2010, 26 (5): 694-695. 10.1093/bioinformatics/btq009.View ArticlePubMed CentralPubMedGoogle Scholar
- Liu Y, Schmidt B, Maskell D: CUDASW++2.0: enhanced smith-waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions. BMC Res Notes. 2010, 3 (1): 93-104. 10.1186/1756-0500-3-93.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhou Y, Liepe J, Sheng X, Stumpf MPH: GPU accelerated biochemical network simulation. Bioinformatics. 2011, 27 (6): 874-876. 10.1093/bioinformatics/btr015.View ArticlePubMed CentralPubMedGoogle Scholar
- Ueki M, Tamiya G: Ultrahigh-dimensional variable selection method for whole-genome gene-gene interaction analysis. BMC Bioinformatics. 2012, 13 (1): 72-10.1186/1471-2105-13-72.View ArticlePubMed CentralPubMedGoogle Scholar
- Yung LS, Yang C, Wan X, Yu W: GBOOST: a GPU-based tool for detecting gene–gene interactions in genome–wide case control studies. Bioinformatics. 2011, 27 (9): 1309-1310. 10.1093/bioinformatics/btr114.View ArticlePubMed CentralPubMedGoogle Scholar
- Kam-Thong T, Czamara D, Tsuda K, Borgwardt K, Lewis CM, Erhardt-Lehmann A, Hemmer B, Rieckmann P, Daake M, Weber F, Wolf C, Ziegler A, Pütz B, Holsboer F, Schölkopf B, Müller-Myhsok B: EPIBLASTER-fast exhaustive two-locus epistasis detection strategy using graphical processing units. Eur J Hum Genet. 2011, 19 (4): 465-471. 10.1038/ejhg.2010.196.View ArticlePubMed CentralPubMedGoogle Scholar
- Kam-Thong T, Azencott C-A, Cayton L, Pütz B, Altmann A, Karbalai N, Sämann PG, Schölkopf B, Müller-Myhsok B, Borgwardt KM: GLIDE: GPU-based linear regression for detection of epistasis. Hum Hered. 2012, 73 (4): 220-236. 10.1159/000341885.View ArticlePubMedGoogle Scholar
- Chrysos G, Engineer SP: Intel®; Xeon Phi coprocessor (codename Knights Corner). Proceedings of the 24th Hot Chips Symposium, HC. 2012, Stanford, USA: Stanford University,Google Scholar
- Courtland R: Intel strikes back [news]. Spectrum, IEEE. 2013, 50 (8): 14-View ArticleGoogle Scholar
- Payne JL, Sinnott-Armstrong NA, Moore JH: Exploiting graphics processing units for computational biology and bioinformatics. Interdiscip Sci Comput Life Sci. 2010, 2 (3): 213-220. 10.1007/s12539-010-0002-4.View ArticleGoogle Scholar
- Curk T, Rot G, Zupan B: SNPsyn: detection and exploration of SNP-SNP interactions. Nucleic Acids Res. 2011, 39 (2): 444-449.View ArticleGoogle Scholar
- Anastassiou D: Computational analysis of the synergy among multiple interacting genes. Mol Syst Biol. 2007, 3 (83): 1-8.Google Scholar
- Cohen J, Garland M: Solving computational problems with GPU computing. Comput Sci Eng. 2009, 11 (5): 58-63.View ArticleGoogle Scholar
- Stone JE, Gohara D, Shi G: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng. 2010, 12 (3): 66-View ArticlePubMed CentralPubMedGoogle Scholar
- Lindholm E, Nickolls J, Oberman S, Montrym J: NVIDIA tesla: a unified graphics and computing architecture. IEEE Micro. 2008, 28 (2): 39-55.View ArticleGoogle Scholar
- Saule E, Kaya K, Çatalyürek Ümit V: Performance evaluation of sparse matrix multiplication kernels on intel xeon phi. Parallel Processing and Applied Mathematics. 2014, Berlin, Germany, 559-570.View ArticleGoogle Scholar
- Liu X, Smelyanskiy M, Chow E, Dubey P: Efficient sparse matrix-vector multiplication on x86-based many-core processors. Proceedings of the 27th International ACM Conference on International Conference on Supercomputing. ICS ’13. 2013, New York: ACM, 273-282.View ArticleGoogle Scholar
- Gao T, Lu Y, Zhang B, Suo G: Using the intel many integrated core to accelerate graph traversal. Int J High Perform Comput Appl. 2014, doi:10.1177/1094342014524240,Google Scholar
- Cramer T, Schmidl D, Klemm M, an Mey D: OpenMP programming on Intel®; Xeon PhiTMcoprocessors: an early performance comparison. Proceedings of the Many-core Applications Research Community (MARC) Symp. at RWTH Aachen University. 2012, Achen, Germany: RWTH Achen University, 38-44.Google Scholar
- Wellcome Trust Case Control Consortium: Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. 7145. 2007, 447: 661-678.Google Scholar
- Zhu Z, Tong X, Zhu Z, Liang M, Cui W, Su K, Li MD, Zhu J: Development of gmdr-gpu for gene-gene interaction analysis and its application to wtccc gwas data for type 2 diabetes. PloS one. 2013, 8 (4): 61943-10.1371/journal.pone.0061943.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.