Fast network centrality analysis using GPUs
 Zhiao Shi^{1, 2} and
 Bing Zhang^{3}Email author
DOI: 10.1186/1471210512149
© Shi and Zhang; licensee BioMed Central Ltd. 2011
Received: 16 November 2010
Accepted: 12 May 2011
Published: 12 May 2011
Abstract
Background
With the exploding volume of data generated by continuously evolving highthroughput technologies, biological network analysis problems are growing larger in scale and craving for more computational power. General Purpose computation on Graphics Processing Units (GPGPU) provides a costeffective technology for the study of largescale biological networks. Designing algorithms that maximize data parallelism is the key in leveraging the power of GPUs.
Results
We proposed an efficient data parallel formulation of the AllPairs Shortest Path problem, which is the key component for shortest pathbased centrality computation. A betweenness centrality algorithm built upon this formulation was developed and benchmarked against the most recent GPUbased algorithm. Speedup between 11 to 19% was observed in various simulated scalefree networks. We further designed three algorithms based on this core component to compute closeness centrality, eccentricity centrality and stress centrality. To make all these algorithms available to the research community, we developed a software package gpufan (GPUbased Fast Analysis of Networks) for CUDA enabled GPUs. Speedup of 1050× compared with CPU implementations was observed for simulated scalefree networks and real world biological networks.
Conclusions
gpufan provides a significant performance improvement for centrality computation in largescale networks. Source code is available under the GNU Public License (GPL) at http://bioinfo.vanderbilt.edu/gpufan/.
Background
Cellular systems can be modeled as networks, in which nodes are biological molecules (e.g. proteins, genes, metabolites, microRNAs, etc.) and edges are functional relationships among the molecules (e.g. protein interactions, genetic interactions, transcriptional regulations, protein modifications, metabolic reactions, etc.). In systems biology, network analysis has become an important approach for gaining insights into the massive amount of data generated by highthroughput technologies.
Shortest pathbased centrality metrics
Centrality  Equation  Description 

Betweenness (BC) 
 fraction of shortest paths between all other nodes that run through node u 
Closeness (CC) 
 reciprocal of average shortest path distance 
Eccentricity (EC) 
 reciprocal of maximum shortest path distance 
Stress (SC) 
 total number of shortest paths between all other nodes that run through u 
Owing to its massive parallel processing capability, General Purpose computation on Graphics Processing Units (GPGPU) provides a more efficient and cost effective alternative to conventional Central Processing Unit (CPU)based solutions for many computationally intensive scientific applications [6]. A GPU device typically contains hundreds of processing elements or cores. These cores are grouped into a number of Streaming Multiprocessors (SM). Each core can execute a sequential thread, and the cores perform in SIMT (Single Instruction Multiple Thread) fashion where all cores in the same group execute the same instruction at the same time. NVIDIA's CUDA (Compute Unified Device Architecture) platform [7] is the most widely adopted programming model for GPU computing. In bioinformatics, GPUbased applications have already been implemented for microarray gene expression data analysis, sequence alignment and simulation of biological systems [8–11].
Parallel algorithms for centrality computation have been developed on various multicore architectures [12–14]. However, as pointed out by Tu et al. [15], challenges such as dynamic noncontiguous memory access, unstructured parallelism, and low arithmetic density pose serious obstacles to an efficient execution on such architectures. Recently, several attempts at implementing graph algorithms, including breadth first search (BFS) and shortest path, on the CUDA platform have been reported [16–18]. Two early studies process different nodes of the same level in a network in parallel [16, 17]. Specifically, for the BFS implementation, each node is mapped to a thread. The algorithms progress in levels. Each node being processed at the current level updates the costs of all its neighbors if the existing costs are higher. The algorithms stop when all the nodes are visited. This approach works well for densely connected networks. However, for scalefree biological networks [19] in which some nodes have many more neighbors than the others, these approaches can potentially be slower than implementations using only CPUs due to load imbalance for different thread blocks [18]. A recent study by Jia et al. exploits the parallelism among each node's neighbors to reduce load imbalance for different thread blocks and achieves better performance in AllPairs Shortest Path (APSP) calculation and shortest pathbased centrality analysis [18]. However, the APSP algorithm can only use one thread block per SM due to excessive memory duplication, which is an inefficient way of executing threads blocks and may result in low resource utilization [20].
In this paper, we developed a new APSP algorithm that avoids data structure duplication and thus allows scheduling units from different thread blocks to fill the long latency of expensive memory operations. We showed that our algorithm outperformed Jia's algorithm for betweenness centrality computation. Based on the improved APSP algorithm, we developed a software package gpufan (GPUbased Fast Analysis of Networks) for computing four widely used shortest pathbased centrality metrics on CUDA enabled GPUs. Using simulated scalefree networks and real world biological networks, we demonstrated significant performance improvement for centrality computation using gpufan as compared to CPU implementations.
Implementation
Given a network G = (V, E) with V = n and E = m, we implemented algorithms for computing four shortest pathbased centrality metrics as described in Table 1 on the CUDA platform. There are currently two approaches for computing shortest paths on GPUs. The first approach processes different nodes of the same level in parallel [17]. The second one exploits the parallelism on the finest neighborhood level [18]. Since biological networks typically exhibit a scalefree property [19], the first approach can potentially cause serious load imbalance and thus result in poor performance. Therefore, we adopted the second approach in our implementation. Specifically, a network is represented with two arrays. A pair of corresponding elements from each array is an edge in the network. For undirected networks, an edge is represented by two pairs of elements, one for each direction. All four centrality metrics are based on the APSP computation. The APSP algorithm performs a BFS starting from each node. During the BFS, each edge is assigned to a thread. If one end of an edge is updating its distance value, the thread checks the other node and updates the distance value if it has not been visited yet. Each edge (thread) can proceed independently of each other and therefore exploits the finest level of parallelism to achieve load balance. After finding all shortest paths, each centrality metric is computed with additional GPU kernel function(s) as described in [18]. For betweenness centrality, the implementation is based on a fast serial version [21].
Results and Discussions
We tested both GPU and CPU implementations on a Linux server. The server contains 2 Intel Xeon L5630 processors at 2.13 GHz, each having 4 processing cores, and an NVIDIA Tesla C2050 GPU card (448 CUDA cores, 3GB device memory). The CPU implementation was single threaded and coded in C++. The kernel functions in GPU version were implemented with CUDA C extension.
Running times on GPU vs. CPU for centrality computations in a randomly generated scalefree network (n = 30, 000, β = 50)
Centrality  CPU time (sec)  GPU time (sec)  Speedup 

Betweenness (BC)  17777.0  365.5  48.64 
Closeness (CC)  3914.7  92.6  42.29 
Eccentricity (EC)  3954.1  91.4  43.24 
Stress (SC)  16950.1  338.2  50.12 
Finally, we tested gpufan on a human proteinprotein interaction (PPI) network and a breast cancer gene coexpression network [23]. The human PPI has 11, 660 nodes and 94, 146 edges, while the coexpression network has 7, 819 nodes and 195, 928 edges. Although these two networks have relatively low edge density, we still obtained a speedup of around 10× as shown in Figures 3(c) and 3(d).
For the computation of betweenness centrality, a two dimensional array p of size n × n is used to keep predecessor information, where p(i, j) = 1 indicates that there is a shortest path passing from node i to node j. This limits our implementation from processing graph with large number of nodes because of the limited global memory size on GPU. Since this array will likely be sparse, using sparse matrix representation can help reduce memory usage. As a future work, we will investigate the use of sparse matrix and its potential effect on the overall performance.
Conclusions
We developed a software package for computing several shortest pathbased centrality metrics on GPUs using the CUDA framework. The algorithms deliver significant speedup for both simulated scalefree networks and real life biological networks.
Availability and requirements
Project name: gpufan (GPUbased Fast Analysis of Networks)
Project home page: http://bioinfo.vanderbilt.edu/gpufan/
Operating system: Unix/Linux
Programming language: CUDA, C/C++
Other requirements: CUDA Toolkit 3.0 or higher, GPU card with compute capability 2.0 or higher
License: GPL v3
Abbreviations
 GPGPU:

General Purpose computation on Graphics Processing Units
 CUDA:

Compute Unified Device Architecture
 GPU:

Graphics Processing Unit
 SIMT:

Single Instruction Multiple Thread
 SM:

Streaming Multiprocessor
 APSP:

AllPairs Shortest Path
 BFS:

Breadth First Search.
Declarations
Acknowledgements
This work was supported by the National Institutes of Health (NIH)/National Institute of General Medical Sciences (NIGMS) through grant R01GM088822. This work was conducted in part using the resources of the Advanced Computing Center for Research and Education at Vanderbilt University, Nashville, TN.
Authors’ Affiliations
References
 del Rio G, Koschützki D, Coello G: How to identify essential genes from molecular networks? BMC Systems Biology 2009, 3: 102. 10.1186/175205093102PubMed CentralView ArticlePubMedGoogle Scholar
 Csárdi G, Nepusz T: The igraph software package for complex network research. InterJournal Complex Systems 2006, 1695: 2006.Google Scholar
 Hagberg AA, Schult DA, Swart PJ: Exploring network structure, dynamics, and function using NetworkX. Proceedings of the 7th Python in Science Conference (SciPy2008), Pasadena, CA USA 2008, 11–15.Google Scholar
 Gregor D, Lumsdaine A: The Parallel BGL: A generic library for distributed graph computations. Parallel ObjectOriented Scientific Computing (POOSC) 2005.Google Scholar
 Cong G, Bader D: Techniques for designing efficient parallel graph algorithms for SMPs and multicore processors. Parallel and Distributed Processing and Applications 2007, 137–147.View ArticleGoogle Scholar
 Kirk D, Hwu W: Programming Massively Parallel Processors: A Handson Approach. Morgan Kaufmann; 2010.Google Scholar
 NVIDIA: Compute Unified Device Architecture Programming Guide. NVIDIA: Santa Clara, CA; 2010.Google Scholar
 Buckner J, Wilson J, Seligman M, Athey B, Watson S, Meng F: The gputools package enables GPU computing in R. Bioinformatics 2010, 26: 134. 10.1093/bioinformatics/btp608PubMed CentralView ArticlePubMedGoogle Scholar
 Manavski S, Valle G: CUDA compatible GPU cards as efficient hardware accelerators for SmithWaterman sequence alignment. BMC bioinformatics 2008, 9(Suppl 2):S10. 10.1186/147121059S2S10PubMed CentralView ArticlePubMedGoogle Scholar
 Payne J, SinnottArmstrong N, Moore J: Exploiting graphics processing units for computational biology and bioinformatics. Interdisciplinary Sciences: Computational Life Sciences 2010, 2(3):213–220. 10.1007/s1253901000024Google Scholar
 Dematté L, Prandi D: GPU computing for systems biology. Briefings in bioinformatics 2010, 11(3):323. 10.1093/bib/bbq006View ArticlePubMedGoogle Scholar
 Bader D, Madduri K: Parallel algorithms for evaluating centrality indices in realworld networks. Parallel Processing, 2006. ICPP 2006. International Conference on, IEEE 2006, 539–550.Google Scholar
 Madduri K, Ediger D, Jiang K, Bader D, ChavarriaMiranda D: A faster parallel algorithm and efficient multithreaded implementations for evaluating betweenness centrality on massive datasets. Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, IEEE 2009, 1–8.Google Scholar
 Tan G, Tu D, Sun N: A Parallel Algorithm for Computing Betweenness Centrality. Parallel Processing, 2009. ICPP'09. International Conference on, IEEE 2009, 340–347.View ArticleGoogle Scholar
 Tu D, Tan G: Characterizing betweenness centrality algorithm on multicore architectures. Parallel and Distributed Processing with Applications, 2009 IEEE International Symposium on, IEEE 2009, 182–189.View ArticleGoogle Scholar
 Harish P, Narayanan P: Accelerating large graph algorithms on the GPU using CUDA. High Performance ComputingHiPC 2007 2007, 197–208.View ArticleGoogle Scholar
 Sriram A, Gautham K, Kothapalli K, Narayan P, Govindarajulu R: Evaluating Centrality Metrics in RealWorld Networks on GPU. High Performance ComputingHiPC 2009 Student Research Symposium 2009.Google Scholar
 Jia Y: Large graph simplification, clustering and visualization. PhD thesis. University of Illinois at UrbanaChampaign, Urbana, Illinois; 2010.Google Scholar
 Barabási A, Oltvai Z: Network biology: understanding the cell's functional organization. Nature Reviews Genetics 2004, 5(2):101–113. 10.1038/nrg1272View ArticlePubMedGoogle Scholar
 Ryoo S, Rodrigues C, Stone S, Stratton J, Ueng S, Baghsorkhi S, Hwu W: Program optimization carving for GPU computing. Journal of Parallel and Distributed Computing 2008, 68(10):1389–1401. 10.1016/j.jpdc.2008.05.011View ArticleGoogle Scholar
 Brandes U: A faster algorithm for betweenness centrality. Journal of Mathematical Sociology 2001, 25(2):163–177. 10.1080/0022250X.2001.9990249View ArticleGoogle Scholar
 Barabási A, Albert R: Emergence of scaling in random networks. Science 1999, 286(5439):509. 10.1126/science.286.5439.509View ArticlePubMedGoogle Scholar
 Shi Z, Derow C, Zhang B: Coexpression module analysis reveals biological processes, genomic gain, and regulatory mechanisms associated with breast cancer progression. BMC Systems Biology 2010, 4: 74. 10.1186/17520509474PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.