cuRnet: an R package for graph traversing on GPU

Background R has become the de-facto reference analysis environment in Bioinformatics. Plenty of tools are available as packages that extend the R functionality, and many of them target the analysis of biological networks. Several algorithms for graphs, which are the most adopted mathematical representation of networks, are well-known examples of applications that require high-performance computing, and for which classic sequential implementations are becoming inappropriate. In this context, parallel approaches targeting GPU architectures are becoming pervasive to deal with the execution time constraints. Although R packages for parallel execution on GPUs are already available, none of them provides graph algorithms. Results This work presents cuRnet, a R package that provides a parallel implementation for GPUs of the breath-first search (BFS), the single-source shortest paths (SSSP), and the strongly connected components (SCC) algorithms. The package allows offloading computing intensive applications to GPU devices for massively parallel computation and to speed up the runtime up to one order of magnitude with respect to the standard sequential computations on CPU. We have tested cuRnet on a benchmark of large protein interaction networks and for the interpretation of high-throughput omics data thought network analysis. Conclusions cuRnet is a R package to speed up graph traversal and analysis through parallel computation on GPUs. We show the efficiency of cuRnet applied both to biological network analysis, which requires basic graph algorithms, and to complex existing procedures built upon such algorithms. Electronic supplementary material The online version of this article (10.1186/s12859-018-2310-3) contains supplementary material, which is available to authorized users.


Algorithm 1 Parallel breath-first search algorithm
1: for all vertices u ∈ V (G) do 2: d(u) = ∞ 3: d(s) = 0, F 1 = s, F 2 = ∅ 4: level = 1 5: while F 1 = ∅ do 6: parallel for vertices u ∈ F 1 do 7: u ← dequeue(F 1 ) 8: parallel for vertices v ∈ adj [u] do 9: if d(u) = ∞ then 10: d(u) = level 11: enqueue(F 2 , u) 12: end 13: end 14: level = level + 1 15: swap(F 1 , F 2 ) 16: We selected the BFS for GPUs proposed in [1] (BFS-4K ) since it is actually the main efficient BFS implementation at the state of the art. Figure 1 shows the results of a comparison we performed between BFS-4K, Gunrock [2], and B40C [3]. The network dataset has been composed by selecting different graphs from the University of Florida Sparse Matrix Collection [4], the 10th DIMACS Challenge [5], and the SNAP dataset [6]. It is important to note that this work targets many-core architectures (GPUs) as, for graph traversing, they allow reaching higher speedup w.r.t. multi-core architectures at low cost [1], [2], [3]. Indeed, cuRnet can be used in any HW/SW architecture with a standard and cheap graphics processing unit (e.g., NVIDIA GPU). Our future work consists of extending the acceleration concept to other more complex, high-end architectures (e.g., multi-node GPUs or multi-node many-cores like that adopted in [7]). The parallel SSSP algorithm implemented in cuRnet is based on the Bellman-Ford's approach [8]. It can be derived from Algorithm 1 by replacing lines 8-12 with the following lines: Finally, the parallel SCC algorithm implemented in cuRnet is described in Algorithm 2. It implements a multi-step approach that applies different GPU-accelerated algorithms for SCC decomposition [9].

Data
We used the STRING dataset [10], which mainly contains Protein-Protein Interaction (PPI) networks of several organisms, varying from microbes to eukaryotes. We used the R package STRINGdb. We retrieved the undirected networks related to Homo sapiens, Danio rerio, and Zea mais (Figures 2, 3 and 4 show their properties). These three organisms belong to the set of core species of STRING. This guarantees a good reliability of them. The human PPI is the smallest one, with 19k vetices and 11M edges, and the Maize's network is the largest one, with 33k vertices and 34M edges. The networks show differences in the degree distribution, as well as in the distribution of the STRING score assigned to their edges. STRING assigns to each interaction a functional score, combining multiple information, ranging from 0 to 1000. Thus, in addition to the complete PPIs, supplementary networks were extracted by applying thresholds to edge scores. A threshold on the value 900 has been fixed to discard edges with lower score producing a sparse network of highly functional connections [10]. An intermediate threshold on the value 200 has been fixed to remove low significant predicted interactions.
The STRING package provides a real-case example reporting differential expression values regarding the treatment of A549 lung cancer cells by means of Resveratrol, a natural phytoestrogen found in red wine and a variety of plants shown to have protective effects against the disease. The example is referred to the GEO (Gene Expression Omnibus) GSE9008 study. We used such data to label the above described PPI networks according to the pvalues (see Figure 5). Figure 6 reports the properties of a dataset of direct unlabelled homology networks built on the complete set of 115 archaea species from STRINGdb. The homology information between proteins is measured by sequence BLAST alignments. For each protein, STRING reports the best BLAST hits, w.r.t. a given species. The number of edges (best hists) increases with the number of vertices (proteins), and about 75% of edges represents bidirectional hits. The final network, composed by the proteins of all the 114 species, has 229k vertices and more than 9M edges. Degree distributions show a prevalence for low values, that indicates the presence of strainspecific genes. The outgoing degree distribution is bounded by the amount of 114 species, but much higher connectivity is shown for incoming degrees.

Performance
color(u) ← undef ined 4: end 5: trimmable ← f alse 10: parallel for vertices u ∈ F 1 do 11: if Trim(F 1 , u) = f alse then 12: Insert(F 2 , u) 13: else 14: color(u) ← u 15: trimmable ← true 16: end 17: end 18: parallel for vertices u ∈ F ∩ B do 30: color(u) ← p 31: end 32: InsertIfNotEmpty(P 2 , S \ (F ∪ B) 34: InsertIfNotEmpty(P 2 , F \ B) 35: InsertIfNotEmpty(P 2 , B \ F ) 36: end 37: swap(P 1 , P 2 ) 38: parallel for (p i , S i ∈ P, S) do 54: B ← Bwd-Reach(S i , p i ) 55: parallel for vertices u ∈ B do 56: color(u) ← p i 57: Append(F 2 , u) 58: end 59: end 60:  Regarding BFS (Figure 16), the device with the Maxwell architecture outperforms the Tesla device, however also the less recent device shows good speed-ups, up to 10x, w.r.t. iGraph. Test concerning the SCC search show similar results, with some exception in which the Tesla architecture outperforms the Maxwell one due to the very small workloads (small homology networks), as shown in Figure 18. Tests over PPI networks (see Figure 17), regarding the calculation of distance in shortest paths, show that the architectural difference between the two devices may results in different slopes of the running times that produce differences in speedup curves. Finally, Figure 19 reports the PCSF subnetworks obtained by running PCSF accelerated with the cuRnet SSSP function to analyze diffuse large B-cell lymphoma (DLBCL). Based on gene expression profiling studies DLBCL can be divided into two subgroups, the germinal center B-cell (GCB) and the activated Bcell like (ABC), with different clinical outcome and response to therapies [11,12].
In particular, Figure 19 shows subnetworks for GCB patients. It highlights the activation of the PI3K/Akt/mTOR signalling pathway (cluster in red) and overexpression of germinal center markers such as BCL6, LMO2, MME (CD10) and MYBL1m, confirming the findings reported in [13,14].