Figure 1 shows an overview of the full *cuRnet* stack, by which R data is passed, as input data, to the GPU environment for parallel computation. The input network is represented, in R, through a standard R data frame, where every edge between two vertices is stored with the corresponding weight. By exploiting the *Rcpp* library of R, an R-C++ wrapper has been developed to automatically translate the network from the standard R representation to a C++ data structure, and to link the algorithm invocation from the R to the C++ environment.

The network representation in the C++ environment relies on the coordinate list (COO) data structure, which is a mandatory step to generate the compressed sparse row (CSR) data structure for the GPU computation. CSR is a well-known storage format to efficiently represent graphs, and it allows reaching high performance during the graph traversal on the GPU.

The C++ interface allows handling the interaction with the GPU device. It generates the host (CPU) representation of the graph starting from the rows in the data frame, it initializes the GPU kernel, it handles the host (CPU)-device (GPU) data exchanging, and, finally, it runs the kernel for the parallel computation. The computation result is retrieved from the device and passed back to R through the Rcpp/C++ layers.

In what follows we briefly describe the parallel graph traversal algorithms implemented in *cuRnet*. Given a graph *G*(*V*,*E*), with a set *V* of vertices, a set *E* of edges, and a weight function \(w: E \to \mathbb {R} \), *cuRnet* takes *G* in a dataframe X having three columns listing the network edges and their weights. The dataframe can be built from an iGraph object or from a textual file (.csv). The following lines invoke the loading of the *cuRnet* package and the construction of the graph data structure:

We refer the reader to (https://bitbucket.org/curnet/curnet) for a complete manual of the *cuRnet* usage.

### Parallel implementation of breadth-first search for GPUs

The parallel graph traversal through BFS [10], which is listed and analyzed in Additional file 1: Section 1 -Algorithm 1 and Figure S1, respectively, explores the reachable vertices, level-by-level, starting from a source *s*. *cuRnet* implements the concept of frontier [17] to achieve work efficiency. A frontier holds all and only the vertices visited at each level. The algorithm checks every neighbour of a frontier vertex to see whether it has been already visited. If not, the neighbour is added into a new frontier. *cuRnet* implements a frontier propagation step through two data structures, *F*_{1} and *F*_{2}. *F*_{1} represents the actual frontier, which is read by the parallel threads to start the propagation step. *F*_{2} is written by the threads to generate the frontier for the next BFS step. At each step, *F*_{2} is filtered and swapped into *F*_{1} for the next iteration. When a thread visits an already visited neighbour, that neighbour is eliminated from the frontier. When more threads visit the same neighbour in the same propagation step, they generate duplicate vertices in the frontier. *cuRnet* implements efficient duplicate detection and correction strategies based on hash tables, advanced strategies for coalesced memory accesses, and warp shuffle instructions. Moreover, it implements different strategies to deal with the potential workload imbalance and thread divergence caused by any actual biological network non-homogeneity. These include prefix-sum procedures to efficiently handle frontiers, dynamic virtual warps, dynamic parallelism, multiple CUDA kernels, and techniques for coalesced memory accesses.

The BFS result is a matrix *s*×|*V*|, where *s* is the number of vertex sources from which the BFS is run. Each entry in the matrix is the depth of the BFS from a source to a graph vertex. The matrix is retrieved from the GPU device to R through the Rcpp/C++ layers. BFS is ran by invoking the following *cuRnet* function in the R environment:

### Parallel implementation of single-source-shortest-path for GPU

The *cuRnet* CUDA implementation of the SSSP algorithm is based on the Bellman-Ford’s approach [11]. The parallel algorithm is reported in Additional file 1: Section 1. *cuRnet* SSSP visits the graph and finds the shortest path *d* to reach every vertex of *V* from source *s*. Also in this case, *cuRnet* exploits the concept of frontier to deal with the most expensive step of the algorithm (i.e., the relax procedure). At each iteration *i*, the algorithm extracts, in parallel, the vertices from one frontier and inserts the active neighbours in the second frontier for the next iteration step. Each iteration concludes by swapping the contents of the second frontier (which will be the actual frontier at the next iteration) into the first one. Indeed, the frontiers allow working only on active vertices, i.e., all and only vertices whose tentative distance has been modified and, thus, that must be considered for the relax procedure at the next iteration.

The result is a double numeric matrix (i.e., *distances* and *predecessors*), which are retrieved from the GPU device to R through the Rcpp/C++ layer. They are obtained by invoking the *cuRnet* functions CURNET_SSSP and CURNET_SSSP_DISTS for the matrix of shortest paths (returned as lists of predecessor vertices) and the corresponding source-destination distances:

### Parallel implementation of strongly-connected components for GPU

*cuRnet* implements a multi-step approach that applies different GPU-accelerated algorithms for SCC decomposition [12]. The algorithm is reported in Additional file 1: Section 1. The multi-step approach consists of 3 phases. In the first phase it iterates a trimming procedure to identify and delete vertices of *G* that form trivial SCCs (i.e., vertices with no active successors or predecessors). In the second phase it iterates a forward-backward algorithm to identify the main components. The first step is related to the choice of the pivot for each set, where heuristics can be applied to maximize vertices coverage within a single iteration. Forward and backward closure is then computed from this vertex, and up to four subgraphs are generated. The first one is the component which the pivot belongs to, and it is calculated as the intersection of the forward and backward closure. The other three sets are SCC-closed subgraphs that can be processed in parallel at the next iteration. They correspond to the non-visited vertices in the current set, to the forward closure but not to the backward one, and to the backward-reachable vertices, respectively. In the third phase the approach runs a *coloring* algorithm to decompose the rest of the graph. A unique color is firstly assigned to each vertex. The max color is then propagated to the successor non-eliminated vertices until no more updates are possible. Pivots are chosen as the vertices which color is unchanged. Running the backward closure from these vertices on the corresponding set, *cuRnet* detects the components labelled with that color.

The *cuRnet* SCC computation results in a vector of associations between vertices and strongly component IDs. It is retrieved from the GPU device to R through the Rcpp/C++ layer and obtained by invoking the following *cuRnet* function: