RNA-editing events detected by REDItools are represented in a machine-readable tabular format as reported, for example, in Fig. 1 in which each row is an editing site characterised by the chromosome name (column Region) and genomic coordinate (column Position). In addition, each editing position includes additional attributes, such as the reference nucleotide (column Reference), the list of observed substitutions (column AllSubs), the RNA editing frequency (column Frequency) (representing the editing level per site), the number and base count of reads supporting the position (columns Coverage-q25 and BaseCount[A,C,G,T], where q25 refers to the minimum quality score of a base), the strand (column Strand where 1 indicates strand plus, 0 strand minus and 2 strand not defined) and the mean quality of supporting nucleotides (column MeanQ).
The identification of RNA editing events in a classical RNA-seq experiment (including at least 50-60 million reads) through the first implementation of REDItools [13] can take up to 300 hours on a standard machine with a single core and 2GB of RAM. Consequently, REDItools are not suitable to analyse large datasets from public repositories (e.g. GTEx or TCGA). To overcome this issue, we have completely redesigned the main algorithm and adapted it to HPC infrastructures to benefit from multiple parallel cores.
Our novel software, called HPC-REDItools, introduce at least three big novelties over the previous implementation:
-
1
Data loading optimization: the new code has been fully rewritten to optimise data loading and improve computational speed when launched in serial mode;
-
2
Dynamic Interval Analysis: we have implemented a novel algorithm, called Dynamic Interval Analysis (DIA), to solve the problem of high-density genomic regions that improves the software in terms of workload balance;
-
3
Parallelization: the current code is HPC-aware supporting the parallel analysis on thousands HPC nodes.
Below we describe in detail algorithmic novelties illustrating speed improvements over the original REDItools implementation.
HPC-aware implementation
Original REDItools software is written in Python (version 2.7) and for its biological purpose (RNA editing detection) employs the pysam module, a wrapper of the widespread SAMtools package for reading and manipulating large raw BAM files [16] containing multiple alignments of transcriptome reads onto a reference genome (composed by several sequences corresponding to chromosomes). REDItools can speed up the identification of editing events using multi-threading in which each thread (available core) analyses independently a complete chromosome or part of it. Such pseudo-parallelism is strongly limited because REDItools run on single machines and, thus, the number of useful processes and analyses is restricted to the number of available cores (typically in a range from 2 to 30 on modern CPUs). In addition, REDItools do not take into account the density of reads per chromosome and some processes may be more computationally stressed than others. HPC-REDItools, instead, have been designed to be highly-parallel in order to run on HPC infrastructures, as they are becoming more and more accessible to researchers worldwide, and take advantage of multiple computing nodes. HPC-REDItools are again written in Python (to increase portability and for continuity with the previous version) and makes use of mpi4py library [17] (version 2.0.0) that is the binding of the Message Passing Interface (MPI) standard library for the Python programming language. Such library enables the power of multi-node computing and gives access to point-to-point and collective communication primitives (e.g., send/receive, scatter/gather and son on) directly from native Python code. The general architecture of HPC-REDItools follows a simple master/slave template. As shown in Fig. 2, the MPI program (yellow rectangle) takes in input the BAM file (see Fig. 2a). A master process M splits the input whole genome into a set of genomic intervals D and, then, dispatches each interval to n free parallel slave processes by sending a COMPUTE message and until all processes are kept busy. Each slave process S (Fig. 2d) performs the analysis on the assigned interval and produces an intermediate temporary file with candidate RNA-editing events (see Fig. 2e). To promote process recycling, whenever a slave process S completes its analysis, it notifies the master process M by sending a DONE message which in turn assigns a new interval to S (if any). When all intervals have been analysed, the master process M sends a FINISH message to each slave process S notifying the end of the computation so that they can gracefully exit. A final procedure (Fig. 2f) is implemented to collect all intermediate results (temporary files) and creates a single, unified output file with all potential RNA-editing events (Fig. 2g). HPC-REDItools accept personalised intervals for an advanced and fine-grained control over the computation. However, we have implemented the DIA Algorithm (see Algorithm 1 in “Dynamic interval analysis” section) which exploits the point-wise coverage of a given BAM file to return an optimal set of intervals and guarantee a more balanced workload across processes.
Data loading optimization
A relevant novelty of HPC-REDItools consists in the optimization of the data loading which represented a bottleneck of the previous REDItools version. Such optimization improves the global speed per genomic interval and has a positive impact on parallelism. As explained above, the original REDItools implementation relies on the well-known Python module Pysam [16] (version 0.13) to extract and manipulate aligned reads in SAM/BAM format. To look at RNA editing events, REDItools inspect the entire genome position by position iteratively by invoking the mpileup function of Pysam to calculate statistics about transcriptome reads supporting each position. During the traversal, the mpileup function loads scanned reads as many times as their length, causing l disk accesses per read (where l is the read length) (see Fig. 3, left). To overcome this limitation, individual genomic positions could be simply explored by traversing aligned reads sequentially with no need to use the mpileup function. Indeed, HPC-REDItools access the disk only once for each read: as a new genomic position is encountered, it dynamically loads from the disk all the novel mapping reads present in the input BAM file which start at the given position, caching the information as needed for future reuse (see Fig. 3, right). Note that as soon as a read does not intersect the position currently under analysis, it is discarded and the corresponding memory freed, thus also contributing to a minimal memory footprint. Furthermore, the rate of disk accesses is decreased by a factor equal to n and the overall analysis time is dramatically reduced.
Dynamic interval analysis
The parallelism of the original REDItools version consists in splitting the genome into n equal-sized intervals, where n is the number of processes specified at the program invocation. Genomic intervals of RNA-seq experiments are characterised by an extreme uneven coverage (density of mapped reads) and when treated equally result in an unbalanced workload across processes with sub-optimal performances (cfr. Fig. 4a and b). In addition, splitting highly-variable coverage data into equal-sized intervals causes computational time bottlenecks with fast processes for low coverage intervals and very slow processes for high-density intervals. In the worst case, a unique process might monopolise the entire time slice allocated to the whole job because of an interval with an extremely high number of supporting reads. However, many biological data produced by high throughput sequencing technologies are characterised by a highly-variable coverage and, thus, ad-hoc solutions are needed.
HPC-REDItools introduce a novel strategy called Dynamic Interval Analysis (DIA). Such strategy aims at finding an optimal interval division, which is able to guarantee a balanced processing time across processes while maintaining the number of intervals comparable to that of processes. Formally, the goal is to find a set D of n intervals (n fixed) such that the processing time T(I) of each interval I is approximately the same, that is:
$$D = \left\{ \ I_{k} \ | \ T\left(I_{1}\right) \approx T\left(I_{2}\right) \approx \dots \approx T\left(I_{n}\right) \ \right\} $$
This is equivalent to say that T(Ii)≈cn, meaning that the time to process a given interval is constant (equal to T(G)/n, where T(G) is the total time required to analyse the whole genome G in a serial fashion). Since the function T is not known, it is necessary to find the best estimate \(\tilde {T}\) of T which is able to predict the execution time over an interval. First, we define the time needed to analyse an interval I as the sum of the single contributions over each position i in the interval, thus reducing the problem of estimating the processing time over single positions:
$$\tilde{T}(I) = \sum\limits_{i \in I}{t(i)} $$
where t(i) is a function that estimates the processing time of the i-th position. For simplicity, we assume that the processing time to compute a certain position depends only on the number of reads supporting such position (i.e., known as coverage).

As shown in Algorithm 1, we first initialise the set D of dynamic intervals and then calculate the processing-time estimate GC for the whole genome (lines 3-6). We then calculate AC as the average coverage time of the ideal interval (line 7); this is the ideal constant processing time for an interval. Since intervals cannot overlap and are sets of contiguous positions, the algorithm starts at the first position of the genome and then proceeds by expanding the partial interval until a stop condition is reached. There are three possible stop conditions:
-
C1.
max-weight: the weight of the interval under analysis exceeds the constant AC (to preserve workload balance);
-
C2.
max-width: the width of the interval exceeds the desired threshold max (to encourage process recycling);
-
C3.
single-chromosome-span: the end of a chromosome has been reached (since an interval cannot span across chromosomes).
Finding time estimate. To find the best estimate for t, we selected a random sample of 1,000 equal-size intervals uniformly distributed across all human chromosomes, in order to take into account also short sequences (such as the mitochondrial genome, barely 16,571 base pairs in length). We analysed each interval with HPC-REDItools in serial mode (i.e., one single core) and calculated its average coverage. Figure 5 shows a log-log plot, where each (x,y) point represents an interval with average coverage ex which has taken ey seconds to be completed. As shown, the plot reveals two main clouds, the first representing intervals whose log average coverage is between 4 and 12 and the second which includes higher-coverage intervals. This plot is meaningful in the sense that it gives a pragmatic indication of the correlation between an interval coverage and the processing time it requires to analyse it. Since the analytical continuous function t is not known a priori, we tested several t, including constant, linear and polynomial functions. However, the plot suggests that the average time to elaborate a given interval correlates in a cubic manner with its mean coverage and the light-blue line in the figure represents the function which best fits the given discrete point distribution.