Performance evaluation of pbalt
We evaluated the performance of pblat using different number of threads and compared to the results of the original blat. All analyses were performed on a Linux machine with 4 Intel Xeon E7–4830 @ 2.13GHz CPUs (32 cores in total, 2 threads per core) and 128G memory. We employed the nucleotide sequences of all human protein-coding transcripts with length range from 1 kb to 5 kb in the GENCODE release 26 [9]. We aligned these sequences to the GRCh38.p10 human reference genome sequences with blat and pblat. The test data consisted of 42,286 transcript sequences. The mean length of the test transcripts was 2440 and the median length was 2226. The blat and pblat analyses, with 2, 4, 8, 16, 24, 32, 40, 48, 56 and 64 threads, were executed to map all the transcripts to the genome.
The results showed that pblat reduced the run time relative to the increasing number of threads used (Fig. 1a). Blat took more than 7 h to map all the test transcripts to the reference genome, whereas pblat with 16 threads only required approximately 39 min to finish, which was 10.97x speedup than the blat. From 2 threads to 32 threads, the speedup of pblat increased with the increasing number of threads used (Fig. 1b). When using 2 to 8 threads, the speedup increased almost linearly with the number of threads. But the speed of acceleration decreases when using more threads and the run time did not further reduce after 32 threads. The final speedup of this test was 18.61 when using 48 threads. Due to the channel and bandwidth limitation of memory accessing, the acceleration was not proportional to the number of threads used. The memory usage for pblat with any number of threads tested was almost the same as that of blat. The results generated by pblat were completely identical to the results of the blat analysis. Based on these results, we suggest to use half of total CPU cores to get the maximum acceleration, or setting the number of threads as the total number of memory channels to get a high acceleration with economic CPU resource consumptions.
To evaluate the performance of pblat on real long-read sequencing data with sequencing errors leading to abundant mismatches and gaps, we adopted the mouse single-cell full length transcript RNA-seq data sequenced on the long-read single-molecule Oxford Nanopore MinION sequencer (ONT) [10]. The dataset was downloaded from NCBI Sequence Read Archive (SRA) with accession number SRR5286960. For a single murine B1a cell, 104,990 reads were generated by ONT with the maximum length 6296 bp, mean length 1868 bp and median length 1787 bp. Fast spliced read aligners including HISAT2 [11], STAR [12] etc., were not compatible with such long reads with high error rates [10]. In the original study [10], the blat was employed to align these nanopore long reads to the mouse genome and successfully helped identify novel transcript isoforms. We replicated the aligning step using blat and pblat with 8 and 16 threads. The blat program took 935 min to align all these reads to GRCm38 genome (primary assembly), while pblat with 8 threads used 149 min and pblat with 16 threads took 86 min to finish. The speedup for pblat with 8 and 16 threads relative to blat was 6.28 and 10.87, respectively. The speedup was consistent with results in the last analysis. These results showed pblat could significantly accelerate aligning long sequencing reads generated by the Oxford Nanopore and PacBio SMRT (Single Molecule Real-Time) sequencing platforms.
Performance evaluation of pblat-cluster
The performance of pblat-cluster was evaluated on a high-performance computing cluster with 15 nodes. Each node had 24 CPU cores @ 2.30GHz (1 thread per core) and 128GB memory. All the nodes were connected by the InfiniBand network and shared a GPFS file system. All the nodes ran Linux system with OpenMPI 3.0.0. The test data was the same as that used in the evaluation of pblat. We evaluated the pblat-cluster with 1, 3, 6, 9, 12 and 15 computing nodes. Each node ran 12 threads. Results indicated that the run time decreased significantly with the increasing number of computing nodes employed (Fig. 2). The blat program took 6.4 h with one thread on one node to align all the test sequences to the reference genome. The pblat-cluster with one node (12 threads) used 44 min. When using 15 nodes, pblat-cluster reduced the time consumption to 6.8 min, which was 6.47x speedup than pblat with 12 threads in one node and 51.18x speedup than blat. The speedup would further increase with more computing nodes, but as pblat the acceleration would not be proportional to the number of nodes employed and there is a roof for the maximum speedup. The run time of pblat-cluster is determined by the slowest node. When a node has an extremely long sequence to align which takes much longer time than the other sequences, the total time consumption would be determined by the time used to align this sequence no matter how many nodes are used.
We then compared the time consumption when running pblat with 12 threads and that when running pblat-cluster with 12 nodes (1 thread per node). The pblat with 12 threads took 44 min to align all the test transcripts to the reference genome. The pblat-cluster with 12 nodes used 39.8 min. As expected, pblat-cluster is faster than pblat using the same number of threads, because for pblat-cluster the threads run on different computing nodes, with each node using its own memory of genomes and indexes, they do not compete to access the memory.