Ryūtō: network-flow based transcriptome reconstruction

Background The rapid increase in High-throughput sequencing of RNA (RNA-seq) has led to tremendous improvements in the detection and reconstruction of both expressed coding and non-coding RNA transcripts. Yet, the complete and accurate annotation of the complex transcriptional output of not only the human genome has remained elusive. One of the critical bottlenecks in this endeavor is the computational reconstruction of transcript structures, due to high noise levels, technological limits, and other biases in the raw data. Results We introduce several new and improved algorithms in a novel workflow for transcript assembly and quantification. We propose an extension of the common splice graph framework that combines aspects of overlap and bin graphs and makes it possible to efficiently use both multi-splice and paired-end information to the fullest extent. Phasing information of reads is used to further resolve loci. The decomposition of read coverage patterns is modeled as a minimum-cost flow problem to account for the unavoidable non-uniformities of RNA-seq data. Conclusion Its performance compares favorably with state of the art methods on both simulated and real-life datasets. Ryūtō calls 1−4% more true transcripts, while calling 5−35% less false predictions compared to the next best competitor. Electronic supplementary material The online version of this article (10.1186/s12859-019-2786-5) contains supplementary material, which is available to authorized users.

: Information loss of the original definition (right) compared to the alternative definition (left). We have exon bins of unrealistic theoretical quality: AXWZ, XWZB, XWZC, DXVZ, EXVZ, and XVZF. The bins form two cluster in the overlap graph that stay intact in the alternative definition. As we normally require bins to end at nodes and nodes are unique per exon, both clusters are contracted into one graph component. We argue that this case is rare. Figure S9: Information gain of the original definition (right) compared to the alternative definition (left). Exon X is smaller than the readsize. However, only bins AC, BC, CXD, and XE are present, with the expected bin CXE missing. Condition (iv) forces bin XE to stay disconnected for the alternative condition. In the original formulation, (ii) forces the presence of a node for X, and XE is joined as nodes are unique. This is a common problem, especially in low abundant regions.
drop the constraint that l(v 1 ) = x 1 , and l(v k ) = x j from (ii). We can use a modified version of the original algorithm for this purpose, and leave details as an exercise for the reader. The thus enforced integrity of multi-splice bins has a positive impact in some instances (Suppl. Fig. S8), but shows bad properties on incomplete bin sets (Suppl. Fig. S9). Overall, on realistic data, the quality of description decreases (see Suppl. Tbl. S10).
In order to show the correctness and maximality according to (iii) and (iv) of our original algorithm, we argue by listing elemental operations on the overlap graph, corresponding to the resulting bin graph structures they induce. We can then depict the algorithm as a series of elemental operations, each contracting a group of nodes in the overlap graph into a single node until only a single node per component remains. Given the operations, a simple proof by induction can be formulated: As the bin graph is acyclic, the inverse of the listed operations can be used to build up any overlap graph. Local maximality is also true on a global perspective, as the restricting factors remain the same. In order to minimize visual elements, we here do not label edges with the nodes of a bin as before, but rather use placeholder variables that represent ≥ 1 splices each. Different naming strictly indicates incompatible splices, while same names represent the same splice signal. We disregard bins that are unique subsets of another bin. Therefore, no inclusion of middle nodes according to (ii) are required and maximality forces no changes. 3.1a/b As a prefix overlaps to at least two nodes, unique mapping of bins (ii) is violated unless paths are joint at the largest overlapping suffix for each incoming node. According to (ii) every bin bin needs to end at a node, inducing nodes at every prefix corresponding to a suffix of an incoming node. 3.2a/b As a suffix overlaps to at least two nodes, unique mapping of bins (ii) is violated unless paths are joint at the largest overlapping prefix for each outgoing node. According to (ii) every bin bin needs to start at a node, inducing nodes at every suffix corresponding to a prefix of an outgoing node. 3.3 As a bin is contained in two edges, it can not be uniquely mapped, violating (ii). Therefore, both need to be joined. 3.4a A bin contained in the overlapping region of two nodes actually belongs to a single path. 3.4b A bin contained in an also contained bin does not induce any violations to (i) or (ii). 3.5 Transitive overlaps can be ignored, as pairwise treatment induces the same data.

ID
Please note that instances of 3.4a, 3.4b, 3.5 need to be resolved before the rest of all operations, with the exception of 2.2, which can only be used last per locus, as all more specific operators need to go first. Of course, induced bin graphs need to be preserved and updated for each step. Nodes cannot be lost and edges only get split, as they were mandated for by (i) or (ii).
Operations 3.4a, 3.4b, 3.5 correlate to the removal of transitive edges in our algorithm. 2.2 is applied to nodes with in-degree 1 or out-degree 1. 3.1 and 3.2 correspond to part (a) of the algorithm, 3.3 to (b). We give an example decomposition as follows: Algorithm 2 Generate the bin graph from the overlap graph. We omit details of range handling.

Operation
Require: overlap graph G , B n overlapping nodes, B c contained nodes        Table S2: Running times, CPU use and memory use of benchmarked tools. We tested runtimes for ENSEMBL Realistic (ER) on a machine with an Intel(R) Xeon(R) CPU E7540 @ 2.00GHz and sufficient RAM. Ryūtō, Cufflinks and StringTie were run assigned 8 cores. Transcomb and Scallop do not offer this option and were run single threaded. Ryūtō took only slightly longer than StringTie despite the higher computational needs of Ryūtō, made up by more effective parallelization. The higher memory and time use of Ryūtō is explained by its internal infrastructure that is designed already for later addition of trans-splice and circularization events. Scallop can take advantage of its lower requirements for data-structures.  Table S3: Spearman's rank correlation coefficients for Cufflinks ρ C , StringTie ρ S , Transcomb ρ T and Ryūtō ρ R for individual chromosomes of the simulated datasets ENSEMBL Perfect (EP) and ENSEMBL Realistic (ER). Ranks of predicted FPKM are correlated to true abundance ranks. Only the ranks of true predicted isoforms where considered. Cufflinks exhibits the highest accuracy, but also called the fewest transcripts, therefore gaining a slight advantage for this measure. StringTie and Ryūtō perform similarly well, with Ryūtō consistently in the advantage. The best results are highlighted bold, the second best italic.      Table S8: Total true predicted transcripts (TP), total false predicted transcripts (FP), recall, precision and F1 score for each tool broken down by the true abundance of transcripts on the simulated dataset ENSEMBL Realistic. Tools were provided with a partly falsified annotation to guide assembly. Trust levels for Ryūtō are given.  Table S9: Total true predicted transcripts (TP), total false predicted transcripts (FP), recall, precision and F1 score for Ryūtō and StringTie. Paired-end reads of the simulated dataset ENSEMBL Realistic were aligned using STAR. Assembled de novo super-reads of the same data were aligned with STAR or HISAT. Results for only the de novo alignment, and a merged dataset of paired-end and de novo are given, run at standard settings, or with higher filters (f). Results are broken down according to abundance of the true transcripts.  Table S10: Total true predicted transcripts (TP), total false predicted transcripts (FP), recall, precision and F1 score for the standard definition (norm.) and the alternative definition (alt.) of Ryūtō broken down by the true abundance of transcripts on the simulated dataset ENSEMBL Realistic.

Dataset
Tool TP FP Prec. Rec. F1 Table S11: Total true predicted transcripts (TP), total false predicted transcripts (FP), recall, precision and F1 score for each tool broken down by the true abundance of transcripts on realistic data, looking only at multi-exon transcripts.  Table S12: Total true predicted transcripts (TP), total false predicted transcripts (FP), recall, precision and F1 score for each tool broken down by the true abundance of transcripts on realistic data, looking at all transcripts including single exon transcripts.