ASGAL (Alternative Splicing Graph ALigner) is a tool for performing a mapping of RNA-Seq data in a sample against the splicing graph of a gene with the main goal of detecting novel alternative splicing events supported by the reads of the sample with respect to the annotation of the gene. More precisely, ASGAL takes as input the annotation of a gene together with the related reference sequence, and a set of RNA-Seq reads, to output (i) the spliced alignments of each read in the sample and (ii) the alternative splicing events supported by the sample which are novel with respect to the annotation. We point out that ASGAL uses the input reference sequence only for building the splicing graph as well as for refining the alignments computed against it, with the specific goal of improving the precision in the AS event type detection. Each identified event is described by its type, i.e. exon skipping, intron retention, alternative acceptor splice site, alternative donor splice site, its genomic location, and a measure of its quantification, i.e. the number of alignments that support the identified event.
This section is organized as follows. We first introduce the basic definitions and notions that we will use in the section spliced graph-alignment, and finally we describe the steps of our method. For the sake of clarity, we will describe our method considering as input the splicing graph of a single gene: it can be easily generalized to manage more than a gene at a time. However, the current version of ASGAL tool cannot manage more than a limited set of genes. At the end of this section, we will propose a possible procedure an user can adopt to use our tool in a genome-wide analysis.
Definitions
From a computational point of view, a genome is a sequence of characters, i.e. a string, drawn from an alphabet of size 4 (A, C, G, and T). A gene is a locus of the genome, that is, a gene is a substring of the genome. Exons and introns of a gene locus will be uniquely identified by their starting and ending positions on the genome. A transcript T of gene G is a sequence 〈[a1,b1],[a2,b2],…,[an,bn]〉 of exons on the genome, where ai and bi are respectively the start and the end positions of the i-th exon of the transcript. Observe that a1 and bn are the starting and ending positions of transcript T on the genome, and each [bi+1,ai+1−1] is an intron represented as a pair of positions on the genome. In the following, we denote by \(\mathcal {E}_{G}\) the set of all the exons of the transcripts of gene G, that is \(\mathcal {E}_{G}=\cup _{T \in \mathcal {T}}\mathcal {E}(T)\), where \(\mathcal {E}(T)\) is the set of exons of transcript T and \(\mathcal {T}\) is the set of transcripts of G, called the annotation of G. Given two exons ei=[ ai,bi] and ej=[ aj,bj] of \(\mathcal {E}_{G}\), we say that eiprecedes ej if bi<aj and we denote this by ei≺ej. Moreover, we say that ei and ej are consecutive if there exists a transcript \(T \in \mathcal T\) and an index k such that ek=ei and ek+1=ej, and ei, ej in \(\mathcal {E}(T)\).
The splicing graph of a gene G is the directed acyclic graph \(\mathcal {S}_{G} = (\mathcal {E}_{G},E)\), i.e. the vertex set is the set of the exons of G, and the edge set E is the set of pairs (vi,vj) such that vi and vj are consecutive in at least one transcript. For each vertex v, we denote by seq(v), the genomic sequence of the exon associated to v. Finally, we say that \(\mathcal {S}_{G}^{\star }\) is the graph obtained by adding to \(\mathcal {S}_{G}\) all the edges (vi,vj)∉E such that vi≺vj. We call these edges novel edges. Note that the novel edges represent putative novel junctions between two existing exons (that are not consecutive in any transcript of G). Figure 1 shows an example of the definitions of gene, exon, annotation, and splicing graph.
In the following, we will use the notion of Maximal Exact Match (MEM) to perform the spliced graph-alignment of a RNA-Seq read to \(\mathcal {S}_{G}\). Given two strings R and Z, a MEM is a triple m=(iZ,iR,ℓ) representing the common substring of length ℓ between the two strings that starts at position iZ in Z, at position iR in R, and that cannot be extended in either direction without introducing a mismatch. Computing the MEMs between a string R and a splicing graph \(\mathcal {S}_{G}\) can be done by concatenating the labels of all the vertices and placing the special symbol ϕ before each label and after the last one, obtaining a string \(Z = \phi {\texttt {seq}}(v_{1}) \phi {\texttt {seq}}(v_{2}) \phi \ldots \phi {\texttt {seq}}(v_{|\mathcal {E}_{G}|}) \phi \) that we call the linearization of the splicing graph (see Fig. 1 for an example). It is immediate to see that, given a vertex v of \(\mathcal {S}_{G}\), the label seq(v) is a particular substring of the linearization Z. For the sake of clarity, let us denote this substring, which is the one related to seq(v), as Z[ iv,jv]. Then, by employing the algorithm by Ohlebusch et al. [26], all the MEMs longer than a constant L between R and Z, thus between R and \(\mathcal {S}_{G}\), can be computed in linear time with respect to the length of the reads and the number of MEMs. Thanks to the special character ϕ which occurs in Z and not in R, each MEM occurs inside a single vertex label and cannot span two different labels. In the following, given a read R and the linearization Z of \(\mathcal {S}_{G}\), we say that a MEM m=(iZ,iR,l) belongs to vertex v if iv≤iZ≤jv where [ iv,jv] is the interval on Z related to the vertex label seq(v) (that is, seq(v)=Z[ iv,jv]). We say that a MEM m=(iZ,iR,l) precedes another MEM \(m^{\prime }=\left (i^{\prime }_{Z}, i^{\prime }_{R}, l^{\prime }\right)\) in R if \(i_{R} < i^{\prime }_{R}\) and \(i_{R} + l < i^{\prime }_{R} + l^{\prime }\), and we denote this by m≺Rm′. Similarly, when m precedes m′ in Z, we denote it by m≺Zm′, if the previous properties hold on Z and the two MEMs belong to the same vertex label seq(v). When m precedes m′ in R (in Z, respectively), we say that \(lgap_{R} = i^{\prime }_{R} - (i_{R} + l)\) (\({lgap}_{Z} = i^{\prime }_{Z} - (i_{Z} + l)\), respectively) is the length of the gap between the two MEMs. If lgapR or lgapZ (or both) are positive, we refer to the gap strings as sgapR and sgapZ, while when they are negative, we say that m and m′overlap either in R or Z (or both). Given a MEM m belonging to the vertex labeled seq(v), we denote as PREFZ(m) and SUFFZ(m) the prefix and the suffix of seq(v) upstream and downstream from the start and the end of m, respectively. Figure 2 summarizes the definitions of precedence between MEMs, gap, overlap, PREFZ, and SUFFZ.
Spliced graph-alignment
We are now able to define the fundamental concepts that will be used in our method. In particular, we first define a general notion of gap graph-alignment and then we introduce specific constraints on the use of gaps to formalize a splice-aware graph-alignment that is fundamental for the detection of alternative splicing events in ASGAL.
A gap graph-alignment of R to graph \(\mathcal {S}_{G}\) is a pair (A,π) where π=〈v1,…,vk〉 is a path of the graph \(\mathcal {S}_{G}^{\star }\) and
$$A = \left\langle (p_{1}, r_{1}), \left(p_{1}^{\prime}, r_{1}^{\prime}\right), \ldots, \left(p^{\prime}_{n-1}, r^{\prime}_{n-1}\right),(p_{n}, r_{n}) \right\rangle $$
is a sequence of pairs of strings, with n≥k, such that seq(v1)=x·p1 and seq(vk)=pn·y, for x,y possibly empty strings and \(P= p_{1} \cdot p_{1}^{\prime } \cdot p_{2} \cdot p_{2}^{\prime } \cdot p_{3} \cdots p_{n-1}^{\prime } \cdot p_{n} \) is the string labeling the path π and \(R = r_{1} \cdot r_{1}^{\prime } \cdot r_{2} \cdots r_{n-1}^{\prime } \cdot r_{n} \).
The pair (pi,ri), called a factor of the alignment A, consists of a non-empty substring ri of R and a non-empty substring pi of the label of a vertex in π. On the other hand, the pair \(\left (p_{i}^{\prime }, r_{i}^{\prime }\right)\) is called a gap-factor of the alignment A if at least one of \(p_{i}^{\prime }\) and \(r_{i}^{\prime }\) is an empty substring ε. Moreover, either \(p_{i}^{\prime }\) is empty or \(|p_{i}^{\prime }| > \alpha \), and either \(r_{i}^{\prime }\) is empty or \(|r_{i}^{\prime }| > \alpha \), for a fixed value α which represents the maximum alignment indel size allowed. When an insertion (or a deletion) is smaller than α, we consider it an alignment indel and we incorporate it into a factor; otherwise, we consider it as a clue of the possible presence of an AS event and we represent it as a gap-factor. We note that an “alignment indel” is a small insertion or deletion which occurs in the alignment, due to a sequencing error in the input data or a genomic insertion/deletion. Intuitively, in a gap graph-alignment, factors correspond to portions of exons covered (possibly with errors) by portions of the read, while gap-factors correspond to introns, which can be already annotated or novel, and which can be used to infer the possible presence of AS events. We note that to allow the detection of alternative splice site events known as NAGNAG resulting in a difference of 3bps, if an alignment indel occurs at the beginning or at the end of an exon, we consider it during the detection of the events, even though it is not modeled as a gap-factor since in these cases the insertion may be smaller than α.
We associate to each factor (pi,ri) the cost δ(pi,ri), and to each gap-factor \(\left (p_{i}^{\prime },r_{i}^{\prime }\right)\) the cost \(\delta \left (p_{i}^{\prime },r_{i}^{\prime }\right)\), by using a function δ(·,·) with positive values. Then the cost of the alignment (A,π) is given by the expression:
$$ {\mathtt{cost}}(A, \pi) = \sum_{i=1}^{n} \delta(p_{i}, r_{i}) + \sum_{i=1}^{n-1} \delta\left(p_{i}^{\prime},r_{i}^{\prime}\right). $$
Moreover, we define the error of a gap graph-alignment as the sum of the edit distance of each factor (but not of gap-factors). Formally, the error of the alignment (A,π) is:
$$ {\mathtt{Err}}(A, \pi) = \sum_{i=1}^{n} d(p_{i}, r_{i}), $$
where d(·,·) is the edit distance between two strings.
To define a splice-aware alignment, that we call spliced graph-alignment, we need to classify each gap-factor and to assign it a cost. Our primary goal is to compute a gap graph-alignment of the read to the splicing graph that possibly reconciles to the gene annotation; if this is not possible, then we want to minimize the number of novel events. For this reason we distinguish three types of gap-factors: annotated, novel, and uninformative. Intuitively, an annotated gap-factor models an annotated intron, a novel gap-factor represents a novel intron, while an uninformative gap-factor does not represent any intron.
Formally, we classify a gap-factor \(\left (p_{i}^{\prime }, r_{i}^{\prime }\right)\) as annotated if and only if \(p^{\prime }_{i} = r^{\prime }_{i} = \epsilon \) and the two strings pi, pi+1 are on two different vertices that are linked by an edge in \(\mathcal {S}_{G}\). We classify a gap-factor \(\left (p_{i}^{\prime }, r_{i}^{\prime }\right)\) as novel in the following cases:
-
1
\(r_{i}^{\prime } = \epsilon \) and \(p_{i}^{\prime }= \epsilon \) occurs between the strings pi and pi+1 which belong to two distinct vertices linked by an edge in \(\mathcal {S}_{G}^{\star }\) and not in \(\mathcal {S}_{G}\) (i.e. this gap-factor represents an exon skipping event — Fig. 3a).
-
2
\(r_{i}^{\prime } = \epsilon \) and \(p_{i}^{\prime }\neq \epsilon \) occurs between the strings pi and pi+1 which belong to the same vertex of \(\mathcal {S}_{G}^{\star }\) (i.e. this gap-factor represents an intron retention event — Fig. 3b). Actually, we note here that this type of gap-factor may represent also a genomic deletion: currently, our program does not distinguish between intron retentions and genomic deletions that are entirely contained in an exon, therefore we might overpredict intron retentions.
-
3
\(r_{i}^{\prime } = \epsilon \) and \(p_{i}^{\prime }\neq \epsilon \) occurs between the strings pi and pi+1 which belong to two distinct vertices linked by an edge in \(\mathcal {S}_{G}^{\star }\) (i.e. this gap-factor represents an alternative splice site event — Fig. 3c).
-
4
\(r_{i}^{\prime } \ne \epsilon \) and \(p_{i}^{\prime }= \epsilon \) occurs between the strings pi and pi+1 which belong to two distinct vertices linked by an edge in \(\mathcal {S}_{G}^{\star }\) (i.e. this gap-factor represents an alternative splice site extending an exon or a new exon event — Fig. 3d-e).
Note that Case 1 allows to detect a novel intron whose splice sites are both annotated (see Fig. 3a). Case 2 supports a genomic deletion or an intron retention (see Fig. 3b), and in case of intron retention, ASGAL finds the two novel splice sites inside the annotated exon. Case 3 gives an evidence of a novel alternative splice event shortening an annotated exon (see Fig. 3c) and ASGAL finds the novel splice site supported by this case. Finally, in Case 4, ASGAL is able to detect a novel alternative splice site (extending an annotated exon) or a novel exon (see Fig. 3d), but only in the first case (alternative splice site) ASGAL is able to find the novel splice site induced by the gap-factor.
For ease of presentation, Fig. 3 shows only “classic” AS event types and not their combination as those modeled with the notion of Local Splicing Variations (LSV) [13]. We note here that our formalization takes into account combinations of AS event types as those given by an exon skipping combined with an alternative splice site (see definition of gap-factor in cases 3 and 4). However, the actual version of the tool is designed only to detect the AS event types shown in Fig. 3. For completeness, in Fig. 4 we show the same AS event types (shown in Fig. 3) with respect to the annotated case, i.e. when the gap-factor is annotated and it represents an already known AS event.
Finally, we classify a gap-factor \(\left (p^{\prime }_{i}, r^{\prime }_{i}\right)\) as uninformative in the two remaining cases, which are (i) \(r_{i}^{\prime } = \epsilon \) and \(p_{i}^{\prime }= \epsilon \) occurs between strings pi and pi+1 which belong to the same vertex, and (ii) \(r_{i}^{\prime } \neq \epsilon \) and \(p_{i}^{\prime }= \epsilon \) occurs between strings pi and pi+1 which belong to the same vertex. We notice that in the former case, factors (pi,ri) and (pi+1,ri+1) can be joined into a unique factor.
Let \(\mathcal {G_{F}}\) be the set of novel gap-factors of a gap graph-alignment A. Then a spliced graph-alignment (A,π) of R to \(\mathcal {S}_{G}\) is a gap graph-alignment in which uninformative gap-factors are not allowed, whose cost is defined as the number of novel gap-factors, and whose error is at most β, for a given constant β which models any type of error that can occur in an alignment (sequencing errors, indels, etc). In other words, in a spliced graph-alignment (A,π), we cannot have uninformative gap-factors, and the δ function assigns a cost 1 to each novel gap-factor and a cost 0 to all other factors and annotated gap-factors: thus \({\texttt {cost}}(A,\pi) = |\mathcal {G_{F}}|\) and Err(A,π)≤β. We focus on a bi-criteria version of the computational problem of computing the optimal spliced graph-alignment (A,π) of R to a graph \(\mathcal {S}_{G}\), where first we minimize the cost, then we minimize the error. The intuition is that we want a spliced graph-alignment of a read that is consistent with the fewest novel splicing events that are not in the annotation. Moreover, among all such alignments we look for the alignment that has the smallest edit distance (which is likely due to sequencing errors and polymorphisms) in the non-empty regions that are aligned (i.e. the factors). Figure 5 shows an example of spliced graph-alignment of error value 2, and cost 2 — since it has two novel gap-factors.
In this paper we propose an algorithm that, given a read R, a splicing graph \(\mathcal {S}_{G}\), and three constants, which are L (the minimum length of a MEM), α (the maximum alignment indel size), and β (the maximum number of allowed errors), computes an optimal spliced graph-alignment — that is, among all spliced graph-alignments with minimum cost, the alignment with minimum error. The next section details how ASGAL computes the optimal spliced graph-alignments of a RNA-Seq sample to the splicing graph \(\mathcal {S}_{G}\), and how it exploits novel gap-factors to detect AS events.
ASGAL approach
We now describe the algorithm employed by ASGAL to compute the optimal spliced graph-alignments of a sample of RNA-Seq reads to the splicing graph of a gene, to be used in order to provide the alternative splicing events supported by the sample and a measure of their quantification (i.e. the number of reads supporting the event).
The ASGAL tool implements a pipeline consisting of the following steps: (1) construction of the splicing graph of the gene, (2) computation of the spliced graph-alignments of the RNA-Seq reads, (3) remapping of the alignments from the splicing graph to the genome, and (4) detection of the novel alternative splicing events. Figure 6 depicts the ASGAL pipeline.
In the first step, ASGAL builds the splicing graph \(\mathcal S_{G}\) of the input gene using the reference genome and the gene annotation, and adds the novel edges to obtain the graph \(\mathcal S^{\star }_{G}\) which will be used in the next steps.
The second step of ASGAL computes the spliced graph-alignments of each read R in the input RNA-Seq sample by combining MEMs into factors and gap-factors. For this purpose, we extend the approximate pattern matching algorithm of Beretta et al. [19] to obtain the spliced graph-alignments of the reads, which will be used in the following steps to detect novel alternative splicing events. As described before, we use the approach proposed by Ohlebusch et al. in [26] to compute, for each input read R, the set of MEMs between Z, the linearization of the splicing graph \(\mathcal {S}_{G}\), and R with minimum length L, a user-defined parameter (we note that the approach of [26] allows to specify the minimum length of MEMs). We recall that the string Z is obtained by concatenating the strings seq(v) and ϕ for each vertex v of the splicing graph (recall that ϕ is the special character used to separate the vertex labels in the linearization Z of the splicing graph). We point out that the concatenation order does not affect the resulting alignment and that the splicing graph linearization is performed only once before aligning the input reads to the splicing graph.
Once the set M of MEMs between R and Z is computed, we build a weighted graph GM=(M, EM) based on the parameter α, representing the maximum alignment indel size allowed, and the two precedence relations between MEMs, ≺R and ≺Z, respectively. Then we use such graph to extract the spliced graph-alignment. Intuitively, each node of this graph represents a perfect match between a portion of the input read and a portion of an annotated exon whereas each edge models the alignment error, the gap-factor of the spliced graph-alignment, or both. More precisely, there exists an edge from m to m′, with m,m′∈M, if and only if m≺Rm′ and one of the following six conditions (depicted in Fig. 7) holds:
-
1
m and m′ are inside the same vertex label of Z, m≺Zm′, and either (i) lgapR>0 and lgapZ>0, or (ii) lgapR=0 and 0<lgapZ≤α. The weight of the edge (m,m′) is set to the edit distance between sgapR and sgapZ (Fig. 7a).
-
2
m and m′ are inside the same vertex label of Z, m≺Zm′, lgapR≤0, and lgapZ≤0. The weight of the edge (m,m′) is set to |lgapR−lgapZ| (Fig. 7b).
-
3
m and m′ are inside the same vertex label of Z, m≺Zm′, lgapR≤0 and lgapZ>α. The weight of the edge (m,m′) is set to 0 (Fig. 7c).
-
4
m and m′ are on two different vertex labels seq(v1) and seq(v2), with v1≺v2, and lgapR≤0. The weight of the edge (m,m′) is set to 0 (Fig. 7d).
-
5
m and m′ are on two different vertex labels seq(v1) and seq(v2), with v1≺v2, lgapR>0, and SUFFZ(m)=PREFZ(m′)=ε. The weight of the edge (m,m′) is set to 0 if lgapR>α, and to lgapR otherwise (Fig. 7e).
-
6
m and m′ are on two different vertex labels seq(v1) and seq(v2), with v1≺v2, lgapR>0, at least one between SUFFZ(m) and PREFZ(m′) is not ε. The weight of the edge (m,m′) is set to the edit distance between sgapR and the concatenation of SUFFZ(m) and PREFZ(m′) (Fig. 7f).
Note that the aforementioned conditions do not cover all of the possible situations that can occur between two MEMs, but they represent those that are relevant for computing the spliced graph-alignments of the considered read. Intuitively, m and m′ contribute to the same factor (pi,ri) in cases 1 and 2 and the non-zero weight of the edge (m,m′) concurs to the spliced graph-alignment error. In cases 3-5, the edge (m,m′) models the presence of a novel gap-factor. More precisely, m contributes to the end of a factor (pi,ri) and m′ contributes to the start of the consecutive factor (pi+1,ri+1) and the novel gap-factor in between models an intron retention or a genomic deletion on an annotated exon (case 3), an alternative splice site shortening an annotated exon (case 4), and an alternative splice site extending an annotated exon or a new exon (case 5). Finally in case 6, m contributes to the end of a factor (pi,ri) and m′ contributes to the start of the consecutive factor (pi+1,ri+1) whereas the gap-factor in between can identify either a novel exon skipping event or an already annotated intron. In both these cases, the non-zero weight of the edge contributes to the spliced graph-alignment error.
The spliced graph-alignment of the read R is computed by a visit of the graph GM. More precisely, each path πM of this graph represents a spliced graph-alignment and the weight of the path is the number of differences between the pair of strings in R and Z covered by πM. For this reason, for read R, we select the lightest path in GM, with weight less than β (the given error threshold) which also contains the minimum number of novel gap-factors, i.e. we select an optimal spliced graph-alignment.
The third step of ASGAL computes the spliced alignments of each input read with respect to the reference genome starting from the spliced graph-alignments computed in the previous step. Exploiting the annotation of the gene, we convert the coordinates of factors and gap-factors in the spliced graph-alignment to positions on the reference genome. In fact, observe that factors map to coding regions of the genome whereas gap-factors identify the skipped regions of the reference, i.e. the introns induced by the alignment, modeling the possible presence of AS events (see Fig. 3 for details). We note here that converting the coordinates of factors and gap-factors to positions on the reference genome is pretty trivial except when factors pi and pi+1 are on two different vertices and only \(p^{\prime }_{i}\) is ε (case d-e of Fig. 3). In this case, the portion \(r^{\prime }_{i}\) must be aligned to the intron between the two exons whose labels contains pi and pi+1 as a suffix and prefix, respectively. If \(r^{\prime }_{i}\) aligns to a prefix or a suffix of this intron (taking into account possible errors within the total error bound α), then the left or right coordinate of the examined intron is modified according to the length of \(r^{\prime }_{i}\) (Fig. 3d). In the other case (Fig. 3e), the portion \(r^{\prime }_{i}\) is not aligned to the intron and it is represented as an insertion in the alignment. Moreover, the third step of our approach performs a further refinement of the splice sites of the introns in the obtained spliced alignment since it searches for the splice sites (in a maximum range of 3 bases with respect to the detected ones) determining the best intron pattern (firstly GT-AG, secondly GC-AG if GT-AG has not been found).
In the fourth step, ASGAL uses the set \(\mathcal {I}\) of introns supported by the spliced alignments computed in the previous step, i.e. the set of introns associated to each gap-factor, to detect the novel alternative splicing events supported by the given RNA-Seq sample with respect to the given annotation. Let \(\mathcal {I}_{n}\) be the subset of \(\mathcal {I}\) composed of the introns which are not present in the annotation, that is, the novel introns. For each novel intron \(\left [p_{s}, p_{e}\right ] \in \mathcal {I}_{n}\) which is supported by at least ω alignments, ASGAL identifies one of the following events, which can be considered one of the relevant events supported by the input sample:
-
exon skipping, if there exists an annotated transcript containing two non-consecutive exons [ ai,bi] and [ aj,bj], such that bi=ps−1 and aj=pe+1.
-
intron retention, if there exists an annotated transcript containing an exon [ ai,bi] such that (i) ai<ps<pe<bi, (ii) there exists an intron in \(\mathcal {I}\) ending at ai−1 or ai is the start of the transcript and (iii) there exists another intron in \(\mathcal {I}\) starting at bi+1 or bi is the end of the transcript.
-
alternative acceptor site, if there exists an annotated transcript containing two consecutive exons [ ai,ps−1] and [ aj,bj] such that pe<bj, and there exists an intron in \(\mathcal {I}\) starting at bj+1 or bj is the end of the transcript.
-
alternative donor site, if there exists an annotated transcript containing two consecutive exons [ ai,bi] and [ pe+1,bj] such that ps>ai, and there exists an intron in \(\mathcal {I}\) ending at ai−1 or ai is the start of the transcript.
We note here that these definitions are accurately designed to minimize the chances of mistaking a complex AS event as those modeled with the notion of LSV for an AS event. For example, if we remove conditions (ii) and (iii) from the definition of intron retention, we could confuse the situation shown in Fig. 8 with an intron retention event.
Genome-wide analysis
ASGAL is specifically designed to perform AS prediction based on a splice-aware alignment of an experiment of RNA-Seq reads against a splicing graph of a specific gene. The current version of ASGAL is time efficient when a limited set of genes are analyzed, while for genome-wide analysis we have implemented a pre-processing step that aims to speed up the process of filtering reads that map to genes under investigation. Given a set of genes and a RNA-Seq sample, this filtering procedure consists of three main steps: (i) the quasi-mapping algorithm of Salmon is first used to quantify the transcripts of the genes and to quickly assign each read to the transcripts, (ii) a smaller set of RNA-Seq samples, one for each gene, is then produced by analyzing the output of Salmon, and finally (iii) if the input sample contains paired-end reads, then the mate of the mapped reads that were not mapped by Salmon are then added to these smaller samples. Once we split the input RNA-Seq experiment in smaller samples, it is possible to use ASGAL on the different genes without having to align the entire sample of reads against each of them. We note that we decided to use Salmon as pre-processing step since it is very fast and it allows to split the input sample faster than any other spliced aligner. Anyway, in aligning reads to a reference transcriptome, some reads which cover unannotated exons are not aligned, excluding them from any downstream analysis: for this reason, future works will focus on improving this step.