ASGAL (Alternative Splicing Graph ALigner) is a tool for performing a mapping of RNASeq data in a sample against the splicing graph of a gene with the main goal of detecting novel alternative splicing events supported by the reads of the sample with respect to the annotation of the gene. More precisely, ASGAL takes as input the annotation of a gene together with the related reference sequence, and a set of RNASeq reads, to output (i) the spliced alignments of each read in the sample and (ii) the alternative splicing events supported by the sample which are novel with respect to the annotation. We point out that ASGAL uses the input reference sequence only for building the splicing graph as well as for refining the alignments computed against it, with the specific goal of improving the precision in the AS event type detection. Each identified event is described by its type, i.e. exon skipping, intron retention, alternative acceptor splice site, alternative donor splice site, its genomic location, and a measure of its quantification, i.e. the number of alignments that support the identified event.
This section is organized as follows. We first introduce the basic definitions and notions that we will use in the section spliced graphalignment, and finally we describe the steps of our method. For the sake of clarity, we will describe our method considering as input the splicing graph of a single gene: it can be easily generalized to manage more than a gene at a time. However, the current version of ASGAL tool cannot manage more than a limited set of genes. At the end of this section, we will propose a possible procedure an user can adopt to use our tool in a genomewide analysis.
Definitions
From a computational point of view, a genome is a sequence of characters, i.e. a string, drawn from an alphabet of size 4 (A, C, G, and T). A gene is a locus of the genome, that is, a gene is a substring of the genome. Exons and introns of a gene locus will be uniquely identified by their starting and ending positions on the genome. A transcript T of gene G is a sequence 〈[a_{1},b_{1}],[a_{2},b_{2}],…,[a_{n},b_{n}]〉 of exons on the genome, where a_{i} and b_{i} are respectively the start and the end positions of the ith exon of the transcript. Observe that a_{1} and b_{n} are the starting and ending positions of transcript T on the genome, and each [b_{i}+1,a_{i+1}−1] is an intron represented as a pair of positions on the genome. In the following, we denote by \(\mathcal {E}_{G}\) the set of all the exons of the transcripts of gene G, that is \(\mathcal {E}_{G}=\cup _{T \in \mathcal {T}}\mathcal {E}(T)\), where \(\mathcal {E}(T)\) is the set of exons of transcript T and \(\mathcal {T}\) is the set of transcripts of G, called the annotation of G. Given two exons e_{i}=[ a_{i},b_{i}] and e_{j}=[ a_{j},b_{j}] of \(\mathcal {E}_{G}\), we say that e_{i}precedes e_{j} if b_{i}<a_{j} and we denote this by e_{i}≺e_{j}. Moreover, we say that e_{i} and e_{j} are consecutive if there exists a transcript \(T \in \mathcal T\) and an index k such that e_{k}=e_{i} and e_{k+1}=e_{j}, and e_{i}, e_{j} in \(\mathcal {E}(T)\).
The splicing graph of a gene G is the directed acyclic graph \(\mathcal {S}_{G} = (\mathcal {E}_{G},E)\), i.e. the vertex set is the set of the exons of G, and the edge set E is the set of pairs (v_{i},v_{j}) such that v_{i} and v_{j} are consecutive in at least one transcript. For each vertex v, we denote by seq(v), the genomic sequence of the exon associated to v. Finally, we say that \(\mathcal {S}_{G}^{\star }\) is the graph obtained by adding to \(\mathcal {S}_{G}\) all the edges (v_{i},v_{j})∉E such that v_{i}≺v_{j}. We call these edges novel edges. Note that the novel edges represent putative novel junctions between two existing exons (that are not consecutive in any transcript of G). Figure 1 shows an example of the definitions of gene, exon, annotation, and splicing graph.
In the following, we will use the notion of Maximal Exact Match (MEM) to perform the spliced graphalignment of a RNASeq read to \(\mathcal {S}_{G}\). Given two strings R and Z, a MEM is a triple m=(i_{Z},i_{R},ℓ) representing the common substring of length ℓ between the two strings that starts at position i_{Z} in Z, at position i_{R} in R, and that cannot be extended in either direction without introducing a mismatch. Computing the MEMs between a string R and a splicing graph \(\mathcal {S}_{G}\) can be done by concatenating the labels of all the vertices and placing the special symbol ϕ before each label and after the last one, obtaining a string \(Z = \phi {\texttt {seq}}(v_{1}) \phi {\texttt {seq}}(v_{2}) \phi \ldots \phi {\texttt {seq}}(v_{\mathcal {E}_{G}}) \phi \) that we call the linearization of the splicing graph (see Fig. 1 for an example). It is immediate to see that, given a vertex v of \(\mathcal {S}_{G}\), the label seq(v) is a particular substring of the linearization Z. For the sake of clarity, let us denote this substring, which is the one related to seq(v), as Z[ i_{v},j_{v}]. Then, by employing the algorithm by Ohlebusch et al. [26], all the MEMs longer than a constant L between R and Z, thus between R and \(\mathcal {S}_{G}\), can be computed in linear time with respect to the length of the reads and the number of MEMs. Thanks to the special character ϕ which occurs in Z and not in R, each MEM occurs inside a single vertex label and cannot span two different labels. In the following, given a read R and the linearization Z of \(\mathcal {S}_{G}\), we say that a MEM m=(i_{Z},i_{R},l) belongs to vertex v if i_{v}≤i_{Z}≤j_{v} where [ i_{v},j_{v}] is the interval on Z related to the vertex label seq(v) (that is, seq(v)=Z[ i_{v},j_{v}]). We say that a MEM m=(i_{Z},i_{R},l) precedes another MEM \(m^{\prime }=\left (i^{\prime }_{Z}, i^{\prime }_{R}, l^{\prime }\right)\) in R if \(i_{R} < i^{\prime }_{R}\) and \(i_{R} + l < i^{\prime }_{R} + l^{\prime }\), and we denote this by m≺_{R}m^{′}. Similarly, when m precedes m^{′} in Z, we denote it by m≺_{Z}m^{′}, if the previous properties hold on Z and the two MEMs belong to the same vertex label seq(v). When m precedes m^{′} in R (in Z, respectively), we say that \(lgap_{R} = i^{\prime }_{R}  (i_{R} + l)\) (\({lgap}_{Z} = i^{\prime }_{Z}  (i_{Z} + l)\), respectively) is the length of the gap between the two MEMs. If lgap_{R} or lgap_{Z} (or both) are positive, we refer to the gap strings as sgap_{R} and sgap_{Z}, while when they are negative, we say that m and m^{′}overlap either in R or Z (or both). Given a MEM m belonging to the vertex labeled seq(v), we denote as PREF_{Z}(m) and SUFF_{Z}(m) the prefix and the suffix of seq(v) upstream and downstream from the start and the end of m, respectively. Figure 2 summarizes the definitions of precedence between MEMs, gap, overlap, PREF_{Z}, and SUFF_{Z}.
Spliced graphalignment
We are now able to define the fundamental concepts that will be used in our method. In particular, we first define a general notion of gap graphalignment and then we introduce specific constraints on the use of gaps to formalize a spliceaware graphalignment that is fundamental for the detection of alternative splicing events in ASGAL.
A gap graphalignment of R to graph \(\mathcal {S}_{G}\) is a pair (A,π) where π=〈v_{1},…,v_{k}〉 is a path of the graph \(\mathcal {S}_{G}^{\star }\) and
$$A = \left\langle (p_{1}, r_{1}), \left(p_{1}^{\prime}, r_{1}^{\prime}\right), \ldots, \left(p^{\prime}_{n1}, r^{\prime}_{n1}\right),(p_{n}, r_{n}) \right\rangle $$
is a sequence of pairs of strings, with n≥k, such that seq(v_{1})=x·p_{1} and seq(v_{k})=p_{n}·y, for x,y possibly empty strings and \(P= p_{1} \cdot p_{1}^{\prime } \cdot p_{2} \cdot p_{2}^{\prime } \cdot p_{3} \cdots p_{n1}^{\prime } \cdot p_{n} \) is the string labeling the path π and \(R = r_{1} \cdot r_{1}^{\prime } \cdot r_{2} \cdots r_{n1}^{\prime } \cdot r_{n} \).
The pair (p_{i},r_{i}), called a factor of the alignment A, consists of a nonempty substring r_{i} of R and a nonempty substring p_{i} of the label of a vertex in π. On the other hand, the pair \(\left (p_{i}^{\prime }, r_{i}^{\prime }\right)\) is called a gapfactor of the alignment A if at least one of \(p_{i}^{\prime }\) and \(r_{i}^{\prime }\) is an empty substring ε. Moreover, either \(p_{i}^{\prime }\) is empty or \(p_{i}^{\prime } > \alpha \), and either \(r_{i}^{\prime }\) is empty or \(r_{i}^{\prime } > \alpha \), for a fixed value α which represents the maximum alignment indel size allowed. When an insertion (or a deletion) is smaller than α, we consider it an alignment indel and we incorporate it into a factor; otherwise, we consider it as a clue of the possible presence of an AS event and we represent it as a gapfactor. We note that an “alignment indel” is a small insertion or deletion which occurs in the alignment, due to a sequencing error in the input data or a genomic insertion/deletion. Intuitively, in a gap graphalignment, factors correspond to portions of exons covered (possibly with errors) by portions of the read, while gapfactors correspond to introns, which can be already annotated or novel, and which can be used to infer the possible presence of AS events. We note that to allow the detection of alternative splice site events known as NAGNAG resulting in a difference of 3bps, if an alignment indel occurs at the beginning or at the end of an exon, we consider it during the detection of the events, even though it is not modeled as a gapfactor since in these cases the insertion may be smaller than α.
We associate to each factor (p_{i},r_{i}) the cost δ(p_{i},r_{i}), and to each gapfactor \(\left (p_{i}^{\prime },r_{i}^{\prime }\right)\) the cost \(\delta \left (p_{i}^{\prime },r_{i}^{\prime }\right)\), by using a function δ(·,·) with positive values. Then the cost of the alignment (A,π) is given by the expression:
$$ {\mathtt{cost}}(A, \pi) = \sum_{i=1}^{n} \delta(p_{i}, r_{i}) + \sum_{i=1}^{n1} \delta\left(p_{i}^{\prime},r_{i}^{\prime}\right). $$
Moreover, we define the error of a gap graphalignment as the sum of the edit distance of each factor (but not of gapfactors). Formally, the error of the alignment (A,π) is:
$$ {\mathtt{Err}}(A, \pi) = \sum_{i=1}^{n} d(p_{i}, r_{i}), $$
where d(·,·) is the edit distance between two strings.
To define a spliceaware alignment, that we call spliced graphalignment, we need to classify each gapfactor and to assign it a cost. Our primary goal is to compute a gap graphalignment of the read to the splicing graph that possibly reconciles to the gene annotation; if this is not possible, then we want to minimize the number of novel events. For this reason we distinguish three types of gapfactors: annotated, novel, and uninformative. Intuitively, an annotated gapfactor models an annotated intron, a novel gapfactor represents a novel intron, while an uninformative gapfactor does not represent any intron.
Formally, we classify a gapfactor \(\left (p_{i}^{\prime }, r_{i}^{\prime }\right)\) as annotated if and only if \(p^{\prime }_{i} = r^{\prime }_{i} = \epsilon \) and the two strings p_{i}, p_{i+1} are on two different vertices that are linked by an edge in \(\mathcal {S}_{G}\). We classify a gapfactor \(\left (p_{i}^{\prime }, r_{i}^{\prime }\right)\) as novel in the following cases:

1
\(r_{i}^{\prime } = \epsilon \) and \(p_{i}^{\prime }= \epsilon \) occurs between the strings p_{i} and p_{i+1} which belong to two distinct vertices linked by an edge in \(\mathcal {S}_{G}^{\star }\) and not in \(\mathcal {S}_{G}\) (i.e. this gapfactor represents an exon skipping event — Fig. 3a).

2
\(r_{i}^{\prime } = \epsilon \) and \(p_{i}^{\prime }\neq \epsilon \) occurs between the strings p_{i} and p_{i+1} which belong to the same vertex of \(\mathcal {S}_{G}^{\star }\) (i.e. this gapfactor represents an intron retention event — Fig. 3b). Actually, we note here that this type of gapfactor may represent also a genomic deletion: currently, our program does not distinguish between intron retentions and genomic deletions that are entirely contained in an exon, therefore we might overpredict intron retentions.

3
\(r_{i}^{\prime } = \epsilon \) and \(p_{i}^{\prime }\neq \epsilon \) occurs between the strings p_{i} and p_{i+1} which belong to two distinct vertices linked by an edge in \(\mathcal {S}_{G}^{\star }\) (i.e. this gapfactor represents an alternative splice site event — Fig. 3c).

4
\(r_{i}^{\prime } \ne \epsilon \) and \(p_{i}^{\prime }= \epsilon \) occurs between the strings p_{i} and p_{i+1} which belong to two distinct vertices linked by an edge in \(\mathcal {S}_{G}^{\star }\) (i.e. this gapfactor represents an alternative splice site extending an exon or a new exon event — Fig. 3de).
Note that Case 1 allows to detect a novel intron whose splice sites are both annotated (see Fig. 3a). Case 2 supports a genomic deletion or an intron retention (see Fig. 3b), and in case of intron retention, ASGAL finds the two novel splice sites inside the annotated exon. Case 3 gives an evidence of a novel alternative splice event shortening an annotated exon (see Fig. 3c) and ASGAL finds the novel splice site supported by this case. Finally, in Case 4, ASGAL is able to detect a novel alternative splice site (extending an annotated exon) or a novel exon (see Fig. 3d), but only in the first case (alternative splice site) ASGAL is able to find the novel splice site induced by the gapfactor.
For ease of presentation, Fig. 3 shows only “classic” AS event types and not their combination as those modeled with the notion of Local Splicing Variations (LSV) [13]. We note here that our formalization takes into account combinations of AS event types as those given by an exon skipping combined with an alternative splice site (see definition of gapfactor in cases 3 and 4). However, the actual version of the tool is designed only to detect the AS event types shown in Fig. 3. For completeness, in Fig. 4 we show the same AS event types (shown in Fig. 3) with respect to the annotated case, i.e. when the gapfactor is annotated and it represents an already known AS event.
Finally, we classify a gapfactor \(\left (p^{\prime }_{i}, r^{\prime }_{i}\right)\) as uninformative in the two remaining cases, which are (i) \(r_{i}^{\prime } = \epsilon \) and \(p_{i}^{\prime }= \epsilon \) occurs between strings p_{i} and p_{i+1} which belong to the same vertex, and (ii) \(r_{i}^{\prime } \neq \epsilon \) and \(p_{i}^{\prime }= \epsilon \) occurs between strings p_{i} and p_{i+1} which belong to the same vertex. We notice that in the former case, factors (p_{i},r_{i}) and (p_{i+1},r_{i+1}) can be joined into a unique factor.
Let \(\mathcal {G_{F}}\) be the set of novel gapfactors of a gap graphalignment A. Then a spliced graphalignment (A,π) of R to \(\mathcal {S}_{G}\) is a gap graphalignment in which uninformative gapfactors are not allowed, whose cost is defined as the number of novel gapfactors, and whose error is at most β, for a given constant β which models any type of error that can occur in an alignment (sequencing errors, indels, etc). In other words, in a spliced graphalignment (A,π), we cannot have uninformative gapfactors, and the δ function assigns a cost 1 to each novel gapfactor and a cost 0 to all other factors and annotated gapfactors: thus \({\texttt {cost}}(A,\pi) = \mathcal {G_{F}}\) and Err(A,π)≤β. We focus on a bicriteria version of the computational problem of computing the optimal spliced graphalignment (A,π) of R to a graph \(\mathcal {S}_{G}\), where first we minimize the cost, then we minimize the error. The intuition is that we want a spliced graphalignment of a read that is consistent with the fewest novel splicing events that are not in the annotation. Moreover, among all such alignments we look for the alignment that has the smallest edit distance (which is likely due to sequencing errors and polymorphisms) in the nonempty regions that are aligned (i.e. the factors). Figure 5 shows an example of spliced graphalignment of error value 2, and cost 2 — since it has two novel gapfactors.
In this paper we propose an algorithm that, given a read R, a splicing graph \(\mathcal {S}_{G}\), and three constants, which are L (the minimum length of a MEM), α (the maximum alignment indel size), and β (the maximum number of allowed errors), computes an optimal spliced graphalignment — that is, among all spliced graphalignments with minimum cost, the alignment with minimum error. The next section details how ASGAL computes the optimal spliced graphalignments of a RNASeq sample to the splicing graph \(\mathcal {S}_{G}\), and how it exploits novel gapfactors to detect AS events.
ASGAL approach
We now describe the algorithm employed by ASGAL to compute the optimal spliced graphalignments of a sample of RNASeq reads to the splicing graph of a gene, to be used in order to provide the alternative splicing events supported by the sample and a measure of their quantification (i.e. the number of reads supporting the event).
The ASGAL tool implements a pipeline consisting of the following steps: (1) construction of the splicing graph of the gene, (2) computation of the spliced graphalignments of the RNASeq reads, (3) remapping of the alignments from the splicing graph to the genome, and (4) detection of the novel alternative splicing events. Figure 6 depicts the ASGAL pipeline.
In the first step, ASGAL builds the splicing graph \(\mathcal S_{G}\) of the input gene using the reference genome and the gene annotation, and adds the novel edges to obtain the graph \(\mathcal S^{\star }_{G}\) which will be used in the next steps.
The second step of ASGAL computes the spliced graphalignments of each read R in the input RNASeq sample by combining MEMs into factors and gapfactors. For this purpose, we extend the approximate pattern matching algorithm of Beretta et al. [19] to obtain the spliced graphalignments of the reads, which will be used in the following steps to detect novel alternative splicing events. As described before, we use the approach proposed by Ohlebusch et al. in [26] to compute, for each input read R, the set of MEMs between Z, the linearization of the splicing graph \(\mathcal {S}_{G}\), and R with minimum length L, a userdefined parameter (we note that the approach of [26] allows to specify the minimum length of MEMs). We recall that the string Z is obtained by concatenating the strings seq(v) and ϕ for each vertex v of the splicing graph (recall that ϕ is the special character used to separate the vertex labels in the linearization Z of the splicing graph). We point out that the concatenation order does not affect the resulting alignment and that the splicing graph linearization is performed only once before aligning the input reads to the splicing graph.
Once the set M of MEMs between R and Z is computed, we build a weighted graph G_{M}=(M, E_{M}) based on the parameter α, representing the maximum alignment indel size allowed, and the two precedence relations between MEMs, ≺_{R} and ≺_{Z}, respectively. Then we use such graph to extract the spliced graphalignment. Intuitively, each node of this graph represents a perfect match between a portion of the input read and a portion of an annotated exon whereas each edge models the alignment error, the gapfactor of the spliced graphalignment, or both. More precisely, there exists an edge from m to m^{′}, with m,m^{′}∈M, if and only if m≺_{R}m^{′} and one of the following six conditions (depicted in Fig. 7) holds:

1
m and m^{′} are inside the same vertex label of Z, m≺_{Z}m^{′}, and either (i) lgap_{R}>0 and lgap_{Z}>0, or (ii) lgap_{R}=0 and 0<lgap_{Z}≤α. The weight of the edge (m,m^{′}) is set to the edit distance between sgap_{R} and sgap_{Z} (Fig. 7a).

2
m and m^{′} are inside the same vertex label of Z, m≺_{Z}m^{′}, lgap_{R}≤0, and lgap_{Z}≤0. The weight of the edge (m,m^{′}) is set to lgap_{R}−lgap_{Z} (Fig. 7b).

3
m and m^{′} are inside the same vertex label of Z, m≺_{Z}m^{′}, lgap_{R}≤0 and lgap_{Z}>α. The weight of the edge (m,m^{′}) is set to 0 (Fig. 7c).

4
m and m^{′} are on two different vertex labels seq(v_{1}) and seq(v_{2}), with v_{1}≺v_{2}, and lgap_{R}≤0. The weight of the edge (m,m^{′}) is set to 0 (Fig. 7d).

5
m and m^{′} are on two different vertex labels seq(v_{1}) and seq(v_{2}), with v_{1}≺v_{2}, lgap_{R}>0, and SUFF_{Z}(m)=PREF_{Z}(m^{′})=ε. The weight of the edge (m,m^{′}) is set to 0 if lgap_{R}>α, and to lgap_{R} otherwise (Fig. 7e).

6
m and m^{′} are on two different vertex labels seq(v_{1}) and seq(v_{2}), with v_{1}≺v_{2}, lgap_{R}>0, at least one between SUFF_{Z}(m) and PREF_{Z}(m^{′}) is not ε. The weight of the edge (m,m^{′}) is set to the edit distance between sgap_{R} and the concatenation of SUFF_{Z}(m) and PREF_{Z}(m^{′}) (Fig. 7f).
Note that the aforementioned conditions do not cover all of the possible situations that can occur between two MEMs, but they represent those that are relevant for computing the spliced graphalignments of the considered read. Intuitively, m and m^{′} contribute to the same factor (p_{i},r_{i}) in cases 1 and 2 and the nonzero weight of the edge (m,m^{′}) concurs to the spliced graphalignment error. In cases 35, the edge (m,m^{′}) models the presence of a novel gapfactor. More precisely, m contributes to the end of a factor (p_{i},r_{i}) and m^{′} contributes to the start of the consecutive factor (p_{i+1},r_{i+1}) and the novel gapfactor in between models an intron retention or a genomic deletion on an annotated exon (case 3), an alternative splice site shortening an annotated exon (case 4), and an alternative splice site extending an annotated exon or a new exon (case 5). Finally in case 6, m contributes to the end of a factor (p_{i},r_{i}) and m^{′} contributes to the start of the consecutive factor (p_{i+1},r_{i+1}) whereas the gapfactor in between can identify either a novel exon skipping event or an already annotated intron. In both these cases, the nonzero weight of the edge contributes to the spliced graphalignment error.
The spliced graphalignment of the read R is computed by a visit of the graph G_{M}. More precisely, each path π_{M} of this graph represents a spliced graphalignment and the weight of the path is the number of differences between the pair of strings in R and Z covered by π_{M}. For this reason, for read R, we select the lightest path in G_{M}, with weight less than β (the given error threshold) which also contains the minimum number of novel gapfactors, i.e. we select an optimal spliced graphalignment.
The third step of ASGAL computes the spliced alignments of each input read with respect to the reference genome starting from the spliced graphalignments computed in the previous step. Exploiting the annotation of the gene, we convert the coordinates of factors and gapfactors in the spliced graphalignment to positions on the reference genome. In fact, observe that factors map to coding regions of the genome whereas gapfactors identify the skipped regions of the reference, i.e. the introns induced by the alignment, modeling the possible presence of AS events (see Fig. 3 for details). We note here that converting the coordinates of factors and gapfactors to positions on the reference genome is pretty trivial except when factors p_{i} and p_{i+1} are on two different vertices and only \(p^{\prime }_{i}\) is ε (case de of Fig. 3). In this case, the portion \(r^{\prime }_{i}\) must be aligned to the intron between the two exons whose labels contains p_{i} and p_{i+1} as a suffix and prefix, respectively. If \(r^{\prime }_{i}\) aligns to a prefix or a suffix of this intron (taking into account possible errors within the total error bound α), then the left or right coordinate of the examined intron is modified according to the length of \(r^{\prime }_{i}\) (Fig. 3d). In the other case (Fig. 3e), the portion \(r^{\prime }_{i}\) is not aligned to the intron and it is represented as an insertion in the alignment. Moreover, the third step of our approach performs a further refinement of the splice sites of the introns in the obtained spliced alignment since it searches for the splice sites (in a maximum range of 3 bases with respect to the detected ones) determining the best intron pattern (firstly GTAG, secondly GCAG if GTAG has not been found).
In the fourth step, ASGAL uses the set \(\mathcal {I}\) of introns supported by the spliced alignments computed in the previous step, i.e. the set of introns associated to each gapfactor, to detect the novel alternative splicing events supported by the given RNASeq sample with respect to the given annotation. Let \(\mathcal {I}_{n}\) be the subset of \(\mathcal {I}\) composed of the introns which are not present in the annotation, that is, the novel introns. For each novel intron \(\left [p_{s}, p_{e}\right ] \in \mathcal {I}_{n}\) which is supported by at least ω alignments, ASGAL identifies one of the following events, which can be considered one of the relevant events supported by the input sample:

exon skipping, if there exists an annotated transcript containing two nonconsecutive exons [ a_{i},b_{i}] and [ a_{j},b_{j}], such that b_{i}=p_{s}−1 and a_{j}=p_{e}+1.

intron retention, if there exists an annotated transcript containing an exon [ a_{i},b_{i}] such that (i) a_{i}<p_{s}<p_{e}<b_{i}, (ii) there exists an intron in \(\mathcal {I}\) ending at a_{i}−1 or a_{i} is the start of the transcript and (iii) there exists another intron in \(\mathcal {I}\) starting at b_{i}+1 or b_{i} is the end of the transcript.

alternative acceptor site, if there exists an annotated transcript containing two consecutive exons [ a_{i},p_{s}−1] and [ a_{j},b_{j}] such that p_{e}<b_{j}, and there exists an intron in \(\mathcal {I}\) starting at b_{j}+1 or b_{j} is the end of the transcript.

alternative donor site, if there exists an annotated transcript containing two consecutive exons [ a_{i},b_{i}] and [ p_{e}+1,b_{j}] such that p_{s}>a_{i}, and there exists an intron in \(\mathcal {I}\) ending at a_{i}−1 or a_{i} is the start of the transcript.
We note here that these definitions are accurately designed to minimize the chances of mistaking a complex AS event as those modeled with the notion of LSV for an AS event. For example, if we remove conditions (ii) and (iii) from the definition of intron retention, we could confuse the situation shown in Fig. 8 with an intron retention event.
Genomewide analysis
ASGAL is specifically designed to perform AS prediction based on a spliceaware alignment of an experiment of RNASeq reads against a splicing graph of a specific gene. The current version of ASGAL is time efficient when a limited set of genes are analyzed, while for genomewide analysis we have implemented a preprocessing step that aims to speed up the process of filtering reads that map to genes under investigation. Given a set of genes and a RNASeq sample, this filtering procedure consists of three main steps: (i) the quasimapping algorithm of Salmon is first used to quantify the transcripts of the genes and to quickly assign each read to the transcripts, (ii) a smaller set of RNASeq samples, one for each gene, is then produced by analyzing the output of Salmon, and finally (iii) if the input sample contains pairedend reads, then the mate of the mapped reads that were not mapped by Salmon are then added to these smaller samples. Once we split the input RNASeq experiment in smaller samples, it is possible to use ASGAL on the different genes without having to align the entire sample of reads against each of them. We note that we decided to use Salmon as preprocessing step since it is very fast and it allows to split the input sample faster than any other spliced aligner. Anyway, in aligning reads to a reference transcriptome, some reads which cover unannotated exons are not aligned, excluding them from any downstream analysis: for this reason, future works will focus on improving this step.