In this section, we first give some formal definitions that will be useful for the remaining of the section. In the second subsection, the definitions of three versions of the spliced alignment problem are given under a unified framework allowing to compare theoretically the optimization criteria of the different versions of the problem. All existing cDNA-to-genome spliced alignment methods correspond either to the first or the second version of the problem that account for sequence similarity and splice signals in the input genomic sequence. The last version of the problems introduced in this paper additionally takes into account the splicing structure of the input sequences. In the third subsection, we describe our SFA-align algorithm for the new version of the spliced alignment problem. In the fourth subsection, we describe SFA-ortholog algorithm for the computation of CDS orthology groups, using spliced alignments.
Preliminary definitions: gene, CDS, spliced alignment
Given a set S, |S| denotes the size of S, and given a sequence T, length(T) denotes the length of T. Note that in the nomenclature used in this section, we call CDS the exon structure of a coding sequence and CDS sequence the nucleotide sequence of the coding sequence.
Gene and CDS: A gene sequence is a DNA sequence on the alphabet of nucleotides Σ={A,C,G,T}. Given a gene sequence g, an exon of g is represented as a pair of integers (a,b) such that a≤b and a and b are the start and end locations of the exon on the gene sequence. The sequence of the exon (a,b) is then denoted by g[ a,b]. A CDS of the gene sequence g is represented as a chain c={(a1,b1),…,(aj,bj)} of exons of g such that for any two successive exons (ai,bi) and (ai+1,bi+1), bi<ai+1. The ith exon of the CDS c is then denoted by c[i]. The set of introns induced by the CDS c is denoted by Intron(c)={(b1,a2),(b2,a3)…,(bj−1,aj)}. The set of known CDS of a gene g is denoted by \(\mathcal {C}(g)\) and the set of all exons of all konwn CDS of g is denoted by \(\mathcal {E}(g)=\bigcup _{c \in \mathcal {C}(g)}{c}\) (see Fig. 9 for an illustration).
The sequence of a CDS c of g is denoted by g[ c]. g[ c] is the concatenation of the sequences of the exons composing c. An exon of a CDS sequence g[c] is represented as a pair of integers (k,l) such that k≤l, and k and l are the start and end locations of the exon on the CDS sequence. In this case, the sequence of the exon (k,l) of g[ c] is denoted by g[ c][ k,l]. The set of exons composing a CDS sequence g[ c] is denoted by \(\mathcal {E}(g[\!c])\) (see Fig. 9 for an illustration).
Spliced alignment: A spliced alignment is an alignment of a CDS sequence against a gene sequence that allows to identify conserved exons sequences. Formally, a spliced alignment of a CDS sequence g[ c] against a gene sequence h is represented as a chain A={(k1,l1,a1,b1),…,(kj,lj,aj,bj)} of quadruplets called blocks such that for any block (k,l,a,b) of A, k≤l and k and l are the start and end locations of a segment on the CDS sequence, and a≤b and a and b are the start and end locations of a segment on the gene sequence or a=b=0; The ith block of a spliced alignment A is denoted by A[ i], and:
-
1
k1=1, lj=length(g[ c]) and for any two successive blocks A[ i]=(ki,li,ai,bi) and A[i+1]=(ki+1,li+1,ai+1,bi+1), we have li=ki+1−1.
-
2
for any two blocks A[ i1] and A[ i2] with i1<i2, we have \(b_{i_{1}} < a_{i_{2}}\) if \(a_{i_{1}} \neq 0\) and \(a_{i_{2}} \neq 0\).
A block (k,l,a,b) represents an alignment of a segment g[ c][k,l] of the CDS sequence with a segment h[ a,b] of the gene sequence. If a=b=0, then the gene segment h[ a,b] is empty and the block (k,l,a,b) represents a deletion of the CDS segment g[ c][ k,l] in the spliced alignment. We call such a block a deleted block. Otherwise, the block (k,l,a,b) represents a conservation between a putative CDS exon sequence g[ c][ k,l] and a putative gene exon sequence h[ a,b], and we call such a block a conserved block (see Fig. 10 for an illustration).
Condition 1. of the spliced alignment definition implies that the set of conserved and deleted blocks of the spliced alignment covers the entire CDS sequence. Condition 2. implies that the blocks are ordered in the alignment following an increasing order of their location on the CDS sequence, and this order is also preserved on the gene sequence.
For instance, in Fig. 9, the spliced alignment of g[ c1] against g is A={(1,4,9,12),(5,16,18,29),(17,24,49,56)} (see Fig. 10 for more general examples of spliced alignments with conserved and deleted blocks).
A spliced alignment A induces a set of putative gene intron segments. These intron segments are the gene segments that lie between two successive blocks of the spliced alignment that are conserved blocks. Formally, the set of introns induced by a spliced alignment A={(k1,l1,a1,b1),…,(kj,lj,aj,bj)} is denoted by Intron(A)= { (bi,ai+1)such that (ki,li,ai,bi)and (ki+1,li+1,ai+1,bi+1)are conserved blocks}. Two successive conserved blocks of the spliced alignment A also induce a junction between two successive segments in the CDS sequence that are separated by an intron segment in the gene sequence. The set of putative exon junctions induced the spliced alignment A is denoted by Junction(A)= { lisuch that (ki,li,ai,bi)and (ki+1,li+1,ai+1,bi+1)are conserved blocks}. Note that if all blocks composing A are conserved, then the number of introns induced by A is |Intron(A)|=|Junction(A)|=|A|−1.
A new constrained version of the spliced alignment problem
In this subsection, we first re-call two existing versions of the spliced alignment problem and we introduce a third more constrained version that takes additionally account of the splicing structure of input sequences. We discuss the motivations and limits of the different versions as we give their definitions.
Given a block (k,l,a,b) of a spliced alignment of a CDS sequence g[ c] against a gene sequence h, let sim(g[ c][ k,l],h[ a,b]) denote the score of an optimal global alignment between the CDS segment g[ c][ k,l] and the gene segment h[ a,b] in a given scoring scheme. Note that if a=b=0, then h[ a,b] is an empty segment. The following is a reformulation of the less constrained version of the spliced alignment problem introduced in [1]. This formulation gives a unified framework to formally define and compare all the versions of the spliced alignment problem formulated thereafter.
SPLICED ALIGNMENT PROBLEM I (SAP_I):
Input: A CDS sequence g[ c] from a gene sequence g ; a gene sequence h.
Output: A spliced alignment A of g[ c] against h that maximizes
$$\sum_{(k,l,a,b)\in A}{\mathtt{sim}(g[\!c][\!k,l],h[\!a,b])} $$
The SAP_I problem accounts only for the sequence similarity between the segments composing the blocks of the spliced alignment. In practice, in more than 99% of real cases of splicing, the spliced intron sequences start with a dinucleotide sequence “GT” and ends with a dinucleotide sequence “AG” corresponding to canonical splice sites [22, 23]. They exists also other non-canonical splice site pairs such as GC-AG, AT-AC that occur less frequently. Thus, in order to improve the accuracy of spliced alignments, a more constrained version of the problem allows to account for the extremities of intron segments induced by a spliced alignment. Given an intron (b,a) induced by two successive conserved blocks of a spliced alignment of a CDS sequence g[ c] against a gene sequence h, let splicesignals(h[ b,a]) denote a score of the putative intron segment h[ b,a] accounting for the presence or absence of known splice signals at the extremities of h[ b,a]. A putative intron segment with two canonical splice signals at its extremities has a higher score than a segment with only one which has a higher score than a segment without any canonical splice signal at its extremities. A more constrained version of the spliced alignment problem studied in [9, 12] for instance is the following.
SPLICED ALIGNMENT PROBLEM II (SAP_II):
Input: A CDS sequence g[ c] from a gene sequence g ; a gene sequence h. Output: A spliced alignment A of g[ c] against h that maximizes
$$\sum_{(k,l,a,b)\in A}{\mathtt{sim}(g[\!c][\!k,l],h[\!a,b])} $$
$$+ \sum_{(b,a)\in \mathtt{Intron}(A)}{\mathtt{splicesignals}(h[\!b,a])} $$
The SAP_I and SAP_II problems do not account for the splicing structure of the input sequences. In order to further improve the accuracy of spliced alignments, we define a more constrained version of the problem that takes into account the exon structure of the CDS sequence and the exon-intron structure of the gene sequence.
Given a putative intron (b,a)∈Intron(A) induced by two successive conserved blocks of a spliced alignment A of a CDS sequence g[ c] against a gene sequence h, let \(\mathtt {splicesites}_{\mathcal {E}(h)}(b,a)\) denote a score of the putative intron segment h[ b,a] accounting for the correspondance of its extremities with known splice sites in the gene sequence h. A putative intron segment whose extremities both correspond to known splice sites in the gene sequence receives a higher score than a segment with only one extremity corresponding to a known splice site which has a higher score than a segment without any correspondance to known splice sites at its extremities.
Similarly, for a putative CDS exon junction l∈Junction(A) induced by two successive conserved blocks of the spliced alignment A, let \(\mathtt {exonjunction}_{\mathcal {E}(g[c])}(l)\) denote a score of the putative CDS exon junction l accounting for its correspondance with a real exon junction in the CDS sequence g[ c]. If l corresponds to a real exon junction in \(\mathcal {E}(g[\!c]\!)\) it receives a higher score than if it does not correspond to a junction in \(\mathcal {E}(g[\!c]\!)\). The more constrained version of the problem introduced here is defined as follows.
SPLICED ALIGNMENT PROBLEM III (SAP_III):
Input: A CDS sequence g[ c] from a gene sequence g ; the set of exons \(\mathcal {E}(g[\!c]\!)\) of g[ c] ; a gene sequence h ; the set of exons \(\mathcal {E}(h)\) of h.
Output: A spliced alignment A of g[ c] against h that maximizes
$$\sum_{(k,l,a,b)\in A}{\mathtt{sim}(g[\!c][\!k,l],h[\!a,b])} $$
$$+\sum_{(b,a)\in \mathtt{Intron}(A)}{\mathtt{splicesignals}(h[\!b,a])}$$
$$+\sum_{(b,a)\in \mathtt{Intron}(A)}{\mathtt{splicesites}_{\mathcal{E}(h)}(b,a)}$$
$$+\sum_{l\in \mathtt{Junction}(A)}{\mathtt{exonjunction}_{\mathcal{E}(g[c])}(l)} $$
Examples of algorithms developed for the SAP_I problem which accounts only for the sequence similarity at the nucleotide or at the amino acid level are BLAT [10], DDS/GAP2 [24], SOAPsplice [8] and HSA [7]. Several algorithms have also been developed for the SAP_II problem, for instance Geneseqer [25], Sipdey [17], Splign [9], SIM4 [21], MGAlign [18], GMAP [12] and Spa [11]. These algorithms account for splice signals in addition to sequence similarity. A comparison of a subset of these tools has demonstrated the superiority of splice signal-based methods compared to only sequence similarity-based methods [12]. Moreover, it was shown in [9] that, among splice signal-based methods, the best performing current spliced alignment method for cDNA-to-gene alignment was Splign. Thus, taking account of more information about the input sequences improves the accuracy of spliced alignments. We then expect that accounting for information on the exon structure of CDS sequences and the exon-intron structure of gene sequences will further improve spliced alignments.
The SFA-align algorithm for the SAP_III problem
In this section, we describe a heuristic algorithm called SFA-align for the SAP_III problem. The algorithm follows a general approach followed by most heuristic methods developed for the SAP_I and SAP_II problems. The algorithm is decomposed into three steps that each accounts for the splicing structure of the input sequences. Note that in the case where the splicing structures of the input sequences are not provided, SFA-align includes a preliminary step in which the splicing structure of input sequences are inferred by computing a spliced alignment of each CDS against its own gene having a Percent Sequence Identity (PID) of 100. An overview of the main steps of SFA-align is depicted in Fig. 1. It starts with a fast local alignment to compute highly conserved local segments used as anchors in the alignment. Next, the anchor alignment are extended in order to maximize the exon coverage. Finally, between the anchors, a global alignment of the remaining segments can be applied to complete the alignment.
Step 1. Local alignments using tblastx. This step is achieved using Translated Blast (tblastx) [26] in order to obtain a preliminary set of local alignments between the input CDS sequence and gene sequence. Tblastx is used in order to account for the translation of the sequences into amino acid sequences. This allows to detect conserved exon segments translated into amino acid sequences even in the presence of translational frameshifts or nucleotide silent mutations.
Given the set of hits obtained using tblastx with a given threshold E-value, the following procedure is applied to obtain the final set of local alignments. i) Each hit is assigned to the exon of the CDS that the hit covers the more, and the hit is minimally trimmed at its extremities to cover only this CDS exon. Thus, the boundaries of a trimmed hit never exceed the boundaries of the CDS exon that it is assigned to ; ii) All hits within the same CDS exon are compared and only a subset of pairwise compatible hits with the lowest E-values is kept ; iii) Finally, the hits within different exons are compared and only compatible hits with the lowest E-values are kept ; iv) For each exon, the remaining compatible hits are gathered into a set of non-overlapping local alignments for the exon. A more detailed description of this procedure is given in Algorithm 2 in Additional file 1. Figure 11 provides an illustration of the result of the procedure on an example. In this example, Step i of the local alignment procedure associates hits 1 and 2 to exon E1, hits 3 and 4 to exon E2, hit 5 to exon E3, and hit 6 to exon E4. The extremities of the hit 5 are also trimmed so that the hit covers only exon E3. Step ii removes hit 4 from the list of hits associated to exon E2 because it is incompatible with hit 3 that has a lower E-value. Next, Step iii removes hit 6 from the list of hits associated to exon E4 because it is incompatible with hit 5 associated to exon E3 with a lower E-value. Finally, Step iv ends up with three local alignments covering exons E1, E2 and E3 illustrated by hits 1, 3, 5 in Fig. 12.
Step 2. Gapped extension of anchors. In this step, each local alignment obtained from Step 1 is extended in both directions in order to increase as much as possible the coverage of the CDS exon to which it is associated. The extension procedure allows an extended portion of an alignment to start with a succession of gaps whose number must be a multiple of 3 and shall not exceed a given number α of gaps. For example, a local alignment (“AAUCGGA”,“AAUCGGA”) that partially covers a CDS exon ~AAUCGGAUGGGUG~ could be extended on the right until the extremity of the exon following three possible configurations. It can be extended as (a) (“AAUCGGAUGGGUG”,“AAUCGGAUGGGUG”) without any gap at the start of the extension, or (b) (“AAUCGGAUGGGUG”,“AAUCGGA---GUG”) starting with 3 gaps in the gene sequence, or (c) (“AAUCGGA---UGGGUG”, “AAUCGGACCCUGGGUG”) starting with 3 gaps in the CDS.
Given a local alignment of a CDS segment and a gene segment represented by a block (k,l,a,b) such that k and l are the start and end positions of the segment in the CDS, and a and b are the start and end positions of the segment in the gene sequence, the gapped extension procedure applied on (k,l,a,b) is as follows. Let \((k^{\prime },l^{\prime })\) be the exon of the CDS to which the local alignment (k,l,a,b) is associated. Note that \(k^{\prime }\leq k \leq l \leq l^{\prime }\). i) If k′<k, then the alignment (k,l,a,b) can be extended on the left. The procedure tries all possible configurations of extension, (a) without any gap at the start of the extension, (b) starting with gaps in the gene sequence or (c) starting with gaps in the CDS, such that the extension does not overlap any other local alignment. For configurations (b) and (c), all possible numbers of gaps 3∗i with i ranging from 1 and α/3 are evaluated, the extension configuration having the highest identity score is returned. If this identity score is above a given identity threshold β, then the alignment is extended on the left, otherwise no extension is applied. i) If l<l′, then the alignment (k,l,a,b) can also be extended on the right. A procedure similar to the previous one is applied in order to try all possible configurations of extension on the right. A more detailed description of the procedure is given in Algorithm 3 in Additional file 1. Figure 12 also provides an illustration of the configurations of left extension that are explored by the procedure on an example.
Step 3. Global spliced alignment algorithm. For all CDS exons that were left completely unaligned by the previous steps, a dynamic programming algorithm for global spliced alignment is applied. Restricting the global alignment to remaining unaligned exons of the CDS allows to accelerate the global algorithm step by dividing the dynamic programming space. The following main recurrence formula is used.
$$\begin{array}{@{}rcl@{}} S(i,j)\,=\, \!max \!\left\lbrace\!\! \begin{array}{l} S(i-1,j-1) + score(i,j)\\ S(i-1,j)+ score(i,-);\\ S(i,j-1)+ score(-,j);\\ \max_{(j^{\prime}, j) \in \mathcal{E}(h)}\{S(i,j^{\prime}) \\ ~~~~~~~~~~~~~~~ + \mathtt{exonjunction}(i)\\ ~~~~~~~~~~~~~~~ + \mathtt{splicesignals}(h[j^{\prime},j])\\ ~~~~~~~~~~~~~~~ + \mathtt{splicesites}(j^{\prime},j) \};\\ \max_{l \in [I_{min},I_{max}]}\{S(i,j-l) \\ ~~~~~~~~~~~~~~~ + \mathtt{exonjunction}(i)\\ ~~~~~~~~~~~~~~~ + \mathtt{splicesignals}(h[j-l,j])\\ ~~~~~~~~~~~~~~~ + \mathtt{splicesites}(j-l,j) \}; \end{array} \right. \end{array} $$
(1)
In Formula (1), when computing the global spliced alignment between a CDS segment g[ c][ k,l] and a gene segment h[ a,b], S(i,j) is the maximum score of a global spliced alignment of the segment g[ c][ k,k+i−1] and the segment h[ a,a+i−1]. For instance, if g[ c][ k,l] has length m and h[ a,b] has length n, then S(m,n) is the maximum score of a global spliced alignment of g[ c][ k,l] and the segment h[ a,b].
In the formula, the three first cases contribute to the computation of sequence alignment scores within blocks of the spliced alignment, given score(i,j), score(i,−), score(−,j) that denote the scores of substitution, insertion and deletion of nucleotides. The two last cases contribute to evaluating the structure alignment score according to the correspondence of induced introns with known splicing signals and the splicing structure of input sequences. Formula (1) is an extension of the main recurrence formula used in the global alignment step of Splign [9]. The extension consists in accounting for exonjunction(i) and splicesites(j′,j) that are the scores for CDS exon junctions and genomic introns introduced for the definition of the SAP_III problem. The last case of the formula also contributes to the extension and allows to further account for the splicing structure of the unspliced genomic sequence.
Finally, any new alignment block computed in this step is kept if and only if its identity score is above a given identity threshold β, otherwise it is not added to spliced alignment.
The SFA-ortholog algorithm for the identification of CDS orthology groups
In this section, we give a definition of CDS orthology groups computed by the SFA-ortholog algorithm based on pairwise spliced alignments. We first start with a definition of orthologous CDS based on spliced alignment.
In [15], an extension of the concept of gene orthology to spliced transcript orthology was introduced. They defined orthologous transcripts as two structurally similar transcripts from two orthologous genes. The orthology relationship between two transcripts relies on the structural similarity between the transcripts. This structural similarity is evaluated using the CDS associated to the transcripts and their spliced alignments against genes.
CDS orthology: Let c1 and c2 be two CDS from two homologous genes g and h respectively. Let A1 be a spliced alignment of the CDS sequence g[c1] against the gene sequence h. The CDS c1 and c2 are orthologs if:
(1) | c1 |=| c2 | ;
(2) Intron(A1)=Intron(c2) ;
(3) for any i,1≤i≤ | c1 |, [length(c1[i])−length(c2[i])] % 3=0.
In other terms, c1 are c2 as orthologs if (1) they have the same number of exons | c1 |=| c2 | ; (2) the spliced alignment of c1 against h induces the same introns for c1 and c2, Intron(A1)=Intron(c2) ; (3) the lengths of each pair of corresponding exons in c1 and c2 are congruent modulo 3. Conditions (1) and (2) ensure that the two CDS have the same splicing structure. Condition (3) ensures that the two CDS are translated in the same codon phase in each pair of corresponding exons in order to generate similar protein sequences.
Note that this definition only requires that one of the spliced alignments of g[ c1] against h or h[ c2] against g supports the orthology relation. An alternative more stringent definition of CDS orthology consists in requiring the reciprocity, i.e. that both spliced alignment support the orthology relation.
CDS orthology groups: Given a set of CDS \(\mathcal {C}\) from a set of homologous genes \(\mathcal {G}\), the transitivity of the CDS orthology relation is assumed and used to identify distant orthologs in \(\mathcal {C}\) that cannot be directlty identified by means of the CDS structural similarity. Such orthologs could be missed because of partial spliced alignments due to low sequence similarity.
The CDS orthology relation on \(\mathcal {C}\) is then extended into an equivalence relation such that for any three CDS c1,c2,c3 in \(\mathcal {C}\), if c1 and c2 are orthologs and c2 and c3 are orthologs, then c1 and c3 are also orthologs. The CDS orthology groups are defined as the equivalence classes of the resulting equivalence relation.