Biological sequences are composed of an ordered sequence of residues, which can be nucleotides (DNA/RNA sequences) or amino acids (protein sequences) [15]. These sequences are treated as strings composed of elements of the alphabets Σ={A,T,G,C}, Σ={A,U,G,C} and Σ={A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}, respectively. Protein and RNA sequences are rather small and their sizes range from hundreds to tens of thousands of residues (amino acids and nucleotide bases, respectively). On the other hand, DNA sequences can be very long, often composed of Millions of Base Pairs (MBP).
Similarity score and alignment
To compare two sequences, we need to find the best alignment between them, that is, how characters match when you overlap them [16]. In an alignment, spaces (gaps) can be inserted in arbitrary locations along the sequences so that they end up with the same size.
In order to measure the quality of a DNA sequence alignment a score is calculated, considering three cases: (a) matches (ma), if the characters of both sequences at the same column are identical; (b) mismatches (mi), if the characters in the same column are distinct and (c) gaps (gap), if one of the characters in the same column is a space. The score is the sum of all values assigned to the columns and a high score points to high similarity sequences. Figure 1a and b illustrate global and local alignments between two DNA sequences (S0=ACTTGTCCG and S1=ATGTCAG).
In Fig. 1, a single value is assigned for matches and mismatches (+1 and −1 in the example) regardless of the parts involved. This works well with nucleotides (DNA or RNA sequences) but not for proteins. During evolution, some combinations are more likely to occur than others, so higher scores are assigned to these combinations [17]. Therefore, the alignment of protein sequences employ scoring matrices, such as PAM (Percent Accepted Mutations) and BLOSUM (Blocks Substitution Matrix), which associate values for matches/mismatches that correspond to the likelihood of a particular combination [15]. PAM matrices have scores that are calculated by analyzing the frequencies in which a given amino acid is substituted by another amino acid during evolution. BLOSUM scoring matrices are created by evaluating evolutionary rates of a region of inside a protein (block) rather than considering the entire protein.
Exact algorithms: obtaining the optimal alignment
Algorithm NW for global alignment
The algorithm proposed by Needleman and Wunsh (NW) [12] is an exact method based on dynamic programming to obtain the optimal global alignment between two sequences in quadratic time and space. In fact, the algorithm originally proposed by NW had cubic time complexity [16], but later on, its complexity was reduced to quadratic, which is the version we describe in this paper. The algorithm is divided in two phases: create the DP matrix and obtain the optimal global alignment.
Phase 1 - calculate the DP matrix
In the first phase, the input sequences are S0 and S1, with |S0|=m and |S1|=n, where |Si| is the length of sequence Si. For sequences S0 and S1, there are m+1 and n+1 possible prefixes, respectively, including the empty sequence. In order to represent the n-th character of a sequence Si, the notation is Si[n]. Fianlly, we use Si[1..n] to characterize a prefix with n characters, from the beginning of the sequence.
In the initialization step, the first row and column are set to −Gx, where x is the size of the non-empty subsequence and G is the gap penalty. This represents the cost of aligning a non-empty subsequence with an empty one. In other words, the first row and column of the DP matrix are initialized with values H0,j=−Gj and Hi,0=−Gi. Additionally, H0,0=0. The remaining elements of H are calculated with the recurrence relation Eq. 1. The optimal score is the value contained in cell Hm,n.
$$ H_{i,j}=\max \left\{\begin{array}{l} H_{i-1,j-1}+p(i,j) \\ H_{i,j-1}-G \\ H_{i-1,j}-G \\ \end{array}\right. $$
(1)
In Eq. 1, if DNA or RNA sequences are compared, p(i,j) is usually the match value (ma) if S0[i]=S1[j] or the mismatch penalty (−mi), otherwise. If protein sequences are compared, p(i,j) is given by a scoring matrix (e.g. PAM or BLOSUM).
Figure 2 presents the DP matrix between sequences S0 = ATTGTCAGGAGG and S1 = ACTTGTCCGAGA. The arrows indicate the cell from where the value was obtained, according to Eq. 1. Cells with multiple arrows indicate that the same maximum value was produced by more than one cell (Hi−1,j−1, Hi,j−1, Hi−1,j).
Phase 2 - obtain the optimal global alignment
Phase 2 starts from the position that contains the optimal score (Hm,n) and follows the arrows until cell H0,0 is reached. A left arrow in Hi,j (Fig. 2) produces the alignment of S0[i] with a gap in S1. An up arrow aligns S0[j] with a gap in S1. Finally, a diagonal arrow indicates that S0[i] is aligned with S1[j].
Note that many optimal global alignments may exist, since many arrows may be present in the same cell Hi,j. Usually, the implementations restrict to one the optimal alignment, giving preference to a given type of arrow (diagonal, up, left).
Algorithm SW for local alignment
Local alignment must be employed when the goal is to obtain the similarity between regions inside the sequences. Smith and Waterman (SW) [13] proposed an exact algorithm for local alignment 1985 and, since then, this algorithm is widely used. Like NW (“Algorithm NW for global alignment” section), SW is also based on dynamic programming with quadratic time and space complexity. However, there are three basic differences between the algorithms NW and SW, concerning the calculation of the DP matrix (“Phase 1 - calculate the DP matrix” section) and the alignment retrieval.
In the initialization step, all elements of the first row and column are set to zero in SW. This is done because gaps should not receive any penalty at the beginning of the alignment.
The second difference is the recurrence relation itself since since negative values are not allowed in SW, as shown in Eq. 2.
$$ H_{i,j}=\max \left\{\begin{array}{l} H_{i-1,j-1}+p(i,j) \\ H_{i,j-1}-G \\ H_{i-1,j}-G \\ 0 \\ \end{array}\right. $$
(2)
In the second phase (obtain the optimal local alignment), the algorithm starts from the cell which has the highest value in the DP matrix, following the arrows until a zero-valued cell is reached.
Figure 3 presents the similarity matrix to obtain local alignments between sequences S0= TATAGGTAGCTA and S1= GAGCTATGAGGT. Note that, in this example, even though there are no multiple arrows leaving a single cell, two optimal alignments can be obtained, both of them with score 5. Most of the implementations of SW will only retrieve one of those optimal alignments.
Affine-gap
Algorithms NW (“Algorithm NW for global alignment” section) and SW (“Algorithm SW for local alignment” section) use the linear gap model, where each gap is assigned the same penalty. To produce more biologically relevant results, Gotoh [18] proposed an algorithm that implements the affine-gap model for the global alignment case. This model takes into account the observation that, in nature, gaps tend to occur together [17]. In this case, a higher penalty is assigned to open a gap (Gopen) than to extending it (Gextend). The cost of a sequence of gaps of length x in the affine gap model is γ(x)=Gopen+(x−1)∗Gextend.
The Gotoh algorithm calculates three DP matrices: H, E and F, where H keeps track of matches/mismatches and E and F keep track of gaps in each sequence. Time and space complexities are quadratic. Matrix H is calculated with Eq. 3 and matrices E and F use Eqs. 4 and 5, respectively [18]. The optimal score is the value of cell Hm+1,n+1. To do the traceback, the arrows are followed in matrix H when a match/mismatch occurs or in matrices E and F, to track multiple gaps [18].
$$ H_{i,j}=\max \left\{\begin{array}{l} 0 \\ E_{i,j} \\ F_{i,j} \\ H_{i-1,j-1} - p(i,j) \\ \end{array}\right. $$
(3)
$$ E_{i,j}=\max \left\{\begin{array}{l} E_{i,j-1}-G_{ext} \\ H_{i,j-1}-G_{first} \\ \end{array}\right. $$
(4)
$$ F_{i,j}=\max \left\{\begin{array}{l} F_{i-1,j}-G_{ext} \\ H_{i-1,j}-G_{first} \\ \end{array}\right. $$
(5)
Linear space
The quadratic space complexity of NW, SW and Gotoh imposes severe restrictions when long sequences are compared. In such cases, linear space algorithms must be used. Hirschberg [19] proposed one of the first linear space algorithms for exact pairwise global sequence comparison [19]. The algorithm employs a recursive divide and conquer strategy that works as follows. First, the DP matrix is computed in linear space from the beginning to the middle row (i∗), storing only the last row calculated. Second, the DP matrix is calculated from the end to the middle row over the reverses of the sequences. The algorithm then uses these two middle rows and obtains the position where the addition of both columns j is maximal. This position is called midpoint and it corresponds to an element that belongs to the optimal alignment [19]. The midpoint divides the matrix into two smaller submatrix, which are processed recursively, until trivial solutions are found.
Myers and Miller [20] (MM) adapted Hirschberg’s algorithm to the affine gap model (“Affine-gap” section), using two additional vectors to treat situations where a sequence of gaps occurs. Let \(i* = \frac {m}{2}\) be the middle row of the DP matrices, CC(j) be the minimum cost of a conversion of S0[1..i∗] to S1[1..j], DD(j) be the minimum cost of a conversion of S0[1..i∗] to S1[1..j] that ends with a gap, RR(n−j) be the minimum cost of a conversion of S0[i∗..m] to S1[j..n] and SS(n−j) be the minimum cost of a conversion of S0[i∗..m] to S1[j..n] that begins with a gap.
To find the midpoint of the alignment, the algorithm realizes a matching procedure against a) vectors CC with RR and b) vectors DD with SS. The midpoint is the coordinate (i∗,j∗), where j∗ is the position that satisfies the maximum value in Eq. 6.
$$ {max}_{j \in [0..n]} \left\{ max \left\{ \begin{array}{l} CC(j)+RR(n-j)\\ DD(j)+SS(n-j) - G_{open}\\ \end{array} \right. \right. $$
(6)
As in Hirshbergś, after the midpoint is found, the matrix is recursively split into smaller submatrix, until trivial solutions are found.
Heuristic algorithms
Usually, a given protein sequence is compared against thousands or even millions of sequences that compose genomic databases. Also, two long DNA sequences with more than a million base pairs are often compared.
In these scenarios, the use of exact algorithms such as NW and SW is often prohibitive in terms of execution time. For this reason, faster heuristic methods for local alignment were proposed which do not guarantee that the optimal result will be produced. Usually, these heuristic methods are evaluated using the concepts of sensitivity and sensibility. Sensitivity is the ability to recognize as many significant alignments as possible, including distantly related sequences. Selectivity is the ability to narrow the search in order to discard false positives [17]. Typically, there is a tradeoff between sensitivity and sensibility.
The FASTA (FAST-All) algorithm [21] was proposed in 1988 and it computes local alignments of DNA or protein sequences. It is based on FastP [22], which is a heuristic algorithm to compare a protein sequence to a genomic database composed of several protein sequences.
BLAST (Basic Local Alignment Search Tool) [23] was proposed in 1990 and it is based on FASTA. Nowadays, it is the most widely used heuristic tool for local sequence alignment. Like FASTA, the BLAST algorithm assumes that significant alignments have words of length w in common and it is divided into three well-defined phases: seeding, extension and evaluation.
The original BLAST algorithm searched for local alignments without considering gaps. In 1996 and 1997, two improved gapped versions of the original BLAST, NCBI-BLAST2 [24] and WU-BLAST2 [25], were proposed.