A general overview of the GramAlign algorithm is depicted in Figure 1. The set of sequences to be aligned, S, are regarded as input to the algorithm with S = {s1,...,s
N
}, where s
i
is the ithsequence and i ∈ {1,...,N}.
Distance Estimation
The first step in the procedure involves the formation of an estimate of the distance between each sequence s
m
and all other sequences s
n
∀ n ≠ m. The distance used in GramAlign is based on the natural grammar inherent to all information-containing sequences. Unfortunately, the complete grammar for biological sequences is unknown, and so cannot be used when comparing sequences. However, we do know that biological sequences have structures which correspond to functions. This in turn implies that biological sequences which correspond to proteins with similar functions will have similarities in their structure. Therefore, we use a grammar based on Lempel-Ziv (LZ) compression [18, 19] used in [20] for phylogeny reconstruction. This measure uses the fact that sequences with similar biological properties share commonalities in their sequence structure. It is also known that biological sequences contain repeats, especially in the regulatory regions [21]. When comparing sequences with functional similarity, non-uniform distribution of repeats among the sequences poses a problem to assess sequence similarity. As shown below, the proposed distance naturally handles such cases, which are difficult to be accounted for by alignment or sequence edit based measures.
An overview of the grammar-based distance calculation is shown in Figure 2 where a dictionary of grammar rules for each sequence is calculated. Initially, the dictionary = ∅ is empty, a fragment f1 = s
m
(1) is set to the first residue of the corresponding sequence, and only the first element s
m
(1) is visible to the algorithm. At the k th iteration of the procedure, the k th residue is appended to the k - 1 fragment and the visible sequence is checked. If fk∉ s
m
(1,...,k - 1) then fkis considered a new rule, and so added to the dictionary , and the fragment is reset, fk= ∅. However, if fk∈ s
m
(1,...,k - 1), then the current dictionary contains enough rules to produce the current fragment, i.e., . In either case, the iteration completes by appending the k th residue to the visible sequence. This procedure continues until the visible sequence is equal to the entire sequence, at which time the size of the dictionary is recorded along the diagonal of the grammar elements matrix, E
m, m
= |G
m
|. As will be shown, calculating the distance between sequences requires only the number of entries in the dictionary.
In the next step shown in Figure 2, each sequence is compared with all other sequences. In particular, consider the process of comparing sequences m and n. Initially, the dictionary = G
m
is set to that of sequence m, a fragment f1 = s
n
(1) is set to the first residue of the n th sequence, and the visible sequence is all of s
m
. The algorithm operates as described previously, resulting in a new dictionary size E
m, n
= |G
m,n
|. When complete, more grammatically-similar sequences will have a new dictionary size with fewer entries as compared to sequences that are less grammatically-similar. Therefore, the size of the new dictionary Em,nwill be close to the size of the original dictionary E
m,m
.
In the final step, the distance between the sequences is estimated using the dictionary sizes. Five different distance measures were suggested in [20]. This work used the distance measure
(1)
where m, n ∈ {1,...,N} are indices of two sequences being compared. This particular metric accounts for differences in sequence lengths, and normalizes accordingly. Thus, the final distance matrix D is composed of grammar-based distance entries given by (1). Smaller entries in D indicate a stronger similarity, at least in terms of the LZ-based grammar estimate. Intuitively, sequences with a similar grammar should be pairwise aligned with each other in order for progressive combining into an MSA.
To further improve the execution time, D is only partially calculated as follows. An initial sequence is selected and compared with all other sequences. The resulting distances are split evenly into two groups based on d, one containing the smallest distances, and the other containing the largest distances. The process is repeated recursively on each group until the number of sequences in a group is two. The benefit is that only N log(N) distances need to be calculated. The validity of only calculating these sets of distances stems from the transitivity of the LZ grammars being inferred. That is, if the grammar-based distances d
i, j
and d
j, k
are small, it is likely that d
i, k
is also small. By recursively dividing groups of extreme distances, only those distances which would likely be used in the spanning-tree creation process will actually be calculated.
Sequence Alphabet
The distance between sequences m and n as determined by (1) is based on how many additional rules need to be added to each grammar in order to generate both s
m
and s
n
. Because the real grammars are unknown, G
m
and G
n
are approximated by scanning the only observations available (i.e., s
m
and s
n
). The grammar approximation improves as the length of the observed sequences increases. And so, the distance calculations are a function of sequence lengths, becoming more accurate as the sequences increase in length. In practice, this calculation works well for DNA sequences, even of shorter lengths, because the approximated grammar of a DNA sequence can only contain rules involving words composed of combinations of elements from the alphabet {'A','C','G','T'}. This small alphabet allows for a rapid generation of a reasonable grammar since there are a relatively small number of permutations of letters. From a grammar perspective, amino acid sequences are generally much more difficult to process correctly using (1). The reason being the alphabet contains 23 letters, where each element is not equally different from all other elements. Due to the relatively large alphabet size, much longer sequences are necessary to generate a reasonable grammar approximation. Thus, the accuracy of distances calculated for sets of short amino acid sequences are diminished. Additionally, consider the substitution scores of 'L' and 'M' as taken from the GONNET250 and BLOSUM62 substitution matrices in Figure 3. Notice in (a) and (c), that 'L' receives a relatively high positive value when aligned with any of {'I','L','M','V'}. Similarly, in (b) and (d), 'M' receives a relatively high positive value when aligned with any of the same set. Additionally, both 'L' and 'M' generally receive high negative values when compared to letters other than {'I','L','M','V'}. When taking this type of scoring into account, the elements 'L' and 'M' could be considered the same letter in a grammatical sense.
Thus, GramAlign offers the option to use a "Merged Amino Acid Alphabet" when calculating the distance matrix. The merged alphabet contains 11 elements corresponding to the 23 amino acid letters grouped into the sets {'A','S','T','X'}, {'B','D','N'}, {'C'}, {'E','K','Q','R','Z'}, {'F'}, {'G'}, {'H'}, {'I','L','M','V'}, {'P'}, {'W'}, and {'Y'}. These groupings were determined by considering all 23 rows of the BLOSUM45, BLOSUM62, BLOSUM80 and GONNET250 substitution matrices, and only grouping elements that had a strong similarity across the entire row in all four matrices. The merged alphabet has the benefit of containing fewer elements allowing for more accurate distance estimates based upon shorter observed sequences. Also, the resultant merged-alphabet substitution matrices are more consistent in that a merged-letter score is high only when compared to itself. In practice, the average alignment scores increased when aligning the same data sets using the merged alphabet within the distance calculation, as compared to using the actual alphabet (results not shown). In either case, once the distances have been calculated, a tree based on these distances is used to determine which sequences should be pairwise aligned.
Tree Construction
The next step in the algorithm consists of constructing a minimal spanning tree T based on the distance matrix D. In particular, consider a completely connected graph of N vertices and edges, where the weight of an edge between vertices i and j is given by the (i, j)th element of the distance matrix, D
i, j
. This work uses Prim's Algorithm [22] to determine a minimal spanning tree T which may be used as a guide in determining the order for progressively aligning the set of sequences S.
Align Sequences
The minimal spanning tree T along with the set of sequences S, are processed by the "Align Sequences" block in Figure 1. This block is presented in more detail in Figure 4. The first two sequences from S to be aligned are given by T as the root sequence of T and the nearest sequence in terms of the LZ grammar distance. At the conclusion of the pairwise alignment process, the resulting alignment is stored in an ensemble of sequences.
In the following we describe the pairwise alignment procedure, the scoring system and the method for progressive alignment.
Dynamic Programming
At the core of most progressive MSA algorithms is some method for performing pairwise alignments between two sequences. This work uses a version of the [23] dynamic programming algorithm with affine gap scores as discussed in [2] to generate each pairwise alignment.
Scoring System
A significant ambiguity regarding the dynamic programming procedure is the scoring function used when comparing two elements, or when comparing an element with a gap.
Typically, the pairwise scoring function c() is simply a matrix of values, where each column and row represent one element in the alphabet. In this way, each cell of the matrix corresponds to some measure representing the likelihood that two sequence elements should be aligned with each other. The most well-known amino acid scoring matrices are the Percent Accepted Mutation (PAM) [24], BLOck SUbstitution Matrix (BLOSUM) [25] and GONNET [26]. GramAlign defaults to the GONNET250 substitution matrix for the scoring function c(), as other progressive alignment algorithms generally use it as the default choice (e.g., [14] and [16]).
Determining the best gap-open and gap-extension penalties is a challenging problem, made more difficult by introducing two different penalties to account for the beginning and ending tail gaps of alignments. The default gap penalties used by GramAlign have been adjusted to perform well based on the alignment sets presented in the results section.
Progressive Alignment
The ensemble is implemented as a doubly-linked list, where each node of the list represents a single column of the alignment. Each node of the ensemble contains an array of letters corresponding to the respective column alignment, a tally of gaps in the column, a weighted combination of substitution scores, and two gap penalties. Once the initial ensemble A(0,1) is constructed between the first two entries in T, the remaining sequences need to be added to the ensemble in the order defined by T. This is accomplished by checking T for the next sequence not already in the ensemble, call it sequence s
j
where j corresponds to the order in which the sequence was added to T; that is, j is the priority of the sequence. To progressively add s
j
to the alignment, a pairwise alignment between the ensemble A(0,...,j-1)and s
j
is created via the afore mentioned dynamic programming algorithm. While the algorithm used is a pairwise alignment algorithm the distance calculated at each step of the pairwise alignment is an average of the distances between the particular position being aligned in the new sequence and the corresponding amino acides or bases in the ensemble at that node. The new pairwise alignment is merged into the ongoing ensemble based on the trace-back. The process continues until all sequences have been added to the ensemble of sequences. When sequence s
j
is added to the current ensemble A(0,...,j-1), each node is updated to reflect the new column element.
Gap Adjustment
Once all N sequences have been progressively aligned, the final post-processing block in Figure 1, "Adjust MSA Gaps", is used to cluster gaps together. The adjustment is further detailed in Figure 5, where the ensemble A is scanned so a histogram H of gaps-per-column is generated.
The histogram H is scanned using an equidistant, user-adjustable sliding window about each column. For each column, when the number of gaps is greater than a user-adjustable threshold percentage of gaps-per-column, the following steps are taken. For each row in the column under consideration:
-
1.
If the current row has a gap, move to the next row;
-
2.
Otherwise, scan the current row of the neighboring columns within the window, beginning with the nearest columns and work outward;
-
3.
If a neighboring column has a gap in the current row and the neighboring column has fewer total gaps than the center column, shift the gap from the neighboring column into the column under consideration.
As an illustration, consider a portion of the ensemble
where x
m, n
represents any element other than a gap in column n of sequence m, and -
m, n
represents a gap in column n of sequence m. And so, the gap histogram for this section of ensemble is H = {...,1, 2, 3, 4, 0,...}. Assuming the gap threshold is 0.4, then only columns with more than two gaps will be considered for adjustment. In the example, H is scanned until column i is identified as having three gaps. Following the procedure, each row in column i is checked until a non-gap entry is found. In the example, the first non-gap entry x4, iis in row four. Assuming the gap window is 2, elements in the fourth row of the neighboring columns are checked for gap entries. In particular, column (i + 1) is checked first, with a gap entry -4, i+1. However, no shift occurs because a quick check of H shows that column (i + 1) has more gaps than column i. Continuing the scan, columns (i - 1) and (i + 2) are checked before another gap is found in column (i - 2). In this case, H indicates column (i - 2) has fewer gaps compared to column i, and so a blind shift of entries between (i - 2) and i occurs, resulting in the ensemble
where original indices are kept to depict which entries are shifted into which locations.
The result is a blind movement of sparse gaps into dense regions of gaps. Numeric simulations have shown this post-processing stage does not affect alignment scoring based upon the method used in the Results and Discussion section (results not shown). And so, the user-defined parameters are set to a threshold of 1.0 and a window of 0 columns by default thereby disabling the gap adjustment block. Should it be known there are conserved regions of gaps, the user may decide to enable this process to encourage gap grouping.
Algorithm Complexity
The algorithm complexity of GramAlign may be broken into five pieces, beginning with the generation of each sequence grammar dictionary, G
i
for i ∈ {1,...,N}, where N is the number of sequences. Suppose the average sequence length is L, then each G
i
results in complexity (L), so all dictionaries are generated with complexity (LN). Next, the distance matrix D is formed by recursively extending a grammar by all other sequences within it's neighborhood, each of which results in complexity (L), then splitting the neighborhood into two halves, resulting in a complexity (LN log(N)). The spanning tree T is constructed by searching over D with a complexity of (N2). The tree is used as a map in determining the order in which to perform N - 1 pairwise alignments, each requiring a complexity of (L2 + L). Thus, the progressive alignment process takes (L2N). The alignment ensemble is scanned and has gaps shifted in (LN) time. Thus, the entire time complexity for GramAlign is (LN + LN log(N) + N2 + L2N + LN), which simplifies to (N2 + L2N).