Methods that determine genetic variants from NGS data by and large rely on computational methods that align short reads to reference genomes and detect differences between them. The task of aligning short reads to genomes consists of two separate steps: (1) mapping reads to correct chromosomal locations and (2) aligning reads correctly to those chromosomal locations. A read can be correctly mapped and incorrectly aligned. Misalignment at a correct chromosomal location can affect the determination of insertion-deletion variants (INDEL). An INDEL is represented in the form x
1|x
2|⋯|x
k
, which means that at that location the string x
1 appears in the reference genome, and any of x
1,x
2⋯x
k
can appear in another genome at that location.
To see how a read can be correctly mapped and incorrectly aligned, consider an example, in which the read TCAGG is correctly mapped to the genome at location p, and that the substring starting at this location of length 8 is TCACACAG. Depending on the model of alignment, there are two or three different optimal alignments:
The first alignment results in 2 INDEL calls: TCA |T at location p and ACA |A at location p+4. The second alignment results in an INDEL call ACACA |A at location p+2. And the third alignment results in an INDEL call TCACA |T at location p.
In an alignment model where gap extensions and openings are equally penalized, these three alignments are all optimal because the gaps in each alignment equate to a deletion of 4 bases. In a model such as the affine gap model, in which a gap opening is penalized more than a gap extension, however, there are only two optimal alignments (the second and third) because the first alignment would be penalized more than the other two. So, even in the more sophisticated affine gap model, there can be multiple optimal alignments, resulting in different INDEL calls. And if an aligner picks one of these based on some algorithmic bias, this bias will end up in a biased calling of INDEL.
The goal of this work is to examine known INDEL locations and determine if those locations permit multiple optimal alignments. Further, for INDEL locations that permit multiple optimal alignments, we aim to examine the possibility that they were constructed in a biased manner based on biased alignments of many popular short-read aligners.
Pairwise alignment
The mechanism by which aligners can create a biased alignment can be seen more easily by an examination of the basic pairwise alignment algorithm [14]. Although different alignment methods have different ways to speed up the mapping of reads to genomes, e.g. using an FM index or a hash table, the alignment itself is essentially the same formulation of optimal pairwise alignment, based on dynamic programming.
In a simple alignment model with no penalty for gap opening, an optimal alignment between x=x
1⋯x
n
and y=y
1⋯y
m
is found by constructing a matrix M, in which M[i,j] is the score of an optimal alignment between x
1⋯x
i
and y
1⋯y
j
, for 1≤i≤n and 1≤j≤m. With M[i,0]=i and M[0,j]=j, the matrix M is constructed based on the following relation:
$$ M[i,j] = \max\left\{\begin{array}{cc} M[i-1,j-1] +~ match~(x_{i},y_{j})\\ M[i-1,j] +~ \epsilon \\ M[i,j-1] +~ \epsilon \\ \end{array}\right. $$
(1)
where m
a
t
c
h(x
i
,y
j
) is the cost of substituting x
i
for y
j
and ε is the cost of deleting x
i
or inserting y
j
.
In the affine gap model, finding an optimal alignment between x and y depends on the computation of three matrices M,X, and Y. Here, M[i,j] is the score of an optimal alignment between x
1⋯x
i
and y
1⋯y
j
, where x
i
is aligned with y
j
. X[i,j] is the score of an optimal alignment in which x
i
aligns with a gap. And, Y[i,j] is the score of an optimal alignment in which y
j
aligns with a gap. The computation of the three matrices can be done based on the following relations:
$$ M[i,j] = \max\left\{\begin{array}{ll} M[i-1,j-1] + ~match~(x_{i},y_{j})\\ X[i,j] \\ Y[i,j] \\ \end{array}\right. $$
(2)
$$ X[i,j] = \max\left\{\begin{array}{ll} M[i-1,j] +~ (\epsilon + \rho) \\ X[i-1,j] +~ \epsilon \\ \end{array}\right. $$
(3)
$$ Y[i,j] = \max\left\{\begin{array}{ll} M[i,j-1] +~ (\epsilon + \rho) \\ Y[i,j-1] +~ \epsilon \\ \end{array}\right. $$
(4)
where ε is the cost of inserting or deleting a base, and ρ is the cost of inserting or deleting the first base (i.e. the penalty for gap opening).
Constructing all optimal alignments
In Eqs. 1–4, when there exist more than one ways to achieve a maximal value, the choice adopted by an alignment algorithm will be arbitrary. Further, each arbitrary choice of maximal value of each step will lead to a specific optimal alignment. Thus, given the existence of more than one maximal cases to choose from in Eqs. 1–4, there will necessarily be multiple optimal alignments, which all have the same alignment scores despite being slightly different from one another.
To construct all optimal alignments under the non-affine gap model after the matrix M is filled, one starts from the entry with the highest cost and retraces all steps at which optimal decisions (as specified in Eq. 1) are made. The following procedure constructs all optimal alignments in the non-affine model, after the matrix M is computed: 1: Find (i,j) such that M[i,j] is maximum. 2: return
T
r
a
c
e(M,i,j)
As described in Algorithm 1, the call T
r
a
c
e(i,j) returns all optimal alignments ending at x
i
and y
j
. Trace is done by identifying whether each of the three conditions in Eq. 1 is optimal. If the condition is optimal, Trace is called recursive to obtain all optimal alignments starting at that entry. By induction, the three recursive calls return all possible optimal alignments just before x
i
and y
j
. Then, the algorithm correctly returns the union of all of the optimal alignments ending at x
i
and y
j
.
To construct all optimal alignments under the affine gap model, the process is similar. After the matrices M,X, and Y are filled, one starts from the entry of M with maximum value and retraces all the steps at which optimal decisions (as specified in Eqs. 2, 3, 4) are made. The only technical difference is that we need to specify the appropriate matrix (either M, X, or Y) in each recursive call.