### Quantifying similarity

When alignment algorithms make different homology predictions for a set of sequences, the columns of the resulting MSAs will contain different residues. The AlignStat R package contains functions for calculating all MSA comparison statistics and creating plots quantifying differences in a manner that is equivalent for nucleotide or amino acid sequences.

Each MSA of *n* sequences is treated as a matrix of characters (residues plus a gap character) with the same number of rows. The two matrices are therefore defined as *P* (of dimensions *n* x *p*) and *Q* (of dimensions *n* x *q*), where each row represents an aligned sequence. Residues can occur multiple times in a sequence and so are numbered by occurrence such that each character in a row has a unique designation (Additional file 1: Figure S4). This ensures that alignment columns that contain a non-homologous occurrence of a residue are correctly distinguished. For the matrices *P* and *Q*, each column vector pair *p*
_{
i
}, *q*
_{
j
} is compared to calculate the similarity measure *S*
_{
ij
} defined in Eq. 1 (where *p*
_{
i
} is the *i*th column of *P*, and *q*
_{
j
} is the *j*th column of *Q*).

$$ {S}_{ij} = \frac{1}{n}{\displaystyle \sum_{x=1}^n}\ \varepsilon \left({P}_{xi},{Q}_{xj}\right) $$

(1)

Where *S*
_{
ij
} is the similarity score for each column pair between *P* and *Q*, the equivalency function ε is defined in Eq. 2.

$$ \varepsilon \left(a,b\right)\left\{\begin{array}{c}\hfill 1\kern0.5em if\kern0.5em a\kern0.5em =\kern0.5em b\kern0.5em \wedge \kern0.5em a\kern0.5em \ne \kern0.5em "-"\hfill \\ {}\hfill 0\kern0.5em otherwise\hfill \end{array}\right. $$

(2)

The similarity matrix *S* can be visualised using the *plot_similarity_heatmap* function of the AlignStat R package. Evaluating *S* is the most computationally expensive calculation in the AlignStat scoring method and has been implemented in C++ for maximum efficiency.

#### Detailed match scoring for comparable MSA columns

For each column in *P* we find its “match” in *Q* by finding the index j at which *S*
_{
ij
} is maximized. The match between columns, *P*
_{
i
} and *Q*
_{
j
} is then categorised leading to the dissimilarity matrix, *D* (of dimensions *n* x *p* x *5*) based on the functions defined in Eq. 3 and Eq. 4. This matrix categorises five types of outcome when the reference and comparison alignments are compared. It is called the dissimilarity matrix because four of the five alternatives correspond to various types of mismatch.

$$ {D}_{xik} = {\varepsilon}_k\left({P}_{xi},{Q}_{xj}\right) $$

(3)

Where *ε*
_{
k
}
*(a,b)* is the kth equivalency function as defined in Eq. 4.

$$ \begin{array}{l}{\varepsilon}_1\left(a,b\right)\kern0.5em \left\{\begin{array}{c}\hfill 1\kern0.5em if\ a=b\kern0.5em \wedge \kern0.75em a\kern0.5em \ne "-"\hfill \\ {}\hfill 0\ otherwise\hfill \end{array}\right.\\ {}{\varepsilon}_2\left(a,b\right)\kern0.5em \left\{\begin{array}{c}\hfill 1\kern0.5em if\ a=b\kern0.5em \wedge \kern0.5em a="-"\hfill \\ {}\hfill 0\ otherwise\hfill \end{array}\right.\\ {}{\varepsilon}_3\left(a,b\right)\kern0.5em \left\{\begin{array}{c}\hfill 1\ if\ a\ne b \wedge b="-"\hfill \\ {}\hfill 0\ otherwise\hfill \end{array}\right.\\ {}{\varepsilon}_4\left(a,b\right)\kern0.5em \left\{\begin{array}{c}\hfill 1\kern0.5em if\ a\ne b\kern0.5em \wedge \kern0.5em a="-"\hfill \\ {}\hfill 0\ otherwise\hfill \end{array}\right.\\ {}{\varepsilon}_5\left(a,b\right)\kern0.5em \left\{\begin{array}{c}\hfill 1\kern0.5em if\ a\ne b\kern0.5em \wedge \kern0.5em a\kern0.5em \ne "-" \wedge \kern0.5em b\kern0.5em \ne "-"\hfill \\ {}\hfill 0\ otherwise\hfill \end{array}\right.\end{array} $$

(4)

Where the five *ε*
_{
k
}
*(a,b)* are equivalency functions (see supplementary information for formal definitions) with the following meanings. The first equivalency (*ε*
_{
1
}) is a ‘match’, in which the two characters are identical *and* not gaps. The second equivalency (*ε*
_{
2
}) is a ‘conserved gap’, when the both characters are gaps. A ‘merge’ is when *P* contains a gap, but *Q* contains any other character (*ε*
_{
3
}). Similarly, a ‘split’ is when *Q* contains a gap, but *P* contains any other character (*ε*
_{
4
}). Finally, a ‘shift’ is when two characters are not identical *and* neither are gaps (*ε*
_{
5
}). The *D* matrix is visualised using the *plot_dissimilarity_matrix* function of the *AlignStat* R package.

#### Summary statistics

The column averages of *D* are used to describe the sources of dissimilarity between the reference and comparison alignments at each alignment position and each equivalency, k. This leads to the results matrix *R* (of dimensions *5* x *p*) defined by Eq. 5.

$$ {R}_{ki}=\frac{1}{n}{\displaystyle \sum_{x=1}^n}{D}_{xi\mathrm{k}} $$

(5)

Where *R* is the results matrix, each row of which is used to summarise a source of dissimilarity from the *D* matrix.

The match row of the *R* matrix (*R*
_{1i
}) is visualised using the *plot_similarity_summary* function of the *AlignStat* R package. The merge, split and shift rows of the *R* matrix (*R*
_{3i
}, *R*
_{4i
} and *R*
_{5i
}) are referred to collectively as dissimilarities in AlignStat. They are visualised using the *plot_dissimilarity_summary* function.

A single, overall similarity score describes the weighted average similarity of the two MSAs, as defined in Eq. 6. The treatment of gaps in MSAs is complex [9, 10]. In this case, the most instructive measure is to exclude conserved gaps, to prevent results being skewed by the “similarity” of conserved gaps in low occupancy columns. Therefore, the overall score is the sum of the match characters as a proportion of characters that are not conserved gaps. A more stringent column score can also be calculated as the proportion of all columns that have a perfectly identical between the MSAs. A full worked example of the mathematical implementation is available in Additional file 1.

$$ score=\frac{\frac{1}{p}{\displaystyle {\sum}_{i=1}^p}{R}_{1i}}{1-\frac{1}{p}{\displaystyle {\sum}_{i=1}^p}{R}_{2i}} $$

(6)

Released versions of the R package are available through the comprehensive R archive network (CRAN) and active development versions are available on github (GitHub.com/TS404/AlignStat). In order to allow AlignStat to scale to large MSAs and provide an acceptable run time the core calculation of equivalency functions and scoring statistics was implemented in C++ using the Rcpp framework [11]. A simple web interface to the AlignStat R package is implemented by the Shiny framework and is available at AlignStat.Science.LaTrobe.edu.au. The source code for the user interface is available at Github.com/iracooke/AlignStatShiny.