hoDCA is implemented in the julia language (0.6.2) [16], and depends directly on a) the ArgParse [17] module for command-line arguments and b) on the GaussDCA [18] module for performing preprocessing operations on the MSA and the implicit dependencies for those packages. A typical command-line call is julia hoDCA.jl Example.fasta Example.csv –No_A_Map=1 –Path_Map=A_Map.csv –MaxGap=0.9 –theta=0.2 –Pseudocount=4.0 –No_Threads=2 –Ign_Last=0 with input Example.fasta and output Example.csv. The latter consists of lists of all two-body contact scores Jij separated by at least one residue along the backbone. The meaning of the remaining (optional) parameters will become clear in the following.
General notes. For inference of parameters {hi,Jij,Vijk}, we use the mean-field approximation as described in [15] with a reduced alphabet for three-body couplings. This is accomplished by a mapping
$$ \mu: \{l|l\leq q \} \to \left\{\alpha | \alpha \leq q_{\text{red}}\right\}, $$
(1)
with q being the full alphabet of the MSA and qred≤q. On the one hand, this accounts for the so called curse of dimensionality [19], occuring if the size of the MSA is too small to observe all possible q3 combinations for each Vijk. On the other hand, this significantly reduces memory usage and allows for a faster computation of contact prediction scores. The mapping μ can be specified by Path_Map, which is a csv file with every row representing a mapping. No_A_Map tells which row to choose. As the bottleneck is still the calculation of three-body couplings, it can be performed using parallel threads by specifying the No_Threads flag.
In traditional DCA, the last amino acid q usually represents the gap character and is not taken into account for score computation within the l2-norm. In hoDCA, each two-body coupling state l≤q contains contributions from {n≤q|μ(n)=μ(l)} due to the reduced alphabet. We therefore take gap contributions into account by default, which can be changed by the Ign_Last flag.
MSA preprocessing. The MSA is read in by the GaussDCA module, ignoring sequences with a higher amount of gaps than MaxGap, and subsequently converted into an array of integers. However, in contrast to GaussDCA, we check for the actual number of amino acids types contained in the MSA given. We, then, reduce the alphabet from q=21 to the number of present characters (amino acid types). Afterwards, the reweighting for every sequence \(\vec {\sigma }^{(b)}\) is obtained by the GaussDCA module via \(w_{b}=1/|\{a \in \{1,...,B \}: \text {difference}\left (\vec {\sigma }^{(a)},\vec {\sigma }^{(b)}\right) \leq \texttt {theta} \}|\), where the difference is computed by the percentage hamming distance [6]. The aim of reweighting is to reduce potential phylogenetic bias.
Frequency computation. Empirical frequency counts for the full alphabet are computed according to [6]
$$\begin{array}{*{20}l} \begin{aligned} & f_{i}(l) = \frac{1}{\lambda_{c}+B_{eff}} \left(\frac{\lambda_{c}}{q}+ \sum\limits_{b=1}^{B}w_{b} \cdot \delta\left(\sigma_{i}^{(b)},l\right) \right) \\ & f_{ij}(l,m) = \frac{1}{\lambda_{c}+B_{eff}} \left(\frac{\lambda_{c}} {q^{2}} \right.\\ &~~~~~~~~~~~~~~~~~~~~ \left.+\sum\limits_{b=1}^{B}w_{b} \cdot \delta\left(\sigma_{i}^{(b)},l\right) \delta\left(\sigma_{j}^{(b)},l\right) \right), \\ \end{aligned} \end{array} $$
(2)
with δ being the Kronecker delta, B the number of sequences in the MSA, \(B_{eff}={\sum \nolimits }_{b=1}^{B}w_{b}\) and λc=Pseudocount·Beff. The Pseudocount parameter shifts empirical data towards a uniform distribution. This is necessary to ensure invertibility of the empirical covariance matrix in the mean-field approach.
Frequency counts for the reduced alphabet are computed via
$$\begin{array}{*{20}l} \begin{aligned} & f_{i}^{\text{red}}(\alpha) = \sum\limits_{\{ l | \mu(l)=\alpha \}} f_{i}(l) \\ & f_{ij}^{\text{red}}(\alpha,\beta) = \sum\limits_{\substack{\{ l | \mu(l)=\alpha \} \\ \{ m | \mu(m)=\beta \}}} f_{ij}(l,m) \\ & f_{ijk}^{\text{red}}(\alpha,\beta,\gamma) = \sum\limits_{\substack{\{ l | \mu(l)=\alpha \} \\ \{ m | \mu(m)=\beta \} \\ \{ n | \mu(n)=\gamma \}} } f_{ijk}(l,m,n). \\ \end{aligned} \end{array} $$
(3)
The computation of three-point frequencies takes some time and will be executed on No_Threads threads. For this, we parallelized their calculation over the sequence size N, meaning that the i-th process computes \(f_{ijk}^{\text {red}}\) for all k≥j≥i and fixed i. Besides the parallelization scheme, three-point frequencies are preprocessed in the same manner as one- and two-point frequencies.
Contact prediction scores. Contact prediction scores follow directly from two-body couplings. Two-body couplings are obtained within the mean-field approximation by
$$\begin{array}{*{20}l} \begin{aligned} J_{ij}(l,m) \approx & -g_{ij}(l,m) \\ &+ \sum\limits_{\substack{k=1, \\ k \neq i,j}}^{N} \sum\limits_{n=1}^{q-1} g_{ijk}^{\text{red}}(\mu(l),\mu(m),\mu(n)) \cdot f_{k}(n),\\ \end{aligned} \end{array} $$
(4)
where gij(l,m) is the inverse of the empirical two-point covariance matrix eij(l,m)=fij(l,m)−fi(l)fj(m). \(g_{ijk}^{\text {red}}(\alpha,\beta,\gamma)\) is given by a relation to the three-point covariance matrix over the reduced alphabet
$$ \begin{aligned} e_{ijk}^{\text{red}}(\alpha,\beta,\gamma) = & f_{ijk}^{\text{red}}(\alpha,\beta,\gamma)+2f_{i}^{\text{red}}(\alpha)f_{j}^{\text{red}}(\beta)f_{k}^{\text{red}}(\gamma) \\ &-f_{ij}^{\text{red}}(\alpha,\beta)f_{k}^{\text{red}}(\gamma) \\ &-f_{ik}^{\text{red}}(\alpha,\gamma)f_{j}^{\text{red}}(\beta) \\ &-f_{jk}^{\text{red}}(\beta,\gamma)f_{i}^{\text{red}}(\alpha) \\ \end{aligned} $$
(5)
via
$$ {{} \begin{aligned} g_{ijk}^{\text{red}}(\alpha,\beta,\gamma) = & \,-\,\sum\limits_{a_{1},b_{1},c_{1}=1}^{N} \sum\limits_{a_{2},b_{2},c_{2}\,=\,1}^{q_{\text{red}}-1} \left(e_{a_{1},b_{1},c_{1}}^{\text{red}}\left(a_{2},b_{2},c_{2}\right)\right. \\ &~~~~~~~~~~ \!\cdot \left.g_{i,a_{1}}^{\text{red}}\left(\alpha,a_{2}\right) \!\cdot\! g_{j,b_{1}}^{\text{red}}\left(\beta,b_{2}\right) \!\cdot\! g_{k,c_{1}}^{\text{red}}\left(\gamma,c_{2}\right) \right) \end{aligned}} \!\!, $$
(6)
where gij(α,β) is the inverse of the two-point covariance matrix over the reduced alphabet (see [15] for more details). For the calculation of scores, {Jij} are transformed into so called zero-sum gauge, satisfying \({\sum \nolimits }_{l}^{q} \hat {J}_{ij}(l,.)=\sum _{m}^{q} \hat {J}_{ij}(.,m)=0\), where "." stands for an arbitrary state via
$$ \begin{aligned} \hat{J}_{ij}(l,m)= & J_{ij}(l,m)+\frac{1}{q}\sum\limits_{r=1}^{q}\left[{\vphantom{\sum\limits_{s=1}^{q}}} -J_{ij}(r,m)-J_{ij}(l,r) \right.\\ &~~~~~+\left.\frac{1}{q}\sum\limits_{s=1}^{q}J_{ij}(r,s) \right] \\ &+\frac{1}{q_{\text{red}}}\sum\limits_{\substack{ k=1 \\ k \neq i,j}}^{N} \sum\limits_{\eta=1}^{q_{\text{red}}} \left[{\vphantom{\sum\limits_{s=1}^{q}}}V_{ijk}^{\text{red}}(\mu(l),\mu(m),\eta) \right.\\ & ~~~~~+\frac{1}{q}\sum\limits_{r=1}^{q} \left[{\vphantom{\sum\limits_{s=1}^{q}}} -V_{ijk}^{\text{red}}(\mu(r),m,\eta) \right.\\ & ~~~~~ -V_{ijk}^{\text{red}}(\mu(l),\mu(r),\eta) \\ &~~~~~ \left.\left.+\frac{1}{q}\sum\limits_{s=1}^{q}V_{ijk}^{\text{red}}(\mu(r),\mu(s),\eta) \right] \right] \\ \end{aligned} $$
(7)
The purpose of the gauge transformation is to shift local bias from two-body couplings into local fields [8, 20]. Above calculations are the most time consuming parts and run on No_Threads threads. The final scores result from average product correction (APC) of l2 norm [21] via
$$ S_{ij}=\left\|\hat{\textbf{J}}_{ij} \right\|_{2} -\frac{\left\|\hat{\textbf{J}}_{:j} \right\|_{2}\left\|\hat{\textbf{J}}_{i:} \right\|_{2}}{\left\|\hat{\textbf{J}}_{::} \right\|_{2}} $$
(8)
and \(\left \| \hat {\textbf {J}}_{ij} \right \|_{2} = \sqrt {{\sum \nolimits }_{l,m=1}^{q} \hat {J}_{ij}(l,m)^{2}}\).