Estimation of duplication history under a stochastic model for tandem repeats

Background Tandem repeat sequences are common in the genomes of many organisms and are known to cause important phenomena such as gene silencing and rapid morphological changes. Due to the presence of multiple copies of the same pattern in tandem repeats and their high variability, they contain a wealth of information about the mutations that have led to their formation. The ability to extract this information can enhance our understanding of evolutionary mechanisms. Results We present a stochastic model for the formation of tandem repeats via tandem duplication and substitution mutations. Based on the analysis of this model, we develop a method for estimating the relative mutation rates of duplications and substitutions, as well as the total number of mutations, in the history of a tandem repeat sequence. We validate our estimation method via Monte Carlo simulation and show that it outperforms the state-of-the-art algorithm for discovering the duplication history. We also apply our method to tandem repeat sequences in the human genome, where it demonstrates the different behaviors of micro- and mini-satellites and can be used to compare mutation rates across chromosomes. It is observed that chromosomes that exhibit the highest mutation activity in tandem repeat regions are the same as those thought to have the highest overall mutation rates. However, unlike previous works that rely on comparing human and chimpanzee genomes to measure mutation rates, the proposed method allows us to find chromosomes with the highest mutation activity based on a single genome, in essence by comparing (approximate) copies of the pattern in tandem repeats. Conclusion The prevalence of tandem repeats in most organisms and the efficiency of the proposed method enable studying various aspects of the formation of tandem repeats and the surrounding sequences in a wide range of settings. Availability The implementation of the estimation method is available at http://ips.lab.virginia.edu/smtr. Electronic supplementary material The online version of this article (10.1186/s12859-019-2603-1) contains supplementary material, which is available to authorized users.


SM.1 Proof of Eq. (1)
The following set of conditions must be satisfied for our analysis. Among them, we assume (A1). It is easy to see that (A2)-(A3) are true, and (A4)-(A5) will be evident after we find (A1) There exists K ∈ N such that q i = 0 for i > K.
Furthermore, for the second term of the right side of Eq. (S.1), we have Note that M n is a bounded martingale difference sequence.

SM.2 Proof of Eq. (3)
Let s = s (n) = s 1 · · · s |s| be the sequence at time n. At time n + 1, either a substitution or a duplication has happened. In the former case, suppose the symbol at position i is changed to another symbol of the alphabet, and in the latter case, suppose that the substring s i+1 · · · s i+ is duplicated in a tandem manner; after duplication the sequence becomes s 1 · · · s i s i+1 · · · s i+ s i+1 · · · s i+ s i+ +1 · · · s |s| .
Fix the value of i. For = 0, i.e., the case of a substitution, where s i denote the new (mutated) symbol.
Now we consider the case of > 0, which corresponds to tandem duplications. For The conditions resulting in the first summation j ≤ i and j + r > i + and those resulting in the second summation are i < j ≤ i + or i + < j + r ≤ i + 2 . For > r, R n , we first find the following expected values, where r > 0 and where i is randomly and uniformly distributed among the L n positions: Using Eq. (S.4), Eq. (S.5), and Eq. (S.6), we can find δ and thus h . It can also be verified that (A4)-(A5) hold. Eq. (3) then follows.

SM.3 Proof of Eq. (7)
To determine the stability of Eq. (5) we use the Gershgorin circle theorem. We note that The circles centered at A rr and with radius j A rj − A rr = 2q 0 3 + r (1 − q 0 − q 2r ) in the complex plane either do not intersect the right half of the plane or they intersect it only at 0. Hence, by the Gershgorin circle theorem, the eigenvalues of A are either 0 or have negative real parts. Let (λ j ) m−1 j=0 denote the eigenvalues of A, with λ 0 = 0 and λ j = a j + ıb j for j > 0, where a j < 0 and ı where c rj (t) are polynomials in t of degree at most m. Since ρ r (t) is bounded between 0 and 1, it is evident that c r0 (t) is in fact a constant. Let this constant be denoted by ρ r ∞ . We thus have

SM.4 Proof of Lemma 2
Proof. Let B = (B rj ) be an m × m matrix, with rows and columns indexed by 0, . . . , m − 1, defined by Since q 0 = 0, we have Null (B) = Null (A). We further recall that q i ∈ R, q i ∈ [0, 1], and i q i = 1. Additionally, assume i 1 < i 2 < · · · < i k are the only indices for which q i j > 0. Finally, we assume m is large enough to enable us to see all the nonzero q i 's in the matrix, or more formally, we require m ≥ i k .
We are interested in finding the null-space of B. Instead of doing this directly, we consider the matrix The goal now is to find the right eigenspace of A for the eigenvalue 1.
First we prove S(d) ⊆ Null(B). We do this by showing that for all v i ∈ S(d), v i is in the right eigenspace of A corresponding to the eigenvalue 1, i.e., A v i = v i . This is immediate when we note that when i ≡ ±a (mod d), in the ith row of A (numbering of rows and columns starts from 0), coordinates j ≡ ±a (mod d) contain all the elements q id , and in particular, q i 1 , q i 2 , . . . , q i k .
It is obvious that the vectors in S(d) are linearly independent. To complete the proof we need to show that the geometric multiplicity of the eigenvalue 1 of A is |S(d)| = d/2 + 1.
The matrix A is stochastic, and therefore, its spectral radius is ρ(A ) = 1. Let G A be the (weighted) directed graph whose adjacency matrix is given by A . By Perron-Frobenius theory, it is well known that the eigenvalues of A are the union (in the multiset sense) of the eigenvalues of the irreducible components of G A . Additionally, the geometric multiplicity of ρ(A ) (also called the Perron-Frobenius (PF) eigenvalue of A ) is 1 for each irreducible component.
Combining the above, and remembering the PF eigenvalue of an irreducible graph is a weighted average of the out-weight of its vertices, we obtain that the geometric multiplicity of ρ(A ) = 1 is exactly the number of irreducible sink components of G A . Thus, as a final step in the proof, we show that the number of irreducible sink components of G A is exactly d/2 + 1.
Let us denote the vertices of the graph G A by w 0 , w 1 , . . . , w m−1 . From each w , > 0, we have k out-going edges corresponding to i 1 , i 2 , . . . , i k . The edge corresponding to i j is directed from w to w −i j when ≥ i j , and otherwise to w i j − . When describing a path we shall refer to this edge as "taking i j from w ". Finally, vertex w 0 has a single out-going edge which is also a self-loop.
By construction, all vertices w for ≥ i k have incoming edges from vertices w with > . Thus, they are certainly not part of an irreducible sink component. We therefore concentrate on vertices w 0 , w 1 , . . . , w i k −1 only. We now look at the irreducible components over these vertices.
For each i j , starting from w , 0 < < i k , we can take a path representing the orbit of −i j modulo i k . If ≥ i j , then we can move from w to w −i j . If < i j we can also do this but we require two steps: from w to w i j − by taking an i j step, and then from w i j − to w i k −(i j − ) by taking an i k step. Indeed Since we can do this for every i j every node w is connected to every node w with ≡ (mod d). Also, by taking the i k edge from each node, we can see that from every node w we can reach every node w with ≡ ± (mod d).
The only exception to the above are nodes w with ≡ 0 (mod d) since they get "stuck" at w 0 , which is an irreducible sink component on its own. We therefore reach the conclusion that there are exactly d/2 + 1 irreducible sink components which are of the form

SM.5 Proof of Lemma 3
Proof. Consider a matrix A obtained from A by replacing the first all-zero row with the row vector (1, 0, . . . , 0). By a simple application of the Gershgorin circle theorem, for all r > 0, and therefore all the eigenvalues of A are non-zero, i.e., rank(A ) = m. Thus, we have rank(A) = m − 1, and therefore dim Null(A) = 1.
We now show Av = 0, which along with dim Null(A) = 1, implies that Null (A) = Span(v). Let (Av) r denote the rth element of Av for r = 0, 1, . . . , m − 1. Since the 0th row of A is all zero, we have (Av) 0 = 0. Based on Eq. (6), for (Av) r = 0 to hold when r > 0, we This holds for r ≡ 0 (mod d) if we let v j = 1 4 for all j ≡ 0 (mod d). Finally, for r ≡ 0 (mod d), r > 0, we can choose v d , v 2d , . . . such that the above equality holds as these are not restricted in the statement of the lemma.

SM.6 Estimation for copy number ≤ 3
Since our method relies on asymptotic approximation, for short sequences, specifically those with copy number ≤ 3, we provide an alternative estimation algorithm. In such sequences there is ≤ 2 duplication events of length equal to d and 0 or more substitutions. The number of duplications can be found easily from the length of the sequence. Let a i be the number of distinct symbols appearing at the ith position (relative to the start of the pattern) of different copies minus 1. For example, for ACTGCTACT, we have a 1 = 1, since two symbols, A and G, appear in the first position of different copies, and a 2 = a 3 = 0. The a i can be used to infer the number of substitutions. A substitution will contribute to a i only if it occurs after the first duplication event. To account for hidden substitutions, we estimate the number of substitutions as ( i a i ) (r+1) r , where r is the number of duplication events. So we have estimates both for the number of substitutions and the number of duplications. Note that in this simple analysis, we have assumed that each substitution results in a new symbol, which is a reasonable assumption for a small number of mutations.
All other values of q are set to 0. We then perform n mutation steps, each a substitution with probability q 0 or a tandem duplication of length id with probability q id , for i ∈ {1, 2, 3}.
If in a tandem duplication step, the length chosen for duplication is larger than the length of the sequence, the whole sequence is duplicated. Note that since the length of the sequence grows, such an event may only happen a few times at the beginning of the process.