XSTREAM: A practical algorithm for identification and architecture modeling of tandem repeats in protein sequences

Newman, Aaron M; Cooper, James B

doi:10.1186/1471-2105-8-382

Software
Open access
Published: 11 October 2007

XSTREAM: A practical algorithm for identification and architecture modeling of tandem repeats in protein sequences

Aaron M Newman¹ &
James B Cooper^1,2

BMC Bioinformatics volume 8, Article number: 382 (2007) Cite this article

10k Accesses
115 Citations
3 Altmetric
Metrics details

Abstract

Background

Biological sequence repeats arranged in tandem patterns are widespread in DNA and proteins. While many software tools have been designed to detect DNA tandem repeats (TRs), useful algorithms for identifying protein TRs with varied levels of degeneracy are still needed.

Results

To address limitations of current repeat identification methods, and to provide an efficient and flexible algorithm for the detection and analysis of TRs in protein sequences, we designed and implemented a new computational method called XSTREAM. Running time tests confirm the practicality of XSTREAM for analyses of multi-genome datasets. Each of the key capabilities of XSTREAM (e.g., merging, nesting, long-period detection, and TR architecture modeling) are demonstrated using anecdotal examples, and the utility of XSTREAM for identifying TR proteins was validated using data from a recently published paper.

Conclusion

We show that XSTREAM is a practical and valuable tool for TR detection in protein and nucleotide sequences at the multi-genome scale, and an effective tool for modeling TR domains with diverse architectures and varied levels of degeneracy. Because of these useful features, XSTREAM has significant potential for the discovery of naturally-evolved modular proteins with applications for engineering novel biostructural and biomimetic materials, and identifying new vaccine and diagnostic targets.

Background

Repeated sequences, often organized as extended tandem arrays, abound in biology, and computational approaches have been critical for the identification and analysis of such sequence elements from genomic data. Tandem Repeats (TRs) are formally defined as two identical copies of finite non-empty words with no intervening characters [1]. Since biological sequences evolve naturally by mutation, both by base substitutions and insertions/deletions (indels), a biological TR is defined as two or more sufficiently similar biological words lacking intervening characters, where sufficiency is arbitrarily defined. The work described in this paper focuses exclusively on non-evolutionary TRs (for evolutionary TR detection, see [2]), each of which has three important properties: consensus sequence, a word representing the TR pattern, period, the number of characters in the consensus sequence, and copy number, the number of words in the entire TR domain.

Bioinformatics studies of TRs have primarily focused on DNA. DNA TRs are traditionally classified on the basis of increasing period into microsatellites, minisatellites, and large-scale duplications. In some human TR loci, copy number changes are associated with triplet-repeat expansion diseases that include Huntington's disease and Fragile X Syndrome [3]. Because genomic TR loci are often highly polymorphic, even expanding and contracting from generation to generation, DNA TRs have forensic and biomedical applications, and may play important roles in genome evolution [4, 5].

Nucleotide repeats occurring in protein coding genes can result in protein sequences containing repetitive elements. Though less studied than DNA repeats, peptide repeats are likewise known to be widespread in nature [6–8]. Peptide TRs impart a modular architecture to proteins and are found in important structural proteins such as animal collagens and keratins, insect and spider silks, plant cell wall extensins, and the proteins that form adhesive plaques and byssal threads of bivalve mussels [9–13]. TR domains are also found in other modular proteins, including prion proteins, ice nucleation and antifreeze proteins, FG-rich proteins in nuclear pore complexes, surface antigens of microbial pathogens and parasites, histones, and zinc-finger transcription factors. [14–20]. Peptide TRs may provide an evolutionary shortcut for the modular construction of new proteins through recombination and copy number adjustment [6, 7, 21, 22]. To understand both the evolutionary diversity and functional significance of protein TRs, facile methods for the a priori identification and analysis of TRs from protein sequence databases will be critical.

Numerous bioinformatics tools have been developed for de novo repeat detection in DNA and protein sequences. One class of tools utilizes sequence self-alignment (SSA) [23–26]. Importantly, SSA approaches allow for the substitutions and indels in repeat sequences that often arise in biology. Because protein repeat detection tools that use SSA (RADAR, TRUST, Pellegrini et al. method) detect all repeated sequences, not only TRs, these algorithms may incorrectly characterize TR domains as non-TRs. With Ω(n²) time complexity (where n = length of input sequence), SSA algorithms are less than ideal for long protein sequences and repeat-detection in large multi-genome datasets. An alternative strategy implemented for a priori peptide repeats detection is based on a sliding window (SW) approach [22, 26–28]. In general, SW algorithms are simple to implement, but do not readily accommodate indels and are thus likely to miss many degenerate TRs. The Ω(n³) time complexity of SW algorithms used to detect repeats of all periods also renders this strategy inappropriate for analysis of long sequences.

An efficient heuristic employed for detecting DNA TRs in whole genome data relies on seed extension (SE) [29, 30]. Seed extension algorithms have Ω(n) time complexity for repeat detection, and depending on implementation, can approximate O(n) time complexity, making them fast enough for analyses of large sequence databases. Furthermore, since SE allows for both indels and substitutions, this method is very appropriate for repeat finding applications in naturally evolving biological sequences.

To complement and improve upon current software tools for peptide repeat detection, we implemented a SE algorithm to explicitly locate exact and degenerate (with substitutions and indels) TRs of all periods in protein sequences. This new tool, called XSTREAM for Variable ('X') Sequence Tandem Repeats Extraction and Architecture Modeling, was designed to efficiently mine large genomic datasets for TRs of any period, to effectively characterize degenerate TR domains, and to produce concise TR output. Important features of XSTREAM include novel heuristics that achieve 1) practical running time without period limitations, 2) effective reduction of TR output redundancy, 3) merging of discontinuous degenerate TR domains, 4) identification of nested TR architectures, and 5) TR domain clustering. Though developed specifically for analyzing TR protein sequences, XSTREAM works equally well to extract TR patterns in DNA sequences, or for that matter, TRs in any ASCII string of characters. The practical utility of XSTREAM is demonstrated through testing and validation using publicly available genome sequence data.

Implementation

The XSTREAM program implements a SE approach that includes heuristics to efficiently and effectively detect exact and degenerate TRs of any period from large input sequence datasets. The program utilizes two important strategies in addition to SE to achieve practical running times without period limitations: a user-modifiable sequence alignment method called Gap-Restricted Dynamic Programming (GRDP), and a new long-period TR filter (both described in the Appendix). In addition, XSTREAM applies several strategies, including the use of irreducible repeats, to effectively combat the redundancy in TR detection inherent in biological TR sequences. Other novel features incorporated into XSTREAM include merging of degenerate TR domains and modeling of nested TR architectures. XSTREAM provides non-redundant output of TRs meeting a suite of user-defined criteria for attributes such as minimum and maximum period, minimum copy number, minimum domain length, minimum % input sequence coverage, and maximum character mismatch.

Algorithm

The primary functionalities of XSTREAM, as shown in Figure 1, can be divided into five high level stages: Pre-Processing, TR Detection, TR Characterization, Post-Processing, and Output. For a technical description of the algorithm, presented within the same organizational context, refer to the Appendix section.

Pre-Processing

For processing by XSTREAM, input sequences must be in FASTA format. Valid sequences are sent to the seed detection module. XSTREAM searches the input sequence for short exact substring repeats, or seeds, of two or three sizes, depending on the input length (see [29] for an excellent example of the use of seeds, or k-tuple probes in TR detection). Seed pairs are used to provide starting points and potential periods for TR detection. The use of seeds allows XSTREAM to rapidly identify putative TRs. For every adjacent pair of matching seeds, XSTREAM records both the sequence distance between them and the sequence index of the leftmost seed. Each distance is a potential TR period.

TR Detection

Following seed detection, XSTREAM attempts to extend each seed pair. Two sequence iterators move downstream from each seed in a parallel manner, returning characters for comparison. Running totals of character match and mismatch are kept. We define i as the amount of character matching required between two tandemly arranged words in order for them to be designated a TR. For example, if i is set to 0.8, then at least 80% of the aligned characters among two words at a given period must be identical. Seed extension always stops when for any seed pair, the iterator for the leftmost seed collides with the rightmost seed. If at any point during the procedure, the character mismatch count divided by the current potential period exceeds or equals 1 - i, seed extension is aborted, thereby reducing running time. Similarly, seed extension is prematurely terminated if the match count becomes sufficiently high. To include indels during seed extension, we use a novel heuristic, which is presented in the Appendix section.

Each candidate TR resulting from successful seed extension is subjected to further expansion using the same basic mechanism as seed extension. XSTREAM examines sequence space both downstream and upstream of the current candidate domain using increments equal to the TR period. Potential repeat copies are evaluated by comparing new sequence space with the reference repeat, which is the leftmost repeat resulting from the initial seed extension. If indels are allowed and if domain expansion using seed extension fails to agree with i, we invoke a second strategy. The second approach, termed GRDP (see Appendix), can more accurately perform a subsequence pairwise comparison at the expense of slightly increased running time. A novel feature of our implementation is the user's ability to limit the maximum width of the dynamic programming (DP) matrix (parameter g), resulting in θ(n) time and space complexities for global pairwise alignments.

Following domain expansion, we instantiate a procedure called maximality. Employing a user-adjustable scoring scheme, maximality finds the longest stretch of characters both downstream and upstream that can legitimately be added to each candidate TR. This procedure is invoked because TRs in nature do not always occur in integer copy numbers and XSTREAM's TR domain expansion method is limited to integer copies.

Finally, XSTREAM masks input sequence space corresponding to each maximally extended candidate TR. Sequence masking prevents further seed extensions in sequence regions that constitute TR domains, thus functioning to prevent output redundancy as well as reduce running time. For details of sequence masking, refer to Redundancy Elimination I as well as Two-stage TR detection in the Appendix.

TR Characterization

To further refine each candidate TR, XSTREAM segments every TR domain into its component copies. Parsing can be accomplished by a trivial subdivision of the TR domain using the current period, an optimal subdivision using wrap-around dynamic programming (WDP, [31]), or a heuristic subdivision using GRDP. For details about implementation and when each method is invoked, refer to the Appendix section.

Following TR parsing, each TR undergoes a multiple alignment of its copies. A procedure identical in concept to STAR Alignment is used when indels are allowed. Because practical running time is emphasized in our implementation, pairwise sequence comparisons during STAR Alignment may be computed in a non-optimal manner using GRDP.

Following multiple alignment of each TR, a consensus sequence is computed. Each consensus is democratically derived using the majority rule. In addition, XSTREAM computes an error associated with the consensus – the lower the error, the stronger the agreement between the consensus and its represented domain. We define I as the minimum allowable matching between the consensus and the aligned TR for the TR to be reported to the user. For example, if I equals 0.8, then the consensus error cannot exceed 0.2 or 20% disagreement.

Next, XSTREAM inspects the edges of each aligned TR domain (with TR copy number greater than 2) for accordance with the consensus. If either edge mismatches with the consensus, that edge is truncated. Since all TRs must have at least 2 copies, edge trimming is not performed on TR domains with TR copy number = 2.

Occasionally, because of matching considerations, TR domains are identified with periods that are reducible. Therefore, the last step of TR Characterization functions to reduce overestimated TR periods (see Redundancy Elimination II in the Appendix).

Post-Processing

XSTREAM attempts to merge sufficiently similar TRs that either overlap in the input sequence or are in close enough proximity to one another. To compute sufficient similarity, XSTREAM invokes the concept of cyclical permutations, which enables effective consensus sequence comparison (see Merging and Consensus Comparison in the Appendix). As a result, XSTREAM can identify TR domains with large regions of indels and/or substitutions that, without merging, would be reported as separate TRs. This procedure is thus important for detecting rapidly evolving TR sequences.

Following merging, XSTREAM invokes a series of finalizing functions called finishing touches, which serve to fine-tune the characterization of each TR domain as well as remove TRs that are insufficiently fit for output. TR characterization refinement involves rerunning maximality, redoing multiple alignment, rerunning reducibility, and looking for nested TRs (see Appendix). After additional characterization, finishing touches removes TRs with unacceptable amounts of overlap (see Redundancy Elimination III in the Appendix). Finally, remaining TRs are tested for agreement with user-defined filtration criteria.

All TRs that satisfy the output criteria are sent to the consensus comparison (CC) module. CC clusters TRs on the basis of consensus similarity. By ordering TRs by consensus sequence homology in the output, XSTREAM reduces output redundancy while facilitating the identification of TR families from the input dataset. Related TRs may reflect structural or functional homology of their corresponding protein sequences. The current implementation of CC only compares TRs of equal period.

Output

XSTREAM automatically generates HTML files in a format similar to the output from Tandem Repeats Finder (TRF) [29]. HTML output 1 contains a TR summary table and list of TR information, including sequence positions, period, and copy number. The range of sequence positions for each TR is hyperlinked to HTML output 2, which displays TR multiple alignments and consensus sequences. In the case of a multiple sequence input, XSTREAM generates HTML output 3, which reports a list of all input sequences containing reported TRs. An additional output option is a colored TR schematic, in PNG or HTML format, that represents the modular architectures of TR-containing sequences. The main user-definable output parameters of XSTREAM are presented in Table 1. A list of all user-defined parameters can be found on the XSTREAM webserver [32].

Table 1 User-defined parameters

Full size table

Results

XSTREAM was coded using Java Standard Edition 5.0. To evaluate our implementation, we demonstrated and validated key features of XSTREAM using a variety of input datasets. First, a run time analysis shows the practicality of XSTREAM for TR detection in whole genomic sequence data. Second, multiple sequence alignments, merging, and nesting are demonstrated using anecdotal output examples. Third, the ability of XSTREAM to detect protein TR domains is validated using published results from five protozoan parasite genomes. Finally, we present schematic diagrams illustrating the utility of XSTREAM for graphically depicting modular architectures of TR proteins. In all cases, default parameter values were used unless stated otherwise (see Table 1). All tests and data collection were carried out using a Windows XP PC with a 64-bit AMD Athlon dual core 1.8 Ghz processor and 2 Gb RAM.

A principle attribute of XSTREAM is practical running time for large sequence datasets. To measure how running time varies with differing input sequence lengths and parameter values, we used XSTREAM to analyze DNA sequences. We chose DNA over protein sequences simply because DNA sequences cover a substantially larger range of sequence lengths than proteins, thus enabling a more accurate assessment of running time. XSTREAM was run on DNA sequences ranging from 0.23 Mbp to 202 Mbp, either with gaps (g = 3) or without gaps (g = 0). For these analyses, sequences were examined in two sets. Shorter sequences, < 10 Mbp, were processed with minimum TR domain length minD = 20 and minimum period MinP = 1, and no period restrictions. For longer sequences, we used minD = 50 and MinP = 10, and due to memory limitations, maximum period was set to 100 kbp. In addition, for periods 10 – 999 we used a divide-and-conquer approach (see Appendix) with fragment length = 1 Mbp. As shown in Table 2, running time increased approximately linearly with increasing sequence length for all DNA sequences with or without gaps (R² > 0.99). Next, the effect of increasing dataset size on running time was examined by analyzing four Swiss-Prot datasets ranging in size from 40,292 to 230,150 non-redundant protein sequences, and setting minD = 10 and MinP = 1. As expected, since XSTREAM processes each protein sequence individually, running time scaled linearly (R² > 0.998), as indicated in Table 2. A running time of less than 7. 5 min for the detection of degenerate TRs (using g = 3) from the Swiss-Prot 50.5 dataset clearly demonstrates the practicality of XSTREAM for multi-genome data mining.

Table 2 Running Time Analysis

Full size table

In addition to efficient TR detection, other important capabilities of XSTREAM are demonstrated with the data shown in Figures 2, 3, 4 and Table 3. A multiple alignment of a degenerate TR domain found in the C. elegans hypothetical protein CE22309 is presented in Figure 2. Shown above the alignment are the standard numerical properties reported by XSTREAM for each TR domain: sequence position, period, copy number, and consensus error. Each alignment is additionally described by a consensus sequence (below the dashed double line) and a consensus error string (below the consensus).

Table 3 Extreme examples of DNA TRs detected by XSTREAM

Full size table

The TR example shown in Figure 2 also highlights the utility of the merging feature of XSTREAM when applied to overlapping domains with different periods. Without merging, this TR domain would be reported as several distinct TR fragments. The merging of two non-overlapping TR domains from an A. thaliana hypothetical protein (gi 9293925) is illustrated in Figure 3. This example illustrates the utility of incorporating a highly degenerate intervening sequence to define a larger TR domain that, without merging, would have been divided into two discontinuous regions (x's denote non-matching characters). As in proteins, DNA TRs may also contain extensive degeneracy. The high copy number TR domains shown in Table 3 represent additional successful applications of XSTREAM's merging feature. Taken together, the merging of (non)overlapping TR regions allows XSTREAM to successfully model the architectures of TR domains that have accumulated extensive substitution and/or indel mutations, or that have arisen through convergent evolutionary mechanisms.

In addition to extensive degeneracy, TRs may have very long periods and nested architectures. XSTREAM implements a novel long-period filtering procedure (see Appendix) to find TRs with periods ≥1000. The utility of this method is demonstrated by some of the DNA examples in Table 2 and by the long-period A. thaliana DNA repeats in Table 3. XSTREAM also incorporates a strategy to find and describe nested TR architectures, represented by the regular expression [x,n], with n denoting the number of tandem copies of substring x. An example of TR nesting that shows two levels of nesting is presented in Figure 4. Included in the figure is a block diagram illustrating the hierarchical patterning that epitomizes nested TRs. Taken together, these merging, long-period filtration, and nesting features make XSTREAM a useful tool for detection and architecture modeling of TR domains in both nucleotide and protein sequences.

To validate the utility of XSTREAM for detecting TR-containing proteins, we analyzed the proteomes of five parasite genomes, and compared our output to the TR proteins identified in these same genomes by TRF [18]. Protein sequence datasets for these parasites were downloaded [33] and processed using minP = 1, minD = 90 and minimum copy number minC = 2, or 3. These parameter values were chosen to emulate the TR criteria used in [18] to find TR domains in gene sequences of at least ~250 bp. Setting minD = 90 amino acids for XSTREAM corresponds to a slightly more stringent 270 bp minimum. Table 4 summarizes the TRs found by XSTREAM, using minC = 3 or minC = 2, and by TRF [18]. Using minC = 3, XSTREAM identified more TR containing proteins in all parasites except T. annulata. In L. infantum, the causative agent of Leishmaniasis and the focus of the Goto et al. studies [17, 18], XSTREAM found seven TR proteins that they did not identify, while three of the TR proteins found by TRF were not detected by XSTREAM. Upon closer examination of the three "missed" proteins, each was found to have a TR domain with copy number less than 3, which would not be reported by XSTREAM using minC = 3. When XSTREAM was rerun with minC = 2, all 64 of the previously identified L. infantum TR proteins [18] were found, along with 14 additional TR containing proteins that are schematically diagrammed in Figure 5 to illustrate the significant diversity of TR domain architectures within these 14 proteins.

Table 4 Number of TR proteins detected in protozoan parasite genomes by XSTREAM and TRF

Full size table

Since TR domains can constitute variable fractions of the parent protein sequence (Figure 5), XSTREAM incorporates the simple concept of TR Content, defined as the ratio of the TR domain length to the input sequence length, as an additional metric for comparing modular proteins. Use of this metric allows XSTREAM to filter output using any arbitrary level of TR content, a feature that is illustrated using the protein sequence dataset from A. thaliana (TAIR6_pep_20060907). The Arabidopsis proteome was analyzed using parameter values MinP = 1 and TR Content ≥ 0.7. The relatively small number of proteins with ≥70% TR content resulting from this analysis are schematically depicted in Figure 6. This output clearly reveals the modular architectures of two large, well-described A. thaliana protein families (polyubiquitins with period = 76, and proline-rich extensin-like proteins with period = 25) along with that of additional TR proteins.

Discussion

The use of a priori computational methods to search genome databases for repetitive elements has revealed an abundance of both DNA and peptide repeats in nature, many of which occur in tandemly repeated patterns [6, 8, 26, 27]. The detection and analysis of repeated peptide sequences has received considerable attention in recent years, including the recent publication of a large protein repeats database [26]. Despite the potential importance of such repetitive sequences, the available repeat detection software suffers from both time complexity and output redundancy problems. To address these issues, and to facilitate the detection and modeling of TR structures in general, we developed a new software tool called XSTREAM.

The utility of XSTREAM for efficient and effective detection of degenerate tandem repeats in large input sequence datasets was demonstrated by testing and validation. Practical performance was confirmed by showing that XSTREAM running time can scale linearly with both increasing sequence lengths (up to 202.5 Mbp of DNA sequence) and increasing dataset sizes (up to 230,150 protein sequences). XSTREAM invokes no period limitations and can thus detect TRs with very long periods, as illustrated by the ~45 kbp tandem duplication identified in chromosome I of A. thaliana (Table 2). With the implemented merging heuristic, XSTREAM can also identify TR domains with intermittent regions of high degeneracy, such as the TR from C. elegans chromosome III with period 94 and copy number >400 (Table 2), and the proline/glycine-rich protein from C. elegans shown in Figure 2. In addition, by searching for nested TR structures, XSTREAM detects TRs within TRs (Figure 4), a useful feature for gaining insights into the evolution of complex TR architectures.

Output redundancy is a problem inherent in repeat detection that has often been ignored. For example, using a SW approach, Katti et al. [27] searched Swiss-Prot 38 for TRs with periods between 1 and 20, and compiled the TRIPS database of TRs and their corresponding protein sequence identifiers http://www.ncl-india.org/trips. In many cases, TRs with different periods were reported that occupy the same protein sequence space. The output of another repeat finding tool [26] also demonstrates the importance of redundancy removal. The ProtRepeatsDB tool http://bioinfo.icgeb.res.in/repeats was designed for comparing repeated peptides from many organisms. Though aware of redundancy problems, the strategy implemented by Kalita et al. falls short of providing concise repeat output in numerous cases. For example, ProtRepeatsDB reported 1312 and 568 distinct perfect peptide repeats in the UBQ3 and UBQ12 polyubiquitin sequences from A. thaliana, respectively. Unexpectedly, the canonical period 76 TRs known to characterize polyubiquitins were absent. Such highly redundant outputs illustrate the importance of the redundancy removal tactics incorporated into XSTREAM. By invoking several strategies (see Redundancy Elimination in Appendix), including the use of irreducible TR periods [24], XSTREAM produces non-redundant TR output. Analysis of the A. thaliana proteome by XSTREAM, for example, reports the UBQ3 and UBQ12 sequences only once, with an irreducible, period 76 TR covering virtually the entire protein sequences.

The recent analysis of five protozoan parasite genomes using TRF [18] provided a reasonable reference for testing XSTREAM on genome-scale datasets. Using minD = 90 to mimic the TR domain criterion used by Goto et al, XSTREAM detected significantly more TR proteins from all parasite genomes, including all 64 of the previously identified L. infantum TR proteins [18]. Further analysis of these 64 TR protein domains revealed that the TR domains identified by both algorithms were comparable in size (data not shown).

Conclusion

By testing XSTREAM on a variety of sequence data, we demonstrated the utility of this new genome data-mining tool for identifying TRs with diverse periods and domain sizes, varied levels of degeneracy, and complex architectures. These capabilities should facilitate potentially significant applications. For example, TRs present in parasitic pathogens are known to elicit important immunological responses that may provide antigenic protection (e.g., [19]). New computational approaches for detecting TR proteins might thus be useful for identifying novel protein antigens useful for diagnostics and vaccine development [17, 18]. Secondly, since TR domains are characteristic of modular structural proteins, use of XSTREAM may lead to the in silico discovery of phylogenetically diverse proteins with novel biomaterials and biomimetic applications.

Availability and requirements

Project Name: XSTREAM

Project home page and availability: http://jimcooperlab.mcdb.ucsb.edu/xstream

Operating system(s): Platform independent

Programming language: Java

Any restrictions to use by non-academics: yes, contact author JBC for details

Appendix

Preliminary Notations

S = input sequence, which takes values from alphabet {A,C,G,T} for nucleotide sequences and alphabet {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} for proteins

· |S| = length of S

· S[j] = the character at index j in S with j ≥ 0

· S[i, j] = the subsequence in S from index i to index j inclusively

Xi = TR domain i

· |Xi| = length of entire TR domain Xi

· Xi[j] = repeat copy j in Xi with j ≥ 0

· |Xi[j]| = length of copy j

· |Xi[]| = size of array Xi[]

· XiS = lowest index of Xi; starting position in S

· XiE = highest index of Xi; ending position in S

· XiSE = index range [XiS, XiE]

· Ei = copy number (exponent) of Xi

· Ci = consensus sequence of Xi

· Pi = period of Xi = period of Ci

· CEi = consensus error of Xi =

Without gaps: # of mismatching characters to consensus/total # of characters in aligned Xi
With gaps: see Consensus Building

· Ii = indel error of Xi = # of gaps in aligned Xi/total # of characters in aligned Xi

· Ri = referential repeat copy of Xi: used during TR domain expansion and maximality

{X} = {X₀, X₁,..., X_n} = set of all identified TR domains