Volume 12 Supplement 9
Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics
The evolution of the tape measure protein: units, duplications and losses
 Mahdi Belcaid^{1},
 Anne Bergeron^{2} and
 Guylaine Poisson^{1}Email author
DOI: 10.1186/1471210512S9S10
© Belcaid et al; licensee BioMed Central Ltd. 2011
Published: 5 October 2011
Abstract
Background
A large family of viruses that infect bacteria, called phages, is characterized by long tails used to inject DNA into their victims' cells. The tape measure protein got its name because the length of the corresponding gene is proportional to the length of the phage's tail: a fact shown by actually copying or splicing out parts of DNA in exemplar species. A natural question is whether there exist units for these tape measures, and if different tape measures have different units and lengths. Such units would allow us to retrace the evolution of tape measure proteins using their duplication/loss history. The vast number of sequenced phages genomes allows us to attack this problem with a comparative genomics approach.
Results
Here we describe a subset of phages whose tape measure proteins contain variable numbers of an 11 amino acids sequence repeat, aligned with sequence similarity, structural properties, and simple arithmetics. This subset provides a unique opportunity for the combinatorial study of phage evolution, without the added uncertainties of multiple alignments, which are trivial in this case, or of protein functions, that are well established. We give a heuristic that reconstructs the duplication history of these sequences, using divergent strains to discriminate between mutations that occurred before and after speciation, or lineage divergence. The heuristic is based on an efficient algorithm that gives an exhaustive enumeration of all possible parsimonious reconstructions of the duplication/speciation history of a single nucleotide. Finally, we present a method that allows, when possible, to discriminate between duplication and loss events.
Conclusions
Establishing the evolutionary history of viruses is difficult, in part due to extensive recombinations and gene transfers, and high mutation rates that often erase detectable similarity between homologous genes. In this paper, we introduce new tools to address this problem.
Background
In 1984, Katsura and Hendrix [1] showed that when a specific gene of the phage λ was shortened, the resulting viruses’ tails were proportionally shorter. The corresponding tape measure protein has since been identified in a large number of phages and prophages. These proteins often have a variable number of tandem repeats with highly conserved tryptophan (W) and phenylalanine (F) amino acids at fixed positions that are used as anchors by small auxiliary proteins to stretch the tape and scaffold the actual tail construction (see, for example, [2]). The regular spacing between these anchors, or period, seems to be a key structural property of the tape measure protein and acts as a marking on the tape.
Phages are believed to be, by far, the most abundant form of life on the planet [3], a fact reflected by the large number of phage and prophage genomes currently available. This wealth of data allowed us to literally shop for tape measures that had specific properties in terms of length, period, composition and level of similarity.
Reconstructing duplication histories has been an intensively studied combinatorial problem in the last ten years or so (see [4] and [5] for reviews), following an initial, more biology oriented work, by Walter Fitch in 1977 [6]. Recent advances on duplication history reconstruction extend the previous models by allowing more operations such as inversions [7] or segmental duplications [8]. All these approaches suppose a fixed boundaries model, meaning that duplication events may only occur at a fixed set of breakpoints, that does not apply very well to virus duplications (see Additional File 2). However, the basics of the theory of reconstructing unrestricted duplications was developed by Benson and Dong [9] in 1999, and constitutes the starting point of the present study. The idea of their heuristic is to evaluate the number of mutations for each putative duplication event, and choose to contract the segment with minimum, or near minimum, number of mutations.
We tested most available algorithms and heuristics  with or without the fixed boundaries hypothesis using the amino acid sequences and the corresponding DNA sequences of the two viruses in Figure 1. Unfortunately, despite the striking similarity of the sequences, the two versions of the duplication history of their presumed common ancestor were always different. Since their divergence, each virus seems to have added embellishments to the original story, by the way of mutations, that eventually blur the common origin of their duplication history.
Here we develop a method that uses, in parallel, the information from two or more sequences to detect the most recent duplication event of their ancestor. It is based on an algorithm that computes the expected number of mutations that occurred before any speciation event.
Results and discussion
The units of the tape measures
The structural analysis of Siponen et al [2] suggested that amino acid phenylalanine (F) could be an alternate marker, and that a pattern with mixed period (111118) could also be present. This information was used to construct two search patterns  in ProSite format:
Pattern 1: [FW]x(10)[FW]x(10)[FW]x(10)[FW]x(10)[FW]x(10)[FW]x(10)[FW]
Pattern 2: [FW]x(10)[FW]x(10)[FW]x(17)[FW]x(10)[FW]x(10)[FW]x(17)[FW]
Using the BLAST package algorithm seedtop (ftp://ftp.ncbi.nlm.nih.gov/blast) we found that Pattern 1 has occurrences in 191 of the 5608 proteins records in Genbank that have an explicit reference to “tape measure phage protein”, and Pattern 2 has occurrences in 102 of the 5608 proteins (as of April 22, 2011). Of these, 16 sequences have occurrences of both patterns, yielding a total of 277 sequences that contain at least one occurrence of either pattern, or nearly 5% of the 5608 sequences. Note that these results poorly reflect the real number of tape measure proteins with such periods, since many proteins highly similar to known tape measure proteins are annotated with various descriptors that range from “minor tail protein” to “hypothetical protein”.
This was an encouraging first result that yielded examples of tandem repeats that are discussed in the next paragraphs. However, further investigations, both computational and biological, are needed to discover, if they exist, the repeated units of the remaining annotated tape measure proteins. Current automated tandem repeat finders rely on internal similarity to identify repeated units, and many tape measure proteins fail to show them. Biological evidence of conserved structures  such as the work described in [2] are key observations that allow to construct alignments using these structures, but are based on protein crystallography experiments, thus not widely available.
Reconstruction of the duplication history
Assuming that the two sequences are indeed orthologous, the origin of disagreements between the curves lies in mutations that occurred after speciation. Consequently, if the data of both sequences are to be used to reconstruct the duplication history, it is necessary to develop a scoring technique, detailed in the Methods section, that can discriminate between “recent” mutations and “ancient” mutations.
Contrary to the classic distance that counts the number of positions in which two sequences are different, the new distance is based on the simultaneous comparison of four sequences. An example of computation taken at position p = 210 is shown in Figure 3(b). In this example, only three columns have a positive score: the first and third columns contain the motifs actc and tgtc and get a score of 0.8, reflecting the expected number of mutations that preceded the speciation event; the second column contains the motif tata and gets a score of 1. The combined normalized distance is thus (2.6)/33.
This combined normalized distance applied to all possible positions yields the bottom curve of Figure 4. The new curve smooths out the differences between the first two curves, and narrows the search for the position of the most recent duplication: it reaches its minimum value at each position of interval [97..102]. This approach can be used recursively in order to reconstruct the recent duplication history of these sequences (data not shown) but going further might stretch too far the use of a heuristic on limited input. However, pinpointing the possible positions of the most recent duplication can be useful in establishing the phylogenetic relationships between tape measure proteins, as we show in the next section.
Duplication or loss?
 1.
The two predictions agree. Then the most recent event is either a duplication, or a loss of a recently duplicated segment.
 2.
The two predictions disagree. Then the most recent event is a loss.
This result illustrates the difficulty of deciding between duplication and loss. Indeed, as simulations show (see the Methods section), the shape of the curve that determines the position of the most recent event is also an indication of its nature.
Methods
Heuristic for duplication reconstruction
The duplication reconstruction heuristic proposed in [9] compares every possible pair of consecutive segments of length np where n is an integer greater than 0, and p is the period of the repeat. Each comparison results in a score that is divided by np, and the duplication with the lowest score is a candidate for contraction. A contraction merges together the two consecutive segments by using the Fitch procedure: let S and T be the sets of nucleotides at position j in each segment, if S ∩ T ≠ ∅, then the new position j is filled by S ∩ T, otherwise it is filled by S ∪ T.
In the original paper of Benson and Dong, the sequences were scored by the number of unions performed in the comparison, which is proportional to the number of mutations that separates the two segments. In this paper, we want to apply the same heuristic, but with a different scoring technique that uses two or more DNA sequences whose common ancestor underwent the duplication events. To do this, we must be able to evaluate the number of mutations that occurred before the speciation event(s).
Orthologous and paralogous nucleotides
The selfalignments of Figure 1 are gapless, and this property holds also for the alignment of the underlying DNA sequences. This allows us to apply the classical terminology of paralogs and orthologs to single nucleotide positions.
The length of the resulting sequence will also be a multiple of p, and any two nucleotides in the resulting sequence whose positions differ by a multiple of p were created by a duplication event, thus can be called paralogs. In our model, two tape measure proteins that have a good parallel alignment, such as the one in Figure 1, are presumed to share the duplication history of their common ancestor. Under this hypothesis, all duplications occurred before the speciation event, and nucleotides that are in the same respective position in each sequence can be called orthologs.
Problem 1 Suppose that a duplication event created paralogous nucleotides x and y, and that a subsequent speciation event created orthologous viruses 1 and 2, yielding the two pairs of orthologs x_{1} and x_{2}, and y_{1} and y_{2}, what is the expected number of mutations that occurred before the speciation event?
Ftrees and the Fitch algorithm
Definition 1 An Ftree is a duplicationspeciation tree with 4 ordered leaves labeled by sets A, B, C, D that are subsets of the set of nucleotides {a, c, g, t}.

The left node is the parent of the leaves labeled by sets A and C. It is labeled by L = A ∩ C, if this set is nonempty, otherwise by L = A ∪ C.

The right node is the parent of the leaves labeled by sets B and D. It is labeled by R = B ∩ D, if this set is nonempty, otherwise by R = B ∪ D.

The ancestor node is labeled by X = L ∩ R, if this set is nonempty, otherwise by X = L ∪ R.
The number N of mutations of an Ftree is the number of set unions necessary to construct the sets L, R and X. A labeling of an Ftree is a labeling of its leaves and nodes by nucleotides, such that each leaf is labeled by a nucleotide that belongs to its set label, and such that the number of mutations  that is, edges with different labels at their extremities  is equal to N. Note that it is not mandatory that nucleotides that label internal nodes belong to the corresponding set label, L, R or X.
The procedure outlined in Definition 1 was originally proposed be W. Fitch [13] as a way to compute the minimum number of mutations for a given tree, it was later proven correct by D. Sankoff [14]. The sets computed for a parent node by this rule are called Fitch sets.
A mutation occurs before the speciation event in a given labeling if it occurs between the root and one of its children, otherwise it occurs after the speciation event. If an Ftree has several labelings, we denote by N_{ b } the average number of mutations that occurs before speciation among all possible labelings, and by N_{ a }, the average number of mutations that occurs after speciation. Clearly, N_{ b } + N_{ a } = N.
We first compute, as an example, the values of N and N_{ b } in the case all sets A, B, C and D are singletons. The general proof is presented in the next section. There are 4^{4} = 256 different motifs of 4 nucleotides. With respect to our problem, they can be partitioned into 7 classes with the following representatives: aaaa, aaat, atta, caat, tata, acat, and actg. Of these, the first four cases yield N_{ b } = 0.
The actgmotif, on the other hand, requires a minimum of three mutations. Figure 9(b) shows 3 of the 12 possible labelings and, on average, 2/3 mutations occur before the speciation event, and 7/3 after. Note that the third labeling is not obtainable by the Fitch traceback algorithm since the label of the right child of the root is not contained in the union of the labels of its children.
The acat motif is the most complex and is shown in Figure 9(c). It has one pair of equal orthologous nucleotides, and requires a minimum of two mutations. Three labelings have nucleotide a as an ancestor, one has nucleotide c and one has nucleotide t. On average, 4/5 mutations occur before the speciation event, and 6/5 after.
In the next sections, we will show how these observations can be generalized to trees labeled by sets.
Computing the average number of mutations preceding speciation
When the leaves of an Ftree are labeled by sets containing more than one element, the possible labelings can include more than one motif. For example, if:
A = {a, t}, B = {a, t}, C = {t}, D = {a}
then the Fitch procedure yields N = 1 mutations. Four possible labelings achieve this minimum: two with tata labeling the leaves, one with aata, and one with ttta. The motif atta is excluded since it requires N = 2 mutations.
The next three lemmas give the average number of mutations that occur before speciation in an Ftree with leaves labeled by the sets A, B, C and D, with N > 0 minimum number of mutations:
Proof. There is only one mutation when exactly one of the sets A ∩ C, B ∩ D or L ∩ R is empty. If L ∩ R is not empty, then either A ∩ C = ∅ or B ∩ D = ∅. If A ∩ C is empty, then the single mutation occurs in the left subtree, there is at least one motif with three equal nucleotides implying n_{ aaat } > 0, n_{ tata } = 0, and N_{ b } = 0, thus the result holds. The case B ∩ D is similar.
If L ∩ R is empty, then any motif with two different nucleotides may be present, but only motifs with orthologous equal nucleotides (the tatamotif), or motifs with three equal nucleotides, yield N = 1. The tatamotif has two different labelings, as seen in Figure 9(a), both of which assign the mutation before the speciation event. The aaatmotif has only one labeling, and the mutation occur after the speciation event. Thus N_{ b } = 2n_{ tata }/(2n_{ tata } + n_{ aaat }).
Proof. We first consider the case L ∩ R ≠ ∅. Both A ∩ C = ∅ and B ∩ D = ∅, implying n_{ acat } = 0. Since [(A ∪ C) ∩ (B ∪ D)] ≠ ∅, then at least one of the sets A ∩ B, A ∩ D, C ∩ B or C ∩ D is not empty. In this case, n_{ caat } may be 0 when both A ∩ B and C ∩ D are non zero, or both A ∩ D and C ∩ B are non zero, but in these cases, n_{ atta } > 0, thus we have (2n_{atta} + n_{ caat }) > 0. The attamotif has two possible labelings, both of which assign the two mutations after speciation, and the caatmotif has only one labeling, also with the two mutations after speciation, thus N_{ b } = 0 and the formula holds.
When L ∩ R = ∅, then one of A ∩ C or B ∩ D is not empty and n_{ acat } > 0. As seen on Figure 9(c), there are five possible labelings of acat motifs, four of which have a mutation preceding speciation. However, attamotifs and caatmotifs may also be present, for example with:
A = {a, t}, B = {c, g}, C = {a, g}, D = {t}.
Thus, N_{ b } = 4n_{ acat }/(5n_{ acat } + 2n_{ atta } + n_{ caat }).
Lemma 3 When N = 3, the average number of mutations that occur before speciation is given by N_{ b } = 2/3. Proof. In order to have N = 3, all three sets A ∩ C, B ∩ D and L ∩ R = (A ∪ C) ∩ (B ∪ D) must be empty, thus the four sets are singletons, and, by the case study of Figure 9(b), N_{ b } = 2/3.
Detecting duplication and loss events
In this section, we discuss the problem of detecting a duplication or loss event when comparing two tandem repeat sequences. We first discuss this problem in the fixed boundary context. Formally, we are given two sequences:
b = b_{1}…b_{ j }_{–1}b_{ j }_{+1}…b_{ n }
c = c_{1}…c_{ j }_{–1}c_{ j }c_{ j }_{+1}…c_{ n }
each of them composed of segments of the same length, and both sharing a common ancestor a that contained all segments, except possibly the segment at position j. The relation between the sequences is either a duplication creating segment c_{ j } in the lineage of sequence c, or a loss of segment b_{ j } in the lineage of sequence b. The Hamming distance between two segments is denoted by H(s, t) and measures the number of position with different nucleotides in segments s and t. Under these hypothesis, the problem is the following:
Problem 2 Given sequences b and c, what is the position j of the duplication or loss event that minimizes the distance between the sequences?
Define c_{ i } = c_{1}…c_{ i }_{−1}c_{ i }_{+1}…c_{ n }, as the sequence c with segment at position i removed. Then we have:
Proposition 1 If H(b_{ i }, c_{ i }) ≤ H(b_{ i }, c_{ ℓ }), for ℓ ≠ i, then the function H(b, c_{ i }) attains a minimum when i = j.
The same reasoning holds when i >j.
The hypothesis that H(b_{ i }, c_{ i }) ≤ H(b_{ i }, c_{ ℓ }) reflects the fact that the duplication event(s) that created the segments at position i and ℓ preceded the speciation event that created sequences b and c. In real data, the hypothesis might not hold for all values of i and ℓ, but it should hold on average.
Without the assumption of repeats with fixed boundaries, it is still possible to use Proposition 1 to obtain an estimate of the position of a duplication or loss event by testing all possible sets of boundaries. This is equivalent to computing, for each position i of the nucleotide sequence c, H(b, c_{[}_{ i }_{,}_{ i }_{+}_{ d }_{)}), where d is the difference in length of the two sequences, and c_{[}_{ i }_{,}_{ i }_{+}_{ d }_{)} is the sequence c with all nucleotides between positions i and i + d – 1 removed.
Conclusions
In this paper, we developed a variety of tools to study the evolution of tape measure proteins. We relied on existing software to identify repeated units and markers, and we have already identified hundreds of sequences that have a clear repetitive structure. However many tape measure proteins do not have readily identifiable repeat sequences, or markers, and new methods must be developed to classify them.
In order to study the duplication histories of this first set of sequences, we developed new theoretical tools that could use in parallel the information provided by slightly divergent sequences. For the time being, these analysis are restricted to pairs of sequences for two main reasons: (1) the algorithm assumes an established rooted phylogeny of the studied sequences, and, given the high rate of recombinations between phages [15, 16], this is not a trivial task; (2) the computational complexity of extending the algorithm to more than two species is unknown, but suspected to be hard.
Declarations
Acknowledgements
MB and GP are supported by NIH Grant P20RR018727 from the National Center for Research Resources. GP is supported by NIH grants U54RR026136 and P20RR016467 from the National Center for Research Resources. The paper’s contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH. AB is funded by NSERC (Canada).
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 9, 2011: Proceedings of the Ninth Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S9.
Authors’ Affiliations
References
 Katsura I, Hendrix RW: Length determination in bacteriophage lambda tails. Cell 1984, 39: 691–698. 10.1016/00928674(84)904768PubMedView Article
 Siponen M, Sciara G, Villion M, Spinelli S, Lichière J, Cambillau C, Moineau S, Campanacci V: Crystal structure of ORF12 from Lactococcus lactis phage p2 identifies a tape measure protein chaperone. J. Bacteriol 2009, 191: 728–734. 10.1128/JB.0136308PubMedPubMed CentralView Article
 Brussow H, Hendrix RW: Phage genomics: small is beautiful. Cell 2002, 108: 13–16. 10.1016/S00928674(01)006377PubMedView Article
 Gascuel O, Bertrand D, Elemento O: Reconstructing the duplication history of tandemly repeated sequences. In Mathematics of Evolution and Phylogeny. Edited by: Gascuel O. Oxford Univ. Press; 2005:205–235.
 Rivals E: A Survey on Algorithmic Aspects of Tandem Repeats Evolution. International J. of Foundations of Computer Science 2004, 15(2):225–257. Special Issue “Combinatorics on Words with Applications” Special Issue “Combinatorics on Words with Applications” 10.1142/S012905410400239XView Article
 Fitch WM: Phylogenies constrained by the crossover process as illustrated by human hemoglobins and a thirteencycle, elevenaminoacid repeat in human apolipoprotein AI. Genetics 1977, 86: 623–644.PubMedPubMed Central
 Lajoie M, Bertrand D, ElMabrouk N, Gascuel O: Duplication and inversion history of a tandemly repeated genes family. J. Comput. Biol 2007, 14: 462–478. 10.1089/cmb.2007.A007PubMedView Article
 Zhang Y, Song G, Vinar T, Green ED, Siepel A, Miller W: Evolutionary history reconstruction for Mammalian complex gene clusters. J. Comput. Biol 2009, 16: 1051–1070. 10.1089/cmb.2009.0040PubMedPubMed CentralView Article
 Benson G, Dong L: Reconstructing the Duplication History of a Tandem Repeat. Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology AAAI Press; 1999, 44–53. [http://portal.acm.org/citation.cfm?id=645634.660817]
 Benson G: Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 1999, 27: 573–580. 10.1093/nar/27.2.573PubMedPubMed CentralView Article
 Kurtz S, Choudhuri J, Ohlebusch E, Schleiermacher C, Stoye J, Giegerich R: REPuter: The Manifold Applications of Repeat Analysis on a Genomic Scale. Nucleic Acids Res 2001, 29: 4633–4642. 10.1093/nar/29.22.4633PubMedPubMed CentralView Article
 Bailey TL, Williams N, Misleh C, Li WW: MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res 2006, 34: W369–373. 10.1093/nar/gkl198PubMedPubMed CentralView Article
 Fitch WM: Toward defining the course of evolution: Minimum change for a specified tree topology. Systematic Zoology 1971, 20: 406–416. 10.2307/2412116View Article
 Sankoff D: Minimal Mutation Trees of Sequences. SIAM Journal on Applied Mathematics 1975, 28: 35–42. 10.1137/0128004View Article
 Hatfull GF: Bacteriophage genomics. Curr. Opin. Microbiol 2008, 11: 447–453. 10.1016/j.mib.2008.09.004PubMedPubMed CentralView Article
 Belcaid M, Bergeron A, Poisson G: Mosaic graphs and comparative genomics in phage communities. J. Comput. Biol 2010, 17: 1315–1326. 10.1089/cmb.2010.0108PubMedPubMed CentralView Article
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.