SplitTester : software to identify domains responsible for functional divergence in protein family
© Gao et al. 2005
Received: 14 January 2005
Accepted: 01 June 2005
Published: 01 June 2005
Skip to main content
© Gao et al. 2005
Received: 14 January 2005
Accepted: 01 June 2005
Published: 01 June 2005
Many protein families have undergone functional divergence after gene duplications such that current subgroups of the family carry out overlapping but distinct biological roles. For the protein families with known functional subtypes (a functional split), we developed the software, SplitTester, to identify potential regions that are responsible for the observed distinct functional subtypes within the same protein family.
Our software, SplitTester, takes a multiple protein sequences alignment as input, generated from protein members of two subgroups with known functional divergence. SplitTester was designed to construct the neighbor joining tree (a split cluster) from variable-sized sliding windows across the alignment in a process called split-clustering. SplitTester identifies the regions, whose split cluster is consistent with the functional split, but may be inconsistent with the phylogeny of the protein family. We hypothesize that at least some number of these identified regions, which are not following a random mutation process, are responsible for the observed functional split. To test our method, we used reverse transcriptase from a group of Pseudoviridae retrotransposons: to identify residues specific for diverged primer recognition. Candidate regions were then mapped onto the three dimensional structures of reverse transcriptase. The locations of these amino acids within the enzyme are consistent with their biological roles.
SplitTester aims to identify specific domain sequences responsible for functional divergence of subgroups within a protein family. From the analysis of retroelements reverse transcriptase family, we successfully identified the regions splitting this family according to the primer specificity, implying their functions in the specific primer selection.
Eukaryotic genomes have many genes that fall within well-defined gene or super-gene families . Both orthologuous and paraloguous genes within the same gene family may vary in functions at levels from subtle changes in regulation or catalytic efficiency to substantial evolution of new function. While functional divergence within a protein family is usually determined by changes in a few amino acid residues or domains. Identification of these has traditionally required considerable experimental effort. Developing computational tools for predicting these crucial residues or regions has become important in the field of current functional genomics. Many methods have been proposed, such as ancestral sequence inference , positive selection , and site-specific rate shifts [4, 5]. The new software SplitTester reported here is focused on a special type of functional divergence that functionally associated amino acids do not have the same evolutionary relationship as the protein family. The software is designed to identify domains responsible for functional divergence by iteratively comparing split cluster to the functional classification. Identified inconsistence between the functional divergence and the phylogenetic relationship may provide valuable information for gene function prediction.
For illustration, we applied SplitTester to the reverse transcriptase family from a group of retrotransposons, Pseudoviridae. There are two subgroups of reverse transcriptase, according to the primer utilizations at the initial step of reverse transcription process. One subgroup binds full length tRNA molecule and another one binds tRNA fragment respectively as primer to initiate cDNA synthesis (reviewed in  also see [7–9]). Such difference in primer specificity can not be reflected from the inferred phylogenetic relationship of this protein family, that is, they are not monophyletic because of the parallel evolution for functional-related changes during the expansion of this protein family . Thus, one may design a tree-based (clustering) algorithm that can define domains relevant to diverged function in a protein family with known functional subtypes. The software SplitTester we developed is to look for the local sequence alignments that display the clustering topology in agreement to a known functional split, using the evolutionary relationship of the gene family as the reference, which can be reconstructed by the conventional methods.
The algorithm implemented in the software SplitTester begins with a multiple sequence alignment of a protein family with known functional diversity (functional subgroups), defined here as a 'functional split'. Usually, a functional split is based on a few but unknown diagnostic amino acid residues or regions that are expected to be in accordance with the functional split. If the functional subgroups are not consistent with the phylogenetic tree of the gene family, we may, in retrospect, identify the sequence region that may include amino acid residues crucial for the sought-after function, if the clustering analysis of this region shows the expected functional grouping. In the following we call this idea the split-clustering for simplicity.
The distance method (e.g. NJ, UPGMA) implemented in most phylogenetic software normally uses the greedy algorithm for efficiency and simplicity, but only tracks a single locally best tree. Consequently, ignoring the alternative solutions can be misleading for phylogeny inference particularly when the sequence length is short [15, 16]. To improve the reliability of distance method, there are some efforts to track multiple partial solutions as it progresses . Since split-clustering usually deals with the short sequence length, we modified the conventional algorithm by tracking all possible topologies equally best fitting the data. That is, at each step, the split-clustering searches for the minimal distance pair, creating a list of pairs with equally minimal distance, and then performs the neighbor joining on each pair in order following a depth-first search. Only the resulted unique topology trees are saved for subsequent analyses. In the following we call them NJ-equivalent tree topologies if it is inferred by the NJ algorithm.
If the split-clustering analysis for a given window yields at least one (NJ-equivalent) tree whose topology matches the predefined functional split, we consider the window containing a potential functional signal. In the case of multiple NJ-equivalent trees, as explained above, the strength of the functional signal, or the degree of confidence, is measured by the percentage of NJ-equivalent trees that support the functional split. We used different colors to indicate the signal strength of the window: red means that all (100%) these trees derived from this window split the genes according to the specified function; yellow, green and blue indicate that 75%, 50% and 25% of the trees match the functional split, respectively. A color gradient is used to represent degrees of confidence between the above intervals.
In the output plot, all the lines representing the candidate windows can be selected to display the sequences within this window and the corresponding phylogenetic trees in two separate panels. The panels are updated in real time with the progress of each window tested. While in theory our algorithm could be quite time consuming to run because of the potentially large number of individual NJ operations (time increases quadratically with the length of the alignment), in practice the calculation is very reasonable, because the sequence length is typically small. Our dataset (eight sequences each of 540 amino acid residues) required approximately six minutes for analysis of all possible windows using a Pentium 3 ~ 930 Mhz processor with 512MB RAM and the Microsoft Windows 2000 operating system. In practice, most domains will be identified when window sizes are smaller than 150 amino acids. Therefore, 2–3 minutes of running time is expected to be sufficient for most computers.
SplitTester is available as precompiled binary in a distribution package for Microsoft Windows from the following URL: http://www.public.iastate.edu/~voytas/SplitTester, and http://xgu.zool.iastate.edu/. Both a zip file of the installation files and a self-extracting installer are available. Documentation and example files are included in the distribution packages.
We applied SplitTester to understand functional diversity among retroelement proteins for primer utilization by retroelement reverse transcriptases. Reverse transcriptases of retroviruses (Retroviridae), Metaviruses (Metaviridae) as well as retrotransposons in the genus Pseudovirus (member of Pseudoviridae family) use the 3' acceptor stem of the host tRNA as a primer for DNA synthesis. This region pairs with the retroelement RNA template to start DNA synthesis from 3'-OH of the tRNA. Retrotransposons of the genus Hemiviruses (member of Pseudoviridae family) use a half-tRNA primer, and cDNA synthesis initiates from 3'-OH of nucleotide 40, which resides within the anticodon stem-loop. It is likely that the primer template complexes for the two groups of elements have different structural conformations or properties, and that reverse transcriptases from the different groups have evolved the ability to recognize these differences. The split-clustering implemented by SplitTester could explore candidate domains of reverse transcriptase related to primer selection. These identified candidates provide testable hypotheses that could be verified or falsified by the follow-up molecular genetic experimentation.
Two signal regions were identified by SplitTester when the mutation distance matrix was used to compute the cost of amino acid substitution and when gaps were considered as potentially informative characters. In the aligned amino acid sequence, the maximum windows showing functional split in each non-overlapping region cover positions 144–239 (region 1) and 269–314 (region 2) (Fig. 3A). Using the hydrophobicity matrix, SplitTester also identified these regions, as well as an extra region at the N-terminus (amino acids 1–29, region 3). Each of signal regions above is derived from the multiple overlapping signal windows with gradually increased window length. We chose 50 aa window (166–215) in region 1 and 32 aa window (280–311) in region 2. Both windows are close to the median window length with demonstrating high bootstrapping values (Fig. 4C). In general, one may use the median sized window for representing, because small windows are statistically unstable while a too broad window will demolish the functional signal of the window.
We located the two windows identified by both matrices within the HIV reverse transcriptase sequence, based on the published protein sequence alignments of the reverse transcriptase family by Xiong and Eickbush , and mapped onto the crystal structure 1RTD  (Fig. 3B). HIV reverse transcriptase p66 has a 'right hand' structure. The 50 aa window (166–215) in region 1 from the alignment correspond to the "palm" of the protein (residues 167–210 in HIV reverse transcriptase) that encompasses the polymerase active site wherein nucleotides are added to the 3' end of the primer. The 32 aa window (280–311) in region 2 corresponds to one α-helix in the 'thumb' region of HIV reverse transcriptase (residues 267–297), which directly contact the primer/template complex, as determined by cross-linking experiments [21, 22].
The region of β-sheets between the two identified domains (240–268 in the retroelement reverse transcriptase alignment) is called the primer grip and cross-links to the 3'-OH group of the HIV primer, tRNAlys [23, 24]. We would speculate that primer grip is related to the primer binding. However, this region is not identified as determinant of primer specificity, indicating that the ability of the primer grip to interact with the 3'-OH of the primer will be a common feature of both full-and half-tRNA priming retroelements. The two regions identified by SplitTester surrounding the primer grip may play a role in distinguishing primer conformation or length. The results of SplitTester, therefore, can be well explained from the reverse transcriptase crystal structure and supported by experimental data. To further validate our findings, we randomly partitioned the gene members into two groups and reran the program. The candidate functional regions were not identified using any of the partitions (data not shown). We therefore conclude that the identified residues may separate the functional subtypes specifically.
The strength of functional signal vs. the evolutionary background on the multiple windows of different length (from 8–200) in the signal region 1 can be well illustrated by the bootstrapping values (Fig. 4C). The right splitting of the two functional groups in this region (29-90aa), as predicted by SplitTester, showing a (local) maxima at the window length of 50 residues. Interestingly, when the window size is greater than 90, the topology by the split-clustering is a random mixture between functional split and evolutionary background; the bootstrapping value is for the deepest two subtrees. Rather, with the increase of the window length toward the window of full length sequence, the topology converges to the evolutionary tree as shown in Fig. 4A. Therefore, the region identified by SplitTester is intuitively valid even though although the bootstrapping (re-sampling) value is not very high. We also observed the very similar pattern for region 2; the highest bootstrap value in window with size of 32 aa around the median length within the range of 8 to 46.
Because the SplitTester employs a heuristic algorithm to decipher weak functional signal from the high level of evolutionary background, its power would be limited by several factors. If there are only a few amino acid residues related to a given functional split, or these residues are located in a broad sequence region, the statistical power for detection is low, as we expected. One improvement for SplitTester is to divide the proteins into several small regions, for instance, based on the protein structure, or motifs. Then one can implement some algorithms to integrate weak signals from these non-overlapping small regions, which otherwise are indistinguishable from the background.
There are several other factors that may affect the performance of SplitTester. The first one is the choice of substitution matrices of proteins. We have provided several options for the matrices. The user can select one of them based on their purpose. In general, BLUSOM matrices reflect the overall evolution history and hydrophobicity matrix can reflect more on chemistry properties of amino acids. Second, the inference power of this method might increase with the difference between the functional splitting and phylogenetic tree because the signals are more likely being distinguished from the background. Thus, increasing the sample size, for instance, sequencing more genes from different species is certainly helpful. Nevertheless, the example of reverse transcriptases we presented here has indicated that one may obtain valuable functional information even the difference between functional splitting and phylogenetic tree is weak as indicated by the bootstrap value.
There have been many methods developed to analyze sequences involved in functional divergence between protein subtypes (e.g., [25–26 4]). Most of them are focused on identifying specific residues contributing to the functional differences of subtypes along the phylogeny. Hence, the prediction accuracy may rely on the quality of multi-alignments, the accuracy of phylogenetic inference, or the sufficient number of sequences. Since the SplitTester focus on a particular sequence region that supports the subtype functional split, the predicted results seem not to be strongly affected if a few positions are misaligned. We validated this claim by deleting the position 166 in the multiple sequence alignment, which contain conserved F in Pseudovirus subfamily and A, V, E, T in each gene of Hemevirus subfamily respectively. This position was selected because it can be easily identified manually as one of three "seeds" of the growing signal region (Fig. 3A). Fortunately, SplitTester still identified the almost identical region 1 as before only in overlapped shorter windows (145–221) and (167–238) (data not shown). Similar results were obtained when we deleted each of other two "seeds" at position 184 (conserved K only in Pseudovirus subfamily) and 207 (conserved A only in Hemivirus subfamily).
SplitTester can explore regions potentially responsible for functional divergence of proteins. The best scope of this software is to study the dataset with different functional clustering and phylogenetic tree. As shown by the case of reverse transcriptase, function-related signals may emerge when the functional split is inconsistent with the phylogenetic relationship of the protein family. Even if the functional clustering is consistent with its phylogeny, SplitTester may also provide some useful information for amino acid residues important for functional divergence, e.g., the conserved Myb gene family. In spite that the phylogeny is the same as the known functional split of Myb genes, SplitTester still successfully identified 11 candidate residues that differentiate the two-and three-repeat Myb proteins, the major functional split of Myb gene family , because the sequence window including these residues shows a stronger functional signal. Indeed, these identified candidate residues are well supported by their locations on the NMR structure of the mouse c-Myb DNA binding domain 1MSF . Moreover, these residues are called by type-II functional divergence by , which can also be predicted by the "Evolution Trace" method . See additional files 1 and 2 for the detail. Finally, we mention that, after combining SplitTester with other complementary methods, such as the method of , "Evolution Trace" method , "Phylogenetic Inference" by Sjolander , Diverge , we can develop a powerful analysis pipeline for predicting functional divergence from sequence domains to amino acid residues.
In summary, we developed SplitTester – a tool for exploring the functional domains in protein family. SplitTester focus on a specific type of functional divergence: the functional split is different from the evolutionary relationship. Using the split-clustering algorithm, SplitTester scans all the possible local regions of protein sequence alignment to identify the domain that provide the same clustering topology as the functional split. In the analysis of retroelements reverse transcriptase family, we identified the regions splitting this family according to the primer specificity, implying function in the primer selection. The functional role can be well explained after we map the identified domain onto the structure of the reverse transcriptase protein.
project name: SplitTester
Project home page:http://www.public.iastate.edu/~voytas/SplitTester/
Operating system(s): windows 2000 and XP
Programming language: C
We thank Dr. Les Miller for helpful discussions. This work was supported by NIH grants GM61657 to D.F.V, and GM62118 to X.Gu.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.