- Research article
- Open Access
Enhancing the prediction of protein pairings between interacting families using orthology information
© Izarzugaza et al; licensee BioMed Central Ltd. 2008
Received: 24 October 2007
Accepted: 23 January 2008
Published: 23 January 2008
It has repeatedly been shown that interacting protein families tend to have similar phylogenetic trees. These similarities can be used to predicting the mapping between two families of interacting proteins (i.e. which proteins from one family interact with which members of the other). The correct mapping will be that which maximizes the similarity between the trees. The two families may eventually comprise orthologs and paralogs, if members of the two families are present in more than one organism. This fact can be exploited to restrict the possible mappings, simply by impeding links between proteins of different organisms. We present here an algorithm to predict the mapping between families of interacting proteins which is able to incorporate information regarding orthologues, or any other assignment of proteins to "classes" that may restrict possible mappings.
For the first time in methods for predicting mappings, we have tested this new approach on a large number of interacting protein domains in order to statistically assess its performance. The method accurately predicts around 80% in the most favourable cases. We also analysed in detail the results of the method for a well defined case of interacting families, the sensor and kinase components of the Ntr-type two-component system, for which up to 98% of the pairings predicted by the method were correct.
Based on the well established relationship between tree similarity and interactions we developed a method for predicting the mapping between two interacting families using genomic information alone. The program is available through a web interface.
The biological function of many proteins can only be understood in the context of their relationships with others. For this reason the biological knowledge represented in the "interactome" (the set of protein interactions for a given proteome) cannot be derived from the properties of the isolated proteins. In recent years, efforts have attempted to decipher networks of protein interactions using experimental and computational approaches .
One of computational techniques for predicting and studying protein interactions widely accepted by the community is that based on the similarity of phylogenetic trees. This arose from the initial qualitative observation of topological similarities between the phylogenetic trees of a number of interacting protein families [2–4]. Later these observations were systematized and quantified, and the use of tree-similarity based methods to separate interacting and non-interacting families in large sets of proteins was proposed [5, 6]. This original mirrortree approach was subsequently improved in different ways, for example by introducing a correction for the background similarity created by the underlying speciation events [7, 8]. The hypothesis that explains the relationship observed between tree similarity and interaction states that physically or functionally interacting proteins are subject to similar evolutionary pressure and forced to adapt to each other, both factors resulting in similar evolutionary histories, which are in turn reflected in similar trees. Although the co-adaptation hypothesis is still a matter of debate [9, 10], the relationship between tree similarity and protein interaction has repeatedly been demonstrated by different authors in different test sets.
Mirrortree and related approaches indirectly measure the similarity of two phylogenetic trees by comparing the distance matrices obtained from the trees or from the multiple sequence alignments. This has been demonstrated to be a practical shortcut that avoids the complications associated with the direct comparison of trees, a problem yet to be fully resolved. To compare the distance matrices, a "mapping" or correspondence between the leaves of both trees has to be established in order to confront the appropriate elements of the matrices. Hence, having a common underlying rationale, this mirrortree approach has two main angles. In the first, the mapping between the sequences of the two families is known (for example implicit in the orthologous relationships when there is only one sequence in each species). In this case, the similarity of the trees according to this mapping is evaluated in order to assess the possible interaction between these two families of orthologues [5–7]. In the second, the interaction between the two families is known and the expected high similarity between their trees is used as the criteria to select the mapping (the pairings between the members of both families). This approach, introduced by Ramani & Marcotte  and followed by others [12–15], explores different sets of relationships between two families of interacting proteins (e.g. a family of ligands and their corresponding receptors) on the basis that the correct mapping will be that which maximizes the similarity between the trees. This situation is particularly common in eukaryotic organisms where there are many cases of large families of interacting paralogues for which only one or a few pairs of interacting proteins have been experimentally determined, and where the goal is to decipher the entire set of interactions between the two families.
The way in which most of these methods explore different mappings is by swapping pairs of columns in one of the distance matrices. It is easy to see that swapping two columns (and the corresponding rows) is equivalent to interchanging the mappings of two proteins (linking B to all proteins previously linked to A and vice versa). The exhaustive exploration of all possible mappings will require n! calculations for a set of n elements, which soon becomes unfeasible for large protein families. For this reason, current methods use heuristic approaches (i.e. Monte Carlo) to perform a "guided" non-exhaustive exploration of the solution space. Due to their intrinsic heuristic nature, these methods do not ensure the global best solution will be found, and they are easily trapped within local sub-optimal solutions. To partially overcome this problem, these methods are run several times in the search for a consensus between the different analyses, commencing each execution from a different point in the solution space. Some of the methods also make use of the fact that the distance matrices have an underlying tree structure behind them that imposes some restrictions on the possible mappings, thereby reducing the search space .
Another intuitive way of decreasing the number of mappings to be explored is by avoiding pairings between proteins of incompatible "classes", for example between proteins of different species in families that contain both orthologues and paralogues. We have developed a new Monte Carlo based method to predict the mapping between two interacting protein families. This method overcomes some of the limitations of previous approaches by taking advantage of the information available on the species from which the trees are derived. Basically, the method does not allow pairings involving proteins from different species, or "incompatible classes" in general. This is different to applying the previous equivalent approaches separately to each of the species (or classes), since even if the mapping is restricted to intra-organism pairings the information of the whole tree is used to assess these mappings.
We evaluated the performance of this method on a large set of interacting protein domains and a well described case of co-evolving interacting families: the sensor kinases and response regulators of the Ntr-type two component system. This analysis proved to be more accurate than previous approaches. Additionally, the spectrum of variables related to the number of proteins within the families, the number of species in the alignments, the complexity of the trees, etc., could for the first time be explored to set the boundaries of the expected accuracy and its dependence on the characteristics of the data.
Results and Discussion
We first present the results for the large dataset of interacting domains. This allows to obtain performance figures based on a statistically large number of examples. Then, we discuss in detail the results for the Ntr-type two-component system.
Large-scale evaluation based on interacting protein domains
The relationship between accuracy (percentage of correct pairings) and family size (number of proteins in the largest family of the pair) shows that smaller families tend to produce better results (Figure 1a). This makes sense since the smaller the family the smaller the number of possible mappings (size of the solution space) and hence, it is easier for the algorithm to find a good mapping. For most of the proteins (most populated classes, between 45 and 100 members in the largest family) the accuracy values are around 50–80%.
Figure 1b shows the dependence of the accuracy on the size of the search space (NxM, N and M being the sizes of the two families). As expected the accuracy decreases with the size of the space of solutions, since it is more difficult for the algorithm to find the right solution. Nevertheless, this plot also shows that most pairs correspond to relatively small search spaces (widest bars) and hence they produce reasonable accuracy values, around 60%.
In terms of the relationship between the accuracy and the average number of paralogues (sequences per organism), the method performs better when there are fewer paralogues for each organism (Figure 1c). As the number of paralogues decreases, the size of the solution space also decreased drastically, since only pairs between proteins from the same organism were admitted.
The relationship between the accuracy and the number of pairings between the two families, that is the percentage of proteins that were involved in an interaction (represented as the percentage with respect to the number of proteins in the smallest family) was assessed (Figure 1d). It must be remembered that for the test set to remain as close to the real situation as possible, we did not require all the proteins to be paired. Indeed, the test set contains pairs of domains with many members in the corresponding families for which only a few were involved in interactions. It can be seen that it is easier for the method to succeed when more proteins are involved in interactions, while the results are worse when a smaller fraction of proteins were involved in interactions.
The dependence of the accuracy on the information content of the trees is shown in Figure 1e. It is clear that this method works better as the trees contain more information. Trees with low information content have a characteristic "star-like" shape, meaning that all the distances between leaves are very similar. This similarity of the distances makes ambiguous the matching with other trees, since there are many equivalent possibilities producing the same score. So, in general trees with high information content are associated to better results. However, it can be seen that trees with very low information content (low entropy) also produce good results. This is an artefact due to the fact that the entropy is not totally independent on the number of sequences. Trees with few sequences tend to produce low entropies and good results since the size of the search space is smaller (Figure 1a, discussed above). If we disregard this artefact it was clear that more informative trees produced better results.
Trying to better evaluate the contribution of "tagging" to the improvement of the results, we represent the relationship between the accuracy, the average number of sequences per tag (organism) and the number of organisms. If tagging is improving the results, the highest accuracies should be in the region of large number of organisms and low sequences/organism. This seems to be the case (Additional file 1, Figure S3).
Ntr-type two-components system
For comparative purposes, we run this method using RMSD as scoring function and obtained an accuracy of 86% pairs correctly predicted. Comparing with the corresponding figures for Marcotte's published data (57%, which also uses RMSD as distance measure), and this method with Pearson/Student-t (93%), it looks like the improvement comes from both, the usage of Pearson correlation, and the "tagging", and that this last factor produces a significant increase in the accuracy. To further asses the individual effect of the "tagging" in the accuracy, we ran the pairs of trees for each organism separately. In this case we used those organisms with at least three proteins in both families (22 organisms). With this procedure, we could retrieve 21 correctly assigned K0 pairings out of 77 (27,3%), an accuracy which is quite low compared to the results using tagging (see above). This shows how predicting a mapping using tag restrictions is better than running each one of the tags (organisms) separately and joining the results afterwards. The reason could be related to the inclusion of more information for discriminating mappings (inter-organism distances) while not increasing the size of the search space (due to the tag restrictions). Nevertheless, all these results should be taken carefully since they are based on a single example and a deeper analysis would be required to better dissociate the contribution of these factors.
In this work, we present a method for the prediction of the mapping between two families of interacting proteins based on the well-established relationship between similarity of phylogenetic trees and protein interactions. The method is able to incorporate information on the belonging of proteins to classes in order to restrict the protein-protein pairings according to these classes. The most general class we can define is the organism that the proteins belong to, in this way avoiding "heterologous" interactions between proteins from different organisms. Many other classes can also be defined, such as the cellular compartment, co-expression, etc. We show that these restrictions improve the performance of the method. It is important to note that applying this method to an example with N classes (i.e. N organisms) is different from applying other similar approaches N times to N pairs of trees, each one containing only paralogues for a given organisms. This fact, illustrated for the NtrC example, is possibly related to the inclusion of more information for distinguishing mappings (inter organism distances) while keeping the search space relatively small due to the restrictions associated to the classes.
Moreover, for the first time we apply here a tree-similarity-based method to predict pairings in a large set of interacting domains, which allowed us to obtain statistically significant figures regarding performance, and to asses their dependence on the characteristics of the input data. Previous studies have only tested the methodology employed on a very limited number of examples. Although this dataset is "unrealistic" in the sense that it is composed of protein domains instead of complete proteins, we have tried to keep the level difficulty as realistic as possible. For example, unpaired domains were left in the alignments that obviously makes it harder for the method to find the right solution. It is important to note that in our experience, this is the only large-scale test set possible, since there is little systematic information available on the interaction between members of proteins families that can be used to construct a large dataset of full-length proteins. Indeed, this lack of information strengthens the need for methods like that presented in this work. We complemented this large-scale set with a more "real-life" example that involves a well-described case of interacting families. In this case the method produces substantially better results than previous similar ones.
There are large families of interacting proteins for which only one or a limited number of pairs of interacting proteins have been experimentally determined (i.e. Ras/Ras effectors, chemokines/chemokine receptors). In many of these cases, which are highly significant in eukaryotic organisms, the differential interaction of the members of the two families is crucial to explain their biological roles. It is expected that with the continuous stream of genomic sequences, we will see more and more of these examples. Thus, it is in this context that the automatic method presented here can provide clues about specific interactions, complementing other computational and experimental techniques in the search for the molecular basis of functional specificity.
Description of the Algorithm
TAG-TSEMA uses a Monte Carlo algorithm to perform a directed non-exhaustive exploration of the solution space (possible mappings). Proteins can be "tagged" according to their membership to classes, and pairings (interactions) are permitted between proteins belonging to the same class while inter-class pairings are forbidden. In this work, we label each protein according to the organism it belongs to, restricting the possible pairings to those involving proteins from the same organism. It is important to note that this restriction of the search space through the creation of classes with external information permits many other possibilities, such as tissue-specific expression (disregarding pairings between proteins from different tissues), sub-cellular localization, temporal co-expression, etc. The method can also deal with an additional class, called undef, for proteins that were not associated to any given class.
The computational complexity is reduced from n! (where n is the number of members in the family) to Σn i ! Where n i is the number of proteins within a given class, being the sum-up done along the whole set of classes (organisms). Given that the number of elements in the distance matrix is the main constraint for the feasibility of the calculation, by reducing the size of the search space the approach proposed here can deal with larger protein families.
By including paralogue/orthologue relationships, we are reinforcing the protein interaction signal between given protein pairs for each organism. This is because we are adding more parallel links between sets of interacting orthologues, while avoiding the increase in the complexity of the problem due to artificial inter-species coupling.
Generate the phylogenetic trees for both families starting from their multiple sequence alignments. Different tree-construction methods have been used in this work for the purpose of comparison.
Build the corresponding distance matrices, calculating the distances by summing the lengths of the branches separating each pair of nodes in the trees.
Randomly shuffle one of the matrices to establish a set of initial (tag-restricted) random mappings (1000 by default). Each of these random mappings represents a different starting point in the solution space where the Monte Carlo searches can start.
For each one of these initial random mappings.
4.1. Calculate the starting correlation between the two distance matrices according to the mapping.
4.2. Randomly select a class (organism) and a protein within this class in the second matrix to be swapped with another feasible protein (undef or of the same class). The rows and columns corresponding to these two proteins are swapped.
4.3. Re-calculate the correlation between the distance matrices according to the new mapping obtained after the swap. If the correlation is better the change is accepted and if not, the previous mapping is recovered (that before row swapping).
The overall best solution obtained for all the initial mappings is recorded, together with the best solutions of each of the mappings.
The "cooling temperature" parameter of the Monte Carlo algorithm is set to 0.95.
The algorithm implemented in C includes the possibility of confronting two families with a different number of proteins, which is the situation for most real interacting families, due to promiscuity, pseudogenes, etc. In the current implementation of the algorithm it is also possible to use three alternative scoring functions to assess the correlation between the distance matrices and hence, the proposed mappings: root mean square deviation (RMSD), Pearson's linear correlation, and its corresponding Student's t. However, in order to compare families of very different sizes, all the results in the forthcoming analysis are based on the Student's t test. We used the value of the Student's t itself, instead of the associated p-values.
Where the sum runs for all the distance bins (N = 15 in this work), and p i is the fraction of distances within the i th bin.
We tested the method with a well-studied case of two interacting protein families, allowing us to compare it with other equivalent methods which used the same example. Additionally, we also assessed the performance of the method on a large dataset of interacting protein domains which, despite being a little far from a real-world application, has many of the characteristics of full-length proteins. This large dataset allows us to obtain statistical information regarding the performance of the method. It would have been desirable to test the method with a large dataset of whole-length proteins that contained a reasonable amount of information on the interactions between their members. However, such a dataset does not currently exist and, in fact, the lack of information on interactions between family members that would be necessary to build such a dataset highlights the importance of methods like that presented here.
Large scale dataset of interacting protein domains
We tested the new method on a large set of 604 protein domains (forming 488 interacting pairs) for which the right mapping is simply given by the presence of the two domains in the same protein chain. Due to its large size, this set allows a statistical assessment of the performance of the method and its dependence on the characteristics of the data.
We started by retrieving pairs of domains from Pfam  that could be linked to different regions of the same Yeast protein (i.e.: different regions of the same protein are asociated to the two domains). For example, if a Yeast protein A matches two Pfam domains (X and Y), in its N-terminal and C-terminal regions respectively, we would take X-Y as a pair of interacting Pfam domains. For this pair of interacting Pfam domains, the "real mapping" (set of links between the actual sequences within these Pfam domains) is given by the co-presence of these sequences in the same protein chain. I.e. An homolog of protein A (A') associated to the same domains X and Y, would define a link (between the sequences of its N-terminal and C-terminal parts). Protein A itself would also define another link. All the links between X and Y constitute the "real mapping", which is compared with the mapping predicted by the method for this particular pair X-Y.
In order to be sure about the uniqueness and quality of the sequences, we used only eukaryotic proteins that have an entry in SwissProt (hence removing those coming from TREMBL). Then, we filtered out those low quality sequences annotated as "fragment", "hypothetic" or "putative" proteins. Finally, we removed those pairs of domains with less than fifteen sequences mapped between them. We also removed pairs of domains in which any of the two members involved less than four organisms with more than 3 sequences each, and the ones with less than 50% of the proteins paired (with respect to the family with the lowest number of sequences). This process yielded 488 interacting pairs of domains, comprising 604 individual domains.
It is important to note that this set represents an especially tough test for the method, because not all the sequences have to be paired for a given pair of domains. This is similar to an extreme real case in which we have information about the interaction between two proteins, but no clue about the extension of their families. Therefore, our set is full of cases with few domain-domain interactions that we would expect to produce poor predictions using our approach (poor co-evolutionary signals with a lot of noise). We did not impose this or any other restrictions in order for the cases to be as close to the real situation as possible.
For each domain within this test set, we align its sequences with MUSCLE . In order to test the influence of different estimators of protein divergence, we generated three sets of phylogenetic trees using the Neighbour Joining algorithm with three different evolutionary distances: simple percentage of divergence without including gaps (as implemented in ClustalW : (100-%id)/100, where %id is the percentage of identical residues), Kimura's correction, and Scoredist (both as implemented in Belvu, E. Sonnhammer, unpublished). Given the large-scale nature of this test we were limited to the Neighbour Joining algorithm for the generation of the trees. However, we would expect that more accurate trees will improve the results of our method.
Although the program can be run on any computer, its application to this large set of examples requires a lot of computer power. Hence, that test set was run on MareNostrum (MN), one of the largest computers in the world dedicated to science. For this specific task, approximately 730 hours of MN computer time were used to run TAG-TSEMA for these 488 pairs of domains, distributed over 32 processors, with an average of 10 minutes for each pair. The execution of a single pair comprises 500 complete runs of the entire Monte Carlo search consisting of a million steps each.
For each one of the 488 pairs of domains in the test set, the accuracy (percentage of correct pairings respect to the real mapping-see above-) of the top-scoring solution was calculated.
Ntr-type two-components system
We also tested the method with a more realistic case of two interacting families of full-length proteins for which enough information on the pairings between their members is available. The interactions between the Sensor kinases and the corresponding Response Regulator proteins in the Ntr-type two-component system has been previously studied in by Ramani and Marcotte , and we followed a similar strategy to construct this test set. We retrieved the 14 documented interactions between individual response regulator and kinases from KEGG . According to the KEGG orthology assignments, both the 14 sensor kinases and the 14 response regulators are classified into 6 sets of KEGG's functional orthologues (KOs). Hence, we established the corresponding KO-KO interactions based on the original 14 protein-protein interactions (Additional file 1, Table S1), a process that results in 7 KO-KO interactions. To build the entire families we retrieved all the sequences present in KEGG for these 12 KOs (6 from each tree), and generated the MSAs and the trees by using the approach previously described for the protein domains dataset. By following this procedure we obtained a tree comprised of 213 proteins for the sensor kinase family and another with 286 proteins for the response regulator family. We can evaluate the mapping between these two trees predicted by TAG_TSEMA either based on the original 14 documented interactions, or through the KO-KO interactions extrapolated from them.
Availability and Requirements
The program TAG-TSEMA can be used through the following web server: http://tagtsema.bioinfo.cnio.es. This server combines the method described here with an interactive visualization and post-processing of the results as described in . In this server, the assignment of the proteins to classes (organisms, subcellular localization, etc) is indicated in the input files with the "_" symbol, using a Swissprot-like nomenclature (i.e. "prot1_class1", "prot2_class2", etc).
We are especially grateful to Antonio Rausell for interesting discussions and advice, especially about the statistical analysis. This work was in part funded by the projects BIO2006-15318 and PIE 200620I240 from the Spanish Ministry for Education and Science, and the European Union Projects LSHG-CT-2004-503567 (GENEFUN), LSHG-CT-2003-503265 (BIOSAPIENS), LSHG-CT-2004-512092 (EMBRACE) and LSHG-CT-2004-503568 (COMBIO). Computer support was provided by the Barcelona Supercomputer Centre (BSC) through the project BCV-2006-4-0010.
- Uetz P, Finley RL Jr: From protein networks to biological systems. FEBS Lett 2005, 579: 1821–1827. 10.1016/j.febslet.2005.02.001View ArticlePubMedGoogle Scholar
- Fryxell KJ: The coevolution of gene family trees. Trends Genet 1996, 12: 364–369. 10.1016/S0168-9525(96)80020-5View ArticlePubMedGoogle Scholar
- van Kesteren RE, Tensen CP, Smit AB, van Minnen J, Kolakowski LF, Meyerhof W, Richter D, van Heerikhuizen H, Vreugdenhil E, Geraerts WP: Co-evolution of ligand-receptor pairs in the vasopressin/oxytocin superfamily of bioactive peptides. J Biol Chem 1996, 271: 3619–3626. 10.1074/jbc.271.7.3619View ArticlePubMedGoogle Scholar
- Pages S, Belaich A, Belaich JP, Morag E, Lamed R, Shoham Y, Bayer EA: Species-specificity of the cohesin-dockerin interaction between Clostridium thermocellum and Clostridium cellulolyticum: prediction of specificity determinants of the dockerin domain. Proteins 1997, 29: 517–527. 10.1002/(SICI)1097-0134(199712)29:4<517::AID-PROT11>3.0.CO;2-PView ArticlePubMedGoogle Scholar
- Goh C-S, Bogan AA, Joachimiak M, Walther D, Cohen FE: Co-evolution of Proteins with their Interaction Partners. J Mol Biol 2000, 299: 283–293. 10.1006/jmbi.2000.3732View ArticlePubMedGoogle Scholar
- Pazos F, Valencia A: Similarity of phylogenetic trees as indicator of protein-protein interaction. Protein Eng 2001, 14: 609–614. 10.1093/protein/14.9.609View ArticlePubMedGoogle Scholar
- Pazos F, Ranea JAG, Juan D, Sternberg MJE: Assessing Protein Co-evolution in the Context of the Tree of Life Assists in the Prediction of the Interactome. J Mol Biol 2005, 352: 1002–1015. 10.1016/j.jmb.2005.07.005View ArticlePubMedGoogle Scholar
- Sato T, Yamanishi Y, Kanehisa M, Toh H: The inference of protein-protein interactions by co-evolutionary analysis is improved by excluding the information about the phylogenetic relationships. Bioinformatics 2005, 21: 3482–3489. 10.1093/bioinformatics/bti564View ArticlePubMedGoogle Scholar
- Hakes L, Lovell S, Oliver SG, Robertson DL: Specificity in protein interactions and its relationship with sequence diversity and coevolution. Proc Natl Acad Sci USA 2007, 104: 7999–8004. 10.1073/pnas.0609962104PubMed CentralView ArticlePubMedGoogle Scholar
- Mintseris J, Weng Z: Structure, function, and evolution of transient and obligate protein-protein interactions. Proc Natl Acad Sci USA 2005, 102: 10930–10935. 10.1073/pnas.0502667102PubMed CentralView ArticlePubMedGoogle Scholar
- Ramani AK, Marcotte EM: Exploiting the co-evolution of interacting proteins to discover interaction specificity. J Mol Biol 2003, 327: 273–284. 10.1016/S0022-2836(03)00114-1View ArticlePubMedGoogle Scholar
- Tillier ER, Biro L, Li G, Tillo D: Codep: maximizing co-evolutionary interdependencies to discover interacting proteins. Proteins 2006, 63: 822–831. 10.1002/prot.20948View ArticlePubMedGoogle Scholar
- Gertz J, Elfond G, Shustrova A, Weisinger M, Pellegrini M, Cokus S, Rothschild B: Inferring protein interactions from phylogenetic distance matrices. Bioinformatics 2003, 19: 2039–2045. 10.1093/bioinformatics/btg278View ArticlePubMedGoogle Scholar
- Izarzugaza JM, Juan D, Pons C, Ranea JA, Valencia A, Pazos F: TSEMA: interactive prediction of protein pairings between interacting families. Nucleic Acids Res 2006, 34: W315–319. 10.1093/nar/gkl112PubMed CentralView ArticlePubMedGoogle Scholar
- Jothi R, Kann MG, Przytycka TM: Predicting protein-protein interaction by searching evolutionary tree automorphism space. Bioinformatics 2005, 21: i241-i250. 10.1093/bioinformatics/bti1009PubMed CentralView ArticlePubMedGoogle Scholar
- Tukey J: Exploratory Data Analysis. Reading, MA: Addison-Wesley; 1977.Google Scholar
- Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, Yeats C, Eddy SR: The Pfam protein families database. Nucleic Acids Res 2004, 32: D138–141. 10.1093/nar/gkh121PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32: 1792–1797. 10.1093/nar/gkh340PubMed CentralView ArticlePubMedGoogle Scholar
- Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 2003, 31: 3497–3500. 10.1093/nar/gkg500PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resource for deciphering the genome. Nucleic Acids Res 2004, 32: D277–280. 10.1093/nar/gkh063PubMed CentralView ArticlePubMedGoogle Scholar
- Tamura K, Dudley J, Nei M, Kumar S: MEGA4: Molecular evolutionary genetics analysis (MEGA) software version 4.0. Mol Biol Evol 2007, 24: 1596–1599. 10.1093/molbev/msm092View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.