OrthoSelect: a protocol for selecting orthologous groups in phylogenomics
© Schreiber et al; licensee BioMed Central Ltd. 2009
Received: 22 October 2008
Accepted: 16 July 2009
Published: 16 July 2009
Phylogenetic studies using expressed sequence tags (EST) are becoming a standard approach to answer evolutionary questions. Such studies are usually based on large sets of newly generated, unannotated, and error-prone EST sequences from different species. A first crucial step in EST-based phylogeny reconstruction is to identify groups of orthologous sequences. From these data sets, appropriate target genes are selected, and redundant sequences are eliminated to obtain suitable sequence sets as input data for tree-reconstruction software. Generating such data sets manually can be very time consuming. Thus, software tools are needed that carry out these steps automatically.
We developed a flexible and user-friendly software pipeline, running on desktop machines or computer clusters, that constructs data sets for phylogenomic analyses. It automatically searches assembled EST sequences against databases of orthologous groups (OG), assigns ESTs to these predefined OGs, translates the sequences into proteins, eliminates redundant sequences assigned to the same OG, creates multiple sequence alignments of identified orthologous sequences and offers the possibility to further process this alignment in a last step by excluding potentially homoplastic sites and selecting sufficiently conserved parts. Our software pipeline can be used as it is, but it can also be adapted by integrating additional external programs. This makes the pipeline useful for non-bioinformaticians as well as to bioinformatic experts. The software pipeline is especially designed for ESTs, but it can also handle protein sequences.
OrthoSelect is a tool that produces orthologous gene alignments from assembled ESTs. Our tests show that OrthoSelect detects orthologs in EST libraries with high accuracy. In the absence of a gold standard for orthology prediction, we compared predictions by OrthoSelect to a manually created and published phylogenomic data set. Our tool was not only able to rebuild the data set with a specificity of 98%, but it detected four percent more orthologous sequences. Furthermore, the results OrthoSelect produces are in absolut agreement with the results of other programs, but our tool offers a significant speedup and additional functionality, e.g. handling of ESTs, computing sequence alignments, and refining them. To our knowledge, there is currently no fully automated and freely available tool for this purpose. Thus, OrthoSelect is a valuable tool for researchers in the field of phylogenomics who deal with large quantities of EST sequences. OrthoSelect is written in Perl and runs on Linux/Mac OS X. The tool can be downloaded at http://gobics.de/fabian/orthoselect.php
DNA and protein sequences provide a wealth of information which is routinely used in phylogenetic studies. Traditionally, single genes or small groups of genes have been used to infer the phylogeny of a group of species under study. It has been shown, however, that molecular phylogenies based on single genes often lead to apparently conflicting tree hypotheses . The combination of a large number of genes and species in genome-scale approaches for the reconstruction of phylogenies can be useful to overcome these difficulties . This approach has been termed phylogenomics .
Since complete genome sequences are available only for a limited number of species, many phylogenomic studies rely on EST sequences. EST sequences are short (~200 – 800 bases), unedited, randomly selected single-pass reads from cDNA libraries that sample the diversity of genes expressed by an organism or tissue at a particular time under particular conditions. The relatively low cost and rapid generation of EST sequences can deliver insights into transcribed genes from a large number of taxa. Moreover, EST sequences contain a wealth of phylogenetic information. Several recent phylogenomic studies used EST sequences to generate large data matrices, e.g. [4–7]. Such studies start with the generation of EST libraries for a set of species. Overlapping EST sequences from single coding regions are then assembled into contigs and orthologous genes are identified as a basis for phylogenetic reconstruction. Homologous sequences are called orthologs if they were separated by a speciation event, as opposed to paralogous sequences, which were separated by a duplication event within the same species . If the last speciation event predates the gene duplication event, homologous sequences are called inparalogs . Orthologs are usually functionally conserved whereas paralogs tend to have different functions  and are less useful in phylogenetic studies. (because true genealogical relationships among taxa can only be reconstructed with great difficulty.) A typical protocol for detecting orthologs in phylogenomic studies should include (1) a similarity search using tools like BLAST , (2) a strategy to select a subset of hits returned by this search, (3) a criterion to identify sequences as potential orthologs, (4) a strategy for eliminating potential paralogs – in case several sequences from the same species have been assigned as potential orthologs to the same orthologous group.
Orthology assignment is a crucial prerequisite for phylogeny reconstruction as faulty assumptions about orthology – e.g. the inclusion of paralogs – can lead to an incorrect tree hypothesis . Errors can result from similarity searches against non-specialized databases, e.g. NCBI's nr database, or from best-hit selection strategies such as best reciprocal hit  or best triangular hit that may lead to false positive orthology predictions. The similarity between a query and a database sequence stemming from a similarity search – expressed for example as a bit-score or expectation value (E-Value) – is usually taken as a criterion to predict an orthologous relationship. Since the results of these methods depend on the choice of a database and on the strategy to select sequences from similarity search hits, a more reliable protocol for ortholog predictions is needed.
Several databases and computational methods for predicting orthologs are available. Multi-species ortholog databases have been developed based on different sources of orthologous information. They include information about orthologous relationships between sequences. The OrthoMCL-DB database  and the KOG database  have been constructed from whole genome comparisons, HomoloGene  on the basis of synteny. HOVERGEN  and TreeFam  were constructed using the orthologous information from phylogenetic trees. Two of these databases, OrthoMCL-DB and KOG, explicitly define orthologous groups (OG) which can be used as a source for orthology assignment of unknown sequences using similarity searches.
Most computational methods to identify orthologs are based on either a phylogenetic analysis, or on all-against-all BLAST searches . The former approach is computationally expensive and usually requires manual intervention. All-against-all approaches use every sequence from the input data set as a query for BLAST searches against sequences from the respective other species. This generates OGs based on some similarity measure, e.g. using all best reciprocal hits. These OGs can further be processed to merge, delete, or seperate overlapping groups using a clustering algorithm, as implemented in e.g. OrthoMCL  or Inparanoid . Zhou and Landweber  developed BLASTO, a different computational method for orthology prediction by including information from an orthologous database. Other important aspects in data set construction for phylogenetic analysis on a large scale are (1) correct identification of open reading frames in ESTs and their translation, (2) careful selection of target genes to maximize the phylogenetic information, (3) elimination of redundant sequences, and (4) a refinement step to select conserved blocks and remove homoplasy from multiple sequence alignments.
Nowadays, data sets in phylogenomic studies can easily contain dozens of taxa and hundreds of genes . The construction of data sets of that size for phylogenomic studies is time-consuming and can hardly be achieved manually. To the best of our knowledge, no software pipeline is currently available that performs the above steps automatically. Herein, we present a software pipeline, called OrthoSelect, to process clustered EST sequences automatically for phylogenomic studies. Our goal is to give both non-bioinformaticians and bioinformatic experts a useful framework to carry out analyses on a phylogenomic scale. It integrates publicly available bioinformatic tools and manages data processing and storage. Although the software pipeline is designed to automate the construction of data sets for phylogenomic studies, the user can evaluate intermediate results at any time of the analysis. OrthoSelect produces automatically calculated and post-processed alignments that can be used as input for common phylogenetic reconstruction software. In a large-scale study, we applied OrthoSelect to a data set from metazoan species consisting of > 950, 000 ESTs belonging to 71 taxa (unpublished data). In order to assess the quality of OrthoSelect predictions in relation to results obtained from other methods, we compared OrthoSelect to the manually created and published phylogenomic data set by Dunn et al. . Since our tool offers an increased functionality compared to other tools for orthology prediction (e.g. OrthoMCL), our tests focus on the assignment of orthology only, and do not cover the correct translation of ESTs, gene selection, alignment computation, and alignment postprocessing.
Using the orthologous groups (OG) defined by KOG or OrthoMCL-DB as a basis, orthologous ESTs are detected by a similarity search of ESTs against the ortholog database and assigning them to the OGs using our reimplementation of BLASTO. The ESTs are then translated and stored. Redundant sequences within each OG are eliminated and an alignment of the remaining sequences is computed. In a last step, we use sophisticated post-processing methods to filter out non-informative or misleading information from the alignment (see Figure 1). The entire analysis is guided by a configuration file containing the main parameters and options for each external program.
The first step of the software pipeline comprises the detection of potential orthologs in EST libraries (see Figure 1, Point 1). This is a critical step, because false ortholog assignments can lead to serious errors in the resulting phylogenetic tree. Orthologs are detected by searching an ortholog database – either KOG or OrthoMCL-DB – with a query EST using blastx and subsequently the resulting hits are clustered according to an algorithm similar to that used in BLASTO. A standard BLAST search returns a list of hits ordered by their significance. By contrast, BLASTO calculates similarity values between the query sequence and entire groups of orthologs (OGs).
Here, E i is the E-value of the BLAST alignment of f i with the query sequence s and |g'| the number of species in g'. Finally, every EST sequence s is assigned to those orthologous groups g with a similarity score Sg,sabove a given threshold. We allow multiple assignments of a single EST, because ESTs can represent domains rather than full genes, and they should be assigned to all OGs containing that domain (E.g. the OGs KOG0100, KOG0101, KOG0102 of KOG all contain the same Pfam domain HSP70). All ESTs assigned to the same OG are now potential orthologous. Redundant sequences will be removed later (see section Eliminating Redundancies).
In the next step, potential coding regions in assembled EST sequences are detected and translated into proteins. By their nature, EST sequences often contain sequencing errors and may cover genes partially, only . These errors result in e.g. reading frame shifts that make translation non-trivial. Several algorithms have been developed to overcome this problem. DIANA-EST  uses a combination of Artificial Neural Networks while ESTScan uses Hidden Markov Models. In contrast to this, DECODER  implements rule-based methods, and GeneWise uses a known protein as a template. In addition, combinations of these methods have been proposed to identify coding regions and to translate EST sequences correctly, e.g prot4EST . We use a comparative approach of different well established programs for translation. Each EST is translated (using ESTScan, GeneWise, and a standard six-frame translation using BioPerl) and aligned to the best hit from the previous BLAST search using bl2seq . The translated sequence with the lowest E-value is then chosen as the correctly translated sequence. This way, the probability of getting correctly translated ESTs is increased. Our goal was to fully automate the installation of all external programs. We did not include prot4EST since it requires additional programs and one of which is not freely available for download and therefore cannot be installed automatically.
Taxon/Gene Sampling Strategy
The user selects a subset of individual species under study. In this case, those OGs will be selected that contain at least one EST from each of the user-selected species.
The user defines groups of species (e.g. groups that are thought to be monophyletic). Our tool will then select those OGs that contain at least one EST sequence for each of the specified groups.
The idea of these two methods is to select the maximal biclique of a graph with the nodes consisting of the OGs and the taxa – in case of option 1 – or monophyla – in case of option 2 . The selection of genes according to these who methods focusses on maximising the phylogenetic signal in the dataset (see Figure 1, Point 2).
Multiple divergent copies of the same gene and different levels of stringency during EST assembly can lead to a situation where OGs contain more than one sequence for each species (Depending on the size of the study, OGs can contain hundreds of sequences which makes manual elimination of redundant sequences impossible). It is also known that some of the orthologous groups contained in KOG contain not only orthologous genes but also paralogs . In these cases, a fast and reliable method is needed to select the correct sequence for each species. We work with the assumption that a gene from one organism is often more similar to an orthologous gene from another organism than to paralogs from that organism. This seems plausible based on both the definition of orthology and the fact that orthologs typically retain the same function . A scenario where a gene from one organism is more similar to a paralog rather than to its ortholog from another organism would require a considerable difference in the rate of paralog evolution . Since this is more an exception than a rule and since OrthoSelect aims at the production of gene alignments containing only one sequence per species, we do not consider such cases.
For species S, we then select the sequence s for which this number is maximal (see Figure 3).
Multiple Sequence Alignment
By default, the previously selected sequences are aligned using either MUSCLE or T-Coffee [40, 41]. Other standard methods for multiple alignment can be used as well, e.g. ProbCons , MAFFT [43, 44], DIALIGN [45, 46] or DIALIGN-TX [47, 48].
Once multiple alignments have been calculated for selected groups of ortholog EST sequences, these alignments can be further processed to exclude columns that are not suitable for phylogenetic analysis. Since not all parts of a gene evolve at the same rate, alignments typically contain highly conserved as well as less conserved sites. Alignment columns that are too conserved do not contain any phylogenetic signal. The same holds true for parts of the sequences that are too divergent to be correctly aligned. Another problem that confuses phylogenetic reconstruction is the presence of homoplasy caused by back- or parallel-mutation. Several programs have been developed to tackle these problems by automatically selecting sufficiently conserved blocks from alignments, for example Gblocks and Aliscore, or by eliminating potentially homoplastic sites, e.g. Noisy. Gblocks, Aliscore, and Noisy are incorporated in our software pipeline to allow a broad spectrum of alignment post-processing thereby increasing the accuracy of the subsequent phylogenetic analysis (see Figure 1, Point 4). Furthermore, alignments processed by Gblocks can be further filtered by discarding too short sequences from the alignment (e.g. sequences with > 50% missing characters).
Results and Discussion
OrthoSelect is the first fully automated and freely available tool that covers the whole process of selecting orthologs from EST libraries to output orthologous gene alignments that can be used to build phylogenies. In the absence of a gold standard for benchmarking of orthology prediction and in order to evaluate the performance of our program, we designed the following tests: First, OrthoSelect was compared to the best-hit selection strategy using a set of sequences from JGI with KOG-annotations. Second, we evaluated the performance compared to the KOG database by re-annotating (re-assinging) ortholog database sequences. In the third and most powerful test we compared OrthoSelect tool to a manually created and published phylogenomic data set. In this context, we also compared our tool with OrthoMCL.
OrthoSelect vs. Best-hit selection strategy
Results from orthology assignment: OrthoSelect vs. Best-hit selection strategy.
OrthoSelect vs. KOG
In absence of a reference dataset for orthology prediction and due to the fact that our tool is mainly focused on the automation of a process rather than being a completely new method for orthology prediction, we compared OrthoSelect to the KOG database by re-annotating (re-assinging) ortholog database sequences. We performed the following: 5000 sequences were randomly chosen and masked out from the ortholog database. The remaining sequences were converted into a blastable database. We then ran OrthoSelect using each of the 5000 sequences as a query sequence against the masked database. Assuming the original ortholog group assignment in the ortholog database represents the correct orthology relation, we calculated in how many cases our orthology assignment matched the original assignment. We could assign the query sequences in 92% of the cases to the correct ortholog group.
OrthoSelect vs. manually created data set by Dunn et al
The goal of our tool is to automate the process of constructing data sets that can be used for subsequent phylogenetic analyses. To test our tool regarding this, we selected Dunn et al.'s data set (hereafter referred to as reference data set) published in Nature .
This data set consists of newly sequenced ESTs as well as publicy available ESTs and protein sequences, and has been generated using all-vs.-all BLAST searches, protein translations using prot4EST, grouping of the sequences into orthologous groups using TribeMCL  as well as manual curation and tree reconciliation (see  for more details).
The reference data set as well as the single EST and protein sequences were either downloaded from publicly available sources or provided by Casey Dunn. The initial data set consisted of 150 genes and 77 taxa. In order to guarantee comparable results, we mapped each sequence from each gene to the KOG database using the best BLAST-hit. Only genes where all sequences could be mapped to the same KOG were further considered. This led to a considerable decrease in the number of genes. Since some taxa were not available for download, we ended up with 70 out of the 77 taxa Dunn et al. initially used.
For prediction of orthologous sequences, we denote a true positive as a correctly predicted ortholog, a false positive as an incorrectly predicted ortholog, and a false negative as an overlooked sequence. To be more precise, we use the following measures of performance:
Taxon is present in both alignments: If the percentage identity of both sequences is above a threshold (≥ 95%), the sequences are regarded as being equal and counted as a true positive. Else, both sequences are aligned to a hidden markov model (HMM) build from the alignment of the corresponding orthologous group (OG) using hmmsearch from the HMMER package. If the OrthoSelect sequence is closer to the HMM, it will be counted as a true positive, and otherwise it will be counted as a false positive.
Taxon is present in the reference alignment, but not in the OrthoSelect alignment: It will be counted as a false negative.
Taxon is present in the OrthoSelect alignment, but not in the reference data set: The sequence is aligned to the HMM of that OG. If it shows significant similarity, it will be counted as a true positive, and otherwise as a false positive.
Furthermore, we use the following formula to measure the specificity of our results:
Results from orthology assignment: OrthoSelect vs. reference data set
Cases where OrthoSelect found better sequences
Number of additional sequences found
Number of additional sequences found (good)
Number of additional sequences found (bad)
Number of sequences missed
Ratio of additional/missed sequences
OrthoSelect vs. OrthoMCL
For each of the 60 previously compared gene clusters (see previous section), we checked whether OrthoMCL assigns sequences from the 6 taxa to the same OrthoMCL cluster or not. The results were, that all sequences belonging to the same alignment have been clustered together by OrthoMCL. This means that the clustering algorithm of all methods produce similar results and converge.
Besides the additional functionality of OrthoSelect as compared to OrthoMCL and its usability for EST sequences, it is also much faster. It took OrthoMCL 24 hours to analyse the data set of 55.646 sequences. In contrast, our tool analysed the 1.000.000 sequences Dunn et al. used in about 6 hours.
OrthoSelect is a tool for finding ortholog groups in EST databases. It can be used by either installing it locally or via the OrthoSelect web server . It automatically searches assembled EST sequences against databases of ortholog groups (OG), assigns ESTs to these predefined OGs, translates the sequences into proteins, eliminates redundant sequences assigned to the same OG, creates multiple sequence alignments of identified ortholog sequences and offers the possibility to further process these alignments in a last step. OrthoSelect performes better than the best-hit selection strategy and shows reliable results in re-annotating database member sequences of OrthoMCL-DB and KOG. Most importantly, we showed that our tool produces high quality data sets such as Dunn et al's data set, but with more selected sequences and therefore less missing data in the alignments. Furthermore, the results our tool produces are in absolut agreement with the results of OrthoMCL, but OrthoSelect offers additional funcionality, e.g. handling with EST sequences, computing sequences alignments, and refining them. Our method also showed a significant speedup in comparison to OrthoMCL. Correct orthology assignment is an important prerequisite for the construction of reliable data sets and OrthoSelect is capable of producing them. This makes a OrthoSelect a valuable tool for researchers dealing with large EST libraries focussing on constructing data sets for phylogenetic reconstructions. The tool can be downloaded at http://gobics.de/fabian/orthoselect.php or the web server accessed without local installation at http://orthoselect.gobics.de/.
Availability and requirements
Project name: OrthoSelect
Project home page: http://www.gobics.de/fabian/orthoselect.php
Operating system: Mac OS X, Linux
Programming language: Perl
Other requirements: BioPerl, BLAST, ESTScan, GeneWise, Clustalw, Muscle or T-Coffee, HMMER, Gblocks, Aliscore or Noisy
License: GNU GPL
We thank Katharina Hoff for critically reading the manuscript, the anonymous reviewers for their valuable comments and Casey Dunn for kindly providing the EST sequences for our evaluation. This work was financially supported by the German Research Foundation (DFG, Project Wo896/6-1,2) within DFG Priority Programme SPP 1174 "Deep Metazoan Phylogeny", and by the German Federal Ministry of Research and Education (BMBF) project "MediGRID" (BMBF 01AK803G).
- Delsuc F, Brinkmann H, Philippe H: Phylogenomics and the reconstruction of the tree of life. Nature Reviews Genetics 2005, 6(5):361–375. 10.1038/nrg1603View ArticlePubMedGoogle Scholar
- Gee H: Evolution: ending incongruence. Nature 2003, 425: 798–804. 10.1038/425782aView ArticleGoogle Scholar
- Eisen JA: Phylogenomics: Improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res 1998, 8(3):163–167.View ArticlePubMedGoogle Scholar
- Bourlat SJ, Juliusdottir T, Lowe CJ, Freeman R, Aronowicz J, Kirschner M, Lander ES, Thorndyke M, Nakano H, Kohn AB: Deuterostome phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida. Nature 2006, 444(7115):85–88. 10.1038/nature05241View ArticlePubMedGoogle Scholar
- Delsuc F, Brinkmann H, Chourrout D, Philippe H: Tunicates and not cephalochordates are the closest living relatives of vertebrates. Nature 2006, 439(7079):965–968. 10.1038/nature04336View ArticlePubMedGoogle Scholar
- Dunn CW, Hejnol A, Matus DQ, Pang K, Browne WE, Smith SA, Seaver E, Rouse GW, Obst M, Edgecombe GD: Broad phylogenomic sampling improves resolution of the animal tree of life. Nature 2008, 452(7188):745–749. 10.1038/nature06614View ArticlePubMedGoogle Scholar
- Philippe H, Derelleand R, Lopez P, Pick K, Borchiellini C, Boury-Esnault N, Vacelet J, Renard E, Houliston E, Queinnec E, Silva CD, Wincker P, Guyader HL, Leys S, Jackson DJ, Schreiber F, Erpenbeck D, Morgenstern B, Wörheide G, Manuel M: Phylogenomics Revives Traditional Views on Deep Animal Relationships. Current Biology 2009, 19(8):706–712. 10.1016/j.cub.2009.02.052View ArticlePubMedGoogle Scholar
- Fitch WM: Distinguishing homologous from analogous proteins. Syst Zool 1970, 19(2):99–113. 10.2307/2412448View ArticlePubMedGoogle Scholar
- Sonnhammer E, Koonin E: Orthology, paralogy and proposed classification for paralog subtypes. Trends Genetics 2002, 18: 619–620. 10.1016/S0168-9525(02)02793-2View ArticleGoogle Scholar
- Koonin EV: ORTHOLOGS, PARALOGS, AND EVOLUTIONARY GENOMICS. Annual Review of Genetics 2005, 39: 309–338. 10.1146/annurev.genet.39.073003.114725View ArticlePubMedGoogle Scholar
- Altschul S, Madden T, Schaffer A, Zhang J, Zhang Z, Miller W, Lipman D: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Zmasek C, Eddy S: RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics 2002, 3: 14. 10.1186/1471-2105-3-14PubMed CentralView ArticlePubMedGoogle Scholar
- Mushegian AR, Garey JR, Martin J, Liu LX: Large-scale taxonomic profiling of eukaryotic model organisms: a comparison of orthologous proteins encoded by the human, fly, nematode, and yeast genomes. Genome Res 1998, 8(6):590–598.PubMedGoogle Scholar
- Chen F, Mackey AJ, Stoeckert J, Christian J, Roos DS: OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucl Acids Res 2006, (34 Database):D363–368. 10.1093/nar/gkj123Google Scholar
- Tatusov R, Fedorova N, Jackson J, Jacobs A, Kiryutin B, Koonin E, Krylov D, Mazumder R, Mekhedov S, Nikolskaya A, Rao BS, Smirnov S, Sverdlov A, Vasudevan S, Wolf Y, Yin J, Natale D: The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4: 41. 10.1186/1471-2105-4-41PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. Journal of Computational Biology 2000, 7(1–2):203–214. 10.1089/10665270050081478View ArticlePubMedGoogle Scholar
- Duret L, Mouchiroud D, Gouy M: HOVERGEN: a database of homologous vertebrate genes. Nucl Acids Res 1994, 22(12):2360–2365. 10.1093/nar/22.12.2360PubMed CentralView ArticlePubMedGoogle Scholar
- Ruan J, Li H, Chen Z, Coghlan A, Coin LJM, Guo Y, Heriche JK, Hu Y, Kristiansen K, Li R, Liu T, Moses A, Qin J, Vang S, Vilella AJ, Ureta-Vidal A, Bolund L, Wang J, Durbin R: TreeFam: 2008 Update. Nucl Acids Res 2008, 36(S1):D735–740.PubMed CentralPubMedGoogle Scholar
- Dolinski K, Botstein D: Orthology and functional conservation in eukaryotes. Annual Review of Genetics 2007, 41: 465–507. 10.1146/annurev.genet.40.110405.090439View ArticlePubMedGoogle Scholar
- Li L, Stoeckert J, Christian J, Roos DS: OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res 2003, 13(9):2178–2189. 10.1101/gr.1224503PubMed CentralView ArticlePubMedGoogle Scholar
- O'Brien KP, Remm M, Sonnhammer ELL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucl Acids Res 2005, (33 Database):D476–480.Google Scholar
- Zhou Y, Landweber LF: BLASTO: a tool for searching orthologous groups. Nucl Acids Res 2007, (35 Web Server):W678–682. 10.1093/nar/gkm278Google Scholar
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, Lehvaslaiho H, Matsalla C, Mungall CJ, Osborne BI, Pocock MR, Schattner P, Senger M, Stein LD, Stupka E, Wilkinson MD, Birney E: The Bioperl Toolkit: Perl Modules for the Life Sciences. Genome Res 2002, 12(10):1611–1618. 10.1101/gr.361602PubMed CentralView ArticlePubMedGoogle Scholar
- Gentzsch T: Sun grid engine: Towards creating a compute power grid. IEEE Computer Society Press; 2001.Google Scholar
- Lottaz C, Iseli C, Jongeneel CV, Bucher P: Modeling sequencing errors by combining Hidden Markov models. Bioinformatics 2003, 19(Suppl 2):ii103–112.View ArticlePubMedGoogle Scholar
- Birney E, Clamp M, Durbin R: GeneWise and Genomewise. Genome Res 2004, 14(5):988–995. 10.1101/gr.1865504PubMed CentralView ArticlePubMedGoogle Scholar
- Castresana J: Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol Biol Evol 2000, 17(4):540–552.View ArticlePubMedGoogle Scholar
- Dress A, Flamm C, Fritzsch G, Grunewald S, Kruspe M, Prohaska S, Stadler P: Noisy: Identification of problematic columns in multiple sequence alignments. Algorithms for Molecular Biology 2008, 3: 7. 10.1186/1748-7188-3-7PubMed CentralView ArticlePubMedGoogle Scholar
- Misof B, Misof K: A Monte Carlo Approach Successfully Identifies Randomness in Multiple Sequence Alignments: A More Objective Means of Data Exclusion. Syst Biol 2009, 58: syp006. 10.1093/sysbio/syp006View ArticleGoogle Scholar
- Dessimoz C, Boeckmann B, Roth ACJ, Gonnet GH: Detecting non-orthology in the COGs database and other approaches grouping orthologs using genome-specific best hits. Nucl Acids Res 2006, 34(11):3309–3316. 10.1093/nar/gkl433PubMed CentralView ArticlePubMedGoogle Scholar
- Wasmuth J, Blaxter M: prot4EST: Translating Expressed Sequence Tags from neglected genomes. BMC Bioinformatics 2004, 5: 187. 10.1186/1471-2105-5-187PubMed CentralView ArticlePubMedGoogle Scholar
- Hatzigeorgiou AG, Fiziev P, Reczko M: DIANA-EST: a statistical analysis. Bioinformatics 2001, 17(10):913–919. 10.1093/bioinformatics/17.10.913View ArticlePubMedGoogle Scholar
- Fukunishi Y, Hayashizaki Y: Amino acid translation program for full-length cDNA sequences with frameshift errors. Physiol Genomics 2001, 5(2):81–7.PubMedGoogle Scholar
- Tatusova T, Madden T: BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiology Letters 1999, 174(2):247–250. 10.1111/j.1574-6968.1999.tb13575.xView ArticlePubMedGoogle Scholar
- Wiens J: Missing data and the design of phylogenetic analyses. Journal of Biomedical Informatics 2006, 39: 34–42. 10.1016/j.jbi.2005.04.001View ArticlePubMedGoogle Scholar
- Changhui Yan JGB, Eulenstein O: Identifying optimal incomplete phylogenetic data sets from sequence databases. Molecular Phylogenetics and Evolution 2005, 35(3):528–535. 10.1016/j.ympev.2005.02.008View ArticlePubMedGoogle Scholar
- Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs. Nucl Acids Res 2003, 31(13):3497–3500. 10.1093/nar/gkg500PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl Acids Res 2004, 32(5):1792–1797. 10.1093/nar/gkh340PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar R: MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 2004, 5: 113. 10.1186/1471-2105-5-113PubMed CentralView ArticlePubMedGoogle Scholar
- Notredame C, Higgins DG, Heringa J: T-coffee: a novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology 2000, 302: 205–217. 10.1006/jmbi.2000.4042View ArticlePubMedGoogle Scholar
- Poirot O, O'Toole E, Notredame C: Tcoffee@igs: a web server for computing, evaluating and combining multiple sequence alignments. Nucleic Acids Res 2003, 31(13):3503–3506. 10.1093/nar/gkg522PubMed CentralView ArticlePubMedGoogle Scholar
- Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Research 2005, 15(2):330–340. 10.1101/gr.2821705PubMed CentralView ArticlePubMedGoogle Scholar
- Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nuc Acids Research 2002, 30(14):3059–3066. 10.1093/nar/gkf436View ArticleGoogle Scholar
- Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nuc Acids Research 2005, 33(2):511–518. 10.1093/nar/gki198View ArticleGoogle Scholar
- Schmollinger M, Nieselt K, Kaufmann M, Morgenstern B: DIALIGN P: fast pair-wise and multiple sequence alignment using parallel processors. BMC Bioinformatics 2004, 5: 128. 10.1186/1471-2105-5-128PubMed CentralView ArticlePubMedGoogle Scholar
- Morgenstern B, Prohaska SJ, Pöhler D, Stadler PF: Multiple sequence alignment with user-defined anchor points. Algorithms for Molecular Biology 2006, 1: 6. 10.1186/1748-7188-1-6PubMed CentralView ArticlePubMedGoogle Scholar
- Subramanian AR, Weyer-Menkhoff J, Kaufmann M, Morgenstern B: DIALIGN-T: An improved algorithm for segment-based multiple sequence alignment. BMC Bioinformatics 2005, 6: 66. 10.1186/1471-2105-6-66PubMed CentralView ArticlePubMedGoogle Scholar
- Subramanian A, Kaufmann M, Morgenstern B: DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms for Molecular Biology 2008, 3: 6. 10.1186/1748-7188-3-6PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy SR: A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation. PLoS Comput Biol 2008., 4(5):Google Scholar
- Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Cambridge, UK: Cambridge University Press; 2006.Google Scholar
- Department of Energy Joint Genome Institute[http://genome.cshlp.org/cgi/content/abstract/12/10/1611]
- Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucl Acids Res 2002, 30(7):1575–1584. 10.1093/nar/30.7.1575PubMed CentralView ArticlePubMedGoogle Scholar
- Schreiber F, Wörheide G, Morgenstern B: OrthoSelect: a web server for selecting orthologous gene alignments from EST sequences. Nucl Acids Res 2009, (37 Web Server):W185–188. 10.1093/nar/gkp434Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.