A phylogenomic gene cluster resource: the Phylogenetically Inferred Groups (PhIGs) database
© Dehal and Boore; licensee BioMed Central Ltd. 2006
Received: 20 April 2005
Accepted: 11 April 2006
Published: 11 April 2006
We present here the PhIGs database, a phylogenomic resource for sequenced genomes. Although many methods exist for clustering gene families, very few attempt to create truly orthologous clusters sharing descent from a single ancestral gene across a range of evolutionary depths. Although these non-phylogenetic gene family clusters have been used broadly for gene annotation, errors are known to be introduced by the artifactual association of slowly evolving paralogs and lack of annotation for those more rapidly evolving. A full phylogenetic framework is necessary for accurate inference of function and for many studies that address pattern and mechanism of the evolution of the genome. The automated generation of evolutionary gene clusters, creation of gene trees, determination of orthology and paralogy relationships, and the correlation of this information with gene annotations, expression information, and genomic context is an important resource to the scientific community.
The PhIGs database currently contains 23 completely sequenced genomes of fungi and metazoans, containing 409,653 genes that have been grouped into 42,645 gene clusters. Each gene cluster is built such that the gene sequence distances are consistent with the known organismal relationships and in so doing, maximizing the likelihood for the clusters to represent truly orthologous genes. The PhIGs website contains tools that allow the study of genes within their phylogenetic framework through keyword searches on annotations, such as GO and InterPro assignments, and sequence similarity searches by BLAST and HMM. In addition to displaying the evolutionary relationships of the genes in each cluster, the website also allows users to view the relative physical positions of homologous genes in specified sets of genomes.
Accurate analyses of genes and genomes can only be done within their full phylogenetic context. The PhIGs database and corresponding website http://phigs.org address this problem for the scientific community. Our goal is to expand the content as more genomes are sequenced and use this framework to incorporate more analyses.
The continually increasing number of whole genome sequencing projects has underscored the need for a high-throughput methodology to sort genes into orthologous sets to facilitate genome analysis. With a more robust understanding of the evolutionary history for each gene in the genome, not only can we more accurately transfer annotation across organisms, but we can also address larger biological questions regarding the evolution of genomes and species as well as the functional and biochemical processes encoded within each genome. Currently, most gene annotations rely on homologs identified by pair-wise sequence similarity to transfer the presumed function. This approach has been shown to have many drawbacks  which lead to annotation errors. Incorrect assignments are generally due to gene duplication events  giving rise to paralogs that can then acquire a new function or sub-functionalize [3, 4], accelerated rates of amino acid substitution  and domain shuffling . Simple pair-wise comparisons cannot uncover these events.
Several approaches have been proposed to address these problems. However, most of these retain the problems associated with simply clustering genes based on sequence similarity and fail to incorporate the known evolutionary relationships of species [7–9]. Alternatively, those approaches that attempt to use some aspect of the evolutionary relationships of the species to inform the clustering process fail to then create a phylogenetic tree to uncover the relationships of the genes within the clusters [10–12].
The method we present here considers a priori the known evolutionary relationships among the considered organisms as a guide to constructing gene clusters, then analyzes each cluster for the evolutionary relationships among the contained genes in order to reconstruct the evolutionary history of each gene family using standard analytical methods of molecular evolution. This provides a tool for the scientific community for gaining a more complete understanding of such things as evolutionary patterns of gene duplication and loss, variation in rates of amino acid substitution, and alterations in gene structure. PhIGs is the first truly comprehensive whole genome analysis phylogenetic tool allowing for accurate assessment of gene family and genome structure evolution.
Construction and content
In this work, we develop a computational framework for the identification of sets of genes which have all descended from a single ancestral gene in the common ancestor of the lineages being examined. This collection of genes is then followed by the construction of phylogenetic trees for each set to determine relationships of the gene cluster members.
Taxa currently included in PhIGs
Broad release 2
Pompep version 19
Broad release 1.3
Broad release 1.1
Broad release 2.4
Broad release 3
Stanford release 19
Saccharomyces Genome Database
All against all BLASTP and global alignment
An all-against-all BLASTP search is performed on the entire protein dataset derived from each genome. Because each BLAST only reports local alignments, a global alignment is created for each protein pair returned by BLAST with ClustalW . A protein distance is then calculated using the JTT matrix and the protdist program from PHYLIP , hereafter referred to as the distance between genes themselves. These pair-wise protein distances and gap-free alignment lengths are then used as input for the clustering process. All alignments are stored in the PhIGs database.
Because this clustering approach is dependent on seeds, the order in which the seeds are processed will affect the clustering results. To ensure that each gene is placed in its optimal cluster, a greedy approach is used by sorting the list of seed alignments by the BLASTP score and processing the seeds by using the highest scoring seed first. In so doing, any subsequent cluster that attempts to incorporate a gene which has already been clustered can be eliminated. It is important to note that the BLASTP score is only used to sort the seeds and clustering is based on the protein distance and alignment length. The pseudocode describing this method is available online as additional file 1: Cluster Pseudocode.
By using an iterative approach, working through the entire evolutionary tree of the organisms beginning at the base, we ensure that the most early diverging gene families create the most comprehensive clusters, with later established families properly assigned to the lineages in which they arose. Genes with a highly accelerated amino acid substitution rate, such that they are more distantly related to their sister genes than those sister genes are to a gene from the outgroup, are always excluded, since this cannot be differentiated from ancestral paralogy.
MSA and phylogenetic tree creation
A multiple sequence alignment (MSA) is created for each cluster using the ClustalW  program, which provides the input for phylogenetic tree reconstruction. Alignments are trimmed to remove columns that contain gap characters and the cluster is eliminated if the resulting alignment contains fewer than 100 aligned amino acid positions. Phylogenetic trees are created using the quartet puzzling maximum likelihood method implemented in the TREE-PUZZLE  program using the JTT model of amino acid substitution and a gamma distribution of rates over eight rate categories with 10,000 puzzling steps to assess reliability. Quartet puzzling is chosen here as a compromise between speed and reliability; however, the multiple sequence alignment is available for re-analysis with other tree reconstruction methods. The resulting gene tree is then reconciled with the known relationships of the organisms to determine, relative to lineage splitting, when each duplication or loss occurred, and so to determine an initial estimate of the orthology and paralogy relationships among the genes. The reconciliation process uses the most straight-forward interpretation of the tree; no alterations are made to minimize the number of duplications or gene loss events. Genes are considered orthologs if they are separated only by speciation nodes consistent with the known phylogenetic tree and considered paralogs if there is a node representing a duplication event in their shared ancestry.
The MSAs are also used to create Hidden Markov Models (HMMs) to later facilitate searching the clusters and to provide a resource for placing genes from genomes too sparsely sampled to be included in this comprehensive analysis, such as those from many EST sequencing projects.
An instructive example of this process is for the Succinyl-Coenzyme A ligase beta subunit family. In this example, considering the fungi and metazoans first as clade A and clade B, respectively, the seed alignment used is the match between a gene from M. grisea and one from mouse (Sucla2). The protein distance measure and gap free alignment length of this seed alignment pair is now taken to represent the maximum distance and minimum alignment length for recruiting new genes to the cluster. Any fungal or metazoan gene with a shorter distance and larger alignment length is added. In this case the fungal gene recruits a single gene from each of the remaining fungal genomes and the mouse gene recruits two genes in each case from most of the remaining metazoan genomes and three genes from each of human, chimp, and mouse. All of these genes now included in the cluster have matches to each other that are as good or better than the initial seed alignment and do not have better matches to any other cluster. The phylogenetic tree created for this cluster ultimately shows that this gene family had a duplication at the base of the metazoan lineage, another duplication at the base of the primate lineage, and an independent duplication in the mouse lineage.
Comparisons of differences and similarities in annotations, such as definition line (defline) gene descriptions, InterPro families, and Gene Ontology assignments, can be made with respect to the tree. The user can make a determination of whether the gene annotations are consistent with the tree topology and whether annotations should be transferred to unannotated genes. Additionally, the genomic location and intron and exon structure of each gene is also provided, enabling analysis of such issues as whether the paralogous genes are physically clustered within a genome, indicating tandem or segmental duplication, or whether the gene family is widely dispersed. Alterations in gene intron and exon structure (and sizes) relative to other members of the cluster may be the result of biological forces acting on the genome or may simply be indicative of poor gene modeling.
The MSA for each cluster is also made available in the Cluster View. An alignment graphic, with the intron and exon structure superimposed, is shown on the page and a detailed alignment view is provided through a Jalview  java applet. By examining the MSA, the user can determine whether poorly aligning or missing regions of a gene contains a protein domain which may indicate the gain or loss of some function. Of course, when dealing with gene models of unknown quality, the genomic sequence should be examined for the possibility of annotation error before concluding an exon or domain loss occurred.
All annotations related to each gene are viewable on its Gene View web page. This includes the annotations presented on the Cluster View page as well as a summary of domains found with the InterProScan  program (not available for all genomes) and a summary of all pair-wise alignments, including the calculated protein distance. This pair-wise alignment information can be useful to determine whether any genes may have been left out of the cluster for failing to meet the distance and alignment length cutoffs. In some cases, this appears to be a gene model that is erroneously fragmented or merged with another, and so PhIGs provides a powerful tool for detecting these potential errors.
Searches of the database can be done by sequence similarity or by text matches to annotation fields. Text searches can be done on gene names, deflines, or InterPro annotations. Because these are associated with individual genes, the search function can be used to either return a list of genes from a selected set of taxa that contain the search term or it can return a set of clusters which contain genes matching the search term. Because all clustering is done at the protein level, sequence similarity searches can only be performed against protein datasets. An individual sequence can be aligned against the proteins contained in the database using the BLAST program. Matches to the sequence can then be used as an entry into the cluster in which they belong. Alternatively, a similarity search can be performed directly against the Hidden Markov Models (HMMs) generated from the MSA of the clusters using the HMMER  program. Once a match has been made, the user can easily download either the raw fasta file of the cluster or the MSA file to create a tree incorporating the new sequence.
Discussion and conclusion
The rapidly increasing number of sequenced genomes allows us to study genes and genomes within an evolutionary context. Not only does this assist in the transfer of annotations between genes, but also allows us to uncover how the forces of evolution have shaped each genome. The PhIGs database project seeks to facilitate comparative genomic, phylogenomic, and functional genomic studies by providing a comprehensive resource for the determination of the evolutionary history for all genes from the fully sequenced genome projects. The two main properties that differentiate the PhIGs database from other clustering methods are the use of the known evolutionary relationships of the species to create gene clusters representing the descendants of a single ancestral gene and the creation of a complete phylogenetic gene tree of the cluster members using widely accepted analytic methods of molecular evolution. By combining this phylogenetic information with functional annotation, gene structure, genomic position and other datasets, the PhIGs database will prove to be a valuable resource for all fields of biology currently using genomic data.
The scientific applications of the PhIGs database are broad, extending beyond practical genome annotation and analysis. For instance, obvious applications are the use of orthogous gene clusters for: (1) organismal phylogenetic reconstruction; (2) the study of genome evolution by gene duplication; (3) gene structure evolution through the gain and loss of exons, introns, and domains; (4) the identification of gene family expansions and losses and 5) genome evolution. The PhIGs analyses have already been used to compare specifically the whole genomes of a tunicate, fish, mouse, and human, demonstrating that the relative positions in the human genome of paralogs generated by duplications at the base of vertebrates provide clear evidence in favor of the contentious hypothesis of two rounds of whole genome duplication having occurred at the base of the vertebrates, and perhaps providing the raw material for vertebrate complexity . Further applications can be developed to meet other analytical needs of the scientific community.
Future development includes improvements to the underlying clustering method, incorporation of more annotation data, creation of more analysis tools and more rapid updates of newly available genomes. The functionality of the PhIGs database is currently accessible though the web interface and data files of orthology relationships for download. Our goal is to convert this into an open source project to help maintain and expand this as a resource for the scientific community.
We thank S. Rash, W. Huang, and A. Porter for technical assistance. Funding for salary support was in part from the National Science Foundation awards MCB-0242131 and EF-0328516. This work was performed under the auspices of the U.S. Department of Energy, Office of Biological and Environmental Research, by the University of California, Lawrence Berkeley National Laboratory, under contract No. DE-AC03-76SF00098.
- Eisen JA: Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res 1998, 8: 163–167.View ArticlePubMedGoogle Scholar
- Fitch WM: Distinguishing homologous from analogous proteins. Syst Zool 1970, 19: 99–113.View ArticlePubMedGoogle Scholar
- Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J: Preservation of duplicate genes by complementary, degenerative mutations. Genetics 1999, 151: 1531–1545.PubMed CentralPubMedGoogle Scholar
- Lynch M, Force A: The probability of duplicate gene preservation by subfunctionalization. Genetics 2000, 154: 459–473.PubMed CentralPubMedGoogle Scholar
- Gaucher EA, Gu X, Miyamoto MM, Benner SA: Predicting functional divergence in protein evolution by site-specific rate shifts. Trends Biochem Sci 2002, 27: 315–321. 10.1016/S0968-0004(02)02094-7View ArticlePubMedGoogle Scholar
- Doolittle RF: The multiplicity of domains in proteins. Annu Rev Biochem 1995, 64: 287–314. 10.1146/annurev.bi.64.070195.001443View ArticlePubMedGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4: 41. 10.1186/1471-2105-4-41PubMed CentralView ArticlePubMedGoogle Scholar
- Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002, 30: 1575–1584. 10.1093/nar/30.7.1575PubMed CentralView ArticlePubMedGoogle Scholar
- Liu J, Rost B: Domains, motifs and clusters in the protein universe. Curr Opin Chem Biol 2003, 7: 5–11. 10.1016/S1367-5931(02)00003-0View ArticlePubMedGoogle Scholar
- O'Brien KP, Remm M, Sonnhammer EL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 2005, 33 Database Issue: D476–80.Google Scholar
- Li L, Stoeckert CJJ, Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 2003, 13: 2178–2189. 10.1101/gr.1224503PubMed CentralView ArticlePubMedGoogle Scholar
- Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai J, Parvizi B, Cheung F, Antonescu V, White J, Holt I, Liang F, Quackenbush J: Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res 2002, 12: 493–502. 10.1101/gr.212002PubMed CentralView ArticlePubMedGoogle Scholar
- Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H, Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder NJ, Oinn TM, Pagni M, Servant F, Sigrist CJ, Zdobnov EM: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res 2001, 29: 37–40. 10.1093/nar/29.1.37PubMed CentralView ArticlePubMedGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680.PubMed CentralView ArticlePubMedGoogle Scholar
- Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.6. Department of Genome Sciences, University of Washington, Seattle, Distributed by the author; 2004.Google Scholar
- Schmidt HA, Strimmer K, Vingron M, von Haeseler A: TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 2002, 18: 502–504. 10.1093/bioinformatics/18.3.502View ArticlePubMedGoogle Scholar
- Clamp M, Cuff J, Searle SM, Barton GJ: The Jalview Java alignment editor. Bioinformatics 2004, 20: 426–427. 10.1093/bioinformatics/btg430View ArticlePubMedGoogle Scholar
- Zdobnov EM, Apweiler R: InterProScan--an integration platform for the signature-recognition methods in InterPro. Bioinformatics 2001, 17: 847–848. 10.1093/bioinformatics/17.9.847View ArticlePubMedGoogle Scholar
- Eddy S: http://hmmer.wustl.edu.
- Dehal P, Boore JL: Two Rounds of Whole Genome Duplication in the Ancestral Vertebrate. PLoS Biology 2005, 3: e314. 10.1371/journal.pbio.0030314PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.