- Open Access
A phylogenomic gene cluster resource: the Phylogenetically Inferred Groups (PhIGs) database
BMC Bioinformaticsvolume 7, Article number: 201 (2006)
We present here the PhIGs database, a phylogenomic resource for sequenced genomes. Although many methods exist for clustering gene families, very few attempt to create truly orthologous clusters sharing descent from a single ancestral gene across a range of evolutionary depths. Although these non-phylogenetic gene family clusters have been used broadly for gene annotation, errors are known to be introduced by the artifactual association of slowly evolving paralogs and lack of annotation for those more rapidly evolving. A full phylogenetic framework is necessary for accurate inference of function and for many studies that address pattern and mechanism of the evolution of the genome. The automated generation of evolutionary gene clusters, creation of gene trees, determination of orthology and paralogy relationships, and the correlation of this information with gene annotations, expression information, and genomic context is an important resource to the scientific community.
The PhIGs database currently contains 23 completely sequenced genomes of fungi and metazoans, containing 409,653 genes that have been grouped into 42,645 gene clusters. Each gene cluster is built such that the gene sequence distances are consistent with the known organismal relationships and in so doing, maximizing the likelihood for the clusters to represent truly orthologous genes. The PhIGs website contains tools that allow the study of genes within their phylogenetic framework through keyword searches on annotations, such as GO and InterPro assignments, and sequence similarity searches by BLAST and HMM. In addition to displaying the evolutionary relationships of the genes in each cluster, the website also allows users to view the relative physical positions of homologous genes in specified sets of genomes.
Accurate analyses of genes and genomes can only be done within their full phylogenetic context. The PhIGs database and corresponding website http://phigs.org address this problem for the scientific community. Our goal is to expand the content as more genomes are sequenced and use this framework to incorporate more analyses.
The continually increasing number of whole genome sequencing projects has underscored the need for a high-throughput methodology to sort genes into orthologous sets to facilitate genome analysis. With a more robust understanding of the evolutionary history for each gene in the genome, not only can we more accurately transfer annotation across organisms, but we can also address larger biological questions regarding the evolution of genomes and species as well as the functional and biochemical processes encoded within each genome. Currently, most gene annotations rely on homologs identified by pair-wise sequence similarity to transfer the presumed function. This approach has been shown to have many drawbacks  which lead to annotation errors. Incorrect assignments are generally due to gene duplication events  giving rise to paralogs that can then acquire a new function or sub-functionalize [3, 4], accelerated rates of amino acid substitution  and domain shuffling . Simple pair-wise comparisons cannot uncover these events.
Several approaches have been proposed to address these problems. However, most of these retain the problems associated with simply clustering genes based on sequence similarity and fail to incorporate the known evolutionary relationships of species [7–9]. Alternatively, those approaches that attempt to use some aspect of the evolutionary relationships of the species to inform the clustering process fail to then create a phylogenetic tree to uncover the relationships of the genes within the clusters [10–12].
The method we present here considers a priori the known evolutionary relationships among the considered organisms as a guide to constructing gene clusters, then analyzes each cluster for the evolutionary relationships among the contained genes in order to reconstruct the evolutionary history of each gene family using standard analytical methods of molecular evolution. This provides a tool for the scientific community for gaining a more complete understanding of such things as evolutionary patterns of gene duplication and loss, variation in rates of amino acid substitution, and alterations in gene structure. PhIGs is the first truly comprehensive whole genome analysis phylogenetic tool allowing for accurate assessment of gene family and genome structure evolution.
Construction and content
In this work, we develop a computational framework for the identification of sets of genes which have all descended from a single ancestral gene in the common ancestor of the lineages being examined. This collection of genes is then followed by the construction of phylogenetic trees for each set to determine relationships of the gene cluster members.
A relational database is used to store the genome annotations for each taxon. All sequence data as well as individual gene annotations, including InterPro  and Gene Ontology  assignments, intron, exon and UTR structural information, and genomic positional information are retrieved whenever available. In addition, results of analyses such as sequence alignments, intermediate data, and trees are stored in the database. Table 1 lists the genomes included in the current data set, which will be updated as more genome sequences become available.
The overall process involves five stages (Figure 1) explained in more detail below: (1) an all against all BLASTP  of the complete proteomes; (2) global alignment and distance calculation of the gene pairs identified by BLAST; (3) iterative, hierarchical clustering; (4) multiple sequence alignment (MSA) creation and editing; and (5) gene tree reconstruction.
All against all BLASTP and global alignment
An all-against-all BLASTP search is performed on the entire protein dataset derived from each genome. Because each BLAST only reports local alignments, a global alignment is created for each protein pair returned by BLAST with ClustalW . A protein distance is then calculated using the JTT matrix and the protdist program from PHYLIP , hereafter referred to as the distance between genes themselves. These pair-wise protein distances and gap-free alignment lengths are then used as input for the clustering process. All alignments are stored in the PhIGs database.
Gene clustering is performed at each node of the tree, using the known evolutionary relationships of the organisms and all pair-wise protein distances as input. The objective of the clustering process is to create gene clusters at each node of the evolutionary tree such that the genes of the descending taxa are more closely related to each other than they are to the genes from the outgroup taxa. We employ a hierarchical approach, starting at the base of the best known evolutionary tree of the organisms, and proceeding up the tree iteratively. For each bifurcating node, taxa are temporarily grouped such that those on one descending branch are labeled as clade A and those on the other as clade B. The remaining taxa, having branched earlier, are considered to be the outgroup (Figure 2). Clusters of genes are then constructed such that the included genes meet the following criteria: (1) Genes from organisms within clade A are more similar to each other than they are to genes from organisms within clade B; and (2) genes from clade A and clade B are more closely related to each other than they are to any gene in the outgroup. Effectively, this can be achieved by first finding the top scoring alignment for each gene within any member of its sister clade, then recruiting all additional genes that have greater similarity to either one of these genes using single linkage clustering with inclusion criteria being set to the distance and alignment length of the alignment of the seed. As illustrated in Figure 2, the initial seed alignment of a pair of genes, one from each of clade A and clade B, defines an area shown in blue around representing the minimum match quality. As more genes are added to the cluster, this area grows until no more genes can be added.
Because this clustering approach is dependent on seeds, the order in which the seeds are processed will affect the clustering results. To ensure that each gene is placed in its optimal cluster, a greedy approach is used by sorting the list of seed alignments by the BLASTP score and processing the seeds by using the highest scoring seed first. In so doing, any subsequent cluster that attempts to incorporate a gene which has already been clustered can be eliminated. It is important to note that the BLASTP score is only used to sort the seeds and clustering is based on the protein distance and alignment length. The pseudocode describing this method is available online as additional file 1: Cluster Pseudocode.
By using an iterative approach, working through the entire evolutionary tree of the organisms beginning at the base, we ensure that the most early diverging gene families create the most comprehensive clusters, with later established families properly assigned to the lineages in which they arose. Genes with a highly accelerated amino acid substitution rate, such that they are more distantly related to their sister genes than those sister genes are to a gene from the outgroup, are always excluded, since this cannot be differentiated from ancestral paralogy.
MSA and phylogenetic tree creation
A multiple sequence alignment (MSA) is created for each cluster using the ClustalW  program, which provides the input for phylogenetic tree reconstruction. Alignments are trimmed to remove columns that contain gap characters and the cluster is eliminated if the resulting alignment contains fewer than 100 aligned amino acid positions. Phylogenetic trees are created using the quartet puzzling maximum likelihood method implemented in the TREE-PUZZLE  program using the JTT model of amino acid substitution and a gamma distribution of rates over eight rate categories with 10,000 puzzling steps to assess reliability. Quartet puzzling is chosen here as a compromise between speed and reliability; however, the multiple sequence alignment is available for re-analysis with other tree reconstruction methods. The resulting gene tree is then reconciled with the known relationships of the organisms to determine, relative to lineage splitting, when each duplication or loss occurred, and so to determine an initial estimate of the orthology and paralogy relationships among the genes. The reconciliation process uses the most straight-forward interpretation of the tree; no alterations are made to minimize the number of duplications or gene loss events. Genes are considered orthologs if they are separated only by speciation nodes consistent with the known phylogenetic tree and considered paralogs if there is a node representing a duplication event in their shared ancestry.
The MSAs are also used to create Hidden Markov Models (HMMs) to later facilitate searching the clusters and to provide a resource for placing genes from genomes too sparsely sampled to be included in this comprehensive analysis, such as those from many EST sequencing projects.
An instructive example of this process is for the Succinyl-Coenzyme A ligase beta subunit family. In this example, considering the fungi and metazoans first as clade A and clade B, respectively, the seed alignment used is the match between a gene from M. grisea and one from mouse (Sucla2). The protein distance measure and gap free alignment length of this seed alignment pair is now taken to represent the maximum distance and minimum alignment length for recruiting new genes to the cluster. Any fungal or metazoan gene with a shorter distance and larger alignment length is added. In this case the fungal gene recruits a single gene from each of the remaining fungal genomes and the mouse gene recruits two genes in each case from most of the remaining metazoan genomes and three genes from each of human, chimp, and mouse. All of these genes now included in the cluster have matches to each other that are as good or better than the initial seed alignment and do not have better matches to any other cluster. The phylogenetic tree created for this cluster ultimately shows that this gene family had a duplication at the base of the metazoan lineage, another duplication at the base of the primate lineage, and an independent duplication in the mouse lineage.
The PhIGs database allows users to view genes within the evolutionary context of other sequenced genomes. Because each cluster is constructed to represent the extant descendants of a single ancestral gene, the gene trees provided allow the user to see where gene duplication events have occurred and the rates of amino acid sequence change along the individual branches of the tree (Figure 3). By reconciling the gene tree with the species tree, orthology and paralogy relationships can be determined.
Comparisons of differences and similarities in annotations, such as definition line (defline) gene descriptions, InterPro families, and Gene Ontology assignments, can be made with respect to the tree. The user can make a determination of whether the gene annotations are consistent with the tree topology and whether annotations should be transferred to unannotated genes. Additionally, the genomic location and intron and exon structure of each gene is also provided, enabling analysis of such issues as whether the paralogous genes are physically clustered within a genome, indicating tandem or segmental duplication, or whether the gene family is widely dispersed. Alterations in gene intron and exon structure (and sizes) relative to other members of the cluster may be the result of biological forces acting on the genome or may simply be indicative of poor gene modeling.
The MSA for each cluster is also made available in the Cluster View. An alignment graphic, with the intron and exon structure superimposed, is shown on the page and a detailed alignment view is provided through a Jalview  java applet. By examining the MSA, the user can determine whether poorly aligning or missing regions of a gene contains a protein domain which may indicate the gain or loss of some function. Of course, when dealing with gene models of unknown quality, the genomic sequence should be examined for the possibility of annotation error before concluding an exon or domain loss occurred.
All annotations related to each gene are viewable on its Gene View web page. This includes the annotations presented on the Cluster View page as well as a summary of domains found with the InterProScan  program (not available for all genomes) and a summary of all pair-wise alignments, including the calculated protein distance. This pair-wise alignment information can be useful to determine whether any genes may have been left out of the cluster for failing to meet the distance and alignment length cutoffs. In some cases, this appears to be a gene model that is erroneously fragmented or merged with another, and so PhIGs provides a powerful tool for detecting these potential errors.
Searches of the database can be done by sequence similarity or by text matches to annotation fields. Text searches can be done on gene names, deflines, or InterPro annotations. Because these are associated with individual genes, the search function can be used to either return a list of genes from a selected set of taxa that contain the search term or it can return a set of clusters which contain genes matching the search term. Because all clustering is done at the protein level, sequence similarity searches can only be performed against protein datasets. An individual sequence can be aligned against the proteins contained in the database using the BLAST program. Matches to the sequence can then be used as an entry into the cluster in which they belong. Alternatively, a similarity search can be performed directly against the Hidden Markov Models (HMMs) generated from the MSA of the clusters using the HMMER  program. Once a match has been made, the user can easily download either the raw fasta file of the cluster or the MSA file to create a tree incorporating the new sequence.
These analyses produce sets of true, one-to-one orthologs, and this presentation incorporates a view of their relative physical positions across multiple genomes. As opposed to other methods that rely on sequence similarity to create comparative genome alignments, this avoids confusion that arises from paralogy. Synteny maps are generated by selecting a genomic span from a single reference genome and one or more query genomes to align (Figure 4). All identified orthologous genes between the selected genome and each of the query genomes are shown.
Discussion and conclusion
The rapidly increasing number of sequenced genomes allows us to study genes and genomes within an evolutionary context. Not only does this assist in the transfer of annotations between genes, but also allows us to uncover how the forces of evolution have shaped each genome. The PhIGs database project seeks to facilitate comparative genomic, phylogenomic, and functional genomic studies by providing a comprehensive resource for the determination of the evolutionary history for all genes from the fully sequenced genome projects. The two main properties that differentiate the PhIGs database from other clustering methods are the use of the known evolutionary relationships of the species to create gene clusters representing the descendants of a single ancestral gene and the creation of a complete phylogenetic gene tree of the cluster members using widely accepted analytic methods of molecular evolution. By combining this phylogenetic information with functional annotation, gene structure, genomic position and other datasets, the PhIGs database will prove to be a valuable resource for all fields of biology currently using genomic data.
The scientific applications of the PhIGs database are broad, extending beyond practical genome annotation and analysis. For instance, obvious applications are the use of orthogous gene clusters for: (1) organismal phylogenetic reconstruction; (2) the study of genome evolution by gene duplication; (3) gene structure evolution through the gain and loss of exons, introns, and domains; (4) the identification of gene family expansions and losses and 5) genome evolution. The PhIGs analyses have already been used to compare specifically the whole genomes of a tunicate, fish, mouse, and human, demonstrating that the relative positions in the human genome of paralogs generated by duplications at the base of vertebrates provide clear evidence in favor of the contentious hypothesis of two rounds of whole genome duplication having occurred at the base of the vertebrates, and perhaps providing the raw material for vertebrate complexity . Further applications can be developed to meet other analytical needs of the scientific community.
Future development includes improvements to the underlying clustering method, incorporation of more annotation data, creation of more analysis tools and more rapid updates of newly available genomes. The functionality of the PhIGs database is currently accessible though the web interface and data files of orthology relationships for download. Our goal is to convert this into an open source project to help maintain and expand this as a resource for the scientific community.
Eisen JA: Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. Genome Res 1998, 8: 163–167.
Fitch WM: Distinguishing homologous from analogous proteins. Syst Zool 1970, 19: 99–113.
Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J: Preservation of duplicate genes by complementary, degenerative mutations. Genetics 1999, 151: 1531–1545.
Lynch M, Force A: The probability of duplicate gene preservation by subfunctionalization. Genetics 2000, 154: 459–473.
Gaucher EA, Gu X, Miyamoto MM, Benner SA: Predicting functional divergence in protein evolution by site-specific rate shifts. Trends Biochem Sci 2002, 27: 315–321. 10.1016/S0968-0004(02)02094-7
Doolittle RF: The multiplicity of domains in proteins. Annu Rev Biochem 1995, 64: 287–314. 10.1146/annurev.bi.64.070195.001443
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4: 41. 10.1186/1471-2105-4-41
Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002, 30: 1575–1584. 10.1093/nar/30.7.1575
Liu J, Rost B: Domains, motifs and clusters in the protein universe. Curr Opin Chem Biol 2003, 7: 5–11. 10.1016/S1367-5931(02)00003-0
O'Brien KP, Remm M, Sonnhammer EL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 2005, 33 Database Issue: D476–80.
Li L, Stoeckert CJJ, Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 2003, 13: 2178–2189. 10.1101/gr.1224503
Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai J, Parvizi B, Cheung F, Antonescu V, White J, Holt I, Liang F, Quackenbush J: Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res 2002, 12: 493–502. 10.1101/gr.212002
Apweiler R, Attwood TK, Bairoch A, Bateman A, Birney E, Biswas M, Bucher P, Cerutti L, Corpet F, Croning MD, Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob H, Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Y, Lopez R, Marx B, Mulder NJ, Oinn TM, Pagni M, Servant F, Sigrist CJ, Zdobnov EM: The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res 2001, 29: 37–40. 10.1093/nar/29.1.37
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25: 25–29. 10.1038/75556
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680.
Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.6. Department of Genome Sciences, University of Washington, Seattle, Distributed by the author; 2004.
Schmidt HA, Strimmer K, Vingron M, von Haeseler A: TREE-PUZZLE: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 2002, 18: 502–504. 10.1093/bioinformatics/18.3.502
Clamp M, Cuff J, Searle SM, Barton GJ: The Jalview Java alignment editor. Bioinformatics 2004, 20: 426–427. 10.1093/bioinformatics/btg430
Zdobnov EM, Apweiler R: InterProScan--an integration platform for the signature-recognition methods in InterPro. Bioinformatics 2001, 17: 847–848. 10.1093/bioinformatics/17.9.847
Eddy S: http://hmmer.wustl.edu.
Dehal P, Boore JL: Two Rounds of Whole Genome Duplication in the Ancestral Vertebrate. PLoS Biology 2005, 3: e314. 10.1371/journal.pbio.0030314
We thank S. Rash, W. Huang, and A. Porter for technical assistance. Funding for salary support was in part from the National Science Foundation awards MCB-0242131 and EF-0328516. This work was performed under the auspices of the U.S. Department of Energy, Office of Biological and Environmental Research, by the University of California, Lawrence Berkeley National Laboratory, under contract No. DE-AC03-76SF00098.
Both authors conceived and designed the project. PSD did all of the programming and implementation. Both authors wrote and approved the final manuscript.