Stratification of co-evolving genomic groups using ranked phylogenetic profiles
© Freilich et al; licensee BioMed Central Ltd. 2009
Received: 18 May 2009
Accepted: 27 October 2009
Published: 27 October 2009
Previous methods of detecting the taxonomic origins of arbitrary sequence collections, with a significant impact to genome analysis and in particular metagenomics, have primarily focused on compositional features of genomes. The evolutionary patterns of phylogenetic distribution of genes or proteins, represented by phylogenetic profiles, provide an alternative approach for the detection of taxonomic origins, but typically suffer from low accuracy. Herein, we present rank-BLAST, a novel approach for the assignment of protein sequences into genomic groups of the same taxonomic origin, based on the ranking order of phylogenetic profiles of target genes or proteins across the reference database.
The rank-BLAST approach is validated by computing the phylogenetic profiles of all sequences for five distinct microbial species of varying degrees of phylogenetic proximity, against a reference database of 243 fully sequenced genomes. The approach - a combination of sequence searches, statistical estimation and clustering - analyses the degree of sequence divergence between sets of protein sequences and allows the classification of protein sequences according to the species of origin with high accuracy, allowing taxonomic classification of 64% of the proteins studied. In most cases, a main cluster is detected, representing the corresponding species. Secondary, functionally distinct and species-specific clusters exhibit different patterns of phylogenetic distribution, thus flagging gene groups of interest. Detailed analyses of such cases are provided as examples.
Our results indicate that the rank-BLAST approach can capture the taxonomic origins of sequence collections in an accurate and efficient manner. The approach can be useful both for the analysis of genome evolution and the detection of species groups in metagenomics samples.
The notion of genes as components of a genome has been recently challenged with the advent of metagenomics, where the phylogenetic origin of entire gene pools is not necessarily known . Particular environments and biological symbioses establish crucial constraints on the nature of genes that can be associated with the species communities under investigation . Such constraints can be used to infer the number and types of species that contribute to these gene pools. In computational terms, the goal is to detect unambiguous and unique intra-genomic signals that can be used as signatures for the association of gene collections assigned to species groups.
Several approaches have been proposed to detect genomic signatures on the basis of nucleotide composition [3–5]. These approaches enable, to a varying degree of accuracy, the species classification of genes according to their compositional signatures, and their association with phylogenetic or environmental factors [6–8]. More recently, these methods have been applied to environmental sequencing samples, in order to detect the origins of these sequence fragments [9, 10].
Whereas the aforementioned methods focus on the detection of structural constraints of genes, another family of methods appears to be ideally suitable for this endeavor, using the detection of gene distribution patterns across taxa . These evolutionary signatures can be captured by phylogenetic profiles, the binary representation of the taxonomic distribution of genes (presence or absence) across genome . Phylogenetic profiles thus represent the evolutionary history of genes and genomes, constrained by functional properties in particular environments, and can be used to understand both the structural and evolutionary properties of a genome at the gene level. In the case of metagenomics, desirable properties of such an approach include the detection of genes with common co-inheritance patterns, histories, and thus origins. Given that the major factors shaping gene content are gene loss, gene genesis and horizontal gene transfer,  it would also be desirable to trace unique intra-genomic signals with respect to those factors.
In this work, we present the rank-BLAST classification procedure, estimate the strength and accuracy of the classification using a synthetic dataset, and discuss its biological relevance. First, we describe the choice of the optimal parameters for detecting an intra-genomic signal while allowing inter-genomic separation. Second, we describe the clusters formed and classify them with respect to taxonomically-expected inheritance signal or an anomalous inheritance signal. Third, we identify and characterize a group of proteins carrying an anomalous inheritance signal, in order to better understand the biological significance of the ranked phylogenetic profiles.
Clustering proteins into genomic-groups according to their rank-BLAST profile
The first stage in tracing the intra-genomic signal is to assess how similar are the rank-BLAST profiles of proteins with varying degrees of taxonomical relatedness. Two main considerations guided the choice of the species tested. First, species have to represent several phylogenetic proximities. Second, comparing a large number of protein-pair combinations is an exhaustive computational process and it is thus effective towards the study of species with a small number of proteins. The low sequence and function redundancy which is characteristically observed in small-size proteomes [16–20] is an additional advantage of these species in terms of improving the efficiency of the practice. The use of larger genomes, comprising many duplicated proteins, is expected to yield redundant profiles, unnecessarily increasing the complexity of the computational process.
Taxonomic classification and number of proteins of the fully sequenced species analyzed.
Taxonomic relatedness to M. genitalium
Evaluation of the divergence distance from M. genitalium‡
Number of proteins
Bacteria; Firmicutes; Mollicutes; Mycoplasmatales; Mycoplasmataceae; Mycoplasma
Bacteria; Firmicutes; Mollicutes; Mycoplasmatales; Mycoplasmataceae; Ureaplasma
Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae; Streptococcus
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Buchnera
Archaea; Nanoarchaeota; Nanoarchaeum
From distinct super kingdoms
One way of maximizing the information obtained from the rank-BLAST profile is, instead of using the direct correlation measured between two proteins, to compare the pairwise similarities in the correlation profiles of proteins. The rationale behind this approach is that in cases where a small number of common observations (e.g., proteins 3 and 7 in Figure 1) results in a high correlation which does not necessarily reflect the full set of observations of each protein in the pair, the differences in their correlation profile will mask out the inter-genomic signal. In cases where we miss an intra-genomic signal due to lack, or a small number, of common observations (e.g., proteins 3 and 4 in Figure 1), the similarity in their correlation profile will reveal the intra-genomic signal. To evaluate whether two correlation profiles exhibit a significant similarity, we calculated the hypergeometric probability  (see Methods). The correlation matrix is thus transformed into a probability matrix (Figure 1).
The distributions of the hypergeometric probability values found in intra- and inter-genomic pair combinations are compared in Figure 2B. For P values smaller than 1 × 10-75, the ratio between the fraction of intra-genomic pairs and inter-genomic pairs ranges between 9 (in Ureaplasma parvum) to 200 (in Streptococcus pyogenes) - much higher than the ratio observed between the intra- and inter-values of the pure correlation (Figure 2B).
The efficiency of the rank-BLAST clustering procedure for the re-construction of genomic groups.
No. proteins classified into a cluster‡
No. proteins classified into a corresponding genomic group†
% of the proteins in a species (out of the complete proteome) classified into a genomic group
No. of genomic groups dominated by the species†
The efficiency of the clustering procedure in re-constructing genomic groups varies between the different species (Table 2). The highest fraction of proteins classified into a large cluster is observed in Mycoplasma genitalium, where 92% of its protein members are classified into two genomic groups. The lowest fraction is observed in Nanoarchaeum equitans, where only 15% of the proteins are classified into 3 genomic clusters. It is evident that the underlying dataset of both target and database proteins plays a significant role in providing the necessary contrast for the detection of species-specific genomic groups. Using different parameters for ranking, cut-offs for the P value of the hypergeometric distribution, and inflation values for the clustering procedure has resulted in a similar pattern of clusters, as discussed above (not shown).
Overall, using the rank-BLAST procedure (as illustrated in Figure 1) we have succeeded in correctly clustering 64% of the proteins in the analysis into consistent, species-specific genomic groups (Table 2). In comparison, only 2180 out of the 3891 proteins (56%) had recognized a BLAST best-hit partner within the same genus indicating that the rank-BLAST procedure is more sensitive than a simple BLAST search (Methods). The protein NEQ108 (a tRNA methyltransferase) from the archaeal species Nanoarchaeum equitans provides an example for a protein with a distant best-hit partner (a tRNA methyltransferase from the bacterial species Aquifex aeolicus) which is classified by the rank-BLAST procedure into a corresponding genomic group (cluster 12 - a cluster clearly dominated by Nanoarchaeum equitans' proteins Figure 3): although its best-hit partner is a bacterial protein, the following hits of NEQ108 are proteins from the archaeal superkingdom and hence the rank-BLAST approach, taking into account all hits rather than only the best-hit partner, classifies this protein together with other Nanoarchaeum equitans' proteins. A few examples exist for proteins with a best-hit partner from a closely-related species which are classified into a non-corresponding genomic group. The Mycoplasma genitalium protein MG014 (Putative ABC transporter ATP-binding protein) for example recognizes a best-hit partner from Mycoplasma pneumoniae - another member of the Mycoplasma genus - yet it is classified into a genomic group dominated by Streptococcus pyogenes proteins (cluster 1). Notably, although its best-hit partner is another Mycoplasma protein, its following hits are to proteins from Clostridia species (including Caldicellulosiruptor saccharolyticus, Anaerocellum thermophilum, Clostridium tetani). However, overall, the genomic groups are highly specific with 97% of the proteins classified into corresponding genomic groups (Methods). The correspondence between the order of species in the rank-BLAST profile and their taxonomic proximity to the relevant species is further discussed in the following paragraphs.
The main genomic groups carry a signal which reflects the vertical evolution of the corresponding species
The primary goal of the rank-BLAST classification procedure is to capture a signal reflecting the inheritance pattern of the target genes. Therefore, genomic clusters were characterized in order to examine whether such a signal is carried by the clusters as a whole. Since we retrieved more than a single genomic cluster for each species, we first aimed to characterize the main genomic group of each species, i.e., a cluster where the co-inheritance pattern of its member proteins matches the known phylogeny of the organism.
This phylogenetic signal together with the classification of the conserved proteins into different genomic groups indicate that rank-BLAST profiles are indeed suitable for taxonomic stratification of genes, unlike other sequence-similarity based classification approaches such as phylogenetic profiles that are optimally designed to capture a functional signal [12, 24, 25]. A functional analysis of the main clusters further supports the strength of the approach in delineating a taxonomic, rather than functional, signal. The classification of proteins from the main clusters into functional categories is shown in Additional File 2, where many of the proteins are classified into highly-conserved categories such as transcription and ribosomal structure. The remarkable similarity observed between the patterns of functional distribution of proteins from Mycoplasma genitalium (cluster 2) and proteins from Ureaplasma parvum (cluster 6) - both members of the Mycoplasmataceae family - indicate that the rank-BLAST approach succeeds in correctly grouping proteins according to taxonomic origin and not according to functional similarity, even for orthologous proteins from closely related related species.
Secondary genomic groups carry a signal for non-vertical evolution
For each species, we obtained more than a single genomic group (i.e. clusters dominated by proteins of the species). Whereas the inheritance pattern of the proteins in the main clusters corresponds to the species phylogeny, in secondary clusters we expect to detect different patterns, possibly patterns that correspond to events of non-vertical evolution of genes. For the purpose of identifying such patterns, we searched for clusters where we can define a ranking order which is common to the protein members of the cluster and is different from the ranking-order in the main cluster.
Intra- and inter-cluster correlation in the four clusters dominated by S. pyogenes proteins.
Mean correlation between all protein-pair combinations†
Mean correlation with the cluster-profile††
Mean correlation with the reduced cluster-profile††
Individual examples for proteins from cluster 19 include a group of neighbouring protein-encoding genes that are involved in the uptake and utilization of maltosaccharide, a crucial process for the successful infection of S. pyogenes [26, 27]. Members of this group exhibit not only increased sequence similarity within Bacillales species, but also similar genomic organization. The proteins are listed in Figure 5B according to the genomic order of their encoding genes. Whereas in other Streptococcus species malX, malC and malD are adjacent proteins and members of the malXCD operon , in S. pyogenes amyA and amyB are located between malC and malX. Interestingly, similar organization of orthologous proteins is also observed in Bacillales species such as Geobacillus kaustophilus (rather than in Lactobacillales). Another member of cluster 19 (Spy1297), a putative transcription regulator of proteins involved in maltose/maltodextrin uptake , is located upstream of these genes. Overall, the six proteins from cluster 19 together with additional proteins from cluster 1 form a genomic sequence of 12 proteins involved in maltose/maltodextrin uptake, metabolism and regulation. The rank-BLAST profile of these proteins is shown in Figure 5B. The proteins from cluster 1 demonstrate the expected phylogenetic order where they exhibit higher similarity to other Lactobacillales proteins, and present the same genomic organization reported in Streptococcus species . The proteins from cluster 19 exhibit higher similarity to Bacillales, as well as a similar organization on the genome and correlated expression . This similarity to Bacillales species (higher than the similarity to other Streptococci) suggests a common event of horizontal gene transfer (HGT) of the maltose related genes in cluster 19 , and a recent modification in the maltodextrin metabolism in S. pyogenes. Considering the role of maltodextrin acquisition in infecting host tissues [26, 27], this modification is likely to have an adaptive advantage. The classification of all proteins from cluster 19 as well as other secondary clusters into functional categories is shown at Additional File 4.
The rank-BLAST profile carries a genomic signature beyond the best-hit
Distribution of BLAST best-hit and sequence-signature based methods for prediction of HGT events.
Phylogenetic distribution of BLAST best-hit
No. of proteins†
Streptococcus Species 1
Bacilli species 2
Firmicutes Species 3
Bacteria Species 4
Main genomic group
Secondary genomic groups
Not in genomic groups
Here, we describe a novel approach for the assignment of protein sequences into genomic groups, using a simulation on a synthetic dataset composed of an assortment of proteins lacking taxonomic classification, originally retrieved from five species belonging to different taxonomic classes, orders and domains. The necessity of developing strategies towards taxonomic classification is relatively recent and arose as a result of the growing number of metagenomic projects. The annotation of sequences from metagenomic samples can be viewed as a two-dimensional task, i.e., whereas the annotation of species-specific genomes primarily concerns functional assignments, the annotation of metagenomic samples needs to provide taxonomic assignments as well.
Early annotation approaches before the sequencing of complete genomes used the best-hit of a sequence similarity search to infer function. The accumulation of a growing number of fully sequenced genomes has led to a shift of focus from gene-centric approaches to genomic-centric approaches. One such prominent genomic-centric approach is the phylogenetic profiles method, which analyses the distribution pattern of genes across genomes in order to retrieve a functional signal. Analogously, whereas basic approaches for taxonomic prediction rely on the identity of the best-hit , the rank-BLAST approach described here aims to maximize the information from a BLAST search in order to delineate a strong and consistent, species-specific phylogenetic signal. Our analysis indicates that the use of the rank-BLAST approach indeed allows an accurate taxonomic classification in cases where BLAST search by itself is not equally sensitive: whereas the rank-BLAST procedure allows the taxonomic classification at the species level of 64% of the proteins analyzed in this study, only 56% of the proteins had recognized a BLAST best-hit partner within the same genus. In S. pyogenes for example, 27% of the proteins classified into a genomic group have a best-hit partner outside the Streptococcus genus and even outside the class, the phylum, and super-kingdom (Table 4), indicating that the rank-BLAST procedure can outperform the basic BLAST search in assigning species into genomic groups.
Due to the complexity of the data analyzed, the retrieval of a taxonomic or functional signal from the results of a sequence similarity search is not trivial. Since first suggested in 1999 , the use of phylogenetic profiles had been extensively modified and optimized [24, 25, 33]. Similarly, the retrieval of phylogenetic signal is not a straightforward exercise and it requires the consideration of a large number of parameters, where several approaches can possibly be accounted for in each step of the procedure . As part of the development of the rank-BLAST classification procedure, we have tested many alternatives in order to optimize the procedure, and examined their implications on the efficiency of clustering the data into genomic groups. Several approaches were tested for the following key steps: the construction of the vector describing the ranking-order of species; scoring the level of similarity between vectors; clustering the data; and the choice of query and database species. In order to emphasize the variety of options that need to be explored in each step, in the following paragraph we provide a brief description of some of the possibilities tested along these key steps. The construction of the vector best describing the ranking order of species was performed both according to bit-score  and E-value estimates for the BLAST search. Several possibilities for binning the data in different resolutions of the estimates were tested. Sequence similarity detection was based exclusively on BLAST. For the comparison of vectors we used several methods of correlation comparison, including Kendall's tau correlation, Spearman's correlation and Pearson correlation, and examined the intra-/inter-genomic stratification under the use of both the correlation estimates and P values. Other than correlation-based methods, we have also estimated the similarity of the ranking-order using decision-Trees and Bi-clustering approaches (not shown). For clustering of the data, we tested the results using k-means and a range of Machine Learning algorithms. The analysis was performed for five different species which represent varying taxonomic relationships between themselves, and with respect to the database species. It is likely that the differences observed in the clustering efficiency of the different genomes are related to the differences in phylogenetic resolution of the database. Although it is beyond the scope of the current work, studying the effect of the database (and reference genomes) is a future avenue of research.
As part of accuracy considerations, the sensitivity (allowing the classification of as many proteins as possible) and specificity (receiving as coherent genomic groups as possible; i.e., clearly relating to a single species) of the classification procedure were tested under different parameters. Since it is now widely accepted that a genome represents a dynamic collection of genes, whose composition may change across lineages, the clustering of all proteins of a species into a single genomic group was not a main consideration, assuming that different genes in the genome have followed different evolutionary paths. From the variety of approaches taken, the optimal procedure in terms of precision and recall yielded 20 genomic clusters, which almost uniquely represent a single species, and encompasses the large majority of the proteins in the analysis (Figure 3, as described in the text). The procedure enables the full separation between species within the same family (the two Mycoplasmataceae species Mycoplasma genitalium and Ureaplasma parvum). The different genomes were reconstructed with different efficiencies, where reconstruction of the original genomes is up to 92% (Table 2). As expected, for all species we have retrieved more than a single group, where the variety of groups detected for each genome enable us to carefully view the signal detected by the rank-BLAST procedure. For most of the test genomes, we have identified a main cluster whose phylogeny corresponds to species phylogeny, therefore providing a reassurance that the rank-BLAST procedure enables us to capture an intra-genomic signal retrieved from a group of genes with a common, conserved inheritance pattern. Genes classified to secondary groups do not necessarily follow this pattern and have possibly taken a different evolutionary path, which might be common to all members of the group (see above for an example).
In conclusion, our results indicate that the rank-BLAST classification procedure is indeed suitable for capturing an intra-genomic signal, which to a large extent allows the reconstruction of meaningful, species-specific genomic groups. In some cases the procedure also allows the stratification of the genes into groups reflecting their individual histories. The example of the mutual Bacillales origin of the genes involved in the maltodextrin metabolism demonstrates that common origins can be associated with common functions. We therefore propose that this novel approach provides the basis for revealing the association between groups of genes, contributing an interesting, fresh perspective of the complex genotype-phenotype relationship. The rank-BLAST procedure can therefore not only delineate a phylogenetic-signal, but can also detect genes that exhibit specific patterns of evolutionary change that diverge from the whole-genome profile.
In the study of metagenomic data, the rank-BLAST procedure has a role in both obtaining taxonomic assignments and characterizing the conservation level of different functional groups. Firstly, the procedure can be used for the reconstruction of species-specific genomic groups. The ability of the procedure to capture a taxonomic signal which is beyond the BLAST best-hit is of special importance for the appropriate phylogenetic-localization of genes which belong to distantly related species with respect to the collection of species with fully sequenced-genomes. Secondly, as different environments exhibit a typical functional fingerprint , it will be interesting to examine whether 'signal-less' proteins form a functional group characteristic of a given environment. We are currently applying the rank-BLAST procedure for the study of metagenomic data.
Estimating divergence distance
The protein sequences and annotations of the five fully-sequences species analyzed (Table 1) were retrieved from the DOE Joint Genome Institute website IMG http://img.jgi.doe.gov/. The evaluated divergence distances between species were derived according to the scores of the CLUSTALW pairwise alignments  between the 16S rRNA from each species and the 16S rRNA from M. genitalium. CLUSTALW (version 1.83) was run from the EBI website http://www.ebi.ac.uk/Tools/clustalw/ using the default parameters . Each alignment score was normalized from the score of the alignment of the 16S rRNA of M. genitalium against itself. For example, the self alignment of the 16S rRNA from M. genitalium yielded a score of 100, and the alignment of the 16S rRNA from M. genitalium against the 16S rRNA from U. parvum yielded a score of 83. The divergence distance of M. genitalium from itself was considered to be 0 (100-100) and the divergence distance between M. genitalium and U. parvum was 17 (100-83). In case of multiple copies of 16S RNA, the closest sequence in a species is considered. Divergence distance scores were also compared to scores obtained when using the greengenes website http://greengenes.lbl.gov/: Aligned 16S RNA sequences from the five species studied were retrieved from the website and then pairwise compared via the website using default parameters. Alignments are done using NAST, distance matrix is calculated using the DNAML option of DNADIST (PHYLIP package) . The same divergence order is obtained by both methods (Table 1).
Constructing a rank-BLAST profile
All proteins in the selected synthetic dataset were subject to a BLAST search (default BLAST parameters) against a collection of fully sequenced species. The collection of the complete genomes was retrieved from the COGENT database  representing the entire protein sequence complements from 243 species (including 197 Bacteria, 22 Archaea and 24 Eukaryota species). For each database protein (i.e., a protein coming from one of the five species in the analysis, as listed in Table 1) we used the scores of the BLAST search against the query proteins to construct a vector of the corresponding query species, each species in the vector is represented by the highest BLAST score of a homologue. Species within each vector were ranked according to the order of their corresponding scores. Species for which the E-value of their recorded score (best match) is of the same order (the same exponent) were assigned to an identical ranking score. This was done in order to provide a buffer which reduces the effect of subtle changes (for example small differences in the ranking order of closely related strains) on the overall analysis.
Calculating the Kendall tau rank correlation coefficient
The Kendall tau rank correlation coefficient is used to measure the degree of correspondence between two rankings . The correlation coefficient was calculated using the R statistical platform.
Computing the hypergeometric probability
In order to calculate the hypergeometric probability, the correlation matrix was first transformed into a binary matrix, where values of tau correlation coefficient > 0.7 were converted to 1 (positive events) and lower values were converted to 0. This cut-off was selected as a trade-off between achieving maximum intra-inter genomic separation and keeping as many "positive" correlations as possible (Figure 2).
Generating genomic clusters
The P values for the hypergeometric probability are stored in memory and then written into five binary files, each of which represents a different threshold, so that they need not be recomputed for subsequent analysis. Each file describes a network which consists of proteins (nodes) connected by P values above the set threshold (edges). The five thresholds for the P values are: 1 × 10-25; 1 × 10-50; 1 × 10-75; 1 × 10-100; 1 × 10-150. We use the MCL (M arkov Cl ustering) algorithm to cluster this network according to connectivity and local structure . The inflation value parameter of the MCL algorithm is used to control the granularity of these clusters. The data was tested under inflation values ranging from 1 to 3. The use of different parameters for the P value threshold and for the inflation values yielded stable clusters (not shown). Optimal results (reported here) were obtained using P value threshold of 1 × 10-75 with inflation value 1.8.
To examine the effect of genome size on the efficiency of the classification procedure we have studied the correlation between the number of proteins encoding genes in the species used in the analysis (Table 1) and the efficiency of the classification procedure (Table 2) - no such correlation is observed (Spearman's rho correlation = 0, P value = 1).
Comparing the sensitivity of the taxonomic classification using the rank-BLAST approach to the sensitivity using BLAST best-hit partner
The rank-BLAST classification approach allows the classification of 2590 proteins (out of 3891) into genomic groups. From these, 2505 proteins are classified into a corresponding genomic group; that is protein A is classified into a corresponding genomic group if most proteins in the cluster are coming from the same species as protein A. The distribution of proteins within the genomic groups is listed at Figure 3. Sensitivity is calculated as the fraction of correct classifications in the entire data set of 3891 proteins (64%). The overall specificity of the genomic groups is calculated as the fraction of correct classifications out of all proteins classified into any genomic group (2590 proteins, as listed in Figure 3).
The pre-processed BLAST results of the 3891 proteins versus the database of complete genomes (as described at the Constructing a rank-BLAST profile sub-section) were used for estimating the sensitivity of the taxonomic classifications according to BLAST best-hit partners. Recognition of a BLAST best-hit partner within the same genus (excluding hits versus proteins of the corresponding species) is considered as a correct taxonomic classification. Sensitivity is calculated as the fraction of correct classifications in the entire data set of 3891 proteins (56%).
All software used for the analysis will be provided upon request from the authors.
We thank Fotis Psomopoulos (Intelligent Systems & Software Engineering Lab, Department of Electrical and Computer Engineering, Aristotle University of Thessaloniki) for examining additional ways for clustering the data. We also thank Tamir Tuller and Eytan Ruppin for useful discussions. SF has been supported by an EMBO Postdoctoral Fellowship and is now a fellow of the Edmond J. Safra Program at Tel-Aviv University. The CGU at CERTH is supported by the Networks of Excellence BioSapiens (contract number LSHG-CT-2003-503265) and ENFIN (contract number LSHG-CT-2005-518254), both funded by the EC.
- Hugenholtz P, Tyson GW: Microbiology: metagenomics. Nature 2008, 455: 481–483. 10.1038/455481aView ArticlePubMedGoogle Scholar
- Cases I, de Lorenzo V, Ouzounis CA: Transcription regulation and environmental adaptation in bacteria. Trends Microbiol 2003, 11: 248–253. 10.1016/S0966-842X(03)00103-3View ArticlePubMedGoogle Scholar
- Karlin S: Global dinucleotide signatures and analysis of genomic heterogeneity. Curr Opin Microbiol 1998, 1: 598–610. 10.1016/S1369-5274(98)80095-7View ArticlePubMedGoogle Scholar
- Karlin S, Burge C: Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 1995, 11: 283–290. 10.1016/S0168-9525(00)89076-9View ArticlePubMedGoogle Scholar
- Karlin S, Ladunga I, Blaisdell BE: Heterogeneity of genomes: measures and values. Proc Natl Acad Sci USA 1994, 91: 12837–12841. 10.1073/pnas.91.26.12837PubMed CentralView ArticlePubMedGoogle Scholar
- Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, et al.: Informatics for unveiling hidden genome signatures. Genome Res 2003, 13: 693–702. 10.1101/gr.634603PubMed CentralView ArticlePubMedGoogle Scholar
- Karlin S, Campbell AM, Mrazek J: Comparative DNA analysis across diverse genomes. Annu Rev Genet 1998, 32: 185–225. 10.1146/annurev.genet.32.1.185View ArticlePubMedGoogle Scholar
- Kreil DP, Ouzounis CA: Identification of thermophilic species by the amino acid compositions deduced from their genomes. Nucleic Acids Res 2001, 29: 1608–1615. 10.1093/nar/29.7.1608PubMed CentralView ArticlePubMedGoogle Scholar
- McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Rigoutsos I: Accurate phylogenetic classification of variable-length DNA fragments. Nat Methods 2007, 4: 63–72. 10.1038/nmeth976View ArticlePubMedGoogle Scholar
- Teeling H, Meyerdierks A, Bauer M, Amann R, Glockner FO: Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ Microbiol 2004, 6: 938–947. 10.1111/j.1462-2920.2004.00624.xView ArticlePubMedGoogle Scholar
- Ouzounis C, Kyrpides N: The emergence of major cellular processes in evolution. FEBS Lett 1996, 390: 119–123. 10.1016/0014-5793(96)00631-XView ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 1999, 96: 4285–4288. 10.1073/pnas.96.8.4285PubMed CentralView ArticlePubMedGoogle Scholar
- Kunin V, Ouzounis CA: The balance of driving forces during genome evolution in prokaryotes. Genome Res 2003, 13: 1589–1594. 10.1101/gr.1092603PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, et al.: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Ouzounis CA: Ancestral state reconstructions for genomes. Curr Opin Genet Dev 2005, 15: 595–600. 10.1016/j.gde.2005.09.011View ArticlePubMedGoogle Scholar
- Enright AJ, Kunin V, Ouzounis CA: Protein families and TRIBES in genome sequence space. Nucleic Acids Res 2003, 31: 4632–4638. 10.1093/nar/gkg495PubMed CentralView ArticlePubMedGoogle Scholar
- Freilich S, Spriggs RV, George RA, Al-Lazikani B, Swindells M, et al.: The complement of enzymatic sets in different species. J Mol Biol 2005, 349: 745–763. 10.1016/j.jmb.2005.04.027View ArticlePubMedGoogle Scholar
- Krakauer DC, Plotkin JB: Redundancy, antiredundancy, and the robustness of genomes. Proc Natl Acad Sci USA 2002, 99: 1405–1409. 10.1073/pnas.032668599PubMed CentralView ArticlePubMedGoogle Scholar
- Muller A, MacCallum RM, Sternberg MJ: Structural characterization of the human proteome. Genome Res 2002, 12: 1625–1641. 10.1101/gr.221202PubMed CentralView ArticlePubMedGoogle Scholar
- Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278: 631–637. 10.1126/science.278.5338.631View ArticlePubMedGoogle Scholar
- Fraser CM, Gocayne JD, White O, Adams MD, Clayton RA, et al.: The minimal gene complement of Mycoplasma genitalium. Science 1995, 270: 397–403. 10.1126/science.270.5235.397View ArticlePubMedGoogle Scholar
- Wu J, Kasif S, DeLisi C: Identification of functional links between genes using phylogenetic profiles. Bioinformatics 2003, 19: 1524–1530. 10.1093/bioinformatics/btg187View ArticlePubMedGoogle Scholar
- Case RJ, Boucher Y, Dahllof I, Holmstrom C, Doolittle WF, et al.: Use of 16S rRNA and rpoB genes as molecular markers for microbial ecology studies. Appl Environ Microbiol 2007, 73: 278–288. 10.1128/AEM.01177-06PubMed CentralView ArticlePubMedGoogle Scholar
- Cokus S, Mizutani S, Pellegrini M: An improved method for identifying functionally linked proteins using phylogenetic profiles. BMC Bioinformatics 2007, 8(Suppl 4):S7. 10.1186/1471-2105-8-S4-S7PubMed CentralView ArticlePubMedGoogle Scholar
- Karimpour-Fard A, Hunter L, Gill RT: Investigation of factors affecting prediction of protein-protein interaction networks by phylogenetic profiling. BMC Genomics 2007, 8: 393. 10.1186/1471-2164-8-393PubMed CentralView ArticlePubMedGoogle Scholar
- Shelburne SA, Sumby P, Sitkiewicz I, Okorafor N, Granville C, et al.: Maltodextrin utilization plays a key role in the ability of group A Streptococcus to colonize the oropharynx. Infect Immun 2006, 74: 4605–4614. 10.1128/IAI.00477-06PubMed CentralView ArticlePubMedGoogle Scholar
- Graham MR, Virtaneva K, Porcella SF, Gardner DJ, Long RD, et al.: Analysis of the transcriptome of group A Streptococcus in mouse soft tissue infection. Am J Pathol 2006, 169: 927–942. 10.2353/ajpath.2006.060112PubMed CentralView ArticlePubMedGoogle Scholar
- Nieto C, Espinosa M, Puyet A: The maltose/maltodextrin regulon of Streptococcus pneumoniae. Differential promoter regulation by the transcriptional repressor MalR. J Biol Chem 1997, 272: 30860–30865. 10.1074/jbc.272.49.30860View ArticlePubMedGoogle Scholar
- Reidl J, Romisch K, Ehrmann M, Boos W: MalI, a novel protein involved in regulation of the maltose system of Escherichia coli, is highly homologous to the repressor proteins GalR, CytR, and LacI. J Bacteriol 1989, 171: 4888–4899.PubMed CentralPubMedGoogle Scholar
- Koonin EV, Makarova KS, Aravind L: Horizontal gene transfer in prokaryotes: quantification and classification. Annu Rev Microbiol 2001, 55: 709–742. 10.1146/annurev.micro.55.1.709View ArticlePubMedGoogle Scholar
- Ragan MA, Harlow TJ, Beiko RG: Do different surrogate methods detect lateral genetic transfer events of different relative ages? Trends Microbiol 2006, 14: 4–8. 10.1016/j.tim.2005.11.004View ArticlePubMedGoogle Scholar
- Koski LB, Golding GB: The closest BLAST hit is often not the nearest neighbor. J Mol Evol 2001, 52: 540–542.View ArticlePubMedGoogle Scholar
- Chang GS, Hong YJ, Ko KD, Bhardwaj G, Holmes EC, et al.: Phylogenetic profiles reveal evolutionary relationships within the "twilight zone" of sequence similarity. Proceedings of the National Academy of Sciences of the United States of America 2008, 105: 13474–13479. 10.1073/pnas.0803860105PubMed CentralView ArticlePubMedGoogle Scholar
- Snitkin ES, Gustafson AM, Mellor J, Wu J, DeLisi C: Comparative assessment of performance and genome dependence among phylogenetic profiling methods. BMC Bioinformatics 2006, 7: 420. 10.1186/1471-2105-7-420PubMed CentralView ArticlePubMedGoogle Scholar
- Podell S, Gaasterland T: DarkHorse: a method for genome-wide prediction of horizontal gene transfer. Genome Biol 2007, 8: R16. 10.1186/gb-2007-8-2-r16PubMed CentralView ArticlePubMedGoogle Scholar
- Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, et al.: Comparative metagenomics of microbial communities. Science 2005, 308: 554–557. 10.1126/science.1107851View ArticlePubMedGoogle Scholar
- Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, et al.: Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 2003, 31: 3497–3500. 10.1093/nar/gkg500PubMed CentralView ArticlePubMedGoogle Scholar
- DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, et al.: Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol 2006, 72: 5069–5072. 10.1128/AEM.03006-05PubMed CentralView ArticlePubMedGoogle Scholar
- Goldovsky L, Janssen P, Ahren D, Audit B, Cases I, et al.: CoGenT++: an extensive and extensible data environment for computational genomics. Bioinformatics 2005, 21: 3806–3810. 10.1093/bioinformatics/bti579View ArticlePubMedGoogle Scholar
- The Kendall rank correlation coefficient[http://www.utdallas.edu/~herve/Abdi-KendallCorrelation2007-pretty.pdf]
- Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002, 30: 1575–1584. 10.1093/nar/30.7.1575PubMed CentralView ArticlePubMedGoogle Scholar
- Goldovsky L, Cases I, Enright AJ, Ouzounis CA: BioLayout(Java): versatile network visualisation of structural and functional relationships. Appl Bioinformatics 2005, 4: 71–74. 10.2165/00822942-200504010-00009View ArticlePubMedGoogle Scholar
- Tsirigos A, Rigoutsos I: A new computational method for the detection of horizontal gene transfer events. Nucleic Acids Res 2005, 33: 922–933. 10.1093/nar/gki187PubMed CentralView ArticlePubMedGoogle Scholar
- Garcia-Vallve S, Guzman E, Montero MA, Romeu A: HGT-DB: a database of putative horizontally transferred genes in prokaryotic complete genomes. Nucleic Acids Res 2003, 31: 187–189. 10.1093/nar/gkg004PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.