Identification of discriminative characteristics for clusters from biologic data with InforBIO software
© Tanaka et al; licensee BioMed Central Ltd. 2007
Received: 22 February 2007
Accepted: 02 August 2007
Published: 02 August 2007
There are a number of different methods for generation of trees and algorithms for phylogenetic analysis in the study of bacterial taxonomy. Genotypic information, such as SSU rRNA gene sequences, now plays a more prominent role in microbial systematics than does phenotypic information. However, the integration of genotypic and phenotypic information for polyphasic studies is necessary for the classification and identification of microbes. Thus, we devised an algorithm that objectively identifies discriminative characteristics for focused clusters on generated trees from a dataset composed of coded data, such as phenotypic information. Moreover, this algorithm has been integrated into the polyphasic analysis software, InforBIO.
We developed a differential-character-finding algorithm based on information measures and used this algorithm to identify the characteristic that best discriminates operational taxonomic unit clusters. For all characteristics in a dataset, the algorithm estimates commonality in focused clusters and diversity among clusters by scoring based on Shannon's and relative entropies. All the characteristics selected for scoring are equally weighted. Thresholds for the scores are defined to identify discriminative characteristics for clusters efficiently from a database. The unique feature of the algorithm, which is implemented in the InforBIO software, is that it can identify the phenotypic characteristics that discriminate and are associated with the clusters of a phylogenetic tree. We successfully applied this algorithm to the study of phylogenetic clusters of Pseudomonas species.
The algorithm in the InforBIO software is a novel and useful approach for microbial polyphasic studies. The algorithm can also be applied to diverse cluster analyses. The InforBIO software is available from the download site http://wdcm.nig.ac.jp/inforbio/. This software is free for personal but not commercial use.
It is common practice for biologists to apply cluster analysis to both genotypic and phenotypic data of operational taxonomic units (OTUs). Genotypic data, primarily small subunit ribosomal RNA (rRNA) gene sequences and DNA-DNA similarities based on the hybridization technique, play a more prominent role than do phenotypic data in current microbial systematics [1, 2]. Phylogenetic analysis based on single gene sequences is useful for understanding relations between species but is rarely used to define species. DNA-DNA similarities based on the hybridization technique are recommended for species definition, and the cut-off is defined . However, it is actually and technically difficult to conclude when a similarity is close to the defined cut-off. In contrast, phenotypic data are still important for efficient identification and recognition of biologic features and can be obtained with kits such as API (bioMerieux, Inc., Lyon, France) and BiOLOG (Biolog, Inc., Hayward, CA, USA). In 2002, the ad hoc committee of the International Committee for the Systematics of Prokaryotes  made the following recommendations with respect to the value of phenotypes in a species description: "(1) Species should be identifiable by readily available methods (phenotypic and genotypic). Efforts should be made to establish standardized methods of reporting phenotypic and genomic data; (2) Minimal characteristics should be provided...; and (3) Phenotype, including chemotaxonomic markers, will remain important diagnostic properties in a species description." These recommendations can be fully addressed only through polyphasic studies based on the integration of genotypic and phenotypic data [4, 5]. An information system for polyphasic studies is needed for the classification and identification of microbes.
A number of algorithms and programs for clustering and generating trees have been developed for numerical and phylogenetic analyses in bacterial taxonomy [6, 7]. It is often difficult to identify phenotypic characteristics that can discriminate clusters defined on the bases of gene sequences. At present, if it is difficult to identify discriminative characteristics for a species from available phenotypic data, the species is studied on the basis of the genomovars concept as was reported for Pseudomonas stutzeri strains , although cryptic discriminative characteristics may be found. It can be difficult to identify discriminative characteristics in the case of a large and diverse phenotypic dataset.
We developed the InforBIO software for the study of microbial diversity , and the software is freely available from the download site . The user can seamlessly repeat a workflow from data management, data analysis, and evaluation of analytical results. The software includes functions for data handling, including design of databases, storage and retrieval of data, numerical analysis, phylogenetic analysis, and discriminative and probabilistic identification. All the features of the InforBIO software are applicable to any biologic object from molecules to organisms, if the data are coded in the same way as microbial data.
In the present study, we devised a differential-character-finding algorithm that objectively identifies discriminative characteristics for focused clusters from a dataset composed of coded data, and this algorithm was integrated into the InforBIO software.
The differential-character-finding algorithm consists of two types of mathematical measures based on Shannon's entropy  and relative entropy . We named these measures "common score" and "differential score", respectively. With our differential-character-finding algorithm, the following events are executed in the InforBIO software: 1) construction of a database of biologic data, including coded and sequence data; 2) construction of phylogenetic trees (or numeric dendrograms) from datasets in the database; 3) selection of target clusters for differential-character-finding analysis on the phylogenetic tree (or numeric dendrograms); 4) calculation of common and differential scores for each characteristic on target clusters; and 5) identification of the most discriminative characteristic(s) for each target cluster with reference to thresholds for the two scores depending on the range of scores for the value.
A biologic database and analysis programs are integrated into the InforBIO software, and the system architecture was described previously . An ID number is automatically assigned to an OTU in a biologic database. Each phenotypic characteristic (e.g., assimilation of glucose) of an OTU is described in a biologic database by a testable variable associated with the characteristic. In general, most tests take the character data type as the test's result, such as "+++", "+", and "-". Therefore, we consider the complete event system for each item variable to calculate the common and differential scores. We can then calculate the occurrence probabilities of every value for the item variable on the complete event systems. In the InforBIO software, up to 12 values (e.g., +) can be assigned for a test item (e.g., glucose assimilation). The InforBIO software can also manage multiple gene and protein sequence data of OTUs on a database.
where p(m) denotes the frequency (0 ≤ p(m) ≤ 1) of the occurrence probability of a value 'm' for an item. The average common score of an item among defined clusters is represented as C in this study.
where a is an acceptable frequency of the occurrence probability (0.5 <a < 1) of a value for an item within a cluster. (1-a)/(n-1) indicates that the total of frequencies of the occurrence probability of other values for the item are divided by the number of other values. Hence, C thr is the maximal common score in the case of a defined a. The C for discriminative items should satisfy the condition of C ≤ C thr .
where y indicates the number of defined clusters. The denominator indicates the number of combinations of all clusters, and the numerator indicates a total score (TS) of the differential scores.
The significant lower threshold for D, D thr , is determined by the substitution of TS in formula (4) with (y-1)D cluster , where D cluster is calculated with an acceptable frequency of the occurrence probability a (0.5 <a < 1) on the basis of formula (3). The four frequencies are then substituted as p(m) = a, p(n) = 1-a, q(m) = 0.000001, and q(n) = 0.999999, respectively. Hence, D thr is calculated in the same manner as the D score of an item assigned two values (n = 2). The D for discriminative items should satisfy the condition of D thr ≤ D.
Consequently, the common score is a useful measure of the commonality of characteristics of OTUs in a cluster, whereas the differential score is a measure of the differences between clusters. Their thresholds are effective for the rejection of unsuitable items for the discrimination of target clusters from a dataset.
Results and discussion
A differential-character-finding algorithm was added to the InforBIO software and was tested with data for Pseudomonas strains to identify discriminative characteristics for Pseudomonas species with reference to phylogenetic clusters based on their 16S rRNA gene sequences.
Identification of discriminative phenotypic characteristics for Pseudomonas species
Construction of the biologic database of Pseudomonas species
Construction of a phylogenetic tree based on sequences from Pseudomonas species
Identification of discriminative characteristics for Pseudomonas species
As shown in figure 4, undiscriminated species were 1B, 1C, and 1G in the first step (H1) and were located in two clusters (2B and 2E) on the phylogenetic tree in the second step (H2). Discriminative characteristics for the two clusters were analyzed under the conditions described above. The 2E cluster is composed of P. flavescens and P. straminea and could be discriminated from six other clusters by a single characteristic, non-assimilation of L-isoleucine. Moreover, the two species could be discriminated from each other by 7 characteristics because their values for these items were opposite each other. Thus, at least two characteristics are needed for discrimination of these species from other Pseudomonas species. In contrast, no characteristics were identified that discriminated members of the 2B cluster composed of three species, including P. fulva and P. parafulva. Therefore, an upper cluster by the third step (H3), comprising five species from the addition of two species to the lower 2B cluster (3A by H3 in figure 4), was examined in the same manner and was discriminated by a single characteristic, non-reduction of nitrate. Additionally, best discriminative characteristics for P. parafulva (15 items) within the 3A cluster were identified, whereas those for P. fulva were not. To identify the discriminative characteristics for P. fulva, characteristics to discriminate the 2B cluster were identified within the 3A cluster, and the discriminative characteristic from the 2A and 2C clusters was non-assimilation of L-tartrate, whose C and D values were 0.1668 and 7.0464, respectively. Captures of the result screens of the InforBIO software are shown in figures 3B–3C. Finally, three best discriminative characteristics for P. fulva within the 2B cluster were identified. Therefore, at least three characteristics, non-reduction of nitrate, non-assimilation of L-tartrate, and one of the three discriminative characteristics for P. fulva within the 2B cluster, are needed to discriminate the species. Moreover, discriminative characteristics for upper clusters are effective as additional data for the discrimination of species, whose discriminative characteristics are tenuous. For instance, non-reduction of nitrate, a discriminative characteristic for the 3A cluster, is additional data for the discrimination of P. oryzihabitans, which is able to assimilate L-rhamnose. Consequently, the algorithm and the InforBIO software were effective for identification of characteristics that allowed discrimination of 11 Pseudomonas species with reference to discriminative characteristics for phylogenetic clusters. In addition, such an approach and results may be helpful to find the specific properties of species, which is important for phenotypic studies . In the InforBIO software, the C and D thresholds cannot be set flexibly and have simply been set to 0.5 and 2, respectively. Thus, a table of detailed data of discriminative characteristics is provided by the InforBIO software as shown in figure 3C.
We developed a differential-character-finding algorithm for the identification of the best characteristic to discriminate focused clusters. The algorithm can be used to analyze any type of cluster because it evaluates both intra-cluster and inter-cluster entropy. The common and differential scores are sensitive to taxon sampling. Thus, their thresholds are calculated with the number of defined clusters (y) and of values for an item (n) in addition to an acceptable frequency of the occurrence probability (a) of a value for an item. For continuous data, ranges can be set and converted into discrete data that can be analyzed by the algorithm in the InforBIO software. The set of ranges should include all possible values, but ranges should not overlap. These discrete ranges can then be regarded as values for the algorithm. There are algorithms available to identify sets of diagnostic keys  that can reduce a large dataset into compact, homogeneous data clusters. There are also tools for deterministic and probabilistic identification . The unique feature of our differential-character-finding algorithm in the InforBIO software is that the system can identify the phenotypic characteristics that discriminate and are associated with the clusters of a phylogenetic tree. In current study, the phylogeny of protein-coding gene and protein sequences is analyzed in addition to 16S rRNA gene sequence phylogeny [26, 27], and discriminative characteristics for clusters on a phylogenetic tree based on such sequences might be interesting for polyphasic analysis.
We demonstrated the algorithm in the InforBIO software with an actual dataset of Pseudomonas species. In the recent taxonomic studies of the genus Pseudomonas, specific characteristics of newly suggested species have been decided on the basis of the result from kits with many test items in addition to their phylogeny [26–28]. Also, the importance of species-discriminative phenotypic characteristics has become evident recently in taxonomic studies of species of other genera [29–33]. In this study, we successfully identified a set of phenotypic characteristics that were useful as diagnostic keys for Pseudomonas species. Discriminative characteristics for phylogenetic clusters as shown in the demonstration might be useful information for the finding of novel features for species. Therefore, the differential-character-finding algorithm and the InforBIO software are effective for identification of the characteristics that discriminate clusters from biologic data.
Availability and requirements
Project name: InforBIO project;
Project homepage: http://wdcm.nig.ac.jp/inforbio/index_e.html;
Operating systems: Windows 2000/XP, Macintosh OSX, Linux, UNIX;
Other requirements: CPU ≥ 800 MHz, Memory ≥ 256 MB, HD ≥ 50 MB, Screen resolution ≥ 800 × 600 pixels;
Programming language: Java (j2sdk1.4.2_05);
License: GNU GPL;
Any restrictions to use by non-academics: none.
The authors would like to express their sincere thanks to K. Koorikawa of Hitachi Software Engineering Co., Ltd., for programming of InforBIO.
- Wayne LG, Brenner DJ, Colwell RR, Grimont PAD, Kandler O, Krichevsky MI, Moore LH, Moore WEC, Murray RGE, Stackebrandt E, Starr MP, Trüper HG: International Committee on Systematic Bacteriology. Report of the ad hoc committee on reconciliation of approaches to bacterial systematics. Int J Syst Bacteriol 1987, 37: 463–464.View ArticleGoogle Scholar
- Stackebrandt E, Goebel BM: Taxonomic note: a place for DNA-DNA reassociation and 16S rRNA sequence analysis in the present species definition in bacteriology. Int J Syst Bacteriol 1994, 44: 846–849.View ArticleGoogle Scholar
- Stackebrandt E, Frederiksen W, Garrity GM, Grimont ADP, Kampfer P, Maiden MCJ, Nesme X, Rossello-Mola R, Swings J, Truper HG, Vauterin L, Ward AC, Whitman WB: Report of the ad hoc committee for the re-evaluation of the species definition in bacteriology. Int J Syst Evol Microbiol 2002, 52: 1043–1047.PubMedGoogle Scholar
- Vandamme P, Pot B, Gillis M, de Vos P, Kersters K, Swings J: Polyphasic taxonomy, a consensus approach to bacterial systematics. Microbiol Rev 1996, 60: 407–438.PubMed CentralPubMedGoogle Scholar
- Vandamme P: Polyphasic Taxonomy in Practice: the Burkholderia cepacia Polyphasic Challenge. WFCC Newsletter 2002, 34: 17–24.Google Scholar
- Ledwing W, Klenk H: Overview: A phylogenetic backbone and taxonomic framework for procaryotic systematics. In Bergey's Manual of Systematic Bacteriology. Volume 2. 2nd edition. Edited by: Brenner DJ, Staley JTK, Garrity GM. New York: Springer Inc; 2005:49–65.Google Scholar
- Sneath PH: A Numerical Taxonomy. In Bergey's Manual of Systematic Bacteriology. Volume 2. 2nd edition. Edited by: Brenner DJ, Staley JTK, Garrity GM. New York: Springer Inc; 2005:39–42.Google Scholar
- Rosselló-Mora RA, Lalucat J, García-Valdés E: Comparative biochemical and genetic analysis of naphthalene degradation among Pseudomonas stutzeri strains. Appl Environ Microbiol 1994, 60: 966–972.PubMed CentralPubMedGoogle Scholar
- Sugawara H, Tanaka N, Miyazaki S: An e-Workbench for the study of microbial diversity: the system design and basic functions. Microbiol Cult Coll 2003, 19: 59–67.Google Scholar
- InforBIO download site[http://wdcm.nig.ac.jp/inforbio/]
- Shannon CE: A mathematical theory of communication. Bull System Tech J 1948, 27: 379–423. and 623–656 and 623–656View ArticleGoogle Scholar
- Kullback S: Information Theory and Statistics. New York: Wiley; 1959.Google Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680.PubMed CentralView ArticlePubMedGoogle Scholar
- Felsenstein J: PHYLIP – Phylogeny Inference Package (Version 3.2). Cladistics 1989, 5: 164–166.Google Scholar
- Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 2002, 30: 3059–3066.PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32: 1792–1797.PubMed CentralView ArticlePubMedGoogle Scholar
- Uchino M, Kosako M, Uchimura T, Komagata K: Emendation of Pseudomonas straminea Iizuka and Komagata 1963. Int J Syst Evol Microbiol 2000, 50: 1513–1519.View ArticlePubMedGoogle Scholar
- Uchino M, Shida O, Uchimura T, Komagata K: Recharacterization of Pseudomonas fulva Iizuka and Komagata and proposals of Pseudomonas parafulva sp nov and Pseudomonas cremoricolorata sp nov. J Gen Appl Microbiol 1963, 47(5):247–261.View ArticleGoogle Scholar
- Anzai Y, Kim H, Park J, Wakabayashi H, Oyaizu H: Phylogenetic affiliation of the pseudomonads based on 16S rRNA sequence. Int J Syst Evol Microbiol 2000, 50: 1563–1589.View ArticlePubMedGoogle Scholar
- Palleroni NJ: Prokaryote taxonomy of the 20th century and the impact of studied on the genus Pseudomonas : a personal view. Microbiol 2003, 149: 1–7.View ArticleGoogle Scholar
- Dunn G, Everitt BS: The construction of diagnostic keys. In An introduction to mathematical taxonomy. Edited by: Dunn G, Everitt BS. Cambridge: Cambridge University Press; 1982:110–121.Google Scholar
- Priest F, Austin B: Identification and diagnosis. In Modern Bacterial Taxonomy. 2nd edition. Edited by: Priest F, Austin B. London: Chapman & Hall; 1993:142–153.Google Scholar
- Case RJ, Boucher Y, Dahllöf I, Holmström C, Doolittle WF, Kjelleberg S: Use of 16S rRNA and rpoB genes as molecular markers for microbial ecology studies. Appl Environ Microbiol 2007, 73: 278–288.PubMed CentralView ArticlePubMedGoogle Scholar
- Richert K, Brambilla E, Stackebrandt E: The phylogenetic significance of peptidoglycan types: Molecular analysis of the genera Microbacterium and Aureobacterium based upon sequence comparison of gyrB , rpoB , recA and ppk and 16SrRNA genes. Syst Appl Microbiol 2007, 30: 102–108.View ArticlePubMedGoogle Scholar
- Gardan L, Shafik H, Belouin S, Broch R, Grimont F, Grimont AP: DNA relatedness among the pathovars of Pseudomonas syringae and description of Pseudomonas tremae sp nov and Pseudomonas cannabina sp nov (ex Sutic and Dowson 1959). Int J Syst Evol Microbiol 1999, 49(Pt 2):469–478.Google Scholar
- Kwon WS, Kim SJ, Park CI, Yoon HS, Park HD, Lim KC, Go JS: Pseudomonas koreensis sp nov, Pseudomonas umsongensis sp nov and Pseudomonas jinjuensis sp nov, novel species from farm soils in Korea. Int J Syst Evol Microbiol 2003, 53: 21–27.View ArticlePubMedGoogle Scholar
- Reddy SG, Matsumoto IG, Schumann P, Stackebrandt E, Shivaji S: Psychrophilic pseudomonads from Antarctica: Pseudomonas antarctica sp nov, Pseudomonas meridiana sp nov and Pseudomonas proteolytica sp nov. Int J Syst Evol Microbiol 2004, 54: 713–719.View ArticlePubMedGoogle Scholar
- Langley R, Kenna TD, Vandamme P, Ure R, Govan RJ: Lysogeny and bacteriophage host range within the Burkholderia cepacia complex. J Med Microbiol 2003, 52: 483–490.View ArticlePubMedGoogle Scholar
- Schinsky FM, Morey ER, Steigerwalt GA, Douglas PM, Wilson WR, Floyd MM, Butler RW, Daneshvar IM, Brown-Elliott AB, Wallace JR Jr, McNeil MM, Brenner JD, Brown MJ: Taxonomic variation in the Mycobacterium fortuitum third biovariant complex: description of Mycobacterium boenickei sp nov, Mycobacterium houstonense sp nov, Mycobacterium neworleansense sp nov and Mycobacterium brisbanense sp nov and recognition of Mycobacterium porcinum from human clinical isolates. Int J Syst Evol Microbiol 2004, 54: 1653–1667.View ArticlePubMedGoogle Scholar
- Yu SY, Yang Q, Xu WX, Kong SH, Xu YG, Zhong YB: Typing and characterization of carbapenem-resistant Acinetobacter calcoaceticus - baumannii complex in a Chinese hospital. J Med Microbiol 2004, 53: 653–656.View ArticlePubMedGoogle Scholar
- Christensen H, Kuhnert P, Bisgaard M, Mutters R, Dziva F, Olsen JE: Emended description of porcine [ Pasteurella ] aerogenes , [ Pasteurella ] mairii and [ Actinobacillus ] rossii . Int J Syst Evol Microbiol 2005, 55: 209–223.View ArticlePubMedGoogle Scholar
- Namwong S, Tanasupawat S, Smitinont T, Visessanguan W, Kudo T, Itoh T: Isolation of Lentibacillus salicampi strains and Lentibacillus juripiscarius sp nov from fish sauce in Thailand. Int J Syst Evol Microbiol 2005, 55: 315–320.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.