Identification of signature and primers specific to genus Pseudomonas using mismatched patterns of 16S rDNA sequences
© Purohit et al; licensee BioMed Central Ltd. 2003
Received: 5 November 2002
Accepted: 22 May 2003
Published: 22 May 2003
Pseudomonas, a soil bacterium, has been observed as a dominant genus that survives in different habitats with wide hostile conditions. We had a basic assumption that the species level variation in 16S rDNA sequences of a bacterial genus is mainly due to substitutions rather than insertion or deletion of bases. Keeping this in view, the aim was to identify a region of 16S rDNA sequence and within that focus on substitution prone stretches indicating species level variation and to derive patterns from these stretches that are specific to the genus.
Repeating elements that are highly conserved across different species of Pseudomonas were considered as guiding markers to locate a region within the 16S gene. Four repeating patterns showing more than 80% consistency across fifty different species of Pseudomonas were identified. The sub-sequences between the repeating patterns yielded a continuous region of 495 bases. The sub-sequences after alignment and using Shanon's entropy measure yielded a consensus pattern. A stretch of 24 base positions in this region, showing maximum variations across the sampled sequences was focused for possible genus specific patterns. Nine patterns in this stretch showed nearly 70% specificity to the target genus. These patterns were further used to obtain a signature that is highly specific to Pseudomonas. The signature region was used to design PCR primers, which yielded a PCR product of 150 bp whose specificity was validated through a sample experiment.
The developed approach was successfully applied to genus Pseudomonas. It could be tried in other bacterial genera to obtain respective signature patterns and thereby PCR primers, for their rapid tracking in the environmental samples.
Pseudomonas, a soil bacterium, because of its diverse catabolic potential is often associated with various microbial communities surviving in different environmental conditions. Hence, it is often considered as an indicator organism to study the development and dynamics of microbial community. We are working on treatment of wastewater having toxic chemicals with the emphasis on identifying bacterial isolates [1–6]. We have sequenced 16S rRNA gene for two strains; Pseudomonas SF1 (utilizes 4-nitrophenol) and PH1 (utilizes 3-aminophenol), bearing accession numbers AF135269 and AF065166 respectively in GenBank. Over a period of time, it has been observed that our bacterial collection, which is based on physiological and degradative properties, is mostly dominated by genus Pseudomonas as identified through bacteriological analysis. In such case, a rapid and specific detection protocol for Pseudomonas using its genetic information would be a valuable tool in ecological and diagnostic studies. This was the motivation behind the exercise.
There are programs available for designing polymerase chain reaction (PCR) primer pairs as a means of rapid detection [7, 8]. These programs select primer pairs based on the user-defined parameters, such as length, secondary structure of primers, G/C content of primers and amplicons, stability of primers etc. However, they do not provide any information regarding the specificity of the oligonucleotides / patterns; and especially when the list of candidate primer pairs is long, it becomes arduous to select the best primers. Multiple alignment tools are often used to identify specific oligonucleotides [9–12]. These tools when applied to related sequences results into a consensus pattern of conserved and variable sites across the sequences. The basic purpose of multiple alignment is to compare the similarity of sequences and to identify homologous genes from the database. Because they compare the entire sequence, they are sometimes not well suited for identifying short specific oligonucleotide sequences. Keeping this in view, a program HYBsimulator was developed to design target specific oligonucleotides through computer hybridization simulation (CHS) . The program generates a ProbeSet of candidate oligonucleotides from the target gene sequence. The parameters like salt and oligonucleotide concentrations, melting temperatures are specified to obtain the final ProbeSet meeting these specifications. The CHS is then executed for the ProbeSet against the GenBank database of interest and finally best probes are selected based on the G/C contents, hairpin formation energy and the highest specificity to the database. Recently, another software PRIMROSE was developed to pick specific 16S rRNA probes and PCR primers as ecological tools in the identification and enumeration of bacteria . The program generates oligonucleotides from aligned as well as unaligned sequences. With aligned sequences, the algorithm identifies the consensus sequence and then generates oligonucleotides of specific length from the consensus sequence. A search string is created from each oligonucleotide and database sequences are searched for a match with the generated strings. A string with the number of specific hits exceeding the minimum threshold is considered as important. The program also has an alternative algorithm to deal with unaligned sequences to finally yield the target specific oligonucleotides.
This paper describes an approach, using 16S rDNA sequence data, to select genus specific probes / patterns that could be subsequently used as primers for tracking of bacterium from the environmental niches. The approach is based on the hypothesis that 16S rDNA sequences representing different species of a particular genus, if aligned, then much of the variability across the sequences is due to substitution of bases and hardly by insertion or deletion of bases. The interest is to know whether such substitution prone / mismatched region(s) possesses patterns that are genus specific. And secondly, could there be an alternative way to identify such regions without taking recourse to multiple alignment of complete sequences. One possible way could be to locate some markers that remain preserved across the selected sequences and focus on the region(s) flanked by them; and one such marker could be the repeating patterns of nucleotides. We found that in a set of closely related sequences there are some repeating patterns, which consistently occur across the set; and importantly the separating distance between the repeats is also preserved throughout. Accordingly, the search for mismatched or the substitution prone base positions, was restricted to the sub-sequence enclosed between the highly consistent repeats. The patterns generated from the mismatched region could be tested for their specificity against the standard database. By this, we are ensuring first the variability of patterns across the species of a selected genus, followed by their specificity to the target genus. Thus, the main difference between this approach and the one followed in HYBsimulator or PRIMROSE is that, in this case the probe / pattern set generation is restricted to only the mismatched region.
The paper details a case study for genus Pseudomonas using 16S rDNA sequence data. The most consistent repeating patterns across 16S sequences representing different species of this genus were obtained using a program Repeat Tuple Search. A region of fixed length flanked by the repeat markers was obtained. A consensus pattern was obtained for this region using Shanon's entropy measure. A most variable stretch from this region was identified, which provided Pseudomonas specific patterns and subsequently a signature pattern. A PCR primer pair was designed considering the conserved and the variable stretch and was validated through a preliminary experiment.
Consistent repeating patterns in Pseudomonas 16S rRNA gene
Target specific patterns and PCR primers
Species level variation within 150 bp PCR product
Amongst several approaches to determine target specific patterns, we have recently proposed one based on the dinucleotide compositions that discriminate a group of selected bacteria from each other . The selected dinucleotides could be used to generate patterns whose specificity could be tested through search analysis against the standard databases.
However, in this case, repeating patterns that are conserved across different sequences of Pseudomonas, have been used to locate a mismatched region and eventually for selection of genus specific pattern / signature and primers. The identified four repeats exhibit a pattern of repeating elements, which is quite dominant across the Pseudomonas 16S sequences, as shown in Figure 2. In other studies, we have identified repeating elements in 16S sequences of fifty different bacterial groups http://www.ims.nus.edu.sg/Programs/genome, but the occurrence of these four repeats, together, was not observed in other groups except Pseudomonas. This suggests that in other genera, either these patterns have undergone substitutions at one or more corresponding base positions thereby not constituting repeats or might be that these four conserved repeating patterns in Pseudomonas is the result of evolution. Some recent studies have suggested the evolutionary significance of repeat elements, following the discovery of similar repeat units among the species of vertebrate [16, 17]. Likewise, in Pseudomonas also, the identified consistent repeats might have some relevance, which could be explored.
For large size DNA fragment analysis, the approach based on tandem repeats provides a reference through which a sequence could be characterized. Using tandem repeats as reference, PCR primers could be designed to differentiate heterologous or even homologous DNA fragments . In this study, a similar strategy was adopted to design PCR primers, but using simple repeating patterns as reference. The most consistent repeating patterns were used as markers to define a sub-sequence of fixed length in selected sequences. As regards the difference between this approach and the one proposed in PRIMROSE, in the later the program generates large number of possible patterns using the consensus pattern for a whole sequence and then determines the target specific patterns based on the statistics of number of hits. Similarly, in HYBsimulator also, the program generates probe set of specified length across the length of the input sequence. Contrary to this, in the present approach, the probe set is restricted to only the substitution prone or the mismatched region of the alignment. The determination of specificity of pattern is similar to that of PRIMROSE and based on the statistics of target and non-target hits, and does not consider any thermodynamic properties of probes as in the case of HYBsimulator. Moreover, the approach depicts part of the species level diversity finally yielding a signature pattern for the target genus, which is an interesting observation of this study. The signatures if ascertained for their specificity could be used for typing the bacteria in different habitats. Here, the signature was used to design a tracking protocol for genus Pseudomonas, which was also validated on a preliminary scale through experiment. However, a rigorous validation work plan would be required with more DNA templates. The exercise should also include significant number of unknown isolates from environmental niches. The DNA derived from these isolates could be tested for 150 bp product and further support by ARDRA of 16s rDNA using band-sharing index. Alternatively, the DNA templates yielding 150 bp product could be randomly picked to derive complete 16S rDNA sequences to ascertain the specificity of the approach to genus Pseudomonas.
We had an assumption that the species level differences could be captured by nucleotide patterns in the mismatched or substitution prone sites in sequences belonging to a particular genus. This has been verified through a sample experiment by analysing a small region of 16S rDNA bounded by repeating patterns. There could be some other position specific markers such as non-duplicating patterns; however, the constant distance criterion between such patterns across the sequences needs to be ascertained as in case of repeating patterns. Although, the approach was implemented successfully for Pseudomonas, it may have limitations while applying in other genera. For instance, in some cases, the sub-sequences flanked by consistent repeats might indicate lesser variability across the length of sub-sequence or the variability might be uniform throughout its length. In that case identifying a stretch to select patterns would be difficult. Further, it might also happen that the patterns belonging to most variable stretch does not produce sufficient hits to be considered as target specific. We would like to state here that the strength of our assumption and the approach depends much on its applicability in other bacterial genera and recommend that it may be investigated further. We are trying to use this approach to derive signatures and primers for different genera of importance in bioremediation. If it works for other genera and provides genus specific signatures then it could be one of the possible means of designing micro arrays for studying population dynamics in bioremediation protocols or similar kind of applications.
The procedure for selecting target specific patterns has been summarized through flow chart as shown in Figure 1 and is as follows:
Consistent repeating patterns in related sequences
A repeat pattern is a sub-string of nucleotides in a sequence S, which occurs more than once in S. The origin, evolution and distribution of repetitive elements in genome databases have been the subject of intense study, both experimentally and computationally. Some algorithms to find repeats in a sequence could be found in [19–21]. Amongst these, the recently developed REPuter has been found to be efficient and provides exhaustive repeats, even in large genome sequences. The algorithm uses a compact implementation of suffix trees to locate exact repeats in linear space and time for sequences.
Although the above programs provides a list of repeats in a sequence, they do not have an option to automatically provide repeating patterns, which are conserved across the input set of sequences, by simultaneously considering the spacings criterion. In fact, processing data on repeats from different input sequences could be a small extension to many of these programs. We have developed a program Repeat Tuple Search, which has this additional feature to determine the consistent repeating patterns across the set of sequences. The program could be downloaded from site http://www.ebi.ac.uk/~lijnzaad/RepeatTupleSearch and used to handle small gene sequences up to size 2 kb. It accepts sequences one-by-one in a simple text format and stores data on repeating patterns for each input sequence. The program has two basic components – the first determines the repeating patterns of lengths more than six (default setting) along with the spacing between the repeats in each input sequence. The search for repeating patterns is exhaustive without asking for any input conditions from the user. The second component processes the collective data to get the number of sequences in which different repeating patterns make their appearances, considering the constant spacing criterion. The patterns with high frequency of occurrence are considered as the most consistent repeating patterns.
Entropy-based conservation analysis
Once the consistent repeating patterns are determined, the sub-sequences enclosed between them could be studied for the degree of conservation across the set of input sequences. The sub-sequences enclosed between a particular repeat from different sequences could be aligned using CLUSTAL W program to arrive at the positional conservation. The Shanon's entropy could be used as a measure of positional conservation. It gives the average uncertainty of an outcome at different positions using expression
where, m i stands for the i th position and p ia is the probability of residue a in i th position with reference to 5'-end tuple and could be estimated using the maximum likelihood estimate
where c ia is the total number of a's in i th position, while the denominator in (2) gives the total number of bases at that position. The logarithm in expression (1) has base 2, hence the unit of entropy is a bit. The information content could be used, which gives the reduction in uncertainty after some 'knowledge' has been received. In other words, information content at a particular position is the difference between the entropy before and after the knowledge, and is given by
IC(m i ) = S before - S after
Without any bias, the assumption could be that the bases in a particular column are random i.e. p a = 0.25, giving S before = 2 bits. On observing the occurrence of bases in the columns of alignment, S after could be obtained and thus the difference gives the information content of the column. Such information content could be obtained at different positions of the aligned sub-sequences, thereby yielding conserved and variable base positions .
The most variable stretch of nucleotide positions could be identified and the patterns of some fixed length 'l' spanning the variable stretch from the sampled sequences could be selected. The specificity of each selected pattern could be tested against the rRNA database. If the specific hits produced by a pattern exceed some predetermined threshold, select it as target specific, otherwise, the search is aborted and the next pattern is picked. Once the search for all the patterns is over, the selected target specific pattern could be used to generate a signature pattern, which would be specific to the target locus. Also, each selected pattern could be tested for its possibility as a primer using any of the primer selection programs to facilitate rapid detection of the target bacterium in environmental samples.
The authors have equally contributed to the manuscript.
HJ Purohit: For developing the concept, retrieving the sequences, participated in sequence alignment using Lasergene software, developing dendogram, designing PCR primers followed by the validation through molecular experiment.
Kapley: Experimental validation with Slot-Blot hybridization using different Pseudomonas strains.
- Atuanya EI, Purohit HJ, Chakrabarti T: Anaerobic and aerobic biodegradation ofchlorophenols using UASB and ASG bioreactors. World Journal of Microbiology & Biotechnology 2000, 16: 95–98. 10.1023/A:1008957229070View ArticleGoogle Scholar
- Chhatre SA, Purohit HJ, Shanker R, Chakrabarti T, Khanna P: Bacterial consortia for crude oil spill remediation. Water Science and Technology 1996, 34: 187–193. 10.1016/S0273-1223(96)00713-5View ArticleGoogle Scholar
- Kapley A, Purohit HJ, Chhatre S, Shanker R, Chakrabarti T, Khanna P: Osmotolerance and hydrocarbon degradation by genetically engineered bacterial consortium. Bio-Resource Technology 1999, 67: 241–245. 10.1016/S0960-8524(98)00121-7View ArticleGoogle Scholar
- Kapley A, Lampel K, Purohit HJ: Thermocycling steps and optimization of multiplex PCR. Biotechnology Letters 2000, 22: 1913–1918. 10.1023/A:1026748202071View ArticleGoogle Scholar
- Kapley A, Purohit HJ: Tracking of phenol degrading genotype. Environmental Science and Pollution Research 2001, 8: 89–90.View ArticlePubMedGoogle Scholar
- Kutty R, Purohit HJ, Khanna P: Isolation and Characterization of a Pseudomonas sp. strain PH1 utilizing meta-aminophenol. Canadian Journal of Microbiology 2000, 46: 211–217. 10.1139/cjm-46-3-211View ArticlePubMedGoogle Scholar
- Lowe T, Sherfkin J, Yang SQ, Dieffenbach CW: A computer program for selection of oligonucleotide primers for polymerase chain reaction. Nucleic Acids Research 1990, 18: 1757–1761.PubMed CentralView ArticlePubMedGoogle Scholar
- Mitsuhashi M, Cooper A, Ogura M, Shinagawa T, Yano K, Hosokawa T: Oligonucleotide probe design – a new approach. Nature 1994, 367: 759–761. 10.1038/367759a0View ArticlePubMedGoogle Scholar
- Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution 1987, 25: 351–360.View ArticlePubMedGoogle Scholar
- Lipman DJ, Altschul SF, Kececioglu JD: A tool for multiple sequence alignment. Proceedings of National Academy of Science 1989, 86: 4412–4415.View ArticleGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CulstalW: improving the sensitivity of progressive multiple alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 1994, 22: 4673–4680.PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Plewniak F, Poch O: A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research 1999, 27: 2682–2690. 10.1093/nar/27.13.2682PubMed CentralView ArticlePubMedGoogle Scholar
- Hyndman D, Cooper A, Pruzinsky S, Coad D, Mitsuhashi M: Software to determine optimal oligonucleotide sequences based on hybridization simulation data. BioTechniques 1996, 20: 1090–1096.PubMedGoogle Scholar
- Ashelford KE, Weightman AJ, Fry JC: PRIMROSE: a computer program for generating and estimating the phylogenetic range of 16S rRNA oligonucleotide probes and primers in conjunction with the RDP-II database. Nucleic Acids Research 2002, 30(15):3481–3489. 10.1093/nar/gkf450PubMed CentralView ArticlePubMedGoogle Scholar
- Raje DV, Purohit HJ, Singh RN: Distinguishing features of 16S rDNA gene for five dominating bacterial genus observed in bioremediation. Journal of Computational Biology 2002, 9(6):819–829. 10.1089/10665270260518290View ArticlePubMedGoogle Scholar
- Fumigalli L, Taberlet P, Favre L, Hauser J: Origin and evolution of homologous repeated sequences in mitochondrial DNA control region of shrews. Molecular Biology and Evolution 1996, 13: 31–46.View ArticleGoogle Scholar
- Wilkinson GS, Mayer F, Kerth G, Petri B: Evolution of repeated sequence arrays in evening bat D-loop mtDNA. Genetics 1997, 128: 607–617.Google Scholar
- Liu BH: Statistical Genomics: Linkage, Mapping and QTL analysis. CRC Press LLC, Florida 1998.Google Scholar
- Rigoutsos I, Floratos A: Combinatorial Pattern Discovery in Biological Sequences: the TEIRESIAS Algorithm. Bioinformatics 1998, 14: 55–67. 10.1093/bioinformatics/14.1.55View ArticlePubMedGoogle Scholar
- Kurtz S, Schleiermacher C: REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics 1999, 15: 426–427. 10.1093/bioinformatics/15.5.426View ArticlePubMedGoogle Scholar
- Kurtz S, Choudhari J, Schleiermacher C, Stoye J, Giegerich R: REPuter: the manifold applications of repeat analysis on a genome scale. Nucleic Acids Research 2001, 29: 4633–4642. 10.1093/nar/29.22.4633PubMed CentralView ArticlePubMedGoogle Scholar
- Durbin R, Eddy S, Krough A, Mitchison G: Biological Sequence Analysis: Probabilistic models for proteins and nucleic acids. Cambridge University Press 1998.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.