Mathematical design of prokaryotic clone-based microarrays
© Pieterse et al; licensee BioMed Central Ltd. 2005
Received: 19 May 2005
Accepted: 28 September 2005
Published: 28 September 2005
Clone-based microarrays, on which each spot represents a random genomic fragment, are a good alternative to open reading frame-based microarrays, especially for microorganisms for which the complete genome sequence is not available. Since the generation of a genomic DNA library is a random process, it is beforehand uncertain which genes are represented. Nevertheless, the genome coverage of such an array, which depends on different variables like the insert size and the number of clones in the library, can be predicted by mathematical approaches. When applying the classical formulas that determine the probability that a certain sequence is represented in a DNA library at the nucleotide level, massive amounts of clones would be necessary to obtain a proper coverage of the genome.
This paper describes the development of two complementary equations for determining the genome coverage at the gene level. The first equation predicts the fraction of genes that is represented on the array in a detectable way and cover at least a set part (the minimal insert coverage) of the genomic fragment by which these genes are represented. The higher this minimal insert coverage, the larger the chance that changes in expression of a specific gene can be detected and attributed to that gene. The second equation predicts the fraction of genes that is represented in spots on the array that only represent genes from a single transcription unit, which information can be interpreted in a quantitative way.
Validation of these equations shows that they form reliable tools supporting optimal design of prokaryotic clone-based microarrays.
In the past decade, whole transcriptome comparison by microarray hybridizations has proven to be an effective tool for studying genome wide gene responses. The general approaches for the development of microarrays are based on the completely annotated genome sequence of an organism. Usually each spot on the array represents one open reading frame (ORF). Whereas this approach has clear advantages for strains for which the complete annotated genome sequence is available, it is not applicable to strains for which this is not the case.
A method that allows for the rapid construction of microarrays for which the completely annotated genome sequence is not required is by the construction of a clone-based array. In this approach, a chromosomal DNA library is constructed from the strain of interest. From this library the genomic fragments, the inserts, are amplified from the clones by PCR with generic primers and spotted on the array-slide [1, 2].
The major differences between ORF-based and clone-based arrays with respect to the data interpretation are that in case of clone-based arrays the differential signals can only be linked to a specific gene after the DNA fragment from the spot of interest on the array has been sequenced, and that it is beforehand uncertain whether a gene is represented on the array. Moreover, whereas ORF-based microarrays exclusively generate gene specific data, a differential signal within a spot on a clone-based array can originate from multiple genes on the insert that are not necessarily linked at the transcriptional level.
The extent of these limitations can be quantified by estimating the genome coverage by the spots present on the array. The standard formulas for estimating the genome coverage of a DNA library, the Clark-Carbon equation  and the Lander-Waterman equation , determine this coverage at the nucleotide level. In other words, they consider the genome as a set of nucleotides, which is useful when the library is to be used for genome sequencing. However, these formulas will overestimate the required number of clones for hybridization purposes. The reason for this is that for hybridization purposes small overlapping fragments that allow for specific binding of the labeled cDNA suffice. Akopyants et al.  developed an equation for the estimation of the fraction of genes that are at least partially represented. This formula is directly derived from classical probability calculations and contains the organism specific variables genome size and average gene size. Due to the fact that Akopyants et al. determine the genome coverage at the gene level, and consider a gene represented if a fragment is present that is large enough to hybridize to and large enough to identify the gene, the required number of clones to obtain a certain coverage is reduced.
A general drawback of these three formulas is that they give no insight into the fraction of genes for which specific data can be generated in a transcriptomics experiment. The data from a spot are considered specific if the expression ratios from the quantified signal from that spot can directly be related to the gene(s) represented by the spot. This is not the case if DNA from multiple (neighboring) transcription units is present in one spot, since it would be uncertain which gene is responsible for which part of the total signal from that spot.
In this paper, two formulas were developed that enable for mathematical predictions of genome coverage by a prokaryotic clone based-array at the gene level as a function of genome size, number of clones, insert size, and either the minimal part of the insert that is covered by the gene or the minimal overlap of the gene and the insert: the minimal insert coverage (MIC) equation, and the gene specific information (GSI) equation.
Overview of prokaryotes from several genera with their genes/transcription unit-ratio. Microorganisms that were used for model development (M) or validation (V) of the MIC- and the GSI-equation are depicted in the list.
Model (M) or validation (V) strain
Proteobacteria Gammaproteobacteria Enterobacteriales
Escherichia coli K-12 MG1655
Escherichia coli O157:H7 EDL933
Escherichia coli CFT073
Salmonella typhi CT19
Salmonella typhimurium LT2
Yersinia pestis CO92
Shigella flexneri 2a str. 2457T
Buchnera aphidicola Sg
Haemophilus influenzae Rd
Pasteurella multocida PM70
Proteobacteria Gammaproteobacteria Xanthomonadales
Xylella fastidiosa 9a5c
Xanthomonas campestris ATCC33913
Proteobacteria Gammaproteobacteria Vibrionales
Vibrio cholerae El Tor N16961
Vibrio parahaemolyticus RIMD2210633
Vibrio vulnificus CMCP6
Proteobacteria Gammaproteobacteria Pseudomonadales
Pseudomonas aeruginosa PA01
Pseudomonas putida KT2440
Proteobacteria Gammaproteobacteria Legionellales
Coxiella burnetii RSA 493
Neisseria meningitidis Z2491
Ralstonia solanacearum GMI1000
Helicobacter pylori 26695
Campylobacter jejuni NCTC11168
Rickettsia prowazekii Nadrid E
Sinorhizobium meliloti 1021
Agrobacterium tumefaciens C58
Brucella suis 1330
Bacillus subtilis 168
Oceanobacillus iheyensis HTE831
Stapylococcus aureus MW2
Listeria monocytogenes EGD-e
Listeria innocua Clip11262
Clostridium acetobutylicum ATCC824
Clostridium tetani E88
Thermoanaerobacter tengcongensis MB4T
Lactococcus lactis IL1403
Streptococcus agalactiae 2603
Streptococcus pneumoniae R6
Lactobacillus plantarum WCFS1
Enterococcus faecalis V583
Mycoplasma pneumoniae M129
Mycoplasma genitalium G37
Mycoplasma penetrans HF-2
Ureaplasma urealyticum (serovar 3)
Mycobacterium tuberculosis H37Rv
Corynebacterium glutamicum ATCC 13032
Streptomyces coelicolor A3(2)
Tropheryma whipplei Twist
Bifidobacterium longum NCC2705
Fusobacterium nucleatum ATCC25586
Chlamydia trachomatis (serovar D)
Chlamydophila pneumoniae AR39
Borrelia burgdorferi B31
Treponema pallidum Nichols
Leptospira interrogans 56601
Bacteroides thetaiotaomicron VPI-5482
Thermosynechococcus elongatus BP-1
Nostoc sp. PCC 7120
Green sulfur bacteria
Chlorobium tepidum TLS
Deinococcus radiodurans R1
Aquifex aeolicus VF5
Thermotoga maritima MSB8
Methanococcus jannaschii DSM2661
Pyrococcus furiosus DSM3638
Archaeoglobus fulgidus DSM4304
Thermoplasma acidophilum DSM1728
Methanosarcina acetivorans C2A
Methanosarcina mazei Goe1
Aeropyrum pernix K1
Sulfolobus solfotaricus P2
Pyrobaculum aerophilum IM2
Description of the developed equations
Minimal Insert Coverage (MIC)-equation
Since the generation of inserts for a genomic DNA library is a random process, a large part of the represented genes may be co-represented with other genes by one spot on the microarray. This complicates data interpretation since it introduces an uncertainty on which gene or genes are responsible for differential signals from these spots. The impact of differential expression of a specific gene on the observed difference of the signal from a spot will be larger when a larger part of the genome fragment in that spot is covered by that gene. Moreover, the larger the part of the insert that is covered by a specific gene, the larger the chance that differential signals for the spot can be attributed to that gene, and the higher the chance that differential expression levels from that gene result in a statistically significant differential signal on the array.
The MIC-equation anticipates to this effect by predicting the number of genes that are (at least partially) present on an insert and cover at least a predefined part of the insert (DIC). This predefined part is defined as a percentage of the total insert. E.g. if the insert size is 1000 base pairs and the predefined minimal insert coverage (DIC) is set at 50%, then at least 500 bp of that gene should be present on an insert to be considered as represented by the array. Genes smaller than the size of the predefined part of the insert, are considered as not represented on the array.
Gene Specific Information (GSI)-equation
Fifteen prokaryotes from various genera were selected as model species (Table 1). Genome data from these microorganisms were used for the generation of species-specific values for the expected fraction of represented genes as a function of the genome size (GS), number of clones (N), insert size (IS), and either DIC or O mf . Coordinates from all annotated genes from these organisms were obtained from GenBank, and were used to determine the gene sizes. In addition, information was obtained on the start and stop coordinates from the transcription unit to which the gene belongs, and the position of the gene in this transcription unit. It was assumed that transcription units start at the first base pair of the first gene and finish at the last base pair of the last gene. This information was generated by the combination of intergenic region based transcription unit predictions, generated by Moreno-Hagelsieb and Collado-Vides , with gene coordinates from GenBank.
The genome size (GS) could be included as a fictitious variable in the datasets, since not the species-specific genome size, but the species-specific gene size distribution and genome organization in genes and transcription units were relevant.
It was assumed that each possible genome fragment of the size of the insert size (IS) has an equal chance of being represented. To achieve this, fragments should be generated by physical fragmentation, and not by the use of endonucleases.
Dataset preparation for the fitting procedure for the MIC-equation
Overview of the variables that were used for the model datasets on which the MIC- and the GSI-equation are based. Multiple combinations of the mentioned values were applied.
500; 1500; 2500; 3500; 4500; 5500; 6500; 7500; 8500; 9500
100; 300; 500; 700; 900; 1100; 1300; 1500; 2100; 2700; 3000
0.5; 1.5; 2.5; 3.5; 4.5; 5.5; 6.5; 7.5; 8.5; 9.5
50; 100; 150; 200; 250; 300; 350; 400; 450
10; 20; 30; 40; 50; 60; 70; 80; 90
The following formulas were developed for the calculation of the probability value per gene:
Gene <O mv ⇒ p = 0 (3)
Dataset preparation for the fittimg procedure for the GSI-equation
For each organism, the fraction of genes for which specific information could be generated was determined for 114 different combinations of the number of clones (N), fictitious genome size (GS), the insert size (IS), and the minimal required overlap (O mf ) in the ranges depicted in Table 2. The represented fraction was determined by taking the average of the probability values per gene. Formulas were developed that describe different situations with respect to the localization and organization of the gene of interest on the insert (eq 4 – 15)
Formulas that were developed to determine the probability value for genes that are transcribed into a single gene transcript:
Gene <IS ⇒ p = 0 (5)
Formulas that were developed to determine the probability value for genes that are at the beginning of a transcription unit:
BP e ≤ IS - O mf ⇒ O e = IS - BP e (6)
BP e > IS - O mf ⇒ O e = O mf (7)
BP e + Gene <IS ⇒ p = 0 (9)
Formulas that were developed to determine the probability value for genes that are flanked at both sides by other genes that belong to the same transcription unit:
BP b ≤ IS - O mf ⇒ O b = IS - BP b (10)
BP b > IS - O mf ⇒ O b = O mf (11)
BP e ≤ IS - O mf ⇒ O e = IS - BP e (12)
BP e > IS - O mf ⇒ O e = O mf (13)
BP b + BP e + Gene > IS ⇒ p = 0 (15)
Models and fits
The datasets with the expected fractions of represented genes for the various combinations of parameters as presented in the previous section functioned as template for the fitting of the predictive equation for MIC and GSI.
From equation 2, which was used to determine the probability value per gene, it became apparent that organism-dependent gene size distribution influenced the expected number of represented genes on a clone based array. These organism dependent differences were neglected for the preparation of the MIC equation, which proofed to be justified when validating the MIC-equation (see validation section).
A polynome was developed as MIC model. In the polynome all variables were present in first and second order and in cross terms between two variables. Because of a high expected correlation between IS and DIC (based on equation 2), this relation was extended with a second order term composed of IS and DIC, resulting in:
p MIC = a + b1·DIC + b2·DIC2 + c1·N + c2·N2 + d1·GS + d2·GS2 + e1·IS + e2·IS2 + f·DIC·IS + g·DIC·N + h·DIC·GS + i·IS·N + j·IS·GS + k·GS·N + l·(IS·DIC)2 (16)
The model datasets for the 15 model species were used together in the regression procedure to estimate the parameters in the MIC model. Linear regression using a standard least squares algorithm (fminsearch) provided by Matlab (The MathWorks) was applied to search the parameters that minimize the sum of squares (SSQ) defined as:
SSQ = ∑(pMIC,exp- pMIC,mod)2 (17)
Values for the parameters in the MIC- and the GSI-equation.
From the model datasets for the GSI equation it appeared that an organism dependent variable had a strong influence on the calculated number of represented genes (results not shown). Analysis revealed a positive correlation between the number of represented genes and the species-dependent average number of genes per transcription unit, R. R was determined by dividing the total number of genes (GenBank) by the total number of predicted transcription units  (Table 1).
Starting-point for the GSI model was a second order polynome for all variables, extended with the cross terms between two variables. A set of parameters was estimated for each individual model species (results not shown). Parts which appeared to contribute less than 1% to p GSI were not included, which resulted in the following relation:
p GSI = a + b2·O mf 2 + c1·N + c2·N2 + d1·GS + d2·GS2 + e1·IS + e2·IS2 + f·O mf ·IS + h·O mf ·GS + i·IS·N + j·IS·GS + k·GS·N (18)
For each prokaryote a set of parameters was obtained by minimizing the SSQ, equivalent to equation 17. The average absolute deviation of the GSI equation from the model datasets was 0.0258.
In order to obtain one generic equation for the organism specific relations for p GSI , the species specific values for the parameters (a - k) in equation 18 were related to the species related variable R by a linear relation:
parameter(a - k) = q + r·R (19)
in which R is species specific (Table 1). Since no dependency of a with R could be established, a was set at the average of all individual a values: 0.544. With this value the polynome was fitted again, and the final relations between the other parameters and R were determined (Table 3).
Overview of the variables and the values used for these variables that were used for the datasets that were used for the validation of the MIC- and the GSI-equation. All possible combinations of the mentioned values were tested.
1000; 4000; 7000; 10000
100; 500; 1000; 1500; 2000
1; 3; 5; 7
25; 50; 75
Reliability of the MIC- and the GSI-equation, depicted as the fraction of predictions that differ less than 0.01, 0.05 or 0.10 from the real values, for the validation sets defined in Table 4.
Abs (Δ predicted vs. real)
Fraction for MIC-equation
Fraction for GSI-equation
Deviations between the predicted fractions by the MIC-equation and the true values as they were determined for the validation species are mainly to be attributed to species-specific gene size distribution. In order to obtain one generic equation, and based on the accuracy of the equation in its current form (Table 5), it was decided not to include a species-specific variable.
Prediction of the optimum value for the insert size (IS)
Whereas an increase in N will always have a positive contribution to the fraction of represented genes, and an increase in GS, O mf , and MIC a negative contribution, there may be an optimum IS that depends on the values of the other variables. This optimum can be estimated by differentiation of equation 16 and 18 to IS (dp/dIS).
For the determination of the optimal value for IS for the MIC-approach this results in the following equation:
For the determination of the optimal value for IS for the GSI-approach the equation is as follows:
If the indicated values for ISopt are outside the range of 0 to 2000 bp (the range that was applied for validation of the models) no optimum can be identified within the boundaries of the model. In these cases small values of IS will give the best results.
Influence of the average number of genes per transcription unit (R) on the predicted values
From the input variables for the MIC and GSI formulas, N, IS, DIC and O mf are user-defined, while GS and R have to be estimated for the specific organism. Whereas current techniques allow for rapid and accurate estimations of GS [7–9], the organism specific value for R is difficult to determine for species from which little sequence information is available.
R was determined for 73 prokaryotes from multiple genera, as previously described in the "models and fits" section (Table 1). For 61 of the 73 strains in this list, R was within the narrow range from 1.5 – 2.0. Moreover these data indicate that accurate estimations of R can be made, based on the genus of the organism, with an exception for the mollicutes, the hyperthermophilic bacteria and the euryarchaeota.
Effect of false estimations of R on the fraction of predictions that differ less than 0.01, 0.05 or 0.10 from the real values, for the validation set defined in Table 4.
Applied value for R
Abs (Δ predicted vs. real)
R - 0.1
R + 0.1
R - 0.2
R + 0.2
R - 0.3
R + 0.3
Plots like those depicted in figures 3a–c and 4 can be used to determine the preferred combination of the number of spots on the array and the insert size. If for instance the number of spots would be limited to 6000, an insert size of approximately 800 bp would be optimal with respect to the fraction of genes that are represented with a minimal insert coverage of 25% (Fig. 3a). From equation 20 this optimum appears to be 803 bp. With this combination of array parameters the predicted fraction of genes that cover at least 25% of the insert (which equals 803 × 0.25 = 201 bp) is 0.75 (eq. 16). Meanwhile the predicted fraction of genes for which gene specific information can be generated is 0.49 (eq. 18). If the specificity of the data is considered to be more important than the amount of represented genes, it is preferable to have an optimum value for pMIC for higher values of DIC (e.g. Fig. 3c) and a high value for pGSI (Fig. 4). These requirements are best fulfilled by combinations with low values for the insert size.
A Microsoft Excel fill in-spreadsheet that allows for calculations of pGSI, pMIC, and the optimal values for the insert size, is available as additional file with this paper [see Additional file 1].
Classical approaches for the construction of DNA libraries form a suitable base for the construction of clone-based microarrays. However, as the construction of these libraries is a random process, it is beforehand uncertain whether a gene or transcription unit will be uniquely represented on a separate insert on the array. Genome coverage by a DNA library is usually determined by calculating the expectation that each single nucleotide from that gene is present [3, 4]. These formulas will overestimate the number of clones required when the library is to be used for the construction of a microarray, since for this purpose partial representation of a gene is sufficient for hybridization.
To our knowledge, Akopyants et al. were the first to estimate genome coverage at the gene level . They predicted the fraction of represented genes using equation 22:
An important variable in this formula is the average transcript size. However, use of this variable is not legitimate for this type of probability calculations since the average probability per gene (the required information) is not per se equal to the probability per average gene. When we validated the Akopyants formula on the same dataset that was applied for the validation of the MIC-equation, it appeared that 49% of the predictions deviated more than 0.1 from the real value (calculated as the average chance per gene), with a strong tendency to overestimation. The Akopyants formula therefore appears unreliable for calculating optimal library sizes
None of the previous formulas give insight in the fraction of genes for which gene specific information can be generated, while this is one of the most important features when one is interested in studying differential gene expression. The MIC-and GSI-equations that were developed in this study allow for good estimations of both the genome coverage at the gene level, and the fraction of genes for which gene specific transcription information can be generated.
Whereas the MIC-equation is rather straight-forward with respect to the input variables and interpretation, application of the GSI-equation requires the estimation of the average number of genes per transcription unit for an organism. Although a false estimation of this variable could lead to a wrong prediction of the represented fraction, Tables 1 and 6 indicate that this risk is limited.
The GSI-equation is partially based on operon predictions. For the development of the model and validation datasets we used log-likelihood based transcription unit predictions for adjacent pair of genes to be in the same operon . This log-likelihood based prediction method is only applicable to organisms for which at least large parts of the genome have been sequenced, and will therefore not be useful when sequence data from array spots for which differential expression was identified, have to be interpreted. Nevertheless, good predictions can be made on whether or not genes that are co-represented in a single spot on the array belong to the same transcription unit. Strong indications can already be obtained from the physical organization of the DNA fragment of interest, like gene orientation and intergenic distance [6, 11]. Other indications are the co-occurrence of genes with a joint function, and the conserved organization of homologous genes in other prokaryotes [11, 12].
The MIC- and GSI-equations that were developed in this study were based on genomes from 15 prokaryotes from different genera, and validated on the genomes of 10 other prokaryotes. These validations show that these equations form reliable tools for optimal design of prokaryotic clone-based microarrays within the ranges that were tested (Table 4), and that they are applicable to a broad range of prokaryotes. Therefore, these equations form a good basis for the design of microarrays for prokaryotes from which the genome sequence is not available.
List of abbreviations
- BP b :
number of base pairs within the operon in front of the specific gene [bp]
- BP e :
number of base pairs within the operon behind the specific gene [bp]
- DIC :
predefined minimal insert coverage, i.e. the minimal required representation of the gene on the insert [%]
- Gene :
gene size [bp]
- GS :
genome size [Mbp]
- IS :
insert size [bp]
- IS opt-MIC :
Optimal value of IS for the MIC equation [-]
- IS opt-GSI :
Optimal value of IS for the GSI equation [-]
- N :
number of clone [-]
- O b :
overlap of the fragment and the beginning of the gene [bp]
- O e :
overlap of fragment and the end of the gene [bp]
- O mf :
minimal required overlap of the fragment and the gene (fixed) [bp]
- O mv :
minimal required overlap of the fragment and the gene (variable) [bp]
- p :
gene specific probability value [-]
- p GSI :
predicted fraction of specifically represented genes [-]
- p MIC :
predicted fraction of represented genes represented with a minimal insert coverage [-]
- R :
average number of genes per transcription unit [-]
- SSQ :
Residual Sum of Squares [-]
- a-l :
parameters in MIC or GSI model [-]
We would like to thank Rolf Boesten, Martien Caspers, Nicole van Luijk and Karin Overkamp for their critical remarks and useful suggestions.
- Cho JC, Tiedje JM: Bacterial species determination from DNA-DNA hybridization by using genome fragments and DNA microarrays. Appl Environ Microbiol 2001, 67: 3677–3682. 10.1128/AEM.67.8.3677-3682.2001PubMed CentralView ArticlePubMedGoogle Scholar
- Askenazi M, Driggers EM, Holtzman DA, Norman TC, Iverson S, Zimmer DP, Boers ME, Blomquist PR, Martinez EJ, Monreal AW, Feibelman TP, Mayorga ME, Maxon ME, Sykes K, Tobin JV, Cordero E, Salama SR, Trueheart J, Royer JC, Madden KT: Integrating transcriptional and metabolite profiles to direct the engineering of lovastatin-producing fungal strains. Nat Biotechnol 2003, 21: 150–156. 10.1038/nbt781View ArticlePubMedGoogle Scholar
- Clark L, Carbon J: A colony bank containing synthetic Col E1 hybrids representative of the entire E. coli genome. Cell 1976, 9: 91–99. 10.1016/0092-8674(76)90055-6View ArticleGoogle Scholar
- Lander ES, Waterman MS: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 1988, 2: 231–239. 10.1016/0888-7543(88)90007-9View ArticlePubMedGoogle Scholar
- Akopyants NS, Clifton SW, Martin J, Pape D, Wylie T, Li L, Kissinger JC, Roos DS, Beverley SM: A survey of the Leishmania major Friedlin strain V1 genome by shotgun sequencing: a resource for DNA microarrays and expression profiling. Mol Biochem Parasitol 2001, 113: 337–340. 10.1016/S0166-6851(01)00227-4View ArticlePubMedGoogle Scholar
- Moreno-Hagelsieb G, Collado-Vides J: A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics 2002, 18(Suppl 1):S329–336.View ArticlePubMedGoogle Scholar
- Chu G, Vollrath D, Davis RW: Separation of large DNA molecules by contour-clamped homogeneous electric fields. Science 1986, 234: 1582–1585.View ArticlePubMedGoogle Scholar
- Sun LV, Foster JM, Tzertzinis G, Ono M, Bandi C, Slatko BE, O'Neill SL: Determination of Wolbachia genome size by pulsed-field gel electrophoresis. J Bacteriol 2001, 183: 2219–2225. 10.1128/JB.183.7.2219-2225.2001PubMed CentralView ArticlePubMedGoogle Scholar
- Wilhelm J, Pingoud A, Hahn M: Real-time PCR-based method for the estimation of genome sizes. Nucl Acids Res 2003, 31(10):e56. 10.1093/nar/gng056PubMed CentralView ArticlePubMedGoogle Scholar
- Salgado H, Moreno-Hagelsieb G, Smith TF, Collado-Vides J: Operons in Escherichi coli : Genomic analyses and predictions. Proc Nat Acad Sci 2000, 97: 6652–6657. 10.1073/pnas.110147297PubMed CentralView ArticlePubMedGoogle Scholar
- Westover BP, Buhler JD, Sonnenburg JL, Gordon JI: Operon prediction without a training set. Bioinformatics 2005, 21: 880–888. 10.1093/bioinformatics/bti123View ArticlePubMedGoogle Scholar
- Ermolaeva MD, White O, Salzberg SL: Prediction of operons in microbial genomes. Nucleic Acids Research 2001, 29: 1216–1221. 10.1093/nar/29.5.1216PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.