Mathematical design of prokaryotic clonebased microarrays
 Bart Pieterse^{1, 2, 3}Email author,
 Elisabeth J Quirijns^{4, 5},
 Frank HJ Schuren^{2} and
 Mariët J van der Werf^{1, 2}
DOI: 10.1186/147121056238
© Pieterse et al; licensee BioMed Central Ltd. 2005
Received: 19 May 2005
Accepted: 28 September 2005
Published: 28 September 2005
Abstract
Background
Clonebased microarrays, on which each spot represents a random genomic fragment, are a good alternative to open reading framebased microarrays, especially for microorganisms for which the complete genome sequence is not available. Since the generation of a genomic DNA library is a random process, it is beforehand uncertain which genes are represented. Nevertheless, the genome coverage of such an array, which depends on different variables like the insert size and the number of clones in the library, can be predicted by mathematical approaches. When applying the classical formulas that determine the probability that a certain sequence is represented in a DNA library at the nucleotide level, massive amounts of clones would be necessary to obtain a proper coverage of the genome.
Results
This paper describes the development of two complementary equations for determining the genome coverage at the gene level. The first equation predicts the fraction of genes that is represented on the array in a detectable way and cover at least a set part (the minimal insert coverage) of the genomic fragment by which these genes are represented. The higher this minimal insert coverage, the larger the chance that changes in expression of a specific gene can be detected and attributed to that gene. The second equation predicts the fraction of genes that is represented in spots on the array that only represent genes from a single transcription unit, which information can be interpreted in a quantitative way.
Conclusion
Validation of these equations shows that they form reliable tools supporting optimal design of prokaryotic clonebased microarrays.
Background
In the past decade, whole transcriptome comparison by microarray hybridizations has proven to be an effective tool for studying genome wide gene responses. The general approaches for the development of microarrays are based on the completely annotated genome sequence of an organism. Usually each spot on the array represents one open reading frame (ORF). Whereas this approach has clear advantages for strains for which the complete annotated genome sequence is available, it is not applicable to strains for which this is not the case.
A method that allows for the rapid construction of microarrays for which the completely annotated genome sequence is not required is by the construction of a clonebased array. In this approach, a chromosomal DNA library is constructed from the strain of interest. From this library the genomic fragments, the inserts, are amplified from the clones by PCR with generic primers and spotted on the arrayslide [1, 2].
The major differences between ORFbased and clonebased arrays with respect to the data interpretation are that in case of clonebased arrays the differential signals can only be linked to a specific gene after the DNA fragment from the spot of interest on the array has been sequenced, and that it is beforehand uncertain whether a gene is represented on the array. Moreover, whereas ORFbased microarrays exclusively generate gene specific data, a differential signal within a spot on a clonebased array can originate from multiple genes on the insert that are not necessarily linked at the transcriptional level.
The extent of these limitations can be quantified by estimating the genome coverage by the spots present on the array. The standard formulas for estimating the genome coverage of a DNA library, the ClarkCarbon equation [3] and the LanderWaterman equation [4], determine this coverage at the nucleotide level. In other words, they consider the genome as a set of nucleotides, which is useful when the library is to be used for genome sequencing. However, these formulas will overestimate the required number of clones for hybridization purposes. The reason for this is that for hybridization purposes small overlapping fragments that allow for specific binding of the labeled cDNA suffice. Akopyants et al. [5] developed an equation for the estimation of the fraction of genes that are at least partially represented. This formula is directly derived from classical probability calculations and contains the organism specific variables genome size and average gene size. Due to the fact that Akopyants et al. determine the genome coverage at the gene level, and consider a gene represented if a fragment is present that is large enough to hybridize to and large enough to identify the gene, the required number of clones to obtain a certain coverage is reduced.
A general drawback of these three formulas is that they give no insight into the fraction of genes for which specific data can be generated in a transcriptomics experiment. The data from a spot are considered specific if the expression ratios from the quantified signal from that spot can directly be related to the gene(s) represented by the spot. This is not the case if DNA from multiple (neighboring) transcription units is present in one spot, since it would be uncertain which gene is responsible for which part of the total signal from that spot.
In this paper, two formulas were developed that enable for mathematical predictions of genome coverage by a prokaryotic clone basedarray at the gene level as a function of genome size, number of clones, insert size, and either the minimal part of the insert that is covered by the gene or the minimal overlap of the gene and the insert: the minimal insert coverage (MIC) equation, and the gene specific information (GSI) equation.
Overview of prokaryotes from several genera with their genes/transcription unitratio. Microorganisms that were used for model development (M) or validation (V) of the MIC and the GSIequation are depicted in the list.
Genus  Organism  genes/TU (R)  Model (M) or validation (V) strain 

Proteobacteria Gammaproteobacteria Enterobacteriales  Escherichia coli K12 MG1655  1.6  
Escherichia coli O157:H7 EDL933  1.6  
Escherichia coli CFT073  1.6  M  
Salmonella typhi CT19  1.4  
Salmonella typhimurium LT2  1.6  
Yersinia pestis CO92  1.4  
Shigella flexneri 2a str. 2457T  1.5  
Buchnera aphidicola Sg  1.5  V  
Wigglesworthia glossinidia  1.5  
ProteobacteriaGammaproteobacteria Pasteurellales  Haemophilus influenzae Rd  1.7  
Pasteurella multocida PM70  1.7  V  
Proteobacteria Gammaproteobacteria Xanthomonadales  Xylella fastidiosa 9a5c  1.5  
Xanthomonas campestris ATCC33913  1.5  V  
Proteobacteria Gammaproteobacteria Vibrionales  Vibrio cholerae El Tor N16961  1.8  M 
Vibrio parahaemolyticus RIMD2210633  1.5  
Vibrio vulnificus CMCP6  1.5  
Proteobacteria Gammaproteobacteria Pseudomonadales  Pseudomonas aeruginosa PA01  1.6  M 
Pseudomonas putida KT2440  1.6  
Proteobacteria Gammaproteobacteria Legionellales  Coxiella burnetii RSA 493  1.6  
Proteobacteria Betaproteobacteria  Neisseria meningitidis Z2491  1.6  M 
Ralstonia solanacearum GMI1000  1.6  
Proteobacteria Epsilonproteobacteria  Helicobacter pylori 26695  2.3  M 
Campylobacter jejuni NCTC11168  2.7  M  
Proteobacteria Alphaproteobacteria  Rickettsia prowazekii Nadrid E  1.4  V 
Sinorhizobium meliloti 1021  1.5  
Agrobacterium tumefaciens C58  1.5  
Brucella suis 1330  1.5  
Caulobacter crescentus  1.5  
Firmicutes Bacillales  Bacillus subtilis 168  1.6  M 
Oceanobacillus iheyensis HTE831  1.6  
Stapylococcus aureus MW2  1.6  
Listeria monocytogenes EGDe  1.8  M  
Listeria innocua Clip11262  1.8  
Firmicutes Clostridia  Clostridium acetobutylicum ATCC824  1.6  
Clostridium tetani E88  1.6  
Thermoanaerobacter tengcongensis MB4T  2.0  
Firmicutes Lactobacillales  Lactococcus lactis IL1403  1.5  M 
Streptococcus agalactiae 2603  1.8  
Streptococcus pneumoniae R6  1.8  
Lactobacillus plantarum WCFS1  1.6  M  
Enterococcus faecalis V583  1.8  
Firmicutes Mollicutes  Mycoplasma pneumoniae M129  2.1  M 
Mycoplasma genitalium G37  3.1  V  
Mycoplasma penetrans HF2  1.6  
Ureaplasma urealyticum (serovar 3)  2.1  
Actinobacteria  Mycobacterium tuberculosis H37Rv  1.7  M 
Corynebacterium glutamicum ATCC 13032  1.5  
Streptomyces coelicolor A3(2)  1.4  
Tropheryma whipplei Twist  1.9  
Bifidobacterium longum NCC2705  1.3  V  
Fusobacteria  Fusobacterium nucleatum ATCC25586  2.0  V 
Chlamydia  Chlamydia trachomatis (serovar D)  1.6  
Chlamydophila pneumoniae AR39  1.6  
Spirochete  Borrelia burgdorferi B31  1.8  
Treponema pallidum Nichols  1.9  
Leptospira interrogans 56601  1.5  
Bacteroid  Bacteroides thetaiotaomicron VPI5482  1.8  M 
Cyanobacteria  Thermosynechococcus elongatus BP1  1.6  
Nostoc sp. PCC 7120  1.2  
Green sulfur bacteria  Chlorobium tepidum TLS  1.6  
Deinococcus  Deinococcus radiodurans R1  1.5  V 
Hyperthermophilic bacteria  Aquifex aeolicus VF5  2.1  
Thermotoga maritima MSB8  3.0  V  
Archae Euryarchaeota  Methanococcus jannaschii DSM2661  1.8  M 
Pyrococcus furiosus DSM3638  2.0  M  
Archaeoglobus fulgidus DSM4304  2.1  
Thermoplasma acidophilum DSM1728  1.5  
Methanosarcina acetivorans C2A  1.3  V  
Methanosarcina mazei Goe1  1.3  
Pyrococcus abyssi  2.1  
Archae Crenarchaeota  Aeropyrum pernix K1  2.0  
Sulfolobus solfotaricus P2  1.6  
Pyrobaculum aerophilum IM2  1.7 
Description of the developed equations
Minimal Insert Coverage (MIC)equation
Since the generation of inserts for a genomic DNA library is a random process, a large part of the represented genes may be corepresented with other genes by one spot on the microarray. This complicates data interpretation since it introduces an uncertainty on which gene or genes are responsible for differential signals from these spots. The impact of differential expression of a specific gene on the observed difference of the signal from a spot will be larger when a larger part of the genome fragment in that spot is covered by that gene. Moreover, the larger the part of the insert that is covered by a specific gene, the larger the chance that differential signals for the spot can be attributed to that gene, and the higher the chance that differential expression levels from that gene result in a statistically significant differential signal on the array.
The MICequation anticipates to this effect by predicting the number of genes that are (at least partially) present on an insert and cover at least a predefined part of the insert (DIC). This predefined part is defined as a percentage of the total insert. E.g. if the insert size is 1000 base pairs and the predefined minimal insert coverage (DIC) is set at 50%, then at least 500 bp of that gene should be present on an insert to be considered as represented by the array. Genes smaller than the size of the predefined part of the insert, are considered as not represented on the array.
Gene Specific Information (GSI)equation
Dataset preparation
Fifteen prokaryotes from various genera were selected as model species (Table 1). Genome data from these microorganisms were used for the generation of speciesspecific values for the expected fraction of represented genes as a function of the genome size (GS), number of clones (N), insert size (IS), and either DIC or O_{ mf }. Coordinates from all annotated genes from these organisms were obtained from GenBank, and were used to determine the gene sizes. In addition, information was obtained on the start and stop coordinates from the transcription unit to which the gene belongs, and the position of the gene in this transcription unit. It was assumed that transcription units start at the first base pair of the first gene and finish at the last base pair of the last gene. This information was generated by the combination of intergenic region based transcription unit predictions, generated by MorenoHagelsieb and ColladoVides [6], with gene coordinates from GenBank.
The genome size (GS) could be included as a fictitious variable in the datasets, since not the speciesspecific genome size, but the speciesspecific gene size distribution and genome organization in genes and transcription units were relevant.
It was assumed that each possible genome fragment of the size of the insert size (IS) has an equal chance of being represented. To achieve this, fragments should be generated by physical fragmentation, and not by the use of endonucleases.
Dataset preparation for the fitting procedure for the MICequation
Overview of the variables that were used for the model datasets on which the MIC and the GSIequation are based. Multiple combinations of the mentioned values were applied.
Variable  Values 

N  500; 1500; 2500; 3500; 4500; 5500; 6500; 7500; 8500; 9500 
IS  100; 300; 500; 700; 900; 1100; 1300; 1500; 2100; 2700; 3000 
GS  0.5; 1.5; 2.5; 3.5; 4.5; 5.5; 6.5; 7.5; 8.5; 9.5 
O _{ mf }  50; 100; 150; 200; 250; 300; 350; 400; 450 
DIC  10; 20; 30; 40; 50; 60; 70; 80; 90 
The following formulas were developed for the calculation of the probability value per gene:
${O}_{mv}=\frac{IS\cdot DIC}{100}(1)$
$Gene>{O}_{mv}\Rightarrow p=1{\left(1\frac{Gene+1+IS2\cdot {O}_{mv}}{GS}\right)}^{N}(2)$
Gene <O_{ mv }⇒ p = 0 (3)
Dataset preparation for the fittimg procedure for the GSIequation
For each organism, the fraction of genes for which specific information could be generated was determined for 114 different combinations of the number of clones (N), fictitious genome size (GS), the insert size (IS), and the minimal required overlap (O_{ mf }) in the ranges depicted in Table 2. The represented fraction was determined by taking the average of the probability values per gene. Formulas were developed that describe different situations with respect to the localization and organization of the gene of interest on the insert (eq 4 – 15)
Formulas that were developed to determine the probability value for genes that are transcribed into a single gene transcript:
$Gene\ge IS\Rightarrow p=1{\left(1\frac{Gene+1IS}{GS}\right)}^{N}(4)$
Gene <IS ⇒ p = 0 (5)
Formulas that were developed to determine the probability value for genes that are at the beginning of a transcription unit:
BP_{ e }≤ IS  O_{ mf }⇒ O_{ e }= IS  BP_{ e } (6)
BP_{ e }> IS  O_{ mf }⇒ O_{ e }= O_{ mf } (7)
$B{P}_{e}+Gene>IS\Rightarrow p=1{\left(1\frac{Gene+1{O}_{e}}{GS}\right)}^{N}(8)$
BP_{ e }+ Gene <IS ⇒ p = 0 (9)
Formulas that were developed to determine the probability value for genes that are flanked at both sides by other genes that belong to the same transcription unit:
BP_{ b }≤ IS  O_{ mf }⇒ O_{ b }= IS  BP_{ b } (10)
BP_{ b }> IS  O_{ mf }⇒ O_{ b }= O_{ mf } (11)
BP_{ e }≤ IS  O_{ mf }⇒ O_{ e }= IS  BP_{ e } (12)
BP_{ e }> IS  O_{ mf }⇒ O_{ e }= O_{ mf } (13)
$B{P}_{b}+B{P}_{e}+Gene>IS\Rightarrow p=1{\left(1\frac{Gene+1+IS{O}_{b}{O}_{e}}{GS}\right)}^{N}(14)$
BP_{ b }+ BP_{ e }+ Gene > IS ⇒ p = 0 (15)
Models and fits
The datasets with the expected fractions of represented genes for the various combinations of parameters as presented in the previous section functioned as template for the fitting of the predictive equation for MIC and GSI.
MIC equation
From equation 2, which was used to determine the probability value per gene, it became apparent that organismdependent gene size distribution influenced the expected number of represented genes on a clone based array. These organism dependent differences were neglected for the preparation of the MIC equation, which proofed to be justified when validating the MICequation (see validation section).
A polynome was developed as MIC model. In the polynome all variables were present in first and second order and in cross terms between two variables. Because of a high expected correlation between IS and DIC (based on equation 2), this relation was extended with a second order term composed of IS and DIC, resulting in:
p_{ MIC }= a + b_{1}·DIC + b_{2}·DIC^{2} + c_{1}·N + c_{2}·N^{2} + d_{1}·GS + d_{2}·GS^{2} + e_{1}·IS + e_{2}·IS^{2} + f·DIC·IS + g·DIC·N + h·DIC·GS + i·IS·N + j·IS·GS + k·GS·N + l·(IS·DIC)^{2} (16)
The model datasets for the 15 model species were used together in the regression procedure to estimate the parameters in the MIC model. Linear regression using a standard least squares algorithm (fminsearch) provided by Matlab (The MathWorks) was applied to search the parameters that minimize the sum of squares (SSQ) defined as:
SSQ = ∑(p_{MIC,exp} p_{MIC,mod})^{2} (17)
Values for the parameters in the MIC and the GSIequation.
parameter  MIC equation  GSI equation  GSI equation 

q  r  
a  4.85E01  0.544  0 
b _{ 1 }  2.54E03  *  * 
b _{ 2 }  1.51E05  4.26E08  3.05E07 
c _{ 1 }  1.27E04  6.13E05  1.46E05 
c _{ 2 }  5.22E09  0  1.96E09 
d _{ 1 }  1.22E01  7.84E02  1.06E02 
d _{ 2 }  3.42E03  3.31E03  3.23E04 
e _{ 1 }  3.95E04  5.36E04  2.08E04 
e _{ 2 }  9.57E08  9.73E08  4.62E08 
f  9.85E06  1.69E08  3.42E08 
g  4.61E07  *  * 
h  3.25E04  2.55E06  4.12E06 
I  1.69E08  2.22E08  5.47E09 
j  2.01E05  2.42E05  6.04E06 
k  2.26E06  1.76E06  1.30E06 
l  2.60E11  *  * 
GSI equation
From the model datasets for the GSI equation it appeared that an organism dependent variable had a strong influence on the calculated number of represented genes (results not shown). Analysis revealed a positive correlation between the number of represented genes and the speciesdependent average number of genes per transcription unit, R. R was determined by dividing the total number of genes (GenBank) by the total number of predicted transcription units [6] (Table 1).
Startingpoint for the GSI model was a second order polynome for all variables, extended with the cross terms between two variables. A set of parameters was estimated for each individual model species (results not shown). Parts which appeared to contribute less than 1% to p_{ GSI }were not included, which resulted in the following relation:
p_{ GSI }= a + b_{2}·O_{ mf }^{2} + c_{1}·N + c_{2}·N^{2} + d_{1}·GS + d_{2}·GS^{2} + e_{1}·IS + e_{2}·IS^{2} + f·O_{ mf }·IS + h·O_{ mf }·GS + i·IS·N + j·IS·GS + k·GS·N (18)
For each prokaryote a set of parameters was obtained by minimizing the SSQ, equivalent to equation 17. The average absolute deviation of the GSI equation from the model datasets was 0.0258.
In order to obtain one generic equation for the organism specific relations for p_{ GSI }, the species specific values for the parameters (a  k) in equation 18 were related to the species related variable R by a linear relation:
parameter(a  k) = q + r·R (19)
in which R is species specific (Table 1). Since no dependency of a with R could be established, a was set at the average of all individual a values: 0.544. With this value the polynome was fitted again, and the final relations between the other parameters and R were determined (Table 3).
Validations
Overview of the variables and the values used for these variables that were used for the datasets that were used for the validation of the MIC and the GSIequation. All possible combinations of the mentioned values were tested.
Variable  Values 

N  1000; 4000; 7000; 10000 
IS  100; 500; 1000; 1500; 2000 
GS  1; 3; 5; 7 
DIC  25; 50; 75 
O _{ mf }  100; 200 
Reliability of the MIC and the GSIequation, depicted as the fraction of predictions that differ less than 0.01, 0.05 or 0.10 from the real values, for the validation sets defined in Table 4.
Abs (Δ predicted vs. real)  Fraction for MICequation  Fraction for GSIequation 

< 0.01  0.19  0.24 
< 0.05  0.58  0.73 
< 0.10  0.87  0.95 
Deviations between the predicted fractions by the MICequation and the true values as they were determined for the validation species are mainly to be attributed to speciesspecific gene size distribution. In order to obtain one generic equation, and based on the accuracy of the equation in its current form (Table 5), it was decided not to include a speciesspecific variable.
Prediction of the optimum value for the insert size (IS)
Whereas an increase in N will always have a positive contribution to the fraction of represented genes, and an increase in GS, O_{ mf }, and MIC a negative contribution, there may be an optimum IS that depends on the values of the other variables. This optimum can be estimated by differentiation of equation 16 and 18 to IS (dp/dIS).
For the determination of the optimal value for IS for the MICapproach this results in the following equation:
$\begin{array}{l}\frac{d{p}_{MIC}}{dIS}={e}_{1}+2{e}_{2}\cdot IS+f\cdot DIC+i\cdot N+j\cdot GS+2\cdot l\cdot DI{C}^{2}\cdot IS=0\Rightarrow \hfill \\ I{S}_{MICopt}=\frac{{e}_{1}f\cdot DICi\cdot Nj\cdot GS}{2{e}_{2}+2l{(DIC)}^{2}}\hfill \end{array}(20)$
For the determination of the optimal value for IS for the GSIapproach the equation is as follows:
$\begin{array}{l}\frac{d{p}_{GSI}}{dIS}={e}_{1}+2{e}_{2}\cdot IS+f\cdot O+i\cdot N+j\cdot GS=0\Rightarrow \hfill \\ I{S}_{GSIopt}=\frac{{e}_{1}f\cdot Oi\cdot Nj\cdot GS}{2{e}_{2}}\hfill \end{array}(21)$
If the indicated values for IS_{opt} are outside the range of 0 to 2000 bp (the range that was applied for validation of the models) no optimum can be identified within the boundaries of the model. In these cases small values of IS will give the best results.
Influence of the average number of genes per transcription unit (R) on the predicted values
From the input variables for the MIC and GSI formulas, N, IS, DIC and O_{ mf }are userdefined, while GS and R have to be estimated for the specific organism. Whereas current techniques allow for rapid and accurate estimations of GS [7–9], the organism specific value for R is difficult to determine for species from which little sequence information is available.
R was determined for 73 prokaryotes from multiple genera, as previously described in the "models and fits" section (Table 1). For 61 of the 73 strains in this list, R was within the narrow range from 1.5 – 2.0. Moreover these data indicate that accurate estimations of R can be made, based on the genus of the organism, with an exception for the mollicutes, the hyperthermophilic bacteria and the euryarchaeota.
Effect of false estimations of R on the fraction of predictions that differ less than 0.01, 0.05 or 0.10 from the real values, for the validation set defined in Table 4.
Applied value for R  Abs (Δ predicted vs. real)  Fraction 

R  < 0.01  0.24 
< 0.05  0.73  
< 0.10  0.95  
R  0.1  < 0.01  0.17 
< 0.05  0.65  
< 0.10  0.95  
R + 0.1  < 0.01  0.24 
< 0.05  0.75  
< 0.10  0.94  
R  0.2  < 0.01  0.12 
< 0.05  0.55  
< 0.10  0.90  
R + 0.2  < 0.01  0.23 
< 0.05  0.69  
< 0.10  0.91  
R  0.3  < 0.01  0.10 
< 0.05  0.38  
< 0.10  0.81  
R + 0.3  < 0.01  0.21 
< 0.05  0.59  
< 0.10  0.88 
Application
Plots like those depicted in figures 3a–c and 4 can be used to determine the preferred combination of the number of spots on the array and the insert size. If for instance the number of spots would be limited to 6000, an insert size of approximately 800 bp would be optimal with respect to the fraction of genes that are represented with a minimal insert coverage of 25% (Fig. 3a). From equation 20 this optimum appears to be 803 bp. With this combination of array parameters the predicted fraction of genes that cover at least 25% of the insert (which equals 803 × 0.25 = 201 bp) is 0.75 (eq. 16). Meanwhile the predicted fraction of genes for which gene specific information can be generated is 0.49 (eq. 18). If the specificity of the data is considered to be more important than the amount of represented genes, it is preferable to have an optimum value for p_{MIC} for higher values of DIC (e.g. Fig. 3c) and a high value for p_{GSI} (Fig. 4). These requirements are best fulfilled by combinations with low values for the insert size.
A Microsoft Excel fill inspreadsheet that allows for calculations of pGSI, pMIC, and the optimal values for the insert size, is available as additional file with this paper [see Additional file 1].
Discussion
Classical approaches for the construction of DNA libraries form a suitable base for the construction of clonebased microarrays. However, as the construction of these libraries is a random process, it is beforehand uncertain whether a gene or transcription unit will be uniquely represented on a separate insert on the array. Genome coverage by a DNA library is usually determined by calculating the expectation that each single nucleotide from that gene is present [3, 4]. These formulas will overestimate the number of clones required when the library is to be used for the construction of a microarray, since for this purpose partial representation of a gene is sufficient for hybridization.
To our knowledge, Akopyants et al. were the first to estimate genome coverage at the gene level [5]. They predicted the fraction of represented genes using equation 22:
${p}_{Akopyants}=1{\left(1\left(\frac{\text{average transcript size}+\text{insert size}2\times \text{required overlap}}{\text{genome size}}\right)\right)}^{\text{number of clones}}(22)$
An important variable in this formula is the average transcript size. However, use of this variable is not legitimate for this type of probability calculations since the average probability per gene (the required information) is not per se equal to the probability per average gene. When we validated the Akopyants formula on the same dataset that was applied for the validation of the MICequation, it appeared that 49% of the predictions deviated more than 0.1 from the real value (calculated as the average chance per gene), with a strong tendency to overestimation. The Akopyants formula therefore appears unreliable for calculating optimal library sizes
None of the previous formulas give insight in the fraction of genes for which gene specific information can be generated, while this is one of the most important features when one is interested in studying differential gene expression. The MICand GSIequations that were developed in this study allow for good estimations of both the genome coverage at the gene level, and the fraction of genes for which gene specific transcription information can be generated.
Whereas the MICequation is rather straightforward with respect to the input variables and interpretation, application of the GSIequation requires the estimation of the average number of genes per transcription unit for an organism. Although a false estimation of this variable could lead to a wrong prediction of the represented fraction, Tables 1 and 6 indicate that this risk is limited.
The GSIequation is partially based on operon predictions. For the development of the model and validation datasets we used loglikelihood based transcription unit predictions for adjacent pair of genes to be in the same operon [10]. This loglikelihood based prediction method is only applicable to organisms for which at least large parts of the genome have been sequenced, and will therefore not be useful when sequence data from array spots for which differential expression was identified, have to be interpreted. Nevertheless, good predictions can be made on whether or not genes that are corepresented in a single spot on the array belong to the same transcription unit. Strong indications can already be obtained from the physical organization of the DNA fragment of interest, like gene orientation and intergenic distance [6, 11]. Other indications are the cooccurrence of genes with a joint function, and the conserved organization of homologous genes in other prokaryotes [11, 12].
Conclusion
The MIC and GSIequations that were developed in this study were based on genomes from 15 prokaryotes from different genera, and validated on the genomes of 10 other prokaryotes. These validations show that these equations form reliable tools for optimal design of prokaryotic clonebased microarrays within the ranges that were tested (Table 4), and that they are applicable to a broad range of prokaryotes. Therefore, these equations form a good basis for the design of microarrays for prokaryotes from which the genome sequence is not available.
List of abbreviations
 BP _{ b } :

number of base pairs within the operon in front of the specific gene [bp]
 BP _{ e } :

number of base pairs within the operon behind the specific gene [bp]
 DIC :

predefined minimal insert coverage, i.e. the minimal required representation of the gene on the insert [%]
 Gene :

gene size [bp]
 GS :

genome size [Mbp]
 IS :

insert size [bp]
 IS _{optMIC} :

Optimal value of IS for the MIC equation []
 IS _{optGSI} :

Optimal value of IS for the GSI equation []
 N :

number of clone []
 O _{ b } :

overlap of the fragment and the beginning of the gene [bp]
 O _{ e } :

overlap of fragment and the end of the gene [bp]
 O _{ mf } :

minimal required overlap of the fragment and the gene (fixed) [bp]
 O _{ mv } :

minimal required overlap of the fragment and the gene (variable) [bp]
 p :

gene specific probability value []
 p _{ GSI } :

predicted fraction of specifically represented genes []
 p _{ MIC } :

predicted fraction of represented genes represented with a minimal insert coverage []
 R :

average number of genes per transcription unit []
 SSQ :

Residual Sum of Squares []
 al :

parameters in MIC or GSI model []
Declarations
Acknowledgements
We would like to thank Rolf Boesten, Martien Caspers, Nicole van Luijk and Karin Overkamp for their critical remarks and useful suggestions.
Authors’ Affiliations
References
 Cho JC, Tiedje JM: Bacterial species determination from DNADNA hybridization by using genome fragments and DNA microarrays. Appl Environ Microbiol 2001, 67: 3677–3682. 10.1128/AEM.67.8.36773682.2001PubMed CentralView ArticlePubMedGoogle Scholar
 Askenazi M, Driggers EM, Holtzman DA, Norman TC, Iverson S, Zimmer DP, Boers ME, Blomquist PR, Martinez EJ, Monreal AW, Feibelman TP, Mayorga ME, Maxon ME, Sykes K, Tobin JV, Cordero E, Salama SR, Trueheart J, Royer JC, Madden KT: Integrating transcriptional and metabolite profiles to direct the engineering of lovastatinproducing fungal strains. Nat Biotechnol 2003, 21: 150–156. 10.1038/nbt781View ArticlePubMedGoogle Scholar
 Clark L, Carbon J: A colony bank containing synthetic Col E1 hybrids representative of the entire E. coli genome. Cell 1976, 9: 91–99. 10.1016/00928674(76)900556View ArticleGoogle Scholar
 Lander ES, Waterman MS: Genomic mapping by fingerprinting random clones: a mathematical analysis. Genomics 1988, 2: 231–239. 10.1016/08887543(88)900079View ArticlePubMedGoogle Scholar
 Akopyants NS, Clifton SW, Martin J, Pape D, Wylie T, Li L, Kissinger JC, Roos DS, Beverley SM: A survey of the Leishmania major Friedlin strain V1 genome by shotgun sequencing: a resource for DNA microarrays and expression profiling. Mol Biochem Parasitol 2001, 113: 337–340. 10.1016/S01666851(01)002274View ArticlePubMedGoogle Scholar
 MorenoHagelsieb G, ColladoVides J: A powerful nonhomology method for the prediction of operons in prokaryotes. Bioinformatics 2002, 18(Suppl 1):S329–336.View ArticlePubMedGoogle Scholar
 Chu G, Vollrath D, Davis RW: Separation of large DNA molecules by contourclamped homogeneous electric fields. Science 1986, 234: 1582–1585.View ArticlePubMedGoogle Scholar
 Sun LV, Foster JM, Tzertzinis G, Ono M, Bandi C, Slatko BE, O'Neill SL: Determination of Wolbachia genome size by pulsedfield gel electrophoresis. J Bacteriol 2001, 183: 2219–2225. 10.1128/JB.183.7.22192225.2001PubMed CentralView ArticlePubMedGoogle Scholar
 Wilhelm J, Pingoud A, Hahn M: Realtime PCRbased method for the estimation of genome sizes. Nucl Acids Res 2003, 31(10):e56. 10.1093/nar/gng056PubMed CentralView ArticlePubMedGoogle Scholar
 Salgado H, MorenoHagelsieb G, Smith TF, ColladoVides J: Operons in Escherichi coli : Genomic analyses and predictions. Proc Nat Acad Sci 2000, 97: 6652–6657. 10.1073/pnas.110147297PubMed CentralView ArticlePubMedGoogle Scholar
 Westover BP, Buhler JD, Sonnenburg JL, Gordon JI: Operon prediction without a training set. Bioinformatics 2005, 21: 880–888. 10.1093/bioinformatics/bti123View ArticlePubMedGoogle Scholar
 Ermolaeva MD, White O, Salzberg SL: Prediction of operons in microbial genomes. Nucleic Acids Research 2001, 29: 1216–1221. 10.1093/nar/29.5.1216PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.