An efficient algorithm for the stochastic simulation of the hybridization of DNA to microarrays

Background Although oligonucleotide microarray technology is ubiquitous in genomic research, reproducibility and standardization of expression measurements still concern many researchers. Cross-hybridization between microarray probes and non-target ssDNA has been implicated as a primary factor in sensitivity and selectivity loss. Since hybridization is a chemical process, it may be modeled at a population-level using a combination of material balance equations and thermodynamics. However, the hybridization reaction network may be exceptionally large for commercial arrays, which often possess at least one reporter per transcript. Quantification of the kinetics and equilibrium of exceptionally large chemical systems of this type is numerically infeasible with customary approaches. Results In this paper, we present a robust and computationally efficient algorithm for the simulation of hybridization processes underlying microarray assays. Our method may be utilized to identify the extent to which nucleic acid targets (e.g. cDNA) will cross-hybridize with probes, and by extension, characterize probe robustnessusing the information specified by MAGE-TAB. Using this algorithm, we characterize cross-hybridization in a modified commercial microarray assay. Conclusions By integrating stochastic simulation with thermodynamic prediction tools for DNA hybridization, one may robustly and rapidly characterize of the selectivity of a proposed microarray design at the probe and "system" levels. Our code is available at http://www.laurenzi.net.


Background
Presently, there are several high throughput methods of quantifying changes in gene expression including oligonucleotide microarrays, quantitative realtime PCR (qPCR) and "next generation sequencing" (e.g. [1]). Of these, high density oligonucleotide microarrays are arguably the most important tools for genomic investigation. Although next generation sequencing is a promising alternative to microarrays for genome-scale expression profiling, and exhibit more sensitivity in the low-expression limit [2,3], microarray technology is substantially less expensive and the resulting data sets require much less information processing. Moreover, microarrays have substantially higher throughput than qPCR.
Microarrays consist of DNA probes (reporters) which are orderly arranged on a glass slide. Probes may be attached to (or synthesized from) the slide surface via (1) maskdependent [4][5][6][7][8][9] or maskless photolithographic DNA synthesis technology [10,11], or (2) robotic printing of PCR products or synthetic oligomers [12]. The first two of these methods yield oligonucleotide arrays (e.g. Affymetrix GeneChips), which are more reliable than PCR productbased arrays; they comprise the majority of commercial arrays by market share. Thus, we restrict our considerations to microarrays of these types.
Each probe is designed to hybridize with a specific cellderived and fluorolabeled DNA species such as cDNA or cRNA [6,13]. If these targets originate from mRNA, then the fluorescence associated with each microarray feature is assumed to be proportional to the amount of each transcript [14][15][16]. However, in recent years, this assumption has been called into question. Several studies have shown that different microarray assays yield different results when used to quantify differences in expression (e.g. [17]). Moreover, significant differences have also been reported among the results of microarray and qPCR assays [18][19][20][21]. Although the MicroArray Quality Control consortium confirmed a "high level of interplatform concordance in terms of genes identified as differentially expressed" for several commercial microarrays targeting the human transcriptome in 2006 [22], the reliability of many other array designs and experimental protocols have not been characterized systematically.
Theoretical analyses of smaller systems have suggested that problems with probe reliability are thermodynamic in origin [23,24]. Sequence and GC content heuristics are commonly employed in the design of probes [25,26]. However, these heuristics do not preclude the possibility of finite lengths of complementary sequence between probe candidates and non-target ssDNA. Consequently, all probe-ssDNA interactions will exhibit favorable energetics of interaction to some extent, and there will always be a finite amount of cross-hybridization between probes and non-target cDNA. However, experimental identification and quantification of the relative amounts of correctly-hybridized and cross-hybridized probes is impractical, since the only measurement of an array reader is the fluorescence associated with each microarray feature.

Chemical Dynamics of DNA Hybridization
However, the relative amounts of these species may be quantified using population-balances combined with an appropriate thermodynamic model of hybridization. Consider a microarray with N P oligonucleotide probes that target many if not all of the N T solution-phase ssD-NAs (e.g. cDNA) The hybridization network may be written as where ᐍ ∈ (0, N T ] and m ∈ (0, NP]. Note that N P and N T needn't be equal, since an array needn't measure the expression levels of all transcripts (N p < N T ). Most arrays have multiple reporters per target [27]. The deterministic time evolution of the process is completely specified by the following chemical population balance equations In Eq. ical solution of Eq. 2 until t ~∞ yields the equilibrium populations of desired hybrids (between probes and their targets) as well as cross-hybrids. Unfortunately, this deterministic approach exhibits certain practical pitfalls. Since the reaction rate constants and cDNA populations vary over many orders of magnitude, Eqs. 2 are "stiff". The size of the hybridization reaction network compounds the problem; typical genomic assays are designed to measure the expression levels of thousands of transcripts. For example, baker's yeast (S. cerevisiae) possesses approximately 6,700 genes (N T = 6700) [28] and humans possess approximately 25,000 [29]. Since most microarrays feature one or more distinct reporters (N P ) for each target, the size of the hybridization network (2N P N T reactions) will be enormous. For these genomes, Eq. 2 represents millions to billions of stiff ordinary differential equations.

Stochastic Simulation
Alternately, one may utilize the stochastic approach to chemical kinetics, which underlies the aforementioned rate equations [30,31] is added to an array with N P surface-bound probes at populations = N (m = 1 ... N P ), where again, the superscript "0" denotes the initial state. Upon addition of the solution to the slide, unhybridized probes and targets will randomly hybridize in accordance with Eq. 1. Assuming perfect mixing and isothermal hybridization -both ostensibly achieved with most assays -the state of this system may be defined in terms of the populations of unhybridized probes , unhybridized target DNA , hybrids , and the hybridization volume, V. The probability of a transition from one state to another is defined by the stoichiometry of Eq. 1. Since ssDNA molecules directly interact to form a dsDNA hybrid, is the probability that probe m will hybridize with cDNA ᐍ within the imminent time interval δt, and is the probability that any of the hybrids composed of probe m and cDNA ᐍ will dehybridize within the imminent time interval δt. We recognize and as the rates of the forward and reverse reactions of Eq. 1, respectively. Eqs. 3 and 4 are microphysically valid [32][33][34][35] and are in fact the basis of the validity of the aforementioned rate equations (Eq. 2). We refer interested readers to Gillespie's paper on this subject [31].
Eqs. 3 and 4 are the bases of the stochastic simulation algorithms (SSAs) [34,36]. SSAs simulate the time evolution of a chemical process by repetitively (1) selecting a reaction μ among a set M of potential reactions, (2) selecting the time τ until that reaction occurs, (3) updating the state of the system to reflect the occurrence of the selected reaction, and (4) updating the time. Exact SSAs differ only in how the first two steps are implemented. Each methodology features its own memory and speed enhancements and restrictions.

Direct Method
The Direct Method samples two exact density functions to obtain the quiescence time and imminent reaction event.
The selection rule for the quiescence time is where a 0 is the sum of all reaction rates and r 1 ~U (0, 1) (i.e. r 1 is a uniform random number by contrast to the MC selection of just one quiescence time in the Direct Method. One may show that the smallest of these times (τ μ ) is the time at which the next reaction (μ) occurs, and that this selection rule is an exact MC selection from the reaction probability density function, like Eqs. 5 and 6 [37].
Another noteworthy difference between the Direct Method and the Next Reaction Method is the employment of a "dependency graph" by the latter. This data ) r e a c t i o n objects, which is prohibitively large for genome-sized microarray simulations; Even current supercomputers with terabytes of memory cannot meet these requirements.

Algorithm
In simulations of microarray hybridization, the extreme computational burden of the Direct Method and the memory burden of the Next Reaction Method may be alleviated by judicious storage and summation of the terms in Eq. 6. Our algorithm employs a data structure called a Hybridization Table (HT) that stores partial sums of the terms in Eq. 6. For this reason, we call our approach the "method of partial sums" (Algorithm 1).
Calculate the auxiliary variables (α j , j = 1, 2 ... NP (Eq. 7), ϕ j , j = 1, 2 ... N P (Eq. 8), α and ϕ (Eqs. 9, 10) repeat Calculate the total reaction rate α 0 (Eq. 11) and quiescence interval τ (Eq. 5) if the next reaction is a hybridization (Eq. 12) then Select the hybridizing probe m and cDNA ᐍ (Eq. 14, 17) Hybridization Table  The   the table also contains the rates at which probes of type j will hybridize with any target, and the rates at which all hybrids composed of probes j will dissociate, Two additional quantities are stored in the table: the total rate of hybridization and the total rate of dehybridization The total rate of reaction may then be calculated from Eqs. 7, 8, 9, and 10 effectively subdivide the running sum in Eq. 6 into independently manageable groups of information. This in turn reduces the number of operations required for event selection reaction rate updates.

Reaction event selection
The selection of the imminent hybridization or dehybridization event is accomplished by summing the terms of Eq. 6 such that the total number of operations in Eq. 6 (including the calculation of a 0 ) is minimized. We begin by ordering the reactions according to column and row in the HT ( Fig.  1) such that . Eq. 6 is then implemented by summing the reaction rates corresponding to each entry in the HT, by column and then by row: first, by hybridization events, and then by dehybridization events.
One begins by identifying whether or not the imminent event will be a hybridization. Since the imminent event is defined by the reaction whose rate causes the sum of α ν to exceed the quantity r 2 α 0 , and all hybridization reactions precede the dehybridization reactions in the HT, it follows that If Eq. 12 results in the selection of a hybridization event, α >r 2 a 0 , and one needn't consider the dehybridization rates in the selection of the event to come.
The next step is the selection of the probe (m) that will hybridize in the imminent event. Using the notation of the Direct Method and the order of summation, the index of the event to come (μ) must be less than or equal to N P N T , where is the last of the hybridization rates in the table (Fig. 1). Eq. 12 becomes In essence, the event is selected by summing the rates corresponding to each column of the "hybridization" section of the HT, column by column, followed by row, until the quantity r 2 a 0 is exceeded. Noting that the quantity is equal to α j (Eq. 7), Eq. 13 may be simplified to which defines the probe that will hybridize in the imminent event.
The equation defining the selection of the target may be derived similarly First, we express the sums on the right and left sides of Eq. 13 as and Simplifying these expressions, we obtain (10) If the event to come is a dehybridization, then α <r 2 a 0 .
Hence, the sum of all hybridization rates may be subtracted from each term in Eq. 6, leaving where {a ν } are the rates of the dehybridization reactions and we have explicitly noted the fact that all N P N T hybridization rates are contained within α. Expressing this in terms of the hybrid indices, we obtain Again, the selection of the the event may be simplified using the auxiliary variables in the HT. If the probe m is released in the imminent event, the value of m may be defined by sequential addition of the values of {Φ j } until the quantity (r 2 a 0 -α) is exceeded Note that this also defines the identity of the probe that will dissociate from the target. The cDNA to be released in the dehybridization event (ᐍ) may be calculated by subtraction of Eq. 20 from Eq. 19: The selection rules for hybridization (Eqs. 14, 17) and dehybridization (Eqs. 20, 21) substantially reduce the number of operations required by Eq. 6. Whereas the Direct Method may require 2N P N T operations to select a reaction, our selection rules require at most (N P + N T ) operations. For a genome-sized microarray simulations, with N P ~N T ~ 10 4 , our reaction selection approach will be several orders of magnitude faster than those of other algorithms.

System state accounting
Upon selection of the imminent event, the populations of the species involved ( , , and for the probe, target, and hybrid, respectively) must be updated in accordance with the stoichiometry of the reaction. The structure of the HT facilitates this procedure: the row and column of the selected event designate the identities of both the reactants and products as well as the stoichiometric changes in population.
However, other quantities must be updated, most notably the partial sums of reaction rates that facilitate event selection. The first quantity to be updated is the hybridization rate of the probe m where I = -1 for hybridization and +1 for dehybridization.
Since each partial sum α j is a function of the population of the ᐍth cDNA (Eq. 7), these quantities must also be updated: Subsequently, a must be recalculated using Eq. 9. Collectively, 2N P operations are required for the update of the hybridization section of the HT.
After a hybridization or dehybridization event, only one of the dehybridization rates must be updated. Thus, only one partial sum requires modification: Moreover, inasmuch as Φ m is the only affected partial sum, Φ may also be updated in one operation, Thus, only two operations are required to update the dehybridization section of the HT.
In summary, after selection of a reaction, the HT may be updated in O(N P ) operations, a substantial improvement over the Direct Method. This is largely a result of the fact that reaction rates (a ν ) are not stored in the HT, precluding the need to update N P rates of events featuring probe m and N T rates featuring target ᐍ. The same argument applies to the dehybrization reaction rates {Φ j }. Ultimately, the method of partial sums is efficient so long as the partial sums can be updated easily.

Determination of equilibrium
Microarray analysis is predicated upon the assumption that the probes and solution-phase cDNA have equilibrated prior to scanning the slide and measuring the fluorescence associated with each feature. We follow this experimental convention in silico.
The hybridization process is at equilibrium when the rates of change of all chemical populations in Eq. 2 are zero, implying As straightforward as this criterion appears, it is difficult to employ in practice. Eq. 26 represents millions of comparisons for genome-scale microarrays, requiring O(N P N T ) operations per time step. This many operations would also be required if one determined the steady states of all molecular species, which would additionally require storage of the current state as well as previous states. Clearly, this is memory-prohibitive. Furthermore, Eq. 26 is exact only on average [30,32]. Hence, it will never be exactly satisfied at any point in time within a single simulation.
To circumvent these issues, we propose an alternate approach that employs the average total rates of hybridization (α) and dehybridization (Φ). Considering the definitions of these quantities (Eqs. 7, 8, 9, and 10) summation of Eq. 26 over all ᐍ and m yields the result that α = Φ at equilibrium. This criterion may be established using Student's t test for two populations with unknown means and standard deviations [38]. Strictly speaking, this is necessary but insufficient for the specification of thermodynamic equilibrium. However, it is remarkably effective as a heuristic. We implement it as follows: In standard hybridizations, imperfect mixing causes the transport of cDNA or cRNA to be diffusion limited [39]. This in turn presents an obstacle to hybridization, as the time required for a (large) target to diffuse to it's probe is large. This, in turn affects the kinetics. The most common solution to this problem is to increase the concentration of target cDNA or cRNA, which results in an abundance of Determination of equilibrium within simulations of hybridiza-tion these molecules at equilibrium. As we have discussed, our simulations feature perfect mixing. Thus, we may use substantially less solution-phase cDNA. As a result, free cDNA is sparse at equilibrium, which introduces fluctuation into a. By contrast, F exhibits little fluctuation because hybrid populations are fairly large compared to the change in population accompanying a (random) reaction.
Effect of reaction rate constants upon the steady state As discussed, the values of the kinetic rate constants should not affect the equilibrium state of our simulations provided that their ratios are constant (Eq. 35). This assumption may be formulated as a testable hypothesis as follows: If the values of and affect the equilibrium state of simulation, then they will affect the fractional occupancy of each probe m ∈ [1, N P ] defined by In this expression is the total population of probes m hybridized with any type of cDNA at equilibrium.
We test these hypotheses at the system level by performing simulations using rate constants That is, we perturb the rate constants generated via Eqs. 37 and 35 by multiplying their results by a uniform random number rᐍ m . Since a unique random number is generated for each probe m and cDNA ᐍ, one may quantify the effect of kinetic perturbations on the equilibrium state of a simulation via the hypothesis If kinetic rate constants significantly affect any of the fractional occupancies at the equilibrium state (t → ∞), then this hypothesis will fail.
Simulations of the hybridization of all 6256 modified Agilent probes with 6718 full-length yeast cDNA molecules were conducted for two types of perturbations. In the first study, r ~U (0, 1), where U (a, b) is a uniform random number on the interval (a, b). This allowed us to evaluate the effect of the widest variation of the rate con-stants. In the second study, r ~U (0.001, 1). Rate constants for all 42,027,808 hybridization events were independently perturbed. The results of our hypothesis tests are illustrated in Fig. 3.
In both cases, our results clearly indicate that the equilibrium state is unaffected by the values of the rate constants, as expected. Interestingly, the time required to reach equilibrium correlates with the heterogeneity of the dehybridization rate constants: deviations of the results for the two cases at 100 hours of CPU time (not hybridization time) indicate that the hybridization of many probes with solution-phase cDNA was incomplete in cases where their rate constants were perturbed by factors less than 10 -3 .
Analyses of the timeseries of the overall rate of reaction ( Fig. 2) revealed that the progression to equilibrium is considerably slower if r ~U (0, 1) than if r ~U (0.001, 1). This conforms with the experimental observations of Dai and coworkers [40], which demonstrated differences between the kinetics of specific and nonspecific hybridizations.

Comparison of stochastic simulation algorithms
Results of all SSAs applied to the process described by Eq. 1 should yield statistically indistinguishable results since they share a common stochastic process. In his seminal work [34], Gillespie showed that the "First Reaction Method" and "Direct Method" were equivalent. Subsequently, Gibson and Bruck demonstrated that their "Next Reaction Method" is equivalent to Gillespie's algorithms.
Since the Method of Partial Sums shares the mathematical underpinnings of the Direct Method, its results should be indistinguishable from those generated by the Next Reaction Method or Gillespie's algorithms, as well as the law of mass action.
We initially tested this hypothesis by conducting simulations of the hybridization of ten full length cDNA species from yeast to ten probes for those species. Five simulations were conducted via both the Method of Partial Sums and the Next Reaction Method. Additionally, we solved the corresponding population balance equations (Eqs. 2) for this illustrative hybridization process. Our results (Fig.  4) clearly show that all methods yield equivalent results.
Average populations for all 100 hybrid species could not be distinguished by simulation or calculation method via the T-test (p > 0.05).
Interestingly, several common differential equation solvers could not integrate Eqs. 2 for this ten by ten system from t = 0 until steady state, including the Matlab pack-ages ode23 and ode45. The two hundred equations necessary to model this small array required the use of the ode15s, which is designed for stiff sets of ODEs. By contrast, the stochastic simulation algorithms were unimpeded in their numerical progress. Although the stochastic simulation algorithms give indistinguishable results for the population states both in time and at equilibrium, their computational performances are significantly different. The differences in the computational speeds of the SSAs were evaluated by conducting simulations of the hybridization of full length yeast cDNAs to microarrays featuring 32, 64, 128, 256, and 512 probes targeting those cDNAs (N P = N T for all cases). Additionally, a simulation with 6256 probes and 6718 targets was conducted with the Method of Partial Sums. The same initial populations of targets and the same set of probes (Methods) were used in both the Next Reaction and MPS simulations. Estimation of the time at which equilibrium is attained was determined as discussed, and then used in simulations conducted with the Next Reaction Method. Our results are illustrated in Fig. 5, and the contrast is stark. The Method of Partial Sums outperforms the Next Reaction Method for simulations of microarrays of all sizes. It requires one hour to complete a simulation for an array large enough to be used for genomic characterization. By contrast, the Next Reaction Method is not capable of performing such simulations due to its memory requirements. If N P ~N T ~ 6000, the pointers required by the Next Reaction Method will consume approximately three terabytes of 64 bit computer memory by themselves; the HT requires no such pointers. Next Reaction simulations with as few as 512 probes required 12 hrs of CPU time to reach equilibrium, and for a computer with extraordinary memory, we forecast that simulations using the Next Reaction Method would require a month to simulate the hybridization to the Agilent yeast array.

Comparison of SSAs and the Law of Mass Action
In addition, the scaling of the CPU time with respect to the number of probes differs among the two stochastic simulation algorithms. Both algorithms feature powerlaw scaling, however, the Method of Partial Sums scales as , whereas the Next Reaction Method scales as (mean ± SE). These differences arise due to the differences in the data structures underlying the two meth-ods, which in turn affect the number of calculations required per time step.
Our algorithm permits two concurrent simulations of an 84 million-reaction system (e.g. the Agilent yeast array with cDNA from the yeast transcriptome) to reach equilibrium within two hours using 2.

Illustrative Example: Characterization of Cross Hybridization
The in silico characterization of cross hybridization is only as sensitive as the number of probe molecules per feature. Agilent arrays have 2.0 × 10 8 probe molecules per feature, facilitating the hybridization and measurement of as many target cDNAs and giving, in principle, as much resolution. However, the time required for stochastic simulations is proportional to the number of molecules therein [34,36,41]. Balancing these resolution and population considerations, we selected a hybridization volume of 0.275 nL corresponding to initial probe populations of 1000 molecules/feature and concomitant scaling of the feature diameter to maintain the surface concentration. We also employed a total concentration of 100 ng per 60 μL hybridization volume. At this concentration, only a few if any probes become saturated. This concentration is a tenfold dilution of that recommended by Agilent's protocol, however, it is is within the range of commercial oligonucleotide microarray protocols [42]. Moreover, the Agilent array effectively has a probe population four times higher than the one we used, as it has four redundant features per probe (the 4 × 44 design, Methods).
In Fig. 6 we illustrate a small subset of the hybrid populations predicted from our simulations. The average populations of hybrids were calculated from five replicate simulations, each of which required 1.1 hours of CPU time as discussed (Fig. 5) Computational performance of the Method of Partial Sums and Next Reaction Method Figure 5 Computational performance of the Method of Partial Sums and Next Reaction Method. CPU times for of simulations of hybridization until equilibrium are illustrated. The in silico time t required to establish equilibrium is determined by our method and utilized as an end point in "Next Reaction" simulations. Our algorithm outperforms the Next Reaction Method in both absolute terms as well as on a perprobe basis (frame).
hybridizes with cDNA for YOR393W (ERR1); in this case, the two ORFs are identical. The third hybrid, YMR323W (ERR3), shares all but twelve bases of the other two. Other cross hybridizing cDNA species have less overlap in sequence, but share the probe sequences completely or in part. All cross-hybrids listed in Fig. 6 share at least 90% of the target sequence. A probe's proclivity to cross hybridize in the presence of thousands of potentially competitive cDNAs may be expressed in terms of its selectivity, S, defined as the fraction of fluorescence intensity associated with a microarray feature that originates from the corresponding target. Mathematically, where is the population of hybrids composed of probe m and its target ( (m)), and is calculated via Eq. 28. For example, the selectivity of the aforementioned Agilent probe A_06_P7042 is 13.8%. In Fig. 7, we present a summary of the selectivities for all probes in the Agilent set. As these results show, the vast majority of the probes for this commercial microarray are selective (e.g A_06_P6109 in Fig. 6), and do not exhibit cross hybridi-zation when they bind yeast cDNA at the concentrations specified by the expression state. For other microarrays and probe sets designed by various publicly available software packages, the cross hybridization may be more extreme. Selectivity distribution for the Agilent probe set Figure 7 Selectivity distribution for the Agilent probe set. The black line illustrated the selectivities (Eq. eq:Selectivity) of all Agilent probes when the array is hybridized to yeast cDNAs at the initial concentrations described in Methods. The gray lines are results for four different initial initial conditions. The red line delimits 90% selectivity. The distribution is independent of the initial concentrations (x-axes are different for all five sets of results). About 100 probes will not be selective in any given microarray experiment.
Since the populations of all hybrids depend upon the populations of potentially cross-hybridizing cDNA molecules, the value of S for each probe depends upon the expression state of a cell. Interestingly, the distribution of S among the probes is independent of the expression state.
Simulations with different initial conditions corresponding to four additional MC-generated expression states yielded selectivity distributions that were indistinguishable from the aforementioned distribution (Fig. 7). This result suggests a method of characterizing the overall reliability of a proposed probe set. Since the distribution of selectivities is invariant with respect to the initial target populations (provided the total cDNA concentration does not force saturation of probes), a statistic that characterizes that distribution should be robust. We propose that the average selectivity can serve as such a metric. The use of such metrics should be used with care, however, inasmuch as they lend themselves to "ecological fallacy".

Implementation
Our software is implemented in C++ and executed on Apple G5 processors running MacOS X (Tiger). UNA-FOLD 3.5 is available for download at the DINAMelt server http://dinamelt.bioinfo.rpi.edu/ and compiles on a variety of operating systems.
Additionally, we have implemented the Next Reaction Method as described by Gibson and Bruck [37] in C++ for the purposes of comparing its performance with that of our algorithm.

Discussion
In recent years, advances in microarray manufacturing have opened the doors to custom microarray design. Individual researchers can now upload their own microarray probe sequences to one of many sites (e.g. Agilent's earray website) and have a custom array manufactured. In response, many probe design algorithms have been arisen to fill the needs of researchers intent on performing global investigations of gene expression [25,26,[43][44][45][46][47]. However, none of the probe sets generated by these algorithms have been evaluated in terms of their selectivity or proclivity to give a linear relationship between feature fluorescence and target concentration. Studies of this type carried out by the MicroArray Quality Control Consortium [22] were costly, involving dozens of participants. Such laborious quality control is not feasible for each and every probe set designed by a novel probe tool. However, robust population-based simulations of the hybridization process may be employed to evaluate candidate probe sets, given robust estimates of the thermodynamic free energies of hybridization.
In the tests of our simulation algorithm, we have utilized equilibrium constants that were calculated via the NN model of Allawi and SantaLucia. This model predicts freeenergies to within three significant digits of experimentally measured values for solution-phase hybridization [48] and has been validated for oligonucleotide arrays in which the oligonucleotides are connected to the surface by linkers, making them more "solution like" [49][50][51][52].
However, many studies have demonstrated that NN models do not accurately predict the thermodynamic properties of hybridization between solution-phase and surfacebound oligonucleotides (e.g. [53,54]).
Electrostatic interference in conventional microarrayswhich do not feature 3D linkers of the type employed by Weckx and coworkers [52], is a major reason for this discrepancy. Solution-phase Na + may shield the phosphate groups of both hybridized and single-stranded probes and targets [55]. Cations will also shield the negativelycharged glass surface to which the probes are bound by organizing the formation of a layer of counter ions. The surface density the oligonucleotide probes also plays a role, partly via steric interference and partly via changes in the charge distribution at the glass-water interface [50,[55][56][57][58]. Ultimately, surface electrostatic effects cause a location-dependent effect of mismatches [54,59].
Therefore, extra care must be taken when applying our algorithm to characterize cross hybridization in real microarray assays. If equilibrium constants are calculated using NN models (e.g. UNAFOLD [60,61], Pairfold [62,63] or BINDIGO [44]), corrections for the effect of the Debye layer should be introduced at a bare minimum. Theoretical predictions of the effect of the resulting "Debye layer" upon the melting temperature T m [64,65] have been confirmed via experimental measurements [66]. This correction may be applied to NN models for DNA-cRNA duplexes as well (e.g. the semi-empirical model of Wu and coworkers [67]), if simulations of cRNA-DNA assays are to be conducted.
We have not explicitly considered additional effects of single-strand secondary structure (SS) formation among fulllength cDNA or probes Nor, for that matter, have we considered the possibility of hybridization between cDNA molecules in solution The hybridization between probes, or folding thereof, is customarily not a substantial problem in oligonucleotide microarrays. Probe design software packages such as Oli- cDNA cDNA * cDNA cDNA cDNA cDNA goArray 2.0 routinely screen probes for secondary structure, and inter-probe hybrids are precluded due to spatial separation. The remaining interactions among cDNAs (folding and inter-hybridization) serve only to sequester single-stranded cDNA from the probes [68]. Inasmuch as this can only further degrade the sensitivity of probe-target interactions, we have restricted our algorithm to consider only probe-cDNA hybridization dynamics. Hence, the method we have presented for probe and array characterization (Figs. 6 and 7) gives the performance of a microarray under a "best case" scenario, even as it fully accounts for the interactions between oligonucleotide probes and full length cDNA targets at the system-scale.
For the end user we note that simulation-based characterization of putative microarray probe sets requires no more information than that contained in MAGE-TAB (Microarray Gene Expression Tabular) or MAGE-ML (Microarray Gene Expression Mark-up Language) formatted array information [69,70], which are required for publication of microarray data to ensure MIAME compliance. The Array Design Format (ADF) component of MAGE-TAB contains all of the probe sequence information and its target(s), whereas the Investigation Description Format (IDF) contains the experimental protocols, including the hybridization temperature and salt conditions. Additionally, the ADF contains information regarding the relationship between reporter sequences and features: there may be multiple features with the same reporter sequence, or alternately, signals from several reporters may be combined to produce a signal for a single gene (e.g. Affymetrix Gene Chips). As the total cDNA concentration is ostensibly included in the IDF, and the individual cDNA concentrations are randomly generated, the system-scale selectivity of any array can be computed via simulation from a proposed experimental protocol.
In this work, we have not explicitly considered the prediction of the time evolution of the hybridization process, focusing instead on the equilibrium state. Simulations employing our algorithm are most accurate when they utilize experimentally-determined rate constants [54,57,71]. However, care must be taken to ensure simulations are conducted for the same surface densities and ionic strengths employed in the experiments employed to estimate the rate constants, for the reasons previously discussed. Given accurate estimates of the rate constants and , the time required for microarray hybridizations may be estimated via the method illustrated in Fig. 3, as the time variable would represent the actual hybridization time.

Conclusions
In this work, we have developed an algorithm for the stochastic simulation of exceptionally large and complex probe-cDNA hybridization reaction networks that underlie microarray assays.
Using the method of partial sums in conjunction with the data structure we denote the "hybridization As a result of these innovations, our algorithm permits system-level simulation of the complete reaction network composed of all potential probe-target hybridizations (for any genome or array) without the need for high-performance computing. Furthermore, such simulations are now possible within a reasonable amount of time. Thus, given robust thermodynamic predictions of the free energies of DNA hybridization, one may obtain a conservative estimate of the reliability of a candidate probe set in silico.

Microarray specifications and simulation protocols
We investigated the hybridization of cDNA originating from S. cerevisiae to a reproduction of the Agilent Yeast Oligo Microarray Kit (V2, see Additional File 3). 6,256 of the probes on this array target yeast ORFs, each targeting one gene, and the remainder consist of randomly located oligonucleotide probes, control features, and empty spots. The Agilent array features multiple identical reporters, each of which is randomly distributed over the array surface to abrogate spatial artifacts. Insofar as our simulations treat the hybridization volume as homogeneous, we have not included redundant probes in our simulations. We have also removed control features that do not target yeast ORFs.
Agilent arrays of the type considered here possess of 65 μm wide features with surface probe densities of is 6.0 × 10 12 probe molecules per cm 2 [72]. Therefore each (redundant) feature consists of 2.0 × 10 8 probe molecules. The hybridization protocol for Agilent microarrays suggests addition of 60 μL of cDNA mixture to the microarray. Thus, a "total probe concentration" of 3.3 × 10 6 probe molecules/ μL was used for each feature in all simulations. The Agilent protocol also suggests that the hybridization mixture contains 1.65 μg of linearly amplified and labeled cRNA and possesses a NaCl concentration of 750 mMspecifications that are typical among experimental protocols [42,73,74]. Agilent also recommends hybridization at 65°C.
We have conformed to the Agilent protocol with three exceptions. First, full-length cDNA were used in lieu of cRNA in light of the observations of Eklund and coworkers, who demonstrated that replacement of cRNA targets with cDNA reduces microarray cross hybridization [13].
Second, although 1-100 μg of non-amplified RNA is typically required in experimental protocols for microarrays [42,74], Nagino and coworkers have shown that it may be reduced to 10 -100 ng by improving mixing [42]. This would also reduce steric effects and electrostatic effects induced by dense packing of charged oligonucleotides on the glass surface (cf [55,56]). Finally, we assumed that the probes are separated from the glass surface by linker molecules of lengths greater than the Debye length. In so doing, we may utilize the NN model of Allawi and Santalucia [75] to calculate the free energies of hybridization between probes and targets. We then investigated the effect of total DNA concentration upon cross hybridization and signal response by hybridizing 50, 100 and 200 ng of cDNA per 60 μL aliquot (1.8, 3.6 and 7.2 nM).

cDNA populations
Oligonucleotide probes were hybridized to cDNAs for each of the N T = 6718 protein-encoding ORFs in the November 10, 2006 version of the yeast genome. Differential expression studies akin to those employed by the MAQC [22] were performed using a "gold standard" expression state defined by the amounts of each cDNA, , ᐍ = 1 ... 6718, which, by convention, we assume to be proportional to the amounts of the corresponding mRNA. The amounts of each transcript were selected via Monte Carlo from a log-normal distribution that was fit to the yeast expression dataset of Cho and coworkers (Additional File 4) [76]. The fits of all seventeen expression datasets (two cell cycles) revealed that the genomic expression levels of yeast are log-normally distributed with a coefficient of variation of 0.21, independent of the expression state; the mean of the distribution corresponds to the total solution-phase DNA concentration.

Kinetics and equilibrium constants
The methodology thus presented for the simulation of the time evolution of the hybridization of cDNA to oligonucleotide microarrays is valid if and only if the rate constants employed in silico are valid. At equilibrium, which corresponds to the condition where microarray slides are scanned, the constraints on the population balance equations (Eq. 2) and stochastic simulation are less stringent. In this case, the populations of all ssDNA and dsDNA species predicted by stochastic simulation will be accurate if and only if the equilibrium constants Kᐍ m are accurate, where This is a direct consequence of the thermodynamic principle of microscopic reversibility [77][78][79]. Therefore, our simulation procedure will accurately predict the extent of cross hybridization if and only if the equilibrium constants employed in silico are valid.
To this end, we have employed UNAFOLD [60,61] -a standard bioinformatics tool. Although UNAFOLD is largely used for the prediction of secondary structures of DNA and RNA (excluding pseudoknots), it is also capable of calculating free energies of hybridization (ΔGᐍ m ) via the semi-empirical hybridization model of Allawi and Santalucia [75]. This "Nearest Neighbor" (NN) model is exceptionally accurate for two reasons. First, its mathematical form accounts for nearest neighbor interactions within hybrids (e.g. thermodynamic contributions due to basepair stacking). Second, its parameters are robustly calculated from an abundance of well-controlled experimental measurements on mismatched oligomers. Given the sequences of a pair of DNA molecules, the NN model will reproduce experimentally-measured ΔGᐍ m for (a) solution-phase hybrids and (b) oligomers that are separated from the slide surface by distances greater than the Debye length (cf. [52]).
Using the NN model via UNAFOLD, we calculated ΔGᐍ m between all probes and cDNAs between all probes and cDNAs under experimental temperatures and salt concentrations. Equilibrium constants were then evaluated using the standard formula where Kᐍ m has units of M -1 . Equilibrium constants are functions of temperature and salt concentrations alone, and are independent of cDNA concentration.
Although the equilibrium state of hybridization is fully determined by the equilibrium constants Kᐍ m , SSAs require reaction rate constants. As we have discussed, microscopic reversibility precludes the possibility that the values of the rate constants will affect the populations of ssDNA and dsDNA species at equilibrium. Therefore, one of the rate constants may be estimated provided that the other is calculated using 35 and the NN-model prediction of the equilibrium constant.
In our simulations, we have chosen to estimate the "off rates", . We do so using Delisi's equation [80], where is the diffusion-limited off-rate. In this expression, Dᐍ is the diffusion constant of cDNA ᐍ, and sᐍ m is the sum of radii of gyration of the oligonucleotide probe m and cDNA ᐍ, K 0 accounts for the energetics associated with the transition state, and λ* and V* are associated with surface potentials. As the off-rate is an estimate, we treat the second term of Eq. 37 as a constant and assign where N Av is the Avogadro's number [81]. In so doing, we preclude all "on-rates" { } from exceeding the Smoluchowki rate of diffusion-limited association k + = 2πDᐍsᐍ m N Av The on-rates are calculated from Eq. 35, with the estimates of the off-rate and the free-energies calculated with San-taLucia's NN model. As the kinetic rate constant estimation procedure preserves the equilibrium constants, the simulation predictions of the dsDNA and ssDNA populations will be accurate at equilibrium. The effect of kinetic parameters upon the time required to reach the equilibrium states of hybridization is discussed in the Results section.

Authors' contributions
IL and EA developed the algorithm. EA implemented the algorithm and carried out the simulations and analyses. IL and EA wrote the manuscript.