Hybridization thermodynamics of NimbleGen Microarrays
© Mueckstein et al. 2010
Received: 16 July 2009
Accepted: 19 January 2010
Published: 19 January 2010
Skip to main content
© Mueckstein et al. 2010
Received: 16 July 2009
Accepted: 19 January 2010
Published: 19 January 2010
While microarrays are the predominant method for gene expression profiling, probe signal variation is still an area of active research. Probe signal is sequence dependent and affected by probe-target binding strength and the competing formation of probe-probe dimers and secondary structures in probes and targets.
We demonstrate the benefits of an improved model for microarray hybridization and assess the relative contributions of the probe-target binding strength and the different competing structures. Remarkably, specific and unspecific hybridization were apparently driven by different energetic contributions: For unspecific hybridization, the melting temperature T m was the best predictor of signal variation. For specific hybridization, however, the effective interaction energy that fully considered competing structures was twice as powerful a predictor of probe signal variation. We show that this was largely due to the effects of secondary structures in the probe and target molecules. The predictive power of the strength of these intramolecular structures was already comparable to that of the melting temperature or the free energy of the probe-target duplex.
This analysis illustrates the importance of considering both the effects of probe-target binding strength and the different competing structures. For specific hybridization, the secondary structures of probe and target molecules turn out to be at least as important as the probe-target binding strength for an understanding of the observed microarray signal intensities. Besides their relevance for the design of new arrays, our results demonstrate the value of improving thermodynamic models for the read-out and interpretation of microarray signals.
Microarrays have become the predominant method for studying gene expression on a genomic scale. It has been recognised, however, that probes interrogating different regions of the same mRNA target show considerable variation in signal intensities [1, 2], and that the observed intensity variation is highly sequence-dependent [3–5]. This is expected because different probes vary in their tendency of forming intraand intermolecular structures that compete with the hybridization of the probe-target duplex, resulting in different hybridization efficiencies [6, 7]. Comparative studies have indicated that accurate thermodynamic models based on the physico-chemical parameters underlying probe-target interactions are particularly good predictors of actual probe binding behaviour and thus microarray signal intensity . This is not only important in the design of new arrays, where specific and uniform probes need to be selected , but also for the readout of data from established platforms, where a better comparability of signals from different genes improves quantitative modelling and the sensitive detection of subtle higher-dimensional patterns. When the effective hybridization temperature is not known, the probe-target melting temperature T m is often calculated to predict the expected thermodynamic stability of the hybridized complex [10, 11]. The melting temperature is still one of the most popular measures in the evaluation of microarray probes. It gives the temperature at which half of all probes form a duplex with their target while the other half are unbound, assuming a simple two state transition, thus providing information about the probe binding behaviour at the melting temperature. It makes, however, no statement about the binding affinity at the actual hybridization temperature. Probe-target dimers with the same T m can actually behave quite differently at typical reaction temperatures, which are usually considerably lower than the T m [12–14]. Increasingly, general thermodynamic models of probe-target hybridization have become established in the prediction of microarray probe behaviour. These models are either based on published thermodynamic rules for nearest-neighbour base-pairing and experimentally determined parameters, or they determine model parameters from fits of the observed probe intensities [1–4]. The free energy of probe-target hybridization can so be calculated for the effective hybridization temperature.
None of the above approaches yet considers that nucleic acid sequences can form stable intramolecular structures that compete with the formation of the probe-target duplex: Only freely accessible, i. e., monomeric and unfolded probe and target molecules can interact to form probe-target dimers [13, 15–17]. More elaborate models therefore include the effects of the secondary structures of the probe [8, 18–20] and the target [9, 21, 22] on the overall binding efficiency. Another factor that reduces the efficiency of probe-target hybridization is the formation of probe-probe dimers .
Wei et al.  recently examined to what degree several probe properties affect microarray signal intensities. In this work we extend this analysis by additionally determining the influence of target secondary structure, the free energy of probe-target hybridization, and the strength of probe-probe dimerization on the overall efficiency of probe-target binding on microarrays. We used a partition function approach to capture the full dynamic potential of the different inter- and intramolecular interactions [24, 25]. Although the stability of the probe-target duplex alone is a good first indicator of hybridization behaviour, we will show that the signal variation observed in the examined data sets can be better explained by the effective interaction energy. The effective interaction energy, which is the free energy of probe-target hybridization reduced by the free energies of probe and target secondary structures and the probe-probe duplex formation, can predict probe dependent signal intensity variation twice as well as the melting temperature T m . Considering a complementary tiling array study , we can moreover show that the higher predictive power of the effective interaction energy is independent of the typical probe length and the type of target nucleic acid (cDNA or RNA) in an experiment.
We studied the sequence-dependent intensity variations for two different tiling array experiments. The first one features sets of probes targeting different regions of the same transcripts (Dataset II of Wei et al. ). It comprises nine tiling arrays with a resolution of 22 nt, each containing about 385,000 probes interrogating the expression of 32,424 regions throughout the genome. Probe length ranged from 45 to 75 bases. Chips had been manufactured with 5 nt thymidine linkers and had been hybridized to cDNAs from undifferentiated human Embryonic Stem Cells (hESCs) by NimbleGen Systems . Raw expression data had been extracted using NimbleScan software v2.1. After qspline normalization, a non-linear method for controlling signal-dependent sources of variability , data were median centred using control set probe intensities. For comparability to the original study, our analysis is based on the same preprocessed data. Probe targets were identified by WU-BLAST (W.Gish, pers. comm.) run against UCSC 'Known Genes' annotation  (as obtained 2008-08-29). To efficiently identify perfect matches, WU-BLAST parameters were set as follows: seed alignment word size W = probe length, match score M = 1, mismatch score N = -1, gap penalty Q = 3 and gap extension penalty R = 1. As expected for a tiling array, many probes had no cDNA target (about 90%), and about half the matches were on the reverse strand. For simplicity and to avoid confounding effects, we have focussed on probes complementing the sense strand and without matches to multiple genes. For simplicity, we only included probes that showed no cross-hybridization to any mRNA in the UCSC 'Known Genes' annotation  in our analysis.
The second experiment used 25-mer oligonucleotides for perfectly matching 1 nt tiling probes of ribosomal RNA (rRNA) sequences from nine nematodes . The rRNA targets were generated by in vitro transcription. Each rRNA target was separately hybridized to the specific compartment on the 12-well NimbleGen array, i. e., no interference between targets was possible. We analyse the fluorescence intensities of the probes perfectly matching the single hybridized target, giving purely specific signals with no cross-hybridization. In the hybridization reactions, an rRNA concentration of 375 ng was used for each target .
where [Na +] = 0.6 M is the sodium ion concentration in molar units, A is a helix initiation factor equal to -10.8 cal/(KM), C t the molecular concentration of the oligonucleotide strands in molar units, F the correction term for Formamide, namely 0.63°C per 1% Formamide. The universal gas constant R = 1.987 cal/(KM). Wei et al.  estimated the oligonucleotide concentration to C t = 6.1 * 10-17 M and used a Formamide concentration of 35% for hybridization (Ron Stewart, pers. comm., 2009). Since we were interested in the sensitivity of our study to T m calculation variations, we also computed the T m for different settings of molecular concentration C t , and with or without Formamide and sodium correction terms.
To facilitate a comparison of results with the original study , the minimal free energy (mfe) of the secondary structure of the probe molecule was computed by hybrid-ss-min [31, 32], with the nucleic acid type (option -NA) set to 'DNA'.
Here ΔG h , ΔG p , ΔG t , and ΔG pp are always ≤ 0. For unstable structures, their respective contributions were zero. The final effective interaction energy ΔG was stable (< 0) for all considered probe-target pairs.
The probe binding site area in the target may form structural motifs with bases some distance from the probe binding site. To calculate the free energy ΔG t of unfolding structures in the target we therefore consider a target fragment including the probe binding site plus parts of the flanking sequence. Koehler et al.  showed that 90% of base pairs are formed between nucleotides less than 85 bases apart in the primary sequence. They suggest that over 90% of the predicted structures of the full length target can be found by using a target fragment consisting of the probe binding site flanked by 170 bases on either side. We can thus safely use a target fragment including flanking regions of 200 bases on either side of the probe binding region for the calculation of the free energy ΔG t that is required to unfold the probe binding site in the target.
Wei et al.  used the GUIDE algorithm  to rank the thermodynamic parameters in order of their importance for predicting signal intensity. GUIDE constructs a non-linear model fit by finding an optimal partitioning of the data together with piecewise least-square regression models on each data subset, forming a so-called regression-tree. Variables for the split-condition of a tree-node are selected by unbiased detection of pairwise interactions and curvature, where the split points are found by exhaustive search. Over-partitioning/over-fitting of the data is avoided by limiting the size of the final regression tree by cost-complexity pruning in cross-validation, selecting the smallest tree with a mean prediction error within one standard error of the minimal prediction error achieved. The algorithm also provides importance scores, which reflect the contribution a predictor variable makes to the non-linear model. For each split-node and variable in the regression-tree, it is computed from the chi-square statistic of the interaction/curvature detection step for the underlying piecewise-constant regression model, and then weighted by the square-root of the data subset size at the split-node. The overall importance score of a predictor is obtained by summing over all split-nodes of the tree (Wei-Yin Loh, pers. comm., 2009). Software and further bibliography are available from http://www.stat.wisc.edu/~loh/guide.html. For direct comparability to earlier work, we also employ GUIDE in this study. Bootstrapping was used to study the robustness of relative importance and obtain error estimates for GUIDE results: For the whole dataset of about 3 million probes, 100 random subsets of 200,000 probes each were sampled with replacement from the original data. For the smaller subsets examined (cf. Results), 100 random subsets were generated, each containing 90% of all probes in the respective dataset. For the GUIDE scores separately obtained from each of the 100 random subsets, means and standard deviations were computed, which were finally scaled by the mean for the most important parameter to give the relative importance scores. The analysis was repeated both on the raw fluorescence intensities and on the log-intensities (Additional File 1).
We used tiling microarray data from Wei et al.  to study the influence of inter- and intramolecular structures on the efficiency of probe-target duplex formation. For this purpose, tiling arrays have the advantage that probes have been selected only by the tile start position but no probe selection to optimize hybridization efficiency or uniformity has been applied . One therefore has a set of probes with a highly varying potential for intra- and intermolecular structure formation, which results in different binding efficiencies and thus microarray hybridization signals for the same transcript target, allowing a systematic investigation of probe specific effects.
Our analysis followed the methodology of Wei et al.  yet newly introduces target-side analysis. In addition, we consider an accurate representation of the free energy of probe-target dimerization and introduce an extended model for microarray hybridization that includes probe dimerization. Our approach assesses the influence of inter- and intramolecular structures that compete with probe-target duplex formation in the hybridization process. We subsequently analyse the relative contributions of the competing structures and the probe-target duplex binding strength on the measured signal intensities.
The probes in this dataset are tiling probes designed to interrogate the entire human genome. In a hybridization to cDNAs, the majority (about 90%) of these tiling probes do not have a transcript target. These probes reflect non-specific hybridization and they dominate the dataset. The obtained importance ranking actually did not change when only considering probes with non-specific hybridization (Additional File 2: Fig. A.2). We therefore observe that the melting temperature T m is the best predictor of non-specific hybridization signal. (For these probes with no known transcript target, more advanced thermodynamic models like the effective interaction energy cannot be calculated, as they require knowledge of the target sequence.)
Consequently, it is interesting to focus on the subset of probes that target expressed genes for a complementary examination of predominantly specific hybridization signals. For this purpose, probes were selected that matched the plus strand of single genes from the UCSC 'Known Genes' annotation . For the identification of clearly expressed genes, we employed a conservative strategy of keeping only targets having at least one probe with a signal higher than the mean intensity of probes targeting transcripts. (Results, however, were robust under different threshold choices for target expression; see Additional File 3, Fig. A.3.)
Considering probes for known targets now also allows an introduction of target-side modelling. For simplicity, we further focus on probes with no cross-hybridization potential (cf. Additional File 4). In our extended model we relate the observed binding efficiency to an effective interaction energy ΔG. This is obtained by subtracting the free energies of inter- and intramolecular interactions that interfere with the formation of probe-target dimers from the free energy gained by probe-target dimer hybridization ΔG h :
where ΔG p is the free energy of the probe secondary structure, ΔG t the free energy of the probe binding site secondary structure in the target, and ΔG pp the free energy of probe-probe dimerization within the same probe feature. For most probes and targets considered, stable alternative structures were observed (Additional File 5, Table A.1), highlighting the importance of considering the effects of this competition for the binding of the probe.
Characteristic differences between the two studied datasets
Variation of thermodynamic properties for probes of similar effective interaction energies or binding strengths.
Variation of properties for probes with similar free energies
Variation range within ± 1 unit of the median
fixed Δ G
fixed Δ G h
fixed T m
44 ... 46
26 ... 62
22 ... 66
19 ... 66
0.5 ... 18
0.1 ... 16
0.3 ... 16
0.1 ... 22
2.2 ... 23
2.3 .... 26
1.7 ... 22
0.6 ... 32
36 ... 64
40 ... 60
46 .... 48
33 ... 66
54 ... 84
65 ... 67
49 ... 89
45 ... 89
In summary, the consideration of multiple inter- and intramolecular structures that compete with the formation of a probe-target duplex improves the prediction of probe signal intensity significantly, reflecting particularly the importance of probe and target accessibility for microarray hybridization.
Microarray probes are attached to the chip surface with one end, the other end protrudes into solution. This causes different reaction conditions for the two ends of a probe. In high-density microarrays the surface-attached end is less accessible than the end in solution due to steric restrictions caused by the substrate  and crowding effects due to neighbouring probes . Therefore the end of the probe that protrudes into solution plays a larger role in hybridization than the surface-tethered end [23, 42].
The influence of the terminal bases at the solution end can be limited by the synthesis yield. NimbleGen's mask-less array synthesis technology has an average stepwise yield between 96% and 98% . The probes in this study had a length range of 45 - 75 nt. Even for a coupling efficiency of 98%, less than 30% of 60-mer probes reach full length . Nevertheless, we find that the effective interaction energies of probes shortened at the solution end were significantly less predictive than the values for full-length probes, even if only 5 nt were removed (Fig. 5). These results are in agreement with the observations of Wei et al.  that protruding ends contribute more to signal intensity than tethered ends, confirming the dominance of steric effects over limitations from synthesis yield for this platform.
Although DNA microarrays have become the predominant method for gene expression profiling, the quantitative understanding of the measurement process still constitutes an active field of research. It is recognised that probe signal variation is highly probe-sequence dependent [3–5]. A recent study  has thoroughly examined the effect of different probe properties on probe signal intensities. We have extended this work by introducing target side modelling, an accurate representation of the free energy of probe-target dimerization and an improved model for microarray hybridization that includes probe dimerization. We separately considered specific and unspecific hybridization modes.
Reproducing results of Wei et al. , we obtained the melting temperature T m as the best predictor of probe signal intensity for the employed tiling array (Fig. 1). Most tiling probes hit non-exonic regions, with only 10% of the probes targeting mature mRNAs. Most probes therefore showed non-specific hybridization, and these dominated the dataset. Indeed, working with all array probes or restricting analysis to the non-specific probes gave the same results (Additional File 2: Fig. A.2). We thus observe that T m is the best predictor of non-specific probe signals.
In contrast, focussing on clearly expressed genes, we could show that the effective interaction energy ΔG was the best predictor for specific probe signal intensity (Fig. 3). Here, we computed ΔG by also considering inter- and intramolecular structures that interfere with probe-target binding, cf. Eq. (1). This improved the prediction of signal intensity considerably, with ΔG performing about twice as well as T m . Probe and target secondary structure gave a similar performance for signal intensity prediction as the free energy of the probe-target duplex ΔG h or the melting temperature T m (Figs 3 and 4). The large impact of probe secondary structure can be understood by considering how secondary structure affects probe-target binding efficiency: Hybridization between two nucleic acids starts with the nucleation of a few perfectly matched bases, followed by a comparatively fast zipping reaction . Nucleation can actually be the rate limiting step for hybridization . Probe secondary structure reduces the number of available nucleation sites because bound bases cannot take part in a nucleation reaction . Moreover, probes with secondary structure fold back on themselves, with the solution ends of the probes brought into closer proximity to the microarray surface. Steric effects close to the array surface are another considerable factor determining probe accessibility. All these effects reduce hybridization efficiency and thus probe signal and may explain the importance of probe secondary structure for probe signal intensity prediction. The large variation of ΔG t and its strong influence on probe signal, on the other hand, can be understood by considering that bases outside the binding site can also contribute to structures interfering with probe binding, which results in an overall larger number of potentially stable relevant secondary structures in the target. Finally, while the incorporation of probe and target secondary structures into an effective interaction energy, Eq. (1), made the biggest contributions to better model performance, the consideration of probe-probe interactions also improved prediction power by a further 10%.
Besides their relevance for the design of new arrays, our results have demonstrated the value of improving thermodynamic models for the read-out and interpretation of microarray signals. Necessary next steps in the development of improved models include both the incorporation of intermolecular target-target interactions as well as of 'cross-hybridization' effects to unintentionally matching non-target molecules in a complex sample background. Target-target interactions on one hand are particularly challenging because they involve multimers of a probe, its bound target, and another sequence that binds to the target. This other sequence can thus contribute to the measured signal intensity . More accurate models of interactions with target sequences will also have to consider the respective target fragmentation and labelling steps of microarray protocols. Models of cross-hybridization on the other hand need to address two tasks: identifying potential non-targets unintentionally matching the probe, and modelling their influence on the hybridization signal. Established probe design tools already filter out non-specific probes through cross-hybridization prediction after efficient sequence-similarity based detection screens [46, 47]; similar to the filtering employed in this study. Latest advances now promise sufficiently fast and more sensitive detection tools based on thermodynamic models [48, 49]. While recent developments have shown how the competitive formation of probe and target secondary structures as well as probe-probe and target-target dimers affect the probability of finding the desired probe-target duplex , these calculations still require several hours of computation time per probe-target pair, precluding their large-scale application. In approximation, the free energy contributions of competing structures can be obtained separately and effective interaction energies can be calculated , as in Eq. (1). A similar approach could also be taken to quantitatively consider cross-hybridization effects, as implemented in state-of-the-art probe design tools . For this, special validation experiments that allow a separation of specific signals and cross-hybridization effects will be valuable. The results of our current study suggest that the establishment and validation of more sophisticated models can in the near future provide further improvements to our understanding and ability to quantitatively predict microarray hybridization signals in a complex sample background.
We have introduced an improved thermodynamic model for probe-specific signal intensity on microarrays. The hybridization efficiency between a probe and its targets is determined by the balance of the binding strength of the probe-target duplex on one hand and the competing formation of probe-probe dimers and secondary structures in either probes or targets on the other hand. Consequently, the effective interaction energy between a probe and its target is modelled as the free energy gained by probe-target duplex formation reduced by the free energies needed to open alternative structures competing with probe-target binding. For specific hybridization, the effective interaction energy is a twice as powerful predictor of signal intensity variation as the melting temperature T m . We furthermore analysed the effects of the alternative competing structures in relation to probe-target binding strength, which highlighted the strong influence of intramolecular structures on specific hybridization signals. In summary, the improved model introduced here considerably enhances our ability to predict and understand sequence-specific variation of microarray signal intensities.
The authors gratefully acknowledge helpful discussions with Peter Sykacek and Ron Stewart. This work was supported by the City of Vienna ['Jubiläumsfonds der Stadt Wien für die Österreichische Akademie der Wissenschaften' to UM and DPK]; the Vienna Science and Technology Fund (WWTF), Baxter AG, Austrian Research Centres Seibersdorf, and the Austrian Centre of Biopharmaceutical Technology ['WWTF Vienna Science Chair of Bioinformatics' to GGL, AP, and DPK]; and the Austrian Research Promotion Agency ['GenAU BIN-III' to IH].
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.