Normalization and centering of array-based heterologous genome hybridization based on divergent control probes
© Darby et al; licensee BioMed Central Ltd. 2011
Received: 16 September 2010
Accepted: 21 May 2011
Published: 21 May 2011
Hybridization of heterologous (non-specific) nucleic acids onto arrays designed for model-organisms has been proposed as a viable genomic resource for estimating sequence variation and gene expression in non-model organisms. However, conventional methods of normalization that assume equivalent distributions (such as quantile normalization) are inappropriate when applied to non-specific (heterologous) hybridization. We propose an algorithm for normalizing and centering intensity data from heterologous hybridization that makes no prior assumptions of distribution, reduces the false appearance of homology, and provides a way for researchers to confirm whether heterologous hybridization is suitable.
Data are normalized by adjusting for Gibbs free energy binding, and centered by adjusting for the median of a common set of control probes assumed to be equivalently dissimilar for all species. This procedure was compared to existing approaches and found to be as successful as Loess normalization at detecting sequence variations (deletions) and even more successful than quantile normalization at reducing the accumulation of false positive probe matches between two related nematode species, Caenorhabditis elegans and C. briggsae. Despite the improvements, we still found that probe fluorescence intensity was too poorly correlated with sequence similarity to result in reliable detection of matching probe sequence.
Cross-species hybridizations can be a way to adapt genome-enabled tools for closely related non-model organisms, but data must be appropriately normalized and centered in a way that accommodates hybridization of nucleic acids with diverged sequence. For short, 25-mer probes, hybridization intensity alone may be insufficiently correlated with sequence similarity to allow reliable inference of homology at the probe level.
Many organisms that are important components of most ecosystems are understudied at the genetic level because they lack useful genome-enabled resources. Hybridization of nucleic acids from non-model organisms onto DNA microarrays designed for closely related model-organisms has been used as a potential alternative to building genomic resources for each species of interest. A variety of platforms and objectives in contemporary applications of heterologous ("cross-species") hybridizations, but the recurring challenge for each platform is to measure the effect of sequence dissimilarity on hybridization between the probes being used and the nucleic acids of the species being hybridized. For example, Gilad et al.  tested hybridization efficiency of microarrays spotted with amplicons from four primate species (including human) and showed that increasing sequence divergence resulted in reduced hybridization efficiency. Similarly, an array of expressed sequence tags (ESTs) from African cichlid fish (Astatotilapia burtoni) was used to test the validity of gene expression analysis on a variety of related teleost fish . The number of spots (probe features) that were able to demonstrate differential gene expression decreased with increasing phylogenetic distance. The microarrays were subsequently used to assess gene expression from swordtail (Xiphophorus nigrensis) , which was estimated to be at the far edge of what was considered phylogenetically close enough to be reliable for cross-species hybridization on the cichlid arrays. Similar arrays developed from zebrafish (Danio) ESTs have been used with coral reef fish (Pomacentrus) cDNA . In situ synthesized oligonucleotide arrays are an alternative to spotted cDNA microarrays and commonly used when the species of interest is closely related to a model organism for which a commercially designed chip is already available. For expression studies, it is common to screen probes for sequence conservation by first hybridizing heterologous gDNA, and secondly assessing gene expression by hybridizing experimental cDNA and analyzing only the accepted probes . This strategy has been applied to examine gene expression of various genera of Brassicaceae on an array containing Arabidopsis thaliana probes, [5–7], expression of banana genes on a rice array , expression of horse genes on an array containing human probes , and expression of goat genes using a bovine array .
The preparation of heterologous hybridization data for analysis is problematic because probe binding is a result of multiple factors, including binding free-energy, self-folding, dimerization, and, importantly, sequence similarity or divergence . Traditional approaches to analyzing heterologous hybridization data largely follow the techniques of array-based comparative genome hybridization (aCGH ) [12–14], which is the hybridization of gDNA to con-specific arrays for the detection of chromosomal or copy-number variations. These techniques can include local regression normalization and quantile normalization. However, the conventional normalization procedures designed for aCGH have the potential to result in the false appearance of homology if the probe signals from cross-species hybridizations violates the underlying assumptions of uniform statistical distributions due to sequence divergence. Several methods have been proposed to 'screen' probes and reduce the potential for false positives : 1) accept only probes of a certain hybridization fluorescence threshold or overall intensity [5, 16], 2) match probes from a reference genome to that of the target genome and only analyze probes of a certain sequence similarity , or 3) normalize the entire dataset using a suite of known conserved genes [18, 19]. However, the significant challenge with normalizing intensity data based on conserved genes is that genes evolve at different rates for different lineages. Many non-model organisms have such little genomic sequence data known that identifying sets of genes with conserved sequences amongst a group of species is unreliable, if not impossible. We propose a normalization and centering approach that relies on universally diverged (non-conserved) probes and does not make any prior assumptions about the distribution of probe signal intensities.
gDNA and Hybridization conditions
Strains used for hybridization included Caenorhabditis elegans (N2, C. elegans (CB4856), C. briggsae (AF16), and five species isolated from Konza Prairie, Riley County, Kansas (US): Oscheius tipulae (KS585) [Genbank:HQ130502], Oscheius sp. FVV-2 (KS555) [Genbank:HQ130503], Mesorhabditis sp. (KS587) [Genbank:HQ130505], Acrobeloides sp. (KS586) [Genbank:HQ130506], Chiloplacus sp. (KS584) [Genbank:HQ130507]. Genomic DNA was isolated from each species by phenol-chloroform extraction, labelled and hybridized onto the GeneChip®C. elegans Tiling 1.0R Array according to manufacturer's specification using two chips per species representing biological replicates. Arrays were imaged on GeneChip® Scanner 3000-7G and data extracted with GeneChip® Operating Software (GCOS) and analyzed using Tilling Analysis Software (TAS). Raw and processed data has been submitted to NCBI Gene Expression Omnibus [GEO:GSE23667].
where the median intensity (median(c,s)) of all control probes c from species s was used as a phase shift to center all control probes around zero.
To characterize the relationship between probe intensity and the percent similarity, we make use of a dataset of candidate genes with potential ecologically relevant roles in nematode survival . We selected 49 of the candidate genes of interest that had only one putative ortholog and confirmed that this suite of genes came from all six chromosomes (I: 3, II: 3, III, 6, IV: 8, V: 18, X: 11) with a group GC content (min: 40.1%, mean: 47.4%, max: 65.8%) that was representative of all probes in exon regions (43% ± 9.2 SD). We then aligned each probe from the C. elegans chip to its respective position in the C. briggsae homolog and computed the number of identical nucleotides.
Results and Discussion
Conventional data transformation
Alternate normalization and centering
Test of sensitivity with con-specific hybridization
Test of specificity with cross-species hybridization
Cross-species hybridization has been proposed as a way to adapt genome-enabled tools developed for model organisms to closely related non-model relatives. However, we (present work) and others [18, 19] have shown that the data must be appropriately normalized and centered to control for sequence divergence. Ultimately, we found that probe intensity alone was a poor predictor of sequence similarity and can result in false inferences of homology. Our findings largely support the recent results of Machado and Renn  who also found that the ability to detect genes decreased below 90% sequence identity between three species of Drosophila. The major difference in our approach is that Machado and Renn normalize based on the 100 or 1000 most conserved genes (assumed to be equivalently similar for all species of interest), while we propose normalizing and centering based on control, non-target probes (assumed to be equivalently dissimilar for all species tested). Both approaches appear to be valid for their respective purposes, but our approach might be more applicable in the absence of enough genomic sequence data to identify an a priori set of conserved genes. The lack of universally dissimilar probes on the spotted chip of Machado and Renn  prevent us from applying our technique on their data, and the lack of genomic sequence data amongst our species prevent us from applying their technique on our data. However, we can nonetheless predict that the microarrays printed with PCR products ~500 bp long  are likely to be more sensitive and specific to their targets than the 25-mer probes used in the Affymetrix platform presented here. Single mismatches may have a more adverse affect on the binding of short, 25-bp probes, than long, ~500 bp probes. Hybridization of gDNA onto microarrays is currently the standard technique to validate probes on gene chips for expression analysis in cross-species applications. One commonly used procedure  hybridizes heterologous gDNA from a non-model organism onto an existing 25-mer GeneChip® designed for a model organism and masks all probe sets except those with at least one probe feature whose hybridization intensity is above a predefined threshold intensity. Our analysis suggests that, either with or without probe-level normalization and centering, a large number of non-specific control probes can still have a relatively high hybridization intensity compared to specific probes (Figure 2A). Furthermore, even if the threshold intensity were set relative to target genome hybridization, we show that a significant fraction of probe features at all threshold intensities could likely be false-positives. Thus, we fear that a cross-species hybridization algorithm to mask chips for gene expression may still permit a large number of false positive probe sets into the analysis. It is for this reason that studies utilizing cross-species hybridization for microarray gene expression profiles must be especially diligent with replication and validation. For example, Pavlidis et al.  found that a minimum of five biological replicates generated stable gene expression profiles. Unfortunately, recent studies using cross-species hybridization on microarrays with short probes either include no replication or insufficiently validate their microarray results with qPCR [8, 9]. We suggest that cross-species microarray hybridizations introduce a degree of uncertainty beyond what is typical for con-specific hybridizations, and thus require more robust quality control measures than would be normally adopted for con-specific hybridization.
Genomic DNA controls are essential to ensure the most reliable interpretation of heterologous hybridization applications, such as gene expression profiles. Our strategy for normalization and centering of cross-species array data is meant to be used to identify reliable probe intensity values that could be utilized in downstream applications, such as finding regions of sequence similarity or for gene expression analysis. Our method is not necessarily meant to be used as a normalization procedure per se, although we could imagine that such an approach could be developed based on the analyses presented here. One such approach would be first to build universal control probe sets into the microarray of interest using random oligonucleotides or sequences derived from universally diverged taxa such as prokaryotes for eukaryotic arrays or vice-versa. Secondly, hybridize both homologous genomic DNA (from the species used to design the array) and heterologous (from the species of interest) genomic DNA onto the arrays being used (either dual-labelled mixtures onto the same chip or single-labelled pools onto separate chips) to compare probe intensity using the "control" based normalization and centering approach presented here. Finally, test the mean signal of a gene's exon probes against "zero" (with an appropriate correction for multiple comparisons). Only those genes whose complement of exon probes are statistically greater than zero can be considered "conserved" enough for use. Based upon our analyses, the number of these "conserved" genes decreases rapidly with phylogenetic distance and suggests that for distantly related taxa non-array based approaches might be more appropriate and cost effective.
We thank Mandar Deshpande for assistance with hybridization protocol and imaging and the Caenorhabditis Genetics Center for providing the Caenorhabditis strains. This work was supported by NSF grant EF-0723862 to MAH. Thanks also to Ted Morgan, two anonymous reviewers, and the members of the Ecological Genomics Institute at Kansas State University for discussions and helpful comments on the manuscript, and the Gene Expression Facility at Kansas State University for use of the facility.
- Gilad Y, Rifkin SA, Bertone P, Gerstein M, White KP: Multi-species microarrays reveal the effect of sequence divergence on gene expression profiles. Genome Research 2005, 15(5):674–680. 10.1101/gr.3335705PubMed CentralView ArticlePubMedGoogle Scholar
- Renn S, Aubin-Horth N, Hofmann H: Biologically meaningful expression profiling across species using heterologous hybridization to a cDNA microarray. BMC Genomics 2004, 5(1):42. 10.1186/1471-2164-5-42PubMed CentralView ArticlePubMedGoogle Scholar
- Cummings ME, Larkins-Ford J, Reilly CRL, Wong RY, Ramsey M, Hofmann HA: Sexual and social stimuli elicit rapid and contrasting genomic responses. Proceedings of the Royal Society B: Biological Sciences 2008, 275(1633):393–402. 10.1098/rspb.2007.1454PubMed CentralView ArticlePubMedGoogle Scholar
- Kassahn KS, Caley MJ, Ward AC, Connolly AR, Stone G, Crozier RH: Heterologous microarray experiments used to identify the early gene response to heat stress in a coral reef fish. Molecular Ecology 2007, 16(8):1749–1763. 10.1111/j.1365-294X.2006.03178.xView ArticlePubMedGoogle Scholar
- Hammond J, Broadley M, Craigon D, Higgins J, Emmerson Z, Townsend H, White P, May S: Using genomic DNA-based probe-selection to improve the sensitivity of high-density oligonucleotide arrays when applied to heterologous species. Plant Methods 2005, 1(1):10. 10.1186/1746-4811-1-10PubMed CentralView ArticlePubMedGoogle Scholar
- Hammond JP, Bowen HC, White PJ, Mills V, Pyke KA, Baker AJM, Whiting SN, May ST, Broadley MR: A comparison of the Thlaspi caerulescens and Thlaspi arvense shoot transcriptomes. New Phytologist 2006, 170(2):239–260. 10.1111/j.1469-8137.2006.01662.xView ArticlePubMedGoogle Scholar
- Morinaga SI, Nagano AJ, Miyazaki S, Kubo M, Demura T, Fukuda H, Sakai S, Hasebe M: Ecogenomics of cleistogamous and chasmogamous flowering: genome-wide gene expression patterns from cross-species microarray analysis in Cardamine kokaiensis (Brassicaceae). Journal of Ecology 2008, 96(5):1086–1097. 10.1111/j.1365-2745.2008.01392.xView ArticleGoogle Scholar
- Davey M, Graham N, Vanholme B, Swennen R, May S, Keulemans J: Heterologous oligonucleotide microarrays for transcriptomics in a non-model species; a proof-of-concept study of drought stress in Musa. BMC Genomics 2009, 10(1):436. 10.1186/1471-2164-10-436PubMed CentralView ArticlePubMedGoogle Scholar
- Graham NS, Clutterbuck AL, James N, Lea RG, Mobasheri A, Broadley MR, May ST: Equine transcriptome quantification using human GeneChip arrays can be improved using genomic DNA hybridisation and probe selection. The Veterinary Journal 2010, 186(3):323–327. 10.1016/j.tvjl.2009.08.030View ArticlePubMedGoogle Scholar
- Faucon F, Rebours E, Bevilacqua C, Helbling JC, Aubert J, Makhzami S, Dhorne-Pollet S, Robin S, Martin P: Terminal differentiation of goat mammary tissue during pregnancy requires the expression of genes involved in immune functions. Physiol Genomics 2009, 40(1):61–82. 10.1152/physiolgenomics.00032.2009View ArticlePubMedGoogle Scholar
- Pozhitkov A, Noble PA, Domazet-Loso T, Nolte AW, Sonnenberg R, Staehler P, Beier M, Tautz D: Tests of rRNA hybridization to microarrays suggest that hybridization characteristics of oligonucleotide probes for species discrimination cannot be predicted. Nucleic Acids Research 2006., 34(9):Google Scholar
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19(2):185–193. 10.1093/bioinformatics/19.2.185View ArticlePubMedGoogle Scholar
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucl Acids Res 2003, 31(4):e15. 10.1093/nar/gng015PubMed CentralView ArticlePubMedGoogle Scholar
- Hsu L, Self SG, Grove D, Randolph T, Wang K, Delrow JJ, Loo L, Porter P: Denoising array-based comparative genomic hybridization data using wavelets. Biostat 2005, 6(2):211–226. 10.1093/biostatistics/kxi004View ArticleGoogle Scholar
- Bar-Or C, Czosnek H, Koltai H: Cross-species microarray hybridizations: a developing tool for studying species diversity. Trends in Genetics 2007, 23(4):200–207. 10.1016/j.tig.2007.02.003View ArticlePubMedGoogle Scholar
- Degletagne C, Keime C, Rey B, de Dinechin M, Forcheron F, Chuchana P, Jouventin P, Gautier C, Duchamp C: Transcriptome analysis in non-model species: a new method for the analysis of heterologous hybridization on microarrays. BMC Genomics 2010, 11(1):344. 10.1186/1471-2164-11-344PubMed CentralView ArticlePubMedGoogle Scholar
- Bar-Or C, Bar-Eyal M, Gal T, Kapulnik Y, Czosnek H, Koltai H: Derivation of species-specific hybridization-like knowledge out of cross-species hybridization results. BMC Genomics 2006, 7(1):110. 10.1186/1471-2164-7-110PubMed CentralView ArticlePubMedGoogle Scholar
- Machado H, Renn S: A critical assessment of cross-species detection of gene duplicates using comparative genomic hybridization. BMC Genomics 2010, 11(1):304. 10.1186/1471-2164-11-304PubMed CentralView ArticlePubMedGoogle Scholar
- Renn S, Machado H, Jones A, Soneji K, Kulathinal R, Hofmann H: Using comparative genomic hybridization to survey genomic sequence divergence across species: a proof-of-concept from Drosophila. BMC Genomics 2010, 11(1):271. 10.1186/1471-2164-11-271PubMed CentralView ArticlePubMedGoogle Scholar
- Jones KL, Todd TC, Wall-Beam JL, Coolon JD, Blair JM, Herman MA: Molecular Approach for Assessing Responses of Microbial-Feeding Nematodes to Burning and Chronic Nitrogen Enrichment in a Native Grassland. Molecular Ecology 2006, 15(9):2601–2609. 10.1111/j.1365-294X.2006.02971.xView ArticlePubMedGoogle Scholar
- Todd TC, Powers TO, Mullin PG: Sentinel nematodes of land-use change and restoration in tallgrass prairie. Journal of Nematology 2006, 38(1):20–27.PubMed CentralPubMedGoogle Scholar
- Freckman DW: Bacterivorous Nematodes and Organic-Matter Decomposition. Agriculture Ecosystems & Environment 1988, 24(1–3):195–217. 10.1016/0167-8809(88)90066-7View ArticleGoogle Scholar
- Ferris H, Bongers T: Nematode Indicators of Organic Enrichment. Journal of Nematology 2006, 38(1):3–12.PubMed CentralPubMedGoogle Scholar
- Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent MR, Chen N, Chinwalla A, Clarke L, Clee C, Coghlan A, et al.: The Genome Sequence of Caenorhabditis briggsae : A Platform for Comparative Genomics. PLoS Biol 2003, 1(2):e45.PubMed CentralView ArticlePubMedGoogle Scholar
- SantaLucia J: A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proceedings of the National Academy of Sciences of the United States of America 1998, 95(4):1460–1465. 10.1073/pnas.95.4.1460PubMed CentralView ArticlePubMedGoogle Scholar
- Coolon JD, Jones KL, Todd TC, Carr BC, Herman MA: Caenorhabditis elegans Genomic Response to Soil Bacteria Predicts Environment-Specific Genetic Effects on Life History Traits. PLoS Genetics 2009, 5(6):e1000503. 10.1371/journal.pgen.1000503PubMed CentralView ArticlePubMedGoogle Scholar
- Maydan JS, Flibotte S, Edgley ML, Lau J, Selzer RR, Richmond TA, Pofahl NJ, Thomas JH, Moerman DG: Efficient high-resolution deletion discovery in Caenorhabditis elegans by array Comparative Genomic Hybridization. Genome Research 2007, 17(3):337–347. 10.1101/gr.5690307PubMed CentralView ArticlePubMedGoogle Scholar
- Pavlidis P, Li Q, Noble WS: The effect of replication on gene expression microarray experiments. Bioinformatics 2003, 19(13):1620–1627. 10.1093/bioinformatics/btg227View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.