Identification and utilization of inter-species conserved (ISC) probesets on Affymetrix human GeneChip® platforms for the optimization of the assessment of expression patterns in non human primate (NHP) samples
© Wang et al; licensee BioMed Central Ltd. 2004
Received: 15 July 2004
Accepted: 26 October 2004
Published: 26 October 2004
While researchers have utilized versions of the Affymetrix human GeneChip® for the assessment of expression patterns in non human primate (NHP) samples, there has been no comprehensive sequence analysis study undertaken to demonstrate that the probe sequences designed to detect human transcripts are reliably hybridizing with their orthologs in NHP. By aligning probe sequences with expressed sequence tags (ESTs) in NHP, inter-species conserved (ISC) probesets, which have two or more probes complementary to ESTs in NHP, were identified on human GeneChip® platforms. The utility of human GeneChips® for the assessment of NHP expression patterns can be effectively evaluated by analyzing the hybridization behaviour of ISC probesets. Appropriate normalization methods were identified that further improve the reliability of human GeneChips® for interspecies (human vs NHP) comparisons.
ISC probesets in each of the seven Affymetrix GeneChip® platforms (U133Plus2.0, U133A, U133B, U95Av2, U95B, Focus and HuGeneFL) were identified for both monkey and chimpanzee. Expression data was generated from peripheral blood mononuclear cells (PBMCs) of 12 human and 8 monkey (Indian origin Rhesus macaque) samples using the Focus GeneChip®. Analysis of both qualitative detection calls and quantitative signal intensities showed that intra-species reproducibility (human vs. human or monkey vs. monkey) was much higher than interspecies reproducibility (human vs. monkey). ISC probesets exhibited higher interspecies reproducibility than the overall expressed probesets. Importantly, appropriate normalization methods could be leveraged to greatly improve interspecies correlations. The correlation coefficients between human (average of 12 samples) and monkey (average of 8 Rhesus macaque samples) are 0.725, 0.821 and 0.893 for MAS5.0 (Microarray Suite version 5.0), dChip and RMA (Robust Multi-chip Average) normalization method, respectively.
It is feasible to use Affymetrix human GeneChip® platforms to assess the expression profiles of NHP for intra-species studies. Caution must be taken for interspecies studies since unsuitable probesets will result in spurious differentially regulated genes between human and NHP. RMA normalization method and ISC probesets are recommended for interspecies studies.
Microarray studies on non human primates (NHP) have been used to address viral pathogenesis [1, 2], neurological disorders , development  and phylogenetic studies [5–7]. Due to the lack of species-specific microarray platforms for NHP, researchers have used GeneChip® platforms built using human sequence information. An underlying assumption in such studies is that transcripts of humans and NHP are highly conserved, and probe sequences designed to detect human genes will detect their orthologs in NHP samples. It is estimated that chimpanzees (Pan troglodytes) and humans shared 98.77 % DNA similarity . While this statistic is widely quoted and believed, Britten  reported that the divergence between humans and chimpanzees to be about 5%. Anzai and colleagues  compared the chimpanzee MHC region (1,750,601 bp) with the human HLA region (1,870,955 bp), and concluded that the similarity drops to 86.7% if insertions and deletions were taken into account. All these analyses are based on genomic DNA sequences; however, for microarray studies on the transcriptome, the similarity of RNA transcripts is the primary concern. A single gene does not necessarily generate a single transcript. Splicing variants are very common in the human [11, 12], and humans and NHPs may use different splicing strategies in some genes. Therefore, it is necessary to re-assess the reliability of human GeneChips® for NHP expression analysis.
Few published studies employing human GeneChip® platforms for NHP expression profiling have robustly addressed the quantitative aspects of cross platform performance. Vahey and colleagues  used the HuGeneFL GeneChip® and demonstrated that there was no significant difference in the dynamic range of the raw fluorescence distribution for equivalent amounts of human cRNA and macaque cRNA hybridized to the chip. Chismar and colleagues  used the U95Av2 GeneChip® platform and compared the expression patterns of humans with that of the rhesus macaque. They concluded that the percentage of 'present' calls observed in the transcriptome of macaque brain is lower than that of human brain, and that this is especially true for genes with lower signal intensity. Caceres and colleagues  used the HG-U95Av2 arrays to identify upregulated genes in the human cortex compared with those of the NHPs. Since sequence divergence could lead to an underestimation of expression levels in NHPs, they excluded 4572 probes that exhibited different hybridization behaviour between two sets of samples in order to reduce false positives. However, this analysis is solely based on probe signal intensities. A more robust way to assess the utility of human GeneChip® platforms for the study of expression profiles in NHP is to employ a sequence analysis approach.
In this study, we address the power of human GeneChip® platforms to assess expression patterns in NHP samples by: a) identifying ISC probesets based on sequence analysis; b) assessing intra (within NHP species)- and interspecies (between NHP and human samples) reproducibility of GeneChip® data; and c) applying appropriate normalization methods to improve interspecies reproducibility.
Results and discussion
Identification of ISC probesets
The number of ISC probesets in various human GeneChip® platforms
Human GeneChip® platforms
Probes / probeset
Total number of probesets (genes*)
The number of ISC probesets (genes)
It is not uncommon, especially in the U133Plus2.0 platform, that multiple probesets target the same gene. For example, in the U133A and the U133 Plus 2.0 GeneChip®s, there are three probesets (217028_at, 211919_s_at and 209201_x_at) that target the gene CXCR4 at different positions in its transcript. In order to address this redundancy issue, we converted the number of probesets into the number of unique UniGene clusters based on the GeneChip® annotation file provided by Affymetrix Website . While a UniGene cluster does not necessarily correspond to a unique gene, it is a reasonable way to assess probeset redundancy. As shown in Table 1, the Focus GeneChip® and the U133Plus2.0 GeneChip® have the lowest and highest frequency of redundant probesets for a given gene, respectively.
Intra- and interspecies reproducibility of detection calls
Intra- and interspecies reproducibility of signal intensities
The effect of normalization methods on interspecies reproducibility
This paper presents a comprehensive analysis of probe sequences and GeneChip® expression data as applied to the derivation of meaningful expression profile data from NHP. The utility of the human Affymetrix GeneChip® for the assessment of expression profiles in NHP depends on the experimental design and on the approach to data normalization and analysis. Our observations suggest that: 1) it is feasible to use the human GeneChip® in the evaluation of expression profiles of NHP samples for intra-species comparisons; 2) use of ISC probesets and RMA normalization are recommended for interspecies studies; and 3) with the increasing amount of ESTs of NHP, additional ISC probesets (and perfect probesets) will be identified in the near future.
Sequence data source
Affymetrix GeneChip probe sequences were downloaded from Affymetrix website . The ESTs (Expressed Sequence Tags) of monkey (Macaca mulatta) and chimpanzee (Pan troglodytes) were downloaded from NCBI website .
Identification of ISC probesets
Stand alone BALST program was downloaded from NCBI website . Perl script was written to automatically run BLAST search between GeneChip® probe sequences and monkey /chimpanzee EST sequences. The length of a probe sequence is always 25 nucleotides while the number of probes in a probeset varies from 11 to 20 depending on GeneChip® platforms (see Table 1). A certain degree of mismatch between a probe sequence and ESTs is allowed. If a probe has at least 23 nucleotides complementary to at least one EST sequence, this probe is designated as a complementary probe. If a probeset has at least two complementary probes, we defined this probeset as an ISC probeset. If all probes of a probeset are complementary probes, this probeset is called a 'perfect' probeset. The rationale for the definition of ISC probesets is as follows: 1) since each probe is a 25-mer oligo, the probability of random matching of one probe is 4-25 thus, the probability of random matching of two probes goes down to 4-50, being exponentially reduced; 2) in comparison with an RT-PCR experiment, the primer length is equivalent to our probe length, and two primers (one forward and one backward) usually generate a unique sequence in a whole genome; 3) a probe sequence on the Affymetrix GeneChip® is a well designed sequence with a single probe hybridizing with a unique transcript in whole transcriptome; and 4) since the EST sequences in NHP are very limited so far, most of them do not cover whole transcript such that a false negative could be generated if we require all the probes in a probeset being complementary to known ESTs. In order to convert probeset IDs to UniGene IDs and map them onto chromosomes, probeset annotation files were downloaded from Affymetrix website . No animals or human samples were used for the purpose of this analysis. Affymetrix datasets used in this analysis are from other approved ongoing projects in our lab. The procedure used to process these samples was previously published .
Briefly, peripheral blood from healthy human and NHP (Indian origin Rhesus macaque) was collected and peripheral blood mononuclear cells (PBMCs) were separated by Histopaque-Ficoll (Sigma) gradient centrifugation. RNA preparation, Hybridization, staining and scanning of the GeneChip® was carried out as described by Vahey et al. . Animal and human samples were handled identically throughout the process. All 20 samples (12 human and 8 rhesus macaques) were hybridized to Affymetrix's HG-Focus GeneChip®. Signal values and detection calls (present or absent) for all samples were determined by using MAS5.0 (Affymetrix Inc. Santa Clara, California). Signal values were scaled to the default target signal intensity of 500). A matrix of detection calls (present, absent and marginal) and a matrix of signal intensities for all samples across all probesets were constructed. A gene must exhibit 50% or more of 'present' calls in all samples to be considered 'expressed'. In this study, an expressed probeset in human is a probeset that has 6 or more present calls among 12 human samples. Similarly, an expressed probeset in monkey means there were 4 or more present calls among 8 monkey samples. The signal intensities output from MAS5.0 were log2 transformed. Model-based normalization was performed using dChip version 1.3 . The output signal intensities were log2 transformed. RMA (Robust Multichip Average) normalization [14–16] was carried out using BioConductor package Affy_1.2.30 . The rma() function in the package was used at its default setting, that is, 'RMA' background correction, 'quantile normalization', 'PM only model' and 'median polish summarization'. By default, the signal intensities were already log2 transformed.
Intra- and interspecies correlation coefficients of signal intensities were calculated by built in function 'cor' in statistical package R version 1.9.0. . Visualization of correlation coefficients matrix was done by the function 'image'. The function 'heat.colors' was used to create heat-spectrum (red to white) and set color scales between 0.5 (red) and 1.0 (white).
non human primate
expressed sequence tag
robust multi-chip average
Microarray suite version 5.0
The authors thank Dr. Deborah L. Birx, Director of the Military HIV-1 Research Program, for support of this effort and Drs. Nelson Michael and Christian Ockenhouse for helpful discussions. This work was supported in part by Cooperative Agreement no. W81XWH-04-2-0005 between the U.S. Army Medical Research and Materiel Command and the Henry M. Jackson Foundation for the Advancement of Military Medicine.
The opinions or assertions contained herein are the private views of the authors, and are not to be construed as official, or as reflecting the views of the Department of the Army or the Department of Defence.
- Vahey M, Nau M, Taubman M, Yalley-Ogunro J, Silvera P, Lewis M: Pattern of gene expression in peripheral blood mononuclear cells of Rhesus Macaques infected with SIVmac251 and exhibiting differential rates of disease progression. AIDS Res and Hum Retroviruses 2003, 19(5):369–387. 10.1089/088922203765551728View Article
- Bigger CB, Brasky KM, Lanford RE: DNA microarray analysis of chimpanzee liver during acute resolving hepatitis C virus infection. J Virol 2001, 7059–7066. 10.1128/JVI.75.15.7059-7066.2001
- Marvanova M, Menager J, Bezard E, Bontrop RE, Pradier L, Wong G: Microarray analysis of nonhuman primates: validation of experimental models in neurological disorders. FASEB 2003, 929–931.
- Lachance PED, Chaudhuri A: Microarray analysis of development plasticity in monkey primary visual cortex. J Neurochem 2004, 88: 1455–1469.View ArticlePubMed
- Caceres M, Lachuer J, Zapala MA, Redmond JC, Kudo Lili, Geschwind DH, Lockhart DJ, Preuss TM, Barlow C: Elevated gene expression levels distinguish human from non-human primate brains. Proc Natl Acad Sci USA 2003, 100(22):13030–13035. 10.1073/pnas.2135499100PubMed CentralView ArticlePubMed
- Enard W, Khaitovich P, Klose J, Zollner S, heissig F, Giavalisco P, Nieselt-Struwe K, Muchmore E, Varki A, Ravid R, Doxiadis GM, Bontrop RE, Paabo S: Intra- and interspecific variation in primate gene expression patterns. Science 2002, 296: 340–343. 10.1126/science.1068996View ArticlePubMed
- Uddin M, Wildman DE, Liu G, Xu W, Johnson RM, Hof PR, Kapatos G, Grossman LI, Goodman M: Sister grouping of chimpanzees and humans as revealed by genome-wide phylogenetic analysis of brain gene expression profiles. Proc Natl Acad Sci USA 2004, 101: 2957–2962. 10.1073/pnas.0308725100PubMed CentralView ArticlePubMed
- Fujiyama A, Watanabe H, Toyoda A, Taylor TD, Itoh T, Tsai S, Park H, Yaspo M, Lehrach H, Chen Z, Fu G, Saitou N, Osoegawa K, Jong PJ, Suto Y, Hattori M, Sakaki Y: Construction and Analysis of a human-chimpanzee comparative clone map. Science 2002, 295: 131–134. 10.1126/science.1065199View ArticlePubMed
- Britten RJ: Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels. Proc Natl Acad Sci USA 2002, 99(21):13633–13635. 10.1073/pnas.172510699PubMed CentralView ArticlePubMed
- Anzai T, Shiina T, Kimura N, Yanagiya K, Kohara S, Shigenari A, Yamagata T, Kulski JK, Naruse TK, Fujimori Y, Fukuzumi Y, Yamazaki M, Tashiro H, Iwamoto C, Umehara Y, Imanishi T, Meyer A, Ikeo K, Gojobori T, Bahram S, Inoko H: Comparative sequencing of human and chimpanzee MHC class I regions unveils insertions/deletions as the major path to genomic divergence. Proc Natl Acad Sci USA 2003, 100(13):7708–7713. 10.1073/pnas.1230533100PubMed CentralView ArticlePubMed
- Brett D, Hanke J, Lehmann G, Haase S, Delbruck S, Krueger S, Reich J, Bork P: EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett 2000, 474: 83–86. 10.1016/S0014-5793(00)01581-7View ArticlePubMed
- Modrek B, Resch A, Grasso C, Lee C: Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res 2001, 29: 2850–2859. 10.1093/nar/29.13.2850PubMed CentralView ArticlePubMed
- Chismar JD, Mondala T, Fox HS, Roberts E, Langford D, Masliah E, Salomon DR, Head SR: Analysis of result variability from high-density oligonucleotide arrays comparing same-species and cross-species hybridizations. BioTechniques 2002, 33: 516–524.PubMed
- Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs Band, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31(4):e15. 10.1093/nar/gng015PubMed CentralView ArticlePubMed
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics 2003, 4(2):249–264. 10.1093/biostatistics/4.2.249View ArticlePubMed
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics 2003, 19(2):185–193. 10.1093/bioinformatics/19.2.185View ArticlePubMed
- Li C, Wong WH: Model-based analysis of oligonucleotides arrays: Expression index computation and outlier detection. Proc Natl Acad Sci USA 2001, 98: 31–36. 10.1073/pnas.011404098PubMed CentralView ArticlePubMed
- Affymetrix. [https://www.affymetrix.com/index.affx]
- NCBI. [http://www.ncbi.nlm.nih.gov]
- BioConductor. [http://www.bioconductor.org]
- R. [http://www.r-project.org]
This article is published under license to BioMed Central Ltd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.