A verification protocol for the probe sequences of Affymetrix genome arrays reveals high probe accuracy for studies in mouse, human and rat

Alberts, Rudi; Terpstra, Peter; Hardonk, Menno; Bystrykh, Leonid V; de Haan, Gerald; Breitling, Rainer; Nap, Jan-Peter; Jansen, Ritsert C

doi:10.1186/1471-2105-8-132

Research article
Open access
Published: 20 April 2007

A verification protocol for the probe sequences of Affymetrix genome arrays reveals high probe accuracy for studies in mouse, human and rat

Rudi Alberts¹,
Peter Terpstra²,
Menno Hardonk²,
Leonid V Bystrykh³,
Gerald de Haan³,
Rainer Breitling¹,
Jan-Peter Nap^1,4 &
…
Ritsert C Jansen^1,2

BMC Bioinformatics volume 8, Article number: 132 (2007) Cite this article

5935 Accesses
11 Citations
3 Altmetric
Metrics details

Abstract

Background

The Affymetrix GeneChip technology uses multiple probes per gene to measure its expression level. Individual probe signals can vary widely, which hampers proper interpretation. This variation can be caused by probes that do not properly match their target gene or that match multiple genes. To determine the accuracy of Affymetrix arrays, we developed an extensive verification protocol, for mouse arrays incorporating the NCBI RefSeq, NCBI UniGene Unique, NIA Mouse Gene Index, and UCSC mouse genome databases.

Results

Applying this protocol to Affymetrix Mouse Genome arrays (the earlier U74Av2 and the newer 430 2.0 array), the number of sequence-verified probes with perfect matches was no less than 85% and 95%, respectively; and for 74% and 85% of the probe sets all probes were sequence verified. The latter percentages increased to 80% and 94% after discarding one or two unverifiable probes per probe set, and even further to 84% and 97% when, in addition, allowing for one or two mismatches between probe and target gene. Similar results were obtained for other mouse arrays, as well as for human and rat arrays. Based on these data, refined chip definition files for all arrays are provided online. Researchers can choose the version appropriate for their study to (re)analyze expression data.

Conclusion

The accuracy of Affymetrix probe sequences is higher than previously reported, particularly on newer arrays. Yet, refined probe set definitions have clear effects on the detection of differentially expressed genes. We demonstrate that the interpretation of the results of Affymetrix arrays is improved when the new chip definition files are used.

Background

Microarrays are widely used to study genome-wide gene expression levels. A frequently used type of microarray is the Affymetrix GeneChip [1]. This technology uses multiple probes per gene (probe set) to measure the amount of mRNA present (target). For reasons of specificity, probes are chosen to be complementary to a unique part of the target sequence. Although all probes from a single probe set should measure the same amount of mRNA, the hybridization signals of individual probes for a given mRNA molecule may vary widely. This is believed to be caused by variations in molecular characteristics of the probe sequence, such as GC content and secondary structure, and corrections have been proposed to calculate true expression levels averaged over probe signals [2, 3]. However, another reason for the variation in signal between probes could be misdesigned probes, that either do not match the target RNA or can hybridize with other, non-target, RNA molecules. For correct interpretation of the results of Affymetrix GeneChip hybridizations, it is important to know which probes may cause variation in hybridization and for what reason. For example, in our large scale genetical genomics applications [4–6], individual probe hybridizations are used to map regulatory regions in a genome. In such applications, it is important to be able to rule out potential false positive results due to misdesigned probes.

An earlier analysis of the probe sequences of the Affymetrix mouse genome U74Av2 array [7] against the RefSeq database showed that for only 51% of the probe sets on the array all probes could be 'entirely verified', that is, corresponded without any mismatch to a RefSeq mRNA sequence. A recent analysis at the individual probe level verified 73% of the individual probe sequences of the MG-U74Av2 array against mRNA sequences from Entrez [8]. Affymetrix supplies regular updates of probe set verifications using new releases of the RefSeq, GenBank and Ensembl databases [9, 10]. In the July 2006 release, 70% of the probe sets of the MG-U74Av2 GeneChip are 'entirely verified'. These surprisingly low verification percentages suggest that a major part of the hybridization results of such an array should be regarded with caution. Little information is available on the possibility of hybridization of individual mouse probes with non-target RNA molecules [8]. Here we present an extensive and generalized protocol for the verification of probe sequences on Affymetrix arrays.

The protocol uses four databases: NCBI RefSeq, NCBI UniGene Unique, NIA Mouse Gene Index, and UCSC mouse genome. By incorporating these databases in the verification protocol, the number of sequence-verified probes of the Affymetrix mouse arrays increases considerably. The same protocol applied to other mouse arrays, or a similar protocol (based on RefSeq, UniGene Unique and UCSC genome) for human and rat arrays, yielded similar results. Refined chip definition files (CDF files), which include only verified probes, are provided online.

We conclude that with the corrections as proposed previously [2, 3], the accuracy and reliability of the Affymetrix arrays is considerably higher than reported till now. Our new data on probe verification and cross-hybridization are important for assessing unexpected behaviour of any given individual probe in a given experiment and will contribute to the more accurate assessment of expression data using Affymetrix arrays.

Results

Quality of sequence databases

The verification protocol for mouse arrays makes use of three messenger databases (NCBI RefSeq, NCBI UniGene Unique, NIA Mouse Gene Index), and one genome database (UCSC mouse genome). We first assessed their quality. Assuming that the genome is the most accurate sequence with an error rate of less than 1 in 10,000 bases [11, 12], we compared the sequence of 1000 randomly selected genes, all occurring in each of the three messenger databases, to the genome (see Methods). Table 1 shows that there were no major quality differences between the messenger databases, except that the NIA Mouse Gene Index showed a lower mismatch frequency. The sequence differences that are observed may be due to sequencing errors, but also to genetic polymorphisms between mouse strains. This means that each database contains reliable information and can be used to verify probe sequences.

Table 1 Comparison of sequence databases

Full size table

The verification protocol

In the protocol for mouse arrays, we use the BLAST program to verify all probe sequences against the three messenger databases (see Methods). Using the terminology of Mecham et al. [7], for each probe set we determine per database whether it is

'entirely verified', meaning that all probes were identical to a messenger sequence;
'partially verified', meaning that only a subset of probes was identical to a messenger sequence;
'entirely unverified', meaning that none of the probes was identical to a messenger sequence.

Only probe sets that could not be classified as 'entirely verified' against one of the three messenger databases, were verified against the genome (see Methods). Each probe set is assigned a verification score which is the best score over all databases, where 'entirely verified' is better than 'partially verified', and 'partially verified' is better than 'entirely unverified'. For the final verification score the order of the databases does not matter since each probe set is assigned the best possible score.

We included all mentioned databases in the protocol to obtain the greatest coverage. Since the genome sequence database is much larger than the messenger databases and therefore the verification against the genome takes much longer, we have put the verification against the genome in the last position. This improves computational efficiency. The verification is not hampered by the lower accuracy of the messenger databases compared to the genome, since only 0.60% (0.34%) of the probe sets of the MG-U74Av2 (430 2.0) array were 'entirely verified' against one of the messenger databases but 'entirely unverified' against the genome. We examined some of the probe sets that were 'entirely unverified' against the genome in more detail. These seem to represent contaminated non-mouse sequences, or the tiny fraction of genes that are still missing from the assembled genomes. Because there are no major quality differences between the messenger databases, their order is in principle arbitrary. However, we have put RefSeq in the first position in the protocol since it contains the most intensively curated transcript sequence information and probe sets that are 'entirely verified' against this database exit the protocol with RefSeq gene identifiers (supplementary material).

Verification of the U74 and 430 arrays

We here report the results of the application of the verification protocol to two mouse GeneChips, MG-U74Av2 and 430 2.0, to which we will refer as 'U74' and '430', respectively. The percentages of 'entirely verified' probe sets are reported in a cumulative way, i.e. they grow with every database added; see Figure 1.

Megablasting all probe sequences of the U74 array against the mouse NCBI RefSeq database 'entirely verified' only 53% of all probe sets; this confirms the 51% reported earlier [7] with an older version of the RefSeq database. From the 430 array, only 46% of all probe sets could be 'entirely verified' (Figure 1). Next, by including the UniGene Unique database we 'entirely verified' 59% and 56% of all probe sets in U74 and 430, respectively. Then, by including the NIA Mouse Gene Index, the percentages grow to 69% and 74%, respectively. At last, we verified the remaining probe sets that were not yet 'entirely verified' against the UCSC mouse genome database. This way, we finally 'entirely verified' 74% and 85% of all probe sets in U74 and 430, respectively. More detailed numbers of the contribution of each of the databases to the final verification are given in additional file 1: 'Verification scores for the Affymetrix U74 array' and additional file 2: 'Verification scores for the Affymetrix 430 array'.

Most 'partially verified' probe sets contain at most two bad probes

In this verification protocol, the class 'partially verified' is heterogeneous in nature. For a proper interpretation of the hybridization signals of a given probe set, it may be required to know how many and which probes of a particular set are not giving a perfect match with the mouse genome data available. In Figure 2 we have plotted the number of perfectly matching probes for those probe sets that were categorized in the 'partially verified' class. This shows that mainly one or two probes per probe set give a less than perfect match. Especially in case of the U74 array, where 16 probes per gene are present, the hybridization results of such non-perfect probes could be disregarded and the remaining probe set can be considered 'entirely verified'. In the supplementary material, the precise identification of these probes can be retrieved. By repeating the protocol and allowing one or two non-perfect probes per probe set, 80% and 94% of the probe sets of U74 and 430 were 'entirely verified', respectively.

Allowing mismatches

Laboratory experience has shown that often the hybridization conditions do not allow distinction between a perfect match and a mismatch probe [13]. In this context, it could be argued that the requirement for a perfect match in probe sequence verification is not necessary, especially when only PM signals are used for estimating the expression levels, as is the case for most modern probe summarization methods (RMA, GCRMA). Moreover, messenger databases contain sequencing errors. For these two reasons, we have repeated the verification protocol as established above while allowing either one or two mismatches per probe sequence; 26% and 47% of the unverified probes had one or two mismatches between probe and target for U74 and 430, respectively. Figure 1 shows that the percentage of 'entirely verified' probe sets increases considerably, up to 77% for U74 and 91% for 430 in case of one mismatch and up to 79% for U74 and 93% for 430 for two mismatches. If we restrict ourselves to probe sets labeled by Affymetrix with "_at" then 85% of the probe sets are 'entirely verified' for U74 and 92% for 430 in case of one mismatch, and 87% for U74 and 93% for 430 for two mismatches. If we allow for two mismatches and also drop one or two unverifiable probes then 84% and 97% of all probe sets of U74 and 430 were 'entirely verified'. The hybridization conditions of the individual laboratory will have to decide which validation scheme is most appropriate and which probes or probe sets have to be scrutinized with more care.

Cross-hybridization

Another issue of quality control is the specificity of probe sequences. A probe set may be 'entirely verified' with a given gene, yet an individual probe from such a set may be identical, or more similar than desired, to the sequence of another gene. This may cause cross-hybridization of different mRNAs and give rise to a probe that yields a hybridization signal that differs markedly from the other probe sequences. For the U74 array, 17% of the probes in 'entirely verified' probe sets had more than one Megablast hit against the RefSeq, UniGene Unique and/or NIA databases; 23% of the verified probe sets had at least one such probe with multiple Megablast hits. For the 430 array the percentages are 15% and 18% respectively. The numbers of cross-hybridizing probes per verified probe set are given in Figure 3. Note that the genome has not been used to assess cross-hybridization, since probe selection regions were used and individual probes were not compared with the genome. In the majority of probe sets with cross-hybridizing probes, all probes are cross-hybridizing. This indicates different splicing variants or duplicated genes that have different identifiers but can not be distinguished by these probe sets. Again, the individual laboratory will have to decide which probes or probe sets have to be scrutinized with more care.

Verification of all available human, mouse and rat arrays confirms high probe accuracy

We applied the protocol to all other Affymetrix mouse arrays and we developed a similar protocol consisting of the three databases RefSeq, UniGene Unique and UCSC genome for the analysis of all human and rat Affymetrix arrays. Table 2 shows the results; the most striking observation is that, except the human X3P array, the newer arrays show high accuracy of probe sequences. New chip definition files for these 30 arrays can be downloaded from [14]. One can choose CDF files with or without cross-hybridizing probes and allowing for 0, 1 or 2 mismatches between sequence and probe.

Table 2 Percentage of verified probe sets for all Affymetrix human, mouse and rat arrays analyzed

Full size table

The impact of updated probe set definitions on expression data

Microarrays are often used to find genes that are differentially expressed. To assess the impact of the updated probe set definitions on the assessment of differential gene expression, we reanalyzed an example dataset, the Clinical Prostate Cancer Behavior dataset (see Methods), consisting of 52 prostate tumor RNA samples and 50 non-diseased RNA samples hybridized to the human HG-U95Av2 array. Using RankProducts (Methods), we calculated lists of differentially expressed genes, both using the original Affymetrix CDF file and the new CDF file. 943 upregulated probe sets were detected with both CDF files, 32 probe sets were detected only with the new CDF file and 41 probe sets were detected only with the original CDF file (at a significance level of p < 0.05, Bonferroni adjused; similar numbers were found for the downregulated genes).

This result only shows that there are differences between the two CDF files. To check if refining the probe set definition indeed improves the results, we performed additional testing. For this purpose we focused on those genes that are most strongly affected, i.e. those genes whose rank in the list created with the original CDF file and the rank in the list created with the new CDF file are most different. Such genes will appear as differentially expressed in one list but not the other. If random probes are different between the two analyses, this difference can be both ways, with equal probability. However, we predict that improved probe sets will result in better detection of differential expression, as non-verifiable probes probably do not show differential expression and hence weaken the differential expression of the whole probe set. In that case, the genes that differ between the two lists should more often be detected as differentially expressed with the new CDF file compared to the original CDF file, than the other way around. Focusing on the genes with the highest differences in ranks, a significant proportion (p < 1E-10; Wilcoxon signed rank test) had a higher (better) rank in the list created with the new CDF file compared to the list created with the original CDF file (Table 3, columns a, b), confirming our prediction. Of the 250 probesets that showed the highest improvement in rank, 100% had been redefined. This indicates that a significant number of genes is picked up as differentially expressed with the new CDF file, while they remain undetected using the original CDF file.

Table 3 Comparison of lists of differentially expressed genes created with original and new CDF files.

Full size table

To verify that this observed improvement of results is consistent in other datasets and platforms, we repeated this evaluation procedure for a dataset of 34 smoker vs. 23 non-smoker samples from intra-pulmonary airway epithelial cells hybridized to HG-U133A arrays and a dataset of 4 male vs. 4 female BWF1 lupus-prone mice spleen samples hybridized to MG-U74Av2 arrays. We saw the same clear improvement, with high statistical significance (Table 3). As outlined above, we expect that random changes in the probe set definition would lead to equal numbers of genes being affected in either direction. We calculated the difference of the observed amount of genes having a higher rank with the new CDF and the expected amount (n/2), for different values of n. We used the maximum excess as an estimate of the number of probe sets that are significantly improved by refining the CDF files. Depending on the array, these numbers range between 321 and 658. Although these numbers are small compared to the total number of genes present on the array, they comprise a large fraction of the genes that are typically found to be differentially expressed in a microarray experiment.

Discussion

In different studies [7, 8, 15, 16] Affymetrix probe sequences were verified against mRNA databases. In all of these studies, only one mRNA database was considered. Gautier et al. [15] and Zhang et al. [16] verified human Affymetrix arrays against mRNA sequences from Entrez and RefSeq. Elo et al. [8] investigated the reproducibility of the probe signals for different generations of Affymetrix arrays. They compared the correlations of probe signals for original Affymetrix probe sets and verified probe sets, which they defined as the subset of probes of the original probe sets that only match with the target transcript for which the probes were originally designed by Affymetrix. They found that probe verification improved the correlations between generations of Affymetrix arrays and also that probe verification improved the consistency of the measurements within an array. Mecham et al. [7] showed that probe verification results in increased precision in technical replicates; increased accuracy across complementary microarray platforms, increased accuracy translating data from oligonucleotide arrays to cDNA microarrays, and increased diagnostic power of microarray technology.

A problem with the RefSeq and the UniGene Unique databases is that 3' UTRs are often truncated by the way the sequences are assembled [17, 18], while Affymetrix selects the probes from the 600 bases most proximal to the 3' end of each transcript [19]. We overcame this problem by incorporating the genome in the verification protocol, where all 3' UTRs are available.

The Fantom 3 project (Functional annotation of the mouse, [20]) provides an extensive characterization of the mouse transcriptome. We also tested the verification protocol with the Fantom 3 transcripts included. Since this did not increase our verification scores (data not shown), we did not include this database in our protocol.

The mRNA and genome databases currently available are mainly based on the C57BL/6 mouse strain. Also, the probes on the Affymetrix arrays are mainly based on the C57BL/6 mouse strain. When samples from C57BL/6 mice are hybridized to the arrays, their transcripts are expected to perfectly match the probes. However, mice from genetically different strains or from recombinant inbred pedigrees, as in our genetical genomics applications [4, 6], may carry allelic SNPs compared to the C57BL/6 genome. Probes carrying allelic SNPs may hamper data interpretation as putative differential mRNA expression can be confounded with differential hybridization [4]. When sequences of other mouse strains become available, the verification protocol here developed should be repeated for these newly sequenced strains to identify and, if so desired, eliminate probes carrying allelic SNPs.

The use of refined probe set definitions, that exclude unverified probes, will improve the interpretation of expression data, as non-hybridizing and mis-hybridizing probes add only noise to the data. Our evaluation of expression data from the public domain shows that this is indeed the case.

Conclusion

By combining various verifications as described above, we show that 74% of the U74 probe sets and 85% of the 430 probe sets can be considered 'entirely verified' when based on perfect matches. When two mismatches are allowed, the percentages increase to 79% for U74 and 93% for 430. When considering individual probes, 85% and 95% of the probes were verified for U74 and 430 respectively, and even 89% and 97% when allowing two mismatches. Our extensive analyses of probe sequence data show that the inclusion of various databases, such as the genome sequence, indicate that the arrays are much more accurate than shown previously. Existing data can be reanalyzed with our verified probe sets (using the online CDF files). We show that such a refined probe set definition has clear effects on the detection of differentially expressed genes and demonstrate for various experiments that the results are systematically improved by discarding unverified probes.

Methods

Affymetrix GeneChips

Probe set annotations and probe selection regions (PSR) for all human, mouse and rat arrays, were obtained from Affymetrix [9, 10].

The U74Av2 array is based on the mouse UniGene database, release 74. It contains 196.670 oligomers of length 25, divided into 12.422 probe sets, most of which contain 16 oligomers. Probe sets of the newer 430 2.0 array were selected from sequences derived from dbEST (NCBI, June 2002), GenBank (NCBI, Release 129, April 2002), and RefSeq (NCBI, June 2002) [21]. It contains 495.374 oligomers of length 25, divided into 45.037 probe sets, generally consisting of 11 oligomers.

Sequence databases

RefSeq is a curated non-redundant collection of naturally occurring DNA, RNA and protein sequences. It is based on the sequences and annotations supplied to GenBank by the original researchers [17]. For mouse we used 55,810 messenger sequences from RefSeq.

UniGene is a processed and curated collection of millions of ESTs (Expressed Sequence Tags), which are relatively inaccurate (around 2% error). To assign ESTs to genes, the ESTs are clustered and the cluster consensus sequences stored in UniGene Unique [18]. The mouse UniGene Unique release contains 43,104 sequences.

NIA Mouse Gene Index (developed by the National Institute on Aging) is currently the most comprehensive collection of alternative transcription/splicing sequences. Patterns of alternative transcription/splicing are obtained by aligning a complete and nonredundant transcriptome assembly from expressed sequences (obtained from RefSeq, GenBank, dbEST, Ensembl and NIA) to the mouse genome [22]. The NIA Mouse Gene Index contains 186,405 sequences.

The UCSC mouse genome (maintained by University of California Santa Cruz) reports about 90% of the genome in finished form (error rate of less than 1 in 10,000 bases). We used build mm7 (corresponding to NCBI build 35.1; August, 2005).

For the mouse protocol we used two NCBI [23] databases: RefSeq mRNAs (NCBI, Feb. 3, 2006) and UniGene Unique (NCBI, build 151, Oct. 20, 2005). In addition, we used all mouse mRNA sequences from the National Institute on Aging (NIA Mouse Gene Index 5, June 2005, [22]) and the UCSC mouse genome (mm7, Aug. 2005, [11]). For the human protocol we used RefSeq mRNAs (NCBI, Feb. 16, 2006), UniGene Unique (NCBI, build 188, Dec. 30, 2005) and UCSC human genome (hg17, May 2004). For the rat protocol we used RefSeq mRNAs (NCBI, Feb. 16, 2006), UniGene Unique (NCBI, build 149, Jan. 25, 2006) and UCSC rat genome (rn3, June 2003).

Assessment of the quality of the sequence databases

To assess the quality of the sequence databases, we took the UCSC genome sequence as a reference, and compared the sequences of 1000 randomly selected genes, all occurring in each of the three messenger databases, to the genome sequence. Since the genome contains introns and the messenger databases do not, we extracted the exon sequences from the genome by using the exon coordinates of RefSeq genes and attached them to each other. Then for each of the 1000 genes we compared the three messenger sequences to the reconstructed genome sequence and counted the amounts of mismatches and gaps (Table 1).

Sequence alignment algorithms

Individual probes were analyzed against the messenger databases with Megablast (version 2.2.6 with a word size of 12, [24]) for 'short nearly exact matches'. Hits in databases were distinguished on the basis of none, one or two mismatches with the probe sequence.

Since analysis of all single probe sequences against the mouse genome gives too many non-exon hits (data not shown), we used the probe selection region (PSR) of each probe set as input for BLAT ([11], standalone BLAT version 32 × 1, standard settings). PSR is defined as the unique part of the messenger sequence from which Affymetrix selected the probes [19]. We masked all nucleotides not represented in probe sequences. Within the obtained BLAT hits of the masked PSRs, we re-identified the position of each probe to count the number of mismatches per probe.

Computing infrastructure

All analyses were performed on a Linux cluster consisting of 200 nodes with dual Opteron processors 2 GHz and 1 GB memory. The average computation time per array was 4 hours on one node.

Datasets and methods for determining the impact of updated probe set definitions on expression data

The Clinical Prostate Cancer Behavior dataset was downloaded from [25]. The smoker vs. non-smoker dataset was downloaded from the Gene Expression Omnibus (GEO) and has accession number GSE994. The male vs. female BWF1 lupus-prone mice dataset was also downloaded from GEO (accession number GSE2336). In all cases we used RMA [26] to generate probe set-level data. Using RankProducts [27] we calculated ranked lists of differentially expressed genes using Affymetrix' original CDF file and our refined CDF file, while separating up- and downregulated genes.

References

Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, Brown EL: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 1996, 14: 1675–1680. 10.1038/nbt1296-1675
Article CAS PubMed Google Scholar
Zhang L, Miles MF, Aldape KD: A model of molecular interactions on short oligonucleotide microarrays. Nat Biotechnol 2003, 21: 818–821. 10.1038/nbt836
Article CAS PubMed Google Scholar
Wu Z, Irizarry RA: Preprocessing of oligonucleotide array data. Nat Biotechnol 2004, 22: 656–658. 10.1038/nbt0604-656b
Article CAS PubMed Google Scholar
Alberts R, Terpstra P, Bystrykh LV, de Haan G, Jansen RC: A statistical multiprobe model for analyzing cis and trans genes in genetical genomics experiments with short-oligonucleotide arrays. Genetics 2005, 171: 1437–1439. 10.1534/genetics.105.045930
Article PubMed Central CAS PubMed Google Scholar
Jansen RC, Nap JP: Genetical genomics: the added value from segregation. Trends Genet 2001, 17: 388–391. 10.1016/S0168-9525(01)02310-1
Article CAS PubMed Google Scholar
Bystrykh L, Weersing E, Dontje B, Sutton S, Pletcher MT, Wiltshire T, Su AI, Vellenga E, Wang J, Manly KF, Lu L, Chesler EJ, Alberts R, Jansen RC, Williams RW, Cooke MP, de Haan G: Uncovering regulatory pathways that affect hematopoietic stem cell function using 'genetical genomics'. Nat Genet 2005, 37: 225–232. 10.1038/ng1497
Article CAS PubMed Google Scholar
Mecham BH, Wetmore DZ, Szallasi Z, Sadovsky Y, Kohane I, Mariani TJ: Increased measurement accuracy for sequence-verified microarray probes. Physiol Genomics 2004, 18: 308–315. 10.1152/physiolgenomics.00066.2004
Article CAS PubMed Google Scholar
Elo LL, Lahti L, Skottman H, Kylaniemi M, Lahesmaa R, Aittokallio T: Integrating probe-level expression changes across generations of Affymetrix arrays. Nucleic Acids Res 2005, 33: e193. 10.1093/nar/gni193
Article PubMed Central PubMed Google Scholar
NetAffx Analysis Center: Affymetrix.2006. [http://www.affymetrix.com/analysis/index.affx]
Google Scholar
Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, Valmeekam V, Sun S, Kulp D, Siani-Rose MA: NetAffx: Affymetrix probesets and annotations. Nucleic Acids Res 2003, 31: 82–86. 10.1093/nar/gkg121
Article PubMed Central CAS PubMed Google Scholar
UCSC Genome Browser2006. [http://genome.ucsc.edu]
Genome Glossary2006. [http://www.ncbi.nlm.nih.gov/genome/guide/glossary.htm]
Naef F, Magnasco MO: Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays. Phys Rev E Stat Nonlin Soft Matter Phys 2003, 68: 011906.
Article PubMed Google Scholar
GBiC supplementary data2006. [http://gbic.biol.rug.nl/supplementary/2006/probeverification]
Gautier L, Moller M, Friis-Hansen L, Knudsen S: Alternative mapping of probes to genes for Affymetrix chips. BMC Bioinformatics 2004, 5: 111. 10.1186/1471-2105-5-111
Article PubMed Central PubMed Google Scholar
Zhang J, Finney RP, Clifford RJ, Derr LK, Buetow KH: Detecting false expression signals in high-density oligonucleotidearrays by an in silico approach. Genomics 2005, 85: 297–308. 10.1016/j.ygeno.2004.11.004
Article CAS PubMed Google Scholar
Pruitt K, Tatusova T, Ostell J: The Reference Sequence (RefSeq) Project. In The NCBI Handbook. Edited by: McEntyre J and Ostell J. Bethesda (MD), National Library of Medicine; 2002:18–1-18–20.
Google Scholar
Pontius JU, Wagner L, Schuler GD: UniGene: A Unified View of the Transcriptome. In The NCBI Handbook. Edited by: McEntyre J and Ostell J. Bethesda (MD), National Library of Medicine; 2002:21–1-21–12.
Google Scholar
Affymetrix Technical Note2006. [http://www.affymetrix.com/support/technical/technotes/mouse430_technote.pdf]
Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, Kodzius R, Shimokawa K, Bajic VB, Brenner SE, Batalov S, Forrest AR, Zavolan M, Davis MJ, Wilming LG, Aidinis V, Allen JE, mbesi-Impiombato A, Apweiler R, Aturaliya RN, Bailey TL, Bansal M, Baxter L, Beisel KW, Bersano T, Bono H, Chalk AM, Chiu KP, Choudhary V, Christoffels A, Clutterbuck DR, Crowe ML, Dalla E, Dalrymple BP, de BB, Della GG, di BD, Down T, Engstrom P, Fagiolini M, Faulkner G, Fletcher CF, Fukushima T, Furuno M, Futaki S, Gariboldi M, Georgii-Hemming P, Gingeras TR, Gojobori T, Green RE, Gustincich S, Harbers M, Hayashi Y, Hensch TK, Hirokawa N, Hill D, Huminiecki L, Iacono M, Ikeo K, Iwama A, Ishikawa T, Jakt M, Kanapin A, Katoh M, Kawasawa Y, Kelso J, Kitamura H, Kitano H, Kollias G, Krishnan SP, Kruger A, Kummerfeld SK, Kurochkin IV, Lareau LF, Lazarevic D, Lipovich L, Liu J, Liuni S, McWilliam S, Madan BM, Madera M, Marchionni L, Matsuda H, Matsuzawa S, Miki H, Mignone F, Miyake S, Morris K, Mottagui-Tabar S, Mulder N, Nakano N, Nakauchi H, Ng P, Nilsson R, Nishiguchi S, Nishikawa S, Nori F, Ohara O, Okazaki Y, Orlando V, Pang KC, Pavan WJ, Pavesi G, Pesole G, Petrovsky N, Piazza S, Reed J, Reid JF, Ring BZ, Ringwald M, Rost B, Ruan Y, Salzberg SL, Sandelin A, Schneider C, Schonbach C, Sekiguchi K, Semple CA, Seno S, Sessa L, Sheng Y, Shibata Y, Shimada H, Shimada K, Silva D, Sinclair B, Sperling S, Stupka E, Sugiura K, Sultana R, Takenaka Y, Taki K, Tammoja K, Tan SL, Tang S, Taylor MS, Tegner J, Teichmann SA, Ueda HR, van NE, Verardo R, Wei CL, Yagi K, Yamanishi H, Zabarovsky E, Zhu S, Zimmer A, Hide W, Bult C, Grimmond SM, Teasdale RD, Liu ET, Brusic V, Quackenbush J, Wahlestedt C, Mattick JS, Hume DA, Kai C, Sasaki D, Tomaru Y, Fukuda S, Kanamori-Katayama M, Suzuki M, Aoki J, Arakawa T, Iida J, Imamura K, Itoh M, Kato T, Kawaji H, Kawagashira N, Kawashima T, Kojima M, Kondo S, Konno H, Nakano K, Ninomiya N, Nishio T, Okada M, Plessy C, Shibata K, Shiraki T, Suzuki S, Tagami M, Waki K, Watahiki A, Okamura-Oho Y, Suzuki H, Kawai J, Hayashizaki Y: The transcriptional landscape of the mammalian genome. Science 2005, 309: 1559–1563. 10.1126/science.1112014
Article CAS PubMed Google Scholar
Affymetrix Data Sheet[http://www.affymetrix.com/support/technical/datasheets/mogarrays_datasheet.pdf]
Sharov AA, Dudekula DB, Ko MS: Genome-wide assembly and analysis of alternative transcripts in mouse. Genome Res 2005, 15: 748–754. 10.1101/gr.3269805
Article PubMed Central PubMed Google Scholar
NCBI HomePage2006. [http://www.ncbi.nlm.nih.gov]
Zhang Z, Schwartz S, Wagner L, Miller W: A greedy algorithm for aligning DNA sequences. J Comput Biol 2000, 7: 203–214. 10.1089/10665270050081478
Article CAS PubMed Google Scholar
Cancer Program Data Sets2006. [http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi]
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31: e15. 10.1093/nar/gng015
Article PubMed Central PubMed Google Scholar
Breitling R, Armengaud P, Amtmann A, Herzyk P: Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett 2004, 573: 83–92. 10.1016/j.febslet.2004.07.055
Article CAS PubMed Google Scholar

Download references

Acknowledgements

RA was supported by Biomolecular Informatics grant 050-50-203 from the Netherlands Organisation for Scientific Research (NWO).

LVB was supported by the Dutch Cancer Society and by the US National Institutes of Health.

Author information

Authors and Affiliations

Groningen Bioinformatics Centre, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, 9751 NN, Haren, The Netherlands
Rudi Alberts, Rainer Breitling, Jan-Peter Nap & Ritsert C Jansen
Groningen Bioinformatics Centre, University Medical Centre Groningen, University of Groningen, 9713 AV, Groningen, The Netherlands
Peter Terpstra, Menno Hardonk & Ritsert C Jansen
Department of Cell Biology, section Stem Cell Biology, University Medical Centre Groningen, University of Groningen, 9713 AV, Groningen, The Netherlands
Leonid V Bystrykh & Gerald de Haan
Bioinformatics Expertise Center, Institute for Life Science & Technology, Hanze University Groningen, 9747 AS, Groningen, The Netherlands
Jan-Peter Nap

Authors

Rudi Alberts
View author publications
You can also search for this author in PubMed Google Scholar
Peter Terpstra
View author publications
You can also search for this author in PubMed Google Scholar
Menno Hardonk
View author publications
You can also search for this author in PubMed Google Scholar
Leonid V Bystrykh
View author publications
You can also search for this author in PubMed Google Scholar
Gerald de Haan
View author publications
You can also search for this author in PubMed Google Scholar
Rainer Breitling
View author publications
You can also search for this author in PubMed Google Scholar
Jan-Peter Nap
View author publications
You can also search for this author in PubMed Google Scholar
Ritsert C Jansen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rudi Alberts.

Additional information

Authors' contributions

RA was responsible for designing and implementing the protocol, interpreting the data, writing and finalizing the paper. PT helped in the design of the protocol, supervised the implementation and contributed to the interpretation of the data and writing the paper. MH participated in the implementation of the protocol. LVB helped with the biological interpretation of the results. GdH participated in writing the paper. RB participated in determining the impact of updated chip definition files on expression data and finalization of the manuscript. JPN drafted the manuscript, helped with the interpretation of the data and contributed to the finalization of the manuscript. RCJ conceived the study of individual Affymetrix probes, coordinated the study and the writing and was responsible for the funding. All authors read and approved the final manuscript.

Electronic supplementary material

12859_2006_1504_MOESM1_ESM.xls

Additional File 1: Verification scores for the Affymetrix U74 array. The file contains the numbers and percentages of probe sets per occurring combination of verification scores. 'e' = 'entirely verified', 'p' = 'partially verified' and 'u' = 'entirely unverified'. (XLS 19 KB)

12859_2006_1504_MOESM2_ESM.xls

Additional File 2: Verification scores for the Affymetrix 430 array. The file contains the numbers and percentages of probe sets per occurring combination of verification scores. 'e' = 'entirely verified', 'p' = 'partially verified' and 'u' = 'entirely unverified'. (XLS 19 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Alberts, R., Terpstra, P., Hardonk, M. et al. A verification protocol for the probe sequences of Affymetrix genome arrays reveals high probe accuracy for studies in mouse, human and rat. BMC Bioinformatics 8, 132 (2007). https://doi.org/10.1186/1471-2105-8-132

Download citation

Received: 12 September 2006
Accepted: 20 April 2007
Published: 20 April 2007
DOI: https://doi.org/10.1186/1471-2105-8-132

A verification protocol for the probe sequences of Affymetrix genome arrays reveals high probe accuracy for studies in mouse, human and rat

Abstract

Background

Results

Conclusion

Background

Results

Quality of sequence databases

The verification protocol

Verification of the U74 and 430 arrays

Most 'partially verified' probe sets contain at most two bad probes

Allowing mismatches

Cross-hybridization

Verification of all available human, mouse and rat arrays confirms high probe accuracy

The impact of updated probe set definitions on expression data

Discussion

Conclusion

Methods

Affymetrix GeneChips

Sequence databases

Assessment of the quality of the sequence databases

Sequence alignment algorithms

Computing infrastructure

Datasets and methods for determining the impact of updated probe set definitions on expression data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

12859_2006_1504_MOESM1_ESM.xls

12859_2006_1504_MOESM2_ESM.xls

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us