A novel set of custom CDF and the corresponding Bioconductor libraries for Affymetrix human 3' expression arrays has been developed based on GeneAnnot and GeneCards information. GeneCards is a popular and widely used database integrating gene-centered information from major databases, which could show some inconsistencies among themselves if considered singularly. GeneAnnot based CDFs are provided with libraries compliant with Bioconductor standards, including probe libraries that are required for sequence level analysis, such as GCRMA pre-processing, and annotation libraries that take advantage of the rich annotations that are reported in GeneCards, thus facilitating their implementation by final users.
GeneAnnot custom CDFs address the problem of a reliable reconstruction of expression signals through the inclusion in a unique custom-probeset of only those probes matching transcripts associated to a single gene. Different groups in recent years have proposed a variety of methods to re-define probesets from Affymetrix 3' arrays referring to several databases and proposing different strategies to solve the technical issues of probeset composition and matching [6, 7]. Each approach has advantages and disadvantages, although the definition of custom-probesets has little effect on the general performance and the results of sample clustering and classification [7, 16]. Thus, the decision on which is the most appropriate custom-CDF to be used, largely depends on the goals of the experimenter: if the major purpose of the study is to analyze samples based on the expression patterns, using either probeset definition leads to similar results. When the focus comes down to the identification of specific genes, then the most appropriate CDF has to be carefully selected considering issues related to the database and the strategy used to group probe pairs into custom-probesets. Specifically, if the biological relevancy is in detecting differentially expressed genes, using custom CDFs which refer to gene-centered database and combine all probes per gene into a single probeset may be the best choice. On the other hand, if the focus is distinguishing expression of individual transcript variants, then relying on a transcript-centered database and sub-dividing probesets into small groups of probes (e.g., 4–5 probes) covering individual exons may be a more appropriate approach. GeneAnnot and GA_CDFs have been proposed for improving the reliability of results from gene-centered analysis of microarray experiments. In this regards, they aimed at eliminating the presence of more than one probeset per gene, a frequent instance in Affymetrix standard probeset definitions which often leads to discordant expression signals when the focus of the analysis is detecting differentially expressed genes. Consequently, GA_CDFs may not represent the CDFs of choice when 3' expression arrays are used to detect transcript variants, alternative splicing and exon differential expression.
Using different CDFs directly reflects on different utilization of the probe-level information available in the chip for signal reconstruction. Expression data generated using the Bio_CDF are based on all probes (100%) contained in the chip while data obtained with the Entrez8_CDF rely on the 68.7%. Similarly, AV_CDF accounts for 77.3% of all probes when considering probesets with at least 4 probes [7]. In the latter case, indeed, 48% of probesets have less than 4 probes and therefore may not support reliable statistical summarization, as assessed by Lu et coworkers [7]. Consequently, undesired probesets with less than 4 probes should be filtered out before pre-processing procedures, such as RMA, and this filtering could be not trivial for final users. GA11_CDF has been constructed using more than 77.5% of the probe-level information and all custom-probesets include at least 11 probes, i.e. the minimum number of probes in the standard Affymetrix probesets. As discussed in [6] and [7], a probeset composed of at least four probe pairs should satisfy the minimum requirements of most probe-level analysis algorithms and, thus, any choice on the number of probes composing a custom-probeset is largely arbitrary. We imposed that each custom-probeset be represented by at least 11 probes because in our GeneAnnot-based probesets definition 90% of GeneCardsIDs are interrogated by custom-probesets composed of at least 11 probe pairs (supplementary information). Nevertheless, the functions to create the custom-probesets are generally applicable and the minimum number of probe pairs making up a custom-probeset is a tunable parameter. A comparison between the lists of differentially expressed genes obtained using GA11_CDF and a GeneAnnot custom CDF with probesets composed of 6 probe pairs (GA6_CDF, accounting for 95% of GeneCardsIDs) indicates that the impact of this parameter is minimal (Table 1).
The various CDFs result in different number of genes whose transcripts levels are measured by a probeset or by groups of probesets. In particular, the Bio_CDF contains probesets associated to 13,389 EntrezGenes, while the Entrez8_CDF accounts for 11,999 EntrezGenes. GA6_CDF and GA11_CDF custom-probesets are annotated to 12,074 and 11,408 EntrezGenes respectively, thus representing comparable number of annotated genes when compared with other gene-centered custom CDFs.
In addition, the deviation from the one-to-one probeset/gene match is variable in the different definition files, due both to the existence of multiple probesets per gene (or multiple probesets per transcript as in AV_CDF) [8] or to the presence of the same probe in multiple custom-probesets, that adds ambiguity in the evaluation of signals (e.g. in the RefSeq8_CDF probes with indexes 182067, 182068, 204881, and 204883 are present in 39 different custom-probesets). On the contrary, GeneAnnot based custom-probesets include only probes matching transcripts linked to a single gene. As such, they preserve a one-to-one correspondence between genes and custom-probesets. Furthermore, each probe is assigned to a unique custom-probeset, thus avoiding additional noise due to the use of a probe into multiple probesets.
Finally, the reported data show that, when applied to the analysis of a standard experimental design, GA_CDFs perform similarly to the other custom CDFs, with the additional advantage that GeneAnnot based CDFs are provided with complete annotation libraries compliant with Bioconductor standards, thus allowing an easier implementation by final users.