Transcript-level annotation of Affymetrix probesets improves the interpretation of gene expression data
- Hui Yu†1,
- Feng Wang†2, 1,
- Kang Tu4, 1,
- Lu Xie1,
- Yuan-Yuan Li1, 3Email author and
- Yi-Xue Li1, 4Email author
© Yu et al; licensee BioMed Central Ltd. 2007
Received: 21 November 2006
Accepted: 11 June 2007
Published: 11 June 2007
The wide use of Affymetrix microarray in broadened fields of biological research has made the probeset annotation an important issue. Standard Affymetrix probeset annotation is at gene level, i.e. a probeset is precisely linked to a gene, and probeset intensity is interpreted as gene expression. The increased knowledge that one gene may have multiple transcript variants clearly brings up the necessity of updating this gene-level annotation to a refined transcript-level.
Through performing rigorous alignments of the Affymetrix probe sequences against a comprehensive pool of currently available transcript sequences, and further linking the probesets to the International Protein Index, we generated transcript-level or protein-level annotation tables for two popular Affymetrix expression arrays, Mouse Genome 430A 2.0 Array and Human Genome U133A Array. Application of our new annotations in re-examining existing expression data sets shows increased expression consistency among synonymous probesets and strengthened expression correlation between interacting proteins.
By refining the standard Affymetrix annotation of microarray probesets from the gene level to the transcript level and protein level, one can achieve a more reliable interpretation of their experimental data, which may lead to discovery of more profound regulatory mechanism.
Microarray technology was invented to rapidly profile the quantities of mRNA transcripts in a particular cellular context [1–3]. Its application has become universal in biomedical researches. Although it is mRNA that is actually detected by microarray experiments, and it is mRNA that has the direct relationship with protein, the methodology and algorithms for data analysis are commonly gene based. As evident in the probe annotation file provided by Affymetrix, gene-level annotation is widely accepted even though it fails to discriminate multiple mRNAs transcribed from the same gene. As a result, the analysis results are usually summarized at the gene level, such as differentially expressed genes [4–6] or gene-sets [7, 8]. Even in the recent works that integrate protein-protein interaction data and microarray data [9–11], the analysis unit is reduced to the gene instead of mRNA. This practice could be attributed to the fact that most of the functional knowledge is at gene level instead of transcript level, and the functional differences between mRNA variants transcribed from the same gene are seldom clear. Although the gene-level analysis ignores the difference among mRNAs variants, this strategy is still biologically meaningful considering that the diversity of genes is much higher than that of transcripts encoded by the same gene.
It has been well established that alternative splicing increases mRNA diversity, and over 60% of human genes are involved in this mechanism . In addition, other RNA processing events, such as RNA editing, also account for the increased diversity at the mRNA level . Since these events enable one gene to encode multiple proteins which might be functionally heterogeneous, we feel it necessary to separate transcript-level synonymous probesets from gene-level synonymous ones. The probesets that hybridize to more than one transcript variant of the same gene are referred to as gene-level synonymous probesets; while the ones that hybridize to a single variant are named as transcript-level synonymous probesets. It has been noticed that transcript-level synonymous probesets tend to have similar expression profiles, while gene-level synonymous probesets may have distinct expression profiles [14, 15]. In fact, individual reports demonstrated that the expression of transcript variants could be totally different [16, 17]. These phenomena indicate that the gene-level strategy of microarray data analysis is imprecise enough that one may overlook the expressional inconsistencies among gene-level synonymous probesets.
As a matter of fact, Affymetrix suffixes their probeset ID according to the probeset's specificity. For example, probesets that recognize unique transcript variants are suffixed with _at, and probesets that recognize multiple alternative transcripts from a single gene are suffixed with _a_at or _s_at, and so on. This suffix system gives a hint on the varied specificities of the probesets, and could be considered as an endeavor trying to do away with customers' worry about the gene-level data analysis strategy. However, the correctness of the suffix system has been in doubt [14, 18]. Therefore it is not reliable to perform transcript-level analysis on the basis of this imperfect suffix system. As the standard annotation files and most of the analysis algorithms are gene-oriented, analysts often average out the expression heterogeneity of the same gene when dealing with probeset level data .
In this paper, we linked the probesets of two widely used Affymetrix arrays with the International Protein Indexes (IPIs) [20, 21] through proper association and rigorous alignment procedures, and demonstrated the statistically significant advantage of interpreting microarray data at the transcript-level or protein-level. Our results can be viewed as a more precise annotation of Affymetrix array's probesets, with which one may achieve a more reliable interpretation of their experimental data. Moreover, the application of this new annotation substantially increased the expression correlation between interacting proteins.
Transcript-level or protein-level annotation of Affymetrix probesets
Summary of probesets, genes and proteins covered by probesets.
Probesets associated with protein-coding mRNAs (Investigated probesets)
Probesets passing BLAST (Annotated probesets)
(probeset retaining percentage: 83.5% of the non-control, 89.6% of the investigated)
(probeset retaining percentage: 68.8% of the non-control, 94.3% of the investigated)
Genes covered by non-control probesets
Genes covered by investigated probesets
Genes covered by annotated probesets
(Gene retaining percentage: 95.2% of the non-control)
(Gene coverage: 45.1% of all protein-coding genes*)
(Gene retaining percentage: 85.4% of the non-control)
(Gene coverage: 43.2% of all protein-coding genes*)
Proteins covered by annotated probesets
(Protein coverage: 42.0% of all proteins*)
(Protein coverage: 40.8% of all proteins*)
Through the rigorous association and alignment, we obtained precise annotations for 18,894 and 15,288 probesets in Affymetrix arrays MOE430A_2 and HG-U133A respectively (see Additional file 1). These annotations discriminate alternative mRNA variants transcribed from a same gene, thus are at transcript level as opposed to the standard gene-level annotation files provided by Affymetrix. It is worth noting that since the transcript data we used were quite redundant, a conceptual transcript variant may be represented by multiple redundant transcript accessions in the transcript database. In our transcript-level annotation file, each conceptual transcript is identified with one IPI ID, as we only investigated the probesets associated with protein-coding transcripts.
Statistics on the non-control, investigated and annotated probesets, together with the number of involved genes and proteins, are shown in Table 1. It is evident that the proportion of genes covered by our annotated probesets to those covered by all non-control probesets ('gene retaining percentage' in Table 1), 95.2% for MOE430A_2 and 85.4% for HG-U133A, are higher than the corresponding probeset retaining percentages, 83.5% and 68.8%, indicating that the gene coverage has only been slightly reduced by our filtering procedures. This observation is in support of our primary goal, that is to refine gene-level probeset annotations to transcript-level, but not to simply remove the poor-quality gene-level annotations.
One-to-one and one-to-many probesets in our annotation tables.
11327 (53.7% of investigated)
7014 (43.3% of investigated)
Total (Investigated probesets)
Comparison of the Affymetrix suffix categorization and our classification of probesets.
(% of Total)
(% of Total)
(% of Total)
These subgroups of the investigated probesets with different Affymetrix suffixes indicate the imperfection of the Affymetrix suffix system, thus affirming the necessity of our transcript-level or protein-level probeset annotations. In fact, several other research groups have addressed the misleading nature of the Affymetrix suffix system and the imperfection of its standard annotation file, including some re-annotation works for array HG-U133 [14, 18, 19, 23]. We will discuss these related works in detail in next section.
Verification of the protein-level annotation and comparison with related annotations
Since the probesets were linked to transcripts and proteins through rigorous association and alignment procedures, the expression profiles of transcript-level synonymous probesets were supposed to be more consistent than those of gene-level synonymous probesets [14, 15]. This was taken as the basis for the evaluation of our annotations.
These results support the argument that it is more reliable to interpret microarray data at transcript level than at gene level. The inferiority of the Affy-protein level annotation to our annotation could be attributed to the technical details in their alignment and association procedures . First, they used the representative mRNA sequence ('consensus' or 'exemplar' sequence of each probeset), instead of the probe sequences themselves, as the query sequence in the alignment. Second, they aligned against the GenBank non-redundant protein database, rather than a comprehensive pool of mRNA sequences. In microarray experiment hybridization takes place between the probe sequences immobilized on the array and the cDNA sequences from the sample, so one can deduce that the alignment between the representative mRNA sequences and the protein sequences cannot precisely simulate the hybridization between probes and mRNAs. Finally, NetAffx filtered the blast results according to a cutoff of E-value, which indicates the likelihood of the observed alignment by chance . Although frequently adopted for sequence homology analysis in closely related species, E-value is not sensitive enough to grade the many well aligned targets from the same species. In our practice, we took the probe sequences as the query and the mRNA sequences as the alignment targets, and adopted the matching nucleotide proportion as the filtering criterion (see Methods).
A similar comparison was conducted for HG-U133A array, involving another transcript-level annotation by Harbig et al. . Across the whole 28 datasets, the annotation by Harbig et al. showed advantage over Affy-protein-level annotation, while our transcript-level annotation performed best (p < 0.05 for 18 datasets under Student's t-test, see Figure 1B and Additional file 2). As our work was done two years later than Harbig et al.'s, the updated mRNA sequences in the probe-mRNA alignment is one of factors contributing to the increased performance. The other contributing factor is different approach we used to identify the mapping of probes to mRNA targets. Harbig et al. performed a two-phase blast: first, blast probeset target sequences against mRNA sequence pool, and then blast probe sequences against the retrieved mRNA sequences. Efficient as it was, this two-phase-blast strategy reduced the alignment precision as compared to our direct probe-against-mRNA blast strategy. Moreover, Harbig et al. accepted the mRNA with the highest average probe matches as the target of a probeset, even if the highest value could be suboptimal.
Shown in Figure 2B, three probesets in GDS1076 dataset, 1419114_at, 1419115_at and 1419116_at, were mapped to the mouse gene Alg14 (GeneID: 66789), with the former two corresponding to IPI00132168 and the latter one corresponding to IPI00405947. Both proteins were indicated in IPI as homologs of yeast asparagine-linked glycosylation 14 without any further information. We notice two interesting phenomena in this case. First, the two probesets correlating to a same protein do not show similar expression profiles. Instead, they behave like the probesets correlating to different variants characterized by a PCC value of 0.4080 (P = 0.0415). This issue might be due to some factors causing microarray hybridization efficiency shift. Secondly, according to our calculation, the expressions of these two variants are negatively correlated with PCC values of -0.7740 (P = 5.04e-5) for 1419116_at and 1419115_at, and -0.4402 (P = 0.0296) for 1419116_at and 1419114_at. So far there are no reports on expression regulation of these two transcript variants of gene Alg14, and no function reports of the corresponding protein isoforms. This negative correlation suggested that these two proteins might perform different roles thus should be distinctly annotated.
Application of our new annotations the evaluation of the expression correlation between interacting proteins
In recent years, considerable efforts have been devoted to identifying and characterizing protein-protein interaction (PPI). Besides investigations on the molecular events involved in PPI, functional annotation of an unclassified protein according to its interacting partners is also an important topic . Since it is too bold to infer protein functions according to the "majority rule" that utilizes only the PPI network structure [29, 30], many studies integrate other data sources into the functional characterization of PPI, among which the gene expression data is the favorite [9, 31, 32]. All these works assumed that interacting protein pairs were characterized with higher expression correlation than random ones. However, previous investigations indicated that the relationship between expression correlation and PPI was weak on a genomic scale [33–35] although a recent work strengthened the association by integrating cross-species conservation information .
We noticed that in these genome-scale studies PPI information was always first converted to gene pairs, after which the Pearson correlations of the probeset pairs corresponding to the gene pairs were evaluated. That is, the analysis targets were expanded from real interacting protein pairs to all possible cross-gene protein pairs for which interaction may not always exist. As illustrated in Figure 3, suppose we have gene a (abbreviated to Ga) and gene b (Gb), with Ga encoding protein a1 (Pa1) and protein a2 (Pa2), Gb encoding protein b1 (Pb1) and protein b2 (Pb2). Among these protein variants, only proteins Pa2 and Pb1 interact with each other, while the other three possible cross-gene interactions, including Pa1-Pb1, Pa1-Pb2 and Pa2-Pb2, do not really happen. The four probesets, Pst_a1, Pst_a2, Pst_b1, and Pst_b2, recognize transcript variants Ta1, Ta2, Tb1 and Tb2 respectively, producing proteins Pa1, Pa2, Pb1 and Pb2. In the conventional genome-scale studies mentioned above, besides the probeset pair (Pst_a2, Pst_b1) corresponding to the real interacting protein pair, the other three cross-gene pairs, (Pst_a1, Pst_b1), (Pst_a1, Pst_b2) and (Pst_a2, Pst_b2), were also included, which would blunt the expression correlation between the real interacting entities according to our preceding observations. We propose that this might partly explain the weak coherency between PPI and expression correlation.
In Figure 4A, we notice that the negative correlation is also strengthened by the PPI PCC calculation as well as the positive correlation. That is to say, PPI pairs seem to be either positively correlated or negatively correlated, but not exclusively 'co-expressed' as previous publications reported . This phenomenon is more evident when we examine the PCC values for each coupled PPI pair and GGI pair. Figure 4B shows a scatter plot of the PPI PCC values versus the corresponding GGI PCC values. It is evident that most points fall into the 1st and the 3rd quadrants, indicating that each pair of PPI PCC value and GGI PCC value tends to have the same signs. The scatter plot suggests a linear relationship between the PPI PCCs and GGI PCCs, and indeed we get a linear regression formula, y = 0.5612x + 0.0046, at high confidence (p < 2e-16). Since the estimated coefficient, 0.5612, is far less than 1, we may conclude that the absolute PCC values of PPI pairs are often larger than those of the corresponding GGI pairs. So the PPI PCC calculation preserves the original positive or negative correlation tendency revealed by the conventional GGI PCC calculation, and strengthens it with larger absolute correlation values. Such correlation tendencies between interacting proteins, especially those negative ones, would very likely be submerged under the background correlations of random pairs if the non-interacting protein pairs are included in the analysis.
Number of significantly correlated PPIs or GGIs found from each dataset.
Number of datasets on which a PPI or GGI demonstrate significant correlation.
The same experiments were also implemented with 274 PPI pairs extracted from the IntAct database , and similar conclusions were obtained. More details can be found in Additional file 6 and Additional file 7.
In this work, we re-annotated the probesets of two widely used Affymetrix arrays, MOE430A_2 and HG-U133A, via proper association and rigorous alignment procedures in a transcript perspective, and demonstrated the necessity and advantage of exploring microarray data at the transcript or protein level, instead of the conventional gene level.
Although Affymetrix utilized the most complete information available at the time of array design, tremendous progress in genome sequencing and annotation in recent years renders existing probeset designs and target identifications suboptimal. In recent years, there have been continuous reports on systematic false expression signals of Affymetrix probesets , spurious expression correlation caused by cross hybridization , and expressional inconsistency among different microarray platforms or even different generations of one platform [40–43]. A few research groups performed probe-against-mRNA blast similar to ours [22, 42, 44], but mostly they centered around UniGene  and therefore improved the accuracy of annotation only at gene level. A major trend among these efforts was to redefine probesets so that probes matching the same molecular target were placed into custom probesets, as proposed by [19, 23, 39, 42], but as the authors of  pointed out, 'these transcript-targeted probesets are not transcript-specific, as probesets targeting transcripts from the same gene may share many or even all probes'. Thus the probe re-organization strategy may be used to make distinction at the level of genes, but not at the level of transcripts or splice variants . Besides, this strategy takes the probe-level intensity file (the CEL file) as a prerequisite, however only around half of the expression datasets deposited in public databases like GEO were found with CEL files.
In order to make distinction precisely at the transcript level, we preserved the classical Affymetrix probesets, but distinguished them among their alternatively spliced transcript targets according to the consistent alignments of probes against up-to-date mRNA sequences. Our annotation table clearly divides the Affymetrix probesets into three groups with increased transcript-level specificity (reliability): one-to-null probesets that do not recognize any transcript, one-to-many probesets that hybridize to multiple alternative transcript variants of the intended gene, and one-to-one probesets that hybridize to unique alternative transcript variants of the intended gene. We discriminate the intended alternative transcript variants of Affymetrix probesets based on the NetAffx's gene-level annotation for the first time. Given the fact that existing solutions are accompanied with imperfections and no consensus has been reached on an overwhelming strategy, our alternative solution to the problematic standard annotation points out a new way to improve the interpretation and exploitation of Affymetrix microarray data.
Although the transcript collections were not identical and the reannotation strategies differed more or less, we made out some similar discoveries to previous reports. For example, Harbig et al. found that a number of probesets did not detect any transcript and attributed this phenomenon to the elimination of the target sequence in the process of sequence update . In our study, altogether 16.5% and 31.2% of non-control probesets in the MOE430A_2 and HG-U133A arrays were not found with any transcript targets in the pool of GenBank, RefSeq and Ensembl. Using newer and larger collection of transcript sequences, we even obtained a quite similar statistics of the percentage of 'multiple-targeting' probesets to that estimated in a foregoing study , specifically 54.6% for MOE430A_2 and 54.1% HG-U133A (see Table 2). The significant mutual agreement among the related researches justifies the necessity to set up an improved annotation mechanism of the Affymetrix probes in the face of the continual growth of genomic and transcriptomic knowledge, ideally at transcript or protein level.
Over the past few years, the analysis of alternative splicing has emerged as an important new field in bioinformatics, and several recent large-scale studies have shown that alternative splicing can be analyzed in a high-throughput manner using DNA-microarray methods [46, 47]. Most of these studies used arrays particularly manufactured for analyzing alternative splicing, such as genomic tiling array and exon-exon junction array. Constructed without any priori knowledge of the possible exon content of a genomic sequence, the genomic tiling array [48, 49] is in principle capable of detecting novel alternative splicing events of diverse types, but it is in doubt whether their data will be readily interpretable as successful experiences remain insufficient . On the other hand, although designed particularly to address the alternative splicing issue, exon and exon-exon junction arrays  were reported to be plagued by problematic probe specificity and unsatisfying hybridization efficiency because of the necessity of probe coverage across the full length of the gene (including 5' end) . Many questions about the reproducibility of the amplification protocol, the quantitative accuracy, and the data analysis need to be addressed as a prerequisite to reliable quantitative analysis using these splicing arrays . Given the current imperfection of splicing array techniques and inconvenience in deciphering their generated data, it is an economic way to do large-scale investigations of alternative transcribing events with standard gene expression arrays, provided that the recognizing targets of the probes can be rigorously defined at the transcript level. Hu et al. proposed a primitive analysis method to explore alternative splicing with Affymetrix 3' gene expression arrays, though they regretted that only alternative splicing biased toward the 3'end of the gene can be detected in their way . In the present paper, we conducted a large-scale alignment of the probe sequences in traditional gene expression arrays against the currently most comprehensive collection of transcript sequences, highlighting the probesets mapping to unique alternative transcripts unambiguously. For each of the two Affymetrix expression arrays tested in this study, we found over 40% of all probesets could be mapped to transcripts in a one-to-one manner, so our work strongly validate that it is feasible to analyze alternative splicing using traditional gene expression arrays. While the foregoing work contributed by Hu et al. remains as a qualitative analysis method aiming at detecting novel alternative splicing events, our work gives explicitly the relationship between the probesets and the currently known alternative transcript variants, which can be immediately exploited to facilitate quantitative analysis of alternative variants. As our mapping relationships are defined for the standard probesets of the traditional gene expression arrays, they can be conveniently exploited as the standard NetAffx annotation information, without any ad hoc influence on the widely applied experiment protocols or the routine data processing algorithms. In the demonstrative implementations of the novel annotation tables, we actually observed several examples of negatively correlated alternative variants (see Figure 2B for one of them), which will shed light on further studies of expression regulation of alternative transcript variants.
To sum up, we re-annotated two popular Affymetrix gene expression arrays, MOE430_2 and HG-U133A, in a transcript-level perspective, aiming at identifying probesets' detecting targets precisely at the transcript level. Although previous works addressed similar issues [14, 15, 18, 19, 22, 23], we are the first to rigorously link existing Affymetrix probesets to their specific transcript targets and their corresponding proteins. Armed with this new annotation, we re-examined a number of previous studies, 30 datasets for MOE430_2 and 28 datasets for HG-U133A from GEO, and revealed increased expression consistency among synonymous probesets and closer expression correlation among interacting proteins. This transcript-level annotation of Affymetrix probesets allows for a more reliable gene expression data analysis and a more accessible protein-level correlation study.
Sequences and related information of Affymetrix probesets
The Affymetrix 3' eukaryotic gene expression analysis arrays MOE430A_2 and HG-U133A were selected for this study. Probe sequence files and corresponding annotation files, 'Mouse430A_2_annot_csv.zip' (annotated on 2005-12-19) and 'HG-U133A_annot_csv.zip' (annotated on 2006-04-11), were downloaded from Affymetrix website . Also downloaded there were the NetAffx probeset-protein mapping files for MOE430A_2 (file 'Mouse430A_2_blast_csv', updated on 2005-12-18) and HG-U133A ('HG-U133A.na21.blast.csv.zip', updated 2006-04-11), which were the blast results of the representative mRNA sequence of probes against protein sequence databases .
Sources of mRNA transcripts
GenBank: mRNA sequences from CoreNucleotide division of NCBI Nucleotide database were obtained via the Entrez Nucleotide search  on April 10th, 2006. For mouse, this dataset comprises 1,582,211 sequences, with 1,521,234 from DDBJ, 6,506 from EMBL and 54,471 from GenBank. For human, there are totally 201,206 sequences, with 41,128 from DDBJ, 62,701 from EMBL and 97,377 from GenBank. File 'gene2accession' (updated on 2006-03-28), downloaded from Entrez Gene , provides the mapping relationship between the CoreNucleotide sequences, Entrez Gene IDs, and protein sequence accessions.
RefSeq: 55,832 mouse mRNA sequences and 40,530 human mRNA sequences were obtained from the RefSeq database . Mapping relationships between RefSeq mRNA accessions, RefSeq protein accessions, and Entrez GeneIDs were extracted from related flat files 'mouse.rna.gbff.gz' and 'human.rna.gbff.gz', which were downloaded from RefSeq in April 2006.
Ensembl transcripts: 37,854 mouse transcript sequences were obtained from the Ensembl database (release 38) . Mapping tables between Ensembl Gene ID, Ensembl Transcript ID, and Ensembl Peptide ID were obtained from Ensembl martview .
IPI entries and their mappings to external sequence accession numbers
IPI entries and their mappings to external protein accession numbers were acquired from the International Protein Index (IPI) database  (release 3.17). Also obtained there were the mapping relations between IPI numbers and transcript IDs (GenBank, RefSeq, and Ensembl). The counterpart file for human was downloaded there too (release 3.16).
Expression datasets from Gene Expression Omibus
Microarray datasets were downloaded from the Gene Expression Ominibus on April 15, 2006. Array MOE430A_2, indexed as GPL339, was associated with 2,276 samples in GEO, ranking the second among all registered Affymetrix mouse arrays. All 31 GDS datasets profiled with MOE430A_2 were used in the analyses except for GDS1057, which contains only two samples. Array HG-U133A, indexed as GPL96, was associated with 8,698 samples in GEO, ranking the first among all registered Affymetrix human arrays. For our analysis, we downloaded 31 GDS datasets with largest sample sizes, and used 28 of them in our analyses, excluding GDS534, GDS1329, and GDS1324 as they are in a data format inconsistent with the others. Details about the used datasets can be found in our Additional file 8.
PPI datasets from IntAct and HPRD
Two well-known databases, IntAct  and HPRD , provide the PPI information for this study. We downloaded 68,035 human PPIs from HPRD (updated 2006-06-01) and 12,301 from IntAct (updated 2006-05-12), respectively.
An alternative annotation of the Affymetrix U133 Plus 2.0 array
A recently proposed transcript-level annotation of the Affymetrix U133 plus 2.0 array was obtained from the H. Lee Moffitt cancer center and research institute , which was used for comparison with our transcript-level annotation of HG-U133A array.
Generation of a new transcript-level annotation table for Affymetrix array
Out of the total 22,690 and 22,283 probesets in arrays MOE430A_2 and HG-U133A, respectively, 64 and 68 control probesets were firstly removed. The remaining probesets were associated with genes according to the probeset-gene mapping information provided in Affymetrix's standard annotation file. The probeset-transcript mapping relationships were obtained based on the gene-mRNA mapping tables from GeneBank, RefSeq and Ensembl. In the process, we only included probesets that were identified with one Entrez Gene ID or one Ensembl gene ID. We ignored the probesets that were associated with multiple entities or no entity in the two gene-centric databases, since their gene-level specificity appears doubtful in the standard annotation file. This filtered out 3.2% and 5.2% of non-control probesets in MOE430A_2 and HG-U133A, respectively. For the rest of the probesets, we linked the candidate transcript targets to their corresponding protein entries in IPI database. IPI is currently the least redundant yet most complete protein database for featured species, with one protein sequence matching each transcript variant. Those probesets of which transcript targets do not have any protein counterparts were also excluded from the following blast validation in order to focus our attention to the transcripts with well-characterized functions at protein level. The remaining probesets, 21,097 for MOE430A_2 and 16,213 for HG-U133A, were selected for the BLAST procedure.
We then filtered the candidate probeset-mRNA mapping relationships by aligning probe sequences in these probesets against their corresponding transcripts. Probes were blasted against their candidate mRNA targets using the bl2seq program , and the probe to transcript matches were accepted if no more than one mismatch was found. At the level of probesets, the probeset to transcript matches were accepted only if more than 90% of all probes within a probeset (over 10 probes for the typical 11-probeset) were mapped to the transcript in the same orientation.
The probeset-transcript-protein links related to the above probesets passing BLAST filter were retrieved. After reducing the redundancy information of multiple transcripts corresponding to the same IPI, we finally obtained rigorous probeset annotation files for Affymetrix arrays MOE430A_2 and HG-U133A. There are two types of probesets in the new annotation file: one-to-one probesets, where one probeset maps to only one IPI ID; and one-to-many probesets, where one probeset maps to two or more IPI IDs. Only the one-to-one probesets were used in the subsequent analyses.
Evaluating expression consistencies within synonymous probesets
and 28 expression datasets were selected from GEO respectively for MOE430A_2 and HG-U133A, and the original intensity data within each GEO dataset (GDS) were transformed to log 2 base and normalized to a constant median across all samples. For a synonymous group, if the expression values of all probesets in all samples were no larger than the constant median value, the probesets in this group were regarded not moderately expressed, and their expression profiles not informative enough. Therefore, we only kept the synonymous groups with at least one expression value above the constant median value, similar to the filtering procedure used by Tian et al. . For the remaining synonymous groups, Pearson correlation coefficients (PCCs) were calculated for the expression profiles of each probeset pair. The minimum value of these PCCs was taken as a measurement of the expression coherence of this group. We used the minimum aggregation because the gene level synonymous probesets gave rise to within-protein PCCs (which are theoretically higher) and across-protein PCCs (which are theoretically lower), and the former was identical to the result of the protein-level synonymous group. In such a setting, the maximum did not result in any difference, and the average aggregation was not as sensitive as the minimum in terms of differentiating the two groups, so we adopted the minimum aggregation.
Finally, the mean of the expression coherences of all synonymous groups over a dataset was calculated. In this way we obtained an evaluation of expression consistencies within synonymous probesets for a microarray dataset, and may compare the expression consistencies at the three levels over different microarray datasets.
Investigating expression correlations between interacting protein pairs
Given protein-protein interaction data from HPRD or IntAct, we first transformed the binary relations of protein accessions to IPI-IPI pairs, and also got the corresponding Gene-Gene pairs. For each PPI, we assembled the PPI probeset pairs and the GGI probeset pairs as illustrated in Figure 3, where PPI pairs are those associated with the interacting IPI IDs while GGI pairs are those associated with the corresponding Gene IDs. For all probeset pairs associated with the IPI-IPI pair (PPI pairs) and those associated with the corresponding gene-gene pair (GGI pairs), the PCCs were calculated and averaged into a PPI PCC and GGI PCC, respectively. These PCCs of interacting pairs were further calculated to obtain the accompanying false discovery rates using the SPLOSH FDR estimation method.
The distributions of the PPI PCCs and the GGI PCCs were plotted in a same figure to show the contrast (Figure 4A). In addition, a background distribution of the PCCs of random probeset pairs was overlaid on the same figure. We let the number of random pairs equal to the number of PPI or GGI pairs, but repeated the process of calculating random PCC distribution 20 times and averaged over the 20 separate random distributions in order to cut down on random fluctuation. Within each run of calculating random PCC distribution, we randomly compiled 2 × n (n = 1037 or 274, for HPRD or IntAct, respectively) pairs of probesets, where each two probeset pairs formed a group. The two PCCs of each group were firstly averaged into a group-level PCC, and the distribution was calculated over the n group-level PCCs. The group-level averaging was devised to mimic the counterpart operation in PPI PCC or GGI PCC calculation.
The Basic Local Alignment Search Tool
Gene Expression Omnibus
Human Genome U133A Array
Human Protein Reference Database
International Protein Index
Mouse Genome 430A 2.0 Array
Pearson correlation coefficient
We would like to thank Prof. James Scott for the insightful discussion when conceiving the project and Dr. Wei-Zhong He for editorial helps. We also wish to thank the reviewers for their helpful criticisms and suggestions for improving the manuscript. This work was supported in part by grants from the Shanghai Morning-Star Program (Type A) (06QA14037), the Shanghai Fundamental Research Program (05DJ14009), the Pujiang Promising Scientist Program (Class A, 06PJ14073), the Shanghai Human Brain Disease Proteome Research (04DZ14005), and the National "973" Basic Research Program (2006CB0D1203, 2006CB0D1205).
- Ramsay G: DNA chips: state-of-the art. Nature Biotechnology 1998, 16(1):40–44. 10.1038/nbt0198-40View ArticlePubMedGoogle Scholar
- Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, Brown EL: Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 1996, 14(13):1675–1680. 10.1038/nbt1296-1675View ArticlePubMedGoogle Scholar
- Stoughton RB: Applications of DNA microarrays in biology. Annu Rev Biochem 2005, 74: 53–82. 10.1146/annurev.biochem.74.082803.133212View ArticlePubMedGoogle Scholar
- Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW: On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 2001, 8(1):37–52. 10.1089/106652701300099074View ArticlePubMedGoogle Scholar
- Le K, Mitsouras K, Roy M, Wang Q, Xu Q, Nelson SF, Lee C: Detecting tissue-specific regulation of alternative splicing as a qualitative change in microarray data. Nucleic Acids Res 2004, 32(22):e180. 10.1093/nar/gnh173PubMed CentralView ArticlePubMedGoogle Scholar
- Yang YH, Xiao Y, Segal MR: Identifying differentially expressed genes from microarray experiments via statistic synthesis. Bioinformatics 2005, 21(7):1084–1093. 10.1093/bioinformatics/bti108View ArticlePubMedGoogle Scholar
- Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 2005, 102(43):15545–15550. 10.1073/pnas.0506580102PubMed CentralView ArticlePubMedGoogle Scholar
- Kim SY, Volsky DJ: PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics 2005, 6: 144. 10.1186/1471-2105-6-144PubMed CentralView ArticlePubMedGoogle Scholar
- Tu K, Yu H, Li YX: Combining gene expression profiles and protein-protein interaction data to infer gene functions. J Biotechnol 2006, 124(3):475–485. 10.1016/j.jbiotec.2006.01.024View ArticlePubMedGoogle Scholar
- Bhardwaj N, Lu H: Correlation between gene expression profiles and protein-protein interactions within and across genomes. Bioinformatics 2005, 21(11):2730–2738. 10.1093/bioinformatics/bti398View ArticlePubMedGoogle Scholar
- Ideker T, Ozier O, Schwikowski B, Siegel AF: Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 2002, 18 Suppl 1: S233–40.View ArticlePubMedGoogle Scholar
- Ladd AN, Cooper TA: Finding signals that regulate alternative splicing in the post-genomic era. Genome Biol 2002, 3(11):reviews0008. 10.1186/gb-2002-3-11-reviews0008PubMed CentralView ArticlePubMedGoogle Scholar
- Laurencikiene J, Kallman AM, Fong N, Bentley DL, Ohman M: RNA editing and alternative splicing: the importance of co-transcriptional coordination. EMBO Rep 2006, 7(3):303–307.PubMed CentralPubMedGoogle Scholar
- Harbig J, Sprinkle R, Enkemann SA: A sequence-based identification of the genes detected by probesets on the Affymetrix U133 plus 2.0 array. Nucleic Acids Res 2005, 33(3):e31. 10.1093/nar/gni027PubMed CentralView ArticlePubMedGoogle Scholar
- Leong HS, Yates T, Wilson C, Miller CJ: ADAPT: a database of affymetrix probesets and transcripts. Bioinformatics 2005, 21(10):2552–2553. 10.1093/bioinformatics/bti359View ArticlePubMedGoogle Scholar
- Buck K, Vanek M, Groner B, Ball RK: Multiple forms of prolactin receptor messenger ribonucleic acid are specifically expressed and regulated in murine tissues and the mammary cell line HC11. Endocrinology 1992, 130(3):1108–1114. 10.1210/en.130.3.1108PubMedGoogle Scholar
- Lim SJ, Jung HH, Cho YA: Postnatal development of myosin heavy chain isoforms in rat extraocular muscles. Mol Vis 2006, 12: 243–250.PubMedGoogle Scholar
- Okoniewski MJ, Miller CJ: Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations. BMC Bioinformatics 2006, 7: 276. 10.1186/1471-2105-7-276PubMed CentralView ArticlePubMedGoogle Scholar
- Dai M, Wang P, Boyd AD, Kostov G, Athey B, Jones EG, Bunney WE, Myers RM, Speed TP, Akil H, Watson SJ, Meng F: Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res 2005, 33(20):e175. 10.1093/nar/gni179PubMed CentralView ArticlePubMedGoogle Scholar
- Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, Apweiler R: The International Protein Index: an integrated database for proteomics experiments. Proteomics 2004, 4(7):1985–1988. 10.1002/pmic.200300721View ArticlePubMedGoogle Scholar
- Chalifa-Caspi V, Yanai I, Ophir R, Rosen N, Shmoish M, Benjamin-Rodrig H, Shklar M, Stein TI, Shmueli O, Safran M, Lancet D: GeneAnnot: comprehensive two-way linking between oligonucleotide array probesets and GeneCards genes. Bioinformatics 2004, 20(9):1457–1458. 10.1093/bioinformatics/bth081View ArticlePubMedGoogle Scholar
- Gautier L, Moller M, Friis-Hansen L, Knudsen S: Alternative mapping of probes to genes for Affymetrix chips. BMC Bioinformatics 2004, 5: 111. 10.1186/1471-2105-5-111PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 2002, 30(1):207–210. 10.1093/nar/30.1.207PubMed CentralView ArticlePubMedGoogle Scholar
- Liu G, Loraine AE, Shigeta R, Cline M, Cheng J, Valmeekam V, Sun S, Kulp D, Siani-Rose MA: NetAffx: Affymetrix probesets and annotations. Nucleic Acids Res 2003, 31(1):82–86. 10.1093/nar/gkg121PubMed CentralView ArticlePubMedGoogle Scholar
- Chinese SMEC: Molecular evolution of the SARS coronavirus during the course of the SARS epidemic in China. Science 2004, 303(5664):1666–1669. 10.1126/science.1092002View ArticleGoogle Scholar
- Schwikowski B, Uetz P, Fields S: A network of protein-protein interactions in yeast. Nat Biotechnol 2000, 18(12):1257–1261. 10.1038/82360View ArticlePubMedGoogle Scholar
- Vazquez A, Flammini A, Maritan A, Vespignani A: Global protein function prediction from protein-protein interaction networks. Nat Biotechnol 2003, 21(6):697–700. 10.1038/nbt825View ArticlePubMedGoogle Scholar
- Chen Y, Xu D: Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae . Nucleic Acids Res 2004, 32(21):6414–6424. 10.1093/nar/gkh978PubMed CentralView ArticlePubMedGoogle Scholar
- Letovsky S, Kasif S: Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 2003, 19 Suppl 1: i197–204. 10.1093/bioinformatics/btg1026View ArticlePubMedGoogle Scholar
- Tornow S, Mewes HW: Functional modules by relating protein interaction networks and gene expression. Nucleic Acids Res 2003, 31(21):6283–6289. 10.1093/nar/gkg838PubMed CentralView ArticlePubMedGoogle Scholar
- Garber ME, Troyanskaya OG, Schluens K, Petersen S, Thaesler Z, Pacyna-Gengelbach M, van de Rijn M, Rosen GD, Perou CM, Whyte RI, Altman RB, Brown PO, Botstein D, Petersen I: Diversity of gene expression in adenocarcinoma of the lung. PNAS 2001, 98(24):13784–13789. 10.1073/pnas.241500798PubMed CentralView ArticlePubMedGoogle Scholar
- Grigoriev A: A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae . Nucleic Acids Res 2001, 29(17):3513–3519. 10.1093/nar/29.17.3513PubMed CentralView ArticlePubMedGoogle Scholar
- Jansen R, Greenbaum D, Gerstein M: Relating whole-genome expression data with protein-protein interactions. Genome Res 2002, 12(1):37–46. 10.1101/gr.205602PubMed CentralView ArticlePubMedGoogle Scholar
- Peri S, Navarro JD, Amanchy R, Kristiansen TZ, Jonnalagadda CK, Surendranath V, Niranjan V, Muthusamy B, Gandhi TK, Gronborg M, Ibarrola N, Deshpande N, Shanker K, Shivashankar HN, Rashmi BP, Ramya MA, Zhao Z, Chandrika KN, Padma N, Harsha HC, Yatish AJ, Kavitha MP, Menezes M, Choudhury DR, Suresh S, Ghosh N, Saravana R, Chandran S, Krishna S, Joy M, Anand SK, Madavan V, Joseph A, Wong GW, Schiemann WP, Constantinescu SN, Huang L, Khosravi-Far R, Steen H, Tewari M, Ghaffari S, Blobe GC, Dang CV, Garcia JG, Pevsner J, Jensen ON, Roepstorff P, Deshpande KS, Chinnaiyan AM, Hamosh A, Chakravarti A, Pandey A: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Res 2003, 13(10):2363–2371. 10.1101/gr.1680803PubMed CentralView ArticlePubMedGoogle Scholar
- Pounds S, Cheng C: Improving false discovery rate estimation. Bioinformatics 2004, 20(11):1737–1745. 10.1093/bioinformatics/bth160View ArticlePubMedGoogle Scholar
- Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R: IntAct: an open source molecular interaction database. Nucleic Acids Res 2004, 32(Database issue):D452–5. 10.1093/nar/gkh052PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang J, Finney RP, Clifford RJ, Derr LK, Buetow KH: Detecting false expression signals in high-density oligonucleotide arrays by an in silico approach. Genomics 2005, 85(3):297–308. 10.1016/j.ygeno.2004.11.004View ArticlePubMedGoogle Scholar
- Kuo WP, Jenssen TK, Butte AJ, Ohno-Machado L, Kohane IS: Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 2002, 18(3):405–412. 10.1093/bioinformatics/18.3.405View ArticlePubMedGoogle Scholar
- Elo LL, Lahti L, Skottman H, Kylaniemi M, Lahesmaa R, Aittokallio T: Integrating probe-level expression changes across generations of Affymetrix arrays. Nucleic Acids Res 2005, 33(22):e193. 10.1093/nar/gni193PubMed CentralView ArticlePubMedGoogle Scholar
- Hwang KB, Kong SW, Greenberg SA, Park PJ: Combining gene expression data from different generations of oligonucleotide arrays. BMC Bioinformatics 2004, 5: 159. 10.1186/1471-2105-5-159PubMed CentralView ArticlePubMedGoogle Scholar
- Kothapalli R, Yoder SJ, Mane S, Loughran TPJ: Microarray results: how accurate are they? BMC Bioinformatics 2002, 3(1):22. 10.1186/1471-2105-3-22PubMed CentralView ArticlePubMedGoogle Scholar
- Mecham BH, Klus GT, Strovel J, Augustus M, Byrne D, Bozso P, Wetmore DZ, Mariani TJ, Kohane IS, Szallasi Z: Sequence-matched probes produce increased cross-platform consistency and more reproducible biological results in microarray-based gene expression measurements. Nucleic Acids Res 2004, 32(9):e74. 10.1093/nar/gnh071PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler DL, Church DM, Federhen S, Lash AE, Madden TL, Pontius JU, Schuler GD, Schriml LM, Sequeira E, Tatusova TA, Wagner L: Database resources of the National Center for Biotechnology. Nucleic Acids Res 2003, 31(1):28–33. 10.1093/nar/gkg033PubMed CentralView ArticlePubMedGoogle Scholar
- Lee JS, Chu IS, Mikaelyan A, Calvisi DF, Heo J, Reddy JK, Thorgeirsson SS: Application of comparative functional genomics to identify best-fit mouse models to study human cancer. Nat Genet 2004, 36(12):1306–1311. 10.1038/ng1481View ArticlePubMedGoogle Scholar
- Lee C, Wang Q: Bioinformatics analysis of alternative splicing. Brief Bioinform 2005, 6(1):23–33. 10.1093/bib/6.1.23View ArticlePubMedGoogle Scholar
- Shoemaker DD, Schadt EE, Armour CD, He YD, Garrett-Engele P, McDonagh PD, Loerch PM, Leonardson A, Lum PY, Cavet G, Wu LF, Altschuler SJ, Edwards S, King J, Tsang JS, Schimmack G, Schelter JM, Koch J, Ziman M, Marton MJ, Li B, Cundiff P, Ward T, Castle J, Krolewski M, Meyer MR, Mao M, Burchard J, Kidd MJ, Dai H, Phillips JW, Linsley PS, Stoughton R, Scherer S, Boguski MS: Experimental annotation of the human genome using microarray technology. Nature 2001, 409(6822):922–927. 10.1038/35057141View ArticlePubMedGoogle Scholar
- Kampa D, Cheng J, Kapranov P, Yamanaka M, Brubaker S, Cawley S, Drenkow J, Piccolboni A, Bekiranov S, Helt G, Tammana H, Gingeras TR: Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res 2004, 14(3):331–342. 10.1101/gr.2094104PubMed CentralView ArticlePubMedGoogle Scholar
- Castle J, Garrett-Engele P, Armour CD, Duenwald SJ, Loerch PM, Meyer MR, Schadt EE, Stoughton R, Parrish ML, Shoemaker DD, Johnson JM: Optimization of oligonucleotide arrays and RNA amplification protocols for analysis of transcript structure and alternative splicing. Genome Biol 2003, 4(10):R66. 10.1186/gb-2003-4-10-r66PubMed CentralView ArticlePubMedGoogle Scholar
- Hu GK, Madore SJ, Moldover B, Jatkoe T, Balaban D, Thomas J, Wang Y: Predicting splice variant from DNA chip expression data. Genome Res 2001, 11(7):1237–1245. 10.1101/gr.165501PubMed CentralView ArticlePubMedGoogle Scholar
- Entrez Gene[ftp://ftp.ncbi.nih.gov/gene/]
- Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, Clarke L, Coates G, Cuff J, Curwen V, Cutts T, Down T, Eyras E, Fernandez-Suarez XM, Gane P, Gibbins B, Gilbert J, Hammond M, Hotz HR, Iyer V, Jekosch K, Kahari A, Kasprzyk A, Keefe D, Keenan S, Lehvaslaiho H, McVicker G, Melsopp C, Meidl P, Mongin E, Pettett R, Potter S, Proctor G, Rae M, Searle S, Slater G, Smedley D, Smith J, Spooner W, Stabenau A, Stalker J, Storey R, Ureta-Vidal A, Woodwark KC, Cameron G, Durbin R, Cox A, Hubbard T, Clamp M: An overview of Ensembl. Genome Res 2004, 14(5):925–928. 10.1101/gr.1860604PubMed CentralView ArticlePubMedGoogle Scholar
- Ensembl MartView[http://www.ensembl.org/Multi/martview]
- Sequence based identification and annotation of Affymetrix probesets[http://mriweb.moffitt.usf.edu/mpv/share/MPV_U133PLUS_Export.zip]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci U S A 2005, 102(38):13544–13549. 10.1073/pnas.0506577102PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.