Regulation of protein abundance is a central determinant of cellular phenotype. Therefore the ability to conduct and interpret studies of proteome-wide alterations in protein abundance presents tremendous promise for biological understanding. Proteomics based on MS/MS (tandem mass spectrometry) enables direct detections of peptide fragments for identification and quantitation of proteins in a proteome-wide manner. However, it has some major handicaps, especially detection biases and low dynamic range (though techniques requiring labeling can have great dynamic range). Hybridization-based expression microarrays represent a well-established high-throughput technology for conducting global measurements of mRNA transcript abundances. However, although mRNA expression precedes protein translation, the correlation between transcript level and abundance of the corresponding protein product, is often poor .
Thus neither transcriptomic nor proteomic studies are perfect. However, when performed on the same samples they are complementary [4, 5]. A relevant analogy comes from statistics. Central to many statistical procedures (such as empirical Bayes estimation) is the established principle that combining two data sources with different sources of bias and variance frequently produces greater precision than either alone[6, 7]. Genomic and proteomic data sets have different sources of bias and variance, so combining them may lead to a more precise view of differential protein abundance. Consider one application, biomarker discovery. Improving the selection of candidates to validate is a worthy goal, since biomarker validation is generally elaborate and costly. If both transcriptomic and proteomic platforms agree on a strong differential expression between the groups of patients to be distinguished, the attractiveness of a candidate strengthens. If not, the call is for caution.
The potential contributors to poor correlations are numerous. Post-transcriptional events such as alternative splicing and microRNA regulation complicate the link between the abundance of a specific mRNA and production of its protein product. Thus microarray transcript signals may not faithfully reflect the pool of transcripts available for translation. On the other hand, proteins which degrade quickly will be underrepresented compared to those with greater half-lives, so variation in protein degradation can also reduce the correlation between transcriptomics and proteomics. In summary, decoupled expression at the mRNA and protein levels might relate to post-transcription and post-translation events; explanations might be forthcoming from studies of microRNA-mediated regulation and protein degradation [4, 9].
But the decoupling might not be biological; it might stem from errors in the data integration. The supposed identity of either the gene coding for the probeset's target transcript or the detected protein may be incorrect. The quality of a study integrating proteomic and genomic data rests heavily on reliable mapping between the identifiers of the two high-throughput platforms. Discrepancies between bioinformatics identifier mapping resources are abundant. Draghici  has demonstrated a variety of serious ID mapping anomalies, in which results depend strikingly on which bioinformatics mapping resource is chosen. Identifier assignments for genes, mRNA species, and proteins are managed and the annotations curated by different bioinformatics centers. The annotation systems of the microarray probesets depend on the chip manufacturer to provide mappings to transcript identifiers, so the provenance or motivation behind a link between an array probeset and a mRNA species may be unclear. In addition, in mass spectrometry the incidence of misidentification of protein accessions is not negligible, for a host of reasons, including variations in sequence search algorithms and tuning parameters. Therefore for consumers of these resources, to determine with accuracy how a protein connects with a mRNA species and thence to an microarray probeset can be labor-intensive and error-prone.
The possibility of misidentifications is troubling in biomarker discovery projects. When a candidate marker appears promising, misidentification of the protein or transcript will lead to wasted effort in initial marker validation and/or subsequent clinical prediction studies. With two integrated discovery platforms, on the other hand, if a candidate appears promising in one of the platforms but ID mapping generates a poor correlation with the other platform, then the integration is useful by casting suspicion on the reported identity of the biomarker.
Despite the appeal of integrated genomic and proteomic analysis, it is rarely done, and more rarely still is the identifier mapping methodology described. But there are pioneering examples of integrated studies. Chen  studied gene expression using the Plus 2.0 Affymetrix chip together with proteomic analysis using Isotope-Coded Affinity Tags (ICAT) methodology. The focus was high-risk neuroblastoma, with a small number of clinical stage 4 samples with MYCN amplified and stage 1's with MYCN not amplified. It was unclear how identifier mapping was performed. Another study performed by Shankavaram et al. used a protein lysate array and the Affymetrix U133 Plus 2.0 chip to identify biomarkers present on the NCI-60 cancer cell panel data. Matchminer was used to obtain annotation matches. Chen et al. 2002  used integrated analysis on 76 lung adenocarcinoma and nine non-neoplastic lung tissues. The report mentions that an elevation in protein did not always correlate with an elevated mRNA expression level. The identifier mapping resource is not mentioned for this study.
The motivation for our study of this issue was a transcriptome-proteome integrated study comparing early endometrial cancer with normal endometrial tissue from cancer-free subjects. Preliminary efforts quickly revealed major anomalies in some of the identifier matches. This result motivated a deeper investigation into the fidelity of identifier mapping, to achieve acceptable reliability of the linkages between the two data sets. The starting point is the proteomic study, generating UniProt ACCs. With these ACCs we queried three prominent bioinformatics identifier mapping resources, to obtain corresponding Affymetrix probeset identifiers. There will be a multitude of mapping strategies obtainable by connecting combinations of bioinformatics resources, but these three are the only ones that we are aware of providing a direct mapping query suitable for this specific purpose. Inconsistencies encountered here are likely emblematic of all mapping strategies. Here we report the extent of agreements and disagreements among the resources' returned results.
We also utilized the two datasets to generate correlations between message expression and protein expression. Such correlations have been studied before. Yu et al  provide a discussion of the kinetics of the expected relation between mRNA expression and protein expression, together with supplemental data that suggests that correlation values may vary according to GO-defined functional grouping, including some groups with negative correlations. Nie et al  studied mRNA-protein correlations in Desulfovibrio vulgaris. They found that mRNA abundance explains 20-28% of protein abundance variation, and functional category explains 10-15% of variation in correlation. We decomposed the correlations into a mixture with a zero-centered component and a positive component. The mixture distribution model sheds light on the degree to which positive correlations exceed negative ones, allowing estimation of the distribution of correlations among correctly mapped protein-probeset pairs, without needing to know at this point which specific pairs are correctly mapped and which are not. The only assumption is that the distribution of correlations among the mismatched pairs is symmetric around zero. We proceeded presuming that better identifier matching should generate a higher rate of match pairs with strong correlations between the transcript signals and corresponding protein spectral counts. This is flawed as a "gold standard" because of issues discussed above: pre- and post-translation events which decouple expression at the mRNA and protein levels[4, 9]. Nevertheless, large observed correlations are likely to be more prevalent when the ID mapping is done correctly, and less likely when done incorrectly. We demonstrate that the ensemble of correlations is useful for evaluating mapping services even though any individual correlation is not.
This paper first presents summaries of the retrieval sets from each individual resource, then presents comparisons between pairs of resources, and finally evaluates the mappings based on the assay correlations. The overall objective is to develop and demonstrate methods that bring a needed critical but constructive eye to integrative studies.