1D and 2D annotation enrichment: a statistical method integrating quantitative proteomics with complementary high-throughput data
© Cox and Mann; licensee BioMed Central Ltd. 2012
Published: 5 November 2012
Skip to main content
© Cox and Mann; licensee BioMed Central Ltd. 2012
Published: 5 November 2012
Quantitative proteomics now provides abundance ratios for thousands of proteins upon perturbations. These need to be functionally interpreted and correlated to other types of quantitative genome-wide data such as the corresponding transcriptome changes. We describe a new method, 2D annotation enrichment, which compares quantitative data from any two 'omics' types in the context of categorical annotation of the proteins or genes. Suitable genome-wide categories are membership of proteins in biochemical pathways, their annotation with gene ontology terms, sub-cellular localization, the presence of protein domains or the membership in protein complexes. 2D annotation enrichment detects annotation terms whose members show consistent behavior in one or both of the data dimensions. This consistent behavior can be a correlation between the two data types, such as simultaneous up- or down-regulation in both data dimensions, or a lack thereof, such as regulation in one dimension but no change in the other. For the statistical formulation of the test we introduce a two-dimensional generalization of the nonparametric two-sample test. The false discovery rate is stringently controlled by correcting for multiple hypothesis testing. We also describe one-dimensional annotation enrichment, which can be applied to single omics data. The 1D and 2D annotation enrichment algorithms are freely available as part of the Perseus software.
Mass spectrometry-based proteomics can now deliver highly accurate data on hundreds of thousands of peptide features in a single biological project. Accurate comparative protein quantification is feasible for thousands of proteins with methods based on stable isotopic labeling but also in label free approaches. Coverage of these methods has already reached complete proteome scale for the yeast model organism and other proteomes of similar or lower complexity. With further improvements in the underlying technologies comprehensive quantification in human cells seems also likely to be achieved soon. Therefore one can now carry out quantitative expression proteomics on a similar scale as for nucleic acids with microarrays or more recently with deep sequencing approaches.
The ability to perform side-by-side large-scale quantitative profiling of the proteome ,transcriptome or genome raises the question which classes of gene products show concordant and which show discordant behavior between the different levels of gene expression. For instance, one general question in proteomics is how far absolute levels of expression changes correlate between the transcriptome and the proteome. In the hypothetical case of pure transcriptional regulation the correlation between these two levels would be near one, and would only be limited by the technical limitations and imperfections of the respective quantitative profiling technologies. Indeed, while early investigations found low or no correlation between proteome and transcriptome[6, 7], recently much higher levels of global correlation have been reported[8, 9].
While transcriptional regulation is generally a dominant aspect of the entire expression cascade, there are many known examples of posttranscriptional regulation like micro-RNA controlled inhibition of transcripts and directed protein degradation. Therefore it would be desirable to have a method that highlights those proteins or protein classes for which a differential behavior is observed at different levels of the gene expression program. A straightforward manual approach is to make a scatter plot of protein changes versus transcript changes and look for single 'off-diagonal data points' or whole 'off-diagonal data point clouds'. This can be very instructive and is indeed recommended as a first step of data analysis. However, this quickly becomes very complex and in any case it is desirable to have formal statistical criteria for assigning 'interestingness' to protein classes in this scatter plot. As an example one may want to have a p-value based approach for the question whether the points in the scatter plot that correspond to proteins in the proteasome core complex show a significantly different distribution from all the other proteins. Furthermore, if that is the case we would like to have some sort of score for the proteasome that describes the direction in the scatter plot in which the proteasome components tend to deviate from the overall distribution of all proteins. This directional score will then indicate whether the statistically significant behavior of the peoteasomal proteins is the same between transcriptome and proteome or if it is 'off-diagonal' i.e. differing between the transcriptome and the proteome. Finally, in case we repeat the p-value calculation over many complexes or other annotation terms we need to take precautions for multiple hypothesis testing. All the issues described above are addressed by the 2D annotation enrichment method introduced in this manuscript. In the following sections we describe all the details of the method, illustrate its principles with relevant examples and give details on how readers can apply it to their own data.
The protein intensity data used in the explanation of the 1D annotation distribution in the Results section is taken from a label-free proteome study of mouse dendritic cells to a depth of 5,780 proteins. In that study, cell sub-populations were obtained by Fluorescent Activated Cell sorting (FACS), proteins were separated by 1D SDS PAGE and digested with trypsin.
The yeast data used in the sub-section on 2D annotation enrichment is obtained from de Godoy et al. where haploid and diploid cells were differentially labeled with Lys8/Lys0 SILAC. The microarray data for the comparison is from Galitski et al. The data was filtered as described in. For the proteome versus DNA copy number example, the proteomic as well as the comparative genomic hybridization (array CGH) data is from Geiger et al.
In all cases peptides were analyzed on a nanoflow HPLC system connected to a hybrid LTQ-Orbitrap or Orbitrap-Velos mass spectrometer (Thermo Fisher Scientific). Human and mouse data were searched against International Protein Index (IPI) protein sequences while the yeast data was searched against the protein translations in the Saccaromyces Genome Database (SGD).
We start the description of the data analysis workflow at the point where protein abundances or protein expression ratios have already been calculated. While all examples show proteomics data obtained with the MaxQuant software[16, 17] in combination with the Andromeda search engine, this is not a prerequisite for the application of the 2D annotation enrichment method. In fact, the quantitative proteomics data could have been produced by any other software or search engine. Similarly, data from other levels of expression are not limited to any particular technology or software platform. Ideally the measurements from the different technologies are done on the same samples, but this is not strictly necessary.
When reporting quantitative data for proteins from shotgun proteomics data, care has to be taken in the counting of independent protein identifications. The measured peptides may in some cases not be unique and map to several proteins, which is called the protein interference problem. Only proteins that are distinguishable based on the measured set of quantified peptides should be counted as individual protein identifications. All redundancy on the protein level should be removed, for example by collapsing them to protein groups. This is a particularly important point when performing annotation enrichment analysis. For instance, assume that there are ten isoforms of a particular protein reported as independent protein identifications even though the quantified peptides for all of them are the same. If one would keep all ten isoforms as separate identified proteins their quantitative profiles would perfectly correlate and their annotation terms would be highly similar. This would obviously lead to irresolvable difficulties in statistical tests for contingency between the quantitative expression data and the annotation terms and would likely produce false positives. Hereafter we always refer to suitable groups of proteins in the above sense when we use the word protein. MaxQuant automatically performs this grouping and is furthermore integrated with subsequent bioinformatic analysis in the freely available Perseus software which implements all algorithms described below.
When using MaxQuant, if several quantitative experiments are combined or replicates were made these will all be projected onto the same protein grouping over all 'quantitative columns'. Therefore, the proteome data will be in a convenient matrix form already, even for very complex experimental designs. This will not be the case when one wants to compare the proteome data with transcriptome data, for instance. Several probe sets of an Affymetrix chip measure the same gene and there may be several genes belonging to the same protein group. For the matching we take a protein centric view. For each protein in the protein group we determine all probe sets that are annotated in the chip annotation file with a UniProt identifier. It is not trivial to decide which UniProt identifiers to use for a group of proteins that are indistinguishable by the measured peptides. A protein group consists of proteins from the list of protein sequences submitted to the search engine that cannot be quantified independently based on the set of identified peptides. In particular, if two proteins have identical sets of identified peptides they will be grouped together. Also if the set of identified peptides of one protein is completely contained in the set of identified peptides of another protein, these two proteins will be combined in a protein group as well. Proteins within a protein group are sorted by the number of identified peptides in descending order. For the remaining ambiguities we use the razor peptide or parsimony concept, which means that the peptide is assigned according to Occam's razor principle to the protein group that most plausibly explains its existence, which is the one which already has the most peptide identifications assigned to it.
The number of probe sets matched in this way to every protein group can vary now from zero to several. If none is matching then no comparison can be made for this protein. If one is matching, then the quantitative information for this probe set will be used. If several probe sets match the point-wise median of their quantitative profiles is taken. Expression data from other microarray types can be matched in a similar way as long as the vendor provides UniProt or other protein identifiers for the hybridization probes. Deep RNA sequencing data is also easily matched, for instance in the form of RPKM values produced by a suitable software. DNA copy numbers from comparative genomic hybridization are associated with the closest protein coding gene for each hybridization probe. All copy numbers that have the same nearest protein coding gene are combined by taking their median value.
. Note that for the quantitative analysis of expression data (irrespective of which kind) it is usually advisable to take the logarithm before proceeding with further steps. This is true for ratios as well as for abundance data. Also here before averaging expression profiles this is advisable, even if the median is taken. This is because also for the median an averaging can take place between the two central numbers in case there is an even number of values. The need for taking logarithms becomes immediately apparent in the case of ratios. One would expect that the average of a two-fold up-regulation and a two-fold down-regulation should be no regulation. This is however not the case if the ratios are averaged (2 + 1/2)/2 = 1.25 ≠ 1. If logarithms (e.g. to the base two) are averaged the desired result is obtained: (log(2) + log(1/2))/2 = 0. The base of the logarithm does not matter in principle since it can be absorbed in an overall factor multiplied to all the data. However it is customary to use base two for ratio data and base two or ten for abundance data.
We base the annotation of proteins on UniProt identifiers. Also here, as was the case for the matching of proteins to other 'omics' data, one has to take care of the selection of UniProt identifiers for a particular protein group. Since each of the proteins in a protein group potentially can have different annotation terms it is a non-trivial question which terms should be reported for the protein group as a whole and then be used in the enrichment analysis. Selecting only the annotation of the protein with the most peptides (which is anyway not necessarily a unique choice) might not always be the best decision since only a shorter or 'activated' version of the protein might carry the complete annotation. On the other hand, including the annotation of all protein group members may pick up false positives because there may be proteins in the group only due to a single peptide and not because of a genuine relatedness of the protein. We found that a good compromise is to include the annotation of all proteins in the protein group that have at least half the number of peptides of the leading protein which has the maximal number of peptides in the group. This choice usually avoids the two pitfalls mentioned above.
One major source of annotation is the gene ontology (GO) which carries information on biological processes, molecular functions and sub-cellular localization. We use the Kyoto Encyclopedia of Genes and Genomes (KEGG) database as a source for pathway membership information and we infer the protein domain content from the Pfam database. For human data the Corum database is a well curated repository of protein complexes. Expression in tissues (as a yes/no information) can also be included as annotation. Annotation relating to the transcripts or genes can also be included. For instance miRNA binding sites in the three-prime region of the transcript can be treated as categorical annotation. Even knockout phenotypes, if they exist for the particular organism, can be treated as an annotation. Indeed any kind of information that can be summarized in terms that can be ascribed to proteins can serve as a source of annotation, whose enrichment in the process under study can be tested.
The 2D annotation enrichment algorithm works equally well for 1D data, such as any quantitative proteomics experiment. We first describe the principle of the 1D distribution analysis, which also serves as a preparation for the 2D algorithm. The input is a single column of numerical values assigning one numerical value to every protein. These values are typically protein ratios or absolute protein abundances. They could also be derived quantities, like average fold-changes between replicate groups or p-values or test statistic resulting from a test for significant changes in protein expression. If the column has missing values then the respective proteins will be ignored in further analysis.
To be independent of the shape of the distribution we apply a non-parametric test which in particular does not assume a normal distribution of the numerical values. These properties single out the two-sided (two-sample) Wilcoxon-Mann-Whitney test as the method of choice, which we apply to all protein categorizations in a given set of terms.
Where n 1 is the size of group 1 and R 1 is the sum of ranks in group 1.
Technically, the Mann-Whitney test assumes independence of the values, which is a good approximation in our case, in particular since every peptide is used in only one protein group for quantification. If non-unique peptides were used in several protein groups, the independence assumption would not hold.
where R1 and R2 are the average ranks within the group under consideration and its complement (all remaining proteins in the experiment), respectively and n is the total number of data points. It is a number between -1 and 1. A value near 1 indicates that the protein category is strongly concentrated at the high end of the numerical distribution while a value near -1 means that the values are all at the low end of the distribution. For significant terms it is not possible that s reaches zero exactly, but especially for larger categories that show a slight but consistent trend it is possible to have small absolute values of s. A moderately positive value of s for a category with many members, for instance, indicates that there is a significant collective shift towards larger values for this category which however is small in absolute terms and possibly not noticeable when looking at individual proteins. Note that the method's calculations are entirely based on information within the measured proteome. Often, enrichment calculations in proteomics against the whole genome are problematic. By construction these problems are completely circumvented here.
In the ribosome example of Figure 1, the p-value is 2 × 10-37 and the s-value is 0.85, indicating that ribosomal proteins are strongly enriched among the most abundant proteins.
When applied to ratios of protein abundances the method described here is similar to the quantile-based enrichment calculations introduced by Pan et al. There the distribution of protein ratios was subdivided into bins and then all categories were tested for being enriched in these bins. In contrast, the 1D annotation enrichment developed here has the advantage that it is not necessary to define a somewhat arbitrary positioning of bin boundaries beforehand. Instead the distribution of values is scanned for interesting sub-categories in an unbiased way without using thresholds.
Also for the two-dimensional case, we want to avoid the normality assumption and therefore wish to use a non-parametric testing strategy. What is needed for the generalization to two numerical dimensions is a replacement of the Wilcoxon-Mann-Whitney test that works with two-dimensional input data. All the remaining strategy can then be taken over from the one-dimensional case. The concept of rank sums that is used in the definition of the test statistic for the Wilcoxon-Mann-Whitney test at first appears to be tied to the one-dimensional case since only in the one-dimensional case is it possible to define an order of the data points in a meaningful way. For points in a two-dimensional plane, in contrast, a natural order relationship does not exist. The situation is different for parametric tests, like Student t-test or analysis of variance (ANOVA) where the generalization to the multivariate case is straightforward and known as multivariate analysis of variance (MANOVA) (see e.g. reference), which is is a statistical test procedure for comparing multivariate population means of several groups. To solve our problem of non-parametric tests in higher dimensions, we make use of the circumstance that the Wilcoxon-Mann-Whitney test is equivalent to the simple Student t-test performed on the ranks. This is the case because the t-test statistics calculated from the ranked data is a monotone function of the rank sum expression that is used for the Wilcoxon-Mann-Whitney test statistic. Combining the fact that the non-parametric test is equivalent to the parametric test on ranks and that ANOVA has a straightforward generalization to higher dimensions, we propose to use the MANOVA test on the ranked multivariate data, where the data is replaced by ranks in each dimension separately.
are the ranked values for x and y dimensions, separated into group 1 and 2.
We define the resulting MANOVA test result as the 2D annotation enrichment p-value. The FDR of this approach can be controlled with the Benjamini-Hochberg method in the same way as was done for the one-dimensional case.
While replicates within one technology are usually best done as 'biological' as possible to ensure that the findings are robust and reproducible, for cross technology comparisons it is more desirable to have the equivalent of a 'technical' replicate. For instance, the cell populations from which the transcriptome and the proteome are measured should be as similar as possible, ideally aliquots from the same sample so that one is sure that one samples the same cellular state on different levels of expression. If desired, the whole measurement including the proteome and the transcriptome can be repeated as 'biological' replicates. In the majority of cases, however, the available data has not been recorded in this optimal way. Of course the data analysis described here can still be applied to this situation as well.
Often enrichment analysis in proteomics is performed by calculating a p-value corresponding to a test if a certain annotation term is enriched in a certain set of proteins relative to all genes in the genome. The results of this kind of calculations have to be taken with caution, especially in cases where the proteome coverage is far away from saturation or completeness since apart from the effect under investigation they are biased by which proteins are measurable at all by the employed mass spectrometric technology. This may lead to seemingly significant p-values for large protein categories only because the measurable and abundant proteins tend to have more annotation. Another bias can come from which proteins are expressed at all in the proteome under study. We completely avoid this problem by basing the enrichment calculations always on the protein population that has been observed in the measurement.
Another issue of interest is the potential application of corrections when multiple related terms are used for statistical comparisons. For example, terms in GO are mutually dependent. In principle correction methods like this can be applied to 1D and 2D annotation enrichment as well and we might do so in the future. Note, however that by not taking the hierarchy and relatedness of terms into account the significant findings reported after multiple hypothesis correction are on the conservative side, since the number of effectively independent tests is lower than the total number of terms which is used in the multiple testing correction. Therefore there is no danger of over-reporting. On the contrary, at fixed FDR one might miss a few significant terms which one would have obtained with a method taking the relatedness into account.
Many other tools for enrichment analysis already exist. In Hauang et al. the authors categorize existing tools into three classes: singular enrichment analysis (SEA), gene set enrichment analysis (GSEA) and modular enrichment analysis (MEA). In this kind of classification our 1D annotation enrichment belongs to the GSEA class, because it is a 'no-cutoff' method. This means that it is not necessary to define a set of regulated proteins beforehand, thereby reducing arbitrary factors in such a protein selection step. In contrast, 2D annotation enrichment method is inherently novel since it is the first enrichment method dealing with two 'omics' dimensions simultaneously.
We thank all the other members of the Proteomics and Signal Transduction group for help and discussions. This work was partially supported by PROSPECTS, a 7th Framework grant by the European Directorate (grant agreement HEALTH-F4-2008-201648/PROSPECTS) and by the Max Planck Society for the advancement of Science.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 16, 2012: Statistical mass spectrometry-based proteomics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/13/S16.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.