PCA2GO: a new multivariate statistics based method to identify highly expressed GOTerms
 Marc Bruckskotten^{1}Email author,
 Mario Looso^{1},
 Franz Cemiĉ^{2},
 Anne Konzer^{1},
 Jürgen Hemberger^{2},
 Marcus Krüger^{1} and
 Thomas Braun^{1}
https://doi.org/10.1186/1471210511336
© Bruckskotten et al; licensee BioMed Central Ltd. 2010
Received: 2 October 2009
Accepted: 21 June 2010
Published: 21 June 2010
Abstract
Background
Several tools have been developed to explore and search Gene Ontology (GO) databases allowing efficient GO enrichment analysis and GO tree visualization. Nevertheless, identification of highly specific GOterms in complex data sets is relatively complicated and the display of GO term assignments and GO enrichment analysis by simple tables or pie charts is not optimal. Valuable information such as the hierarchical position of a single GO term within the GO tree (topological ordering), or enrichment within a complex set of biological experiments is not displayed. Pie charts based on GO tree levels are, themselves, onedimensional graphs, which cannot properly or efficiently represent the hierarchical specificity for the biological system being studied.
Results
Here we present a new method, which we name PCA2GO, capable of GO analysis using complex multidimensional experimental settings. We employed principal component analysis (PCA) and developed a new score, which takes into account the relative frequency of certain GO terms and their specificity (hierarchical position) within the GO graph. We evaluated the correlation between our representation score R and a standard measure of enrichment, namely pvalues to convey the versatility of our approach to other methods and point out differences between our method and commonly used enrichment analyses. Although p values and the R score formally measure different quantities they should be correlated, because relative frequencies of GO terms occurrences within a dataset are an indirect measure of protein numbers related to this term. Therefore they are also related to enrichment. We showed that our score enables us to identify more specific GOterms i.e. those positioned further down the GOgraph than other common tools used for this purpose. PCA2GO allows visualization and detection of multidimensional dependencies both within the acyclic graph (GO tree) and the experimental settings. Our method is intended for the analysis of several experimental sets, not for one set, like standard enrichment tools. To demonstrate the usefulness of our approach we performed a PCA2GO analysis of a fractionated cardiomyocyte protein dataset, which was identified by enhanced liquid chromatographymass spectrometry (GeLCMS). The analysis enabled us to detect distinct groups of proteins, which accurately reflect properties of biochemical cell fractions.
Conclusions
We conclude that PCA2GO is an alternative efficient GO analysis tool with unique features for detection and visualization of multidimensional dependencies within the dataset under study. PCA2GO reveals strongly correlated GO terms within the experimental setting (in this case different fractions) by PCA group formation and improves detection of more specific GO terms within experiment dependent GO term groups than standard p value calculations.
Keywords
Background
The advent of highthroughput techniques, which allow rapid acquisition of vast data sets, has created the need for fast and reliable functional annotation of molecules. A typical DNA microarray [1] or high throughput LC massspectrometry experiment [2] often leads to the identification of changes in the expression of hundreds if not thousands of molecules. The identification of pathways or biological processes that are affected in a given experimental setup requires functional annotation of multiple gene products, which can be achieved by association with a set of annotation terms. The Gene Ontology project is a collaborative effort to provide a controlled vocabulary to describe gene product attributes in different organisms [3]. The use of Gene Ontology Annotations (GOA) in highthroughput contexts has become a widespread practice to gain insight into the potential biological meaning of profiling experiments. GO provides structured, controlled vocabularies and classifications such as biological process, molecular function and cellular component. The relationship between GO terms can be described by a directed acyclic graph (DAG). Relations within GO terms, like is_a and part_of or regulates are represented by direct edges. The truepath rule postulates that an annotated gene is also associated with less specific parents of that term. Explicitly, the pathway from a child term all the way up to its toplevel parent(s) must always be true [3]. Publicly available software packages like DAVID [4], GOstat [5], FatiGO [6], GOrilla [7], BINGO [8], blast2GO [9] and others use various approaches to visualize, filter and search the GO database. Webbased applications allow users to download their findings in graphic formats or as text tables. One of the most common GO applications is to test datasets for gene enrichment. Enrichment of single GO terms might reveal potential functional characteristics of a given dataset. Typically, enrichment analyses are based on hyper geometric or binomial models. Most software tools use similar algorithms although the accepted input data might differ. Some tools need a target as well as a background set of genes as input, while others use only a default background set. A more general discussion of the advantages of different approaches can be found in [10]. Here, we investigated the usability of our new GObased method applied to modern highthroughput techniques, like mass spectrometry using the SILAC approach. Naturally, our method has to be adapted to the question addressed by the experiment. In this case the purity of certain presorted cell fraction derived from LCMS needed to be verified. We showed that our methods enabled us to check whether it is suitable to verify the purity of a certain cell fraction using the GOA platform. Furthermore, we were able to show, that our R score combined with principal component analysis (PCA) is capable to detect strongly represented branches of the GO DAG (i.e. gene products included within these branches) comparable to the output of standard p values enrichment tools.
Results
To detect strongly represented GO branches more specifically than with commonly used methods, we developed a specific representation score R that combined with PCA takes into account relative frequencies of gene product occurrences within the data set and topological ordering of the GODAG. To reveal specific dependencies within the dataset, we performed a PCA on that new score R. In order to demonstrate the performance of our method, we applied it to an experimental dataset derived from fractionated cardiomyocyte proteins, which was obtained by enhanced liquid chromatographymass spectrometry (GeLCMS). Our analysis enabled us to accurately detect distinct groups of proteins that are localized in specific cell fractions.
Data processing and R score assignment
In a first step we associated each protein to cytosolic, membrane, nuclear or nonspecific fraction as described in the methods section below. We identified 179 proteins, which were associated with the cytosolic, 328 with the membrane, and 45 with the nucleusfraction of cardiomyocytes. An additional group of 1284 cardiomyocyte proteins was detected but placed in the "nonspecific" fraction, since proteins from this group were found in significant amounts in more than one fraction. This nonspecific fraction represented an experimental internal standard as detailed in the methods section. GO terms were individually assigned to our protein lists, based on the IPI database from EBI. Since we were interested in the quality and efficiency of our fractionization, we used only GO terms that belong to the classification "cellular compartment". As a next step, we evaluated a Score of Representation (R) for each GO term as detailed later in the text. These R scores were then used for PCA to unravel multivariate dependencies inherent in our data. Our R score mapped the GO terms on the interval [0,1]. Furthermore the R score provided information about the order of the GO terms. Within these groups of GO terms, the R score of a more specific term was lower than that of a less specific one, even if the relative frequency of the gene products represent by the GO term was almost equivalent. This was due to our definition of R. When the GO terms were sorted by decreasing R score values, more specific terms were found at the end of the list.
PCA results
Correlation of principal components
Fraction  F1  F2  F3  F4 

Cytosolic  34,82%  34,57%  30,53%  0,06% 
Membrane  35,12%  60,66%  3,93%  0,28% 
Nucleus  29,21%  3,83%  63,11%  3,83% 
Nonspecific  0,83%  0,92%  2,41%  95,82% 
Groups of accumulated points in the PCA biplot have similar R scores with respect to certain PC variables. The specificity level, defined here as the number of predecessors of the GOterm, is therefore related to the position (PC coordinate) of the group within the biplot in relation to that principal component.
 (1)
The specificity level of a certain GOterm is coded in the R score.
 (2)
The scoreplot allows identification of GOterms, which are characteristic for a PC. These represent PCcorrelated classifications of experimental data, in our case cytosolic, membrane and nucleus cell fractions.
 (3)
The measure of specificity of terms within the dataset can be detected by PCA on the Rscore.
Interpretation of PCA plots obtained for different cellular fractions
Number of GOterms resolved by PCA and identified within certain fractions
Fractions  PC1 vs PC2  PC2 vs PC3 

Cytosolic  12  12 
Membrane  28  28 
Nucleus  3  3 
Nonspecific  144  221 
CytosolicNonspecific  32  32 
MembraneNonspecific  43  81 
NucleusNonspecific  15  15 
Nucleus and Cytosolic  11  4 
Membrane and Nucleus  2  
Cytosol and Membrane  2 
Comparison of PCA2GO with existing tools for identification of over representation
Overview of the GOresolving capabilities of PCA2GO and other tools
Identified to be enriched by  GOTerms  Connection with terms detected by PCA2GO 

BINGO  intracellular, intracellular part  chaperonincontaining Tcomplex, microtubule organizing center part, pericentriolar material, UBC13MMS2 complex phosphoribosylaminoimidazole carboxylase complex 
BINGO, GOrilla  macromolecular complex, protein complex  UBC13MMS2 complex, posphoribosylaminoimidazole carboxylase complex, angiogeninPRI complex, charperonincontaining Tcomplex 
BINGO, GOrilla  cytoplasm, cytoplasmic part  phosphoribosylaminoimidazole carboxylase complex, chaperonincontaining Tcomplex, microtubule organizing center part, pericentriolar material 
BINGO, GOrilla  cytosol, cytosolic part  charperonincontaining Tcomplex 
BINGO, GOrilla  Extracellular matrix part, basement membrane  laminin10 complex 
Discussion
The R score in combination with PCA allows comprehensive identification of correlations between different datasets. Therefore, a comparison of different datasets regardless of inherent complexity is possible. Virtually no limitations exist for the dataset and the experimental variable, which one might choose to measure representation in a GO term. Our score measures the specificity of GO terms in an experimental dataset. PCA2GO does not extend the list of over represented GO terms obtained by standard pvalue based enrichment analyses, but detects the most specific terms, which are unique for a group. Proteins related to these specific terms can be directly identified. Identification of specific terms will also facilitate detection of overrepresentation at a higher tree level. To our knowledge no other GO analysis tool generates this information. Our method is especially suited for complex experimental settings due to the multivariate character of PCA. All samples are processed simultaneously and ordered, whilst taking into account the explained variance of the Principal Components. Thereby, relevant groups of GO terms are identified. In this case, specific groups of GO terms for each cell fraction, as well as mixed groups are identified. Clearly, such information assists biological interpretation of data and allows direct comparison of proteins related to these groups of GO terms, which makes a comparison between different experimental settings possible. Other GO analysis methods do not allow these direct comparisons since each sample has to be analyzed separately yielding lists of GO terms, which are only comparable to one another if the backgrounddata (zero reference) for the samples is identical. In our method the score R already includes an uniform background (the total GOA). Therefore, the principal component analysis needs no specific background data.
Conclusions
PCA2GO is a sensitive and powerful tool to detect over representation of specific GO terms in a complex DAG, which cause GO enrichment at a higher level. The identification of these terms is due to the multivariate property of PCA. The score R sorts the GO terms by statistical weight associated to the root ontology, to which the GO term belongs. The Score R combines the relative frequency of assigned GOterm I related to protein x_{ i } with a normalisation against all GO annotated proteins (see methods section). In combination with PCA an analysis of GO terms can be performed and lists of over represented GO terms are given, emphasizing the most specific terms, which are unique for a group of proteins, thereby facilitating biological interpretation of experimental data. PCA2GO works complementary to known tools like GOrilla or Bingo and delivers additional information about what causes GOenrichment e.g. biological processes or cellular component, etc. This cannot be done by pvalue analysis alone. The advantage of PCA2GO is the detection of highly specific, preenriched GO terms, which are not significantly represented in their GO level, but may lead to an over representation of this GO branch on a higher level. These terms cannot be detected by standard statistical based tools. Other preenriched terms will trigger enrichment in GO graph on a higher level, where hundreds of proteins appear as a kind of noise.

PCA2GO is very sensitive to detect GO terms using a small number of annotated proteins.

GO terms detected by PCA2GO are usually connected indirectly to less specific terms found to be enriched by other statistical based methods.

Small numbered terms containing few proteins facilitate backtracking from GO terms to proteins and identification of important proteins, which might not be discovered at a lower GOtree level.

PCA2GO has a high resolution power and detects experiment specific GO terms.
Our method can be seen as a complementary tool to existing GO enrichment analysis tools featuring the detection of experiment specific GO terms. In conjunction with GOrilla or BINGO, our tool provides an enhanced resolution to detect specific GO terms. It is predestinated for use in high throughput experimental settings and derived high throughput data and hence represents a significant improvement compared to previous approaches.
Methods
Generation of dataset
Proteins for enhanced liquid chromatographymass spectrometry (GeLC) analysis were derived from mouse cardiomyocytes. Cardiomyocytes were isolated by collagenase perfusion of intact hearts and subjected to subcellular fractionating resulting in a cytosolic, membrane and nuclear fraction (ProteoExtract^{®} Subcellular Proteome Extraction Kit, Calbiochem). Protein extracts were processed and mass spectrometry was performed as described [2]. Briefly, all LCMS/MS measurements were performed with an LTQOrbitrap Hybrid (Thermo Fisher Scientific) combined with an Agilent 1200 nanoflow HPLC system. The mass spectrometer was operated in the datadependent mode to automatically measure full MS scans and MS/MS spectra. Peptides were identified by searching against the International Protein Index sequence database (mouse IPI, version 3.24) using the Mascot search algorithm http://www.matrixscience.com. Mass spectra were analyzed by the MaxQuant software package [11, 12], which performs peak lists and false positive rate determination [13]. To analyze the distribution of proteins in each fraction relative to the total number of proteins all detected peptides per fraction were combined. To obtain reliable results for each subcellular fraction, proteins were only sorted into specific fractions when they showed 30% enrichment in the nucleus to cytosolic or membrane to nucleus ratio and 50% enrichment for the membrane to cytosolic ratio. Remaining proteins, which displayed no clear enrichment in one specific cell fraction, were grouped as nonspecific proteins. GO terms were allocated to proteins in each fraction to analyze the distribution of allocated gene products per GOterm. For GO annotation the standard files for mouse as provided by Gene Ontology were used http://www.geneontology.org/GO.current.annotations.shtml.
Definition of R score and enrichment of GOterms
With N_{ exp } being the total number of detected proteins in the experiment and M_{ GOA } representing the total number of proteins annotated by the Gene Ontology Annotation (GOA). For this purpose the source files were used to generate the whole GO structure to provide the right values for y_{ i } and M_{ GOA }. The number of N_{ exp } is important to decide, whether scaling is required or not. The exponent 2 expands the distribution of GO annotated proteins. Both variables x_{ i } and y_{ i } represent the total number of proteins. Specifically x_{ i } is the number of assigned proteins in the experimental dataset to a specific term i, taking all proteins assigned further down the DAG into account. y_{ i } represents the number of all possible proteins, which are assigned to the ith GO term, also counting successor protein GO annotations. Multiple assigned proteins are counted only once. The quotient is the relative frequency of assigned proteins x_{ i } in a term with the total number of N_{exp} of proteins in the dataset. The second factor in the definition of r, defined as , serves as a kind of normalization against all GO annotated proteins and is the reciprocal value of the relative frequency of the ith GO term referred to the total number of all GO annotated proteins. The weighting of the two factors by taking the square root of the second and quadrature of the first factor is done because the annotated proteins representing the experiment constitute only a small subset of the complete GO annotation.
PrePCA transforming and Post PCA Group extraction
The localization of specific groups and the extraction of GOterms after PCA was enhanced by global scaling: First, the score R was transformed by log_{2} to allow an almost linear distribution of R values for all fraction datasets prior to PCA analysis. Terms, which were not represented in a certain dataset were assigned a default finite not a numbervalue after logtransformation before performing the PCA. These values can be used for scaling, resulting in an enhancement of the measure of correlation within the dataset. Setting all not a numbervalues to a certain finite value avoids the logzeroproblem. The not a number value was varied iteratively from a starting value of 10 until a clear separation of groups was achieved in the biplot.
Introduction to PCA
The idea is to map the investigated complex system from a multidimensional space to a reduced space spanned by a few principal components (PCs) thereby revealing the principal and most important features, which underlie the data set. Consider a GO set with p GOterms and let x = (x_{1} x_{2} ... x_{ p })' be a p × 1 vector, where x_{ i } is a variable for GO representation values of the ith GOterm and t denotes transpose of a vector. Let Σ be the covariance matrix of x with dimension p × p. The eigenvectors and eigenvalues of Σ are defined as vectors α_{ l } and scalars λ_{ i } such that Σα_{ i } = λ_{ i }α_{ i }, i = 1,..., p. The first PC score (PC1) is a scalar defined as the linear function α^{ t } = α_{11}x_{1}+α_{12}x_{2}+⋯+α_{1p}x_{ p } of elements of x having the maximum variance among all linear functions of x(Jolliffe, 2002). Without loss of generality, assuming λ_{1}≥ λ_{2}≥ ⋯ ≥ λp, then it can be shown, that the vector of coefficients α_{1} for the first PC score is the eigenvector corresponding to largest eigenvalue of Σ and variance Σα^{ t } = λ_{ p }. The set of coefficients {α_{11} ,..., α_{1p}} are called the loadings of the first PC. An estimated value for the coefficients{α_{ i, j } = 1 ,..., p} (eigenvectors) of the PC scores on a set of GOterms can be computed using singular value decomposition (SVD) (Jolliffe, 2002). Briefly, let X be a N × p matrix with columns corresponding to standardized GO representation values (with mean 0 and variance 1) of a group of GOterms. The kth PC score is z = Xα^{ k }, where α_{ k }, is unit length eigenvector of the covariance matrix S= X^{ t }X/(N1) corresponding to the kth largest eigenvalue λ_{ k } and var Σ = λ_{ k } Furthermore let r = rank(X).
where U =_u_{1}, u_{2},..., u_{ r } _ is an N × r matrix, where u_{ k } = l^{1/2k}Xα k is scaled kth PC score. These are linear combinations of GO values corresponding to columns of matrix X. is an r × r diagonal matrix where l_{ k } is kth eigenvalue of XtX, A= {α_{1}, α_{2} ,..., α_{ r }}is a p × r matrix where α_{ k } is eigenvector of covariance matrix S, which are also coefficients defining PC scores. Note that since the kth eigenvalue of the covariance matrix S is λ_{ k } = l_{ k }/(N1), we have var(u_{ k } ) = 1/(N1). Therefore, SVD provides not only the coefficients and Single Decompositions (eigenvalues) for the PCs through A and L, but also the PC scores of each observation by matrix UL. For simple models, it can be shown that the PCs provide an optimal approximation to the original variables (Jolliffe, 2002). Graphs plotting the ith Principal Component versus the jth component are usually called scoreplots, which result in biplots if the Loadings are included. Biplots reveal the correlation of PC scores with the loadingvariables. Correlated scores form groups of PC scores, which are usually clearly discernible in both types of plots.
Declarations
Acknowledgements
This work was supported by the MaxPlanckSociety http://www.mpg.de/, the KerckhoffFoundation and the Excellence Clusters "Cardiopulmonary System" (in support of T.B.). The authors declare that they have no conflicting commercial interests related to this work.
Authors’ Affiliations
References
 Gensch N, Borchardt T, Schneider A, Riethmacher D, Braun T: Different autonomous myogenic cell populations revealed by ablation of Myf5expressing cells during mouse embryogenesis. Development 2008, 135(9):1597–1604. 10.1242/dev.019331View ArticlePubMedGoogle Scholar
 Kruger M, Moser M, Ussar S, Thievessen I, Luber CA, Forner F, Schmidt S, Zanivan S, Fassler R, Mann M: SILAC mouse for quantitative proteomics uncovers kindlin3 as an essential factor for red blood cell function. Cell 2008, 134(2):353–364. 10.1016/j.cell.2008.05.033View ArticlePubMedGoogle Scholar
 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25–29. 10.1038/75556View ArticlePubMedPubMed CentralGoogle Scholar
 Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003, 4(5):P3. 10.1186/gb200345p3View ArticlePubMedGoogle Scholar
 Beissbarth T, Speed TP: GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics 2004, 20(9):1464–1465. 10.1093/bioinformatics/bth088View ArticlePubMedGoogle Scholar
 AlShahrour F, Minguez P, Tarraga J, Medina I, Alloza E, Montaner D, Dopazo J: FatiGO +: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Res 2007, 35(Web Server):W91–96. 10.1093/nar/gkm260View ArticlePubMedPubMed CentralGoogle Scholar
 Eden E, Navon R, Steinfeld I, Lipson D, Yakhini Z: GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics 2009, 10(1):48. 10.1186/147121051048View ArticlePubMedPubMed CentralGoogle Scholar
 Maere S, Heymans K, Kuiper M: BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 2005, 21(16):3448–3449. 10.1093/bioinformatics/bti551View ArticlePubMedGoogle Scholar
 Conesa A, Gotz S: Blast2GO: A Comprehensive Suite for Functional Analysis in Plant Genomics. Int J Plant Genomics 2008, 619832.Google Scholar
 Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 2005, 21(18):3587–3595. 10.1093/bioinformatics/bti565View ArticlePubMedPubMed CentralGoogle Scholar
 Cox J, Mann M: Is proteomics the new genomics? Cell 2007, 130(3):395–398. 10.1016/j.cell.2007.07.032View ArticlePubMedGoogle Scholar
 Graumann J, Hubner NC, Kim JB, Ko K, Moser M, Kumar C, Cox J, Scholer H, Mann M: Stable isotope labeling by amino acids in cell culture (SILAC) and proteome quantitation of mouse embryonic stem cells to a depth of 5,111 proteins. Mol Cell Proteomics 2008, 7(4):672–683.View ArticlePubMedGoogle Scholar
 Kersey PJ, Duarte J, Williams A, Karavidopoulou Y, Birney E, Apweiler R: The International Protein Index: an integrated database for proteomics experiments. Proteomics 2004, 4(7):1985–1988. 10.1002/pmic.200300721View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.