Quantitative behavior of protein complexes in human cells

Translational and post-translational control mechanisms in the cell result in widely observable differences between measured gene transcription and protein abundances. Herein, protein complexes are among the most tightly controlled entities by selective degradation of their individual proteins. They furthermore act as control hubs that regulate highly important processes in the cell and exhibit a high functional diversity due to their ability to change their composition and their structure. To better understand and predict these functional states, extensive characterization of complex composition, behavior, and abundance is necessary. Mass spectrometry provides an unbiased approach to directly determine protein abundances across cell populations and thus to profile a comprehensive abundance map of proteins. We investigated the behavior of protein subunits in known complexes by comparing their abundance profiles across up to 140 cell types available in ProteomicsDB. After thorough assessment of different randomization methods and statistical scoring algorithms, we developed a computational tool to quantify the significance of concurrent profiles within a complex, therefore providing insights into the conservation of their composition across human cell types. We identified the intrinsic structures in complex behavior that allow to determine which proteins orchestrate complex function. This analysis can be extended to investigate common profiles within arbitrary protein groups. With the CoExpresso web service, we offer a potent scoring scheme to assess proteins for their co-regulation and thereby offer insight into their potential for forming functional groups like protein complexes. CoExpresso can be accessed through http://computproteomics/Apps/CoExpresso. Source code and R scripts for database generation are available at https://bitbucket.org/veitveit/coexpresso. Author summary Many proteins form multi-functional assemblies called protein complexes instead of working as singly units. These complexes control most processes in the cell making the full characterization of their behavior inevitable to understand cellular control mechanisms. Detailed knowledge about complex behavior will elucidate biomarkers and drug targets that exhibit and correct aberrant cell states, respectively. We investigated abundance changes of the protein complex components over more than 100 different human cell types. By using statistical scoring models, we estimated the evidence for the co-regulation of the proteins and revealed which proteins form subunits with impact on complex function and composition. By providing the interactive web service CoExpresso, any combination of proteins can be tested for their co-regulation in human cells.

modifications. An example of functional diversity are ribosomes that are known to 8 contribute differentially to translation of distinct subpopulations of mRNAs [1]. There 9 is a pressing need to investigate complex capabilities for regulatory control of cellular 10 processes. To achieve this, a detailed map of protein complex composition, abundance, 11 and behavior in different cell types and tissues is required. Such a map will considerably 12 improve the characterization and the prediction of the functional states. 13 Various experimental methods exist to identify protein complexes and to determine 14 and quantify which protein subunits they are composed of. Determination of protein 15 interaction partners within a complex provides valuable knowledge about complex and 16 protein function and thus their potential behavior [2]. Most prominent experimental 17 methods to determine protein-protein interactions are based on the yeast-2-hybrid 18 protocol or the application of affinity purification coupled with mass spectrometry [3,4]. 19 These methods however suffer from either large false identification rates or depend on 20 purification steps that often lead to a strong bias in the results. More details about 21 protein structure can be achieved by chemical cross-linking or hydrogen-deuterium 22 exchange mass spectrometry [5]. Despite the power of these methods, they cannot yet 23 be applied on entire proteomes. For an accurate, large-scale and general 24 characterization, protein complex behavior should be studied across large numbers of 25 samples without perturbations towards e.g. subgroups of proteins and additionally rely 26 on highly confident identification of the proteins.

27
There is an increasing amount of evidence supporting the hypothesis that the 28 majority of protein complexes are tightly controlled in the cell. Post-transcriptional 29 regulation occurs predominantly for protein complex members, leading to strong 30 co-regulation of complex subunits. This could be shown by systematic investigation of 31 protein and gene expression levels in human cancer [6,7], in a study comparing 11 cell 32 types and 4 temporal states [8], based on the co-occurrence of protein pairs across 33 human experiments in the PRIDE database [9], or generally in a selection of proteomics 34 data sets [10]. In summary, these studies showed that only a fraction of complex 35 composition and abundance is regulated at transcriptional level and therefore other 36 mechanisms such as protein degradation contribute to protein complex stoichiometry.

37
This highlights the power of directly measuring protein abundance profiles by common 38 proteomics approaches such as bottom-up mass spectrometry to thoroughly study 39 protein complexes and their variants across cell types and states.

40
In contrast to most proteomics data repositories where only raw data and 41 identification results are available, ProteomicsDB [11,12] is a large compendium of 42 quantitative protein abundances, therefore highly useful to investigate general patterns 43 July 5, 2018 2/11 of protein changes across more than 100 different human cell lines.

44
Here, we apply three scoring models on the ProteomicsDB data to assess the 45 significance of subunit co-regulation in protein complexes. We compare and benchmark 46 different randomization and scoring approaches on known complexes and reveal 47 particular substructures of complex behavior for a few selected use cases. The scoring 48 and extensive visualization is implemented in the web service CoExpresso that allows 49 investigating co-regulatory patterns in any group of human proteins.

51
We applied different scoring systems to evaluate whether proteins in human complexes 52 exhibit similar regulatory behavior when compared over multiple cell types. Despite of 53 having a large set of available protein abundances, coverage of the proteins over the 140 54 cell types was often sparse (S1 Fig),   The FAMS approach however performed differently, reaching a higher number of 82 significant complexes for the protein-centered randomization.

83
On protein level (Fig 2b), lower protein numbers with significant abundance profiles 84 could be expected and were observed when using randomization methods that maintain 85 protein and cell type properties. Here, PCOM displays a higher number of proteins 86 than FAMS and MCOM for low false discovery rates.

88
Recovery of proteins and complexes with significant abundance profiles does however 89 not ensure robustness of the methods towards noise. As example, one could expect a 90 protein complex to contain subunits that do not follow the general trend of the 91 abundance profiles. This could be due to wrong assignment of a protein to a complex or 92 due to different behavior of a subunit being heavily regulated by e.g. post-translational 93 modifications or by forming transients regulating complex function.  Randomization of the entire ProteomicsDB data lead to lower robustness for all 101 methods. One the other hand, protein-centered (PCS) and protein-cell type centered 102 (PTCS) randomization gave nearly identical performance results. Hence, the following 103 analysis will focus on PCS randomization, although being the computationally mosts 104 expensive one, as it yields higher counts of significant proteins. In addition, MCOM and 105 FAM models had lower false positives rates at least in the lower range. Performance of scoring models measured by robustness to 50%, 75% and 100% artificially added random proteins. Proteins were categorized into complex subunits and random proteins. True positive and false positives rates (TPR and FPR) were given by the fraction of true positives and false positives at a given FDR threshold. MCOM and FAM models lead to better performance. Only slight difference between PCS and PTCS randomizations can be observed. interphase [14]. We observed different abundance levels of NCAPH in several cell types 118 leading to lower weight by factor analysis (S3 Fig). 119 Use case B: 28S mitochondrial ribosomal subunit (Fig 4), being essential for ATP 120 production, represents the complex with lowest FDR in all models and most proteins.

135
Use case C: NUMAC complex (nucleosomal methylation activator complex, Fig 4) 136 denotes a case with slightly lower significance. All scoring models suggest high 137 significance with an FDR below 0.5%. The 10 proteins were found in 33 cell types with 138 ACTB distinct behavior and drastically higher abundance than the other proteins. A data source for tightly co-regulated proteins 146 Given the strong co-regulation in annotated protein complexes, we asked whether our 147 randomly sampled protein groups with highly significant co-regulation could determine 148 novel but yet not well characterized complex compositions in human cells. Random 149 protein groups with the highest scores did however not provide evidence for these 150 proteins to be arranged as complexes as evidence scores for their interaction in repositories (e.g. PRIDE nearly reaching 10,000 projects to date [15]). Availability of 161 quantified protein abundances is however still very rare also because the comparison of 162 protein abundance across experiments and projects is still a major bottleneck in the 163 proteomics field. ProteomicsDB provides a large catalogue of protein abundances in 164 human cell types which we used to thoroughly investigate protein complex behavior.

165
Despite the large number of characterized cell types, data coverage is rather low, where 166 more than 20% of the proteins were detected in only 2-5 cell types. Such low coverage 167 hindered straight-forward application of e.g. simple correlation and we therefore 168 compared a variety of different scoring models and randomizations that reproduce the 169 inherent data structure. power, we predict that most protein complexes will be found to be translationally and 189 post-translationally regulated. We furthermore tested whether the large database of randomized protein groups 195 could be used to identify novel protein assemblies that represent highly interacting 196 functional modules such as complexes. We did not find enrichment for known ProteomicsDB [11,12] and CORUM [16], respectively. We used three randomization 216 approaches that differently resemble data structure within all protein abundance 217 profiles. Scores were calculated for the co-regulation of proteins in a complex applying 218 three different models for the comparison of protein profiles. The scores were stored in a 219 database. For each protein in each complex, significance for their co-regulation was 220 calculated and assessed on basis of the scores. A web service was implemented to allow 221 interrogating the score database to test arbitrary protein groups for the significance of 222 their co-regulation. Fig 1 provides an overview of the workflow and the web interface.  having quantitative values for all proteins p = [1..n p ], were considered, resulting in a n t 238 by n p matrix E C (t, p). With this procedure, we obtained 1,000-20,000 unique and random protein groups for 258 each relevant combination giving a total of more than 20,000,000 randomized groups.

259
Protein-and tissue-centered sampling (PTCS): All proteins simultaneously found in 260 the same cell types as the tested protein group were randomized to create 10,000 261 samples. That is, n p proteins are sampled independently at random from all proteins 262 that appear in the same cell types as the tested protein group, and their observed values 263 in those cell types are considered. cor (E C (t, p), E C (t, q)) Factor analysis model (FAMS): The model is based on factor analysis developed for 275 microarray analysis [18] and recently modified to improve protein inference in 276 bottom-up mass spectrometry data [19]. The following parameters were used: Weight 277 w = 0.1, µ = 0.1, 1,000 maximal iterations and a minimal noise of 0.0001. The feature 278 weights W were used to score each protein of a group: S FARMS (p) = W (p).  For each model, randomization method, and a given combination of n t and n p , the aforementioned scores were calculated for randomized protein groups, and stored in a database. These scores, more than 100,000,000 in total, were then used to calculate the probabilities to reject the null hypothesis (of observing the score for a set of n p proteins over n t tissues) for both a single protein p and a group of proteins:  For p-values from multiple protein groups, correction for multiple testing was carried 287 out via the Benjamini-Hochberg procedure.