CrMP-Sol database: classification, bioinformatic analyses and comparison of cancer-related membrane proteins and their water-soluble variant designs

Ma, Lina; Zhang, Sitao; Liang, Qi; Huang, Wenting; Wang, Hui; Pan, Emily; Xu, Ping; Zhang, Shuguang; Tao, Fei; Tang, Jin; Qing, Rui

doi:10.1186/s12859-023-05477-9

Research
Open access
Published: 25 September 2023

CrMP-Sol database: classification, bioinformatic analyses and comparison of cancer-related membrane proteins and their water-soluble variant designs

Lina Ma¹^na1,
Sitao Zhang¹^na1,
Qi Liang²,
Wenting Huang¹,
Hui Wang¹,
Emily Pan³,
Ping Xu¹,
Shuguang Zhang⁴,
Fei Tao¹,
Jin Tang² &
…
Rui Qing¹

BMC Bioinformatics volume 24, Article number: 360 (2023) Cite this article

1064 Accesses
1 Citations
1 Altmetric
Metrics details

Abstract

Membrane proteins are critical mediators for tumor progression and present enormous therapeutic potentials. Although gene profiling can identify their cancer-specific signatures, systematic correlations between protein functions and tumor-related mechanisms are still unclear. We present here the CrMP-Sol database (https://bio-gateway.aigene.org.cn/g/CrMP), which aims to breach the gap between the two. Machine learning was used to extract key functional descriptions for protein visualization in the 3D-space, where spatial distributions provide function-based predictive connections between proteins and cancer types. CrMP-Sol also presents QTY-enabled water-soluble designs to facilitate native membrane protein studies despite natural hydrophobicity. Five examples with varying transmembrane helices in different categories were used to demonstrate the feasibility. Native and redesigned proteins exhibited highly similar characteristics, predicted structures and binding pockets, and slightly different docking poses against known ligands, although task-specific designs are still required for proteins more susceptible to internal hydrogen bond formations. The database can accelerate therapeutic developments and biotechnological applications of cancer-related membrane proteins.

Peer Review reports

Background

Membrane proteins are miniscule molecular machines embedded in the phospholipid bilayer of cells that encompass essential enzymatic, signaling and molecular transporting functions in living organisms. They make up ~ 30% of genes in higher eukaryotes and account for ~ 60% of therapeutic targets for modern drugs [1]. Unsurprisingly, membrane proteins are involved in the most common forms of cancers and considered hallmarks of tumor cells. They participate in all stages of tumor progression, from initiation, invasion, growth, cellular proliferation to metastasis by mediating: (1) cell communication and signal transductions through interacting with ligands and downstream messengers [2,3,4,5]; (2) intracellular/extracellular ion homeostasis, metabolic pathways and chemoresistance [6, 7]; and (3) cell survival, proliferation and apoptosis [8]. Tumors can utilize membrane protein-regulated mechanisms to employ both the immune system and nervous system in favor of cancer progression[9,10,11,12]. Thus, great efforts are devoted to elucidate tumor possessed mechanistic pathways in specific malignancies for immunotherapy developments [13,14,15].

Membrane proteins’ pathological involvements are demonstrated by monitoring protein overexpression, whereas cancer-specific signatures were revealed by gene profiling [16, 17]. Correlation of their abundance with the clinical outcome of patients provides valuable insights in disease progression and prognosis [18, 19]. The research also helps to develop therapeutic strategies such as targeted drugs like monoclonal antibodies, nanocarrier drug delivery, and fluorescent tumor imaging in surgery. However, although gene patterns can reveal the significance of respective proteins in each pathology, functional studies at the molecular level are required to illuminate mechanistic processes [4].

The binding of membrane proteins with endogenous ligands and subsequent signaling are essential to explaining their functions in cancer-related biological processes [20, 21]. Mainstream ligand identification methods include radio-ligand binding, calcium flux, GTP_γ binding, and cAMP modulation, by exposing transcribed cells to synthetic compound libraries and observing cell activation profiles [22]. These indirect efforts are limited by the system complexity and knowledge of downstream pathways [23]. Alternative computational strategies use homologous mapping across species [24,25,26] or virtual screening [27] to predict interactions in different types of membrane proteins[2]. However, subsequent experimental verifications are required.

The major obstacle against structure determination, ligand identification and mechanism studies of membrane proteins is their hydrophobicity and tendency to aggregate in aqueous solutions [28, 29]. Common stabilization methods such as detergent screening or nanodiscs require arduous individual efforts, and are difficult to push beyond research purposes [30]. The advent of AlphaFold2 partially resolved this issue, which is a computational tool for protein structure predictions [31, 32]. The deep-learning architecture uses co-evolution information and homologous crystal structures in the Protein Data Bank (PDB) to conduct accurate simulations. The program and its predicted structures for nearly all catalogued proteins with sequence information known to science are publicly available [33, 34].

Another experimental approach to circumvent such issues is through a rational design tool we previously devised that named QTY code [35]. The water-soluble and functionally equivalent variants of native membrane proteins can be easily designed through pairwise amino acid substitutions [35, 36]. Specifically, hydrophobic residues of Leucine (L), Valine (V) and Isoleucine (I), and Phenylalanine (F) in the transmembrane (TM) region are substituted by hydrophilic Glutamine (Q), Threonine (T), and Tyrosine (Y), respectively. The methodology was demonstrated first on chemokine receptors [35], and later used to elucidate structural basis of their ligand recognitions and regulatory role in vivo [35, 36]. Additional bioinformatic studies were conducted which applied this protocol on different classes of membrane proteins [32, 37, 38]. It is proposed that these detergent-free membrane proteins can be adopted to conduct screening in solution for ligand identification from a biophysiochemical aspect.

To date, despite extensive efforts to establish a membrane protein mediated network of human cancers [2, 4, 39], there is not yet a database to provide essential reference information for cancer-related researches with respect to the understanding of protein functions and molecular mechanisms. The systematic correlation between membrane proteins and tumor pathogenesis are still lacking beyond their cancer-specific signatures revealed by gene profiling. Here we present CrMP-Sol (Cancer-related Membrane Protein and Solubilization database), which is dedicated to connecting molecular characteristics and biological functions of membrane proteins to their participation in cancer pathology, while presenting water-soluble designs to facilitate native membrane protein research.

The database contains 1309 entries related to 17 types of cancers, which were classified into 7 categories, and plotted into 3D-space using machine learning algorithms based on extraction of key functional descriptions. The spatial distribution can be used to predict inapparent relations between adjacent proteins and specific pathogenesis through common mechanisms beyond genetic level analysis. The QTY code was employed for water-soluble designs to facilitate native membrane protein studies in spite of natural hydrophobicity on all 1309 proteins in the database. Five exemplary proteins from different categories and varying numbers of TM helices were used for feasibility demonstration. The QTY variants exhibited highly similar characteristics and structurally superimposed well with native proteins, in addition to enhanced hydrophilicity and stability. Beyond the scope of prior works, we performed comparative analysis on molecular dockings of native and QTY variant proteins against native ligands that might be involved in different pathogeneses. The docking showed slightly altered poses and closely-matched binding energies. Channel-forming proteins exhibited best agreements in geometry and hydrogen bonding sites. For binding pairs with significant changes in conformations and binding energies, molecular dynamic (MD) simulations revealed the decreased hydrophobic interactions to be accountable for the differences.

Our database provides essential information to connect and predict correlation between membrane protein functions and cancer types. The unraveling of hidden relations encoded within biomolecular processes and mechanistic pathways in specific malignancies can shed light on new research directions not apparent from gene-level analysis. The water-soluble designs are also presented in our database as an experimentally feasible solution to facilitate subsequent researches, by offering physical simulators of native membrane proteins. Verification and regulation of these potentially indispensable biological processes can not only provide new scientific insights on the initiation and progression of diseases, but also benefit corresponding therapeutic developments and other biotechnological applications.

Results

CrMP-Sol database

Information of cancer-related membrane proteins at the genetic level are based on a previous transcriptome study, which is available on The Human Protein Atlas (HPA, https://www.proteinatlas.org/) [40,41,42]. Out of 20,090 entries in the database, 11,279 of the proteins are associated with cell membranes [43], where 1309 proteins are clinically relevant to 17 types of cancers, including: colorectal cancer, endometrial cancer, melanoma, renal cancer, liver cancer, testis cancer, pancreatic cancer, glioma, thyroid cancer, prostate cancer, cervical cancer, lung cancer, urothelial cancer, breast cancer, head and neck cancer, stomach cancer, and ovarian cancer [41]. We classified these entries into 7 categories based on descriptions of their functions, which included 327 receptors, 161 transporters, 44 carriers, 124 channels, 201 enzymes, 109 contact proteins, and 344 others lacking apparent functional classifications. Other information about gene and protein expressions, distributions in organs, cell lines, immune cells and bloods are also available in the database [43].

Besides pathogenesis data, critical genetic and molecular information regarding the protein functions are also presented in CrMP-Sol, which referred to NCBI (National Center for Biotechnology Information), Uniprot and PDB. Genetic information consists of gene name, location, a summary of the gene encoding the protein, and open-source links. Molecular information includes name, primary sequence, subcellular locations, crystal and AlphaFold2 predicted structures, and descriptions about experimentally verified or proposed protein functions. The tissue and pathogenesis specificity are also presented.

As a core feature of our database, we designed water-soluble variants of all 1309 membrane proteins by QTY code [44]. Specifically, the primary sequences of these QTY variants, AlphaFold2 predicted structures, and superimpositions with native proteins are presented. It is proposed that these easy-to-synthesize, cost-efficient, more hydrophilic structural and functional equivalents of naturally hydrophobic proteins can accelerate molecular and mechanistic study of the latter to facilitate the development of cancer treatments. These novel water-soluble variants of membrane proteins may also themselves be adopted in therapeutic applications [45].

Classification and visualization of protein-cancer types

To intuitively establish correlation between protein functions and cancer specificities, we encoded data entries with functional descriptions and visualize them in a 3D-space. The TF-IDF (frequency-inverse document frequency) machine-learning algorithm was adopted to extract keywords based on their relative frequency of appearances in each description compared to the whole database, to distinguish minor functional differences in proteins [46]. Words not directly related to protein functions like PubMed ID were manually removed. As the most important hyperparameter for TF-IDF, the number for max features (MF) was adjustable in the interface with cut-offs between 50 and 250 words and a step size of 50. This step allows users to choose either the most important or more inclusive descriptions of protein functions for tailored classifications, without making the data matrix non-efficiently large.

A 1309 × MF matrix was then established to represent the protein × function information. The UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) algorithm was adopted to reduce the dimension of encoded data while preserving its global structure and visualizing in a 3D-coordinate system (Fig. 1A). In this low-dimensional space, protein classifications were denoted by different colors, while halos around a single datapoint represented cancer types. The distant purple cluster at top-left corner represents entries currently without functional descriptions. The interactive graph is the front page of our database, where users can select a single datapoint to access the detailed information page. The interface also allows the selection and highlighting of each protein category, or those associated with one or several types of cancers (Fig. 1B–D). The feature provides information of membrane proteins or critical mechanistic processes adopted by different pathologies in each category.

Beyond the apparent information that the same types of proteins exhibit relative clustering in the 3D-space, we hypothesize that the graph also reveals functional connections encoded by dimension reduction. It is likely that adjacently positioned proteins have higher chance to participate in functionally relevant pathways contributing to the same pathology, whether or not they exhibit concurrent profiling in the gene analysis. For instance, when “receptor” and “glioma” were selected, we found datapoint EPHA7 (Ephrin type-A receptor-7) not overexpressed in the gene-level, but was in close proximity of several receptors all associated with the cancer (Fig. 1E). Literature review indeed suggested its relation to malignant glioma despite genetic analysis labeling it as irrelevant [47]. Similarly, LPCAT1 is adjacent to five enzymes related to liver cancer. Its expression was found to enhance the phosphatidylcholine level in hepatocellular carcinoma tissues, which promoted cellular proliferation, migration, and invasion [48]. On the other hand, CERS3 (Ceramide synthase 3) resides in a wide pocket of 9 proteins related to glioma, liver cancer, or both (Fig. 1G). Despite its normal transcription level in either pathology, a recent study found the enzyme to affect invasion and metastasis of hepatocellular carcinoma via SMAD6 gene [49], whereas it also regulates AKT/ERK1/2 signaling critical for angiogenesis of glioblastoma [50]. Furthermore, as shown in Fig. 1H, there are three other liver cancer-related transporters adjacent to SLC34A2 (Solute carrier family 34 member), while the knockdown of the latter was also found to inhibit hepatocellular carcinoma cell proliferation and invasion [51]. The overall reliability of prediction efficacy will need more extensive evaluation based on data mining and preferably dedicated experimental validation. Yet the few examples presented here already showed the prospect of integrating functional information beyond genetic-level analysis into the clusters of proteins with correlation to pathologies.

QTY design and property comparisons

The design of water-soluble variants likely provides mechanistic insights for native membrane proteins and accelerate therapeutic developments, as has been demonstrated before [36, 52]. Thus, we conducted QTY design on all 1309 cancer-related membrane proteins in the database. The L, I, V, F residues in the TM region of native proteins were replaced by Q, T and Y accordingly in the designs (with T replacing both I and V). The process was conducted using an automated online PSS server established prior [44].

Since we cannot present all designed sequences in one paper, five proteins of different categories with varying numbers of TM helices are selected as exemplary demonstrations, including MGAT3 (Monoacylglycerol O-Acyltransferase 3), GPR35 (G protein-coupled receptor 35), GPR37 (G protein-coupled receptor 37), SLC10A1 (Solute carrier family 10 member 1), and NPC1L1 (Hepatic Niemann-pick C1-like 1). MGAT3 is a 3TM enzyme commonly expressed in the gastrointestinal tract that catalyzes the synthesis of 1,2-diacylglycerol from 2-monoacylglycerol and has a role in dietary fat absorption [53]. It is relevant to colorectal cancer, liver cancer and stomach cancer. Both GPR35 and GPR37 belong to the G-protein coupled receptor family with 7TM helices. They regulate osteogenesis via the Wnt/GSK3β/β-catenin pathway [54], or bind prosaptide to enhance ERK signaling and inhibit cAMP levels [55]. GPR35 is related to colorectal cancer, pancreatic cancer and stomach cancer, while GPR37 is related to glioma, melanoma and liver cancer. SLC10A1 is a 8TM solute carrier co-transporter primarily localized in hepatocytes, and plays a key role in bile acid extraction and biliary excretion from portal blood [56]. The protein hosts hepatitis B virus infection and is associated with liver cancer [57]. NPC1L1 is a large 13TM polytopic sterol transporter localized at the apical membrane of enterocytes and the canalicular membrane of hepatocytes [58]. It serves as a critical mediator for cellular cholesterol uptake and is involved in liver cancer, pancreatic cancer and stomach cancer [59].

Sequence alignments of QTY designed water-soluble proteins and their native counterparts are shown in Fig. 2. Individual optimizations were not conducted for this mass-design process. QTY substitutions were applied to all corresponding residues only in the TM region, but not those in extracellular domains and intracellular domains.

The protein characteristics were calculated and compared in Table 1. Despite significant QTY substitutions on LIVF residues in TM regions (~ 48–54%), the isoelectric point (pI) and molecular weight (MW) of QTY proteins are quite similar to native proteins. This is due to that, although Q, T and Y can induce the formation of intra-, inter- and solvent-exposed hydrogen bonds, they do not carry additional charges. The substitutions enhance the protein solubility while retaining its overall integrity without introducing additional disruptive electrostatic interactions. The alteration of hydrophobicity in the helical region of membrane proteins without changes in steric and electrostatic interactions is the essence of QTY code. The slight MW increase is due to the introduction of hydroxyl group in respective residues.

Table 1 Characteristics of native membrane proteins and their water-soluble QTY variants

Full size table

Superimpositions of AlphaFold2 predicted structures of native and QTY cancer-related membrane proteins

The structural similarity between QTY designed MGAT3, GPR35, GPR37, SLC10A1, NPC1L1, and native counterparts were demonstrated by comparing AlphaFold2 predicted structures. The predicted structures were validated by ProSA web tool and reported as z-score values [60]. Lower z-scores correspond to higher model validity, where predicted structures of native and QTY variant generally exhibited closely matched z-score values (Additional file 1: Table S1). As shown in Fig. 3, predicted structures for native and QTY proteins superimposed very well. Both side views and top views of the superimpositions are shown. Despite > 48% changes in TM sequences, the RMSD (root mean square deviation) for two protein variants under investigation are < 1.5 Å, suggesting very high conformational similarities. Specifically, RMSDs for MGAT3 versus MGAT3^QTY, GPR35 versus GPR35^QTY, GPR37 versus GPR37^QTY, SLC10A1 versus SLC10A1^QTY, and NPC1L1 versus NPC1L1^QTY are 0.157 Å, 1.478 Å, 1.216 Å, 1.233 Å, and 0.656 Å, respectively. TM region RMSDs for MGAT3 versus MGAT3^QTY, GPR35 versus GPR35^QTY, GPR37 versus GPR37^QTY, SLC10A1 versus SLC10A1^QTY, and NPC1L1 versus NPC1L1^QTY are 0.309 Å, 1.044 Å, 0.899 Å, 0.544 Å, and 0.603 Å, respectively. Improvements on TM region RMSDs were attributed to the deletion of intrinsically flexible loop domains that contribute more to the RMSDs, which further demonstrated the applicability of QTY methodology on TM helices without structural alterations [61, 62].

Despite that we cannot show superimpositions of all 1309 membrane proteins in this article, the RMSDs between native and QTY variants, along with MW and secondary structure changes were summarized and plotted in Fig. 3F. Most redesigned proteins exhibit RMSD values < 10 Å, with the densest distribution below 5 Å. The outliers are relatively darker in color, suggesting their higher MWs and more complex structures. Moreover, there are only a few designs falling outside the ± 45° sectors in the graph, while most datapoints reside close to the horizontal line. This suggests that most native and QTY variant proteins share similar secondary structures.

Hydrophobicity analysis of native and QTY cancer-related membrane proteins

To computationally evaluate the solubilization efficacy of cancer-related membrane proteins, we conducted bioinformatic simulations on surface hydrophobic patches of both native and QTY variant proteins. Due to the proteins being naturally embedded in the phospholipid bilayer, native proteins were surrounded by nonpolar residues at the exterior of TM helices, which represents the majority of water-repelling surfaces as colored yellow in Fig. 4A–E (top). After the QTY code was applied, the hydrophobic patches (bottom) have notably decreased compared to their native counterparts, indicating an enhanced capability for water molecule interactions in the QTY variants.

A distribution map containing hydrophobicity information of all 1309 membrane proteins was shown in Fig. 4F. R_H corresponds to the ratio of α-helical content in the protein, while H_Ƴ represents calculated hydrophobicity using ProPAS. As expected, more significant decreases in hydrophobicity are observed for proteins with higher TM helical contents, which were the targets for the QTY design with amino acid substitutions. On the other hand, by comparing the color distribution of circles (native proteins) and diamonds (QTY proteins), slight increases of T_m (melting temperature) were predicted for solubilized proteins using a sequence-based method, indicating relatively higher protein stability [63]. Though accurate T_m values will require experimental determinations, the predicted trend agrees with previous experimental findings [36]. Since water-solubility and structural stability are interconnected characteristics, it is possible that by designing more soluble proteins, we also provide a plausible method for their stabilization, which has both theoretical and practical significances [64].

Molecular docking of native and water-soluble cancer-related membrane proteins

Preliminary functional comparison of native and water-soluble variants of cancer-related membrane proteins was conducted by docking their known ligands into predicted binding sites. The examination of computed binding geometries contributed to the understanding of molecular interactions from both conformational and compositional aspects [65]. We continued using the five exemplary proteins as in previous tasks. Both small molecule ligands and protein binders were checked. Specifically, we conducted molecular dockings for the following binding pairs: MGAT3 versus DAG (diacylglycerol), 2-MAG (2-monoacylglycerol) and oleoyl-CoA; GPR35 versus cGMP, kynurenic acid, lysophosphatidic acid, pamoic acid and Zaprinast; GPR37 versus neuroprotection D1, Osteocalcin and Saposin C; SLC10A1 versus bile acid, estrone sulfate, GCDC (glyco-chenodeoxycholic acid) and taurosholate; NPC1L1 versus cholesterol. Amongst the listed ligands, Osteocalcin and Saposin C are protein binders, whilst all others are small molecule ligands.

The binding pockets were predicted by PrankWeb for both native and QTY variant proteins [66]. Rational considerations were used to select a model from top 3 predictions. For MGAT3, GPR35 and SLC10A1, the highest scoring pockets were selected for subsequent docking. Yet for GPR37, the pocket 1 and 2 of native and pocket 1 of QTY protein were predicted at the C-terminus, thus pocket 3 for native protein and pocket 2 for QTY protein residing on the N-terminus were used for docking. NPC1L1 mediates cholesterol uptake by transporting it across the membrane, which involves the interaction of cholesterol with TM channels. While the 4 highest scoring pockets all resided in the extracellular region far from the phospholipid membrane and were most likely relevant to interaction with cholesterol, we intentionally selected pocket 3 for both native and QTY variants near the N-terminal entrance of the TM channel to elucidate the impact of the QTY design on the cross-lipid transportation. As shown in Fig. 5, predicted binding pockets generally agreed well between native and QTY variant proteins, providing basis for similar binding interactions.

Dockings between protein models and respective ligands were performed using AudoDock Vina [67]. Simulations for each protein–ligand pair were repeated at least three time to generate a reliable docking conformation and statistically meaningful binding energies. As shown in Fig. 6A–E, despite significant amino acid changes in TM regions, the binding between proteins and their respective ligands on the QTY variants generally occurred at closely-matching locations on the native protein. However, slight docking conformation differences were observed due to the inevitable changes to local environments, with some hydrogen bonds altered at new sites. These alterations can be attributed to interference from increased numbers of polar residues, which previously did not exist in the TM helices. Extensive internal hydrogen bond networks in QTY proteins may also lead to significant changes in ligand binding poses, as shown in MGAT3:2-MAG, MGAT3:oleoyl-CoA, and GPR35:pamoic acid. The orientations of the ligands were inverted, as previously outward-facing hydrophilic segments of the molecules were drawn by the polar core of QTY proteins, leaving hydrophobic segments to face solvents uncompensated. Such changes might not only impose additional energy penalties in docking, but also possibly negate the function associated with the binding events, such as the catalytic function in MGAT3. On the other hand, the channel forming proteins, namely SLC10A1 and NPC1L1, exhibited higher agreements both on the ligand docking poses and interaction sites between the native and QTY variants, with the best-performing pair being NPC1L1:cholesterol. Almost identical poses and identical hydrogen bond formations were observed. It was deduced that the presence of high aspect ratio TM channels was likely to guide the binding and orientation of respective ligands. The transporting function was also most likely retained despite significant changes in amino acid sequences.

Table 2 summarizes the calculated binding energy (kcal/mol) for each protein–ligand pair extrapolated from AutoDock Vina. In general, QTY variant proteins showed slightly decreased binding energies as compared to their native counterparts, but were still close in numbers. The trends agreed well with our previous experimental results that QTY proteins generally exhibited very slightly lower binding affinities compared to native proteins [35, 36, 45]. It was also supported by docking pose observations, where both native and QTY variants bound to respective ligands in similar manners, despite the more complex internal hydrogen bond networks of the latter being slightly unfavorable towards intermolecular interactions. Amongst all, the GPR35:pamoic acid pair exhibited the largest binding energy discrepancy of 2.0 kcal/mol. An alternative route was conducted to evaluate this binding pair, where AlphaFold_multimer was employed to predict GPR35/G_α complex structure and established a model for subsequent docking (Additional file 1: Fig. S1) [68]. Almost identical docking positions and orientations were observed for the complex model (Additional file 1: Fig. S2) and those presented in Fig. 6B. Additional MD simulations on this binding pair will be presented in a later section. However, it should be noted that most of our docking computations did not consider the states of membrane proteins, complex with downstream biomolecules such as G-proteins, and potential small molecule induced conformational changes. This might render the simulated structures and calculated binding energies to have slight deviations when compared to the actual binding states of ligands, which should be determined in subsequent crystallographic studies.

Table 2 Binding energies for ligands versus native membrane proteins and their water-soluble QTY variants

Full size table

Beside small molecule ligands, protein binders also play critical roles in the function of membrane proteins [61, 62]. We here used ZDOCK software to inspect the interactions of GPR37 versus Osteocalcin and Saposin C. The TM and intracellular regions were blocked for binding based on rational considerations. As shown in Fig. 7, the docking poses for each binder are quite similar in the native proteins and the QTY variants. Additional hydrogen bonds were observed at the head of TM helices due to the increased availability of polar sites. Hydrophilic interactions between binders and extracellular loops of GPR37 may form or disappear depending on conformational changes induced by either the design or the docking. However, one noteworthy consideration is that the pLDTT value of loop regions for AlphaFold2 predicted structures are generally low, suggesting their intrinsically disordered and flexible nature with higher energy states [69]. Thus, it is plausible that these regions may deform to accommodate for stronger interactions during the binding events. We then recomputed the complexes of GPR37 against Saposin C and Osteocalcin using AlphaFold_multimer, removed the respective binding partners, and redocked them back to the extracellular regions of the receptor using ZDOCK. The models of native and QTY GPR37 against Saposin C still exhibited aberrant N-terminal loops with slightly different docking poses and hydrogen bond interactions (Additional file 1: Fig. S3A). Yet the models of native and QTY GPR37 against Osteocalcin showed closely-matching docking poses and hydrogen bond interactions (Additional file 1: Fig. S3B). In general, similar molecular dockings between native and QTY proteins were observed in these simulations.

Molecular docking analysis of GPR35 versus pamoic acid

The docking poses of GPR35 versus pamoic acid in native and QTY variant proteins were notably different, associated with the largest binding energy change amongst all computed pairs. To further explain this phenomenon, we carried out MD simulations on both complexes using GROMACS and Charmm36 force field [70, 71]. The simulations were conducted for 50 ns to allow the full stabilization of both binding partners in complexes (Additional file 1: Fig. S4).

The MMGBSA approximation was employed to calculate the binding free energies for stabilized complex structures [72]. As shown in Fig. 8A, the major energy terms that differed were ΔE_ele and ΔE_vdw, representing the electrostatic interaction energy and the non-bonded van der Waals interaction energy, respectively. The decreased contributions from both terms in the QTY protein may be attributed to the inverted docking poses and more complex hydrogen bond network at the interface. These two factors combined led to a decreased binding energy between the two [36].

The hypothesis was supported by the per-residue energy contribution graph shown in Fig. 8B. Despite a few stronger interaction sites (Tyr259, Leu80, Gln77), less residues contributed moderately in the QTY variant compared to the native protein, which cumulatively led to a weaker interaction. The energy contributions from residues in the binding pockets (Fig. 8C) again agreed with the above statement where decreases in hydrophobic residue contributions (Leu13, Phe163, Leu233, Leu237, Leu258) were likely to be resulted from the outward-facing nonpolar region of the ligand in the QTY complex. Colored boxes denoted energy contributions from sites subjected to QTY substitutions. Figure 8D summarizes the top unmodified interaction sites from native and QTY proteins. It was shown that the altered binding pose significantly changed interaction sites in complexes, whereas the exclusion of the hydrophobic side of ligands from the interior of TM helices due to the additional internal hydrogen bond network likely played a critical role in this process. The observation for GPR35:pamoic acid binding pair suggested that, despite most QTY variants exhibiting high structural similarity with their native protein counterparts, the sequence change can still pose a notable impact on their interactions with certain binding partners, and should be taken into consideration for task-specific designs.

Discussion

Transmembrane proteins are the input/output machinery of living organisms and perform an extensive variety of functions crucial to biological and pathological processes, including mechanistic pathways essential for the progression of various types of cancers [73]. They bear great importance in understanding tumor pathogeneses with implications for cancer treatments and patient prognosis [9,10,11,12, 74]. Many types of membrane proteins also contain well-defined binding pockets that may be directly adopted as targets for therapeutics and modern medicine [1, 75, 76].

Yet to date, the systematic correlation between membrane protein types and diseases is still only at the genetic level, where gene profiling techniques were used to reveal overexpressed species in certain cancers [16, 17]. Understanding of molecular mechanisms and functional roles in association with specific pathogenesis is still lacking [4], primarily due to the inherent hydrophobicity, the difficulty to express in native conformations, and the instability ex vivo [23, 77]. The deep-learning based AlphaFold2 partially resolved the issue by providing highly accurate structure predictions for these hard-to-work-with protein species. Yet computed structures still need to be experimentally verified with subsequent mechanistic studies at the molecular level [31].

By establishing a dedicated cancer-related membrane protein database, our work contributes to the current status quo of research in two aspects. Firstly, the machine-learning based correlation between protein functions, classifications and cancer types encode essential molecular information contributing to the key mechanistic pathways in tumor progressions. By reducing the high-dimension matrix with all critical functional descriptions into 3-dimensions, the spatial distribution of datapoints may be used to predict previously inapparent relations between adjacent proteins involved in mechanistically connected pathways for specific pathogenesis that are not directly revealed by genetic level analysis.

On the other hand, to circumvent the difficulties in membrane protein study induced by hydrophobicity, we have used a rational design tool called the QTY code, which regulates protein solubility through pairwise amino acid substitutions [35]. The methodology was experimentally demonstrated on 12 types of membrane receptors including 7TM GPCRs [32, 35, 36, 45], with more types computationally designed and reported [32, 37, 38]. It was also adopted to identify essential structural domains for ligand binding and proteins’ regulatory roles in vivo [35, 36]. The water-soluble variants can greatly benefit the molecular understanding on native proteins by providing physical simulators of the latter, due to their structural and functional similarities. However, no crystal structure or ligand docking studies have been conducted to date, both of which would further demonstrate the QTY code’s applicability to facilitate membrane protein research.

We partially solved the problem by conducting ligand docking and molecular simulations in the current work. With 5 selected membrane proteins that differ in TM helices, classifications, and functions, we compared the bindings between native and QTY variants against known ligands. While all 5 examples exhibited high similarities in protein characteristics, AlphaFold2 predicted structures, PrankWeb predicted binding pockets, and slightly varied docking poses, some complexes showed notable changes in both ligand orientation and binding energies, including 2/3 complexes with MGAT3 and 1 complex with GPR35. By MD simulation, we found that the reduced hydrophobic interactions between ligand and QTY protein are accountable for the differences. It appeared that TM enzymes were most susceptible to such changes which might negate their catalytic functions. Receptors were slightly affected, while the structure and functions of channel-forming proteins were best retained with the QTY design. Our observations further suggested the applicability of QTY code on different classes of proteins where task-specific designs need to be taken into consideration for species more susceptible to the formation of internal hydrogen bond networks.

The work presented in this manuscript provides a bioinformatic guideline to determine whether or not a specific QTY design on a membrane protein should be adopted for experimental studies or applications. Superimpositions between the native and QTY variant proteins, as well as the corresponding RMSD values are the primary factors to be considered. Designs with RMSD ≤ 2 Å are generally considered conformationally similar to their native counterparts and suitable for subsequent uses. Higher H_Ƴ change with nearly vertical lines (little R_H change, Fig. 4F) indicates superior design efficiency in enhancing protein solubility without changing its secondary structure, which in combination are positive selection factors. Prankweb predicted binding pocket is another factor to be considered but not necessarily determined upon whether a design should be pursued. The docking pose evaluations are typically conducted by end-users to evaluate the feasibility of ligand-specific applications.

However, there are still a few limitations in the current study that can be worked on to further improve our database and designs. Firstly, the extraction of keywords was processed with the classic TF-IDF algorithm, which was effective in completing the task but fell short in context analysis and lacked biological specificity. We plan to evaluate new language models on this task with extensive training on biology texts to optimize the representation of protein functions. In addition, to further validate the predicted function-based protein-cancer relations in our database, a large language model-based algorithm can be built to conduct literature-wide search and validation. On the other hand, the QTY designs in our database were conducted using the “simple design module” on the PSS server, which featured high efficiency but lacked customization for each protein. In combination with the above-mentioned large language model, we plan to further optimize the QTY design process for individual membrane protein optimization that best retain their functions in specific pathogenesis. MD simulations and resolving the crystal structures of QTY variant proteins beyond AlphaFold2 models will also further benefit both the understanding of these designs and their uses as physical simulators of the native proteins.

In summary, our database provides well-documented information about molecular information of membrane proteins and its expressions in cancers. It pushes beyond the genetic level analysis to reveal undiscovered connections between proteins’ molecular functions and pathogenesis by machine-learning enabled predictions. QTY-code enabled water-soluble designs of membrane proteins are presented as an additional solution for the lack of information on membrane proteins. The variants can be experimentally adopted to facilitate ligand identification from a biophysiochemical aspect and mechanistic pathway studies of critical native proteins. They may also potentially serve as novel targets for immunotherapy in cancer treatments. The discovery, verification and modulation of novel cancer-related molecular mechanisms can not only benefit the scientific understanding of initiation and progression of specific malignancies, but also add tools that can help to concur these diseases.

Methods

CrMP-Sol (Cancer-related Membrane Protein and Solubilization database)

The database is accessible at Metagene platform of Zhejianglab (https://bio-gateway.aigene.org.cn/g/CrMP). The website does require registration but is free to use.

Data acquisition and protein classification

Functional descriptions of each protein were obtained from Uniprot (https://www.uniprot.org/) and associate with corresponding entries. The classification of proteins was based on their names, keywords, and functional descriptions on corresponding Uniprot pages. Protein entries lacking meaningful keywords and functional descriptions are assigned into the “other” category.

Keyword extraction of protein functions

TF-IDF was conducted for keyword extraction. We first performed data cleaning and use regular expression to specify search strings in protein function descriptions. PubMed IDs and punctuation marks were removed to reduce meaningless texts during encoding. We then used the CountVectorizer function to extract text features from proteins' functional descriptions. Common English stopwords such as articles and conjunctions were also removed from the text during this process. The number of feature words can be adjusted by changing the 'max_feature' parameter in this function. Subsequently, we use the TfidfTransformer function to encode the descriptions into a [1309 × max_feature] matrix.

Dimension reduction and visualization

The UMAP algorithm was used to perform the dimension reduction on encoding matrix above. The parameters are set as follows: n_neighbors = 10, n_components = 3, min_dist = 0.5, metric = 'correlation', random_state = 16. A [1309 × 3] matrix was obtained as the final output. Protein classifications and related cancer types are added as labels to the above matrix. The interactive visualization in the 3D coordinate system was achieved using three.js (https://threejs.org/).

QTY code design

QTY code design on all 1309 membrane proteins were conducted using a server we have previously established (https://pss.sjtu.edu.cn/) [44]. FASTA sequences of each entry in the dataset was obtained from Uniprot using a custom Python code. The sequences were then converted into their soluble versions following the principles outlined by QTY method, namely all hydrophobic L, I and V, F were pairwisely substituted by Q, T, and Y in denoted TM domains. The information regarding starts and ends of each TM helices were extracted from the topological domain section in Uniprot database. Automated design was then conducted using the “simple design module” on the server.

Sequence alignment and property calculation

The native protein sequences for cancer-related membrane proteins and their QTY-variants are aligned using the same methods as described previously [32, 38]. The website ExPASy (https://web.expasy.org/protparam/) was used to calculate the MW and pI values of the proteins.

Structure prediction and superimposition

AlphaFold2 was used to predict structures for all cancer related membrane proteins in QTY forms, the service of which is freely provided by Zhejiang Gene Computation Platform (https://cloud.aigene.org.cn/). The predicted structures for native proteins were directly obtained from Uniprot as provided by the European Bioinformatics Institute (https://alphafold.ebi.ac.uk). Structure files for 5 selected proteins were then downloaded and superimposed using PyMOL with RMSD calculated. A Python script was programmed to calculate the RMSD values in batch with PyMOL 2.4.1.

The secondary structure of proteins was predicted using DSSP software [78], and the percentage of helical content changes was normalized to a polar coordinate system to the 180° scale. Proteins with pI > 7 were placed above the horizontal line and those with pI < 7 were placed below the horizontal line. Datapoints were color-coded by protein MW weight and placed according to respective RMSD changes between the two protein variants.

Hydrophobicity prediction

The surface hydrophobic patch was visualized using a script developed by Hagemans et al. for highlighting with the YRB scheme [79]. The standalone software ProPAS was used for the prediction of the protein features including pI, MW, and hydrophobicity [80]. The T_m value was calculated using T_m Predictor localized software with the default T_m reference matrix [63].

Ligand docking comparison

The PrankWeb server (https://prankweb.cz/) was used to predict the binding pockets of native and QTY versions of 5 exemplary proteins based on their AlphaFold2 predicted structure models. Predictions were ranked based on their scores and selected from the top 3 candidates for docking analysis on a rational basis.

The structures for micromolecular ligands were downloaded from PubChem website (https://pubchem.ncbi.nlm.nih.gov/) and converted into.pdb file using OpenBabel. GCDC was extracted from a complex structure from PDB entry: 7ZYI. After preprocessing of the ligand and protein (add polar hydrogen atoms and torsion), the dockings processes were performed by AutoDock Vina with PrankWeb predicted pocket center and defined box dimensions between 15 AND 25 Å.

Dockings were performed for at least 3 times for each protein–ligand pair. The top-ranking conformations appeared 3 times were selected for presentation. The results were then visualized by PyMOL. Native proteins are colored green, and QTY proteins are colored cyan. The ligands are shown in yellow, and the hydrogen bonds are shown in magenta. Residues having polar contact with ligands are shown as stick, with labels displayed. All atoms in proteins are added with polar hydrogen atoms.

The docking between GPR37 and protein binders were performed by Linux ZDOCK 3.0.2. The structure of Saposin C is obtained from PDB (PDB ID: 2GTG) while those for Parkin (Uniprot ID: O60260) and Oseocalcin (Uniprot ID: P02818) were obtained by AlphaFold2 prediction. The large N-terminus 1–255 residues with very low pLDTT (< 50) were removed before docking. The intracellular loops and C-terminus of the proteins were blocked from docking simulations. The dockings processes were conducted at 6°rotational sampling density for maximal precision. Top 100 complexes with highest scores were selected out of 54,000 generated poses. Docking complexes within top 3 were inspected and selected for presentation. The docking results were visualized by PyMOL. Native proteins are colored green, and QTY protein are colored cyan. The ligands are shown in yellow, and the hydrogen bonds are shown in magenta. Residues having polar contact with ligands are shown as stick, with labels displayed. All atoms in proteins are added with polar hydrogen atoms.

MD simulation

MD simulations of native and QTY variant GPR35 versus pamoic acid complexes were performed using GROMACS v2022.3 with the Charmm36 force field. The topology files in the Charmm force field of protein were generated by GROMACS, and the topology files in the Charmm force field of ligand were generated by CGenFF website (https://cgenff.umaryland.edu/). The complexes were immersed in the periodic orthorhombic water box (TIP3P) with added appropriate number of Cl⁻ ions to neutralize the systems. The Steepest Descent (SD) algorithm was used to perform energy minimization. The system was equilibrated by two steps: a 100 ps NVT process at 310 K, and a 100 ps NPT process at 1 bar with position restraints (1000 kJ/mol) on the heavy atoms of the protein and ligand. Subsequently, 50 ns MD was performed at 300 K with trajectory saved every 50 ps. After the backbone of proteins stabilized, the binding free energies were calculated using MMGBSA with the following equation:

$$\Delta {\text{G}}_{{{\text{bind}}}} = {\text{G}}_{{{\text{complex}}}} - {\text{G}}_{{{\text{ligand}}}} - {\text{G}}_{{{\text{receptor}}}} = \Delta {\text{H}} - {\text{T}}\Delta {\text{S}}$$

Where

$$\begin{aligned} & \Delta {\text{H}} = \Delta {\text{E}}_{{{\text{MM}}}} + \Delta {\text{G}}_{{{\text{polar}}}} + \Delta {\text{G}}_{{{\text{nonplar}}}} \\ & \Delta {\text{E}}_{{{\text{MM}}}} = \Delta {\text{E}}_{{{\text{bond}}}} + \Delta {\text{E}}_{{{\text{angle}}}} + \Delta {\text{E}}_{{{\text{dihedral}}}} + \Delta {\text{E}}_{{{\text{ele}}}} + \Delta {\text{E}}_{{{\text{vdW}}}} \\ & \Delta {\text{G}}_{{{\text{polar}}}} = \Delta {\text{G}}_{{{\text{GB}}}} \\ & \Delta {\text{G}}_{{{\text{nonplar}}}} = \Delta {\text{G}}_{{{\text{SA}}}} \\ \end{aligned}$$

where ∆E_MM: electrostatic interaction energy; ∆E_ele: gas-phase molecular mechanics energy; ∆E_vdW: non-bonded van der Waals interaction energy; ∆G_polar: polar solvation free energy; ∆G_nonpolar: nonpolar solvation free energy; ∆G_polar and ∆G_nonplar were calculated by Generalized Born Surface Area.

The 15–50 ns trajectory of native GPR35:pamoic acid and 10–50 ns trajectory of QTY GPR35:pamoic acid were extracted per 1 ns to generate frames for binding energy calculations. All residues were calculated to provide a ranking of respective contributions. For calculation in binding pockets, residues in overlapping sites of native and QTY variant proteins within 6 Å were presented for comparison. Mutated residues were marked with boxes in different colors.

Availability of data and materials

All data supporting this study and its findings are available within the article, in associated files, and accessible at Metagene platform of Zhejianglab (https://bio-gateway.aigene.org.cn/g/CrMP).

References

Santos R, Ursu O, Gaulton A, Bento AP, Donadi RS, Bologa CG, Karlsson A, Al-Lazikani B, Hersey A, Oprea TI, et al. A comprehensive map of molecular drug targets. Nat Rev Drug Discov. 2017;16:19–34.
Article CAS PubMed Google Scholar
Lin CY, Lee CH, Chuang YH, Lee JY, Chiu YY, Wu Lee YH, Jong YJ, Hwang JK, Huang SH, Chen LC, et al. Membrane protein-regulated networks across human cancers. Nat Commun. 2019;10:3131.
Article PubMed PubMed Central Google Scholar
Roslan A, Sulaiman N, Mohd Ghani KA, Nurdin A. Cancer-associated membrane protein as targeted therapy for bladder cancer. Pharmaceutics. 2022;14:2218.
Article CAS PubMed PubMed Central Google Scholar
Kampen KR. Membrane proteins: the key players of a cancer cell. J Membr Biol. 2011;242:69–74.
Article CAS PubMed Google Scholar
Almen MS, Nordstrom KJ, Fredriksson R, Schioth HB. Mapping the human membrane proteome: a majority of the human membrane proteins can be classified according to function and evolutionary origin. BMC Biol. 2009;7:50.
Article PubMed PubMed Central Google Scholar
Almasi S, ElHiani Y. Exploring the therapeutic potential of membrane transport proteins: focus on cancer and chemoresistance. Cancers (Basel). 2020;12:1624.
Article CAS PubMed Google Scholar
Themistocleous SC, Yiallouris A, Tsioutis C, Zaravinos A, Johnson EO, Patrikios I. Clinical significance of P-class pumps in cancer. Oncol Lett. 2021;22:658.
Article CAS PubMed PubMed Central Google Scholar
Lim PS, Sutton CR, Rao S. Protein kinase C in the immune system: from signalling to chromatin regulation. Immunology. 2015;146:508–22.
Article CAS PubMed PubMed Central Google Scholar
March B, Faulkner S, Jobling P, Steigler A, Blatt A, Denham J, Hondermarck H. Tumour innervation and neurosignalling in prostate cancer. Nat Rev Urol. 2020;17:119–30.
Article PubMed Google Scholar
Ziani L, Chouaib S, Thiery J. Alteration of the antitumor immune response by cancer-associated fibroblasts. Front Immunol. 2018;9:414.
Article PubMed PubMed Central Google Scholar
Cervantes-Villagrana RD, Albores-Garcia D, Cervantes-Villagrana AR, Garcia-Acevez SJ. Tumor-induced neurogenesis and immune evasion as targets of innovative anti-cancer therapies. Signal Transduct Target Ther. 2020;5:99.
Article CAS PubMed PubMed Central Google Scholar
Venkataramani V, Tanev DI, Strahle C, Studier-Fischer A, Fankhauser L, Kessler T, Körber C, Kardorff M, Ratliff M, Xie R, et al. Glutamatergic synaptic input to glioma cells drives brain tumour progression. Nature. 2019;573:532–8.
Article CAS PubMed Google Scholar
Song X, Li R, Liu G, Huang L, Li P, Feng W, Gao Q, Xing X. Nuclear membrane protein SUN5 is highly expressed and promotes proliferation and migration in colorectal cancer by regulating the ERK pathway. Cancers (Basel). 2022;14:5368.
Article CAS PubMed Google Scholar
Li Y, Wang J, Gao C, Hu Q, Mao X. Integral membrane protein 2A enhances sensitivity to chemotherapy via notch signaling pathway in cervical cancer. Bioengineered. 2021;12:10183–93.
Article CAS PubMed PubMed Central Google Scholar
Kahm YJ, Kim RK, Jung U, Kim IG. Epithelial membrane protein 3 regulates lung cancer stem cells via the TGF-beta signaling pathway. Int J Oncol. 2021;59:1–9.
Article Google Scholar
Liu R, Wang X, Chen GY, Dalerba P, Gurney A, Hoey T, Sherlock G, Lewicki J, Shedden K, Clarke MF. The prognostic role of a gene signature from tumorigenic breast-cancer cells. N Engl J Med. 2007;356:217–26.
Article CAS PubMed Google Scholar
Choromanska A, Chwilkowska A, Kulbacka J, Baczynska D, Rembialkowska N, Szewczyk A, Michel O, Gajewska-Naryniecka A, Przystupski D, Saczko J. Modifications of plasma membrane organization in cancer cells for targeted therapy. Molecules. 2021;26:1850.
Article CAS PubMed PubMed Central Google Scholar
Das PM, Thor AD, Edgerton SM, Barry SK, Chen DF, Jones FE. Reactivation of epigenetically silenced HER4/ERBB4 results in apoptosis of breast tumor cells. Oncogene. 2010;29:5214–9.
Article CAS PubMed PubMed Central Google Scholar
Gentles AJ, Plevritis SK, Majeti R, Alizadeh AA. Association of a leukemic stem cell gene expression signature with clinical outcomes in acute myeloid leukemia. JAMA. 2010;304:2706–15.
Article CAS PubMed PubMed Central Google Scholar
Nogueira PAS, Moura-Assis A, Razolli DS, Bombassaro B, Zanesco AM, Gaspar JM, Donato Junior J, Velloso LA. The orphan receptor GPR68 is expressed in the hypothalamus and is involved in the regulation of feeding. Neurosci Lett. 2022;781:136660.
Article CAS PubMed Google Scholar
Dao M, Stoveken HM, Cao Y, Martemyanov KA. The role of orphan receptor GPR139 in neuropsychiatric behavior. Neuropsychopharmacology. 2022;47:902–13.
Article CAS PubMed Google Scholar
Civelli O, Reinscheid RK, Zhang Y, Wang Z, Fredriksson R, Schioth HB. G protein-coupled receptor deorphanizations. Annu Rev Pharmacol Toxicol. 2013;53:127–46.
Article CAS PubMed Google Scholar
Tang XL, Wang Y, Li DL, Luo J, Liu MY. Orphan G protein-coupled receptors (GPCRs): biological functions and potential drug targets. Acta Pharmacol Sin. 2012;33:363–71.
Article CAS PubMed PubMed Central Google Scholar
Lo YS, Huang SH, Luo YC, Lin CY, Yang JM. Reconstructing genome-wide protein-protein interaction networks using multiple strategies with homologous mapping. PLoS ONE. 2015;10:e0116347.
Article PubMed PubMed Central Google Scholar
Kotlyar M, Pastrello C, Pivetta F, Lo Sardo A, Cumbaa C, Li H, Naranian T, Niu Y, Ding Z, Vafaee F, et al. In silico prediction of physical protein interactions and characterization of interactome orphans. Nat Methods. 2015;12:79–84.
Article CAS PubMed Google Scholar
Szklarczyk D, Morris JH, Cook H, Kuhn M, Wyder S, Simonovic M, Santos A, Doncheva NT, Roth A, Bork P, et al. The STRING database in 2017: quality-controlled protein-protein association networks, made broadly accessible. Nucleic Acids Res. 2017;45:D362-d368.
Article CAS PubMed Google Scholar
Forli S, Huey R, Pique ME, Sanner MF, Goodsell DS, Olson AJ. Computational protein-ligand docking and virtual drug screening with the AutoDock suite. Nat Protoc. 2016;11:905–19.
Article CAS PubMed PubMed Central Google Scholar
Rawlings AE. Membrane proteins: always an insoluble problem? Biochem Soc Trans. 2016;44:790–5.
Article CAS PubMed PubMed Central Google Scholar
Loll PJ. Membrane protein structural biology: the high throughput challenge. J Struct Biol. 2003;142:144–53.
Article CAS PubMed Google Scholar
Tate CG. Practical considerations of membrane protein instability during purification and crystallisation. Heterologous Expr Membr Proteins Methods Protoc. 2010;601:187–203.
CAS Google Scholar
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9.
Article CAS PubMed PubMed Central Google Scholar
Skuhersky MA, Tao F, Qing R, Smorodina E, Jin D, Zhang S. Comparing native crystal structures and AlphaFold2 predicted water-soluble g protein-coupled receptor QTY variants. Life (Basel). 2021;11:1285.
CAS PubMed Google Scholar
Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Žídek A, Bridgland A, Cowie A, Meyer C, Laydon A, et al. Highly accurate protein structure prediction for the human proteome. Nature. 2021;596:590–6.
Article CAS PubMed PubMed Central Google Scholar
Callaway E. “The entire protein universe”: AI predicts shape of nearly every known protein. Nature. 2022;608:15–6.
Article CAS PubMed Google Scholar
Zhang S, Tao F, Qing R, Tang H, Skuhersky M, Corin K, Tegler L, Wassie A, Wassie B, Kwon Y, et al. QTY code enables design of detergent-free chemokine receptors that retain ligand-binding activities. Proc Natl Acad Sci USA. 2018;115:E8652–9.
Article CAS PubMed PubMed Central Google Scholar
Qing R, Han Q, Skuhersky M, Chung H, Badr M, Schubert T, Zhang S. QTY code designed thermostable and water-soluble chimeric chemokine receptors with tunable ligand affinity. Proc Natl Acad Sci USA. 2019;116:25668–76.
Article CAS PubMed PubMed Central Google Scholar
Smorodina E, Tao F, Qing R, Jin D, Yang S, Zhang S. Comparing 2 crystal structures and 12 AlphaFold2-predicted human membrane glucose transporters and their water-soluble glutamine, threonine and tyrosine variants. QRB Discov. 2022;3:e5.
Article PubMed PubMed Central Google Scholar
Smorodina E, Diankin I, Tao F, Qing R, Yang S, Zhang S. Structural informatic study of determined and AlphaFold2 predicted molecular structures of 13 human solute carrier transporters and their water-soluble QTY variants. Sci Rep. 2022;12:20103.
Article CAS PubMed PubMed Central Google Scholar
Arakaki AKS, Pan WA, Trejo J. GPCRs in cancer: protease-activated receptors, endocytic adaptors and signaling. Int J Mol Sci. 2018;19:1886.
Article PubMed PubMed Central Google Scholar
Digre A, Lindskog C. The Human Protein Atlas-Spatial localization of the human proteome in health and disease. Protein Sci. 2021;30:218–33.
Article CAS PubMed Google Scholar
Uhlen M, Zhang C, Lee S, Sjostedt E, Fagerberg L, Bidkhori G, Benfeitas R, Arif M, Liu Z, Edfors F, et al. A pathology atlas of the human cancer transcriptome. Science. 2017;357:eaan2507.
Article PubMed Google Scholar
Ponten F, Jirstrom K, Uhlen M. The human protein atlas—a tool for pathology. J Pathol. 2008;216:387–93.
Article CAS PubMed Google Scholar
Thul PJ, Lindskog C. The human protein atlas: a spatial map of the human proteome. Protein Sci. 2018;27:233–44.
Article CAS PubMed Google Scholar
Tao F, Tang H, Zhang S, Li M, Xu P. Enabling QTY server for designing water-soluble alpha-helical transmembrane proteins. MBio. 2022;13:e0360421.
Article PubMed Google Scholar
Hao S, Jin D, Zhang S, Qing R. QTY code-designed water-soluble fc-fusion cytokine receptors bind to their respective ligands. QRB Discov. 2020;1:e4.
Article PubMed PubMed Central Google Scholar
Christian H, Agus MP, Suhartono D. Single document automatic text summarization using term frequency-inverse document frequency (TF-IDF). ComTech Comput Math Eng Appl. 2016;7:285–94.
Google Scholar
Nakada M, Hayashi Y, Hamada J. Role of Eph/ephrin tyrosine kinase in malignant glioma. Neuro Oncol. 2011;13:1163–70.
Article CAS PubMed PubMed Central Google Scholar
Morita Y, Sakaguchi T, Ikegami K, Goto-Inoue N, Hayasaka T, Hang VT, Tanaka H, Harada T, Shibasaki Y, Suzuki A. Lysophosphatidylcholine acyltransferase 1 altered phospholipid composition and regulated hepatoma progression. J Hepatol. 2013;59:292–9.
Article CAS PubMed Google Scholar
Cai J, Liu Y, Li Q, Wen Z, Li Y, Chen X. Ceramide synthase 3 affects invasion and metastasis of hepatocellular carcinoma via the SMAD6 gene. Zhong Nan Da Xue Xue Bao Yi Xue Ban. 2022;47:588–99.
PubMed Google Scholar
Wang X, Qiu Z, Dong W, Yang Z, Wang J, Xu H, Sun T, Huang Z, Jin J. S1PR1 induces metabolic reprogramming of ceramide in vascular endothelial cells, affecting hepatocellular carcinoma angiogenesis and progression. Cell Death Dis. 2022;13:768.
Article CAS PubMed PubMed Central Google Scholar
Li Y, Chen X, Lu H. Knockdown of SLC34A2 inhibits hepatocellular carcinoma cell proliferation and invasion. Oncol Res. 2016;24:511–9.
Article PubMed PubMed Central Google Scholar
Qing R, Tao F, Chatterjee P, Yang G, Han Q, Chung H, Ni J, Suter BP, Kubicek J, Maertens B, et al. Non-full-length water-soluble CXCR4(QTY) and CCR5(QTY) chemokine receptors: implication for overlooked truncated but functional membrane receptors. iScience. 2020;23:101670.
Article CAS PubMed PubMed Central Google Scholar
Brandt C, McFie PJ, Stone SJ. Biochemical characterization of human acyl coenzyme A: 2-monoacylglycerol acyltransferase-3 (MGAT3). Biochem Biophys Res Commun. 2016;475:264–70.
Article CAS PubMed Google Scholar
Zhang Y, Shi T, He Y. GPR35 regulates osteogenesis via the Wnt/GSK3beta/beta-catenin signaling pathway. Biochem Biophys Res Commun. 2021;556:171–8.
Article CAS PubMed Google Scholar
Zheng W, Zhou J, Luan Y, Yang J, Ge Y, Wang M, Wu B, Wu Z, Chen X, Li F, et al. Spatiotemporal control of GPR37 signaling and its behavioral effects by optogenetics. Front Mol Neurosci. 2018;11:95.
Article PubMed PubMed Central Google Scholar
Dawson PA, Lan T, Rao A. Bile acid transporters. J Lipid Res. 2009;50:2340–57.
Article CAS PubMed PubMed Central Google Scholar
Nyarko E, Obirikorang C, Owiredu W, Adu EA, Acheampong E, Aidoo F, Ofori E, Addy BS, Asare-Anane H. NTCP gene polymorphisms and hepatitis B virus infection status in a Ghanaian population. Virol J. 2020;17:91.
Article CAS PubMed PubMed Central Google Scholar
Jia L, Betters JL, Yu L. Niemann-pick C1-like 1 (NPC1L1) protein in intestinal and hepatic cholesterol transport. Annu Rev Physiol. 2011;73:239–59.
Article CAS PubMed PubMed Central Google Scholar
Nihei W, Nagafuku M, Hayamizu H, Odagiri Y, Tamura Y, Kikuchi Y, Veillon L, Kanoh H, Inamori KI, Arai K, et al. NPC1L1-dependent intestinal cholesterol absorption requires ganglioside GM3 in membrane microdomains. J Lipid Res. 2018;59:2181–7.
Article CAS PubMed PubMed Central Google Scholar
Wiederstein M, Sippl MJ. ProSA-web: interactive web service for the recognition of errors in three-dimensional structures of proteins. Nucleic Acids Res. 2007;35:W407–10.
Article PubMed PubMed Central Google Scholar
Tamamis P, Floudas CA. Elucidating a key component of cancer metastasis: CXCL12 (SDF-1alpha) binding to CXCR4. J Chem Inf Model. 2014;54:1174–88.
Article CAS PubMed PubMed Central Google Scholar
Tamamis P, Floudas CA. Elucidating a key anti-HIV-1 and cancer-associated axis: the structure of CCL5 (Rantes) in complex with CCR5. Sci Rep. 2014;4:5447.
Article CAS PubMed PubMed Central Google Scholar
Ku T, Lu P, Chan C, Wang T, Lai S, Lyu P, Hsiao N. Predicting melting temperature directly from protein sequences. Comput Biol Chem. 2009;33:445–50.
Article CAS PubMed Google Scholar
Qing R, Hao S, Smorodina E, Jin D, Zalevsky A, Zhang S. Protein design: from the aspect of water solubility and stability. Chem Rev. 2022;122:14085–179.
Article CAS PubMed PubMed Central Google Scholar
Seeliger D, de Groot BL. Ligand docking and binding site analysis with PyMOL and Autodock/Vina. J Comput Aided Mol Des. 2010;24:417–22.
Article CAS PubMed PubMed Central Google Scholar
Jendele L, Krivak R, Skoda P, Novotny M, Hoksza D. PrankWeb: a web server for ligand binding site prediction and visualization. Nucleic Acids Res. 2019;47:W345–9.
Article CAS PubMed PubMed Central Google Scholar
Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem. 2010;31:455–61.
Article CAS PubMed PubMed Central Google Scholar
Mackenzie AE, Quon T, Lin L-C, Hauser AS, Jenkins L, Inoue A, Tobin AB, Gloriam DE, Hudson BD, Milligan G. Receptor selectivity between the G proteins Gα12 and Gα13 is defined by a single leucine-to-isoleucine variation. FASEB J. 2019;33:5005.
Article CAS PubMed PubMed Central Google Scholar
Roney JP, Ovchinnikov S. State-of-the-art estimation of protein model accuracy using AlphaFold. BioRxiv. 2022.
Abraham MJ, Murtola T, Schulz R, Páll S, Smith JC, Hess B, Lindahl E. GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX. 2015;1:19–25.
Article Google Scholar
Huang J, MacKerell AD Jr. CHARMM36 all-atom additive protein force field: validation based on comparison to NMR data. J Comput Chem. 2013;34:2135–45.
Article CAS PubMed PubMed Central Google Scholar
Valdes-Tresanco MS, Valdes-Tresanco ME, Valiente PA, Moreno E. gmx_MMPBSA: a new tool to perform end-state free energy calculations with GROMACS. J Chem Theory Comput. 2021;17:6281–91.
Article CAS PubMed Google Scholar
Kwon OS, Song HS, Park TH, Jang J. Conducting nanomaterial sensor using natural receptors. Chem Rev. 2019;119:36–93.
Article CAS PubMed Google Scholar
Zeng Q, Michael IP, Zhang P, Saghafinia S, Knott G, Jiao W, McCabe BD, Galván JA, Robinson HPC, Zlobec I, et al. Synaptic proximity enables NMDAR signalling to promote brain metastasis. Nature. 2019;573:526–31.
Article CAS PubMed PubMed Central Google Scholar
Gong J, Chen Y, Pu F, Sun P, He F, Zhang L, Li Y, Ma Z, Wang H. Understanding membrane protein drug targets in computational perspective. Curr Drug Targets. 2019;20:551–64.
Article CAS PubMed Google Scholar
Usman S, Khawer M, Rafique S, Naz Z, Saleem K. The current status of anti-GPCR drugs against different cancers. J Pharm Anal. 2020;10:517–21.
Article PubMed PubMed Central Google Scholar
Cao S, Peterson SM, Muller S, Reichelt M, McRoberts Amador C, Martinez-Martin N. A membrane protein display platform for receptor interactome discovery. Proc Natl Acad Sci USA. 2021;118:e2025451118.
Article CAS PubMed PubMed Central Google Scholar
Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolym Orig Res Biomol. 1983;22:2577–637.
CAS Google Scholar
Hagemans D, van Belzen IA, Moran Luengo T, Rudiger SG. A script to highlight hydrophobicity and charge on protein surfaces. Front Mol Biosci. 2015;2:56.
Article PubMed PubMed Central Google Scholar
Wu S, Zhu Y. ProPAS: standalone software to analyze protein properties. Bioinformation. 2012;8:167.
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

The work is supported by Metagene platform of Zhejianglab, BH0800009.

Author information

Lina Ma and Sitao Zhang have contributed equally to this work

Authors and Affiliations

State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
Lina Ma, Sitao Zhang, Wenting Huang, Hui Wang, Ping Xu, Fei Tao & Rui Qing
Zhejiang Lab, Research Center for Intelligent Computing Platforms, Hangzhou, 311121, Zhejiang, China
Qi Liang & Jin Tang
The Lawrenceville School, 2500 Main Street, Lawrenceville, NJ, 08648, USA
Emily Pan
Media Lab, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA, 02139, USA
Shuguang Zhang

Authors

Lina Ma
View author publications
You can also search for this author in PubMed Google Scholar
Sitao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Qi Liang
View author publications
You can also search for this author in PubMed Google Scholar
Wenting Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Emily Pan
View author publications
You can also search for this author in PubMed Google Scholar
Ping Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shuguang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Fei Tao
View author publications
You can also search for this author in PubMed Google Scholar
Jin Tang
View author publications
You can also search for this author in PubMed Google Scholar
Rui Qing
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

RQ, FT and JT designed the research. LM, SZ, QL and WH conducted the experiments. LM, SZ, QL, WH, FT and RQ analyzed the data. PX, SZ, FT, JT and RQ oversees the research. LM, SZ, WH, HW, EP, SZ, FT and RQ wrote the paper.

Corresponding authors

Correspondence to Fei Tao, Jin Tang or Rui Qing.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Accessibility

The database is accessible at Metagene platform of Zhejianglab (https://bio-gateway.aigene.org.cn/g/CrMP). The website does require registration but is free to use.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

. Supplementary Materials.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Ma, L., Zhang, S., Liang, Q. et al. CrMP-Sol database: classification, bioinformatic analyses and comparison of cancer-related membrane proteins and their water-soluble variant designs. BMC Bioinformatics 24, 360 (2023). https://doi.org/10.1186/s12859-023-05477-9

Download citation

Received: 25 January 2023
Accepted: 12 September 2023
Published: 25 September 2023
DOI: https://doi.org/10.1186/s12859-023-05477-9

CrMP-Sol database: classification, bioinformatic analyses and comparison of cancer-related membrane proteins and their water-soluble variant designs

Abstract

Background

Results

CrMP-Sol database

Classification and visualization of protein-cancer types

QTY design and property comparisons

Superimpositions of AlphaFold2 predicted structures of native and QTY cancer-related membrane proteins

Hydrophobicity analysis of native and QTY cancer-related membrane proteins

Molecular docking of native and water-soluble cancer-related membrane proteins

Molecular docking analysis of GPR35 versus pamoic acid

Discussion

Methods

CrMP-Sol (Cancer-related Membrane Protein and Solubilization database)

Data acquisition and protein classification

Keyword extraction of protein functions

Dimension reduction and visualization

QTY code design

Sequence alignment and property calculation

Structure prediction and superimposition

Hydrophobicity prediction

Ligand docking comparison

MD simulation

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Accessibility

Additional information

Publisher's Note

Supplementary Information

Additional file 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us