Fast and accurate genome-wide predictions and structural modeling of protein–protein interactions using Galaxy

Guerler, Aysam; Baker, Dannon; van den Beek, Marius; Gruening, Bjoern; Bouvier, Dave; Coraor, Nate; Shank, Stephen D.; Zehr, Jordan D.; Schatz, Michael C.; Nekrutenko, Anton

doi:10.1186/s12859-023-05389-8

Software
Open access
Published: 23 June 2023

Fast and accurate genome-wide predictions and structural modeling of protein–protein interactions using Galaxy

Aysam Guerler ORCID: orcid.org/0000-0001-6513-2539¹,
Dannon Baker¹,
Marius van den Beek²,
Bjoern Gruening³,
Dave Bouvier²,
Nate Coraor²,
Stephen D. Shank⁴,
Jordan D. Zehr⁴,
Michael C. Schatz¹ &
…
Anton Nekrutenko²

BMC Bioinformatics volume 24, Article number: 263 (2023) Cite this article

1966 Accesses
1 Altmetric
Metrics details

Abstract

Background

Protein–protein interactions play a crucial role in almost all cellular processes. Identifying interacting proteins reveals insight into living organisms and yields novel drug targets for disease treatment. Here, we present a publicly available, automated pipeline to predict genome-wide protein–protein interactions and produce high-quality multimeric structural models.

Results

Application of our method to the Human and Yeast genomes yield protein–protein interaction networks similar in quality to common experimental methods. We identified and modeled Human proteins likely to interact with the papain-like protease of SARS-CoV2’s non-structural protein 3. We also produced models of SARS-CoV2’s spike protein (S) interacting with myelin-oligodendrocyte glycoprotein receptor and dipeptidyl peptidase-4.

Conclusions

The presented method is capable of confidently identifying interactions while providing high-quality multimeric structural models for experimental validation. The interactome modeling pipeline is available at usegalaxy.org and usegalaxy.eu.

Background

Obtaining a complete map of interacting proteins is crucial to decipher the inner workings of living organisms. Among many other roles, proteins act in dynamic collaboration to fulfill biological functions by catalyzing chemical processes. Commonly, interactions are elucidated through a variety of experimental methods [33, 47] which are capable of evaluating an ever-larger number of putative protein pairs. Unfortunately, the overlap between these methods is often limited which either indicates a high false positive rate or a low coverage. Often 40–90% of the detected interactions do not overlap between different methods [30, 39]. Also, high throughput methods do not provide structural insights into the formed protein–protein complex. More reliable methods such as crystallography and NMR spectroscopy do yield structural information but are labor intensive and as such only applicable to a limited number of proteins. In a recent study we demonstrated that the gap between low and high throughput methods can be bridged by identifying distantly related protein–protein homologues with similar protein–protein interfaces [10]. Application of the SPRING method [11] to Escherichia coli competitively identified protein–protein interactions while producing accurate multimeric protein structure models of which 39 by now have been confirmed in high-resolution experiments. Other studies applied our method to the minimal synthetic genome syn3.0 [44] and the mouse genome [20]. In the present study we describe how we implemented our pipeline on Galaxy [1], a web-based computational workbench used by many scientists across the world to analyze large data sets. This allows scientists to reproduce, share and embed the resulting interactome networks within their own analysis pipelines. Given a set of query sequences and a list of known protein structures, the pipeline employs SPRING with HHsearch [35], and TMalign [46] to detect and structurally model protein–protein interactions. We validate the pipeline’s performance by comparing the resulting Human and Yeast protein networks with experimental findings. Similar to the results for Escherichia coli, the method competitively resolves Human and Yeast protein–protein interaction networks. As novel targets, we identified Human proteins likely to bind the papain-like protease of SARS-CoV2’s non-structural protein 3 (Nsp3). We also obtained models for SARS-CoV2’s spike protein (S) in complex formation with myelin-oligodendrocyte glycoprotein (MOG) and dipeptidyl peptidase-4 (DPP4). Some of the detected interactions have already been experimentally confirmed in recent literature, others provide novel insights into the pathology of SARS-CoV2. Notably, the interaction with DPP4 has been suggested to cause a higher mortality rate of diabetics contracting SARS-CoV2 [37] while the MOG receptor is associated with the MOG antibody disease which relapses in SARS-CoV2 patients [41].

Results

Performance validation with human and yeast interactomes

The pipeline’s performance is validated on 20,610 raw protein coding gene sequences from the Human Reference Genome (UP000005640) of the UniProt database. This process evaluates ~ 212 million possible pairs to identify the set of interacting protein–protein pairs. Each interaction is ranked by the Z_com score and Matthew’s correlation coefficient (MCC) is determined with regard to a negative data set of non-interacting protein pairs produced by the SPRING MCC tool and positive data sets derived from each experimental method. The negative data set has been sampled to contain proteins from different subcellular regions. Figure 1 displays the cross-validation performance results in comparison to ten experimental methods available in the BioGRID database. Note that we applied SPRING on the raw protein coding sequences without separating the individual proteins using the CDS record provided by GenBank [4]. In total the 20,610 protein coding genes encode for about 75,776 individual proteins.

We repeated the same experiment using the Yeast genome (UP000002311) to identify protein–protein interaction networks. In total 6045 protein coding genes were parsed through the pipeline evaluating ~ 18 million possible protein–protein interactions using the public Galaxy instance at https://usegalaxy.org. The results are shown in the right panel of Fig. 1.

A more detailed analysis regarding the prediction performance of the presented pipeline versus experimental methods has been recently published for the Escherichia coli genome [10]. The pipeline predicted several protein complex structures which were later experimentally verified by crystallography.

For all three genomes, our pipeline was able to implicitly identify individual protein sequences and achieve an overall performance which is comparable if not better than existing experimental methods.

SARS-CoV2 protease (Nsp3) and ubiquitin

We next applied the Galaxy pipeline to the genome of SARS-CoV2 which causes a novel severe acute respiratory syndrome and has been declared a pandemic [27]. The SARS-CoV2 genome contains 13–15 open reading frames with ~ 30 thousand nucleotides, including 11 protein-coding genes. Our pipeline identified Human substrates for the papain-like protease of SARS-CoV2 which is part of the non-structural protein 3 (Nsp3) (see Fig. 2).

Table 1 shows a list of the highest ranking fifteen substrates with matching multimeric templates and model quality attributes i.e. SM-score, TM-score, E_contact and Z_com. The two highest ranking interactions were identified for ISG15 (SM-score = 1.11) and ANKUB (SM-score = 1.10). ISG15 has recently been experimentally confirmed as a substrate [32] and ANKUB was suggested in a computational cleavage enrichment study [29].

Table 1 SARS-CoV2 papain-like protease substrates

Full size table

SARS-CoV2 spike protein (S) and myelin-oligodendrocyte glycoprotein

We also identified Human proteins interacting with SARS-CoV2’s spike protein (S). The top-ranking interaction was found for angiotensin (ACE2/ACE), which is widely known to be the primary receptor for SARS-CoV2 [48].

The second highest ranking model was detected for the interaction with the myelin-oligodendrocyte glycoprotein (MOG, see Fig. 3). MOG is a protein located on the surface of myelin sheaths in the central nervous system [14]. Our pipeline modeled the monomeric structure of MOG with the highest ranking homologue in the PDB70 database which is PDB entry 4PFE [9] at a Z-score of 102.2. We compared the resulting monomeric model with the model provided by Mesleh et al. [26]. Both models resolve MOG as a beta-barrel and share significant similarity at a TM-score of 0.70. Additionally several suitable multimeric template frameworks were identified. The corresponding PDB entries are 7C8V [21], 6XC2 [43], 6XC4 [43], 7BZ5 [42] and 7C01 [32]. All of these structures, except 7C8V, were crystalized with a potent neutralizing antibody of SARS-CoV2. Table 2 shows the identified template frameworks and the resulting model scores. The results indicate two distinct putative binding modes which may occur in tandem (see Fig. 3C).

Table 2 Multimeric frameworks identified for myelin-oligodendrocyte glycoprotein

Full size table

The MOG receptor is associated with MOG antibody disease (MOGAD), a neuro-inflammatory condition that may cause inflammation of the optic nerve, the spinal cord and brain. Recent research has shown that SARS-CoV2 does trigger a relapse of MOGAD [41].

SARS-CoV2 Spike protein (S) and dipeptidyl peptidase-4

High-scoring models were also generated for dipeptidyl peptidase-4 (DPP4, see Fig. 4), confirming the computational modeling results presented by [19, 22]. DPP4 is a cell surface glycoprotein receptor involved in T-cell activation and assumed to play a role in cell adhesion, migration and tube formation [8].

Additionally, inhibiting DPP4 prevents glucagon release while increasing insulin secretion to decrease blood glucose levels [24]. DPP4 is known to interact with MERS-CoV [40]. The highest scoring template frameworks for DPP4 were PDB entry 4KR0 [23] with a Z_com score of 216.30 and PDB entry 4L72 [40] with 213.4. The resulting dimeric models are very similar to each other. The multimeric template matched the individual models with a TM-score of 0.62, a mean contact energy of -6.7 and SM-score of 0.69.

Several sites are known to significantly contribute to the interaction between DPP4 and MERS-CoV’s receptor-binding domain (RBD). These are DPP4 residues K267, R336, R317, and Q344 ([22, 34], see Fig. 4) along with polymorphic sites as outlined in Table 3. Our method illustrates that SARS-Cov2’s S protein interacts with sites on DPP4 shared by MERS-CoV in addition to novel interaction sites (see Table 3). Additionally, half of fourteen critical binding sites [18] have been identified as polymorphic in Humans. Taken together, binding propensities between SARS-Cov2’s S protein and DPP4 might vary based on the population.

Table 3 shows sites on DPP4 that are critical in binding MERS-CoV’s receptor-binding domain (RBD) and sites predicted to interact with SARS-CoV2’s RBD

Full size table

Discussion

Accurate identification of protein–protein interactions is essential to decipher cellular processes and detect novel drug targets. In the present work we implemented a Galaxy pipeline using the SPRING method which detects and structurally models protein–protein interactions by identifying distantly related protein complex structures with similar protein–protein interfaces.

The presented pipeline yields insights into the biochemical activity of SARS-CoV2 by identifying distant homologues with similar binding interfaces to Human proteins. For the papain-like protease of the non-structural protein 3 (Nsp3), we detected several ubiquitin-like substrates of which some have been experimentally confirmed. The method produced a top-ranking model for SARS-CoV’s spike protein (S) and dipeptidyl peptidase-4 (DPP4) in alignment with existing literature. Our method produced novel complex models between the S protein and myelin-oligodendrocyte glycoprotein (MOG). Here two top-ranking binding modes were produced. Experimental exploration will be needed to determine what impact these novel binding sites might play in pathogenicity, immune evasion, and adaptation. The prediction confidence relies on the accuracy of the homology match between templates, the structural fit and a knowledge-based contact potential, providing likely binding modes and interaction partners for further investigation. Only additional experimental validation can determine which or if any of the predicted binding modes occur in nature.

A limitation of our method is that it may produce high-confidence models between proteins which are localized in different subcellular regions. Existing literature has shown that such cross-interactions occur in a significant number of cases. In the present work we avoid filtering predicted protein interaction pairs by their corresponding subcellular locations since this would bias the obtained Matthew’s correlation coefficients. Identifying an accurate set of truly non-interacting protein pairs is critical and particularly challenging for the evaluation of protein–protein interactions. Randomly sampling protein pairs across a genome may lead to the inclusion of interacting protein pairs. A more accurate method is to sample non-interacting sets by pairing proteins from different subcellular regions as presented here. Yet another common suggestion is to exclude homologue protein pairs from the non-interacting set all together in order to avoid the inclusion of interacting pairs. This however is not an option due to the nature of the presented method which relies on homology detection to predict protein–protein interactions.

Another limitation is that homology modeling does rely on experimental templates. All of the fifteen most confident models derived for Human proteins interacting with Nsp3’s protease rely on four crystallographic complex structures.

Conclusions

This pipeline demonstrates the ability to detect interactome networks for a range of organisms. The increasing number of resolved co-crystal structures in the PDB, will continually improve the model quality and coverage over time [6]. Since the pipeline includes all data preparation steps no manual adjustment is required once new data has been published to the PDB. Galaxy enables users to employ the pipeline within their own methodologies and add or modify steps as required using Galaxy’s web-based workflow editor. Users are now able to reproduce and share the resulting interactome networks. The present contribution expands the repertoire of Galaxy tools to structural modeling methodologies, making them available for a large number of users. Recent advances in protein structure prediction and modeling [3, 13, 16] complement existing sequence analysis tools and provide novel targets for drug discovery and elucidating biochemical processes through structural insights.

Methods

Protein–protein interaction analysis pipeline

We present a Galaxy pipeline to predict and structurally model protein–protein interactions on a genomic scale. The pipeline takes the following inputs:

(1)
An individual file or a pair of files containing multiple FASTA entries of protein coding sequences. The pipeline will attempt to identify protein–protein interactions within the set of query sequences.
(2)
Text file containing the list of all Protein Data Bank (PDB) [5] entry identifiers to be employed as a multimeric template library. This step can be skipped if the library has already been constructed.
(3)
PDB70 threading library files as provided by the developers of HHsearch. These files are used to perform single-chain threading and can be obtained from http://wwwuser.gwdg.de/~compbiol/data/hhsuite/databases/hhsuite_dbs/.

The following outputs are generated:

(1)
Tabular file containing all identified interactions with their corresponding templates and Z_com scores.
(2)
Tabular file containing the details of the produced multimeric structural models and the corresponding model properties, i.e. SM-score, TM-score, a knowledge-based contact energy term E_contact and the fraction of inter chain clashes.
(3)
Collection containing dimeric structural models for each interaction, including the structures of the identified templates from the PDB.
(4)
Bar chart displaying the prediction accuracy in comparison to experimental results derived from the BioGRID database.

This workflow is available at: https://usegalaxy.org/u/guerler/w/interaction-prediction.

Data preparation

The presented pipeline utilizes all protein–protein interfaces available in the PDB as a template library for interface homology detection (Fig. 5). The data preparation starts by using the DBkit tool in Galaxy to download all PDB entries and store them as a ffindex/ffdata database pair. As of November 29th, 2020 this amounted to 170,860 files. Then the SPRING Cross tool is applied which scans each PDB entry for protein–protein interfaces and stores the corresponding interacting PDB chain identifiers in a 2-column lookup table as a pairwise index of all interactions. In more detail, the SPRING Cross method proceeds by using the PDB REMARK 350 entries to build all bio units available in a given PDB entry. Then all C-alpha atom distances between two separate chains within the same bio unit are determined. If more than five distances below 10Å are detected for a pair of PDB chains, the corresponding PDB chain identifiers are deemed as interacting and added as a new row to the resulting 2-column lookup table. This yields a complete set of 988,784 interacting PDB chain identifier pairs contained in the PDB which we will use as a multimeric template library.

In a consecutive step the SPRING Map tool is applied which uses PSI-BLAST [2] to detect close homologues of the PDB70 database for each PDB chain identifier listed in the columns of the lookup table. The identifiers of matching PDB70 entries are added in two additional columns to the lookup table. This allows us to apply HHsearch on a non-redundant subset of the PDB, containing entries with less than 70% sequence identity to each other. Although possible, expanding the monomeric threading database by including every PDB chain would significantly impact the database preparation time without improving the overall prediction performance. We used the PDB70 database issued on November 18, 2020 containing 58,900 entries. If a PSI-BLAST E-value equal to zero is used, 257,698 interaction frameworks which exactly match the sequences in the PDB70 are detected. With an E-value threshold of 0.001, the resulting 4-column lookup table contained 900,772 interaction frameworks suitable for the monomers available in the PDB70 database.

Interaction prediction

The pipeline’s interaction prediction logic uses SPRING with HHsearch and TMalign, and was designed to exploit the redundancy of available protein–protein interfaces in order to predict and model novel protein tertiary structures.

Initially each query sequence Q is threaded by HHsearch against the PDB70 monomeric template library to identify a set of putative templates (T_i, i = 1,2,…) each associated with a Z-score (Z_i). The Z-score is defined as the number of standard deviations by which the raw alignment score differs from its mean. A higher Z-score indicates a higher significance and usually corresponds to a better alignment.

Considering all possible target sequence pairs, the SPRING Min-Z tool uses the previously described 4-column lookup table to select interaction frameworks which are shared by the monomeric templates of a query pair. The Z-score of the framework is defined as Z_com which is the smaller of the two monomeric Z-scores. A more detailed description of this algorithm is provided in [10, 11].

Interaction validation

The accuracy of predicted protein–protein interactions is evaluated using the SPRING MCC tool. This tool compares the set of interactions from SPRING with interactions obtained from experimental methods contained in the Biological General Repository for Interaction Data sets (BioGRID) [28]. BioGRID is an open access database that contains protein interactions curated from primary biomedical literature for all major model organism species and Humans. The SPRING MCC tool accesses the BioGrid Tab 3.0 format columns 24 and 27, containing the UniProt [36] accession identifiers of interacting protein pairs. The method only operates on interactions identified for sequences which are available in the UniProt database.

Initially, the SPRING MCC tool produces a `negative` data set of non-interacting protein pairs by randomly sampling protein–protein interaction pairs from the set of query protein sequences. If a UniProt localization file is provided, the non-interacting pairs can be determined by sampling protein sequences from different subcellular regions. This approach can reduce the false-negative rate of the resulting negative data set.

Subsequently, the protein–protein interaction sets identified by each experimental method are considered to be truly interacting, constituting the `positive` data sets for the cross-validation process. Each method is compared to all other methods using the positive data sets and a negative data set of equal size. The resulting Matthew’s correlation coefficients (MCC) are plotted using the Matplotlib library [12]. An example of such a plot is shown in Fig. 1, displaying the results for the Human and Yeast interactome validation. The legend lists the experimental methods used to determine the corresponding positive data set.

Structural modeling

If a pair of proteins, Chain A and Chain B, is deemed to potentially interact e.g. Z_com > 25, the complex structure is constructed by structurally aligning the top-ranked monomer templates of Chain A and Chain B to all putative interacting frameworks using the SPRING Model tool which utilizes TM-align. The structural alignment is built on the subset of interface residues. The resulting models are evaluated by the recently established SPRING model score [38]:

$${\text{SM-score }} = {\text{TM-score }} - {\text{ w}}_{0}\, {\text{E}}_{{{\text{contact}}}}$$

where TM-score is the smaller TM-score returned by TM-align when aligning the top-ranked monomer models of Chain A and Chain B to the interaction framework; E_contact is a residue-specific, atomic contact potential derived from 3897 non-redundant structure interfaces from the PDB using the formula of RW [45]. The weight parameter w₀ is set to 0.01 through a training set of protein complexes to maximize the modeling accuracy of the interface structures. The final model is evaluated for clashes and removed if more than 10% of the resulting C-alpha atom contacts share a distance of less than 5Å between the interacting pair of protein structures.

Availability of data and materials

We generated Galaxy workflows for (1) intra-genome and (2) inter-genome protein–protein interaction predictions. The source code including data preparation and interaction prediction logic is written in Python 3 and publicly available at https://github.com/guerler/springsuite under the GNU License. All methods can be executed locally or using the Galaxy web interface on usegalaxy.org and usegalaxy.eu. The data required to run the presented workflows is available at: https://usegalaxy.org/u/guerler/h/pdb70-latest and the workflow is available at: https://usegalaxy.org/u/guerler/w/interaction-prediction. All resulting models generated for the analysis are available at: https://usegalaxy.org/u/guerler/h/sars2-model.

Abbreviations

BioGRID:: Biological General Repository for Interaction Datasets
DPP4:: Dipeptidyl peptidase-4
MCC:: Matthew’s correlation coefficients
MOG:: Myelin-oligodendrocyte glycoprotein receptor
MOGAD:: MOG antibody disease
Nsp3:: Non-structural protein 3
PDB:: Protein Data Bank

References

Afgan E, et al. The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018. https://doi.org/10.1093/nar/gky379.
Article PubMed PubMed Central Google Scholar
Altschul S. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402. https://doi.org/10.1093/nar/25.17.3389.
Article CAS PubMed PubMed Central Google Scholar
Baek M, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373(6557):871–6. https://doi.org/10.1126/science.abj8754.
Article CAS PubMed PubMed Central Google Scholar
Benson DA, et al. GenBank. Nucleic Acids Res. 2014. https://doi.org/10.1093/nar/gku1216.
Article PubMed PubMed Central Google Scholar
Berman HM. The protein data bank. Nucleic Acids Res. 2000;28(1):235–42. https://doi.org/10.1093/nar/28.1.235.
Article CAS PubMed PubMed Central Google Scholar
Chandonia J-M, Brenner SE. The impact of structural genomics: expectations and outcomes. Science. 2006;311(5759):347–51. https://doi.org/10.1126/science.1121018.
Article CAS PubMed Google Scholar
Daczkowski CM, et al. Structurally guided removal of DeISGylase biochemical activity from papain-like protease originating from middle east respiratory syndrome coronavirus. J Virol. 2017. https://doi.org/10.1128/jvi.01067-17.
Article PubMed PubMed Central Google Scholar
Durinx C, et al. Molecular characterization of dipeptidyl peptidase activity in serum. Eur J Biochem. 2000;267(17):5608–13. https://doi.org/10.1046/j.1432-1327.2000.01634.x.
Article CAS PubMed Google Scholar
Eshaghi M, et al. Rational structure-based design of bright GFP-based complexes with tunable dimerization. Angew Chem. 2015;127(47):14158–62. https://doi.org/10.1002/ange.201506686.
Article Google Scholar
Gong W, et al. Integrating multimeric threading with high-throughput experiments for structural interactome of Escherichia Coli. J Mol Biol. 2020. https://doi.org/10.1101/2020.10.17.343962.
Article PubMed PubMed Central Google Scholar
Guerler A, et al. Mapping monomeric threading to protein–protein structure prediction. J Chem Inf Model. 2013;53(3):717–25. https://doi.org/10.1021/ci300579r.
Article CAS PubMed PubMed Central Google Scholar
Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9(3):90–5. https://doi.org/10.1109/mcse.2007.55.
Article Google Scholar
Jumper J, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:583–9.
Article CAS PubMed PubMed Central Google Scholar
Kezuka T, Ishikawa H. Diagnosis and treatment of anti-myelin oligodendrocyte glycoprotein antibody positive optic neuritis. Jpn J Ophthalmol. 2018;62(2):101–8. https://doi.org/10.1007/s10384-018-0561-1.
Article CAS PubMed Google Scholar
Kong LY, Yan LM, Zhang Y, Rao ZH. Crystal structure of IBV papain-like protease PLpro C101S mutant in complex with ubiquitin. 2015. to be published.
Kryshtafovych A, et al. Critical assessment of methods of protein structure prediction (CASP)—round XIII. Proteins Struct Funct Bioinform. 2019;87(12):1011–20. https://doi.org/10.1002/prot.25823.
Article CAS Google Scholar
Lei J, Hilgenfeld R. Structural and Mutational Analysis of the Interaction between the Middle-East Respiratory Syndrome Coronavirus (MERS-CoV) Papain-like Protease and Human Ubiquitin. Virol Sin. 2016;31(4):288–99. https://doi.org/10.1007/s12250-016-3742-4.
Article CAS PubMed PubMed Central Google Scholar
Letko M, et al. Adaptive evolution of MERS-CoV to species variation in DPP4. Cell Rep. 2018;24(7):1730–7. https://doi.org/10.1016/j.celrep.2018.07.045.
Article CAS PubMed PubMed Central Google Scholar
Li D, et al. A potent synthetic nanobody targets RBD and protects mice from SARS-CoV-2 infection. 2020. https://doi.org/10.21203/rs.3.rs-75540/v1.
Li H-D, et al. A network of splice isoforms for the mouse. Sci Rep. 2016. https://doi.org/10.1038/srep24507.
Article PubMed PubMed Central Google Scholar
Li T, Yao H, Cai H, et al. Structure of sybody SR4 in complex with the SARS-CoV-2 s receptor binding domain (RBD). Nat Commun. 2021;12:4635–4635. https://doi.org/10.1038/s41467-021-24905-z.
Article CAS PubMed PubMed Central Google Scholar
Li Y, et al. The MERS-CoV receptor DPP4 as a candidate binding target of the SARS-CoV-2 spike. IScience. 2020;23(8):101400. https://doi.org/10.1016/j.isci.2020.101400.
Article CAS PubMed PubMed Central Google Scholar
Lu G, et al. Molecular basis of binding between novel human coronavirus MERS-CoV and its receptor CD26. Nature. 2013;500(7461):227–31. https://doi.org/10.1038/nature12328.
Article CAS PubMed PubMed Central Google Scholar
Mcintosh CH, et al. Dipeptidyl peptidase IV inhibitors: how do they work as new antidiabetic agents? Regul Peptides. 2005;128(2):159–65. https://doi.org/10.1016/j.regpep.2004.06.001.
Article CAS Google Scholar
Mesecar AD, Chen Y. X-Ray structure of MHV PLP2 (Cys1716Ser) catalytic mutant in complex with free ubiquitin. 2018. https://doi.org/10.2210/pdb5wfi/pdb.
Mesleh MF, et al. Marmoset fine B cell and T cell epitope specificities mapped onto a homology model of the extracellular domain of human myelin oligodendrocyte glycoprotein. Neurobiol Dis. 2002;9(2):160–72. https://doi.org/10.1006/nbdi.2001.0474.
Article CAS PubMed Google Scholar
Naqvi AAT, et al. Insights into SARS-CoV-2 genome, structure, evolution, pathogenesis and therapies: structural genomics approach. Biochim Biophys Acta BBA Mol Basis Dis. 2020;1866(10):165878. https://doi.org/10.1016/j.bbadis.2020.165878.
Article CAS Google Scholar
Oughtred R, et al. The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci. 2020;30(1):187–200. https://doi.org/10.1002/pro.3978.
Article CAS PubMed PubMed Central Google Scholar
Prescott L. SARS-CoV-2 3CLpro whole human proteome cleavage prediction and enrichment/depletion analysis. 2020. https://doi.org/10.1101/2020.08.24.265645.
Rao VS, et al. Protein–protein interaction detection: methods and analysis. Int J Proteomics. 2014;2014:1–12. https://doi.org/10.1155/2014/147648.
Article CAS Google Scholar
Shi R, et al. Molecular basis for a potent human neutralizing antibody targeting SARS-CoV-2 RBD. 2020. https://doi.org/10.2210/pdb7c01/pdb.
Shin D, et al. Papain-like protease regulates SARS-CoV-2 viral spread and innate immunity. Nature. 2020;587(7835):657–62. https://doi.org/10.1038/s41586-020-2601-5.
Article CAS PubMed PubMed Central Google Scholar
Shoemaker BA, Panchenko AR. Deciphering protein–protein interactions. Part I. Experimental techniques and databases. PLoS Computat Biol. 2007;3(3):e43. https://doi.org/10.1371/journal.pcbi.0030042.
Article CAS Google Scholar
Song W, et al. Identification of residues on human receptor DPP4 critical for MERS-CoV binding and entry. Virology. 2014;471–473:49–53. https://doi.org/10.1016/j.virol.2014.10.006.
Article CAS PubMed Google Scholar
Steinegger M, et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinform. 2019. https://doi.org/10.1101/560029.
Article Google Scholar
Uniprot Consortium. The “UniProt: the Universal Protein Knowledgebase”. Nucleic Acids Res 2018;46(5):2699–2699. doi:https://doi.org/10.1093/nar/gky092.
Valencia I, et al. DPP4 and ACE2 in diabetes and COVID-19: therapeutic targets for cardiovascular complications? Frontn Pharmacol. 2020;11:598531. https://doi.org/10.3389/fphar.2020.01161.
Article CAS Google Scholar
Vangaveti S, et al. Integrating Ab initio and template-based algorithms for protein–protein complex structure prediction. Bioinformatics. 2019;36(3):751–7. https://doi.org/10.1093/bioinformatics/btz623.
Article CAS PubMed Central Google Scholar
Von Mering C. STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Res. 2004. https://doi.org/10.1093/nar/gki005.
Article PubMed PubMed Central Google Scholar
Wang N, et al. Structure of MERS-CoV spike receptor-binding domain complexed with human receptor DPP4. Cell Res. 2013;23(8):986–93. https://doi.org/10.1038/cr.2013.92.
Article CAS PubMed PubMed Central Google Scholar
Woodhall M, et al. Case report: myelin oligodendrocyte glycoprotein antibody-associated relapse with COVID-19. Front Neurol. 2020;11:598531. https://doi.org/10.3389/fneur.2020.598531.
Article PubMed PubMed Central Google Scholar
Wu Y, et al. A non-competing pair of human neutralizing antibodies block COVID-19 virus binding to its receptor ACE2. Science. 2020. https://doi.org/10.1101/2020.05.01.20077743.
Article PubMed PubMed Central Google Scholar
Yuan M, et al. Structural basis of a shared antibody response to SARS-CoV-2. Science. 2020;369(6507):1119–23. https://doi.org/10.1126/science.abd2321.
Article CAS PubMed PubMed Central Google Scholar
Zhang C, et al. Functions of essential genes and a scale-free protein interaction network revealed by structure-based function and interaction prediction for a minimal genome. J Proteome Res. 2021;20(2):1178–89. https://doi.org/10.1021/acs.jproteome.0c00359.
Article CAS PubMed PubMed Central Google Scholar
Zhang J, Zhang Y. A novel side-chain orientation dependent potential derived from random-walk reference state for protein fold selection and structure prediction. PLoS ONE. 2010;5(10):e15386. https://doi.org/10.1371/journal.pone.0015386.
Article CAS PubMed PubMed Central Google Scholar
Zhang Y. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 2005;33(7):2302–9. https://doi.org/10.1093/nar/gki524.
Article CAS PubMed PubMed Central Google Scholar
Zhou M, et al. Cheminform abstract: current experimental methods for characterizing protein–protein interactions. ChemInform. 2016. https://doi.org/10.1002/chin.201624266.
Article Google Scholar
Zhou P, et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature. 2020;579(7798):270–3. https://doi.org/10.1038/s41586-020-2012-7.
Article CAS PubMed PubMed Central Google Scholar
Klemm T, Ebert G, Calleja DJ, Allison CC, Richardson LW, Bernardini JP, Lu BG, Kuchel NW, Grohmann C, Shibata Y, Gan ZY, Cooney JP, Doerflinger M, Au AE, Blackmore TR, van der Heden van Noort GJ, Geurink PP, Ovaa H, Newman J, Riboldi-Tunnicliffe A, Czabotar PE, Mitchell JP, Feltham R, Lechtenberg BC, Lowes KN, Dewson G, Pellegrini M, Lessene G, Komander D. Mechanism and inhibition of the papain-like protease, PLpro, of SARS-CoV-2. EMBO J. 2020;39(18):e106275. https://doi.org/10.15252/embj.2020106275

Download references

Acknowledgements

The authors are grateful to the broader Galaxy community for their support and software development efforts. Thanks to Jessica Hamby for proofreading the manuscript.

Funding

This work is funded by NIH Grants U41 HG006620 and R01 AI134384 as well as by NSF ABI Grant 1661497 to A.N. and M.S. Usegalaxy.eu is supported by the German Federal Ministry of Education and Research grants 031L0101C and de.NBI-epi to BG. Galaxy integration is supported by NIH grant R01 AI134384 to A.N. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author information

Authors and Affiliations

Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
Aysam Guerler, Dannon Baker & Michael C. Schatz
Department of Biochemistry and Molecular Biology, Penn State University, College Park, PA, USA
Marius van den Beek, Dave Bouvier, Nate Coraor & Anton Nekrutenko
Department of Bioinformatics, Freiburg University, Freiburg, Germany
Bjoern Gruening
Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
Stephen D. Shank & Jordan D. Zehr

Authors

Aysam Guerler
View author publications
You can also search for this author in PubMed Google Scholar
Dannon Baker
View author publications
You can also search for this author in PubMed Google Scholar
Marius van den Beek
View author publications
You can also search for this author in PubMed Google Scholar
Bjoern Gruening
View author publications
You can also search for this author in PubMed Google Scholar
Dave Bouvier
View author publications
You can also search for this author in PubMed Google Scholar
Nate Coraor
View author publications
You can also search for this author in PubMed Google Scholar
Stephen D. Shank
View author publications
You can also search for this author in PubMed Google Scholar
Jordan D. Zehr
View author publications
You can also search for this author in PubMed Google Scholar
Michael C. Schatz
View author publications
You can also search for this author in PubMed Google Scholar
Anton Nekrutenko
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AG, MCS and AN conceived and designed the experiments. DB and MB configured the required infrastructure to perform the experiments. BG and DB automated pipeline testing and deployment. NC provided public instance support. SDS and JDZ analyzed biologically relevant findings. AG wrote the manuscript. All authors approved the manuscript for submission.

Corresponding author

Correspondence to Aysam Guerler.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

A.G., D.B., N.C. and A.N. are founders of and hold equity in GalaxyWorks, LLC. The results of the study discussed in this publication could affect the value of GalaxyWorks, LLC.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Guerler, A., Baker, D., van den Beek, M. et al. Fast and accurate genome-wide predictions and structural modeling of protein–protein interactions using Galaxy. BMC Bioinformatics 24, 263 (2023). https://doi.org/10.1186/s12859-023-05389-8

Download citation

Received: 20 June 2022
Accepted: 15 June 2023
Published: 23 June 2023
DOI: https://doi.org/10.1186/s12859-023-05389-8

Fast and accurate genome-wide predictions and structural modeling of protein–protein interactions using Galaxy