Detection and sequence/structure mapping of biophysical constraints to protein variation in saturated mutational libraries and protein sequence alignments with a dedicated server
© The Author(s). 2016
Received: 1 March 2016
Accepted: 7 June 2016
Published: 17 June 2016
The Erratum to this article has been published in BMC Bioinformatics 2016 17:439
Protein variability can now be studied by measuring high-resolution tolerance-to-substitution maps and fitness landscapes in saturated mutational libraries. But these rich and expensive datasets are typically interpreted coarsely, restricting detailed analyses to positions of extremely high or low variability or dubbed important beforehand based on existing knowledge about active sites, interaction surfaces, (de)stabilizing mutations, etc.
Our new webserver PsychoProt (freely available without registration at http://psychoprot.epfl.ch or at http://lucianoabriata.altervista.org/psychoprot/index.html) helps to detect, quantify, and sequence/structure map the biophysical and biochemical traits that shape amino acid preferences throughout a protein as determined by deep-sequencing of saturated mutational libraries or from large alignments of naturally occurring variants.
We exemplify how PsychoProt helps to (i) unveil protein structure-function relationships from experiments and from alignments that are consistent with structures according to coevolution analysis, (ii) recall global information about structural and functional features and identify hitherto unknown constraints to variation in alignments, and (iii) point at different sources of variation among related experimental datasets or between experimental and alignment-based data. Remarkably, metabolic costs of the amino acids pose strong constraints to variability at protein surfaces in nature but not in the laboratory. This and other differences call for caution when extrapolating results from in vitro experiments to natural scenarios in, for example, studies of protein evolution.
We show through examples how PsychoProt can be a useful tool for the broad communities of structural biology and molecular evolution, particularly for studies about protein modeling, evolution and design.
Deep-sequencing technologies are revolutionizing structural and evolutionary biology by (i) providing experimental maps of tolerance to substitutions and fitness landscapes for proteins [1–20] and (ii) by boosting genomic projects thus increasing the coverage of protein families to an extent that even allows for structural modeling through analysis of residue coevolution [21–27].
In deep-sequencing-based measurements of tolerance to substitutions and fitness landscapes, saturated mutational libraries of a protein or segment of interest are selected for a trait and the retained variants are all sequenced at once to quantify the frequency of each amino acid at each position after selection or, in most cases, the change in frequency before and after selection. The datasets resulting from such complex and costly experiments are usually analyzed in terms of the specific questions the selection experiment was designed to answer, looking coarsely at the extent of conservation and variability throughout the subject protein and focusing structural and functional interpretations mainly on residues of exceptionally high or low tolerance to substitutions and on specific residues known to be critical for stability, catalytic activity, molecular binding, etc. However, these datasets are inherently rich in finer information, intrinsically related to how protein sequences can drift, evolve and be engineered. On the other hand, the fact that sequence coverage in protein families is good enough to derive 3D models of proteins and protein complexes from residue coevolution analysis suggests that other kinds of structural/functional information might be available from large structure-consistent alignments.
Here we present PsychoProt, a web-based tool to systematically extract quantitative biochemical/biophysical information about site-specific amino acid variability from deep-sequencing experiments of saturated mutational libraries or from protein sequence alignments. Specifically, PsychoProt unveils explanations for the amino acid preferences observed throughout a subject protein or segment of interest by finding out which amino acid descriptors shape the probability of observing each amino acid at each site, and maps these constraints on the protein’s sequence and structure. The descriptors, separated in sets that can be independently tested, include physicochemical properties of the amino acids, metrics for their metabolic costs and discrete numbers describing the atomic composition of their side chains.
We begin by describing PsychoProt’s main inputs, outputs and internal algorithms, and we then illustrate how to interpret PsychoProt’s results through examples where we have analyzed amino acid variation in deep-sequencing datasets of saturated mutational libraries and in alignments of naturally occurring proteins consistent with the proteins’ folds according to coevolution analyses. Five remarkable highlights arise from these examples. First, we uncover several sites under mechanical constraints at the core of a protein that undergoes refolding for its main function. Second, we note different physicochemical constraints acting on two variants of a nucleic acid-binding protein despite their high sequence identity. Third, we uncover extensive constraints on flexibility around the active site of a family of hydrolases with large substrate promiscuity. Fourth, we report different constraints acting on a protein upon in vitro experimentation in the laboratory compared to the same protein in a structure-consistent alignment of natural variants, particularly with a large number of surface sites that favor metabolically inexpensive amino acids in the natural variants but not in the experimental dataset. Fifth, we report for protein alignments pervasive constraints for hydrophobicity and hydrophilicity, followed by metabolic descriptors and then by descriptors related to volumes, steric hindrance and flexibility, in patterns that might be exploited to identify structural and functional features of proteins from alignments only.
PsychoProt works entirely online through a graphical interface available for free without registration at http://psychoprot.epfl.ch (or at http://lucianoabriata.altervista.org/psychoprot/index.html). Its input is in principle any dataset describing amino acid variation across a protein sequence or segment of interest, coming either from deep-sequencing of saturated mutational libraries or simply from large sequence alignments. The input must be formatted as a text table containing the preferences for each of the 20 amino acids (the amino acid “distributions”) at all positions (“sites”, or residues) of interest. A metric for the total tolerance to substitution k* for each site can optionally be provided too. The input table can be computed in any spreadsheet program and copied into Psychoprot’s input box. See Methods and the examples provided in the website for further details.
Algorithm and outputs
List of amino acid descriptors available in PsychoProt. Only sets 1 and 2 are recommended for normal use
Composite physicochemical descriptors
Log(solubility) x Flexibility
Steric hindrance x Flexibility
P(helix) + P(sheet)
log(solubility) x Hydrophobicity
Hydrophobicity x Flexibility
Volume/(P(helix) + P(sheet))
In vivo decay time
Cost for synthesis
Volume x Isoelectric Point
P(helix) x Flexibility
Number of Oxygen atoms in side chain
Number of Nitrogen atoms in side chain
Number of Sulfur atoms in side chain
Number of H-bond donors and acceptors in side chain
As a final remark before presenting actual results, we point out that treating each site independently in the fits could possibly offer too many degrees of freedom for finding correlations with amino acid descriptors. This is why we opted to keep the number of descriptors low, and we suggest the use of no more than sets 1 and 2 whose descriptors are low-correlated to each other as well as the use of the Bonferroni correction. Naturally, as with any automated protocol for data analysis, knowledgeable interpretation of the results in the frame of general protein biophysics and considering aspects specific to the subject protein is important.
Results and discussion
We next exemplify specific applications of PsychoProt to real-world datasets, through analyses of amino acid variability in deep-sequence datasets about mutational tolerance in human influenza hemagglutinin , in two close variants of human influenza ribonucleoprotein  and in a TEM β-lactamase , the latter compared to results on a structure-consistent alignment of naturally occurring lactamases; and on structure-consistent alignments of a soluble zinc-dependent β-lactamase and of a tetrameric transmembrane aquaporin.
Tolerance to substitutions in a viral surface protein that undergoes functional refolding upon interaction with a receptor
Motivated by the ability of the human influenza virus to escape immunity through mutations of its HA protein, Thyagarajan et al. used deep sequencing to evaluate the functional effects of all amino acid substitutions in this protein . They found that HA’s surface residues are more tolerant to substitution than buried residues, as reported for other systems [4, 5]; and that among surface residues there is lower tolerance to substitutions at receptor-binding sites than at antigenic sites, the later related to the virus’ ability to escape the immune response through mutation. Thyagarajan et al. noticed that receptor-binding sites are less solvent exposed than antigenic sites, which could explain their lower tolerance to substitutions, but through a formal test they showed that receptor-binding sites are less tolerant of substitutions than other buried sites.
The initial look at receptor and antigenic sites in the work by Thyagarajan et al. was driven by their low site entropies. These site entropies take values in the range from 1.58 to 4.16 for this dataset, which are not straightforward to interpret. If one provides PsychoProt only the amino acid preferences as inputs, it estimates the total tolerance to substitutions (k*) based on equation 2 returning a metric defined in the range from 1 (only one amino acid tolerated) to 20 (all amino acids equally tolerated). The recomputed tolerances (Additional file 1: Table S1) confirm that antigenic sites are indeed highly tolerant of substitutions (with only one site having low tolerance out of 28) and show in a straightforward way that receptor-binding sites are very intolerant of substitutions, with two sites having k* = 10.7 and 16.2 and the rest all ranging between 1 and 2.2.
Summary of explained and unexplained sites in the four experimental datasets of protein variation, having run PsychoProt with only physicochemical descriptors
Source of data
Thyagarajan et al. eLife 
Doud et al. Mol Biol Evol 
Doud et al. Mol Biol Evol 
From Deng et al. 
Sites explained by some descriptor
Unexplained of very low tolerance (k* < 4)
Unexplained of very high tolerance (k* > 16)
Unexplained of not extreme tolerance to substitutions
Similarities and differences between two variants of a viral nucleic acid-binding nucleoprotein
With the goal of investigating whether amino acid preferences are similar in close protein homologs, Doud et al. interrogated through deep sequencing two strains of the human influenza ribonucleoprotein (RNP variants 1934 and 1968) separated by three decades of evolution and 30 residue substitutions . They found that 14, or possibly up to a maximum of 72, out of the 497 sites analyzed exhibit significantly different amino acid preferences between the two RNP variants, suggesting that site-specific amino acid preferences are quite conserved during short periods of evolution.
Using PsychoProt one can further interrogate to what extent the underlying physicochemical, structural and functional constraints remain the same, or vary, between the protein variants of the two strains. Applying PsychoProt with standard settings and focusing only on physicochemical properties from Set 1, 198 and 197 sites were fitted to at least one descriptor for variants 1934 and 1968, respectively, i.e. ~40 % of their sites (Table 2). 139 of these sites were fitted to at least one descriptor in both variants, accounting for 28 % of the protein length with a similar distribution throughout the sequence as shown in Additional file 1: Figure S4. This means there are 59 sites that were fitted to at least one descriptor in variant 1934 but to none in variant 1968, and 58 sites that were fitted to at least one descriptor in variant 1968 but to none in variant 1934. While some of these differences might reflect true differences in the underlying constraints, they can also be due to noise in the input data large enough to blur trends in one dataset but not in the other, especially given the large size of this protein. Similarly, segments of contiguous sites that were not fitted to any descriptor in any dataset (Additional file 1: Figure S4) likely arise from regions with lower number of measurements and hence higher noise, together with extremely high or low tolerance to substitutions (~20-25 %, as shown in Table 2), intricate dependencies of the amino acid preferences on multiple physicochemical descriptors, and sources of variation arising from effects at transcription, RNA stabilization and degradation, translation, protein degradation, and other levels .
Another interesting observation is that isoelectric point is the most frequent best descriptor in both datasets. Most of the fits against isoelectric point show indeed positive trends, consistent with positively charged regions required in the RNA binding groove (Fig. 3b). Two smaller surface patches with several sites that follow negative trends against isoelectric point, i.e. which favor the presence of negatively charged residues, might be related to the interactions of RNP with other proteins or could serve to compensate the positive charges and avoid an excessively high isoelectric point.
Natural vs. in vitro variability in TEM lactamases
TEM lactamases are globular, soluble enzymes that confer bacterial resistance towards β-lactams and stand as workhorse systems to study protein evolution in the laboratory through deep sequencing and by analyzing their extensive natural variability [1–4, 43–46]. We used PsychoProt to analyze an alignment of natural β-lactamases whose information content is consistent with the structure of the TEM-1 lactamase as shown by the EVFold server  (Additional file 1: Figure S6), and compared these results to those on high-resolution deep-sequence data obtained after in vitro selection of a saturated library of TEM-1 mutants against ampicillin .
In both datasets, most dependencies on hydrophobicity are negative (~80 %) and map to the protein surface indicating that hydrophilic residues are favored, while positive dependencies on hydrophobicity map to buried locations, as expected for soluble proteins. The negative dependencies on hydrophobicity in the alignment-based data are well spread on the surface but quite excluded from a 10 Å sphere around the active site (Fig. 4c), which has an overall low tolerance to substitutions. All sites shaped by metabolic descriptors follow negative trends, such that metabolically inexpensive and low-turnover amino acids are preferred. They mostly map to the protein surface, but contrary to sites shaped by negative trends on hydrophobicity, they appear less excluded from the active site (Fig. 4d). Also, they concentrate in patches that could possibly correspond to regions of relatively low importance to the protein’s stability and activity.
In summary, the in vitro data seems to capture the constraints on hydrophobicity that act on the natural variants, but it does not capture the effects of amino acid biochemical costs, possibly because the experiments are carried out in rich media with controlled conditions and no limitations on carbon and nitrogen sources. These and other minor differences highlighted by PsychoProt indicate important alterations in the constraints that underlie variability in the two datasets. This observation calls for caution when extrapolating conclusions from experimental tolerance-to-substitution maps and fitness landscapes to the natural scenario, adding to a series of potential problems recently discussed . Specifically from this example, sequence constraints related to improving protein solubility and minimizing the use of metabolically expensive amino acids seem to be much more important in nature than in vitro, consistent with the previous finding that amino acid metabolism conflicts with protein variation .
Two further examples on the physicochemical factors underlying variation in structure-consistent alignments
Summary of results from the three analyses of structure-consistent protein alignments, using PsychoProt with standard parameters and both descriptor sets 1 (physicochemical properties) and 2 (metabolic descriptors)
Soluble/Membrane and oligomerization state
Source of alignment
From EVFold server
From EVFold server
From Ovchinnikov et al. eLife 
Sites explained by some descriptor
Two metabolic descriptors
Unexplained of very low tolerance (k* < 4)
Unexplained of very high tolerance (k* > 16)
Unexplained of not extreme tolerance to substitutions
After having investigated an alignment of TEM lactamases above, we investigate in this section a globular, soluble, zinc-dependent β-lactamase, BcII, widely used as a model system to understand structure, dynamics, catalysis and substrate profiles in metallo-β-lactamases  (Fig. 5a). On BcII’s sequence, the EVFold server  returns a large alignment whose couplings recapitulate fairly well the protein’s contact map (Additional file 1: Figure S7). PsychoProt analysis of this alignment scanning physicochemical and metabolic descriptors (Sets 1 and 2) returns fits for 69.6 % of the residues. As observed for TEM-1 datasets, hydrophobicity dominates the trends, again with a large network of surface-exposed sites that favor hydrophilic amino acids and a cluster of buried residues that favor hydrophobic amino acids. Also for BcII, PsychoProt reveals several surface residues whose preferences are such that they favor metabolically inexpensive amino acids, and like for TEM-1, they seem to gather and reach into the active site slightly more than negative trends on hydrophobicity. We further observe several sites constrained by trends that translate to favoring flexibility, namely either positive trends on flexibility itself or negative trends on volume and secondary structure propensities (Fig. 5a, bottom). All but two of these residues map to the half of the protein that contains the active site, much like in the pattern of slow-timescale dynamics observed through NMR experiments in BcII variants evolved for an extended substrate profile  (inset in Fig. 5a, residues given in Additional file 1: Table S3). This is noteworthy because the family of protein sequences in the alignment includes a large number of zinc-based hydrolases of the metallo-β-lactamase fold with varied substrates. These bioinformatic and experimental findings suggest that at least this family of enzymes exploits roughly half of the fold as a stability reservoir allowing the other half, which contains the active site, to be more fluxional, reminiscent of the idea of fold polarity in enzymes . Also interestingly, the two only residues requiring flexibility far from the active site (Ser49 and Asn137, both also identified as dynamic by NMR in the extended substrate mutant) are far from each other in BcII’s sequence but contact each other in the structure (Fig. 5a).
Finally, we looked at aquaporin-0, a homotetrameric transmembrane protein whose monomers passively transport water across membranes with a number of physiological roles (Fig. 5b). Coevolution analysis on a large alignment of natural sequences with GREMLIN  and modeling with Rosetta resulted in a structure essentially identical to the crystallographic one in a recent benchmark by the Baker group . Analysis of that alignment in PsychoProt returns fits for 61.1 % of the sites. Hydrophobicity accounts for half of the detected trends, but as expected for integral membrane proteins and opposed to what we observed for the soluble proteins, most of these trends are positive (i.e. favor hydrophobic amino acids) and correspond to the exposed residues that match the hydrophobic portion of the membrane. Negative trends on hydrophobicity gather mostly on the cytoplasmic side of the protein, while metabolic trends (again all negative, favoring inexpensive amino acids) map mostly to the cell exterior. Several sites constrained by small volumes and high flexibility map to the lumen of each monomer, through which water flows.
Interesting conclusions emerge from joint analysis of the three PsychoProt runs on structure-consistent alignments (Table 3). In the three cases, the amino acid preferences for 60-70 % of the sites can be explained by physicochemical and metabolic descriptors of the amino acids. Of the ~30-40 % of sites not fitted to any descriptor, ~25 % have very high or low tolerance to substitutions (k* > 16 or k* < 4, respectively), the latter pointing at active sites and interaction surfaces. But the remaining 75 % of sites that were not fitted (~30 % of the total number of residues in the proteins) display intermediate tolerance to substitutions. The lack of fits for these sequence-based examples could be due to multiple reasons. From the protein side, it can be due to very intricate dependencies of the amino acid preferences on physicochemical descriptors or from epistatic effects between sites. Also coevolution effects can impact, such that many pairs of residues vary under the strict constraint of maintaining interactions but since interactions can arise from diverse physicochemical properties (salt bridges, hydrophobic packing, hydrogen bonds, etc.) then no single property is especially preferred. As a more extreme consequence of coevolution effects, the “evolutionary Stokes shift”, i.e. the mechanism by which amino acid preferences at certain sites re-adapt to accepted changes at other sites , is also expected to blur correlations with amino acid descriptors. (Note that effects related to coevolution apply only to alignment-based data but essentially not to mutational library data because the latter deal largely with single substitutions). Additionally, the lack of fits both for alignment and mutational data can also have contributions from effects at the levels of transcription, RNA stabilization and degradation, translation, protein degradation, and other effects not directly relevant to the folded protein . Notably, the best biophysical models for rates of evolution among sites within proteins (a metric related to, but different than, amino acid preferences) can also explain ~60 % of the observed variance .
Relationship to studies of protein modeling, design and evolution
The kinds of results arising from these studies are strongly related to principles of protein modeling and design, and to those of protein evolution, as highlighted in a recent collection of works at the interface of protein science and molecular evolution . Concerning modeling and design, for example, we learned in the previous section that constraints on hydrophobicity related to the compactness of soluble globular proteins and their surface polarity, or to membrane insertion in transmembrane proteins, are dominant and might therefore be used as additional information for alignment-based modeling of proteins by helping define buried and exposed regions. Constraints from volumes, flexibility and steric hindrance seem more related to functional aspects and could hence be used to approximate the location of active sites and interacting surfaces in protein structures and models, together with metrics for conservation gradients .
Regarding the field of protein evolution, careful consideration of the dominant structural, functional and biochemical constraints that shape protein sequences is increasingly recognized as crucial for obtaining models of protein evolution better based on biological, physical and chemical principles [52, 55, 56]. While coarse effects related to protein stability and solvent accessibility have been well demonstrated and are increasingly integrated into evolutionary models [57–59], PsychoProt analyses help to find out finer constraints to the identity of the amino acid required/tolerated at each site. We note though that being designed with a structural goal in mind, PsychoProt does not deal with the underlying evolutionary processes and phylogenetics, which is the specialty of programs like TreeSAAP [60, 61]. Also, PsychoProt does not account for intragenic epistasis, effects of coevolution and the concomitant evolutionary shifts, and constraints that do not act on folded proteins themselves but rather on their translation, folding mechanisms, propensity to degradation, transcription and turnover of their mRNAs, etc.  Hopefully, joint consideration of all these elements on top of the site-specific effects detected by PsychoProt within a detailed phylogenetic framework will lead in the future to our ultimate understanding of molecular evolution ab initio in terms of physical and chemical principles, opening in turn the door to full control during protein modeling and design.
Core algorithm in PsychoProt
For each protein residue with available data, the selected set(s) of amino acid descriptors are screened for correlations by fitting each of them to the observed distribution of amino acids. Fits are retained depending on the p-value associated to the probability that they were observed by chance, considering the Bonferroni correction for multiple testing according to the number of descriptors being screened. One picked descriptor is replaced later on by another that also satisfies the p-value cutoff only if the improvement is significant according to F-statistics about the decrease of the RMSE between experimental and back-predicted values. By default, linear relationships are fitted, which would lead to monotonically increasing or decreasing patterns. However, the user can choose to search for quadratic dependencies as well, which could help to identify optimal or worst values for a descriptor, and in practice results also in curved yet near-monotonically increasing or decreasing relationships.
The list of descriptors currently available in PsychoProt was retrieved from previous works [36, 62, 63] selected such that they have low correlations to each other. When multiple sets of descriptors are selected for screening, PsychoProt scans them in the order from Set 1 to 4, to ensure that combinations of descriptors (Set 3) and discrete descriptors (Set 4) are only picked if they truly give a statistically better fit than the simple descriptors.
PsychoProt’s Inputs and Outputs
The input to PsychoProt is a matrix in text format made up of as many rows as residues for which data is available, and at least 22 tab-delimited columns: reference amino acid in one-letter code, residue number, and preferences for all 20 amino acids. If available, an additional column can include a parameter that measures the total tolerance to substitutions (k*) at each site. If that is not provided, PsychoProt will estimate it from the amino acid preferences using equation 2. In another optional column the user can include any other data as needed. Sample input datasets are given in the website for a deep-sequencing based-study of TEM-1 lactamase and for a structure-consistent alignment of this protein.
Deep-sequence data must be externally converted to a text-format matrix by the user, for example with a spreadsheet program (tables can be directly pasted to PsychoProt or saved to a text file and then loaded). For the analysis of variability in protein alignments, an auxiliary web tool computes a ΔΔG i,j matrix from the alignment in one step (including the tolerance to substitutions for each residue) and sends it straight to the main analysis tool.
As a main result, displayed under “Core Results (Fits)”, PsychoProt delivers two lists; one with all the fits that satisfy the r, RMSE and p-value limits, and another containing only the best fits for each residue (Fig. 1 of the main paper and Additional file 1: Figure S1). Clicking on items of these lists shows plots of experimental preferences against the selected amino acid descriptor, and the relevant fits (Fig. 1). In these plots, the negative of the amino acid preferences is plotted, such that more favored residues get higher values when the ΔΔG metric from equation 1 is used.
The ribbon of buttons right below the lists and plot offers an easy way to export data in text format (“Save output…”) and toggles between plots. One such plot interactively maps which descriptors were selected at different positions of the sequence (“Map on protein sequence”) with the possibility to show positive, negative or both trends, compare results with the total tolerance to substitutions and with additional data provided in the entry matrix. Another plot summarizes how frequently was each descriptor picked (“Summary by Descriptor”) either singly or grouped, etc. Last, if a PDB structure file is available for the protein or a closely related protein in the Protein Databank, PsychoProt features an interactive tool (under “Map on 3D structure”) to map the fits on the 3D structure of the protein, augmented with all the features of JSmol  and with an easy way to extract selected residues for external visualization. This tool is intended for on-the-fly mapping of results; however, the user might find it easier to do structural analysis. To facilitate this, a small text box gives prebuilt selections of amino acid residues in formats that the programs VMD and PyMOL understand (to the left of the JSmol widget, under the list of shown residues).
Data sources for the examples
The example about natural variability in TEM lactamases corresponds exactly to that set by the button “TEM-1 structure-consistent alignment” in the webpage for the main program. This dataset comes from an alignment of proteins retrieved by EVFold’s Jackhammer module using the sequence of the mature portion of the TEM-1 β-lactamase and default parameters for the search . According to EVFold, this alignment contains a large amount of residue coevolution information that is highly consistent with the known structure (Additional file 1: Figure S6), and by design, it has been purged of redundant sequences. The example about in vitro variability in TEM lactamases corresponds exactly to that set by the button “TEM-1 deep-sequencing”, and is the experimental dataset by Deng et al.  The examples about deep-sequence data for human influenza hemagglutinin and nucleoprotein were produced by the Bloom lab [7, 19]. The antigenic and receptor-binding residues of the hemagglutinin protein were defined as by Thyagarajan et al. . The sequence alignments and coevolution-predicted contact maps for TEM-1 lactamase and BcII metallo-β-lactamase were obtained with the EVCouplings server (http://evfold.org/). The sequence alignment for aquaporin-0 corresponds to the article cited in the main text and was downloaded from the Gremlin website (http://gremlin.bakerlab.org/). Tools used to compare evolutionary couplings and structures are available at http://lucianoabriata.altervista.org/evocoupdisplay/gremlin.html and http://lucianoabriata.altervista.org/evocoupdisplay/evcouplings.htm.
BcII, Bacillus cereus metallo-βlactamase II; HA, hemagglutinin; RNP, ribonucleoprotein
We thank Jacques Rougemont (Bioinformatics and Biostatistics Core Facility, EPFL), Victor Panaretos (Chair of Mathematical Statistics, EPFL) and María Eugenia Zaballa (School of Life Sciences, EPFL) for discussions, and the latter also for proofreading the manuscript. LAA thanks EMBO and the Marie Curie actions for a Long-Term postdoctoral fellowship during which this work was started.
MDP is funded by the Swiss National Science Foundation. LAA was a recipient of an EMBO Long-Term Fellowship when this work started.
Availability of data and material
Additional data supporting the findings is available in the Supplementary Material. Raw data is available as examples in the PsychoProt website, in the supporting information of other articles and in the Gremlin website, each as indicated throughout the article and in “Data sources for the examples” by the end of the Methods section.
LAA and MDP conceived and directed the study. LAA and CB implemented the website. LAA analyzed all data and drafted the manuscript. The three authors read and approved the final manuscript.
MDP is a Professor and group leader of the Laboratory for Biomolecular Modeling (http://lbm.epfl.ch/) at École Polytechnique Fédérale de Lausanne and a group leader at the Swiss Institute of Bioinformatics, in Lausanne, Switzerland. CB is a former member of the Laboratory for Biomolecular Modeling. LAA (http://lucianoabriata.altervista.org/) is a Scientist at the Laboratory for Biomolecular Modeling.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Firnberg E, Labonte JW, Gray JJ, Ostermeier M. A comprehensive, high-resolution map of a gene’s fitness landscape. Mol Biol Evol. 2014;31(6):1581–92.View ArticlePubMedPubMed CentralGoogle Scholar
- Stiffler MA, Hekstra DR, Ranganathan R. Evolvability as a Function of Purifying Selection in TEM-1 β-Lactamase. Cell. 2015;160(5):882–92.View ArticlePubMedGoogle Scholar
- Jacquier H, Birgy A, Le Nagard H, Mechulam Y, Schmitt E, Glodt J, et al. Capturing the mutational landscape of the beta-lactamase TEM-1. Proc Natl Acad Sci U S A. 2013;110(32):13067–72.View ArticlePubMedPubMed CentralGoogle Scholar
- Deng Z, Huang W, Bakkalbasi E, Brown NG, Adamski CJ, Rice K, et al. Deep sequencing of systematic combinatorial libraries reveals β-lactamase sequence constraints at high resolution. J Mol Biol. 2012;424(3–4):150–67.View ArticlePubMedPubMed CentralGoogle Scholar
- Roscoe BP, Thayer KM, Zeldovich KB, Fushman D, Bolon DNA. Analyses of the effects of all ubiquitin point mutants on yeast growth rate. J Mol Biol. 2013;425(8):1363–77.View ArticlePubMedPubMed CentralGoogle Scholar
- Podgornaia AI, Laub MT. Protein evolution. Pervasive degeneracy and epistasis in a protein-protein interface. Science. 2015;347(6222):673–7.View ArticlePubMedGoogle Scholar
- Thyagarajan B, Bloom JD. The inherent mutational tolerance and antigenic evolvability of influenza hemagglutinin. eLife. 2014;3.Google Scholar
- Hietpas R, Roscoe B, Jiang L, Bolon DNA. Fitness analyses of all possible point mutations for regions of genes in yeast. Nat Protoc. 2012;7(7):1382–96.View ArticlePubMedPubMed CentralGoogle Scholar
- Hecht M, Bromberg Y, Rost B. News from the protein mutability landscape. J Mol Biol. 2013;425(21):3937–48.View ArticlePubMedGoogle Scholar
- Olson CA, Wu NC, Sun R. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Curr Biol CB. 2014;24(22):2643–51.View ArticlePubMedGoogle Scholar
- Fowler DM, Stephany JJ, Fields S. Measuring the activity of protein variants on a large scale using deep mutational scanning. Nat Protoc. 2014;9(9):2267–84.View ArticlePubMedPubMed CentralGoogle Scholar
- Araya CL, Fowler DM. Deep mutational scanning: assessing protein function on a massive scale. Trends Biotechnol. 2011;29(9):435–42.View ArticlePubMedPubMed CentralGoogle Scholar
- Fowler DM, Fields S. Deep mutational scanning: a new style of protein science. Nat Methods. 2014;11(8):801–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Fowler DM, Araya CL, Fleishman SJ, Kellogg EH, Stephany JJ, Baker D, et al. High-resolution mapping of protein sequence-function relationships. Nat Methods. 2010;7(9):741–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Melnikov A, Rogov P, Wang L, Gnirke A, Mikkelsen TS. Comprehensive mutational scanning of a kinase in vivo reveals substrate-dependent fitness landscapes. Nucleic Acids Res. 2014;42(14), e112.View ArticlePubMedPubMed CentralGoogle Scholar
- Melamed D, Young DL, Gamble CE, Miller CR, Fields S. Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly(A)-binding protein. RNA N Y N. 2013;19(11):1537–51.View ArticleGoogle Scholar
- Qi H, Olson CA, Wu NC, Ke R, Loverdo C, Chu V, et al. A quantitative high-resolution genetic profile rapidly identifies sequence determinants of hepatitis C viral fitness and drug sensitivity. PLoS Pathog. 2014;10(4), e1004064.View ArticlePubMedPubMed CentralGoogle Scholar
- Al-Mawsawi LQ, Wu NC, Olson CA, Shi VC, Qi H, Zheng X, et al. High-throughput profiling of point mutations across the HIV-1 genome. Retrovirology. 2014;11(1):124.View ArticlePubMedPubMed CentralGoogle Scholar
- Doud MB, Ashenberg O, Bloom JD. Site-Specific Amino Acid Preferences Are Mostly Conserved in Two Closely Related Protein Homologs. Mol Biol Evol. 2015;32(11):2944–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Romero PA, Tran TM, Abate AR. Dissecting enzyme function with microfluidic-based deep mutational scanning. Proc Natl Acad Sci U S A. 2015;112(23):7159–64.View ArticlePubMedPubMed CentralGoogle Scholar
- Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, et al. Protein 3D structure computed from evolutionary sequence variation. PloS One. 2011;6(12), e28766.View ArticlePubMedPubMed CentralGoogle Scholar
- Morcos F, Pagnani A, Lunt B, Bertolino A, Marks DS, Sander C, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci U S A. 2011;108(49):E1293–1301.View ArticlePubMedPubMed CentralGoogle Scholar
- Kamisetty H, Ovchinnikov S, Baker D. Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proc Natl Acad Sci U S A. 2013;110(39):15674–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Ovchinnikov S, Kamisetty H, Baker D. Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. eLife. 2014;3, e02030.View ArticlePubMedPubMed CentralGoogle Scholar
- Ovchinnikov S, Kinch L, Park H, Liao Y, Pei J, Kim DE, et al. Large-scale determination of previously unsolved protein structures using evolutionary information. eLife. 2015;4.Google Scholar
- Hopf TA, Colwell LJ, Sheridan R, Rost B, Sander C, Marks DS. Three-dimensional structures of membrane proteins from genomic sequencing. Cell. 2012;149(7):1607–21.View ArticlePubMedPubMed CentralGoogle Scholar
- Hopf TA, Schärfe CPI, Rodrigues JPGLM, Green AG, Kohlbacher O, Sander C, et al. Sequence co-evolution gives 3D contacts and structures of protein complexes. eLife. 2014;3, e03430.View ArticlePubMed CentralGoogle Scholar
- Lockless SW, Ranganathan R. Evolutionarily conserved pathways of energetic connectivity in protein families. Science. 1999;286(5438):295–9.View ArticlePubMedGoogle Scholar
- Magliery TJ, Regan L. Sequence variation in ligand binding sites in proteins. BMC Bioinforma. 2005;6:240.View ArticleGoogle Scholar
- Magliery TJ, Regan L. Beyond consensus: statistical free energies reveal hidden interactions in the design of a TPR motif. J Mol Biol. 2004;343(3):731–45.View ArticlePubMedGoogle Scholar
- Tamuri AU, dos Reis M, Goldstein RA. Estimating the distribution of selection coefficients from phylogenetic data using sitewise mutation-selection models. Genetics. 2012;190(3):1101–15.View ArticlePubMedPubMed CentralGoogle Scholar
- Halpern AL, Bruno WJ. Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. Mol Biol Evol. 1998;15(7):910–7.View ArticlePubMedGoogle Scholar
- Rodrigue N. On the statistical interpretation of site-specific variables in phylogeny-based substitution models. Genetics. 2013;193(2):557–64.View ArticlePubMedPubMed CentralGoogle Scholar
- Shaffer JP. Multiple Hypothesis Testing. Annu Rev Psychol. 1995;46(1):561–84.View ArticleGoogle Scholar
- Abriata LA, Palzkill T, Dal Peraro M. How structural and physicochemical determinants shape sequence constraints in a functional enzyme. PloS One. 2015;10(2),e0118684.Google Scholar
- Krick T, Verstraete N, Alonso LG, Shub DA, Ferreiro DU, Shub M, et al. Amino Acid metabolism conflicts with protein diversity. Mol Biol Evol. 2014;31(11):2905–12.View ArticlePubMedPubMed CentralGoogle Scholar
- Sriwilaijaroen N, Suzuki Y. Molecular basis of the structure and function of H1 hemagglutinin of influenza virus. Proc Jpn Acad Ser B Phys Biol Sci. 2012;88(6):226–49.View ArticlePubMedPubMed CentralGoogle Scholar
- Hamilton BS, Whittaker GR, Daniel S. Influenza virus-mediated membrane fusion: determinants of hemagglutinin fusogenic activity and experimental approaches for assessing virus fusion. Viruses. 2012;4(7):1144–68.View ArticlePubMedPubMed CentralGoogle Scholar
- Plemper RK. Cell entry of enveloped viruses. Curr Opin Virol. 2011;1(2):92–100.View ArticlePubMedPubMed CentralGoogle Scholar
- Carr CM, Kim PS. A spring-loaded mechanism for the conformational change of influenza hemagglutinin. Cell. 1993;73(4):823–32.View ArticlePubMedGoogle Scholar
- Xu R, Wilson IA. Structural characterization of an early fusion intermediate of influenza virus hemagglutinin. J Virol. 2011;85(10):5172–82.View ArticlePubMedPubMed CentralGoogle Scholar
- Weatheritt RJ, Babu MM. Evolution. The hidden codes that shape protein evolution. Science. 2013;342(6164):1325–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Abriata LA, Salverda MLM, Tomatis PE. Sequence-function-stability relationships in proteins from datasets of functionally annotated variants: the case of TEM β-lactamases. FEBS Lett. 2012;586(19):3330–5.View ArticlePubMedGoogle Scholar
- Thai QK, Bös F, Pleiss J. The Lactamase Engineering Database: a critical survey of TEM sequences in public databases. BMC Genomics. 2009;10:390.View ArticlePubMedPubMed CentralGoogle Scholar
- Figliuzzi M, Jacquier H, Schug A, Tenaillon O, Weigt M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Mol Biol Evol. 2015;6.Google Scholar
- Bratulic S, Gerber F, Wagner A. Mistranslation drives the evolution of robustness in TEM-1 β-lactamase. Proc Natl Acad Sci U S A. 2015;112(41):12758–63.View ArticlePubMedPubMed CentralGoogle Scholar
- Boucher JI, Bolon DNA, Tawfik DS. Quantifying and understanding the fitness effects of protein mutations: Laboratory versus nature. Prot Sci. 2016;24.Google Scholar
- Meini M-R, Llarrull LI, Vila AJ. Evolution of Metallo-β-lactamases: Trends Revealed by Natural Diversity and in vitro Evolution. Antibiotics. 2014;3(3):285–316.View ArticlePubMedPubMed CentralGoogle Scholar
- González MM, Abriata LA, Tomatis PE, Vila AJ. Optimization of Conformational Dynamics in an Epistatic Evolutionary Trajectory. Mol Biol Evol. 2016 Mar 15 pii:msw052. [Epub ahead of print].Google Scholar
- Dellus-Gur E, Toth-Petroczy A, Elias M, Tawfik DS. What makes a protein fold amenable to functional innovation? Fold polarity and stability trade-offs. J Mol Biol. 2013;425(14):2609–21.View ArticlePubMedGoogle Scholar
- Pollock DD, Thiltgen G, Goldstein RA. Amino acid coevolution induces an evolutionary Stokes shift. Proc Natl Acad Sci U S A. 2012;109(21):E1352–1359.View ArticlePubMedPubMed CentralGoogle Scholar
- Echave J, Spielman SJ, Wilke CO. Causes of evolutionary rate variation among protein sites. Nat Rev Genet. 2016;17(2):109–21.View ArticlePubMedPubMed CentralGoogle Scholar
- Bolon DNA, Baker D, Tawfik DS. Editorial. Protein Sci Publ Protein Soc. 2016 May 23 doi:10.1002/pro.2949. [Epub ahead of print]
- Jack BR, Meyer AG, Echave J, Wilke CO. Functional Sites Induce Long-Range Evolutionary Constraints in Enzymes. PLoS Biol. 2016;14(5), e1002452.View ArticlePubMedPubMed CentralGoogle Scholar
- Liberles DA, Teichmann SA, Bahar I, Bastolla U, Bloom J, Bornberg-Bauer E, et al. The interface of protein structure, protein biophysics, and molecular evolution. Protein Sci. 2012;21(6):769–85.View ArticlePubMedPubMed CentralGoogle Scholar
- Sikosek T, Chan HS. Biophysics of protein evolution and evolutionary protein biophysics. J R Soc Interface R Soc. 2014;11(100):20140419.View ArticleGoogle Scholar
- Bloom JD. An experimentally informed evolutionary model improves phylogenetic fit to divergent lactamase homologs. Mol Biol Evol. 2014;24.Google Scholar
- Meyer AG, Wilke CO. Integrating sequence variation and protein structure to identify sites under selection. Mol Biol Evol. 2013;30(1):36–44.View ArticlePubMedGoogle Scholar
- Echave J, Jackson EL, Wilke CO. Relationship between protein thermodynamic constraints and variation of evolutionary rates among sites. Phys Biol. 2015;12(2):25002.View ArticleGoogle Scholar
- McClellan DA, Ellison DD. Assessing and improving the accuracy of detecting protein adaptation with the TreeSAAP analytical software. Int J Bioinforma Res Appl. 2010;6(2):120–33.View ArticleGoogle Scholar
- Woolley S, Johnson J, Smith MJ, Crandall KA, McClellan DA. TreeSAAP: selection on amino acid properties using phylogenetic trees. Bioinformatics. 2003;19(5):671–2.View ArticlePubMedGoogle Scholar
- Meiler J, Müller M, Zeidler A, Schmäschke F. Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks. J Mol Model. 2001;7(9):360–9.View ArticleGoogle Scholar
- Huang F, Nau WM. A conformational flexibility scale for amino acids in peptides. Angew Chem Int Ed Engl. 2003;42(20):2269–72.View ArticlePubMedGoogle Scholar
- Hanson RM, Prilusky J, Renjian Z, Nakane T, Sussman JL. JSmol and the Next-Generation Web-Based Representation of 3D Molecular Structure as Applied to Proteopedia. Isr J Chem. 2013;53(3–4):207–16.View ArticleGoogle Scholar
- Fonzé E, Charlier P, To’th Y, Vermeire M, Raquet X, Dubus A, et al. TEM1 beta-lactamase structure solved by molecular replacement and refined structure of the S235A mutant. Acta Crystallogr D Biol Crystallogr. 1995;51(Pt 5):682–94.View ArticlePubMedGoogle Scholar
- DeLano WL. The PyMOL Molecular Graphics System. San Carlos, CA: DeLano Scientific; 2002.Google Scholar
- Ye Q, Krug RM, Tao YJ. The mechanism by which influenza A virus nucleoprotein forms oligomers and binds RNA. Nature. 2006;444(7122):1078–82.View ArticlePubMedGoogle Scholar
- Fabiane SM, Sohi MK, Wan T, Payne DJ, Bateson JH, Mitchell T, et al. Crystal structure of the zinc-dependent beta-lactamase from Bacillus cereus at 1.9 A resolution: binuclear active site with features of a mononuclear enzyme. Biochemistry. 1998;37(36):12404–11.View ArticlePubMedGoogle Scholar
- Gonen T, Sliz P, Kistler J, Cheng Y, Walz T. Aquaporin-0 membrane junctions reveal the structure of a closed water pore. Nature. 2004;429(6988):193–7.View ArticlePubMedGoogle Scholar