Sibe: a computation tool to apply protein sequence statistics to predict folding and design in silico

Background Evolutionary information contained in the amino acid sequences of proteins specifies the biological function and fold, but exactly what information contained in the protein sequence drives both of these processes? Considerable progress has been made to answer this fundamental question, but it remains challenging to explore the potential space of cooperative interactions between amino acids. Statistical analysis plays a significant role in studying such interactions and its use has expanded in recent years to studies ranging from coevolution-guided rational protein design to protein folding in silico. Results Here we describe a computational tool named Sibe for use in studies of protein sequence, folding, and design using evolutionary coupling between amino acids as a driving factor. In this study, Sibe is used to identify positionally conserved couplings between pairwise amino acids and aid rational protein design. In this process, pairwise couplings are filtered according to the relative entropy computed from the positional conservations and grouped into several ’blocks’, which could contribute to driving protein folding and design. A human β2-adrenergic receptor (β2AR) was used to demonstrate that those ’blocks’ contribute the rational design for specifying functional residues. Sibe also provides folding modules based on both the positionally conserved couplings and well-established statistical potentials for simulating protein folding in silico and predicting tertiary structure. Our results show that statistically inferences of basic evolutionary principles, such as conservations and coupled-mutations, can be used to rapidly design a diverse set of proteins and study protein folding. Conclusions The developed software Sibe provides a computational tool for systematical analysis from protein primary to its tertiary structure using the evolutionary couplings as a driving factor. Sibe, written in C++, accounts for compatibility with the ’big data’ era in biological science, and it primarily focuses on protein sequence analysis, but it is also applicable to extend to other modeling and predictions of experimental measurements Electronic supplementary material The online version of this article (10.1186/s12859-019-2984-1) contains supplementary material, which is available to authorized users.


Analysis on protein sequences
First of all, we search a query sequence against the UniRef90 database [1] by HMMER [2] and then prepare its sequence alignment. The aligned sequence can be directly used to infer evolutionary information, since the HMMER tool provides the best ContTest score [3] and comparable score in the QuanTest [4]. The following command lines are used to collect the aligned sequences, jackhmmer \ -N 5 \ --notextw \ -A out.sto \ --tblout out_tbl.out \ --domtblout out_domtbl.out \ -E 0.01 \ --popen 0.25 \ --pextend 0.4 \ query_fasta \ uniref90.fasta We prepared a wild type sequence of the human β 2 AR protein as the example for protein sequence analysis. By running jackhmmer, we got a raw sequence alignment and then trimmed them by a command line as follows, The command line is to trim the aligned sequences according to the first sequence (query sequence) in the input sequence alignment. At the begin of sequence analysis, we use command sequence stats in Sibe to conduct a naive examination of the sequence space described by the alignment. Sibe will also present a Shell script for plotting figures of the output. By running the script, the computed a matrix of similarity between pairs of sequences can be plotted as illustrated in Fig. S5. The matrix representation gives the fraction of amino acids that are common between the pairwise sequences. As another computational results, the bar representation of positional conservation is shown in Fig. 3.

Validations of the calculations
In computational protein design, we first demonstrate that the estimated sequence potential from MSA can distinguish the mesophilic and thermophilic proteins. Likewise, we prepared a MSA for each family and then inferred the energy-like potential from the MSA by running command line as follows, sibe sequence_coupling \ -msa=sequence_alignment_trimed.aln Based the obtained potential, we calculated the energies of the sequences using the same estimated potential if the sequences are in the same family. The command line is to compute the sequence energy as follows, sibe sequence_energy \ -msa=the_sequences_require_energy_calculation.aln \ -mat=sequence_potential.mat The calculations are shown in Fig. 6. We demonstrated that the computational results have high correlation (Pearson correlation coefficient ∼ −0.74) with the experimental data [5], and the estimated potential and the model were validated to be efficient in distinguishing the mesophilic from thermophilic proteins, 24 of all protein families as listed in Table S1, others can be find in ref. [6]. Humicola lanuginosa (thermo/50 • C) 3tgl Rhizomucor miehei (thermo/45 • C The effects of point mutations were computed from the potential by running following command line. The matrix representation of the computational results is shown in Fig. 7.

Protein design protocol
Using the algorithm described by Desmet et al. [7], the DEE procedure for each trajectory starts with a given wild type sequence. The sequence converges to a local energy minimum by gradually decreasing the temperature. As illustrated in Fig. S2, starting from a given wild type sequence (WT sequence), we launched a DEE algorithm to optimize the mutant sequence base on the energy-like potential (sequence potential) inferred from the MSA. According to the Metropolis criterion, the design protocol will accept or reject a new mutant that may occur in the wild type sequence, and the change is accepted with probability, where r p = 1, and ∆E is the energy difference between the new and old designed sequences. By running the following command line, we can get a trajectory (the designed sequence .trajct) of mutant sequences starting from the wild type sequence.
sibe sequence_design \ -fastx=WT_sequence \ -mat=sequence_potential.mat \ -dseq=the_designed_sequence.trajct \ -iterations=100000 From the design protocol, we got five hundreds trajectories of the designed sequences, and the sequence with lowest energy in each trajectory was collected for visualization as shown in Fig. S4.

Protein folding and structure prediction
In Sibe, MCMC algorithm is used for protein conformation sampling from individual (φ, ψ) distributions (Ramachandran maps), while all other angles and bond lengths are fixed at their ideal values. A single round involves 500 individual MCMC folding simulations that are run using specialized (φ, ψ) backbone sampling procedures and the energy functions (as described in ref. [8]) in a protein representation containing the backbone and C β atoms. Within the simulation, MCMC provides a general solution to protein folding and structure prediction prevalent in scientific research. As described in ref. [8], Sibe utilizes the same moving sets and energy functions. Additionally, Sibe employed predicted angles (φ, ψ) to increase the sampling probabilities in the Ramachandran map distribution, which also efficiently enhance the ability of the MCMC method during the simulation. The passing of constraints of torsion angles (φ, ψ) and residue-contacts from one round to another is repeated until convergence as illustrated in Fig. S4 Figure S4: Flowchart of a single round in the iterative protein folding.

Constraints-guided sampling
In Sibe, we used the Phsior [9] to predict torsion angles (φ, ψ), which were applied to shift the Ramachandran map distribution and increase the probabilities p of (φ, ψ) of the ith residue located in circle In each round, a consensus secondary structure is obtained at every amino acid position, which is used to alter the Ramachandran map distribution. Combining with the predicted torsion angles (φ, ψ), we can make changes in the probabilities in each map of amino acids. This strategy leads to efficient sampling in the MCMC algorithm and accelerate the folding of a protein structure.
The residue-contacts (estimated by evolutionary couplings [10,11]) are used as constraints in iterative simulations as illustrated in Fig. 10. The top L/3, L/2, L, 3L/2 (L is the length of a protein sequence) predicted contacts are used for 500 simulations in initial stage. Every 125 simulations utilize the same initial constraints of residue-contacts, e.g. top L/3 residue-contacts are used in the first 125 simulations while top L/2 residue-contacts are employed by another 125 simulations. This hierarchical application of constraints from residue-contacts is to help the Markov chain sample efficiently in conformation space.

Metropolis-Hastings sampling
In Sibe, we use a Markov chain to sample from Ramachandran maps distribution (π) (derived from high resolution PDB structures [8]). Accordingly, it is necessary to develop a transition for the Markov chain, aiming to match the chains stationary distribution with the individual (φ, ψ) distributions. As an effective strategy to sample angular space (φ, ψ), a random-walk Markov chain uses the current state of a chain of conformations to propose a new state. Within Sibe, a Gaussian function centered on the current state g(s|s (t−1) ) ∼ N (s (t−1) , 1) is defined as the proposal function. This allows the algorithms to exploit the conformation space of the posterior-if a new conformation is similar to the last draw, then it is likely to be accepted. The proposal distribution g(s|s (t−1) ) is dependent only on the previous state in the Markov chain.
Starting from some random initial state s (0) ∼ π (0) , the protocol first draws a potential state s from a proposal distribution g(s|s (t−1) ). Accordingly, in Sibe the Metropolis-Hastings algorithm is launched as the simplest form of random-walk MCMC protocol, which accepts or rejects a transition from protein conformation s (t−1) to s with probability α. The detailed criteria for accepting or rejecting a proposed state are defined as follows: 1. If R(s) ≥ R(s (t−1) ), the proposed state s will be set as the next state in the Markov chain.
2. If R(s) < R(s (t−1) ), then the proposed state may still be accepted, but only randomly, and with a probability R(s) R(s (t−1) ) . The acceptance probability for the proposed state is calculated as follows, where R is the density of the individual φ, ψ distribution of an amino acid, and g is the density of the proposal function for a transition from s (t−1) to s.
The proposed state s is accepted if a random uniform number u is less than or equal to α.
Otherwise, it will be rejected, and the current state will be set as the next state in the Markov chain.

Comparisons
We have compared our predictions to those of EVfold [10] on the three proteins. As shown in Table  S2. All the results are computed from EVfold web-server, and the comparisons between EVfold and Sibe are reported in Table S2 in the supplementary material. As shown in Fig. S5, the results were computed by the EVfold web-server at https://evcouplings.org. All the predictions are based on the default configurations on the server and the visualizations of the tertiary structures are presented by PyMol [12] for comparison to our results as shown in Table  S2. As a de novo method, folding module in Sibe has better performance than that of EVfold, since Sibe provides folding details in the predictions and iteratively bias the folding.

Running time
As shown in Fig. S6, the running time of different tasks (MSA, statistical potential, folding) are presented. For all the proteins used in the study, the jackhmmer tool took similar time to search the UniRef90 database for obtaining the MSA of each protein. The running time of both estimated potential and folding is highly dependent on the protein length and the number of sequences in the MSAs. For example, it took about 13 hours to infer the statistical potential for the human β 2 AR protein due to its longer sequences and large number of sequences in its MSA as illustrated in Fig. S6(a). On distributed computational clusters, we conducted folding simulations on eighteen proteins [13] and computed the running time as shown in Fig. S6(b). As illustrated in Fig. 1, it took much longer time to folding a protein into its tertiary structure comparing to that of calculating the potential.  Figure S6: Running time of obtaining MSA, estimating sequence potential, and folding simulation.