BioPhysConnectoR: Connecting Sequence Information and Biophysical Models
© Hoffgaard et al; licensee BioMed Central Ltd. 2010
Received: 2 December 2009
Accepted: 22 April 2010
Published: 22 April 2010
One of the most challenging aspects of biomolecular systems is the understanding of the coevolution in and among the molecule(s).
A complete, theoretical picture of the selective advantage, and thus a functional annotation, of (co-)mutations is still lacking. Using sequence-based and information theoretical inspired methods we can identify coevolving residues in proteins without understanding the underlying biophysical properties giving rise to such coevolutionary dynamics. Detailed (atomistic) simulations are prohibitively expensive. At the same time reduced molecular models are an efficient way to determine the reduced dynamics around the native state. The combination of sequence based approaches with such reduced models is therefore a promising approach to annotate evolutionary sequence changes.
With the R package BioPhysConnectoR we provide a framework to connect the information theoretical domain of biomolecular sequences to biophysical properties of the encoded molecules - derived from reduced molecular models. To this end we have integrated several fragmented ideas into one single package ready to be used in connection with additional statistical routines in R. Additionally, the package leverages the power of modern multi-core architectures to reduce turn-around times in evolutionary and biomolecular design studies. Our package is a first step to achieve the above mentioned annotation of coevolution by reduced dynamics around the native state of proteins.
BioPhysConnectoR is implemented as an R package and distributed under GPL 2 license. It allows for efficient and perfectly parallelized functional annotation of coevolution found at the sequence level.
One of the biggest challenges in the post-genome era  is to understand how proteins evolve, fold, and structurally encode their function. Understanding the underlying coupling of protein sequence evolution and bio-mechanics is the first step to develop new drugs and annotate molecular evolution in physical space. Exploring the accessible sequence space of a protein provides insights into its evolutionary history and phylogenetic relations. Mutual information (MI), an information-theoretical approach, is widely used to detect coevolution [2–9] at the sequence level within a protein or among several molecules. Such statistical methods allow high-throughput investigations, but the biophysical/-chemical implications of protein sequence changes are not revealed by these methods.
In general a sequence change is fixated in molecular evolution, if it has proven to be useful in the physical realm by benefitial biophysical properties and functions. Interactions between proteins as well as functional aspects of monomers are largely conserved throughout evolution, which implies coevolution among residues. Such coevolution contributes to maintain crucial interactions between these coevolving residues. To explore the physical realm, molecular dynamics (MD) simulations and related methods are routinely employed. Their applicability is restricted to just a few mutants due to severe computational demands of MD. To overcome this drawback a number of coarse-grained models have been developed in recent years [10–12]. In contrast to MD simulations, these models allow high-throughput screening of natural and unnatural mutations.
Hamacher  developed a protocol to integrate both the information from sequence-driven methods and the mechanical aspects derived by biophysical interaction theories, eventually bridging the gap between statistical bioinformatics and molecular dynamics/biophysics. Connecting both points of view proved to be essential for the construction of molecular interaction networks  and helps to understand thermodynamical properties and evolutionary changes . The purpose of BioPhysConnectoR is to provide evolutionary biologists and other bioinformatics researchers with these protocols and allow for future development of new protocols to integrate information space and physical space in a holistic picture of molecular evolution.
An alignment given in fasta format can be read and information theoretical measures such as MI and entropy can be computed. It is possible to compute a null model  to estimate the statistical relevance of the derived MI values.
It is possible to read a pdb file and compute the Hessian as well as the covariance matrix for a coarse-grained anisotropic network model (ANM) [10, 11], thus computing reduced dynamical properties of the molecule. This is done in the ANM in a harmonic approximation of the full, atomistic potential. The actual computation is performed by a singular value decomposition (SVD). Additionally B-factors can be extracted from the covariance matrix.
In silico experiments can be performed by changing the underlying protein sequence or "breaking" amino acid contacts for the computation of biophysical properties. For given alignments, the outcome can be combined with the respective MI or joint entropy values.
The self-consistent pair contact probability (SCPCP)  method is included as an additional method to derive B-factors and further biophysical properties from a coarse-grained approach.
Some additional matrix routines are implemented.
where x and y are realizations of the random variables X i and Y j drawn from a set , taken from a multiple sequence alignment as columns i and j - resulting in an MI matrix (MI ij ). For proteins we are concerned with the symbol set of the 20 standard amino acids AA , which can be expanded to include the gap character and an extra character for non-standard amino acids . The probabilities p i (x), p j (y), and p ij (x, y) are obtained as the relative frequencies of amino acids within the columns of a multiple sequence alignment.
Reduced molecular models [10, 11] are obtained by using only a coarse-grained representation of amino acids, such that each amino acid is represented by a bead at the center of its respective C α atom.
where si, i+1is the distance of the C α atoms at adjacent positions (i.e. covalently attached pairs) at a time point in a test conformation, and is the distance of the same atoms in the native structure. C contains all pairs of residue positions i and j with non-covalent contacts that are within a given cutoff. The amino acid-specific statistical contact potential matrices of Miyazawa and Jernigan (MJ)  and Keskin et. al. (KE)  were used for the non-covalent spring constants, κ ij to provide for sequence specificity . Using MJ and KE, the ANM was shown to improve the correspondence to experimental results [11, 12]. Other weighting schemes for amino acids contacts can be provided by the user as arguments to the respective function in BioPhysConnectoR.
for α, β = x, y, z. The eigenvalues of the Hessian are denoted with λ k and the respective eigenvectors with . i, j are the indices of the residues.
Such elastic network models were extended to include thermodynamics - including phase transitions indicating folding/unfolding events. The extension we implemented is the SCPCP approximation first proposed by Micheletti et al.  and later used by Hamacher et al.  to investigate binding free energies of ribosomal subunits. The SCPCP can produce non-harmonic effects beyond properties one usually would expect in simple models. In particular it can show finite-size equivalents of "phase transitions", e.g. protein unfolding.
Results and Discussion
The alignment is read and MI values are computed. We then pick those residue pairs with the highest MI values that are non-covalently in contact within the cutoff of 13Å. The pdb is read and the C α atoms of the first chain are selected. We compute the covariance matrix Mwt for this system. Afterwards we "break" the contact for each previously selected amino acid pair (a, b), one at a time, and compute a respective new covariance matrix Mmut, (a, b). The corresponding change in the mechanical behavior can be annotated by the Frobenius norm f (see eq. 5) between these two matrices.
Future Trends & Intended Use
R  is a widely used and powerful environment for interactive analysis of statistical data in bioinformatics offering lots of additional software packages (e.g. from the Bioconductor  software project). We implemented the BioPhysConnectoR package in R to make the routines and underlying concepts accessible to a wide community allowing fast and parallelized network-based analysis of protein structures. Work is in progress to develop more efficient algorithms to compute covariance matrices for mutated systems and for biomolecular design  in the elastic network framework.
In the BioPhysConnectoR package we provide routines to compare an original protein system to subsequently altered ones with mutated amino acid sequences or "broken" non-covalent contacts. Using sequence alignments we are able to score sequence changes and coevolution by the bio-mechanical ramifications of these changes. We can then use the biophysical modeling to annotate signals of coevolution in the sequence data. We include several options to alter the protocol of : I) parametrization of bonds and contacts can be changed; II) including the well-known MJ and KE weighting scheme [22, 23]; individual interactions in the structure can be altered; III) details on how to analyze mechanical changes can be modified by computing FNs just for subsets of residues; IV) dynamical and thermodynamical properties can be computed. Changes in the molecular mechanics for different scenarios (including mutations) can then be computed e.g. by the FN of the respective covariance matrices. The evolutionary connection of residues (indicated by high MI values) can be annotated by biophysical properties of the encoded molecule. In addition, a thermodynamical, reduced model is included to correlate the variability of protein sequences and thermodynamical implications. The package can furthermore be combined with state of the art optimization schemes to design molecules [29, 30].
Availability and requirements
Project name: BioPhysConnectoR
Project home page: http://bioserver.bio.tu-darmstadt.de/software/BioPhysConnectoR and CRAN at http://cran.r-project.org/
Operating system: cross-platform
Programming language: R and C/C++
Requirements: The R packages snow and matrixcalc are automatically installed from the CRAN repository.
License: GPL 2 license
Any restrictions to use by non-academics: none
KH was supported by the Fonds der chemischen Industrie through a grant for junior faculty. The authors are grateful to anonymous referees for their suggestions.
- Lengauer T: Bioinformatics: From the Pre-genomic to the Post-genomic Era. ERCIM News 2000, 43: 6–7.Google Scholar
- Korber BTM, Farber RM, Wolpert DH, Lapedes AS: Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: An information theoretic analysis. PNAS 1993, 90: 7176–7180. 10.1073/pnas.90.15.7176View ArticlePubMedPubMed CentralGoogle Scholar
- Pollock DD, Taylor WR, Goldman N: Coevolving protein residues: maximum likelihood identification and relationship to structure. J Mol Biol 1999, 287: 187–198. 10.1006/jmbi.1998.2601View ArticlePubMedGoogle Scholar
- Hild KE, Erdogmus D, Principe J: Blind source separation using Renyi's mutual information. Signal Process Lett 2001, 8: 174–176. 10.1109/97.923043View ArticleGoogle Scholar
- Pham DT: Blind separation of instantaneous mixture of sources via the Gaussian mutual information criterion. Signal Process 2001, 81: 855–870. 10.1016/S0165-1684(00)00260-7View ArticleGoogle Scholar
- Boba P, Weil P, Hoffgaard F, Hamacher K: Co-evolution in HIV enzymes. BIOINFORMATICS2010 2010, 39–47.Google Scholar
- Ramani AK, Marcotte EM: Exploiting the Co-evolution of Interacting Proteins to Discover Interaction Specifity. J Mol Biol 2003, 327: 273–284. 10.1016/S0022-2836(03)00114-1View ArticlePubMedGoogle Scholar
- Almeida LB: Linear and nonlinear ICA based on mutual information. Method Signal Process 2004, 84: 231–245. 10.1016/j.sigpro.2003.10.008View ArticleGoogle Scholar
- Gloor GB, Martin LC, Wahl LM, Dunn SD: Mutual Information in Protein Multiple Sequence Alignments Reveals Two Classes of Coevolving Positions. Biochemistry 2005, 44: 7156–7165. 10.1021/bi050293eView ArticlePubMedGoogle Scholar
- Atilgan A, Durrell S, Jernigan R, Demirel M, Keskin O, Bahar I: Anisotropy of fluctuation dynamics of proteins with an elastic network model. Biophys J 2001, 80: 505–515. 10.1016/S0006-3495(01)76033-XView ArticlePubMedPubMed CentralGoogle Scholar
- Hamacher K, McCammon JA: Computing the Amino Acid Specificity of Fluctuations in Biomolecular Systems. J Chem Theory Comput 2006, 2(3):873–878. 10.1021/ct050247sView ArticlePubMedGoogle Scholar
- Hamacher K, Trylska J, McCammon JA: Dependency Map of Proteins in the Small Ribosomal Subunit. PLoS Computational Biology 2006, 2: e10. 10.1371/journal.pcbi.0020010View ArticlePubMedPubMed CentralGoogle Scholar
- Hamacher K: Relating Sequence Evolution of HIV1-Protease to Its Underlying Molecular Mechanics. Gene 2008, 422: 30–36. 10.1016/j.gene.2008.06.007View ArticlePubMedGoogle Scholar
- Hamacher K: Temperature dependence of fluctuations in HIV1-protease. Eur Biophys J 2009.Google Scholar
- R Development Core Team:R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria; 2008. ISBN 3–900051–07–0 [http://www.R-project.org] ISBN 3-900051-07-0Google Scholar
- Grant BJ, Rodrigues APC, ElSawy KM, McCammon JA, Caves LSD: Bio3d: an R package for the comparative analysis of protein structures. Bioinformatics (Oxford, England) 2006, 22(21):2695–2696. [PMID: 16940322] [http://www.ncbi.nlm.nih.gov/pubmed/16940322] [PMID: 16940322] 10.1093/bioinformatics/btl461View ArticleGoogle Scholar
- Novomestky F:matrixcalc. 2008. [http://cran.r-project.org/]Google Scholar
- Tierney L, Rossini AJ, Li N: Snow: A Parallel Computing Framework for the R System. Int J of Parallel Computing 2009, 37: 78–90. 10.1007/s10766-008-0077-2View ArticleGoogle Scholar
- Weil P, Hoffgaard F, Hamacher K: Estimating Sufficient Statistics in Co-Evolutionary Analysis by Mutual Information. Comp Biol Chem 2009, 33: 440–444. 10.1016/j.compbiolchem.2009.10.003View ArticleGoogle Scholar
- Micheletti C, Banavar JR, Maritan A: Conformations of Proteins in Equilibrium. Physical Review Letters 2001, 87(8):088102–1. 10.1103/PhysRevLett.87.088102View ArticlePubMedGoogle Scholar
- MacKay DJC: Information Theory, Inference, and Learning Algorithms. Cambridge University Press; 2003.Google Scholar
- Miyazawa S, Jernigan RL: Residue-Residue Potentials with a Favorable Contact Pair Term and an Unfavorable High Packing Density Term, for Simulation and Threading. J Mol Biol 1996, 256: 623–644. 10.1006/jmbi.1996.0114View ArticlePubMedGoogle Scholar
- Keskin O, Bahar I, Badretdinov A, Ptitsyn O, Jernigan R: Empirical solvent-mediated potentials hold for both intra-molecular and inter-molecular inter-residue interactions. Prot Sci 1998, 7: 2578–2586. 10.1002/pro.5560071211View ArticleGoogle Scholar
- Moore EH: Bull Am Math Soc. 1920, 26: 394–395.Google Scholar
- Penrose R: A generalized inverse for matrices. Proc Cambr Philos Soc 1955, 51: 406–413. 10.1017/S0305004100030401View ArticleGoogle Scholar
- Chen L, Perlina A, Lee CJ: Positive Selection Detection in 40,000 Human Immunodeficiency Virus (HIV) Type 1 Sequences Automatically Identifies Drug Resistance and Positive Fitness Mutations in HIV Protease and Reverse Transcriptase. J Virol 2004, 78(7):3722–3732. 10.1128/JVI.78.7.3722-3732.2004View ArticlePubMedPubMed CentralGoogle Scholar
- Amdahl G: Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities. AFIPS Conference Proceedings 1967, 30: 483–485.Google Scholar
- Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology 2004, 5: R80. [http://genomebiology.com/2004/5/10/R80] 10.1186/gb-2004-5-10-r80View ArticlePubMedPubMed CentralGoogle Scholar
- Hamacher K: Information Theoretical Measures to Analyze Trajectories in Rational Molecular Design. J Comp Chem 2007, 28(16):2576–2580. 10.1002/jcc.20759View ArticleGoogle Scholar
- Hamacher K: Adaptive Extremal Optimization by Detrended Fluctuation Analysis. J Comp Phys 2007, 227(2):1500–1509. 10.1016/j.jcp.2007.09.013View ArticleGoogle Scholar
- Urbanek S:multicore. 2009. [http://www.rforge.net/multicore]Google Scholar
- Humphrey W, Dalke A, Schulten K: VMD - Visual Molecular Dynamics. Journal of Molecular Graphics 1996, 14: 33–38. 10.1016/0263-7855(96)00018-5View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.