SeqX: a tool to detect, analyze and visualize residue co-locations in protein and nucleic acid structures
© Biro and Fördös; licensee BioMed Central Ltd. 2005
Received: 11 February 2005
Accepted: 12 July 2005
Published: 12 July 2005
The interacting residues of protein and nucleic acid sequences are close to each other – they are co-located. Structure databases (like Protein Data Bank, PDB and Nucleic Acid Data Bank, NDB) contain all information about these co-locations; however it is not an easy task to penetrate this complex information. We developed a JAVA tool, called SeqX for this purpose.
SeqX tool is useful to detect, analyze and visualize residue co-locations in protein and nucleic acid structures. The user
a. selects a structure from PDB;
b. chooses an atom that is commonly present in every residues of the nucleic acid and/or protein structure(s)
c. defines a distance from these atoms (3–15 Å). The SeqX tool detects every residue that is located within the defined distances from the defined "backbone" atom(s); provides a DotPlot-like visualization (Residues Contact Map), and calculates the frequency of every possible residue pairs (Residue Contact Table) in the observed structure. It is possible to exclude +/- 1 to 10 neighbor residues in the same polymeric chain from detection, which greatly improves the specificity of detections (up to 60% when tested on dsDNA). Results obtained on protein structures showed highly significant correlations with results obtained from literature (p < 0.0001, n = 210, four different subsets). The co-location frequency of physico-chemically compatible amino acids is significantly higher than is calculated and expected in random protein sequences (p < 0.0001, n = 80).
The tool is simple and easy to use and provides a quick and reliable visualization and analyses of residue co-locations in protein and nucleic acid structures.
Specific protein-protein and protein-nucleic acid interaction are in the focus of many biochemical studies. The exact nature of these interactions is not known. Some scientists argue that the macromolecular interactions are determined by long sequence domains that are involving many residues (amino acids and nucleotides), while others found that there is some degree of specificity already on a single residue level, i. e. some residue pairs are preferentially co-located on interacting interfaces. The existence of preferred residue pairs within, as well as between, macro-molecular structures are supported by numerous statistical analyses of protein-RNA  regulatory protein-DNA , restrictions enzyme-DNA cut site , protein-protein [4–8] structures and interfaces. Although many studies are performed for statistical analyses of residue co-location, it was not possible for us to find a publicly available tool for this purpose. We found only a reference for the existence of a commercially available tool, the QUANTA modeling software [9, 10].
The main features of the protein secondary structure are indicated by background colors (blue/green: beta sheet, yellow: alpha helix, gray: turn), if they are annotated (not always) in the pdb source files.
It is possible to zoom in the center of the Map and move it into optional directions (using the mouse). Primary structure (the sequence) is available along the coordinates. If the sequence is too long, it is necessary to zoom in the Map to make the sequence readable.. Protein sequence is indicated with the 20 one-letter codes (capital letters), while the nucleic acid sequence with the a, t/u, g, c letters. Clicking on any co-locations highlights the corresponding 2 letters in the sequences (green letter coloring).
A simple statistical analysis is performed and the number of every possible residue combinations is listed in a Residue Contact Table. It is possible to save the results of the analysis in JPG and XLS (or similar) files. It is also possible to save even the Residue Contact Map in binary form and XLS format for future statistical processing ("Save binary" saves the map as 0 a 1 numbers).
The Residue Contact Table contains all possible residue combinations (20 × 20 amino acid to amino acid, 4 × 4 nucleic acid to nucleic acid and 20 × 4 amino acid to nucleic acid combinations) and lists the frequency of theses co-locations in the observed structure. Some of the listed co-locations are specific (true) while other is aspecific (false) co-locations.
It is possible to estimate the specificity of the results only in the case of nucleic acids where the Watson-Crick base pairs are known to be specific co-locations. The Residue Contact Table provides data for the 16 (4 × 4) different type of nucleic acid base co-locations, however it is known that only adenine-thymine (a-t, t-a) and guanine-cytosine (g-c, c-g) co-locations indicate true (T) and specific base-pairs, while the 8 other pairs are false (F).
We found that detection radius between 5–9 Å and exclusion of +/-8 neighbors gives the best results for analyzing alpha helical protein structures.
Some cautious and preliminary estimation is still possible even for the specificity of detected residue co-locations in protein structures. Namely, it is known from physico-chemical studies, that some amino acids are attractive while others are repulsive to each other. The known physico-chemical laws suggest that pair-formation (co-location) is probably preferred between amino acids having similar hydrophobicity or different charge, while pair-building between amino acids with different hydrophobicity or similar charge are strongly prohibited.
To test this assumption we generated a pool of artificial random protein sequences by translating randomized nucleic acid sequences. The nucleic acids contained equal amount of each nucleotide bases (4 × 25%) and, by that way, the average frequency of amino acids in the translated artificial proteins became very similar to the amino acid frequency of the entire human proteome.
The residue co-locations within and between these sequences are determined by statistical lows if we assume that the spatial mobility of the residues in these proteins is free and independent of each other. The calculated probability of any residue co-locations (Pab) will be Pab = nanb/T2, T = na+nb...+n20 wher n is the number of a given amino acid. The calculated relative frequency of a given co-locating pair (Cab) is proportional to Pab and might be calculated by the Cab = Pab/ (Pab+...Pxy) 100 formula, where x and y indicate any of the 20 possible amino acids and the number of xy pairs is 400.
This example indicates that the number of false positive (un-favored) co-locations is about 20% and the specificity of SeqX methods for proteins might be as much as ~80% However this is a very crude estimate, because the number of true co-locations is not surely known.
To understand the nature of specificity of macromolecular interactions is a major challenge in bioinformatics. We were successful in providing evidence to support the view that some degree of specificity already exists on residue level . Therefore we decided to continue our studies of frequency analyses of residue co-locations in nucleoprotein structures. The SeqX tool is specifically designed for this purpose. The 2D Residue Contact Map is a simple and easy to understand display of nucleic acid and protein structures. There are some very sophisticated analytical tools which also even incorporate this feature, like MOLTALK , STING Millennium , STRIDE  MolSurfer  MOLPROBITY . The major advantage of this approach is its simplicity. The effective usage of 3D tools and learning the "3D thinking" usually requires lengthy training which often is not affordable for general bioinformaticians. We have further developed the concept of Residue Contact Map and added many new features that are not present in existing tools. Such features are
1., The option to choose different backbone atoms (in addition to the conventional Calpha and C1' atoms;
2., The option to exclude neighbor atoms and to improve the specificity of the method;
3., The direct connection to a Residue Contact Table which automatically provides a basic statistical analyzes of the residue co-locations.
It is expected, that statistical analyses of residue co-locations in protein and nucleic acid sequences will provide further insight and understanding the rules of macromolecular interactions. The ultimate goal of these types of studies is to find short "complementary" or "compatible" sequences/motifs even for specific nucleic acid – protein and protein – protein interactions, something similar to the well known Watson-Crick rules of specific nucleic acid – nucleic acid contacts.
It is well known that in studies of protein interactions, protein engineering and drug design the most important are the interactions between side chains. However, the recent SeqX program is a general purpose tool (for nucleic acids as well as for proteins) for statistical analyzes and visualization of entire-residue co-locations and it does not pay particular attention to side chains and the pattern of the side chain interactions. It does not limit the usefulness of this tool for its original purpose: any significant residue co-locations (i.e. that which are different from random) are necessarily caused by the side chains ('R' in amino acids, 'bases' in nucleic acids) because they are the variable elements of the structures. However a future implementation might focus on analyzes of side chain to side chain co-locations and examine whether that will improve the specificity of this tool.
The SeqX is a simple, easy to use specialized tool for visualization and statistical analyses of protein and nucleic acid residue co-locations. It is mainly and specifically developed to study known and novel specific residue interactions.
The general support of Z. Benyo and B. Benyo is greatly appreciated. Grants were provided by the Homulus Foundation (Stockholm, Sweden).
- Chen Y, Kortemme T, Robertson T, Baker D, Varani G: A new hydrogen-bounding potential for the design of protein-RNA interactions predicts specific contact and discriminates decoys. Nucleic Acid Research 2004, 32: 5147–5162. 10.1093/nar/gkh785View ArticleGoogle Scholar
- Mandel-Gutfreund Y, Schueler O, Margalit H: Comprehensive analysis of hydrogen bounds in regulatory protein DNA-complexes: In search of common principles. J Mol Biol 1995, 253: 370–382. 10.1006/jmbi.1995.0559View ArticlePubMedGoogle Scholar
- Biro JC, Biro JMK: Frequent occurrence of recognition Site-like sequences in the restriction endonucleases. BMC Bioinformatics 2004, 5: 30. 10.1186/1471-2105-5-30PubMed CentralView ArticlePubMedGoogle Scholar
- Nair D, Fischer D, Jernigan R, Wolfson HJ, Nussinov R: Amino acid pair interchanges at spatially conserved locations. J Mol Biol 1996, 256: 924–938. 10.1006/jmbi.1996.0138View ArticleGoogle Scholar
- Kumarevel TS, Gromiha MM, Ponnuswamy MN: Distribution of amino acid residues and residue-residue contacts in molecular chaperones. Prep Biochem & Biotechnol 2001, 31: 163–183. 10.1081/PB-100103382View ArticleGoogle Scholar
- Glaser F, Steinberg DM, Vakser IA, Ben-Tal N: Residue frequencies and pairing preferences at protein-protein interfaces. Proteins: 2001, 43: 89–102. 10.1002/1097-0134(20010501)43:2<89::AID-PROT1021>3.0.CO;2-HView ArticleGoogle Scholar
- Azarya-Sprinzak E, Naor D, Wolfson HJ, Nussinov R: Interchanges of spatially neighboring residues in structurally conserved environments. Protein Engineering 1997, 10: 1109–1122. 10.1093/protein/10.10.1109View ArticlePubMedGoogle Scholar
- Eilers M, Patel AP, Liu W, Smith O: Comparison of Helix Interactions in membrane and soluble alpha-bundle proteins. Biophysical Journal 2002, 82: 2720–2736.PubMed CentralView ArticlePubMedGoogle Scholar
- Accelrys, San Diego, CA, Modeling/Simulation Products, Quanta; 2005.
- Singer MS, Vriend G, Bywater RP: Prediction of protein residue-derived likelihood matrix. Protein Engineering 2002, 15: 721–725. 10.1093/protein/15.9.721View ArticlePubMedGoogle Scholar
- Diemand AV, Scheib H: iMolTalk: an interactive, internet-based protein structure analysis server. Nucleic Acids Res 2004, 32: W512–6. 10.1093/nar/gkh124PubMed CentralView ArticlePubMedGoogle Scholar
- Neshich G, Togawa RC, Mancini AL, Kuser PR, Yamagishi ME, Pappas G, Torres WV, Fonseca e Campos T, Ferreira LL, Luna FM, Oliveira AG, Miura RT, Inoue MK, Horita LG, de Souza DF, Dominiquini F, Alvaro A, Lima CS, Ogawa FO, Gomes GB, Palandrani JF, dos Santos GF, de Freitas EM, Mattiuz AR, Costa IC, de Almeida CL, Souza S, Baudet C, Higa RH: STING Millennium: A web-based suite of programs for comprehensive and simultaneous analysis of protein structure and sequence. Nucleic Acids Res 2003, 31: 3386–92. 10.1093/nar/gkg578PubMed CentralView ArticlePubMedGoogle Scholar
- Heinig M, Frishman D: STRIDE: a web server for secondary structure assignment from known atomic coordinates of proteins. Nucleic Acids Res 2004, 32: W500–2.PubMed CentralView ArticlePubMedGoogle Scholar
- Gabdoulline RR, Wade RC, Walther D: MolSurfer: A macromolecular interface navigator. Nucleic Acids Res 2003, 31: 3349–51. 10.1093/nar/gkg588PubMed CentralView ArticlePubMedGoogle Scholar
- Davis IW, Murray LW, Richardson JS, Richardson DC: MOLPROBITY: structure validation and all-atom contact analysis for nucleic acids and their complexes. Nucleic Acids Res 2004, 32: W615–9.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.