- Software
- Open access
- Published:
SeqX: a tool to detect, analyze and visualize residue co-locations in protein and nucleic acid structures
BMC Bioinformatics volume 6, Article number: 170 (2005)
Abstract
Background
The interacting residues of protein and nucleic acid sequences are close to each other – they are co-located. Structure databases (like Protein Data Bank, PDB and Nucleic Acid Data Bank, NDB) contain all information about these co-locations; however it is not an easy task to penetrate this complex information. We developed a JAVA tool, called SeqX for this purpose.
Results
SeqX tool is useful to detect, analyze and visualize residue co-locations in protein and nucleic acid structures. The user
a. selects a structure from PDB;
b. chooses an atom that is commonly present in every residues of the nucleic acid and/or protein structure(s)
c. defines a distance from these atoms (3–15 Å). The SeqX tool detects every residue that is located within the defined distances from the defined "backbone" atom(s); provides a DotPlot-like visualization (Residues Contact Map), and calculates the frequency of every possible residue pairs (Residue Contact Table) in the observed structure. It is possible to exclude +/- 1 to 10 neighbor residues in the same polymeric chain from detection, which greatly improves the specificity of detections (up to 60% when tested on dsDNA). Results obtained on protein structures showed highly significant correlations with results obtained from literature (p < 0.0001, n = 210, four different subsets). The co-location frequency of physico-chemically compatible amino acids is significantly higher than is calculated and expected in random protein sequences (p < 0.0001, n = 80).
Conclusion
The tool is simple and easy to use and provides a quick and reliable visualization and analyses of residue co-locations in protein and nucleic acid structures.
Availability and requirements
http://janbiro.com/Downloads.html SeqX, Java J2SE Runtime Environment 5.0 (available from [see Additional file 1] http://www.sun.com) and at least a 1 GHz processor and with a minimum 256 Mb RAM. Source codes are available from the authors.
Background
Specific protein-protein and protein-nucleic acid interaction are in the focus of many biochemical studies. The exact nature of these interactions is not known. Some scientists argue that the macromolecular interactions are determined by long sequence domains that are involving many residues (amino acids and nucleotides), while others found that there is some degree of specificity already on a single residue level, i. e. some residue pairs are preferentially co-located on interacting interfaces. The existence of preferred residue pairs within, as well as between, macro-molecular structures are supported by numerous statistical analyses of protein-RNA [1] regulatory protein-DNA [2], restrictions enzyme-DNA cut site [3], protein-protein [4–8] structures and interfaces. Although many studies are performed for statistical analyses of residue co-location, it was not possible for us to find a publicly available tool for this purpose. We found only a reference for the existence of a commercially available tool, the QUANTA modeling software [9, 10].
Implementation
Any structure files (.pdb) may be selected for analyses from the main window. (Figure 1). The tool automatically provide the title of the selected PDB file, a list of sequences present in the file and a list of every common atom in the residues of the respective sequences. These possible backbone atoms are N, CA: Calpha, C and O in proteins; and P, O1P: O1P, O2P: O2P, O5*: O5', C5*: C5', C4*: C4', O4*: O4'. C3*: C3', O3*: O3', C2*: C2', C1*: C1', C5: C5, C6: C6, N1: N1, C2: C2,, N3: N3 and C4: C4 in nucleic acids. It is possible to exclude one or more sequence from analyses by selecting the "no-one" option in the Common Atoms list. The user is asked to define a spherical space around the selected core atoms by choosing a minimum and maximum detection radius around these atoms (between 0 to 15 Ångströms). It is usually not interesting to detect residue co-locations related to neighbor residues in the same sequences. Therefore it is possible to exclude up- and downstream neighbors in the same sequence. (Ex +/-: 0–10). The program ignores terminal residues if they are annotated as HETATM i.e. non-standard residues. The SeqX program makes a list of atoms (and the corresponding residues) which are located within the defined radius around the pre-selected common atoms and are not excluded as neighbor residues. This list is accessible as a Residue Table that contains the Residue Contact Map elements. The atomic distances are calculated by the Pythagoras theses. The results of these analyses are visualized in a Residue Contact Map and summarized in a statistical table. The Residue Contact Map is a dot-plot like graph where every residue in every sequence in the PDB structure is compared to each other, and residue co-locations are indicated by a square. The color of the squares indicate the type of molecular contacts (blue: nucleic acid – nucleic acid, red: protein – nucleic acid, black: protein – protein).
The main features of the protein secondary structure are indicated by background colors (blue/green: beta sheet, yellow: alpha helix, gray: turn), if they are annotated (not always) in the pdb source files.
It is possible to zoom in the center of the Map and move it into optional directions (using the mouse). Primary structure (the sequence) is available along the coordinates. If the sequence is too long, it is necessary to zoom in the Map to make the sequence readable.. Protein sequence is indicated with the 20 one-letter codes (capital letters), while the nucleic acid sequence with the a, t/u, g, c letters. Clicking on any co-locations highlights the corresponding 2 letters in the sequences (green letter coloring).
A simple statistical analysis is performed and the number of every possible residue combinations is listed in a Residue Contact Table. It is possible to save the results of the analysis in JPG and XLS (or similar) files. It is also possible to save even the Residue Contact Map in binary form and XLS format for future statistical processing ("Save binary" saves the map as 0 a 1 numbers).
Results
The Residue Contact Map provides a 2D dot-plot like graph of residue co-locations in protein, nucleic acid or nucleoprotein complexes. (Figure 2). This plot is simple and as easy to understand as any other dot-plot. The main right diagonal line corresponds to residue co-locations in the same polymeric chain (neighbors) and it is possible to eliminate by neighbor exclusion. (Figure 3).
The Residue Contact Table contains all possible residue combinations (20 × 20 amino acid to amino acid, 4 × 4 nucleic acid to nucleic acid and 20 × 4 amino acid to nucleic acid combinations) and lists the frequency of theses co-locations in the observed structure. Some of the listed co-locations are specific (true) while other is aspecific (false) co-locations.
It is possible to estimate the specificity of the results only in the case of nucleic acids where the Watson-Crick base pairs are known to be specific co-locations. The Residue Contact Table provides data for the 16 (4 × 4) different type of nucleic acid base co-locations, however it is known that only adenine-thymine (a-t, t-a) and guanine-cytosine (g-c, c-g) co-locations indicate true (T) and specific base-pairs, while the 8 other pairs are false (F).
The estimated specificity of the SeqX tool on dsDNA is up to 60% (T/F ~ 1.4), (Figure 4). The specificity is greatly improved by proper distance selection and exclusion of residue neighbors. (Figure 5). It is easy to explain the reason for these observations. (Figure 6, 7, 8).
It is more difficult to find optimal SeqX parameters for studying residue co-locations in- and between proteins. In contrast to the DNA it is not known which (if any) amino acid pairs represent specific residue co-locations. Furthermore some protein structures are very compact and, for example, in the case of alpha helical proteins many amino acid neighbors might interfere with the specificity of the detection (Figure 9) and the exclusion of more than one neighbor is necessary to improve the specificity of the detection.
We found that detection radius between 5–9 Å and exclusion of +/-8 neighbors gives the best results for analyzing alpha helical protein structures.
A real specificity estimation is not possible to do on protein sequences (not even in receptor-ligand structures), because the amino acids are not known to be complementary to each other. Therefore the frequency of amino acid co-locations found by SeqX (preferentially in alpha helical proteins) is compared to the frequency of residue co-locations data from literature [5, 6, 8]. Our results showed highly significant correlation to data from the literature (p < 0.0001, n = 210). (Figure 10).
Some cautious and preliminary estimation is still possible even for the specificity of detected residue co-locations in protein structures. Namely, it is known from physico-chemical studies, that some amino acids are attractive while others are repulsive to each other. The known physico-chemical laws suggest that pair-formation (co-location) is probably preferred between amino acids having similar hydrophobicity or different charge, while pair-building between amino acids with different hydrophobicity or similar charge are strongly prohibited.
To test this assumption we generated a pool of artificial random protein sequences by translating randomized nucleic acid sequences. The nucleic acids contained equal amount of each nucleotide bases (4 × 25%) and, by that way, the average frequency of amino acids in the translated artificial proteins became very similar to the amino acid frequency of the entire human proteome.
The residue co-locations within and between these sequences are determined by statistical lows if we assume that the spatial mobility of the residues in these proteins is free and independent of each other. The calculated probability of any residue co-locations (Pab) will be Pab = nanb/T2, T = na+nb...+n20 wher n is the number of a given amino acid. The calculated relative frequency of a given co-locating pair (Cab) is proportional to Pab and might be calculated by the Cab = Pab/ (Pab+...Pxy) 100 formula, where x and y indicate any of the 20 possible amino acids and the number of xy pairs is 400.
The relative frequency of physico-chemically favored co-locations is significantly higher (and the relative frequency of un-favored co-locations is significantly lower) in real protein structures (determined by SeqX) than it is calculated for random interactions (p < 0.001, n = 80 and n = 10, respectively). (Figure 11.)
This example indicates that the number of false positive (un-favored) co-locations is about 20% and the specificity of SeqX methods for proteins might be as much as ~80% However this is a very crude estimate, because the number of true co-locations is not surely known.
Discussion
To understand the nature of specificity of macromolecular interactions is a major challenge in bioinformatics. We were successful in providing evidence to support the view that some degree of specificity already exists on residue level [3]. Therefore we decided to continue our studies of frequency analyses of residue co-locations in nucleoprotein structures. The SeqX tool is specifically designed for this purpose. The 2D Residue Contact Map is a simple and easy to understand display of nucleic acid and protein structures. There are some very sophisticated analytical tools which also even incorporate this feature, like MOLTALK [11], STING Millennium [12], STRIDE [13] MolSurfer [14] MOLPROBITY [15]. The major advantage of this approach is its simplicity. The effective usage of 3D tools and learning the "3D thinking" usually requires lengthy training which often is not affordable for general bioinformaticians. We have further developed the concept of Residue Contact Map and added many new features that are not present in existing tools. Such features are
1., The option to choose different backbone atoms (in addition to the conventional Calpha and C1' atoms;
2., The option to exclude neighbor atoms and to improve the specificity of the method;
3., The direct connection to a Residue Contact Table which automatically provides a basic statistical analyzes of the residue co-locations.
It is expected, that statistical analyses of residue co-locations in protein and nucleic acid sequences will provide further insight and understanding the rules of macromolecular interactions. The ultimate goal of these types of studies is to find short "complementary" or "compatible" sequences/motifs even for specific nucleic acid – protein and protein – protein interactions, something similar to the well known Watson-Crick rules of specific nucleic acid – nucleic acid contacts.
It is well known that in studies of protein interactions, protein engineering and drug design the most important are the interactions between side chains. However, the recent SeqX program is a general purpose tool (for nucleic acids as well as for proteins) for statistical analyzes and visualization of entire-residue co-locations and it does not pay particular attention to side chains and the pattern of the side chain interactions. It does not limit the usefulness of this tool for its original purpose: any significant residue co-locations (i.e. that which are different from random) are necessarily caused by the side chains ('R' in amino acids, 'bases' in nucleic acids) because they are the variable elements of the structures. However a future implementation might focus on analyzes of side chain to side chain co-locations and examine whether that will improve the specificity of this tool.
Conclusion
The SeqX is a simple, easy to use specialized tool for visualization and statistical analyses of protein and nucleic acid residue co-locations. It is mainly and specifically developed to study known and novel specific residue interactions.
References
Chen Y, Kortemme T, Robertson T, Baker D, Varani G: A new hydrogen-bounding potential for the design of protein-RNA interactions predicts specific contact and discriminates decoys. Nucleic Acid Research 2004, 32: 5147–5162. 10.1093/nar/gkh785
Mandel-Gutfreund Y, Schueler O, Margalit H: Comprehensive analysis of hydrogen bounds in regulatory protein DNA-complexes: In search of common principles. J Mol Biol 1995, 253: 370–382. 10.1006/jmbi.1995.0559
Biro JC, Biro JMK: Frequent occurrence of recognition Site-like sequences in the restriction endonucleases. BMC Bioinformatics 2004, 5: 30. 10.1186/1471-2105-5-30
Nair D, Fischer D, Jernigan R, Wolfson HJ, Nussinov R: Amino acid pair interchanges at spatially conserved locations. J Mol Biol 1996, 256: 924–938. 10.1006/jmbi.1996.0138
Kumarevel TS, Gromiha MM, Ponnuswamy MN: Distribution of amino acid residues and residue-residue contacts in molecular chaperones. Prep Biochem & Biotechnol 2001, 31: 163–183. 10.1081/PB-100103382
Glaser F, Steinberg DM, Vakser IA, Ben-Tal N: Residue frequencies and pairing preferences at protein-protein interfaces. Proteins: 2001, 43: 89–102. 10.1002/1097-0134(20010501)43:2<89::AID-PROT1021>3.0.CO;2-H
Azarya-Sprinzak E, Naor D, Wolfson HJ, Nussinov R: Interchanges of spatially neighboring residues in structurally conserved environments. Protein Engineering 1997, 10: 1109–1122. 10.1093/protein/10.10.1109
Eilers M, Patel AP, Liu W, Smith O: Comparison of Helix Interactions in membrane and soluble alpha-bundle proteins. Biophysical Journal 2002, 82: 2720–2736.
Accelrys, San Diego, CA, Modeling/Simulation Products, Quanta; 2005.
Singer MS, Vriend G, Bywater RP: Prediction of protein residue-derived likelihood matrix. Protein Engineering 2002, 15: 721–725. 10.1093/protein/15.9.721
Diemand AV, Scheib H: iMolTalk: an interactive, internet-based protein structure analysis server. Nucleic Acids Res 2004, 32: W512–6. 10.1093/nar/gkh124
Neshich G, Togawa RC, Mancini AL, Kuser PR, Yamagishi ME, Pappas G, Torres WV, Fonseca e Campos T, Ferreira LL, Luna FM, Oliveira AG, Miura RT, Inoue MK, Horita LG, de Souza DF, Dominiquini F, Alvaro A, Lima CS, Ogawa FO, Gomes GB, Palandrani JF, dos Santos GF, de Freitas EM, Mattiuz AR, Costa IC, de Almeida CL, Souza S, Baudet C, Higa RH: STING Millennium: A web-based suite of programs for comprehensive and simultaneous analysis of protein structure and sequence. Nucleic Acids Res 2003, 31: 3386–92. 10.1093/nar/gkg578
Heinig M, Frishman D: STRIDE: a web server for secondary structure assignment from known atomic coordinates of proteins. Nucleic Acids Res 2004, 32: W500–2.
Gabdoulline RR, Wade RC, Walther D: MolSurfer: A macromolecular interface navigator. Nucleic Acids Res 2003, 31: 3349–51. 10.1093/nar/gkg588
Davis IW, Murray LW, Richardson JS, Richardson DC: MOLPROBITY: structure validation and all-atom contact analysis for nucleic acids and their complexes. Nucleic Acids Res 2004, 32: W615–9.
Acknowledgements
The general support of Z. Benyo and B. Benyo is greatly appreciated. Grants were provided by the Homulus Foundation (Stockholm, Sweden).
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
JCB designed and tested the tool, and wrote this article. GF implemented the software. GF is the winner of the first prize of the First Hungarian George Gamow Competition and Fellowship in 2004 with his contribution.
Electronic supplementary material
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Biro, J.C., Fördös, G. SeqX: a tool to detect, analyze and visualize residue co-locations in protein and nucleic acid structures. BMC Bioinformatics 6, 170 (2005). https://doi.org/10.1186/1471-2105-6-170
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1471-2105-6-170