Predicting disease-associated substitution of a single amino acid by analyzing residue interactions
© Li et al; licensee BioMed Central Ltd. 2011
Received: 10 June 2010
Accepted: 12 January 2011
Published: 12 January 2011
The rapid accumulation of data on non-synonymous single nucleotide polymorphisms (nsSNPs, also called SAPs) should allow us to further our understanding of the underlying disease-associated mechanisms. Here, we use complex networks to study the role of an amino acid in both local and global structures and determine the extent to which disease-associated and polymorphic SAPs differ in terms of their interactions to other residues.
We found that SAPs can be well characterized by network topological features. Mutations are probably disease-associated when they occur at a site with a high centrality value and/or high degree value in a protein structure network. We also discovered that study of the neighboring residues around a mutation site can help to determine whether the mutation is disease-related or not. We compiled a dataset from the Swiss-Prot variant pages and constructed a model to predict disease-associated SAPs based on the random forest algorithm. The values of total accuracy and MCC were 83.0% and 0.64, respectively, as determined by 5-fold cross-validation. With an independent dataset, our model achieved a total accuracy of 80.8% and MCC of 0.59, respectively.
The satisfactory performance suggests that network topological features can be used as quantification measures to determine the importance of a site on a protein, and this approach can complement existing methods for prediction of disease-associated SAPs. Moreover, the use of this method in SAP studies would help to determine the underlying linkage between SAPs and diseases through extensive investigation of mutual interactions between residues.
Genetic variation is a major driving force in the evolution of organism. In individuals, specific genetic mutations such as SNPs can be deleterious and cause disease. The human genome project has yielded massive amounts of data on human SNPs, and this information can be used to further investigate human diseases. It is estimated that the human genome contains 10 million SNP sites . As a major repository of human SNPs, the NCBI dbSNP database  contains ~25 million human entries in the release of build 130. The annotation of single nucleotide polymorphisms (SNPs) is attracting a great deal of attention. Non-synonymous SNPs (nsSNPs), also referred to as single amino acid polymorphisms (SAPs), are SNPs that cause amino acid substitutions, and these are believed to be directly related to diseases. Thus far, only a small proportion of SAPs has been associated with disease. To date, ~20,000 non-synonymous SNPs are available with explicit annotation in the Swiss-Prot database [3, 4]. Therefore, it is desirable to develop effective methods for identifying disease-related amino acid substitutions.
Several computational models have been developed for this purpose. Evolutionary information is commonly considered to be the most important feature for such a prediction task. Based on sequence homology, an earliest predictor SIFT was developed by Ng and Henikoff [6, 7]. The PANTHER database was designed based on family Hidden Markov Models (HMMs) to determine the likelihood of affecting protein function . PolyPhen [9–11] showed that the selection pressure against deleterious SNPs depended on the molecular function of the proteins. Sequence/structural attributions were also incorporated in many studies. Satisfactory results were obtained by Ferrer-Costa  using mutation matrices, amino acid properties, and sequence potentials. By using attributions derived from other tools, an automated computational pipeline was constructed to annotate disease-associated nsSNPs . Many other models have been developed based on this combination strategy [14–21]. Saunders and Baker evaluated the contributions of several structural features and evolutionary information in predicting deleterious mutations . Wang and Moult undertook a detailed investigation of SNPs in which they studied the effects of the mutations on molecular function . Recently, Mort et al.,  Li et al.,  and Carter et al. functionally profiled human amino acid substitutions. They found a significant difference between deleterious and polymorphic variants in terms of both structural and functional disruption. Yue et al.[27–29] performed comprehensive studies on the impact of single amino acid substitutions on protein structure and stability. In these studies, stability change was also regarded as an important factor that contributed to dysfunction. Detailed studies were carried out by Reumers et al.,  and Bromberg et al. in which the extent of the functional effect of a mutation was correlated to its effect on protein stability.
Wang et al.,  and Yue et al. showed that the functional impacts of a mutation are closely related to its protein structural context. Recently, Alexander et al. showed how the fold and function of a protein is altered by mutations. They observed a conformational switch between two different folds triggered by a single amino acid substitution, which directly proved the dependence of protein structure and function on amino acid interactions. Therefore, the challenge that is faced, especially when there is a lack of annotations on the functional role of a residue, is how to incorporate such useful features for detecting disease-associated mutations. To resolve this, in our study a complex network was employed to depict protein structure.
Owing to their potential for systematic analysis, complex networks have been widely used in proteomics. This method can also be used to represent a protein structure as a network (we call it protein structure network, PSN) in which the vertices are the residues and the edges are their interactions. This provides novel insight into protein folding mechanisms, stability, and function. Greene et al., and Bagler et al. described the small-world and even scale-free  properties of such network, which were independent of the protein structural class . Vendruscolo et al., and Dokholyan et al. determined that a limited set of hub vertices with large connectivity plays a key role in protein folding [35–37]. In another study, hubs were defined as residues with more than four links, and these brought together different secondary structure elements that contributed to both protein folding and stability . All these studies suggest that protein structure network (PSN) facilitates the systematic analysis of residue interactions both locally and globally. PSN also has the advantage of capturing the role of a residue in protein structure and function.
Using this information, Cheng et al. developed a solely structure-based approach named Bongo to predict disease-associated SAPs  and obtained a satisfactory positive predictive value. Their study emphasized that the functional essentiality of a site is closely correlated to its role in maintaining protein structure. Their study showed that PSN should be capable of detecting polymorphic mutations. However, their method performed poorly in detecting disease-associated mutations, which was believed to be due to the inability of Bongo to identify functional roles of the residue. In this study, we demonstrated that PSN can also perform well in predicting disease-associated mutations.
We carried out a comprehensive analysis on the network properties of mutations by using a dataset compiled from Swiss-Prot. We tried to determine how disease-associated variants differ from polymorphism variants in terms of network topological features. Four well-established network topological features, degree, clustering coefficient, betweenness, and closeness, were calculated based on protein structure networks and used to predict disease-associated SAPs. The neighborhood of the mutation was also investigated. These features offer a quantitative description of residue interactions. We compared their performance with that of conservation features. Finally, a model was constructed to predict disease-associated SAPs by combining network topological, conservation, and properties of neighboring residues around a mutation (environmental features) as well as several features reported in previous studies. The satisfactory performance suggested that studying residue interactions can help to distinguish disease-associated SAPs from polymorphic SAPs.
Analysis of topological features for disease-associated and polymorphic SAPs
Four well-established network topological features, degree, clustering coefficient, betweenness, and closeness--were used to characterize disease-associated SAPs. First, an analysis was carried out to determine the extent to which disease-associated and polymorphic SAPs differ in terms of such topological features.
Comparing network topological features with conservation features
Conservation features are considered to be the most important features for predicting disease-associated SAPs. Therefore, we compared the performance of topological features, as well as environmental features with widely used conservation features in predicting disease-associated SAPs. Three feature sets were constructed: a conservation feature set (f-set 1), topological feature set (f-set 2) and neighboring environmental feature set (f-set 3). Here, f-set 1 comprises seven elements: position-specific scores and observed percentages for the wild-type and variant residues, changes in these two measures upon mutation, and the conservation score. Moreover, f-set 2 consists of the four topological features mentioned above, which are derived from the wild-type protein structure. Finally, f-set 3 consists of topological features and the conservation scores of the five most conserved neighboring residues around the SAP under study.
Performance for each feature set by 5-fold cross-validation.
All feature set
Performance of our model for prediction of disease-associated SAPs
Performance of different methods based on an independent dataset.
All feature set
SIFT, PolyPhen-2, Bongo and SAPRED, four well established methods, were used for benchmarking. With our independent dataset, SIFT yielded a sensitivity of 79.5% and specificity of 71.3% while PolyPhen-2 yielded a sensitivity of 74.1% and specificity of 78.1% (Table 2). Bongo achieved a low sensitivity of 21.6% and specificity of 84.7%, which were similar with the results reported by Cheng et al.. Using the dataset compiled by Ye et al., our method achieved a sensitivity of 90.5% and specificity of 66.5% (ntree = 300 and mtry = 3). The corresponding ACC and MCC were 82.3% and 0.60, respectively. SAPRED achieved a higher sensitivity of 93.8% and a lower specificity of 61.3%. It yielded an ACC of 82.6% and MCC of 0.60. It should be noted that in this study, network features were introduced to depict an SAP instead of conventional structural features such as nearby functional sites, secondary structure, and hydrogen bonds. The satisfactory performance suggests that network features also include the information provided by structural features and this method can complement to the existing methods for predicting disease-associated SAPs.
From a biological viewpoint, mutual restraint of residues is crucial for the correct functioning of a proper structure. Network topological features were adopted in the present study to describe both local and global residue interactions: degree and clustering coefficient were used for the former, and closeness and betweenness were used for the later. This can be understood from the fundamental aspects of protein structure. A special local structure is usually maintained by the cooperation of several residues. In this case, residues with more neighbors would naturally be more crucial in residue interactions , which would exert a greater influence on the local structure. In this sense, the frangibility upon residue substitution may be related to the density of the local structure. In biology, high betweenness is expected in the case of key residues that acting as a bridge in protein structure, such as those that bring together two different secondary structures. It was reported that closeness could indicate the functional role of a residue. So, it is not surprising that high closeness values were observed for disease-associated SAPs. It is therefore reasonable to use these features to depict the structural/functional role of a residue.
Moreover, it was observed that the topologically important neighbors would likely be more conserved. It would be reasonable to expect that an SAP close to structural/functional key residues would more likely to be associated with diseases. This is why several studies have designed features to indicate the distance of an SAP to the function site [9, 10, 21]. In this sense, these network-based environmental features can depict the environment an SAP lies in.
We also compared network features with widely used conservation features. In contrast to conservation features, network features characterize the SAP in such a manner that its interactions with other residues in local/global protein structure are revealed. The performance of these features further proves their ability to distinguish disease-associated SAPs from polymorphic SAPs from the viewpoint of the roles of the focused residues in proteins. Moreover, the performance of the environmental feature set demonstrates that a dysfunctional mutation is closely correlated to the environment it lies in.
We compared our method with several well established approaches. The satisfactory performance of our method suggests that network features indicate the importance of a position in the context of the entire protein. It is therefore reasonable to believe that studying SAPs by analyzing residue interactions in a protein is both feasible and promising.
Residues are in contact with each other, but their positions and conformations are restricted to ensure the maintenance of proper structure and function. Here, we represented a protein structure as a network, which allowed us to study the correlation between residues. Our results suggest that network topological features can appropriately reflect the role of a disease-associated SAP in both local and global structures by exploiting its correlation with other residues in a protein. The good performance obtained with the environmental feature set proves the feasibility of our method in detecting a disease-associated SAP by investigating the properties of its neighboring residues.
Several types of interactions are involved in a protein structure, including hydrophobic, hydrogen bond, van der Waal and electrostatic interactions. These may play specific roles in maintaining protein structure or function. It is still a challenge to feature such interactions in a protein structure network, although PSN has exhibited its advantage in revealing correlations between residues. It is anticipated that a PSN with more refined residue interactions should accurately reflect the structural/functional role of a residue in a protein. We will conduct further analysis in our future studies.
We compiled an SAPs dataset from the Swiss-Prot variant page [3, 4]. To construct protein structure networks, only variants that mapped to 3D structures were considered. Here, we extracted the protein structures of the wild-type from the ModSNP  on the EXPASY website. We then removed problematic structures with incorrect residue substitution or erroneous position record. The final dataset contained 6527 SAPs from 1094 proteins, including 3953 disease-associated and 2574 polymorphic SAPs, among which 127 proteins contained both disease-associated and polymorphic SAPs. An independent dataset was randomly selected, which consisted of 218 proteins with 696 disease associated and 456 polymorphic SAPs (see Additional file 1). It was used as a benchmark for evaluating our model as well as for comparing it with other published methods. The remaining 876 proteins with 3257 disease associated and 2118 polymorphic SAPs were used to perform 5-fold cross-validation (see Additional file 2).
The random forest package in R was employed for model training. The prediction models were provided in the additional files (see Additional file 3). The random forest is an ensemble classifier based on decision trees[42, 43], which has been commonly used for classification and regression tasks. Two parameters, ntree and mtry, is crucial in this algorithm. ntree is the number of trees to grow and mtry is the number of variables selected to determine the decision at a node of the tree. In this study, they were optimized using a grid search approach. During the grid search, the optimal ntree and mtry were determined based on 5-fold cross-validation. The random forest package also offers a module for feature evaluation in which three measures are provided: selection frequency, Gini importance and permutation importance. In this study, we used the permutation importance to distinguish informative features from uninformative features. The estimation procedure was repeated 100 times, and the averaged values were used for this measurement. Here, sensitivity, specificity, total accuracy (ACC), and Matthew's correlation coefficient (MCC) were adopted for model evaluation. The Formula for each measure is listed as follow:
where TP is the number of correctly predicted positive sample, TN is the number of correctly predicted negative sample, FP is the number of incorrectly predicted positive sample, and FN is the number of incorrectly predicted negative sample.
Protein structure network
In a protein, residue interactions arise from covalent and/or non-covalent bonds between atoms. For convenience, the contacts are defined as follows: each residue is represented by the center of its side chain atom positions, but in the case of glycine, the Cα is treated as the center. A contact is therefore identified on the condition that the distance between the centers of the two residues are within 6.5 Å . Several topology features were derived from such networks.
Four network topological features were examined in this study.
where a i,j is the number of contacts between vertices i and j, and N is the set of total vertices. Within a protein structure, the δ(i) of a residue refers to its direct connectivity to other residues and is a non-negative integer value.
where e i is the virtual and δ i (δ i -1)/2 is the maximum possible number of edges between the neighbors of vertex i.
where N is the total number of vertices and d i,j is the shortest path between vertices i and j. The closeness score indicates the status of a residue in the entire protein structure.
where n j,k is the number of all geodesics linking vertices j and k. The term n j,k (i) indicates the number of shortest paths connecting j and k passing through vertex i. Betweenness is sensitive to the protein length. To avoid the bias, the feature was scaled by the protein length. This parameter was reported to performed well in identifying the hot spots in protein interactions. For more detailed descriptions of these parameters, please refer to Newman and Watts[45, 46].
Other sequence and structural features employed in this study include sequence conservation, point accepted mutation (PAM) 250, solvent accessible area, aggregation propensities and HLA family. Solvent accessibility was reported to be an important feature in SAP prediction. We derived the solvent accessible area of each residue by using DSSP.
where p i,j is the frequency of amino acid j at position i. A lower value suggests lower entropy (more conserved) at a position and vice versa.
In our method, probability of the substitution according to PAM250 was taken to score a mutation. For a more detailed description please refer to Dayhoff et al.. In previous studies, aggregation propensity was thought to be a significant factor in disease susceptibility. Therefore, this feature was adopted here. Aggregation propensities for wild-type and variant amino acids were taken from TANGO. The aggregation propensity change for a fragment upon a single variant was also taken into account. Moreover, by following the approach described by Ye et al., a feature was employed to determine whether a protein in which an SAP is located belongs to the HLA family.
Based on the structure network, neighboring residues were extracted as those with direct contacts with an SAP, i.e., those with a distance to the focused residues that was no more than 6.5 Å. We investigated the properties of neighboring residues in terms of network topological features and conservation scores. Our analysis indicated that, the five most conserved neighboring residues can appropriately reflect the environment around a mutation. For SAP sites with less than five neighbors, zeros were added. Thus, the environmental feature of a SAP site could be encoded by a 25-dimensioned vector.
Availability and requirements
Project name: NetSAP;
Project home page: http://cic.scu.edu.cn/bioinformatics/NetSAP.zip;
Operating system(s): Linux and Microsoft Windows;
Programming language: R language (Version 2.7.2);
single nucleotide polymorphisms
non-synonymous single nucleotide polymorphisms
protein structure network.
We would like to thank ZQ Ye for providing their dataset. We gratefully acknowledge the anonymous reviewers for their constructive comments on this article. This study was supported by the Natural Science Foundation of China (20972103).
- Gibbs RA, Belmont JW, Hardenbol P, Willis TD, Yu FL, Yang HM, Ch'ang LY, Huang W, Liu B, Shen Y, et al.: The international HapMap project. Nature 2003, 426(6968):789–796. 10.1038/nature02168View Article
- Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K: dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001, 29(1):308–311. 10.1093/nar/29.1.308PubMed CentralView ArticlePubMed
- Yip YL, Scheib H, Diemand AV, Gattiker A, Famiglietti LM, Gasteiger E, Bairoch A: The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat 2004, 23(5):464–470. 10.1002/humu.20021View ArticlePubMed
- Yip YL, Famiglietti M, Gos A, Duek PD, David FPA, Gateau A, Bairoch A: Annotating single amino acid polymorphisms in the UniProt/Swiss-Prot knowledgebase. Hum Mutat 2008, 29(3):361–366. 10.1002/humu.20671View ArticlePubMed
- Lee W, Yue P, Zhang ZM: Analytical methods for inferring functional effects of single base pair substitutions in human cancers. Hum Genet 2009, 126(4):481–498. 10.1007/s00439-009-0677-yPubMed CentralView ArticlePubMed
- Ng PC, Henikoff S: Predicting deleterious amino acid substitutions. Genome Res 2001, 11(5):863–874. 10.1101/gr.176601PubMed CentralView ArticlePubMed
- Ng PC, Henikoff S: Accounting for human polymorphisms predicted to affect protein function. Genome Res 2002, 12(3):436–446. 10.1101/gr.212802PubMed CentralView ArticlePubMed
- Thomas PD, Campbell MJ, Kejariwal A, Mi HY, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A: PANTHER: A library of protein families and subfamilies indexed by function. Genome Res 2003, 13(9):2129–2141. 10.1101/gr.772403PubMed CentralView ArticlePubMed
- Ramensky V, Bork P, Sunyaev S: Human non-synonymous SNPs: server and survey. Nucleic Acids Res 2002, 30(17):3894–3900. 10.1093/nar/gkf493PubMed CentralView ArticlePubMed
- Sunyaev S, Ramensky V, Koch I, Lathe W, Kondrashov AS, Bork P: Prediction of deleterious human alleles. Hum Mol Genet 2001, 10(6):591–597. 10.1093/hmg/10.6.591View ArticlePubMed
- Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov AS, Sunyaev SR: A method and server for predicting damaging missense mutations. Nat Methods 7(4):248–249. 10.1038/nmeth0410-248
- Ferrer-Costa C, Orozco M, de la Cruz X: Sequence-based prediction of pathological mutations. Proteins 2004, 57(4):811–819. 10.1002/prot.20252View ArticlePubMed
- Karchin R, Diekhans M, Kelly L, Thomas DJ, Pieper U, Eswar N, Haussler D, Sali A: LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources. Bioinformatics 2005, 21(12):2814–2820. 10.1093/bioinformatics/bti442View ArticlePubMed
- Bao L, Zhou M, Cui Y: nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res 2005, 33: W480-W482. 10.1093/nar/gki372PubMed CentralView ArticlePubMed
- Bao L, Cui Y: Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics 2005, 21(10):2185–2190. 10.1093/bioinformatics/bti365View ArticlePubMed
- Dobson RJ, Munroe PB, Caulfield MJ, Saqi MAS: Predicting deleterious nsSNPs: an analysis of sequence and structural attributes. BMC Bioinformatics 2006., 7: 10.1186/1471-2105-7-217
- Bromberg Y, Rost B: SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res 2007, 35(11):3823–3835. 10.1093/nar/gkm238PubMed CentralView ArticlePubMed
- Ferrer-Costa C, Gelpi JL, Zamakola L, Parraga I, de la Cruz X, Orozco M: PMUT: a web-based tool for the annotation of pathological mutations on proteins. Bioinformatics 2005, 21(14):3176–3178. 10.1093/bioinformatics/bti486View ArticlePubMed
- Kaminker JS, Zhang Y, Watanabe C, Zhang ZM: CanPredict: a computational tool for predicting cancer-associated missense mutations. Nucleic Acids Res 2007, 35: W595-W598. 10.1093/nar/gkm405PubMed CentralView ArticlePubMed
- Kaminker JS, Zhang Y, Waugh A, Haverty PM, Peters B, Sebisanovic D, Stinson J, Forrest WF, Bazan JF, Seshagiri S, et al.: Distinguishing cancer-associated missense mutations from common polymorphisms. Cancer Res 2007, 67(2):465–473. 10.1158/0008-5472.CAN-06-1736View ArticlePubMed
- Ye ZQ, Zhao SQ, Gao G, Liu XQ, Langlois RE, Lu H, Wei LP: Finding new structural and sequence attributes to predict possible disease association of single amino acid lpolymorphism (SAP). Bioinformatics 2007, 23(12):1444–1450. 10.1093/bioinformatics/btm119View ArticlePubMed
- Saunders CT, Baker D: Evaluation of structural and evolutionary contributions to deleterious mutation prediction. J Mol Biol 2002, 322(4):891–901. 10.1016/S0022-2836(02)00813-6View ArticlePubMed
- Wang Z, Moult J: SNPs, protein structure, and disease. Hum Mutat 2001, 17(4):263–270. 10.1002/humu.22View ArticlePubMed
- Mort M, Evani US, Krishnan VG, Kamati KK, Baenziger PH, Bagchi A, Peters BJ, Sathyesh R, Li BA, Sun YN, et al.: In silico functional profiling of human disease-associated and polymorphic amino acid substitutions. Hum Mutat 2010, 31(3):335–346. 10.1002/humu.21192PubMed CentralView ArticlePubMed
- Li B, Krishnan VG, Mort ME, Xin FX, Kamati KK, Cooper DN, Mooney SD, Radivojac P: Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics 2009, 25(21):2744–2750. 10.1093/bioinformatics/btp528PubMed CentralView ArticlePubMed
- Carter H, Chen SN, Isik L, Tyekucheva S, Velculescu VE, Kinzler KW, Vogelstein B, Karchin R: Cancer-specific high-throughput annotation of somatic mutations: computational prediction of driver missense mutations. Cancer Res 2009, 69(16):6660–6667. 10.1158/0008-5472.CAN-09-1133PubMed CentralView ArticlePubMed
- Yue P, Li ZL, Moult J: Loss of protein structure stability as a major causative factor in monogenic disease. J Mol Biol 2005, 353(2):459–473. 10.1016/j.jmb.2005.08.020View ArticlePubMed
- Yue P, Melamud E, Moult J: SNPs3D: Candidate gene and SNP selection for association studies. BMC Bioinformatics 2006., 7: 10.1186/1471-2105-7-166
- Yue P, Moult J: Identification and analysis of deleterious human SNPs. J Mol Biol 2006, 356(5):1263–1274. 10.1016/j.jmb.2005.12.025View ArticlePubMed
- Reumers J, Schymkowitz J, Rousseau F: Using structural bioinformatics to investigate the impact of non synonymous SNPs and disease mutations: scope and limitations. BMC Bioinformatics 2009., 10: 10.1186/1471-2105-10-S8-S9
- Bromberg Y, Rost B: Correlating protein function and stability through the analysis of single amino acid substitutions. BMC Bioinformatics 2009., 10: 10.1186/1471-2105-10-S8-S8
- Alexander PA, He YA, Chen YH, Orban J, Bryan PN: A minimal sequence code for switching protein structure and function. Proc Natl Acad Sci USA 2009, 106(50):21149–21154. 10.1073/pnas.0906408106PubMed CentralView ArticlePubMed
- Greene LH, Higman VA: Uncovering network systems within protein structures. J Mol Biol 2003, 334: 781–791. 10.1016/j.jmb.2003.08.061View ArticlePubMed
- Bagler G, Sinha S: Network properties of protein structures. Physica A 2005, 346(1–2):27–33. 10.1016/j.physa.2004.08.046View Article
- Vendruscolo M, Dokholyan NV, Paci E, Karplus M: Small-world view of the amino acids that play a key role in protein folding. Phys Rev E 2002., 65(6): 10.1103/PhysRevE.65.061910
- Vendruscolo M, Paci E, Dobson CM, Karplus M: Three key residues form a critical contact network in a protein folding transition state. Nature 2001, 409(6820):641–645. 10.1038/35054591View ArticlePubMed
- Dokholyan NV, Li L, Ding F, Shakhnovich EI: Topological determinants of protein folding. Proc Natl Acad Sci USA 2002, 99(13):8637–8641. 10.1073/pnas.122076099PubMed CentralView ArticlePubMed
- Brinda KV, Vishveshwara S: A network representation of protein structures: Implications for protein stability. Biophys J 2005, 89(6):4159–4170. 10.1529/biophysj.105.064485PubMed CentralView ArticlePubMed
- Cheng TMK, Lu YE, Vendruscolo M, Lio P, Blundell TL: Prediction by graph theoretic measures of structural effects in proteins arising from non-synonymous single nucleotide polymorphisms. PLoS Comput Biol 2008., 4(7): 10.1371/journal.pcbi.1000135
- Amitai G, Shemesh A, Sitbon E, Shklar M, Netanely D, Venger I, Pietrokovski S: Network analysis of protein structures identifies functional residues. J Mol Biol 2004, 344(4):1135–1146. 10.1016/j.jmb.2004.10.055View ArticlePubMed
- del Sol A, O'Meara P: Small-world network approach to identify key residues in protein-protein interaction. Proteins 2005, 58(3):672–682. 10.1002/prot.20348View ArticlePubMed
- Breman L: Random forest. Mach Learn 2001, 45: 5–32. 10.1023/A:1010933404324View Article
- Liaw A, Wiener M: Classification and regression by randomForest. R News 2002, 2: 18–22.
- Miyazawa S, Jernigan RL: Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol 1996, 256(3):623–644. 10.1006/jmbi.1996.0114View ArticlePubMed
- Newman MEJ: The structure and function of complex networks. Siam Rev 2003, 45: 167–256. 10.1137/S003614450342480View Article
- Watts DJ, Strogatz SH: Collective dynamics of 'small-world' networks. Nature 1998, 393(6684):440–442. 10.1038/30918View ArticlePubMed
- Kabsch W, Sander C: Dictionary of protein secondary structure - pattern-recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22(12):2577–2637. 10.1002/bip.360221211View ArticlePubMed
- Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMed
- Kosiol C, Goldman N: Different versions of the Dayhoff rate matrix. Mol Biol Evol 2005, 22(2):193–199. 10.1093/molbev/msi005View ArticlePubMed
- Fernandez-Escamilla AM, Rousseau F, Schymkowitz J, Serrano L: Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins. Nat Biotechnol 2004, 22(10):1302–1306. 10.1038/nbt1012View ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.