Predicting the phenotypic effects of non-synonymous single nucleotide polymorphisms based on support vector machines
© Tian et al. 2007
Received: 26 April 2007
Accepted: 16 November 2007
Published: 16 November 2007
Skip to main content
© Tian et al. 2007
Received: 26 April 2007
Accepted: 16 November 2007
Published: 16 November 2007
Human genetic variations primarily result from single nucleotide polymorphisms (SNPs) that occur approximately every 1000 bases in the overall human population. The non-synonymous SNPs (nsSNPs) that lead to amino acid changes in the protein product may account for nearly half of the known genetic variations linked to inherited human diseases. One of the key problems of medical genetics today is to identify nsSNPs that underlie disease-related phenotypes in humans. As such, the development of computational tools that can identify such nsSNPs would enhance our understanding of genetic diseases and help predict the disease.
We propose a method, named Parepro (Predicting the amino acid replacement probability), to identify nsSNPs having either deleterious or neutral effects on the resulting protein function. Two independent datasets, HumVar and NewHumVar, taken from the PhD-SNP server, were applied to train the model and test the robustness of Parepro. Using a 20-fold cross validation test on the HumVar dataset, Parepro achieved a Matthews correlation coefficient (MCC) of 50% and an overall accuracy (Q2) of 76%, both of which were higher than those predicted by the methods, such as PolyPhen, SIFT, and HydridMeth. Further analysis on an additional dataset (NewHumVar) using Parepro yielded similar results.
The performance of Parepro indicates that it is a powerful tool for predicting the effect of nsSNPs on protein function and would be useful for large-scale analysis of genomic nsSNP data.
Almost 90% of human genetic variations result from single nucleotide polymorphisms (SNPs) . Among SNPs resulting in amino acid changes, non-synonymous SNPs (nsSNPs) are an important source of individual variation and can result in inherited diseases and drug sensitivity [2–4]. Therefore, the identification of nsSNPs that affect protein function and relate to disease will be a challenge in the coming years [3, 5–8].
A variety of methods have been developed to identify whether an nsSNP is detrimental to protein function in vitro. Most of these methods utilize evolutionary data [3, 8–17], protein structure information [2, 18, 19], or both [2, 7, 20–22]. Ng and Henikoff [8, 16, 23] developed the software SIFT (Sorting Intolerant from Tolerant) to predict the effect of nsSNPs on protein function; SIFT is based on sequence conservation and scores from position-specific scoring matrices. Some studies [24–26] have used phylogenetics to identify functionally critical residues within a protein. The MAPP (Multivariate Analysis of Protein Polymorphism)  software exploits the physicochemical variation between wild-type amino acid residues and newly introduced residues to identify nsSNPs that impair protein function. The method Align-GVGD  uses both genetic biochemical variation and genetic distance between the wild-type residue and newly introduced residue to predict the effects of an nsSNP. Some methods [2, 20–22] take advantage of three-dimensional structural information to analyze the impact of amino acid changes on protein function. Wang and Moult  found that the vast majority of nsSNPs that are related to diseases affect protein stability rather than function. Specific factors that determine stability of a protein were then used to predict the effects of nsSNPs. Chen et al.  used solvent accessibility of residues to predict deleterious mutations.
Support vector machine (SVM) has gained popularity over other machine learning methods for interpreting biological data [28–35] because of their ability to very effectively handle noise and large datasets/input spaces [36, 37]. Then, some methods [2, 7, 10, 21] have been designed based on the SVM  to predict the effect of nsSNPs. Capriotti et al.  developed a method that depends only on the evolutionary information around the nsSNP. Peng Yue and John Moult  also proposed a method that uses the conservation and type of residues observed at a base change position within a protein family. Karchin et al.  and Bao et al.  introduced two methods based on structural and evolutionary information. The structural information mainly concerns areas in the protein that are buried, as well as the fraction polar secondary structure, solvent accessibility, z-score and buried charge. The evolutionary information mainly uses Hidden Markov model PHC score, Hidden Markov model relative entropy, SIFT score and the biochemical difference between the wild-type residue and newly introduced residue.
Here, we propose a method that predicts nsSNPs based on the SVM . This method, named Parepro (Predicting the amino acid replacement probability) uses evolutionary information surrounding an nsSNP. In addition, properties from the AAindex [39, 40] and from evolutionary information are combined to determine the dissimilarity between the wild-type and newly introduced residues. Parepro predicted the total number of nsSNPs with higher accuracy than other methods and was not dependent on structural information. In this study, two independent datasets, HumVar and NewHumVar, taken from the PhD-SNP server , were applied to train the model and test the robustness of Parepro, respectively.
The prediction performance of the Parepro attribute sets when applied alone or in combination
Range of the number of homologous sequences
The range of homologous sequences number*
The proteins number within the range (%)
The mutations number within the range (%)
We compared Parepro with other predictors, HybridMeth , PolyPhen  and SIFT [8, 16, 23]. HybridMeth uses the profile and sequence information surrounding a mutation. PolyPhen  is based on a decision tree and takes into account several pieces of information derived by structural parameters, functional annotations, and evolutionary information. SIFT [8, 16, 23] mainly uses information from homologous sequences.
Comparison of performance between Parepro and other methods using the HumVar dataset
Comparison of performance parameters of Parepro with other methods using the NewHumVar dataset
Predicting phenotypes resulting from nsSNPs is an important aspect of post-genome biology. The present study helps advance the analysis of genetic variation and may therefore lead to a better understanding of the resulting phenotypic variations among individuals with an aim toward drug design and development [2, 7, 20, 25]. Two tests using different datasets indicated that Parepro outperformed several widely used methods.
Unlike the other methods that use the machine learning method [10, 12, 20–22, 43, 44], Parepro was constructed from three attribute sets RD, MI, and IE, all of which incorporate evolutionary information. In general, if the RD between the newly introduced amino acid and the residue in the mutation position has a high value, the substitution would be considered to have a high probability of being deleterious [16, 18, 25]. At the same time, attribute sets MI and IE were used to characterize the condition at the mutation position and around the mutation position, respectively. For example, when residues surrounding a mutation were found to be conserved, the region was related to either function or structure [10, 27], and thus the mutation would be deleterious. This information reinforced the characterization provided by RD. Moreover, the results indicated that these three attribute sets complemented one another to yield a higher overall accuracy (Q2) and Matthews correlation coefficient(MCC).
The attribute vector of Parepro did not contain structural features. Thus, it is possible that some of the information directly derived from the protein structure  was ignored by Parepro. However, the lack of structural information was likely overcome by the inclusion of 50 discrete amino acid properties in the RD attribute set, thereby enhancing the efficacy of the sequence-based Parepro program.
We present an SVM-based prediction method, Parepro, which predicts the effect of nsSNPs on protein function. Comprehensive comparisons of the prediction performance on two datasets showed that Parepro, which utilizes information from the amino acids surrounding the mutation position and from the residue difference between the newly introduced amino acid and the average residue in the mutation position, outperformed several other widely used prediction methods. Moreover, Parepro was able to predict all mutations within two distinct test sets. Therefore, we anticipate that Parepro will be a useful tool for large-scale analysis of nsSNPs in genomic databases.
The prediction procedure of Parepro (Figure 1) begins by calculating the position-specific amino acid probabilities (PSAP) of a target protein that contains a corresponding nsSNP. Next, three attribute sets were constructed using PSAP and the properties of amino acids from AAindex [39, 40]; these three sets were then used to describe residue differences (RD) and mutation position information (MI) and to yield information on the environment around the mutation positions (IE). Finally, a complex vector that consisted of 94 attributes was used to predict the effects of the nsSNPs. The attribute sets RD, MI and IE comprised 50, 23, 21 attributes, respectively.
We used two datasets, HumVar and NewHumVar, taken from the PhD-SNP server . The dataset HumVar consisted of 21,185 different SNPs (12,944 were disease-related, and 8,241 were neutral polymorphisms) obtained from 3,587 protein sequences in the Swiss-Prot database (Release 48). The NewHumVar dataset was comprised of SNPs obtained from the Swiss-Prot database (Release 50) after eliminating any variants also present in the HumVar dataset. Therefore, the dataset NewHumVar consisted of 935 single amino acid mutations (149 were disease-related variants, and 786 were neutral mutations) from 469 different proteins.
(1) PSI-BLAST  with parameter -e 0.001 was run for three iterations to collect sequences similar to the target protein that contained the corresponding nsSNP from the Swiss-Prot database (Release 50.0) . The identified sequences were aligned by ClustalX [50, 51] with default parameters. The position-based sequence weight method  was used to derive the weight w i of the ith sequence in the alignment. If no homologous sequence was selected, the weight w i of the target sequence was designated as 1.0.
where N is the total number of aligned sequences, w i is the weight of the i th sequence, the value of m from 1 to 20 represents any one of 20 amino acids, and a value of 21 represents a gap. If the symbol type of the i th sequence at the column is an amino acid a m (m = 1, 2⋯20) or gap (m = 21), the value of δ im is 1.0; otherwise it is 0.
(3) A new vector u, which incorporated the gap information into the 20 amino acids, was constructed as follows:
u m = c m + c21 × h m (m = 1, 2⋯20) (2)
where the vector h is the frequency of occurrence of any one of the 20 amino acids .
where q j is the mixture coefficient of each component, B is the Beta function, = (α j1...α j20) is the parameter for each component j of the Dirichlet mixture, and l is the number of components. The vector n was calculated by the equation, n m = u m × N(m = 1, 2⋯20), where N is the total number of homologous sequences and u m is calculated from equation (2).
The Parepro vector was comprised of three attribute sets, which were used to describe the RD, the MI, and the IE.
The first attribute set, RD, was designed to depict the property differences between the newly introduced amino acid and the average residue in the mutation position, which was composed of 50 elements and was constructed as follows:
where μ k and are the mean and variance of the property k, respectively, and were calculated as follows: and .
where p m is the PSAP at a mutated position calculated from equation (3).
where and are the mean and variance of d k , respectively, and were calculated as follows: , .
(4) A new vector r was then constructed using the 544 elements from Additional file 1. The software weka3.4  was used to simplify the vector r, in which the evaluator CfsSubsetEvalwas selected. The redundant and low-contribution elements in vector r were removed. After these modifications, 50 elements remained and were included in the RD attribute set.
where 20 is the number of amino acids, and p m is the PSAP value at the mutation position calculated from equation (3)
where i is the mutation position, f is the number of residues located to the left or right of the mutation position, and a represents one element of IE from 1 to 21. If the value of a is between 1 and 20, y (i+m)a is p a in the position of i + m calculated from equation (3). However, if the value of a is 21, y (i+m)a is the entropy E i+mcalculated from equation (8). Furthermore, if the mutation is located at the N-terminal position (i + m > l) or at the C-terminal position, then y (i+m)a is y la or y la , respectively, where l is the number of residues in the protein.
The SVM is a classifier seeking an optimal hyperplane to separate two classes of samples. SVM uses kernel functions to map original data to a feature space of higher dimensions and locates an optimal separating hyperplane. For SVM implementation, we used LIBSVM  with a Radial Basis Function (RBF kernel function) K(x i , x j ) = exp(-G||x i - x j ||2). The parameter was selected with the LIBSVM parameter selection tool.
where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. Because there was an obvious disparity in the number of positive samples and negative samples in the dataset, MCC combined both the sensitivity and the specificity of the predictor and should be selected as the main score among the six scores in the evaluation [20, 21, 41, 42].
Project name: Parepro
Project home page: http://www.mobioinfor.cn/parepro
Operating systems: Windows
Programming language: Perl
License: GNU General Public License. This license allows the source code to be redistributed and/or modified under the terms of the GNU General Public License as published by the Free Software Foundation. The source code for the application is available at no charge.
Any restrictions to use by non-academics: None
The authors thank Dr. R. Casadio for providing the datasets, HumVar and NewHumVar. This work was supported by the National Natural Science Foundation of China (Grant no.30470031).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.