The prediction procedure of Parepro (Figure 1) begins by calculating the position-specific amino acid probabilities (PSAP) of a target protein that contains a corresponding nsSNP. Next, three attribute sets were constructed using PSAP and the properties of amino acids from AAindex [39, 40]; these three sets were then used to describe residue differences (RD) and mutation position information (MI) and to yield information on the environment around the mutation positions (IE). Finally, a complex vector that consisted of 94 attributes was used to predict the effects of the nsSNPs. The attribute sets RD, MI and IE comprised 50, 23, 21 attributes, respectively.

### The mutation datasets

We used two datasets, HumVar and NewHumVar, taken from the PhD-SNP server [10]. The dataset HumVar consisted of 21,185 different SNPs (12,944 were disease-related, and 8,241 were neutral polymorphisms) obtained from 3,587 protein sequences in the Swiss-Prot database (Release 48). The NewHumVar dataset was comprised of SNPs obtained from the Swiss-Prot database (Release 50) after eliminating any variants also present in the HumVar dataset. Therefore, the dataset NewHumVar consisted of 935 single amino acid mutations (149 were disease-related variants, and 786 were neutral mutations) from 469 different proteins.

### Computing position-specific amino acid probabilities (PSAPs)

A Dirichlet mixture method [45–47] was adopted to estimate the PSAPs, which was then used to construct the vector of Parepro and was calculated as follows:

(1) PSI-BLAST [48] with parameter -e 0.001 was run for three iterations to collect sequences similar to the target protein that contained the corresponding nsSNP from the Swiss-Prot database (Release 50.0) [49]. The identified sequences were aligned by ClustalX [50, 51] with default parameters. The position-based sequence weight method [52] was used to derive the weight *w*
_{
i
}of the *i*th sequence in the alignment. If no homologous sequence was selected, the weight *w*
_{
i
}of the target sequence was designated as 1.0.

(2) An alignment column was summarized by its weighted composition into a vector

c. The element of vector

c was calculated as follows:

where *N* is the total number of aligned sequences, *w*
_{
i
}is the weight of the *i* th sequence, the value of *m* from 1 to 20 represents any one of 20 amino acids, and a value of 21 represents a gap. If the symbol type of the *i* th sequence at the column is an amino acid *a*
_{
m
}(*m* = 1, 2⋯20) or gap (*m* = 21), the value of *δ*
_{
im
}is 1.0; otherwise it is 0.

(3) A new vector u, which incorporated the gap information into the 20 amino acids, was constructed as follows:

*u*_{
m
}= *c*_{
m
}+ *c*_{21} × *h*_{
m
}(*m* = 1, 2⋯20) (2)

where the vector *h* is the frequency of occurrence of any one of the 20 amino acids [53].

(4) The Dirichlet mixture method [

45–

47] was adopted to estimate the PSAPs. The posterior probability of amino acid

*m* at a position,

, was calculated from a 20-component Dirichlet mixture[

47]:

where *q*
_{
j
}is the mixture coefficient of each component, *B* is the Beta function,
= (*α*
_{j1}...*α*
_{j20}) is the parameter for each component *j* of the Dirichlet mixture, and *l* is the number of components. The vector n was calculated by the equation, *n*
_{
m
}= *u*
_{
m
}× *N*(*m* = 1, 2⋯20), where *N* is the total number of homologous sequences and *u*
_{
m
}is calculated from equation (2).

### Inputs and Encoding Schemes of Parepro

The Parepro vector was comprised of three attribute sets, which were used to describe the RD, the MI, and the IE.

The first attribute set, RD, was designed to depict the property differences between the newly introduced amino acid and the average residue in the mutation position, which was composed of 50 elements and was constructed as follows:

(1) The 544 amino acid properties were downloaded from AAindex [39, 40], as shown in Additional file 1. Then the value of each property *t*
_{
km
}(*k* = 1,⋯,544, *m* = 1, 2⋯20) was standardized as follows:

where *μ*
_{
k
}and
are the mean and variance of the property *k*, respectively, and were calculated as follows:
and
.

(2) The position-dependent properties

*d*
_{
k
}were given by

where *p*
_{
m
}is the PSAP at a mutated position calculated from equation (3).

(3) With respect to property

*k*, the distance

*r*
_{
k
}between the weighted property

*d*
_{
kn
}of a newly introduced amino acid

*n* and the mean of

*d*
_{
k
}was

where
and
are the mean and variance of *d*
_{
k
}, respectively, and were calculated as follows:
,
.

(4) A new vector *r* was then constructed using the 544 elements from Additional file 1. The software weka3.4 [54] was used to simplify the vector *r*, in which the evaluator CfsSubsetEvalwas selected. The redundant and low-contribution elements in vector *r* were removed. After these modifications, 50 elements remained and were included in the RD attribute set.

The second attribute set, MI, was used to define the status of a mutation position and consisted of 23 values. The first 20 elements were the PSAP values of the 20 amino acids in the mutation position calculated from equation (3). The 21

^{st} and 22

^{nd} elements were the PSAP values of the wild-type residue and the newly introduced residue, respectively. The 23

^{rd} value was the entropy (

*E*) [

55,

56] of amino acids in the mutation position and was calculated as follows:

where 20 is the number of amino acids, and *p*
_{
m
}is the PSAP value at the mutation position calculated from equation (3)

The third attribute set, IE, encoded the information surrounding the mutation position and consisted of 21 elements. The first 20 elements represented the PSAP values of the 20 amino acids and were calculated from equation

(3), and the last element represented entropy and was calculated from equation

(8). Residues in the immediate vicinity of the mutation carried more significance with respect to the mutation. Therefore, a significance coefficient was assigned to each residue in proximity to the mutation. The element of IE was then calculated as follows:

where *i* is the mutation position, *f* is the number of residues located to the left or right of the mutation position, and *a* represents one element of IE from 1 to 21. If the value of *a* is between 1 and 20, *y*
_{(i+m)a
}is *p*
_{
a
}in the position of *i* + *m* calculated from equation (3). However, if the value of *a* is 21, *y*
_{(i+m)a
}is the entropy *E*
_{i+m}calculated from equation (8). Furthermore, if the mutation is located at the N-terminal position (*i* + *m* > *l*) or at the C-terminal position, then *y*
_{(i+m)a
}is *y*
_{
la
}or *y*
_{
la
}, respectively, where *l* is the number of residues in the protein.

### Support vector machine

The SVM is a classifier seeking an optimal hyperplane to separate two classes of samples. SVM uses kernel functions to map original data to a feature space of higher dimensions and locates an optimal separating hyperplane. For SVM implementation, we used LIBSVM [57] with a Radial Basis Function (RBF kernel function) *K*(*x*
_{
i
}, *x*
_{
j
}) = exp(-*G*||*x*
_{
i
}- *x*
_{
j
}||^{2}). The parameter was selected with the LIBSVM parameter selection tool.

### Scoring the performance

The proteins in the dataset were randomly divided into 20 subsets. For each individual test, the mutations in one of the 20 sub-datasets were used as the test set and the others in the 19 subsets were combined to form a training set. The procedure was repeated 20 times so that each sample was used exactly once for testing and training. We defined disease-associated nsSNPs as positive and neutral nsSNPs as negative. In this work, we adopted sensitivity, specificity, overall accuracy(Q2) and Matthews correlation coefficient(MCC) to score the performance of the corresponding method:

where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives. Because there was an obvious disparity in the number of positive samples and negative samples in the dataset, MCC combined both the sensitivity and the specificity of the predictor and should be selected as the main score among the six scores in the evaluation [20, 21, 41, 42].