Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy

Background Understanding how amino acid substitutions affect protein functions is critical for the study of proteins and their implications in diseases. Although methods have been developed for predicting potential effects of amino acid substitutions using sequence, three-dimensional structural, and evolutionary properties of proteins, the applications are limited by the complication of the features and the availability of protein structural information. Another limitation is that the prediction results are hard to be interpreted with physicochemical principles and biological knowledge. Results To overcome these limitations, we proposed a novel feature set using physicochemical properties of amino acids, evolutionary profiles of proteins, and protein sequence information. We applied the support vector machine and the random forest with the feature set to experimental amino acid substitutions occurring in the E. coli lac repressor and the bacteriophage T4 lysozyme, as well as to annotated amino acid substitutions occurring in a wide range of human proteins. The results showed that the proposed feature set was superior to the existing ones. To explore physicochemical principles behind amino acid substitutions, we designed a simulated annealing bump hunting strategy to automatically extract interpretable rules for amino acid substitutions. We applied the strategy to annotated human amino acid substitutions and successfully extracted several rules which were either consistent with current biological knowledge or providing new insights for the understanding of amino acid substitutions. When applied to unclassified data, these rules could cover a large portion of samples, and most of the covered samples showed good agreement with predictions made by either the support vector machine or the random forest. Conclusion The prediction methods using the proposed feature set can achieve larger AUC (the area under the ROC curve), smaller BER (the balanced error rate), and larger MCC (the Matthews' correlation coefficient) than those using the published feature sets, suggesting that our feature set is superior to the existing ones. The rules extracted by the simulated annealing bump hunting strategy have comparable coverage and accuracy but much better interpretability as those extracted by the patient rule induction method (PRIM), revealing that the strategy is more effective in inducing interpretable rules.


Background
Variants in single bases of DNA sequences yield single nucleotide polymorphisms (SNPs), among which nonsynonymous single nucleotide polymorphisms (nsSNPs) occurring in protein coding regions lead to amino acid substitutions in protein products, potentially affect protein functions, and are closely related to human inherited diseases. Hence, predicting potential effects of non-synonymous single nucleotide polymorphisms and their resulting amino acid substitutions on protein functions is of central importance in modern pathological and pharmaceutical studies [1]. Recently, increasing amounts of amino acid substitutions occurring in human proteins have been detected and collected in various databases such as the Swiss-Prot database [2], the Human Gene Mutation Database (HGMD) [3], and the Online Mendelian Inheritance in Man + (OMIM) database [4]. Stand alone data sets such as the unbiased laboratory mutagenesis data derived from experiments on the E. coli lac repressor [5,6] and the bacteriophage T4 lysozyme [7] are also available. With these data sources, the prediction is typically based on a set of features derived from the sequence and structural properties, as well as the phylogenetic information of the proteins containing the substitutions. For instance, Chasman and Adams derived sequence and structure-based features from a structural model and the phylogenetic information [8]. Sunyaev et al. analyzed amino acid substitutions on the basis of protein threedimensional structure and multiple alignments of homologous sequences [9,10]. Ferrer-Costa et al. characterized disease-associated substitutions in terms of substitution matrix, secondary structure, accessibility, free energies of transfer from water to octanol, etc. [11,12]. Saunders and Baker created mutation models by means of a variety of structural and evolutionary features [13]. Krishnan and Westhead used the physicochemical classes of residues, sequence conservation score, secondary structure, solvent accessibility, and buried charge, etc. [14]. Ng and Henikoff utilized the sequence conservation and the BLO-SUM amino acid substitution matrices [15]. With a set of features ready, the prediction is conventionally performed by making use of either the standard machine learning methods such as the decision tree [16], the support vector machines [17,18], the random forest [19], the statistical and classification models [8,9,13], or certain specifically designed methods such as the SIFT (Sorting Intolerant From Tolerant amino acid substitutions) [15].
No matter what kind of method is used, the quality of the features plays an important role in predicting the potential effects of given amino acid substitutions. To construct these features, amino acid substitutions were mapped to protein 3D structures [8][9][10]13]; evolutionary properties were measured from statistical models [8,9]; secondary structure and accessibility were computed from various prediction programs [11,13]; database annotations were also included [12]. However, the availability of protein or homologous proteins' structures limits the scope of the applications of these methods. In addition, most of these prediction methods are complicated and the prediction results are difficult to interpret, because a large number of complicated features are used and many of them rely heavily on other computational models. Although in some methods simple features were used with some specifically designed statistical models [15], the prediction accuracy is not as high as those methods using combined multiple features [1,[8][9][10][11][12][13][14].
A good feature set should contain as few features as possible, while each feature should have clear physicochemical meaning and is easy to be interpreted in biological terms. To achieve these objectives, we derived a novel feature set (including a continuous form and a discrete form) based on three physicochemical properties (molecular weight, pI value, and hydrophobicity scale) of amino acids, three relative frequencies of occurrences of amino acids in the secondary structures (helices, strands, and turns) of proteins with known secondary structural information, and the evolutionary profile of the protein containing the substitution. We compared the quality of the proposed feature set with other more complicated ones by applying the decision tree [16], the support vector machine [17,18], and the random forest [19] to the experimental amino acid substitution data of the E. coli lac repressor [5,6] and the bacteriophage T4 lysozyme [7], as well as to a wide range of human amino acid substitutions collected in the Swiss-Prot database. The results showed that our simple yet interpretable feature set was superior to other published ones [15,14,20] in terms of the area under the receiver operating characteristic (ROC) curve (AUC), the balanced error rate (BER), and the Matthews' correlation coefficient (MCC).
Although existing machine learning methods could make predictions, they acted like "black boxes" in that they were not capable of capturing physicochemical principles behind the predictions. In many circumstances, however, these hidden principles were of more importance since they could reveal how amino acid substitutions affected protein functions and why some amino acid substitutions would result in diseases. In order to explore these principles and associate amino acid substitutions with biological knowledge, we would use rule induction methods to automatically search rules for amino acid substitutions. These rules should be (1) interpretable, consisting of a small set of simple features; (2) high-quality, with very few exceptions; and (3) general, capable of explaining a significant number of substitutions.
In this paper, we considered rules as sub-regions (boxes) in the feature space composed of amino acid substitutions. More specifically, the boxes were defined in terms of the feature intervals. A previous method for finding boxes in the feature space was the patient rule induction method (PRIM) [21], which searched for optimal boxes using a steepest-ascent approach and was intuitively referred to as a "bump hunting" method. When applied to our problem, the PRIM had drawbacks in that the imbalance between the numbers of data samples in different categories was not considered, and some redundant features in the boxes should be manually removed and the quality of the resulting boxes significantly decreased. To overcome these shortcomings, we incorporated a new criterion called the discrimination power to take the imbalance between the numbers of data samples in different categories into consideration, and developed a novel simulated annealing bump hunting strategy which made use of the simulated annealing method to automatically discard redundant features while extracting high quality rules. We validated this strategy using heterogenous experimental amino acid substitutions occurring in both the E. coli lac repressor [5,6] and the bacteriophage T4 lysozyme [7], and showed that our approach could extract rules with comparable converge and accuracy but much better interpretability as those extracted by the original PRIM method. We then applied our strategy to annotated human amino acid substitutions collected in the Swiss-Prot database and successfully identified several rules which could be interpreted using physicochemical terms and were consistent with the current biological knowledge. We further applied the induced intolerant rules to unclassified human amino acid substitution data, and the results showed that these rules could cover a large portion of data samples and most of the covered samples showed good agreement with predictions made by either the support vector machine or the random forest. Beyond the highly confident predictions, these rules more importantly revealed the physicochemical principles behind the covered amino acid substitutions and explained why these substitutions would result in diseases.

Data sources
A large number of amino acid substitutions occurring in human proteins have been collected in the Swiss-Prot protein database [2]. In version 50.0 (released on May-30-2006), the Swiss-Prot database contained 25,994 amino acid substitution entries in 4,324 human proteins, with each substitution being annotated as "Disease", "Polymorphism", or "Unclassified". For a clear and concise presentation, we would refer to amino acid substitutions with the annotation "Disease" as intolerant ones and those with the annotation "Polymorphism" as tolerant ones. In this paper, we studied human proteins having at least 20 homologous proteins in the Pfam database [22] (version 20.0, released in May-2006), and focused on the substitutions occurring in known protein domains. In total, we collected 9, 610 intolerant substitutions, 4, 556 tolerant substitutions, and 1,487 unclassified ones in 2, 579 human proteins.
In order to validate the proposed feature set and the simulated annealing bump hunting strategy, two sets of experimental amino acid substitution data for the E. coli lac repressor [5,6] and the bacteriophage T4 lysozyme [7] were used. In these data sets, the effects of amino acid substitutions on the function of the corresponding protein were rated and classified to four categories. In the case of the lac repressor, the four categories were "+" (no effect), "+-" (slight effect), "-+" (larger effect), and "-" (complete absence). In the case of the T4 lysozyme, the four categories were "++" (no effect), "+" (slight effect), "+/-" (larger effect), and "-" (complete absence). Following the definition used by Chasman and Adams [8], as well as by Krishnan and Westhead [14], substitutions falling into the "no effect" category were treated as tolerant ones, and substitutions in the other categories were regarded as intolerant ones. In total, for the E. coli lac repressor, we collected 1,187 intolerant substitutions and 1,760 tolerant ones. For the T4 lysozyme, we collected 494 intolerant substitutions and 1,048 tolerant ones.

Prediction of the experimental amino acid substitutions
We first show that the proposed feature set can outperform other published feature sets in the prediction of potential effects of experimental amino acid substitutions occurring in the E. coli lac repressor [5,6] and the T4 lysozyme [7]. We performed 10-fold cross-validation experiments using both the support vector machine and the random forest with the proposed feature set on the substitution samples, calculated the area under the ROC curve (AUC), the minimum balanced error rate (BER), and the maximum Matthews' correlation coefficient (MCC), and compared them with other published results. Detailed descriptions regarding the prediction methods and the definition of the criteria are presented in the method section.
The cross-validation results using our feature set (both the continuous form and the discrete form) are shown in Table 1. First, we can see from the table that for the experimental substitutions occurring in homogenous proteins, our feature set works well with both the support vector machine and the random forest. When working with the random forest, the discrete form of our feature set can produce an AUC of 0.944, a BER of 0.125, and a MCC of 0.741 for the experimental substitutions occurring in the E. coli lac repressor, suggesting that about 88% of the substitutions can be predicted accurately. When working with the support vector machine, the results are slightly worse, but the continuous form of our feature set can still predict about 85% substitutions accurately. For experimental substitutions occurring in the T4 lysozyme, we obtain similar results. Second, the results show that our feature set can also work well for experimental substitutions occurring in heterogenous proteins. When applied to the mixed samples occurring in both the E. coli lac repressor and the T4 lysozyme, the random forest with the discrete form of our feature set can produce an AUC of 0.927, a BER of 0.148, and a MCC of 0.693, suggesting that about 85% of the substitutions can be predicted accurately. Thirdly, we notice that in our studies, the random forest works slightly better than the support vector machine with our feature set in terms of the AUC, the BER, and the MCC.
We compared the cross-validation results using our feature set with those obtained by the SIFT (Ng and Henikoff [15]) and another published feature set (Krishnan and Westhead [14]). As a sequence homology-based method, the SIFT can achieve BERs of 33% and 34% for experimental amino acid substitutions occurring in the E. coli lac repressor and the T4 lysozyme, respectively. By comparison, the continuous form of our feature set can achieve corresponding BERs of 14% and 17% when working with the random forest (15% and 18% when working with the support vector machine), respectively. These results suggest that our feature set can outperform the SIFT in the prediction of potential effects of experimental amino acid substitutions occurring in homogenous proteins. The published feature set by Krishnan and Westhead [14] uses 16 features, including 13 sequence based ones (the residue identities of the original and mutated residue, the physicochemical classes of these residues (hydrophobic, polar, charged, glycine), sequence conservation score at the mutated position, molecular mass shift on mutation, and hydrophobicity difference), and 3 structure based ones (secondary structure, solvent accessibility, and buried charge). When working with the support vector machine, this feature set can achieve BERs of 27%, 29%, and 28% for experimental amino acid substitutions occurring in the E. coli lac repressor, the T4 lysozyme, and the mixture of them, respectively, while the continuous form of our feature set can achieve corresponding BERs of 15%, 18%, and 19%, respectively. When working with the decision tree, the published feature set can achieve BERs of 16%, 20%, and 21% for experimental amino acid substitutions occurring in the E. coli lac repressor, the T4 lysozyme, and the mixture of them, respectively, while the continuous form of our feature set can achieve corresponding BERs of 16%, 18%, and 19%, respectively. These results suggest that our feature set can work as good as or outperform the published feature set [14] in the prediction of potential effects of experimental amino acid substitutions occurring in both homogenous and heterogenous proteins.

Prediction of the disease related amino acid substitutions
We performed 10-fold cross-validation experiments using both the support vector machine and the random forest with the proposed feature set on amino acid substitutions occurring in highly heterogenous human proteins and collected in the Swiss-Prot database, and compared the results with other published results (Bao and Cui [20]).
The published method [20] used a complicated feature set. For every substitution pair, they directly used two three-dimensional structural information predicted by the ENVIRONMENT program [23], one secondary structural information predicted by the STRIDE program [24], and one statistical score calculated by the SIFT program [15]. Their feature set also included another feature derived from the prediction results of these programs, and the wild-type amino acid identity. Altogether, they used six features. Five of them were three-dimensional structural or statistical ones, and needed to be calculated using other programs. Due to the limited availability of three-dimensional structural information, only a small fraction of available substitutions (3, 686 intolerant ones in 323 proteins and 532 tolerant ones in 305 proteins) in the Swiss-Prot database could be considered in their method. In contrast, our proposed feature set used only sequence information and evolutionary profiles, and did not depend on any other prediction programs. Consequently, we could predict more substitutions (9, 610 intolerant ones and 4, 556 tolerant ones) in a wider range of (2, 579) human proteins.
For comprehensive measures, Figure 1 shows the ROC curves for the support vector machine (AUC = 0.817) and the random forest (AUC = 0.831) using the proposed continuous form of our feature set. When compared with the SIFT and the method used by Bao and Cui (both presented in [20]), we can see clearly that both methods using our feature set produce better ROC curves (see Figure 1 and Fig.1 in [20]), indicating that the proposed feature set is superior to both the SIFT and the feature set presented in [20] in terms of comprehensive prediction power (the area under the ROC curve). For the discrete form, the AUC is 0.806 for the support vector machine and 0.817 for the random forest, and the ROC curves (not shown) are similar to those using the continuous form.
More specifically, we compared the two criteria for a certain single decision threshold, as shown in Table 2. When working with the support vector machine, the continuous form of our feature set leads the SIFT (results presented in [20]) by about 4% (26% vs. 30%) in BER and about 0.15 (0.46 vs. 0.31) in MCC. When working with the random forest, the continuous form of our feature set leads the SIFT by about 5% (25% vs. 30%) in BER and about 0.18 (0.49 vs. 0.31) in MCC. Similar results are obtained when comparing the discrete form of our feature set with the SIFT. These results suggest that our feature set can outperform the SIFT in the prediction of amino acid substitutions occurring in human proteins. When comparing our results with those obtained using the feature set proposed in [20], we can see from the table that for both prediction methods using the proposed feature set, the BERs are much smaller while the MCCs are much larger than the corresponding method using the feature set presented in [20], indicating that our feature set are much better than the published one.

Correlation and relative importance of the proposed features
For better understanding of the relationship between the proposed features, we calculated the pairwise Pearson's correlation coefficients between the proposed features (continuous form) based on the amino acid substitutions occurring in human proteins and presented the (upper triangle) correlation matrix in Figure 2. We divided the features to 7 groups according to their definitions in the method section, and named these groups at the top of the matrix. First, we can see from the matrix that the two evolutionary conservation scores (features 43 and 44) have very weak correlations with other 42 features. Second, for the original amino acid group (features 1 to 6), the window-sized group (features 13 to 18), and the columnweighted group (features 19 to 24), features derived from the same (physicochemical or relative frequency) properties (e.g., 1-13-19, 2-14-20, etc.) show medium positive correlations, as illustrated in region 1, 2, and 3 in the matrix. Third, the relative change features (features 25 to 42) show strong positive correlations with the substitution features derived from the same properties (e.g., 25-7, 26-8, etc.) and strong negative correlations with the orig- The ROC curves for predicting amino acid substitutions occurring in human proteins using the support vector machine and the random forest with the continuous form of the proposed feature set Figure 1 The ROC curves for predicting amino acid substitutions occurring in human proteins using the support vector machine and the random forest with the continuous form of the proposed feature set. For those using the discrete form, the curves (not shown) are similar.  inal, the window-sized, or the column-weighted features (e.g., 25-1, 37-19, etc.), as illustrated in region 4, 5, and 6, respectively. Finally, as shown in region 7 in the matrix, the relative change features derived from the same properties show strong positive correlations (e.g., 25-31-37, 26-32-38, etc.). We also calculated the correlation matrix for the discrete form of the proposed features based on the amino acid substitutions occurring in human proteins and observed similar results (data not shown). These observations, though can be intuitively explained from the definitions and calculation schemes of the features (see the method section for details), provide us informative understanding and quantitative measurement of the relationship between the proposed features and can be used as evidences in the future feature selection procedure.
We then evaluated the relative importance of the proposed features using the scheme included in the random forest and presented the results in Figure 3. In the random forest, the raw importance of a feature is calculated by randomly permuting the values of the feature in the Out-Of-Bag (OOB) cases, calculating the difference of classification errors between the original and the permuted cases, and averaging this difference over all the trees in the forest [19]. To make the measurement of importance more understandable, a normalization procedure is further applied to the raw importance of each feature by dividing the raw importance with the maximum raw importance over all the features (assuming it to be a positive number). Consequently, the relative importance of every proposed feature (a real number which is less than or equal to 1.0) is obtained. For the continuous form of the proposed features ( Figure 3A), we can see that the two evolutionary conservation scores (features 43 and 44) The Pearson correlation coefficient matrix (upper triangle) of the proposed features (continuous form)   4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 On the one hand, all of the 10 features except for X 27 in this order are calculated with the evolutionary conservation scores (see the method section for details), revealing the significant importance of the evolutionary information in the prediction of poten-tial effects of amino acid substitutions. On the other hand, the frequent appearances of the features derived from the molecular weight (X 19 , X 25 , and X 37 ), the hydrophobicity scale (X 21 , X 27 , and X 39 ), and the relative frequency in turns (X 24 and X 42 ) in this order suggest the importance of these properties in the identification of human disease related amino acid substitutions. For the discrete form of the proposed features ( Figure 3B), the results show that the relative importance of the window-sized, the columnweighted, and the relative change features are not as good as their continuous forms, suggesting that the discretization procedure causes information loss for individual features. When looking at the order of the top 10 most informative features (X 44 > X 43 > X 19 > X 1 > X 3 > X 5 > X 21 > X 6 > X 24 > X 39 ), we confirm the importance of the evolu- The relative importance of the proposed features  tionary information (X 43 and X 44 ), the molecular weight (X 1 and X 19 ), the hydrophobicity scale (X 3 , X 21 , and X 39 ), and the relative frequency in turns (X 6 and X 24 ).

Validation of the simulated annealing bump hunting strategy
A merit property of the discrete form of our feature set is that every feature has strong physico-chemical meaning, which enables us to induce interpretable rules to explain the biological principles behind amino acid substitutions. We first validated the proposed simulated annealing bump hunting strategy using the heterogenous experimental amino acid substitution data. We randomly divided the mixed substitution samples occurring in the E. coli lac repressor and the T4 lysozyme into a training set (containing 2/3 of the data) and a test set (containing the rest 1/3 of the data), applied the simulated annealing bump hunting strategy to the training set, and evaluated the resulting rules on the test set. As an example, Table 3 lists ten rules (five intolerant ones and five tolerant ones, respectively) extracted by the simulated annealing bump hunting strategy. From the table, we can see that the extracted rules have comparable coverage (box-size), accuracy (box-mean), and discrimination power for the training and test set, suggesting that our strategy is capable of extracting general rules. For example, for intolerant rules, the simulated annealing bump hunting strategy extracted a 1-feature rule with a coverage of 0.033 and an accuracy of 0.933 from the training set, while the same rule have a coverage of 0.021 and an accuracy of 0.938 when evaluated using the test set.
We also made a comparison between the simulated annealing bump hunting strategy and the original patient rule induction method (PRIM), which was implemented in the SuperGEM software [21]. Some candidate rules (7 intolerant ones and 7 tolerant ones) are shown in Figure  4. From the figure, we can see that the rules extracted by the simulated annealing bump hunting strategy have comparable coverage and accuracy but much better interpretability (less number of features) as the rules extracted by the original PRIM method. For example, for intolerant rules ( Figure 4A), the simulated annealing bump hunting strategy extracted a 5-feature rule with a coverage of 0.042 and an accuracy of 0.98, while the original PRIM method extracted a 19 feature rule with comparable coverage and accuracy. Similarly, for tolerant rules ( Figure 4B), the simulated annealing bump hunting strategy extracted a 4-feature rule with a coverage of 0.06 and an accuracy of 0.98, while the original PRIM method extracted a 17 feature rule with comparable coverage and accuracy.

Amino acid substitution rules
We applied the simulated annealing bump hunting strategy with the discrete form of our feature set to the human amino acid substitution data and extracted several rules which were consistent with current biological knowledge. As examples, Figure 5 presents a group of three intolerant Rule 1 in Figure 5 says that for a substitution pair, if the substituted amino acid never appears in the column (corresponding to the substitution position) of the Pfam multiple sequence alignment, the substitution is very likely to be intolerant. This rule uses a single feature (X 44 ) and covers 29% (4,137 out of 14,166) data samples with an accuracy of 88% and a discrimination power of 3.5.
Rule 2 in Figure 5 says that for a substitution pair, if the substituted amino acid rarely (e.g., with a very low frequency ≤ 0.1%) appears in the corresponding column of the Pfam multiple sequence alignment, the substitution is very likely to be intolerant. This rule uses a single feature (X 44 ), covering 5.9% (836) data samples with an accuracy of 92% and a discrimination power of 5.7. Figure 5 says that for a substitution pair, if the original amino acid very abundantly (with a very high frequency ≥ 0.95) appears in the corresponding column of the Pfam multiple sequence alignment, the substitution is very likely to be intolerant. This rule uses a single feature (X 43 ), covering 14% (2, 014) data samples with an accuracy of 95% and a discrimination power of 9.6.

Rule 3 in
These rules can be understood as follows. In the Pfam multiple sequence alignments, homologous proteins are aligned according to their functional units (protein domains). Hence, amino acids appearing in a certain column of an alignment would be those that are crucial in maintaining the protein function. On the contrary, amino acids rarely appearing in a certain column of an alignment would very likely be irrelevant to the protein function. Therefore, in Rule 1 and Rule 2, when an amino acid is substituted by another amino acid which never or rarely appears in the multiple sequence alignment, the function of the protein could hardly be maintained. In Rule 3, the very abundant appearance of the original amino acid in the multiple sequence alignment indicates that the amino acid is crucial in keeping the protein function. Therefore, when the amino acid is substituted, the protein would very likely be malfunction.
The second group of three rules uses physicochemical features and provides us specific understanding regarding the intolerant substitutions for individual amino acids, as illustrated in Figure 6. Figure 6 says that if a Cysteine is substituted, no matter what kind of amino acids it is substituted to, the substitution is very likely to be intolerant. This rule uses a single feature (X 1 ), covering 4.2% (596) data samples with an accuracy of 91% and a discrimination power of 4.7. The Cysteine is the only amino acid capable of forming disulfide bonds, and the disulfide bridges between Cysteines within a polypeptide support the protein's sec- Figure 5 Three general intolerant rules. In the figure, β, γ, and ρ for each rule are the coverage, the accuracy, and the discrimination power of the corresponding rule, respectively.  ondary structure. Therefore, when a Cysteines is substituted, the structure would be destroyed, and the protein would lose its function.

Three general intolerant rules
Rule 5 in Figure 6 says that when a Glycine is substituted, if the evolutionary profile medium or highly prefers turns (x 24 = 6,7), the substitution is very likely to be intolerant. This rule uses 2 features (X 1 and X 24 ), covering 6.6% (940) data samples with an accuracy of 95% and a discrimination power of 9.0. This rule can be understood from two aspects. First, the Glycine is the smallest amino acid. Therefore, when a Glycine is substituted by any other (bigger) amino acids, there might not be enough space to hold that amino acid, and thus the secondary structure of the polypeptide would be destroyed. As a result, the protein would lose its function. Second, the Glycine is one of the amino acids most prefer turns (only second to Proline). Hence, when the turn structure is important to the protein function (evolutionary profile medium or highly prefers turns) and a Glycine is substituted, the function of the protein would very likely change.
Rule 6 in Figure 6 says that when an amino acid is substituted by an Arginine, if the evolutionary profile is acidic (x 20 = 1,2,3), and the substituted amino acid (the Arginine) has a medium to strongly negative change in hydrophobicity versus the evolutionary profile (x 39 = 1,2), the substitution is very likely to be intolerant. This rule uses 3 features (X 9 , X 20 , and X 39 ), covering 5.7% (805) data samples with an accuracy of 93% and a discrimination power of 6.1. This rule can be understood from the following aspects. First, the Arginine is the most alkalic (with the highest pI value) and most hydrophilic amino acid. Second, an acidic evolutionary profile indicates that amino acids with small pI values are crucial to the protein's function. Thirdly, the substituted amino acid (the Arginine) having medium to strongly negative change in hydrophobicity scale suggests that hydrophilic amino acids are the majority in the homologous proteins (in other words, hydrophilic amino acids are crucial to the protein's function). Therefore, when an Arginine replaces the original amino acid, the above second and third conditions are violated, and thus the function of the protein would be destroyed.
The third group of three rules uses both the conservation scores and the physicochemical features, and provides us specific understanding regarding the tolerant substitutions, as illustrated in Figure 7. Figure 7 says that if the substituted amino acid appears in the multiple sequence alignment with a rather high frequency (0.600 <X 44 ≤ 1.000), the substitution is very likely to be tolerant. This rule uses a single feature (X 44 ), covering 1.2% (171) data samples with an accuracy of 81% and a discrimination power of 9.2. This rule can be thought of as the opposite of the previous Rule 1 and Rule 2. Amino acids appearing in a certain column of a Pfam multiple sequence alignment would be those that are crucial in maintaining the protein function. Therefore, when an amino acid is substituted by another amino acid which appears in the multiple sequence alignment with a rather high frequency, the function of the protein could possibly be maintained, and the substitution is likely to be tolerant. Figure 7 says that if the original amino acid appears in a hydrophilic local environment (1 ≤ X 15 ≤ 4) and the substituted amino acid appears in the multiple sequence alignment with a relatively high frequency (0.200 <X 44 ≤ 0.500), the substitution is very likely to be tolerant. This rule uses 2 features (X 15 and X 44 ), covering 4.3% (610) data samples with an accuracy of 74% and a discrimination power of 6.0. The understanding of this rule is similar to the previous Rule 7. Amino acids appearing in a certain column of a Pfam multiple sequence alignment would be those that relate to the maintenance of the protein function. Hence, when an amino acid is substituted by another amino acid which appears in the multiple sequence alignment with a relatively high frequency, the function of the protein could possibly be maintained, and the substitution is likely to be tolerant. Figure 7 says that when one of the amino acids in the set of {W, V, G, L, I} is substituted by one of the amino acids the set of {V, I} (11 ≤ X 2 ≤ 15 and 2 ≤ X 12 ≤ 3), if the substituted amino acid has a neutral change in molecular weight against the original one (X 25 = 4), the substitution is likely to be tolerant. This rule uses 3 features (X 2 , X 12 , and X 25 ), covering 3.5% (493) data samples with an accuracy of 77% and a discrimination power of 6.9. The principle behind this rule is that when amino acids are substituted by other amino acids having similar physicochemical properties, the structure of the protein is likely to be maintained, and thus the function of the protein is likely to be kept.

Prediction of the unclassified amino acid substitutions
We further applied the support vector machine and the random forest with the discrete form of our feature set to predict potential effects of unclassified amino acid substitutions in human proteins. We first used the 10-fold crossvalidation experiments to determine the decision threshold for each method so that the balanced error rate (BER) could be minimized in the experiments, and then applied each method with the corresponding decision threshold on the unclassified data to make predictions. The results are shown in Figure 8. We also applied the six intolerant rules induced by the simulated annealing bump hunting strategy in the previous section to the unclassified data. In total, the six intolerant rules covered 527 (412 + 79 + 11 + 25) data samples, and 412 out of them were also predicted as intolerant by both the support vector machine and the random forest. Besides, 90 samples covered by these rules were also predicted as intolerant by one of the prediction methods. Only 25 samples were not predicted as intolerant by either method. These statistics suggested that the induced interpretable rules were general (covering a significant proportion of data samples), and were of very high quality (with very few exceptions). More importantly, beyond the highly confident predictions, these rules also revealed the physicochemical principles behind the covered amino acid substitutions and explained why these substitutions would be intolerant.

Discussion and conclusions
Most contemporary studies aiming at predicting potential effects of amino acid substitutions made use of complicated and not widely available properties of amino acids and proteins. To overcome these limitations, we proposed a feature set based on three physicochemical properties of amino acids, three relative frequencies of amino acids in the secondary structures of proteins with known secondary structure information, and two evolutionary conservation scores. We applied three machine learning methods (the decision tree, the support vector machine, and the random forest) with our feature set to experimental amino acid substitutions occurring in the E. coli lac repressor and the bacteriophage T4 lysozyme, and showed that the methods using our feature set could achieve preferred prediction results in terms of the area under the ROC curve, the balanced error rate, and the Matthews' correlation coefficient. We further applied the support vector machine and the random forest with our feature set to a large number of amino acid substitutions occurring in highly heterogenous human proteins, and showed that our feature set could be applied to a much wider range of human proteins and the prediction methods using our feature set were superior to those using the existing more complicated feature sets.
Although existing methods could produce reasonable predictions, they were not capable of capturing physicochemical principles behind the predictions. In many situations, however, these hidden principles were of more importance because they could uncover how amino acid substitutions affect protein functions and why some substitutions would result in diseases. In order to explore these principles, we used a novel designed rule induction method called the simulated annealing bump hunting strategy to automatically extract interpretable rules for amino acid substitutions. The induced rules were either consistent with current biological knowledge or providing new insights for the understanding of the physicochemical principles behind amino acid substitutions.
One limitation of our feature set is that we currently use the Pfam multiple sequence alignment to extract evolutionary information for the query protein sequence. As a result, we are limited to deal with amino acid substitutions occurring in known protein domains. This limitation can be overcome by using some other multiple sequence alignment method such as the PSI-BLAST and ClustalW instead of the Pfam. Another limitation of our feature set is that the number of features is large, and some of them are highly correlated. Although good results have been achieved, integrating feature selection mechanisms in prediction methods could further improve the prediction performance. This demand is especially urgent when using the support vector machine as the prediction method. A third limitation is regarding how to perform fair and comprehensive comparisons between feature sets and prediction methods proposed in different literatures, especially when the training and test samples are of different sizes and from different data sources. Although this direction is not the focus in this paper, it would be of great importance and necessity in developing a general benchmark system using unified statistical criteria in our future work.
As for the simulated annealing bump hunting strategy, there exist two free parameters (λ and β 0 ). Although free parameters incorporate more flexibility into the method, they make the computational burden heavier (in order to tune these parameters). How to design an automated mechanism to guide the determination of these free parameters remains an ongoing study. Also, although the nine presented rules could be well explained, there exist some other rules which are not easy to be interpreted by current biological knowledge, especially when the rules contain many features. How to simplify our feature set to make the rules more interpretable forms another research focus.
Despite the limitations, we showed that our results were reasonably good. When using our feature set with the support vector machine and the random forest, we obtained better ROC curves and smaller (balanced) prediction error rates in the cross-validation experiments. When applied to unclassified data, the six induced intolerant rules could cover a large portion of data samples, and most of the covered substitutions were also predicted as intolerant by either the support vector machine or the random forest. More importantly, beyond the highly confident predictions, these rules could also reveal the physicochemical principles behind the covered samples and explain why these substitutions would cause diseases.

The proposed feature set
We propose a set of 44 features which are general enough for most known proteins and are easy to be obtained by simple calculations. Our feature set has a continuous form, in which all the 44 features have continuous values, and a discrete form, in which 42 features have ordered categorical values and the other 2 have continuous values. The features are derived based on 3 physicochemical properties (molecular weight, pI value, and hydrophobicity scale) of amino acids, 3 relative frequencies for the occurrences of amino acids in the secondary structures (helices, strands, and turns) of proteins with known secondary structural information, and two evolutionary conservation scores. The unit of molecular weight is Dalton. The pI (Isoelectric Point) is the pH value at which a molecule carries no net electrical charge. The hydrophobicity scale of Kyte and Doolitle is derived from the physicochemical properties of amino acid side chains [25]. The three relative frequencies are calculated by counting the occurrences of amino acids in the corresponding secondary structure of proteins with known secondary structural information. All these six properties can either be obtained from the literature [25,26], or be calculated using only the sequential information of proteins [27].
The continuous form For a given amino acid substitution pair (Org → Sub) in a certain query protein, the above 6 properties are calculated for the original (Org) and the substituted (Sub) amino acid, as well as in a window-sized situation which includes the neighbors of the original amino acids in the query protein sequence, and in a column-weighted circumstance in which the query protein sequence is aligned with its homologous proteins. The calculations of the properties for the original and the substituted amino acids are straightforward. The window-sized properties (with window size W) are calculated as the average of the corresponding properties for the original amino acid and its W -1 neighbors in the query protein sequence. According to the known relationship between sequences and secondary structures of proteins (i.e., α helices are defined by repeated hydrogen bonds with a period of 4 amino acids, and have 3.6 amino acids per turn [26]), in this paper, we set the window size W = 9 so that the sequence information of the amino acids at the substitution positions and the α helices next to the substitution residues can be included. The column-weighted properties are calculated as follows. For the query protein, its homologous proteins are extracted from the Pfam database [22]. Supposing that the substitution occurs at a position corresponds to a certain column of the alignment, the column-weighted properties are then calculated as the weighted average of the corresponding properties for all the 20 kinds of amino acids, where the weight of a certain kind of amino acid is the frequency of its occurrence in the corresponding column of the alignment.
In addition, for each substitution pair, three combinations of the above four situations are considered. First, each of the 6 properties of the original amino acid is subtracted from the corresponding property of the substituted amino acid, forming 6 features measuring the relative change of the substituted amino acid versus the original amino acid. Second, each of the 6 window-sized properties is subtracted from the corresponding property of the substituted amino acid, forming 6 features measuring the relative change of the substituted amino acid versus the local environment of the substitution position in the query protein. Thirdly, each of the 6 column-weighted properties is subtracted from the corresponding property of the substituted amino acid, forming 6 features measuring the relative change of the substituted amino acid versus the evolutionary profile of the substitution position in the homologous proteins.
Besides the above physicochemical and relative frequency features, our feature set also include two evolutionary conservation scores for the original and the substituted amino acids. The conservation scores are defined as the frequencies of occurrences of the amino acids (original or substituted) in the corresponding column of the Pfam multiple sequence alignment. Substitution-Original Substitution-Window Substitution-Column Each of the 6 amino acid properties is calculated in 7 situations, forming X 1 ~ X 42 , while the conservation scores for the original and the substituted amino acids become X 43 and X 44 , respectively.
An illustration of ordered categorical values for hydrophobicity scales for the original or substituted amino acids Figure 9 An illustration of ordered categorical values for hydrophobicity scales for the original or substituted amino acids.  More Hydrophobic More Hydrophilic

R K N D E Q H P Y W S T G A M C F L V I
With the above properties being calculated, we propose the continuous form of the feature set, including 42 physicochemical or relative frequency properties (each of the 6 amino acid properties being calculated in 7 different situations) and 2 conservation scores (for the original and the substituted amino acids). As a summary, Table 4 shows this feature set, with features labeled by X i for i = 1,..., 44.

The discrete form
In order to make the features interpretable in physicochemical terms, we further discretize the physicochemical and relative frequency properties (X 1 ~ X 42 in Table 4). For each of the properties corresponding to the original or the substituted amino acids (X 1 ~ X 12 ), we first order the possible values (corresponding to the 20 amino acids) from the smallest to the largest, and then use the ranks as the categorical values for the property. By doing this, each categorical value corresponds to one or more amino acids, while the categorical values for a certain property have intrinsic order and clear physicochemical meaning. For example, Figure 9 illustrates the ordered categorical values for the hydrophobicity scale (X 3 or X 9 ). 20 amino acids are sorted according to their hydrophobicity scale, from the most hydrophilic (R) to the most hydrophobic (I). The ranks of the sorting results are then used as the categorical values for the amino acids. Since amino acids N, D, E, and Q have identical hydrophobicity in the Kyte and Doolitle scale, they are assigned the same categorical value (3). The physicochemical meaning of the ordered categorical value is straight forward: the smaller the value, the more hydrophilic the amino acid, and vice versa.
For each of the other properties (X 13 ~ X 42 ), we first group all the possible values to 7 bins with each bin having equal interval, and then use the indices of the bins as the categorical values of the property. By doing this, each categorical value corresponds to several substitution pairs, while the categorical values for a certain property have intrinsic order and clear physicochemical meaning. As a demonstration, Figure 10 illustrates how the windowsized hydrophobicity scale is discretized. First, all the window-sized hydrophobicity scale values (X 15 ) in the data set are collected, and the minimum value (-4.0) and maximum value (+4.0) are determined. And then, 6 threshold values ({-2.86, -1.71, -0.57, 0.57, 1.71, 2.86}) are calculated so that the interval ([-4.0, +4.0]) can be cut into 7 equal bins. Finally, the indices of the bins are used as the categorical values. The physicochemical meaning of the ordered categorical value is straight forward: the smallest value corresponds to strongly hydrophilic local environment, while the largest value corresponds to strongly hydrophobic local environment.
With each feature having meaningful ordered categorical values, we propose the discrete form of our feature set, containing 42 physicochemical or relative frequency properties (each having ordered categorical values) and the 2 conservation scores. We would use the same method as shown in Table 4 to label these features, with X 1 ~ X 42 having ordered categorical values.

Prediction methods and evaluation criteria
When comparing the proposed feature set with other published ones using experimental substitution data, we use the decision tree, the support vector machine, and the random forest to predict the potential effects of amino acid substitutions. Recent studies regarding the random forest [19] have shown that prediction results can be significantly improved by growing a set of decision trees and letting them to vote. Hence, we adopt the support vector machine (SVM) [18] and the random forest (RF) [19] with the proposed feature set to predict the potential effects of human disease related substitutions. For the support vector machine, two crucial parameters are commonly referred to as C and g. We use a grid search, as included in the libsvm software package [18] to determine these parameters. For the random forest, two important parameters are in general referred to as mtry (the number of randomly selected features at each node of the internal decision trees) and jbt (the number of decision trees in the forest). We use jbt = 1000, and try different mtry (from 1 to 10) to select the one which can give us the best prediction performance.
The performance of each prediction method is evaluated using 10-fold cross-validation experiments, and the results of 10 independent experiments are averaged to get a fair evaluation. We use three criteria to evaluate the performance of a prediction method. The first criterion is the area under the receiver operating characteristic (ROC) An illustration of ordered categorical values for the window-sized hydrophobicity Figure 10 An illustration of ordered categorical values for the windowsized hydrophobicity. curve (AUC), which provides us comprehensive understanding for the prediction power of a given method. The other two criteria are the balanced error rate (BER) and the Matthews' correlation coefficient (MCC) [28]. They take the imbalance of intolerant samples and tolerant samples into consideration and provide us more detailed understanding for the prediction power under certain decision threshold.
Given the 10-fold cross-validation results and a certain decision threshold, we can calculate the numbers of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) under the threshold. The balanced error rate (BER) and the Matthews' correlation coefficient (MCC) [28] under the decision threshold are then defined as and In general, the small the BER and the large the MCC, the better the prediction method.

Rule induction for amino acid substitutions
With the meaningful features available, we can use rule induction methods to automatically extract interpretable rules for amino acid substitutions. A rule has a productive format IF (condition) THEN (prediction).
The condition part should include only a small number of features so that the rule can be easily interpreted, while the prediction part gives an assertion about the potential effects (tolerant or intolerant) of amino acid substitutions which satisfy the condition.
For a given amino acid substitution data sample, let x = (x 1 ,..., x D ) T be the vector of all the features, where D = 44 is the total number of features. Let y be the indicator of the substitution type. In the case that we target to extract rules for intolerant substitutions, y = 1 if a substitution is intolerant and 0, otherwise. In the case that we aim at extracting rules for tolerant substitutions, y has the opposite meaning. Each substitution can be thought of as an observation of the output (y) produced by a certain unknown function, given the inputs (x), and observations with similar outputs and similar inputs (or a subset of the inputs) define a rule. The similarity of the inputs can be specifically described by a "box" (sub-region) in the feature space, and defined by a set of feature intervals. The coverage of a rule can be represented by the size of the corresponding box (box-size), and the quality of a rule can be described by the average value of the output y for data samples inside the box (box-mean). Let N be the total number of data samples. The rule induction process is then mathematically formulated as: Given repeated observations composed of the substitutions (the outputs y k ), along with simultaneous values of the features (the inputs x k ), search in the feature space optimal boxes such that the box-means are as large as possible while the box-sizes are not very small.
This problem can be addressed using the patient rule induction method (PRIM) [21], which is also referred to as a "bump hunting" method. Each rule is described using a "box" in the feature space, defined as The PRIM then intends to search in the box space a box which has maximum box-mean , with the constraint > β 0 (β 0 is a predefined threshold). This is treated by a "top-down peeling" algorithm and a "bottom-up pasting" algorithm. The top-down peeling algorithm starts from the whole search space (the initial box) and repeatedly tries to maximize the box-mean by removing some bad data points (y = 0) from the box. Since each peeling is performed without knowledge of later peels, it is possible that the final box can be refined by readjusting some of its boundaries. Hence, the bottom-up pasting algorithm repeatedly tries to put some good data points (y = 1) back by growing the box. Smaller boxes often results in larger box-mean, the PRIM thus seeks for a reasonable tradeoff between the box size and the box mean. The tradeoff is typically done manually by looking at a box-size -boxmean trajectory plot. The final box represents the extracted rule. Considering that some redundant features may exist, a tradeoff between the complexity and goodness of the rule can be further considered by trying to remove some features from the rule. This is done after the final box is obtained by looking at how box-mean changes while removing some boundaries from the box.

Simulated annealing bump hunting strategy
The PRIM can be thought of as a steepest-ascent searching method in the box space. The final box is a (local) optimum without guarantee to be the global optimum. Also, the top-down peeling removes the data points permanently in iterations. Although some of the good data points can be put back by the bottom-up pasting, the repair to the box seems to be very limited. Moreover, it is doubtable that the process of removing some features from the rule after the final box is obtained could be an effective way to generate an optimum rule. These considerations motivate us to explore an automated feature selection methodology which can discard redundant features while extracting rules. The basic idea is to use the simulated annealing strategy instead of the steepest-ascent searching, while incorporating the automated feature selection process in the strategy.
The presence of a feature in a rule can be described as the presence of a boundary in a box and represented by an indicator ξ d , where ξ d = 1 if the d-th feature is included in the box and 0, otherwise. The indicator δ k with ξ d included then becomes . The formulas for the box size and box-mean remain unchanged. Considering that the proportion of data samples from different categories may be very different, we further introduce a normalized quantity to measure the discrimination power of a box , where is the ratio of the positive data samples against the negative data samples. We would take the possible imbalance between the data samples into consideration and maximize the discrimination power for rule induction. Nevertheless, α is a constant with fixed number of data samples, maximizing the discrimination power is therefore equivalent to maximizing the box mean , and vice versa.
For boxes with comparable box-sizes and box-means, we prefer boxes have fewer features. This is achieved by rewarding boxes with less features using the quantity of where λ is a hyper-parameter. λ = 0 means that we do not take the number of features into consideration, while positive λ values give preference to less number of features. In this paper, we in general set λ