- Open Access
SCMHBP: prediction and analysis of heme binding proteins using propensity scores of dipeptides
BMC Bioinformaticsvolume 15, Article number: S4 (2014)
Heme binding proteins (HBPs) are metalloproteins that contain a heme ligand (an iron-porphyrin complex) as the prosthetic group. Several computational methods have been proposed to predict heme binding residues and thereby to understand the interactions between heme and its host proteins. However, few in silico methods for identifying HBPs have been proposed.
This work proposes a scoring card method (SCM) based method (named SCMHBP) for predicting and analyzing HBPs from sequences. A balanced dataset of 747 HBPs (selected using a Gene Ontology term GO:0020037) and 747 non-HBPs (selected from 91,414 putative non-HBPs) with an identity of 25% was firstly established. Consequently, a set of scores that quantified the propensity of amino acids and dipeptides to be HBPs is estimated using SCM to maximize the predictive accuracy of SCMHBP. Finally, the informative physicochemical properties of 20 amino acids are identified by utilizing the estimated propensity scores to be used to categorize HBPs. The training and mean test accuracies of SCMHBP applied to three independent test datasets are 85.90% and 71.57%, respectively. SCMHBP performs well relative to comparison with such methods as support vector machine (SVM), decision tree J48, and Bayes classifiers. The putative non-HBPs with high sequence propensity scores are potential HBPs, which can be further validated by experimental confirmation. The propensity scores of individual amino acids and dipeptides are examined to elucidate the interactions between heme and its host proteins. The following characteristics of HBPs are derived from the propensity scores: 1) aromatic side chains are important to the effectiveness of specific HBP functions; 2) a hydrophobic environment is important in the interaction between heme and binding sites; and 3) the whole HBP has low flexibility whereas the heme binding residues are relatively flexible.
SCMHBP yields knowledge that improves our understanding of HBPs rather than merely improves the prediction accuracy in predicting HBPs.
Heme binding proteins (HBPs) are metalloproteins that contain a heme ligand (an iron-porphyrin complex) as a prosthetic group. HBPs exist in various forms. The most common hemes are b- and c-types . The b-type heme binds to proteins non-covalently, whereas the vinyl groups of the c-type heme forms covalent bonds with two specific cysteine residues of the Cys-Xaa-Xaa-Cys-His motif [1, 2]. Other important hemes include a-, o-, and d-type hemes, which are found in bacteria and eukaryotes . The heme binds to HBPs according to its types. Understanding of the structure-function relationships of heme iron complex or dissociation is useful for rational HBP engineering.
HBPs can perform a variety of biological functions, such as electron transfer , diatomic gas transportation/storage , chemical catalysis , transcriptional regulation , ion channel chemosensing , circadian clock control , and microRNA processing . The functional diversity of HBPs causes them have various special applications, such as in biopolymer films that are used in research into electrocatalysis [9, 10] and components of new biocatalysts . The heme enzyme chlorite dismutase has been successfully utilized in studying O2-utilizing enzymes . Studies of HBPs and the heme biosynthesis pathway are crucial to the development of efficient cancer phototherapies . Identifying novel HBPs increases their range of applications.
Identifying a novel HBP requires some complex experiments and analysis, such as analysis of its crystal structure and sequence, and biochemical assays . Merkley et al.  combined histidine affinity chromatography with a proteomics database to improve the identification of c-type heme peptides in liquid chromatography-tandem mass spectrometry experiments. Babusiak et al.  combined native chromatography with native electrophoresis to examine HBP complexes in murine erythroleukemia cells. Designing artificial HBPs is a good approach to exploring novel applications. Accordingly, understanding the physicochemical properties (PCPs) of HBPs is also crucial. With respect to PCP analysis, previous investigations have examined the isoelectric point, acidity, chemical sensing, oxidation, reduction, and ligand binding catalysis  of HBPs. The PCPs and the transportation of electrons have been observed from flavocytochromes .
The function diversity of HBPs suggests that the sequences and structures of HBPs are highly variable, limiting homological search methods for their discovery. Most in silico investigations focus on heme binding sites and heme binding mechanisms [19–21], rather than the identification of HBPs. Schneider et al.  examined b-type heme proteins with various folding topologies and elucidated their ability to bind chemically identical heme ligands. They found that the residues from these different topologies cluster at particular interaction "hot spots", and define some common structural heme-binding motifs. Smith et al.  comprehensively analyzed a dataset of non-homologous HBPs; their results revealed some typical characteristics of the heme groups and their binding sites in proteins with various functionalities. Li et al.  elucidated the differences between the apo and holo structures of HBPs and found that HBPs generally undergo small conformational changes following heme binding.
As well as being useful in investigating binding sites, several support vector machine (SVM) based methods predict heme binding residues to elucidate the mechanism of heme-protein interactions. HemeBIND  is an SVM-based ensemble predictor that uses two feature sets position specific scoring matrices (PSSM) and structural information (comprising solvent accessibility, depth, and a protrusion index). HemeNET  uses topological properties of binding residues from the 3D interaction networks to improve its predictive performance. Xiong et al.  and TargetS  proposed sequence-based prediction methods that took into account the low availability of the 3D structures of HBPs.
Since the aforementioned experimental approaches are time-consuming and labor-intensive, and few effective in silico methods for identifying HBPs are available, this work proposes a novel method (SCMHBP) for predicting and analyzing HBPs from primary sequences. SCMHBP uses a newly-developed scoring card method (SCM) that was proposed by Huang et al. [26–28] for estimating scores of the propensity of amino acid and dipeptides to be HBPs. These propensity scores of dipeptides are calculated using the difference between the dipeptide compositions of HBPs and non-HBPs and are further optimized by applying an intelligent genetics algorithm . Consequently, the propensity scores of amino acids can be utilized to discover highly related PCPs of HBPs by exploring 531 PCPs in the AAindex database . The advantages of SCMHBP are its accurate predictions, simple methodology, and high interpretability.
Two presently available datasets of HBPs are HemeBind145  and TargetS233 . However, no dataset of non-HBPs is available. Therefore, first, a new dataset is established from the SwissPort dataset consisting of 747 HBPs (selected using a Gene Ontology term, GO:0020037) and 747 non-HBPs (selected from 91,414 putative non-HBPs) with an identity of 25%. This dataset is divided into both training and test datasets. The TargetS233 dataset is enlarged by retrieving HBPs from the new version of BioLiP, which is a ligand-protein binding database . The recollected dataset of 311 HBPs is named TargetS311. The training and mean test accuracies of SCMHBP when applied to three independent test datasets are 85.90% and 71.57%, respectively. SCMHBP is better than some typical methods such as SVM, decision tree J48, and Bayes classifier-based methods, which are implemented using WEKA .
To characterize HBPs, the propensity scores of individual amino acids and dipeptides are investigated to elucidate the interactions between heme and its host proteins. Additionally, based on the correlation coefficient (R value) between the propensity scores of amino acids and the PCPs in AAindex, three informative PCPs are identified; they are 1) SNEP660103 with R = 0.604 described as ''principal component III'' , 2) TAKK010101 with R = 0.576 described as ''side-chain contribution to protein stability'' , and 3) KARP850101 with R = -0.555 described as ''flexibility parameter for no rigid neighbors'' . Further analyzing the PCPs reveals that 1) high aromaticity affects the functionality of HBP, suggesting that conserved residues with aromatic side chains importantly affect the performance of specific HBP functions, 2) a hydrophobic environment plays an important role in the interaction between heme and binding sites, and 3) the HBP has a low overall flexibility whereas the heme binding residues are relatively flexible. The high side-chain contribution to the stability of HBPs arises from those aromatic and non-polar residues, residues' ability to form non-covalent bonds.
Materials and methods
This work proposes the SCMHBP method, which is based on a scoring card method (SCM), to estimate the propensity of 400 dipeptides and 20 amino acids for predicting and analysis HBPs from protein sequences. Figure 1 presents the flowchart of the design of systems to predict and analyze HBPs, which involve datasets, methods, and the analysis of propensity scores.
Two datasets of HBPs without non-HBPs, HemeBind145  and TargetS311 , are available. To design predictors of HBPs, a non-redundant dataset of 747 HBPs that are selected using a Gene Ontology term GO:0020037 and 747 non-HBPs that are selected from 91,414 putative non-HBPs (filtering 395 sequences with a non-amino acid character from 92,309 putative non-HBPs) is established. The dataset is divided into two sets, HBPGO-TRN1000 and HBPGO-TST494 for training and testing, respectively. Table 1 summarizes the used training dataset of HBPGO-TRN1000 for designing HBP predictors and three test datasets for evaluating these predictors.
HBPGO-TRN1000 and HBPGO-TST494
The datasets are established by collecting sequences from the SwissProt database (version:2013_09)  using the GO term GO:0020037 to define HBPs from the GO database . If the sequence is annotated using this GO term, then it is regarded as an HBP; otherwise, it is a putative non-HBP. The sequence identity of any pair of a sequences is reduced to 25% using USEARCH . Hence, 747 HBPs and 91,414 putative non-HBPs are obtained. The 747 HBPs are divided into two sets 500 HBPs for training and 247 HBPs for testing.
Owing to the large number of putative non-HBPs, the identification of a small set of non-HBPs for designing predictors is an issue. Ten candidate datasets of 500 non-HBPs were prepared (to establish a balanced dataset) by random selection from the 91,414 non-HBPs. Table 2 presents the prediction performances of the score cards that were generated from these ten randomly selected negative datasets. The mean accuracy of these ten score cards is 85.22%. The best of ten datasets that yields the highest training accuracy is used as the final dataset of non-HBPs in HBPGO-TRN1000. This best training accuracy is 86.60%, with a sensitivity of 0.86 and a specificity of 0.87.
In HBPGO-TST494, the 247 sequences of are randomly selected from 90,914 non-HBPs that are not in the negative part of HBPGO-TRN1000. We note that these sequences does not involve in the training process.
Two datasets  to predict HBP binding sites: 1) one dataset comprised 72 non-redundant HBPs that were collected from the Het-PDB Navi. database (version at May 2010)  and 2) the other dataset comprised 75 non-redundant HBPs was presented by Fufezan et al. . In these datasets, pair of chains has a sequence identity of more than 30%. The two datasets are combined by removing two sequences with a non-amino acid character to generate the new dataset, HemeBind145, of size 145.
In the TargetS method , 233 HBPs that were collected from the ligand-protein binding database BioLiP  with a sequence identity of 40% were used. BioLiP is a semi-manually curated database for high-quality, biologically relevant ligand-protein binding interactions. In this work, 311 HBPs with a sequence identity of 40% were collected from the new version of BioLiP as the independent test dataset, referred to as TargetS311.
Typical classification methods
To the best our knowledge, few effective methods or tools for predicting HBPs from sequences have been proposed. To develop an accurate predictor, some typical classification methods such as those based on the SVM, J48, and the used of Bayes classifiers with a single type of sequence features have been implemented for performance comparison. The SVM is widely recognized to be an accurate classifier for prediction proteins with a specific function. Generally, the predictive performance of an SVM with effective features is regarded as the gold standard for evaluating predictors. Radial basis SVM classifiers are implemented using the LIBSVM package . The SVM parameters are evaluated by using a grid search method to maximize the ten folds cross validation (10-CV) accuracy on the training dataset. Some commonly used features such as amino acid composition (AAC), dipeptide composition (DPC), normalized PSSM (PSSM400) , and 531 PCPs in the AAindex are evaluated in the design of predictors.
The J48 and Bayes classifiers are implemented using WEKA . J48 refers to the decision tree classifier that is generated by the C4.5 algorithm that was developed by Quinlan . The Naïve Bayes classifier is a statistical classifier that can predict the probability of class membership under the assumption of the mutual independence of features .
Scoring card method
The scoring card method (SCM) is a newly-published classification method for predicting proteins with a particular function and providing insight into the characteristics of proteins in primary sequences. Huang et al. developed this method [26–28]. Unlike the complex classification mechanism of the SVM, which is not easily understood by biologists, the used of SCM to estimate the propensities of amino acids and dipeptides to be the function of interest is a simple and easily interpretable method of prediction and analysis. A comparison of prediction accuracies, the SCM is slightly worse than, or comparable to, the SVM when used with dipeptide features [26–28]. The advantages of SCM are three folds. First, the classification mechanism of the SCM adopts a weighted sum of composition and propensity scores of dipeptides to score the queried protein. Compared with the hyperplane of the SVM, a threshold value in the SCM is used to classify proteins, and this value is easily understood and manipulated by biologists. Secondly, the propensity scores of dipeptides and amino acids can be utilized to identify the PCPs that will provide information about a global property of general proteins in a further analysis of the proteins' characteristics. Thirdly, the SCM is a general-purpose method for identifying protein sequences with a particular function. The proposed SCMHBP method is based on the SCM that is applied to the training dataset of HBPGO-TRN1000. For completeness, the used SCM and the SCMHBP algorithm are described below.
The main procedure in the design SCM-based predictors can be conducted without modification, and consists of the following steps; 1) preparing both positive and negative sequences in a training dataset as inputs (500 HBPs and 500 non-HBPs in HBPGO-TRN1000 in this study); 2) generating an initial scoring card with 400 propensity scores of dipeptides using a simple statistical method; 3) obtaining propensity scores of 20 amino acids from those of 400 dipeptides; 4) optimizing the initial scoring card using a global optimization method, and 5) generating a binary SCM classifier with a threshold value as an output of the procedure. More details of the SCM method and its applications can be found elsewhere [26–28]. The algorithm of SCMHBP is as follows.
Step 1: Prepare a training dataset HBPGO-TRN1000 that comprises 500 HBPs and 500 non-HBPs.
Step 2: Generate an initial scoring card (SCMInit) that consists of 400 propensity scores of dipeptides, obtained by subtracting the dipeptides contents in non-HBPs from those in HBPs. Then, the scores of all dipeptides are normalized into the range [0, 1000].
Step 3: Calculate the propensity score of each amino acid × by averaging 40 propensity scores of dipeptides that contain X.
Step 4: Optimize the scoring card (Scard) of dipeptides using an intelligent genetic algorithm, IGA . The fitness function of IGA is to maximize both the prediction accuracy in terms of the area under the ROC curve (AUC)  and the Pearson's correlation coefficient (R value) between the initial and optimized propensity scores of 20 amino acids. To prevent overfitting, the fitness function is calculated by performing a 10-CV assessment, and is as follows (W1 = 0.9 and W2 = 0.1 in this study).
Step 5: Classify a query sequence P based on the scoring function S(P) and determine a threshold value that yields the highest training accuracy. The variables and are the content and propensity score of the i-th dipeptide, respectively. P is classified as an HBP when S(P) exceeds the threshold value; otherwise, P is a non-HBP.
Identifying informative physicochemical properties
The physicochemical properties of amino acids are widely recognized as effective features for predicting and analyzing the functions of proteins from primary sequences [26–28]. The AAindex database consists 544 indices that were extracted from the published literatures and represent various physicochemical and biological properties of amino acids . Each physicochemical property of amino acids is specified by a set of 20 numerical values. The property with the value 'NA' in a value set of amino acid index is removed. Finally, 531 properties are utilized in the following analysis. The propensity scores of amino acids, estimated by SCMHBP, facilitate the discovery of informative PCPs, providing insight into the characteristics of HBPs. The method of identification of informative PCPs, based on the SCM (SCM-PCPs), is composed of two main steps.
The first step is to compute the R values between the propensity scores of the amino acids and all 531 PCPs in AAindex. A large value of R suggests that the physicochemical or biological property is highly correlated with property of HBPs. The property with an absolute value of R > 0.5 is preferred as a candidate for subsequent analysis.
The second step is to identify the informative PCPs from all candidates, based on domain knowledge of the function of the investigated proteins (HBPs in this work). Generally, the PCPs of amino acids in AAindex are obtained under a particular condition or for a specific family of proteins that is not satisfied are not suited to analysis.
The compositions of the amino acids in HBPs and non-HBPs effectively represent a property of HBPs. Accordingly, the composition property can also be used like the propensity scores of amino acids in using SCM-PCPs . The identified PCPs offer a clue to the characteristics of HBPs.
Results and discussion
Propensity scores of amino acids and dipeptides
Figure 2 presents the propensities of 400 dipeptides to be HBPs, obtained by SCMHBP using the training dataset HBPGO-TRN1000 consisting of 500 HBPs and 500 non-HBPs. Table 3 presents the propensity scores of 20 amino acids that were derived from the propensity scores of 400 dipeptides. The table also presents the amino acid compositions of HBPs and non-HBPs with 199,772 and 200,188 residues, respectively. The correlation coefficient R between the propensity scores of the amino acids and difference between the amino acids composition of HBPs and non-HBPs is 0.92. This high correlation coefficient reveals that the propensity scores of amino acids can be used effectively to distinguish HBPs from non-HBPs. The R = -0.05 between the propensity scores of the amino acids and composition of HBPs suggests that the propensity scores reflect the difference between the composition in HBPs and non-HBPs, rather than the composition in HBPs only.
The c-type heme vinyl group forms covalent bonds with two particular cysteine residues of the Cys-Xaa-Xaa-Cys-His motif. The heme c proteins with histidine as an axial ligand have the classic CXXCH heme c binding motif. The heme b proteins have cysteine as an axial ligand . The propensity score of CH (Cys-His) is as high as 959 (Figure. 2), which is consistent with the fact that CH is a dipeptide motif of HBPs. Li et al.  analyzed the structures of 125 heme-binding proteins chains and found that the dipeptide CP (Cys-Pro) heme regulatory motifs have an important structural role in protein-heme interactions when Cys functions as an axial ligand with heme iron. Ogawa et al.  indicated that four CP motifs are important to both the Bach1-heme interaction and the heme-mediated inhibition of DNA binding. Zenke-Kawasaki et al.  found that CP motifs have a critical role in the heme-Bach1 interaction, which regulates the expression of the heme oxygenase-1 gene. The propensity score of the dipeptide CP is as high as 971, suggesting that the dipeptide motif of HBPs has a high propensity score. Restated, the dipeptides with high scores have the potential to be dipeptide motifs. To elucidate the interaction between the CP motif and heme, 2PBJ is chosen from the HBP dataset . Figure 3(a) presents the secondary structure of 2PBJ when a CP motif is a perpendicular to the heme plane. The CP motif is located closed to the heme and interacts with it.
The five amino acids with the highest propensity scores are Phe, His, Gly, Trp, and Pro, whereas the residues with the lowest scores are Ser, Asn, Glu, Cys, and Thr. Smith et al.  found that His predominates among the other amino acids as an axial ligand to heme iron. The relative frequency of His in the HBPs decreases to the background level when the axial ligands are removed . Smith et al.  noted the important role of Trp and Phe and Tyr in protein-heme interactions. Liu and Hu measured the relative importance of the amino acids in heme binding interfaces and suggested that His, Phe, and Trp are overrepresented in binding pockets . Notably, the aromatic and non-polar residue Phe has the highest score and can form non-covalent bonds.
Performance of HBP predictors
The training dataset HBPGO-TRN1000 was used to design SCMHBP and compared methods, SVM, J48, and Bayes classifiers. For each classifier, four types of features were evaluated they were 1) amino acid composition (AAC), 2) dipeptide composition (DPC), 3) normalized PSSM (PSSM400), and 4) 531 PCPs in AAindex. Table 4 compares of prediction accuracies of SCMHBP and the compared methods.
The training accuracy of SCMInit and the mean accuracy of SCMHBP are 71.30% and 85.90%, respectively. The optimization of the scoring card by using an intelligent genetic algorithm (IGA)  increases the training accuracy to 14.60%. The SCMHBP method uses the 10-CV assessment to prevent overtraining in training the model. This optimization method seeks to maximize the area under the ROC curve (AUC)  to generate balanced sensitivity and specificity. The mean test accuracies of SCMHBP that are obtained using the DPC features on the three test datasets HBPGO-TST494, HemeBind145, and TargetS311, are 72.79%, 67.24%, and 74.69%, respectively. The mean test performance of SCMHBP (71.57%) is close to that of SVM-DPC (72.20%) using the same DPC features. The SCMHBP method is also comparable with the SVM-PSSM400 (75.98%) and SVM-AAindex (76.21%) methods. The score card with best training performance, 86.60%, has a test accuracy of 74.22%, and is adopted in further PCP analysis.
The methods that are presented in Table 4 are compared with SCMHBP. Whereas SCMHBP adopts only a simple weighted-sum classifier and the dipeptide features, the other methods use commonly-used classifiers with only one type of easily-interpretable features. That the combination of complementary features and an ensemble mechanism is widely recognized to improve prediction performance. The ensemble SCMCRYS method for predicting protein crystallization uses propensity scores of p-collocated amino acid pairs (p=0 for a dipeptide) to the performance of the single SCM classifier uses only dipeptide features . Similarly, the ensemble SVM classifier with multiple feature types is expected to exhibit improved performance.
Consider the methods in Table 4. The SVM and J48 based methods have high training accuracies that exceed 90%. However, the mean test accuracies of the SVM and J48-based methods are less than 80% and 70%, respectively. The J48-based decision tree methods exhibit from an obvious overtraining problem. The performance of the Bayes-based methods is not good. In summary, the SVM-based methods and especially the SVM-PSSM400 method with a mean test accuracy of 75.98% using the normalized PSSM features outperform the J48 and Bayes-based methods.
Identification of putative HBPs
The use of a scoring function S(P) for predicting a query sequence facilitates the identification of putative HBPs from the putative non-HBPs in which HBPs have not yet been discovered. The unbalanced dataset is reflective of the natural occurrences of HBPs and non-HBPs. The test accuracy, positive and negative predictive values for the SCMHBP method are 74.20%, 0.70% and 99.88%, respectively, on the unbalanced test dataset consisting of 247 HBPs and 90,914 putative non-HBPs (0.27% for HBPs). The positive predictive value is not intrinsic to the test and would be influenced by the putative non-HBPs. To identify potential HBPs from the putative non-HBPs, each sequence P of the 90,914 putative non-HBPs is scored according to the score of S(P). The top-20 sequences are listed in Table 5 including the name, UniProt ID, and annotated function.
The mean, maximum and minimum scores of sequences in the training dataset HBPGO-TRN1000 are 554.70, 613.90, and 500.70, respectively, when the threshold value is 539.32. All 20 sequences in Table 5 have scores that exceed 648 and so have high potential to be HBPs. Dermotoxin-J2 has the highest score of 715.58 and Protein PCOTH, ranked 20th has a score of 648.51. Interestingly, the eight putative HBPs in Table 5 have unclear protein functions. The high scores of the putative HBPs suggest further experimental confirmation is required.
Propensity analysis using informative PCPs
Table 6 presents the three physicochemical properties (PCPs) selected by SCM-PCPs. The Pearson correlation coefficients (R value) between the PCPs in AAindex and the propensity scores of amino acids help to identify informative PCPs that are useful in further analysis. The three PCPs and their R values are SNEP660103 (R = 0.604), TAKK010101 (R = 0.576) and KARP850101 (R=-0.555). The three PCPs of HBPs are analyzed and discussed below.
A. Aromaticity of the HBP side chains and its contribution
The SNEP660103 property, described as ''Relations between chemical structure and biological activity in peptides'' , is ranked 1 by R value (=0.604). SNEP660103 is a composite feature which interprets a set of the PCPs of the 20 natural amino acids as four vectors, extracted via principal component analysis from the ϕ coefficients. The propensity scores of HBPs have the highest R value with the vector III, which represents aromatic properties . The large R value reveals that the aromatic residues have a higher propensity than the non-aromatic residues to be HBPs.
The aromatic residues with special side chains with aromatic rings and π electrons perform unique functions in proteins such as the hydrophobic interaction, the aromatic-aromatic (π-π) interaction [20, 48], and the carrying of mobile electrons in reactions that involve electron transfer [49–51]. Smith et al.  assessed the chemical composition of the heme binding sites suggesting that aromatic residues supporting the π-π interactions between the residues tend to adopt an off-set, parallel, staggered, or an edge-to-face orientation relative to the heme group. This suggestion reveals the importance of the aromatic residues in maintaining the orientation of heme in HBPs. In investigations of HBP protein activity, the transition metal redox centers in the structures of HBPs are involved in biological electron transfer as mobile electron carriers that shuttle electrons between reductase and oxidase complexes [49–51]. Heme contains an Fe site that can switch between the reduced (FeII) and oxidized (FeIII) states. The molecules that include aromatic π-systems have relatively low redox potentials and a greater probability of undergoing electron transfer reactions. These aromatic residues are thought to increase the redox potentials of heme because the conserved aromatic side chains are close to it. A sequence study showed that the His residue that is located proximally and distally to the heme is highly conserved .
The four aromatic residues with the highest SNEP660103 scores are Trp, Phe, Tyr and His . From Table 6 the aromatic residues Phe, His, Trp, and Tyr with scores 705.3, 615.6, 603.1, and 541.0 are ranked 1, 2, 4, and 12, respectively. As presented in Figure 3(c), the two His residues, His39 and His63, are located close to the heme group. These conserved histidines are thought to be involved in the binding of the aromatic donor, forming the catalytic intermediate, and to transfer electrons. The composition of the four aromatic residues (Phe, His, Trp, and Tyr) in HBPs is 12.18% larger than that (10.58%) in non-HBPs (Table 3), supporting SCMHBP with SCM-PCPs, consistent with previous HBP studies.
B. The hydrophobic characters of side chain for stabilizing HBPs
The property of TAKK010101 with R = 0.576 is described as ''Side-chain contribution to protein stability'' . Hydrophobic interaction influences the stability and various functions of proteins. To estimate the hydrophobic effects, several hydrophobic metrics and various methods of measuring them have been developed. However, a more precise scale that is convenient to use can be obtained by using the free energies that are measured from the difference between the natured and denatured states to quantify the hydrophobic contribution to proteins . This metric quantifies the energetic contribution of the hydrophobic character of each amino acid side chain to the stability of the protein adjusted according to the conformational entropy of the side-chain. The high correlation reveals that the hydrophobic effects have a more important role in the stability of HBPs than of non-HBPs.
Numerous HBPs have a globular structure and hydrophobic bonds have a major role in organizing their self-assembly . HBPs that bind to a heme group become permanently fully folded and stable only upon association with the hydrophobic heme group and the formation of additional hydrophobic bonds . The structures of the holo-HBPs appear to be stabilized by several hydrophobic contacts . These findings are in agreement with the obtained high propensity scores for hydrophobic amino acids Phe, Trp, His and Gly. Notably, a previous investigation found comparably high changes in stability (ΔΔG) for Phe (23.0 kJ/mol), Trp (24.2 kJ/mol) and His (11.9 kJ/mol) .
To estimate the contribution of the hydrophobic character of amino acids, the hydrophobic contribution of the side-chain is calculated using the training dataset HBPGO-TRN1000 according to the property TAKK010101. The mean hydrophobic contribution of the side-chain of HBPs is 10.20 kJ/mol, which significantly exceeds that (9.98 kJ/mol) of non-HBPs, according to the Mann-Whitney test with p<0.05. This result reveals that the HBPs prefer hydrophobic residues more than non-HBPs.
Along with the hydrophobic contacts, multiple other interactions mediate the binding of a heme group. These interactions include the formation of coordination bond(s) to an iron ion, van der Waals' contacts between the planar porphyrin ring and protein side chains, and electrostatic interactions that involve the heme propionate substituents and positively charged protein residues . Figures 3(b) and 3(d) present the surface hydrophobicity of 2PBJ and 1CYO. The binding sites of the HBPs (around the heme group) are very hydrophobic. In the presented scoring card, the hydrophobic residues have high scores; these include Phe, Trp, Pro, and Met with scores of 705.3, 603.1, 601.3 and 599.6, respectively. These results demonstrate that the hydrophobic residues have important roles in HBPs. This finding reveals the importance of the hydrophobic interaction between heme and binding sites.
C. Inflexibility of HBPs with flexible heme binding residues
The property KARP850101 with R = -0.555, indicating a large inverse correlation, ranked 1 using SCM-PCPs is described as ''Flexibility parameter for no rigid neighbors'' . Since the HBPs have various functions and will work across various environments with, for example, various pH values [57, 58], the stability of the HBPs is more important. Numerous investigations of HBPs have focused on HBP binding sites and indicate that the binding sites of HBPs are flexible [59, 60]. However, those investigations [59, 60] did not consider whole protein structures or compare them with those of other non-HBP proteins. Hydrophobic interactions influence the rigidity of protein structures . Accordingly, these hydrophobic residues are postulated not only to compose the hydrophobic binding sites for hemes, but also to maintain the stability of HBPs.
The inverse correlation suggests that the more rigid amino acids have higher scores. The rigidity of a protein affects its structural stability . The fact that the proteins lose their rigidity during unfolding  demonstrates that the proteins with less rigidity tend to reach the unfolding state more rapidly. Our results suggest that HBPs are more inflexible for stabilizing HBPs than non-HBPs.
Numerous investigations have found that the binding sites of HBPs are flexible [59, 60], but no investigation compared the rigidity of whole HBPs and non-HBPs. To compare the rigidity of HBPs with that of non-HBPs, the B-factor, which is an atomic displacement parameter that quantifies the fluctuation of each atom in the proteins, is utilized to determine the flexibility of proteins . The Cα B-factors are extracted from the TargetS311 dataset of HBPs and a dataset of 311 non-HBPs, which provides background values. Table 7 presents the amino acid scores and the mean B-factors of the Cα and side chains.
From Table 7 the R values between the mean B-factors of residues and the amino acid scores for HBPs and non-HBPs are -0.45 and -0.26 with p-values of 0.02 and 0.18, respectively. The results reveal that HBPs preferentially yield residues with lower B-factors. Although study of the rigidity of HBPs has been published, the results herein suggest that HBPs are less flexible than non-HBPs. Even though HBPs exhibit an allosteric mechanism and these are thought to have flexible binding sites, a statistical study indicates that the allosteric mechanism that is seen in various HBPs, such as the hemoglobin and myoglobin, undergoes a conformational change owing to the side-chain rotamer without backbone changes .
To elucidate the above phenomena, the HBP (3BL2) with the highest score (608) in the Target311 dataset and two non-HBPs (1CW0 and 1E5H) are considered. 1CW0 has the lowest score of 357 and the randomly selected 1E5H has a score of 540. Figure 4 plots the B-factor distributions of the HBP (3BL2) and non-HBPs (1CW0 and 1E5H). The blue residues of the HBP indicate that the B-factor of the whole HBP is low. The green and yellow residues of non-HBPs indicate that the two proteins have higher B-factors than the HBP.
Possible causes of the lower flexibility of the HBPs are as follows. 1) HBPs typically function in various environments; for example, they may transport oxygen to organs or transfer energy in mitochondria. According to a previous circulation study, for example, PCO2 affects the pH of the blood . Reducing PCO2 increases arterial pH, changing the oxygen binding ability of hemoglobin [57, 58]. The hemoglobin must tolerate various pHs of various environments. 2) The HBP is involved in some important signals such as apoptosis , which triggers a change in conformation. Domain swapping owing to the protein flexibility of the protein causes domain escape from the original position of the proteins on cytochrome c 552 . These domain swapping proteins can further bind to each other to form dimers or oligomers. In cytochrome c 552 , if domain swapping occurs, then the protein oligomers will trigger apoptosis.
This work proposed a scoring card method (SCM) based method (SCMHBP) for predicting and analyzing HBPs from sequences. SCMHBP performs well in terms of mean test accuracy when applied to three independent datasets. Additionally, the propensity scores of dipeptides and amino acids can support the identification of informative physicochemical properties to provide insight into heme binding proteins. The CP motif has high propensity score, suggesting that this motif is important for HBPs. SCM-PCPs are used to identify the physicochemical properties of HBPs and three PCPs, SNEP660103, TAKK010101 and KARP850101, are identified. In summary, the aromaticity and hydrophobicity of the side chains of the HBP residue appear to be important factors in determining the functional performance and stability of HBP. Additionally, HPBs appeared to have a more rigid structure than non-HBPs. The used datasets and source code for SCMHBP are available at http://iclab.life.nctu.edu.tw/SCMHBP/.
Bowman SEJ, Bren KL: The chemistry and biochemistry of heme c: functional bases for covalent attachment. Nat Prod Rep. 2008, 25 (6): 1118-1130. 10.1039/b717196j.
Fufezan C, Zhang J, Gunner MR: Ligand preference and orientation in b- and c-type heme-binding proteins. Proteins. 2008, 73 (3): 690-704. 10.1002/prot.22097.
Gray HB, Winkler JR: Electron transfer in proteins. Annu Rev Biochem. 1996, 65: 537-561. 10.1146/annurev.bi.65.070196.002541.
Terwilliger NB: Functional adaptations of oxygen-transport proteins. J Exp Biol. 1998, 201 (8): 1085-1098.
Guengerich FP, Macdonald TL: Chemical Mechanisms of Catalysis by Cytochromes-P-450 - a Unified View. Accounts Chem Res. 1984, 17 (1): 9-16. 10.1021/ar00097a002.
Zenke-Kawasaki Y, Dohi Y, Katoh Y, Ikura T, Ikura M, Asahara T, Tokunaga F, Lwai K, Igarashi K: Heme induces ubiquitination and degradation of the transcription factor bach1. Molecular and Cellular Biology. 2007, 27 (19): 6962-6971. 10.1128/MCB.02415-06.
Kaasik K, Lee CC: Reciprocal regulation of haem biosynthesis and the circadian clock in mammals. Nature. 2004, 430 (6998): 467-471. 10.1038/nature02724.
Faller M, Matsunaga M, Yin S, Loo JA, Guo F: Heme is involved in microRNA processing. Nat Struct Mol Biol. 2007, 14 (1): 23-29. 10.1038/nsmb1182.
Huang H, Hu NF, Zeng YH, Zhou G: Electrochemistry and electrocatalysis with heme proteins in chitosan biopolymer films. Anal Biochem. 2002, 308 (1): 141-151. 10.1016/S0003-2697(02)00242-7.
Zhou YL, Hu NF, Zeng YH, Rusling JF: Heme protein-clay films: Direct electrochemistry and electrochemical catalysis. Langmuir. 2002, 18 (1): 211-219. 10.1021/la010834a.
Nagaoka H: Application of a Heme-Binding Protein Eluted from Encapsulated Biomaterials to the Catalysis of Enantioselective Oxidation. Acs Catal. 2014, 4 (2): 553-565. 10.1021/cs400768x.
Dassama LMK, Yosca TH, Conner DA, Lee MH, Blanc B, Streit BR, Green MT, DuBois JL, Krebs C, Bollinger JM: O-2-Evolving Chlorite Dismutase as a Tool for Studying O-2-Utilizing Enzymes. Biochemistry-Us. 2012, 51 (8): 1607-1616. 10.1021/bi201906x.
Frankenberg N, Moser J, Jahn D: Bacterial heme biosynthesis and its biotechnological application. Appl Microbiol Biot. 2003, 63 (2): 115-127. 10.1007/s00253-003-1432-2.
Ebihara A, Okamoto A, Kousumi Y, Yamamoto H, Masui R, Ueyama N, Yokoyama S, Kuramitsu S: Structure-based functional identification of a novel heme-binding protein from Thermus thermophilus HB8. J Struct Funct Genomics. 2005, 6 (1): 21-32. 10.1007/s10969-005-1103-x.
Merkley ED, Anderson BJ, Park J, Belchik SM, Shi L, Monroe ME, Smith RD, Lipton MS: Detection and Identification of Heme c-Modified Peptides by Histidine Affinity Chromatography, High-Performance Liquid Chromatography-Mass Spectrometry, and Database Searching. Journal of Proteome Research. 2012, 11 (12): 6147-6158.
Babusiak M, Man P, Sutak R, Petrak J, Vyoral D: Identification of heme binding protein complexes in murine erythroleukemic cells: Study by a novel two-dimensional native separation - liquid chromatography and electrophoresis. Proteomics. 2005, 5 (2): 340-350. 10.1002/pmic.200400935.
Bartsch RG, Kamen MD: Isolation and properties of two soluble heme proteins in extracts of the photoanaerobe Chromatium. J Biol Chem. 1960, 235: 825-831.
Lan EH, Dave BC, Fukuto JM, Dunn B, Zink JI, Valentine JS: Synthesis of sol-gel encapsulated heme proteins with chemical sensing properties. J Mater Chem. 1999, 9 (1): 45-53. 10.1039/a805541f.
Schneider S, Marles-Wright J, Sharp KH, Paoli M: Diversity and conservation of interactions for binding heme in b-type heme proteins. Natural Product Reports. 2007, 24 (3): 621-630. 10.1039/b604186h.
Smith LJ, Kahraman A, Thornton JM: Heme proteins--diversity in structural characteristics, function, and folding. Proteins-Structure Function and Bioinformatics. 2010, 78 (10): 2349-2368. 10.1002/prot.22747.
Li T, Bonkovsky HL, Guo JT: Structural analysis of heme proteins: implications for design and prediction. Bmc Structural Biology. 2011, 11:
Liu R, Hu JJ: HemeBIND: a novel method for heme binding residue prediction by combining structural and sequence information. Bmc Bioinformatics. 2011, 12:
Liu R, Hu JJ: Computational Prediction of Heme-Binding Residues by Exploiting Residue Interaction Network. Plos One. 2011, 6 (10):
Xiong Y, Liu J, Zhang W, Zeng T: Prediction of heme binding residues from protein sequences with integrative sequence profiles. Proteome Science. 2012, 10:
Yu DJ, Hu J, Yang J, Shen HB, Tang JH, Yang JY: Designing Template-Free Predictor for Targeting Protein-Ligand Binding Sites with Classifier Ensemble and Spatial Clustering. Ieee-Acm Transactions on Computational Biology and Bioinformatics. 2013, 10 (4): 994-1008.
Charoenkwan P, Shoombuatong W, Lee HC, Chaijaruwanich J, Huang HL, Ho SY: SCMCRYS: Predicting Protein Crystallization Using an Ensemble Scoring Card Method with Estimating Propensity Scores of P-Collocated Amino Acid Pairs. Plos One. 2013, 8 (9):
Huang HL, Charoenkwan P, Kao TF, Lee HC, Chang FL, Huang WL, Ho SJ, Shu LS, Chen WL, Ho SY: Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. Bmc Bioinformatics. 2012, 13:
Huang H-L: Propensity Scores for Prediction and Characterization of Bioluminescent Proteins from Sequences. PLoS ONE. 2014, 9 (5): e97158-10.1371/journal.pone.0097158.
Ho SY, Shu LS, Chen JH: Intelligent evolutionary algorithms for large parameter optimization problems. Ieee T Evolut Comput. 2004, 8 (6): 522-541. 10.1109/TEVC.2004.835176.
Kawashima S, Kanehisa M: AAindex: amino acid index database. Nucleic Acids Res. 2000, 28 (1): 374-10.1093/nar/28.1.374.
Yang J, Roy A, Zhang Y: BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2013, D1096-1103. 41 Database
Frank E, Hall M, Trigg L, Holmes G, Witten IH: Data mining in bioinformatics using Weka. Bioinformatics. 2004, 20 (15): 2479-2481. 10.1093/bioinformatics/bth261.
Sneath PH: Relations between chemical structure and biological activity in peptides. J Theor Biol. 1966, 12 (2): 157-195. 10.1016/0022-5193(66)90112-3.
Takano K, Yutani K: A new scale for side-chain contribution to protein stability based on the empirical stability analysis of mutant proteins. Protein Eng. 2001, 14 (8): 525-528. 10.1093/protein/14.8.525.
Karplus PA, Schulz GE: Prediction of Chain Flexibility in Proteins - a Tool for the Selection of Peptide Antigens. Naturwissenschaften. 1985, 72 (4): 212-213. 10.1007/BF01195768.
Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 2013, D43-47. 41 Database
Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004, 32: D258-D261. 10.1093/nar/gkh036.
Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010, 26 (19): 2460-2461. 10.1093/bioinformatics/btq461.
Yamaguchi A, Iida K, Matsui N, Tomoda S, Yura K, Go M: Het-PDB Navi.: a database for protein-small molecule interactions. J Biochem. 2004, 135 (1): 79-84. 10.1093/jb/mvh009.
Chang C, Lin C: LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011, 2: 21-27.
Huang HL, Lin IC, Liou YF, Tsai CT, Hsu KT, Huang WL, Ho SJ, Ho SY: Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties. Bmc Bioinformatics. 2011, 12 (Suppl 1): S47-10.1186/1471-2105-12-S1-S47.
Salzberg SL: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Machine Learning. 1994, 16 (3): 235-240.
Han J, Kamber M: Data Mining: Concepts and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems). 2006, Elsevier, second
Bradley AP: The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn. 1997, 30 (7): 1145-1159. 10.1016/S0031-3203(96)00142-2.
Ogawa K, Sun J, Taketani S, Nakajima O, Nishitani C, Sassa S, Hayashi N, Yamamoto M, Shibahara S, Fujita H: Heme mediates derepression of Maf recognition element through direct binding to transcription repressor Bach1. EMBO J. 2001, 20 (11): 2835-2843. 10.1093/emboj/20.11.2835.
Zenke-Kawasaki Y, Dohi Y, Katoh Y, Ikura T, Ikura M, Asahara T, Tokunaga F, Iwai K, Igarashi K: Heme induces ubiquitination and degradation of the transcription factor Bach1. Mol Cell Biol. 2007, 27 (19): 6962-6971. 10.1128/MCB.02415-06.
Okusawa T, Fujita M, Nakamura J, Into T, Yasuda M, Yoshimura A, Hara Y, Hasebe A, Golenbock DT, Morita M: Relationship between structures and biological activities of mycoplasmal diacylated lipopeptides and their recognition by toll-like receptors 2 and 6. Infect Immun. 2004, 72 (3): 1657-1665. 10.1128/IAI.72.3.1657-1665.2004.
Shimizu T: Diverse role of conserved aromatic amino acids in the electron transfer of cytochrome P450 catalytic functions: site-directed mutagenesis studies. Recent Research Developments in Pure and Applied Chemistry. 1997, 1: 196-175.
Volkov AN, van Nuland NA: Electron transfer interactome of cytochrome C. PLoS Comput Biol. 2012, 8 (12): e1002807-10.1371/journal.pcbi.1002807.
Takayama Y, Harada E, Kobayashi R, Ozawa K, Akutsu H: Roles of noncoordinated aromatic residues in redox regulation of cytochrome c(3) from Desulfovibrio vulgaris Miyazaki F. Biochemistry-Us. 2004, 43 (34): 10859-10866. 10.1021/bi049551i.
Davis AC, Cornelison MJ, Meyers KT, Rajapakshe A, Berry RE, Tollin G, Enemark JH: Effects of mutating aromatic surface residues of the heme domain of human sulfite oxidase on its heme midpoint potential, intramolecular electron transfer, and steady-state kinetics. Dalton T. 2013, 42 (9): 3043-3049. 10.1039/c2dt31508d.
Gribaldo S, Casane D, Lopez P, Philippe H: Functional divergence prediction from evolutionary analysis: a case study of vertebrate hemoglobin. Mol Biol Evol. 2003, 20 (11): 1754-1759. 10.1093/molbev/msg171.
Doig AJ, Sternberg MJ: Side-chain conformational entropy in protein folding. Protein Sci. 1995, 4 (11): 2247-2251. 10.1002/pro.5560041101.
Kauzmann W: Some factors in the interpretation of protein denaturation. Adv Protein Chem. 1959, 14: 1-63.
Hargrove MS, Barrick D, Olson JS: The association rate constant for heme binding to globin is independent of protein structure. Biochemistry-Us. 1996, 35 (35): 11293-11299. 10.1021/bi960371l.
Mukhopadhyay K, Lecomte JT: A relationship between heme binding and protein stability in cytochrome b5. Biochemistry-Us. 2004, 43 (38): 12227-12236. 10.1021/bi0488956.
Di Russo NV, Estrin DA, Marti MA, Roitberg AE: pH-Dependent conformational changes in proteins and their effect on experimental pK(a)s: the case of Nitrophorin 4. PLoS Comput Biol. 2012, 8 (11): e1002761-10.1371/journal.pcbi.1002761.
Di Luccio E, Ishida Y, Leal WS, Wilson DK: Crystallographic Observation of pH-Induced Conformational Changes in the Amyelois transitella Pheromone-Binding Protein AtraPBP1. Plos One. 2013, 8 (2):
Ohkura K, Kawaguchi Y, Watanabe Y, Masubuchi Y, Shinohara Y, Hori H: Flexible Structure of Cytochrome P450: Promiscuity of Ligand Binding in the CYP3A4 Heme Pocket. Anticancer Res. 2009, 29 (3): 935-942.
Villareal VA, Pilpa RM, Robson SA, Fadeev EA, Clubb RT: The IsdC Protein from Staphylococcus aureus Uses a Flexible Binding Pocket to Capture Heme. J Biol Chem. 2008, 283 (46): 31591-31600. 10.1074/jbc.M801126200.
Rader AJ, Hespenheide BM, Kuhn LA, Thorpe MF: Protein unfolding: Rigidity lost. P Natl Acad Sci USA. 2002, 99 (6): 3540-3545. 10.1073/pnas.062492699.
Parthasarathy S, Murthy MRN: Analysis of temperature factor distribution in high-resolution protein structures. Protein Science. 1997, 6 (12): 2561-2567.
Itoh K, Sasai M: Statistical mechanics of protein allostery: roles of backbone and side-chain structural fluctuations. J Chem Phys. 2011, 134 (12): 125102-10.1063/1.3565025.
Grundler W, Weil MH, Rackow EC: Arteriovenous Carbon-Dioxide and Ph Gradients during Cardiac-Arrest. Circulation. 1986, 74 (5): 1071-1074. 10.1161/01.CIR.74.5.1071.
Hayashi Y, Nagao S, Osuka H, Komori H, Higuchi Y, Hirota S: Domain Swapping of the Heme and N-Terminal alpha-Helix in Hydrogenobacter thermophilus Cytochrome c(552) Dimer. Biochemistry-Us. 2012, 51 (43): 8608-8616. 10.1021/bi3011303.
This work was funded by National Science Council of Taiwan under the contract number NSC-103-2221-E-009-117-, and "Center for Bioinformatics Research of Aiming for the Top University Program" of the National Chiao Tung University and Ministry of Education, Taiwan, R.O.C. for the project 103W962. This work was also supported in part by the UST-UCSD International Center of Excellence in Advanced Bioengineering sponsored by the Taiwan National Science Council I-RiCE Program under Grant Number: NSC-102-2911-I-009-101. Publication charges for this work are funded by the project 103W962.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 16, 2014: Thirteenth International Conference on Bioinformatics (InCoB2014): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S16.
The authors declare that they have no competing interests.
YFL and PC conceived the idea of this work, carried out the system design, and participated in manuscript preparation. YSS and TV conducted the analysis of physicochemical properties and participated in manuscript preparation. PC and SCL implemented the programs. SCL and YHC established the website. HCL participated in the experimental analysis and discussion. HLH and SYH participated in the system design, supervised the whole project and coordination, and helped to write the manuscript. All authors have read and approved the final manuscript.
Yi-Fan Liou, Phasit Charoenkwan contributed equally to this work.