- Open Access
SCMHBP: prediction and analysis of heme binding proteins using propensity scores of dipeptides
- Yi-Fan Liou†1,
- Phasit Charoenkwan†1,
- Yerukala Sathipati Srinivasulu1,
- Tamara Vasylenko1,
- Shih-Chung Lai1,
- Hua-Chin Lee1, 2,
- Yi-Hsiung Chen1,
- Hui-Ling Huang1, 2Email author and
- Shinn-Ying Ho1, 2Email author
© Liou et al.; licensee BioMed Central Ltd. 2014
- Published: 8 December 2014
Heme binding proteins (HBPs) are metalloproteins that contain a heme ligand (an iron-porphyrin complex) as the prosthetic group. Several computational methods have been proposed to predict heme binding residues and thereby to understand the interactions between heme and its host proteins. However, few in silico methods for identifying HBPs have been proposed.
This work proposes a scoring card method (SCM) based method (named SCMHBP) for predicting and analyzing HBPs from sequences. A balanced dataset of 747 HBPs (selected using a Gene Ontology term GO:0020037) and 747 non-HBPs (selected from 91,414 putative non-HBPs) with an identity of 25% was firstly established. Consequently, a set of scores that quantified the propensity of amino acids and dipeptides to be HBPs is estimated using SCM to maximize the predictive accuracy of SCMHBP. Finally, the informative physicochemical properties of 20 amino acids are identified by utilizing the estimated propensity scores to be used to categorize HBPs. The training and mean test accuracies of SCMHBP applied to three independent test datasets are 85.90% and 71.57%, respectively. SCMHBP performs well relative to comparison with such methods as support vector machine (SVM), decision tree J48, and Bayes classifiers. The putative non-HBPs with high sequence propensity scores are potential HBPs, which can be further validated by experimental confirmation. The propensity scores of individual amino acids and dipeptides are examined to elucidate the interactions between heme and its host proteins. The following characteristics of HBPs are derived from the propensity scores: 1) aromatic side chains are important to the effectiveness of specific HBP functions; 2) a hydrophobic environment is important in the interaction between heme and binding sites; and 3) the whole HBP has low flexibility whereas the heme binding residues are relatively flexible.
SCMHBP yields knowledge that improves our understanding of HBPs rather than merely improves the prediction accuracy in predicting HBPs.
- Support Vector Machine
- Propensity Score
- Heme Binding
- Dipeptide Composition
Heme binding proteins (HBPs) are metalloproteins that contain a heme ligand (an iron-porphyrin complex) as a prosthetic group. HBPs exist in various forms. The most common hemes are b- and c-types . The b-type heme binds to proteins non-covalently, whereas the vinyl groups of the c-type heme forms covalent bonds with two specific cysteine residues of the Cys-Xaa-Xaa-Cys-His motif [1, 2]. Other important hemes include a-, o-, and d-type hemes, which are found in bacteria and eukaryotes . The heme binds to HBPs according to its types. Understanding of the structure-function relationships of heme iron complex or dissociation is useful for rational HBP engineering.
HBPs can perform a variety of biological functions, such as electron transfer , diatomic gas transportation/storage , chemical catalysis , transcriptional regulation , ion channel chemosensing , circadian clock control , and microRNA processing . The functional diversity of HBPs causes them have various special applications, such as in biopolymer films that are used in research into electrocatalysis [9, 10] and components of new biocatalysts . The heme enzyme chlorite dismutase has been successfully utilized in studying O2-utilizing enzymes . Studies of HBPs and the heme biosynthesis pathway are crucial to the development of efficient cancer phototherapies . Identifying novel HBPs increases their range of applications.
Identifying a novel HBP requires some complex experiments and analysis, such as analysis of its crystal structure and sequence, and biochemical assays . Merkley et al.  combined histidine affinity chromatography with a proteomics database to improve the identification of c-type heme peptides in liquid chromatography-tandem mass spectrometry experiments. Babusiak et al.  combined native chromatography with native electrophoresis to examine HBP complexes in murine erythroleukemia cells. Designing artificial HBPs is a good approach to exploring novel applications. Accordingly, understanding the physicochemical properties (PCPs) of HBPs is also crucial. With respect to PCP analysis, previous investigations have examined the isoelectric point, acidity, chemical sensing, oxidation, reduction, and ligand binding catalysis  of HBPs. The PCPs and the transportation of electrons have been observed from flavocytochromes .
The function diversity of HBPs suggests that the sequences and structures of HBPs are highly variable, limiting homological search methods for their discovery. Most in silico investigations focus on heme binding sites and heme binding mechanisms [19–21], rather than the identification of HBPs. Schneider et al.  examined b-type heme proteins with various folding topologies and elucidated their ability to bind chemically identical heme ligands. They found that the residues from these different topologies cluster at particular interaction "hot spots", and define some common structural heme-binding motifs. Smith et al.  comprehensively analyzed a dataset of non-homologous HBPs; their results revealed some typical characteristics of the heme groups and their binding sites in proteins with various functionalities. Li et al.  elucidated the differences between the apo and holo structures of HBPs and found that HBPs generally undergo small conformational changes following heme binding.
As well as being useful in investigating binding sites, several support vector machine (SVM) based methods predict heme binding residues to elucidate the mechanism of heme-protein interactions. HemeBIND  is an SVM-based ensemble predictor that uses two feature sets position specific scoring matrices (PSSM) and structural information (comprising solvent accessibility, depth, and a protrusion index). HemeNET  uses topological properties of binding residues from the 3D interaction networks to improve its predictive performance. Xiong et al.  and TargetS  proposed sequence-based prediction methods that took into account the low availability of the 3D structures of HBPs.
Since the aforementioned experimental approaches are time-consuming and labor-intensive, and few effective in silico methods for identifying HBPs are available, this work proposes a novel method (SCMHBP) for predicting and analyzing HBPs from primary sequences. SCMHBP uses a newly-developed scoring card method (SCM) that was proposed by Huang et al. [26–28] for estimating scores of the propensity of amino acid and dipeptides to be HBPs. These propensity scores of dipeptides are calculated using the difference between the dipeptide compositions of HBPs and non-HBPs and are further optimized by applying an intelligent genetics algorithm . Consequently, the propensity scores of amino acids can be utilized to discover highly related PCPs of HBPs by exploring 531 PCPs in the AAindex database . The advantages of SCMHBP are its accurate predictions, simple methodology, and high interpretability.
Two presently available datasets of HBPs are HemeBind145  and TargetS233 . However, no dataset of non-HBPs is available. Therefore, first, a new dataset is established from the SwissPort dataset consisting of 747 HBPs (selected using a Gene Ontology term, GO:0020037) and 747 non-HBPs (selected from 91,414 putative non-HBPs) with an identity of 25%. This dataset is divided into both training and test datasets. The TargetS233 dataset is enlarged by retrieving HBPs from the new version of BioLiP, which is a ligand-protein binding database . The recollected dataset of 311 HBPs is named TargetS311. The training and mean test accuracies of SCMHBP when applied to three independent test datasets are 85.90% and 71.57%, respectively. SCMHBP is better than some typical methods such as SVM, decision tree J48, and Bayes classifier-based methods, which are implemented using WEKA .
To characterize HBPs, the propensity scores of individual amino acids and dipeptides are investigated to elucidate the interactions between heme and its host proteins. Additionally, based on the correlation coefficient (R value) between the propensity scores of amino acids and the PCPs in AAindex, three informative PCPs are identified; they are 1) SNEP660103 with R = 0.604 described as ''principal component III'' , 2) TAKK010101 with R = 0.576 described as ''side-chain contribution to protein stability'' , and 3) KARP850101 with R = -0.555 described as ''flexibility parameter for no rigid neighbors'' . Further analyzing the PCPs reveals that 1) high aromaticity affects the functionality of HBP, suggesting that conserved residues with aromatic side chains importantly affect the performance of specific HBP functions, 2) a hydrophobic environment plays an important role in the interaction between heme and binding sites, and 3) the HBP has a low overall flexibility whereas the heme binding residues are relatively flexible. The high side-chain contribution to the stability of HBPs arises from those aromatic and non-polar residues, residues' ability to form non-covalent bonds.
HBPGO-TRN1000 and HBPGO-TST494
The datasets are established by collecting sequences from the SwissProt database (version:2013_09)  using the GO term GO:0020037 to define HBPs from the GO database . If the sequence is annotated using this GO term, then it is regarded as an HBP; otherwise, it is a putative non-HBP. The sequence identity of any pair of a sequences is reduced to 25% using USEARCH . Hence, 747 HBPs and 91,414 putative non-HBPs are obtained. The 747 HBPs are divided into two sets 500 HBPs for training and 247 HBPs for testing.
The performance of 10 randomly selected negative datasets
In HBPGO-TST494, the 247 sequences of are randomly selected from 90,914 non-HBPs that are not in the negative part of HBPGO-TRN1000. We note that these sequences does not involve in the training process.
Two datasets  to predict HBP binding sites: 1) one dataset comprised 72 non-redundant HBPs that were collected from the Het-PDB Navi. database (version at May 2010)  and 2) the other dataset comprised 75 non-redundant HBPs was presented by Fufezan et al. . In these datasets, pair of chains has a sequence identity of more than 30%. The two datasets are combined by removing two sequences with a non-amino acid character to generate the new dataset, HemeBind145, of size 145.
In the TargetS method , 233 HBPs that were collected from the ligand-protein binding database BioLiP  with a sequence identity of 40% were used. BioLiP is a semi-manually curated database for high-quality, biologically relevant ligand-protein binding interactions. In this work, 311 HBPs with a sequence identity of 40% were collected from the new version of BioLiP as the independent test dataset, referred to as TargetS311.
Typical classification methods
To the best our knowledge, few effective methods or tools for predicting HBPs from sequences have been proposed. To develop an accurate predictor, some typical classification methods such as those based on the SVM, J48, and the used of Bayes classifiers with a single type of sequence features have been implemented for performance comparison. The SVM is widely recognized to be an accurate classifier for prediction proteins with a specific function. Generally, the predictive performance of an SVM with effective features is regarded as the gold standard for evaluating predictors. Radial basis SVM classifiers are implemented using the LIBSVM package . The SVM parameters are evaluated by using a grid search method to maximize the ten folds cross validation (10-CV) accuracy on the training dataset. Some commonly used features such as amino acid composition (AAC), dipeptide composition (DPC), normalized PSSM (PSSM400) , and 531 PCPs in the AAindex are evaluated in the design of predictors.
The J48 and Bayes classifiers are implemented using WEKA . J48 refers to the decision tree classifier that is generated by the C4.5 algorithm that was developed by Quinlan . The Naïve Bayes classifier is a statistical classifier that can predict the probability of class membership under the assumption of the mutual independence of features .
Scoring card method
The scoring card method (SCM) is a newly-published classification method for predicting proteins with a particular function and providing insight into the characteristics of proteins in primary sequences. Huang et al. developed this method [26–28]. Unlike the complex classification mechanism of the SVM, which is not easily understood by biologists, the used of SCM to estimate the propensities of amino acids and dipeptides to be the function of interest is a simple and easily interpretable method of prediction and analysis. A comparison of prediction accuracies, the SCM is slightly worse than, or comparable to, the SVM when used with dipeptide features [26–28]. The advantages of SCM are three folds. First, the classification mechanism of the SCM adopts a weighted sum of composition and propensity scores of dipeptides to score the queried protein. Compared with the hyperplane of the SVM, a threshold value in the SCM is used to classify proteins, and this value is easily understood and manipulated by biologists. Secondly, the propensity scores of dipeptides and amino acids can be utilized to identify the PCPs that will provide information about a global property of general proteins in a further analysis of the proteins' characteristics. Thirdly, the SCM is a general-purpose method for identifying protein sequences with a particular function. The proposed SCMHBP method is based on the SCM that is applied to the training dataset of HBPGO-TRN1000. For completeness, the used SCM and the SCMHBP algorithm are described below.
The main procedure in the design SCM-based predictors can be conducted without modification, and consists of the following steps; 1) preparing both positive and negative sequences in a training dataset as inputs (500 HBPs and 500 non-HBPs in HBPGO-TRN1000 in this study); 2) generating an initial scoring card with 400 propensity scores of dipeptides using a simple statistical method; 3) obtaining propensity scores of 20 amino acids from those of 400 dipeptides; 4) optimizing the initial scoring card using a global optimization method, and 5) generating a binary SCM classifier with a threshold value as an output of the procedure. More details of the SCM method and its applications can be found elsewhere [26–28]. The algorithm of SCMHBP is as follows.
Step 1: Prepare a training dataset HBPGO-TRN1000 that comprises 500 HBPs and 500 non-HBPs.
Step 2: Generate an initial scoring card (SCMInit) that consists of 400 propensity scores of dipeptides, obtained by subtracting the dipeptides contents in non-HBPs from those in HBPs. Then, the scores of all dipeptides are normalized into the range [0, 1000].
Step 3: Calculate the propensity score of each amino acid × by averaging 40 propensity scores of dipeptides that contain X.
Identifying informative physicochemical properties
The physicochemical properties of amino acids are widely recognized as effective features for predicting and analyzing the functions of proteins from primary sequences [26–28]. The AAindex database consists 544 indices that were extracted from the published literatures and represent various physicochemical and biological properties of amino acids . Each physicochemical property of amino acids is specified by a set of 20 numerical values. The property with the value 'NA' in a value set of amino acid index is removed. Finally, 531 properties are utilized in the following analysis. The propensity scores of amino acids, estimated by SCMHBP, facilitate the discovery of informative PCPs, providing insight into the characteristics of HBPs. The method of identification of informative PCPs, based on the SCM (SCM-PCPs), is composed of two main steps.
The first step is to compute the R values between the propensity scores of the amino acids and all 531 PCPs in AAindex. A large value of R suggests that the physicochemical or biological property is highly correlated with property of HBPs. The property with an absolute value of R > 0.5 is preferred as a candidate for subsequent analysis.
The second step is to identify the informative PCPs from all candidates, based on domain knowledge of the function of the investigated proteins (HBPs in this work). Generally, the PCPs of amino acids in AAindex are obtained under a particular condition or for a specific family of proteins that is not satisfied are not suited to analysis.
The compositions of the amino acids in HBPs and non-HBPs effectively represent a property of HBPs. Accordingly, the composition property can also be used like the propensity scores of amino acids in using SCM-PCPs . The identified PCPs offer a clue to the characteristics of HBPs.
Propensity scores of amino acids and dipeptides
The propensity scores of amino acids to be a heme binding protein (HBP) and amino acid composition (%).
Heme protein Score (Rank)
Composition of HBP:
Composition of Non-HBP:
The five amino acids with the highest propensity scores are Phe, His, Gly, Trp, and Pro, whereas the residues with the lowest scores are Ser, Asn, Glu, Cys, and Thr. Smith et al.  found that His predominates among the other amino acids as an axial ligand to heme iron. The relative frequency of His in the HBPs decreases to the background level when the axial ligands are removed . Smith et al.  noted the important role of Trp and Phe and Tyr in protein-heme interactions. Liu and Hu measured the relative importance of the amino acids in heme binding interfaces and suggested that His, Phe, and Trp are overrepresented in binding pockets . Notably, the aromatic and non-polar residue Phe has the highest score and can form non-covalent bonds.
Performance of HBP predictors
The comparisons of prediction accuracies (%) between SCMHBP and some methods.
85.90 ± 0.42
72.79 ± 1.21
67.24 ± 2.89
74.69 ± 2.48
71.57 ± 7.83
The training accuracy of SCMInit and the mean accuracy of SCMHBP are 71.30% and 85.90%, respectively. The optimization of the scoring card by using an intelligent genetic algorithm (IGA)  increases the training accuracy to 14.60%. The SCMHBP method uses the 10-CV assessment to prevent overtraining in training the model. This optimization method seeks to maximize the area under the ROC curve (AUC)  to generate balanced sensitivity and specificity. The mean test accuracies of SCMHBP that are obtained using the DPC features on the three test datasets HBPGO-TST494, HemeBind145, and TargetS311, are 72.79%, 67.24%, and 74.69%, respectively. The mean test performance of SCMHBP (71.57%) is close to that of SVM-DPC (72.20%) using the same DPC features. The SCMHBP method is also comparable with the SVM-PSSM400 (75.98%) and SVM-AAindex (76.21%) methods. The score card with best training performance, 86.60%, has a test accuracy of 74.22%, and is adopted in further PCP analysis.
The methods that are presented in Table 4 are compared with SCMHBP. Whereas SCMHBP adopts only a simple weighted-sum classifier and the dipeptide features, the other methods use commonly-used classifiers with only one type of easily-interpretable features. That the combination of complementary features and an ensemble mechanism is widely recognized to improve prediction performance. The ensemble SCMCRYS method for predicting protein crystallization uses propensity scores of p-collocated amino acid pairs (p=0 for a dipeptide) to the performance of the single SCM classifier uses only dipeptide features . Similarly, the ensemble SVM classifier with multiple feature types is expected to exhibit improved performance.
Consider the methods in Table 4. The SVM and J48 based methods have high training accuracies that exceed 90%. However, the mean test accuracies of the SVM and J48-based methods are less than 80% and 70%, respectively. The J48-based decision tree methods exhibit from an obvious overtraining problem. The performance of the Bayes-based methods is not good. In summary, the SVM-based methods and especially the SVM-PSSM400 method with a mean test accuracy of 75.98% using the normalized PSSM features outperform the J48 and Bayes-based methods.
Identification of putative HBPs
The top-20 putative HBPs according to the HBP sequences.
This protein inhibits the growth of a variety of fungal species
Dolichyl-diphosphooligosaccharide--protein glycosyltransferase subunit 4C
May be involved in N-glycosylation through its association with N-oligosaccharyl transferase
Proline, histidine and glycine-rich protein 1
Eggshell protein 2A
Photosystem II reaction center protein Ycf12
A core subunit of photosystem II (PSII)
Acts as a neurotoxin
Core component of nucleosome
Uncharacterized protein DDB_G0295473
May function as a multidomain RNA-binding protein
Uncharacterized protein YML007C-A, mitochondrial
Sperm protamine P1
Protamines substitute for histones in the chromatin of sperm during the haploid phase of spermatogenesis.
S antigens are soluble heat-stable proteins present in the sera of some infected individuals.
Uncharacterized 8.4 kDa protein
Glycine-rich cell wall structural protein
Responsible for plasticity of the cell wall
Has antibacterial activity against the Gram-negative bacterium E. coli and the Gram-positive bacteria L. monocytogenes and S. aureus
Putative uncharacterized protein YKL156C-A
OriE replication initiation protein
Putative uncharacterized protein YEL032C-A
May be involved in growth and survival of prostate cancer cells through the TAF-Ibeta pathway
The mean, maximum and minimum scores of sequences in the training dataset HBPGO-TRN1000 are 554.70, 613.90, and 500.70, respectively, when the threshold value is 539.32. All 20 sequences in Table 5 have scores that exceed 648 and so have high potential to be HBPs. Dermotoxin-J2 has the highest score of 715.58 and Protein PCOTH, ranked 20th has a score of 648.51. Interestingly, the eight putative HBPs in Table 5 have unclear protein functions. The high scores of the putative HBPs suggest further experimental confirmation is required.
Propensity analysis using informative PCPs
The three physicochemical properties selected by SCM-PCPs.
Heme protein Score (Rank)
1SNEP660103 Score (Rank)
2TAKK010101 Score (Rank)
3KARP850101 Score (Rank)
A. Aromaticity of the HBP side chains and its contribution
The SNEP660103 property, described as ''Relations between chemical structure and biological activity in peptides'' , is ranked 1 by R value (=0.604). SNEP660103 is a composite feature which interprets a set of the PCPs of the 20 natural amino acids as four vectors, extracted via principal component analysis from the ϕ coefficients. The propensity scores of HBPs have the highest R value with the vector III, which represents aromatic properties . The large R value reveals that the aromatic residues have a higher propensity than the non-aromatic residues to be HBPs.
The aromatic residues with special side chains with aromatic rings and π electrons perform unique functions in proteins such as the hydrophobic interaction, the aromatic-aromatic (π-π) interaction [20, 48], and the carrying of mobile electrons in reactions that involve electron transfer [49–51]. Smith et al.  assessed the chemical composition of the heme binding sites suggesting that aromatic residues supporting the π-π interactions between the residues tend to adopt an off-set, parallel, staggered, or an edge-to-face orientation relative to the heme group. This suggestion reveals the importance of the aromatic residues in maintaining the orientation of heme in HBPs. In investigations of HBP protein activity, the transition metal redox centers in the structures of HBPs are involved in biological electron transfer as mobile electron carriers that shuttle electrons between reductase and oxidase complexes [49–51]. Heme contains an Fe site that can switch between the reduced (FeII) and oxidized (FeIII) states. The molecules that include aromatic π-systems have relatively low redox potentials and a greater probability of undergoing electron transfer reactions. These aromatic residues are thought to increase the redox potentials of heme because the conserved aromatic side chains are close to it. A sequence study showed that the His residue that is located proximally and distally to the heme is highly conserved .
The four aromatic residues with the highest SNEP660103 scores are Trp, Phe, Tyr and His . From Table 6 the aromatic residues Phe, His, Trp, and Tyr with scores 705.3, 615.6, 603.1, and 541.0 are ranked 1, 2, 4, and 12, respectively. As presented in Figure 3(c), the two His residues, His39 and His63, are located close to the heme group. These conserved histidines are thought to be involved in the binding of the aromatic donor, forming the catalytic intermediate, and to transfer electrons. The composition of the four aromatic residues (Phe, His, Trp, and Tyr) in HBPs is 12.18% larger than that (10.58%) in non-HBPs (Table 3), supporting SCMHBP with SCM-PCPs, consistent with previous HBP studies.
B. The hydrophobic characters of side chain for stabilizing HBPs
The property of TAKK010101 with R = 0.576 is described as ''Side-chain contribution to protein stability'' . Hydrophobic interaction influences the stability and various functions of proteins. To estimate the hydrophobic effects, several hydrophobic metrics and various methods of measuring them have been developed. However, a more precise scale that is convenient to use can be obtained by using the free energies that are measured from the difference between the natured and denatured states to quantify the hydrophobic contribution to proteins . This metric quantifies the energetic contribution of the hydrophobic character of each amino acid side chain to the stability of the protein adjusted according to the conformational entropy of the side-chain. The high correlation reveals that the hydrophobic effects have a more important role in the stability of HBPs than of non-HBPs.
Numerous HBPs have a globular structure and hydrophobic bonds have a major role in organizing their self-assembly . HBPs that bind to a heme group become permanently fully folded and stable only upon association with the hydrophobic heme group and the formation of additional hydrophobic bonds . The structures of the holo-HBPs appear to be stabilized by several hydrophobic contacts . These findings are in agreement with the obtained high propensity scores for hydrophobic amino acids Phe, Trp, His and Gly. Notably, a previous investigation found comparably high changes in stability (ΔΔG) for Phe (23.0 kJ/mol), Trp (24.2 kJ/mol) and His (11.9 kJ/mol) .
To estimate the contribution of the hydrophobic character of amino acids, the hydrophobic contribution of the side-chain is calculated using the training dataset HBPGO-TRN1000 according to the property TAKK010101. The mean hydrophobic contribution of the side-chain of HBPs is 10.20 kJ/mol, which significantly exceeds that (9.98 kJ/mol) of non-HBPs, according to the Mann-Whitney test with p<0.05. This result reveals that the HBPs prefer hydrophobic residues more than non-HBPs.
Along with the hydrophobic contacts, multiple other interactions mediate the binding of a heme group. These interactions include the formation of coordination bond(s) to an iron ion, van der Waals' contacts between the planar porphyrin ring and protein side chains, and electrostatic interactions that involve the heme propionate substituents and positively charged protein residues . Figures 3(b) and 3(d) present the surface hydrophobicity of 2PBJ and 1CYO. The binding sites of the HBPs (around the heme group) are very hydrophobic. In the presented scoring card, the hydrophobic residues have high scores; these include Phe, Trp, Pro, and Met with scores of 705.3, 603.1, 601.3 and 599.6, respectively. These results demonstrate that the hydrophobic residues have important roles in HBPs. This finding reveals the importance of the hydrophobic interaction between heme and binding sites.
C. Inflexibility of HBPs with flexible heme binding residues
The property KARP850101 with R = -0.555, indicating a large inverse correlation, ranked 1 using SCM-PCPs is described as ''Flexibility parameter for no rigid neighbors'' . Since the HBPs have various functions and will work across various environments with, for example, various pH values [57, 58], the stability of the HBPs is more important. Numerous investigations of HBPs have focused on HBP binding sites and indicate that the binding sites of HBPs are flexible [59, 60]. However, those investigations [59, 60] did not consider whole protein structures or compare them with those of other non-HBP proteins. Hydrophobic interactions influence the rigidity of protein structures . Accordingly, these hydrophobic residues are postulated not only to compose the hydrophobic binding sites for hemes, but also to maintain the stability of HBPs.
The inverse correlation suggests that the more rigid amino acids have higher scores. The rigidity of a protein affects its structural stability . The fact that the proteins lose their rigidity during unfolding  demonstrates that the proteins with less rigidity tend to reach the unfolding state more rapidly. Our results suggest that HBPs are more inflexible for stabilizing HBPs than non-HBPs.
The amino acid scores and the average B-factors of the Cα and side chains
29.41 ± 22.90
28.15 ± 18.41
28.38 ± 22.10
30.67 ± 19.56
32.00 ± 24.25
30.60 ± 20.17
29.48 ± 22.98
25.18 ± 16.77
30.60 ± 23.23
32.66 ± 20.87
29.96 ± 22.34
30.20 ± 19.06
32.00 ± 23.58
34.04 ± 20.96
31.11 ± 23.27
29.04 ± 18.19
32.47 ± 22.93
35.76 ± 21.87
31.83 ± 24.25
33.24 ± 20.55
29.14 ± 21.42
30.85 ± 20.20
29.73 ± 22.51
26.99 ± 17.72
30.19 ± 22.40
28.02 ± 17.71
30.53 ± 22.29
29.01 ± 17.89
32.20 ± 24.31
33.27 ± 20.80
31.22 ± 23.47
30.08 ± 19.36
29.10 ± 23.30
26.61 ± 16.31
32.62 ± 23.35
35.51 ± 20.95
31.88 ± 23.42
32.66 ± 20.79
32.34 ± 23.93
32.13 ± 20.33
From Table 7 the R values between the mean B-factors of residues and the amino acid scores for HBPs and non-HBPs are -0.45 and -0.26 with p-values of 0.02 and 0.18, respectively. The results reveal that HBPs preferentially yield residues with lower B-factors. Although study of the rigidity of HBPs has been published, the results herein suggest that HBPs are less flexible than non-HBPs. Even though HBPs exhibit an allosteric mechanism and these are thought to have flexible binding sites, a statistical study indicates that the allosteric mechanism that is seen in various HBPs, such as the hemoglobin and myoglobin, undergoes a conformational change owing to the side-chain rotamer without backbone changes .
Possible causes of the lower flexibility of the HBPs are as follows. 1) HBPs typically function in various environments; for example, they may transport oxygen to organs or transfer energy in mitochondria. According to a previous circulation study, for example, PCO2 affects the pH of the blood . Reducing PCO2 increases arterial pH, changing the oxygen binding ability of hemoglobin [57, 58]. The hemoglobin must tolerate various pHs of various environments. 2) The HBP is involved in some important signals such as apoptosis , which triggers a change in conformation. Domain swapping owing to the protein flexibility of the protein causes domain escape from the original position of the proteins on cytochrome c 552 . These domain swapping proteins can further bind to each other to form dimers or oligomers. In cytochrome c 552 , if domain swapping occurs, then the protein oligomers will trigger apoptosis.
This work proposed a scoring card method (SCM) based method (SCMHBP) for predicting and analyzing HBPs from sequences. SCMHBP performs well in terms of mean test accuracy when applied to three independent datasets. Additionally, the propensity scores of dipeptides and amino acids can support the identification of informative physicochemical properties to provide insight into heme binding proteins. The CP motif has high propensity score, suggesting that this motif is important for HBPs. SCM-PCPs are used to identify the physicochemical properties of HBPs and three PCPs, SNEP660103, TAKK010101 and KARP850101, are identified. In summary, the aromaticity and hydrophobicity of the side chains of the HBP residue appear to be important factors in determining the functional performance and stability of HBP. Additionally, HPBs appeared to have a more rigid structure than non-HBPs. The used datasets and source code for SCMHBP are available at http://iclab.life.nctu.edu.tw/SCMHBP/.
This work was funded by National Science Council of Taiwan under the contract number NSC-103-2221-E-009-117-, and "Center for Bioinformatics Research of Aiming for the Top University Program" of the National Chiao Tung University and Ministry of Education, Taiwan, R.O.C. for the project 103W962. This work was also supported in part by the UST-UCSD International Center of Excellence in Advanced Bioengineering sponsored by the Taiwan National Science Council I-RiCE Program under Grant Number: NSC-102-2911-I-009-101. Publication charges for this work are funded by the project 103W962.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 16, 2014: Thirteenth International Conference on Bioinformatics (InCoB2014): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S16.
- Bowman SEJ, Bren KL: The chemistry and biochemistry of heme c: functional bases for covalent attachment. Nat Prod Rep. 2008, 25 (6): 1118-1130. 10.1039/b717196j.PubMed CentralView ArticlePubMedGoogle Scholar
- Fufezan C, Zhang J, Gunner MR: Ligand preference and orientation in b- and c-type heme-binding proteins. Proteins. 2008, 73 (3): 690-704. 10.1002/prot.22097.PubMed CentralView ArticlePubMedGoogle Scholar
- Gray HB, Winkler JR: Electron transfer in proteins. Annu Rev Biochem. 1996, 65: 537-561. 10.1146/annurev.bi.65.070196.002541.View ArticlePubMedGoogle Scholar
- Terwilliger NB: Functional adaptations of oxygen-transport proteins. J Exp Biol. 1998, 201 (8): 1085-1098.PubMedGoogle Scholar
- Guengerich FP, Macdonald TL: Chemical Mechanisms of Catalysis by Cytochromes-P-450 - a Unified View. Accounts Chem Res. 1984, 17 (1): 9-16. 10.1021/ar00097a002.View ArticleGoogle Scholar
- Zenke-Kawasaki Y, Dohi Y, Katoh Y, Ikura T, Ikura M, Asahara T, Tokunaga F, Lwai K, Igarashi K: Heme induces ubiquitination and degradation of the transcription factor bach1. Molecular and Cellular Biology. 2007, 27 (19): 6962-6971. 10.1128/MCB.02415-06.PubMed CentralView ArticlePubMedGoogle Scholar
- Kaasik K, Lee CC: Reciprocal regulation of haem biosynthesis and the circadian clock in mammals. Nature. 2004, 430 (6998): 467-471. 10.1038/nature02724.View ArticlePubMedGoogle Scholar
- Faller M, Matsunaga M, Yin S, Loo JA, Guo F: Heme is involved in microRNA processing. Nat Struct Mol Biol. 2007, 14 (1): 23-29. 10.1038/nsmb1182.View ArticlePubMedGoogle Scholar
- Huang H, Hu NF, Zeng YH, Zhou G: Electrochemistry and electrocatalysis with heme proteins in chitosan biopolymer films. Anal Biochem. 2002, 308 (1): 141-151. 10.1016/S0003-2697(02)00242-7.View ArticlePubMedGoogle Scholar
- Zhou YL, Hu NF, Zeng YH, Rusling JF: Heme protein-clay films: Direct electrochemistry and electrochemical catalysis. Langmuir. 2002, 18 (1): 211-219. 10.1021/la010834a.View ArticleGoogle Scholar
- Nagaoka H: Application of a Heme-Binding Protein Eluted from Encapsulated Biomaterials to the Catalysis of Enantioselective Oxidation. Acs Catal. 2014, 4 (2): 553-565. 10.1021/cs400768x.View ArticleGoogle Scholar
- Dassama LMK, Yosca TH, Conner DA, Lee MH, Blanc B, Streit BR, Green MT, DuBois JL, Krebs C, Bollinger JM: O-2-Evolving Chlorite Dismutase as a Tool for Studying O-2-Utilizing Enzymes. Biochemistry-Us. 2012, 51 (8): 1607-1616. 10.1021/bi201906x.View ArticleGoogle Scholar
- Frankenberg N, Moser J, Jahn D: Bacterial heme biosynthesis and its biotechnological application. Appl Microbiol Biot. 2003, 63 (2): 115-127. 10.1007/s00253-003-1432-2.View ArticleGoogle Scholar
- Ebihara A, Okamoto A, Kousumi Y, Yamamoto H, Masui R, Ueyama N, Yokoyama S, Kuramitsu S: Structure-based functional identification of a novel heme-binding protein from Thermus thermophilus HB8. J Struct Funct Genomics. 2005, 6 (1): 21-32. 10.1007/s10969-005-1103-x.View ArticlePubMedGoogle Scholar
- Merkley ED, Anderson BJ, Park J, Belchik SM, Shi L, Monroe ME, Smith RD, Lipton MS: Detection and Identification of Heme c-Modified Peptides by Histidine Affinity Chromatography, High-Performance Liquid Chromatography-Mass Spectrometry, and Database Searching. Journal of Proteome Research. 2012, 11 (12): 6147-6158.PubMedGoogle Scholar
- Babusiak M, Man P, Sutak R, Petrak J, Vyoral D: Identification of heme binding protein complexes in murine erythroleukemic cells: Study by a novel two-dimensional native separation - liquid chromatography and electrophoresis. Proteomics. 2005, 5 (2): 340-350. 10.1002/pmic.200400935.View ArticlePubMedGoogle Scholar
- Bartsch RG, Kamen MD: Isolation and properties of two soluble heme proteins in extracts of the photoanaerobe Chromatium. J Biol Chem. 1960, 235: 825-831.PubMedGoogle Scholar
- Lan EH, Dave BC, Fukuto JM, Dunn B, Zink JI, Valentine JS: Synthesis of sol-gel encapsulated heme proteins with chemical sensing properties. J Mater Chem. 1999, 9 (1): 45-53. 10.1039/a805541f.View ArticleGoogle Scholar
- Schneider S, Marles-Wright J, Sharp KH, Paoli M: Diversity and conservation of interactions for binding heme in b-type heme proteins. Natural Product Reports. 2007, 24 (3): 621-630. 10.1039/b604186h.View ArticlePubMedGoogle Scholar
- Smith LJ, Kahraman A, Thornton JM: Heme proteins--diversity in structural characteristics, function, and folding. Proteins-Structure Function and Bioinformatics. 2010, 78 (10): 2349-2368. 10.1002/prot.22747.View ArticleGoogle Scholar
- Li T, Bonkovsky HL, Guo JT: Structural analysis of heme proteins: implications for design and prediction. Bmc Structural Biology. 2011, 11:Google Scholar
- Liu R, Hu JJ: HemeBIND: a novel method for heme binding residue prediction by combining structural and sequence information. Bmc Bioinformatics. 2011, 12:Google Scholar
- Liu R, Hu JJ: Computational Prediction of Heme-Binding Residues by Exploiting Residue Interaction Network. Plos One. 2011, 6 (10):Google Scholar
- Xiong Y, Liu J, Zhang W, Zeng T: Prediction of heme binding residues from protein sequences with integrative sequence profiles. Proteome Science. 2012, 10:Google Scholar
- Yu DJ, Hu J, Yang J, Shen HB, Tang JH, Yang JY: Designing Template-Free Predictor for Targeting Protein-Ligand Binding Sites with Classifier Ensemble and Spatial Clustering. Ieee-Acm Transactions on Computational Biology and Bioinformatics. 2013, 10 (4): 994-1008.View ArticlePubMedGoogle Scholar
- Charoenkwan P, Shoombuatong W, Lee HC, Chaijaruwanich J, Huang HL, Ho SY: SCMCRYS: Predicting Protein Crystallization Using an Ensemble Scoring Card Method with Estimating Propensity Scores of P-Collocated Amino Acid Pairs. Plos One. 2013, 8 (9):Google Scholar
- Huang HL, Charoenkwan P, Kao TF, Lee HC, Chang FL, Huang WL, Ho SJ, Shu LS, Chen WL, Ho SY: Prediction and analysis of protein solubility using a novel scoring card method with dipeptide composition. Bmc Bioinformatics. 2012, 13:Google Scholar
- Huang H-L: Propensity Scores for Prediction and Characterization of Bioluminescent Proteins from Sequences. PLoS ONE. 2014, 9 (5): e97158-10.1371/journal.pone.0097158.PubMed CentralView ArticlePubMedGoogle Scholar
- Ho SY, Shu LS, Chen JH: Intelligent evolutionary algorithms for large parameter optimization problems. Ieee T Evolut Comput. 2004, 8 (6): 522-541. 10.1109/TEVC.2004.835176.View ArticleGoogle Scholar
- Kawashima S, Kanehisa M: AAindex: amino acid index database. Nucleic Acids Res. 2000, 28 (1): 374-10.1093/nar/28.1.374.PubMed CentralView ArticlePubMedGoogle Scholar
- Yang J, Roy A, Zhang Y: BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2013, D1096-1103. 41 DatabaseGoogle Scholar
- Frank E, Hall M, Trigg L, Holmes G, Witten IH: Data mining in bioinformatics using Weka. Bioinformatics. 2004, 20 (15): 2479-2481. 10.1093/bioinformatics/bth261.View ArticlePubMedGoogle Scholar
- Sneath PH: Relations between chemical structure and biological activity in peptides. J Theor Biol. 1966, 12 (2): 157-195. 10.1016/0022-5193(66)90112-3.View ArticlePubMedGoogle Scholar
- Takano K, Yutani K: A new scale for side-chain contribution to protein stability based on the empirical stability analysis of mutant proteins. Protein Eng. 2001, 14 (8): 525-528. 10.1093/protein/14.8.525.View ArticlePubMedGoogle Scholar
- Karplus PA, Schulz GE: Prediction of Chain Flexibility in Proteins - a Tool for the Selection of Peptide Antigens. Naturwissenschaften. 1985, 72 (4): 212-213. 10.1007/BF01195768.View ArticleGoogle Scholar
- Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 2013, D43-47. 41 DatabaseGoogle Scholar
- Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, Lewis S, Marshall B, Mungall C: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 2004, 32: D258-D261. 10.1093/nar/gkh036.View ArticlePubMedGoogle Scholar
- Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010, 26 (19): 2460-2461. 10.1093/bioinformatics/btq461.View ArticlePubMedGoogle Scholar
- Yamaguchi A, Iida K, Matsui N, Tomoda S, Yura K, Go M: Het-PDB Navi.: a database for protein-small molecule interactions. J Biochem. 2004, 135 (1): 79-84. 10.1093/jb/mvh009.View ArticlePubMedGoogle Scholar
- Chang C, Lin C: LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology. 2011, 2: 21-27.View ArticleGoogle Scholar
- Huang HL, Lin IC, Liou YF, Tsai CT, Hsu KT, Huang WL, Ho SJ, Ho SY: Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties. Bmc Bioinformatics. 2011, 12 (Suppl 1): S47-10.1186/1471-2105-12-S1-S47.PubMed CentralView ArticlePubMedGoogle Scholar
- Salzberg SL: C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan Kaufmann Publishers, Inc., 1993. Machine Learning. 1994, 16 (3): 235-240.Google Scholar
- Han J, Kamber M: Data Mining: Concepts and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems). 2006, Elsevier, secondGoogle Scholar
- Bradley AP: The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recogn. 1997, 30 (7): 1145-1159. 10.1016/S0031-3203(96)00142-2.View ArticleGoogle Scholar
- Ogawa K, Sun J, Taketani S, Nakajima O, Nishitani C, Sassa S, Hayashi N, Yamamoto M, Shibahara S, Fujita H: Heme mediates derepression of Maf recognition element through direct binding to transcription repressor Bach1. EMBO J. 2001, 20 (11): 2835-2843. 10.1093/emboj/20.11.2835.PubMed CentralView ArticlePubMedGoogle Scholar
- Zenke-Kawasaki Y, Dohi Y, Katoh Y, Ikura T, Ikura M, Asahara T, Tokunaga F, Iwai K, Igarashi K: Heme induces ubiquitination and degradation of the transcription factor Bach1. Mol Cell Biol. 2007, 27 (19): 6962-6971. 10.1128/MCB.02415-06.PubMed CentralView ArticlePubMedGoogle Scholar
- Okusawa T, Fujita M, Nakamura J, Into T, Yasuda M, Yoshimura A, Hara Y, Hasebe A, Golenbock DT, Morita M: Relationship between structures and biological activities of mycoplasmal diacylated lipopeptides and their recognition by toll-like receptors 2 and 6. Infect Immun. 2004, 72 (3): 1657-1665. 10.1128/IAI.72.3.1657-1665.2004.PubMed CentralView ArticlePubMedGoogle Scholar
- Shimizu T: Diverse role of conserved aromatic amino acids in the electron transfer of cytochrome P450 catalytic functions: site-directed mutagenesis studies. Recent Research Developments in Pure and Applied Chemistry. 1997, 1: 196-175.Google Scholar
- Volkov AN, van Nuland NA: Electron transfer interactome of cytochrome C. PLoS Comput Biol. 2012, 8 (12): e1002807-10.1371/journal.pcbi.1002807.PubMed CentralView ArticlePubMedGoogle Scholar
- Takayama Y, Harada E, Kobayashi R, Ozawa K, Akutsu H: Roles of noncoordinated aromatic residues in redox regulation of cytochrome c(3) from Desulfovibrio vulgaris Miyazaki F. Biochemistry-Us. 2004, 43 (34): 10859-10866. 10.1021/bi049551i.View ArticleGoogle Scholar
- Davis AC, Cornelison MJ, Meyers KT, Rajapakshe A, Berry RE, Tollin G, Enemark JH: Effects of mutating aromatic surface residues of the heme domain of human sulfite oxidase on its heme midpoint potential, intramolecular electron transfer, and steady-state kinetics. Dalton T. 2013, 42 (9): 3043-3049. 10.1039/c2dt31508d.View ArticleGoogle Scholar
- Gribaldo S, Casane D, Lopez P, Philippe H: Functional divergence prediction from evolutionary analysis: a case study of vertebrate hemoglobin. Mol Biol Evol. 2003, 20 (11): 1754-1759. 10.1093/molbev/msg171.View ArticlePubMedGoogle Scholar
- Doig AJ, Sternberg MJ: Side-chain conformational entropy in protein folding. Protein Sci. 1995, 4 (11): 2247-2251. 10.1002/pro.5560041101.PubMed CentralView ArticlePubMedGoogle Scholar
- Kauzmann W: Some factors in the interpretation of protein denaturation. Adv Protein Chem. 1959, 14: 1-63.View ArticlePubMedGoogle Scholar
- Hargrove MS, Barrick D, Olson JS: The association rate constant for heme binding to globin is independent of protein structure. Biochemistry-Us. 1996, 35 (35): 11293-11299. 10.1021/bi960371l.View ArticleGoogle Scholar
- Mukhopadhyay K, Lecomte JT: A relationship between heme binding and protein stability in cytochrome b5. Biochemistry-Us. 2004, 43 (38): 12227-12236. 10.1021/bi0488956.View ArticleGoogle Scholar
- Di Russo NV, Estrin DA, Marti MA, Roitberg AE: pH-Dependent conformational changes in proteins and their effect on experimental pK(a)s: the case of Nitrophorin 4. PLoS Comput Biol. 2012, 8 (11): e1002761-10.1371/journal.pcbi.1002761.PubMed CentralView ArticlePubMedGoogle Scholar
- Di Luccio E, Ishida Y, Leal WS, Wilson DK: Crystallographic Observation of pH-Induced Conformational Changes in the Amyelois transitella Pheromone-Binding Protein AtraPBP1. Plos One. 2013, 8 (2):Google Scholar
- Ohkura K, Kawaguchi Y, Watanabe Y, Masubuchi Y, Shinohara Y, Hori H: Flexible Structure of Cytochrome P450: Promiscuity of Ligand Binding in the CYP3A4 Heme Pocket. Anticancer Res. 2009, 29 (3): 935-942.PubMedGoogle Scholar
- Villareal VA, Pilpa RM, Robson SA, Fadeev EA, Clubb RT: The IsdC Protein from Staphylococcus aureus Uses a Flexible Binding Pocket to Capture Heme. J Biol Chem. 2008, 283 (46): 31591-31600. 10.1074/jbc.M801126200.PubMed CentralView ArticlePubMedGoogle Scholar
- Rader AJ, Hespenheide BM, Kuhn LA, Thorpe MF: Protein unfolding: Rigidity lost. P Natl Acad Sci USA. 2002, 99 (6): 3540-3545. 10.1073/pnas.062492699.View ArticleGoogle Scholar
- Parthasarathy S, Murthy MRN: Analysis of temperature factor distribution in high-resolution protein structures. Protein Science. 1997, 6 (12): 2561-2567.PubMed CentralView ArticlePubMedGoogle Scholar
- Itoh K, Sasai M: Statistical mechanics of protein allostery: roles of backbone and side-chain structural fluctuations. J Chem Phys. 2011, 134 (12): 125102-10.1063/1.3565025.View ArticlePubMedGoogle Scholar
- Grundler W, Weil MH, Rackow EC: Arteriovenous Carbon-Dioxide and Ph Gradients during Cardiac-Arrest. Circulation. 1986, 74 (5): 1071-1074. 10.1161/01.CIR.74.5.1071.View ArticlePubMedGoogle Scholar
- Hayashi Y, Nagao S, Osuka H, Komori H, Higuchi Y, Hirota S: Domain Swapping of the Heme and N-Terminal alpha-Helix in Hydrogenobacter thermophilus Cytochrome c(552) Dimer. Biochemistry-Us. 2012, 51 (43): 8608-8616. 10.1021/bi3011303.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.