A strategy to select suitable physicochemical attributes of amino acids for protein fold recognition

Sharma, Alok; Paliwal, Kuldip K; Dehzangi, Abdollah; Lyons, James; Imoto, Seiya; Miyano, Satoru

doi:10.1186/1471-2105-14-233

Research article
Open access
Published: 24 July 2013

A strategy to select suitable physicochemical attributes of amino acids for protein fold recognition

Alok Sharma^1,3,
Kuldip K Paliwal²,
Abdollah Dehzangi²,
James Lyons²,
Seiya Imoto¹ &
…
Satoru Miyano¹

BMC Bioinformatics volume 14, Article number: 233 (2013) Cite this article

5275 Accesses
36 Citations
Metrics details

Abstract

Background

Assigning a protein into one of its folds is a transitional step for discovering three dimensional protein structure, which is a challenging task in bimolecular (biological) science. The present research focuses on: 1) the development of classifiers, and 2) the development of feature extraction techniques based on syntactic and/or physicochemical properties.

Results

Apart from the above two main categories of research, we have shown that the selection of physicochemical attributes of the amino acids is an important step in protein fold recognition and has not been explored adequately. We have presented a multi-dimensional successive feature selection (MD-SFS) approach to systematically select attributes. The proposed method is applied on protein sequence data and an improvement of around 24% in fold recognition has been noted when selecting attributes appropriately.

Conclusion

The MD-SFS has been applied successfully in selecting physicochemical attributes of the amino acids. The selected attributes show improved protein fold recognition performance.

Background

Discovering the three dimensional structure of a protein from its amino acid sequence via computational means is a challenging task and open for research in biological science and bioinformatics. Deciphering protein structure elucidates protein functions. This has a profound impact on understanding the heterogeneity of proteins, protein-protein interactions and protein-peptide interactions. This further helps in drug design. A usual way to predict the structure of a protein is to first acquire proteins with known structures (e.g. by crystallography techniques) and then from their sequences, the prediction process can be conducted by developing recognition techniques. Thereafter, the developed techniques can be used to classify unknown protein sequences into one of its classes or folds. The length of a protein sequence (i.e., the number of amino acids in it) is usually different from the length of another protein sequence. However, two proteins with different lengths and low sequential similarities can be categorized to the same fold. The identification of protein folds from a protein sequence would bring us one step closer to the recognition of protein structures. A wide range of techniques have been developed over the past two decades to recognize protein folds. Despite numerous contributions and significant enhancements achieved [1, 2], the protein fold recognition problem is yet to be completely solved.

The focus in protein fold recognition can be broadly classified into two categories: 1) the development of classifiers to improve fold recognition, and 2) the development of feature extraction techniques using alphabetical sequence (syntactical-based) and/or using physicochemical properties of the amino acids (attribute-based or physicochemical-based). For the former case, several classifiers have been developed or used including linear discriminant analysis [3], Bayesian classifiers [4], Bayesian decision rule [5], K-Nearest Neighbor [6, 7], Hidden Markov Model [8, 9], Artificial Neural Network [10, 11] and ensemble classifiers [1, 12]-[14]. For the latter case, several feature extraction techniques have been developed including composition, transition and distribution [15], occurrence [16], pairwise frequencies [17], pseudo-amino acid composition [18], bigrams [19], autocorrelation [6, 20, 21] and deriving features by considering more physicochemical properties [22].

Dubchak et al. [15] proposed syntactical and physicochemical-based features for protein fold recognition. They used the five following attributes of amino acids for deriving physicochemical-based features namely, hydrophobicity (H), predicted secondary structure based on normalized frequency of α-helix (X), polarity (P), polarizability (Z) and van der Waals volume (V). The features proposed by Dubchak et al. [15] have been widely used in the field of protein fold recognition [4, 12, 22]-[28]. Apart from the above mentioned 5 attributes used by Dubchak et al. [15], features have also been extracted by incorporating other attributes of the amino acids. Some of the other attributes used are: solvent accessibility [29], flexibility [30], bulkiness [31], first and second order entropy [32], size of the side chain of the amino acids [22]. Several attributes have been picked for feature extraction usually in an arbitrary way for protein fold recognition. Contrary to this, Taguchi and Gromiha [16] argued that features from attributes of amino acids can be ignored due to having insufficient information and only syntactical-based features should be considered. This shows that proper exploration of the amino acid attributes has not been conducted. To this, we posed a question: ‘which of the attributes of the amino acids are to be selected for the protein fold recognition problem?’ The answer to this would open the third category of research apart from 1) the development of classifiers, and 2) the development of feature extraction techniques based on the syntactic and/or physicochemical properties.

In this study, we develop a methodology for selecting the attributes of the amino acids for protein fold recognition in a systematic manner. In order to do this, a successive feature selection (SFS) technique based on an exhaustive greedy search algorithm can be applied [33, 34]. The SFS technique can find important features from a group of features. However, since several features could be extracted from an attribute (e.g. composition, transition and distribution from hydrophobicity of amino acids) and there could be many attributes, this would lead to selecting multi-dimensional features belonging to an attribute. Therefore, we develop a scheme to identify important attributes by investigating multi-dimensional features corresponding to attributes. For brevity we call the proposed technique as multi-dimensional SFS (MD-SFS).

We show two schemes of MD-SFS: backward elimination and forward selection. In the backward elimination scheme, the search for the best subset of attributes will start by first retaining all the given attributes. Then an irrelevant attribute is discarded from this subset at an iteration time point that causes minimum loss of information for the subset. This elimination of attributes from a subset is performed until all the attributes are ranked. This scheme is useful to find attributes of low importance that could perform well, if selected in an appropriate subset. In the forward selection scheme, the best attribute is selected first, and a subsequent attribute is included in the subset such that the included attribute improves the performance (e.g., in terms of classification) of the subset. This scheme, however, could be biased towards the highest ranking attribute.

Experiments are carried out using Dubchak’s (DD) dataset [25], Taguchi’s (TG) dataset (Taguchi and Gromiha, [16]) and extended Ding and Dubchak (EDD) dataset [2]. The selection of physicochemical attributes by MD-SFS technique shows improvement in protein fold recognition by around 18 ~ 24% on all the datasets when 10-fold cross-validation has been applied. The MD-SFS technique has been illustrated in the next section and its usefulness has been demonstrated in the subsequent sections.

Multi-dimensional successive feature selection

The MD-SFS scheme has been illustrated in Figures 1 and 2. The backward-elimination procedure of MD-SFS has been shown in Figure 1 and the forward-selection procedure has been shown in Figure 2. The purpose of MD-SFS is to select the best attribute for protein fold recognition. In the figures, four attributes (T_a = 4) have been depicted. A feature extraction technique has been used to extract d-dimensional features from each attribute. Attributes are represented as A_j (where j = 1, 2,..., T_a) and extracted features of A_j are represented as. f₁^j, f₂^j, …, f_d^j In the figures, there are 4 levels in total, including the beginning state. The number of attributes at each of the level is denoted by NA. The classification accuracy using k-fold cross-validation of a subset of attributes is denoted by H( · ) (Figure 2). The highest average classification accuracy using k-fold cross-validation at each of the level is depicted by α_l where l = 0, 1, …, T_a − 1. The output is the ranked attributes.

MD-SFS: backward elimination

For the backward-elimination case of MD-SFS (Figure 1), a group of features belonging to an attribute is dropped one at a time in each of the successive levels. This would give subsets of attributes containing features. The number of features in a subset at level l is (T_a − l)d. A classifier is used to compute average classification accuracy using k-fold cross-validation procedure on each of the subsets. The subset of attributes with the highest average classification accuracy is progressed to the next subsequent level. The size of subset is reduced by d number of features as we progress across the levels. This process is terminated when all the attributes are ranked. In Figure 1, at level 1, the highest average classification accuracy (α₁) obtained is by attribute subset {A₁, A₂, A₄}. It is also possible that average classification accuracy of more than one subset is the same. In that case, the subsets with the highest average classification accuracies would progress to the next level. In Figure 1, subset {A₁, A₂, A₄} is progressed to level 2 and at this level the subset with highest average classification accuracy (α₂) is {A₂, A₄}. At level 3, the subset with highest average classification accuracy (α₃) is {A₂}. In Figure 1, ranked attributes are {A₂, A₄, A₁, A₃}, where A₂ is the top ranked attribute and A₃ is the bottom ranked or least important attribute. Furthermore, there could be two criteria in which attributes can be selected. For an instance, if we want to select best 3 attributes for the design then we can take {A₂, A₄, A₁} from the ranked attributes. However, a better way would be to find the argument of the maximum of α_l i.e., $r = arg max_{l = 0, . . ., T_{a} - 1} α_{l}$ . For an instance, if r = 2 then this indicates that subset {A₂, A₄} at level 2 exhibits the maximum accuracy among all the selected subsets at all the levels. Therefore, attributes of subset {A₂, A₄} can be selected for the design. We refer the former criterion of selection as brute-n (where n is the number of attributes to be selected) and the latter criterion as maximum accuracy (MA) based criterion.

The MD-SFS backward elimination procedure would approximately require between ${}^{T_{a} + 1}C_{2}$ and $2^{T_{a}} - 1$ search combinations, where T_α is the total number of attributes and the term ^mC_n is the n-combination of m elements. If t_s denotes the number of attributes in a subset s then this subset would have t_sd features. Therefore, the computational complexity of a classifier for doing classification using subset s will be based on t_sd number of features.

MD-SFS: forward selection

For the forward-selection case of MD-SFS (Figure 2), an attribute with corresponding d-dimensional features would be taken at a time for computing average classification accuracy using the k-fold cross-validation procedure. The attribute corresponding to the highest average classification accuracy will be stored; i.e., $r_{1} = arg {max}_{j = 1, . . . T_{a}} H (A_{j})$ . The selected attribute containing the features will go to the next successive level. In the next level, an attribute that exhibits the highest average classification accuracy in combination with the selected attribute from the previous level $(A_{r_{1}})$ will be retained. This process will continue until all the attributes are ranked. The number of features used in computing classification accuracy at level l is (l + 1)d. Further, we can apply the same two criteria (brute-n and MA-based) for obtaining attributes from the ranked set of attributes as it was discussed in MD-SFS backward elimination approach.

The MD-SFS forward selection would require around T_a(T_a + 1)/2 search combinations, where T_a is the total number of attributes. A subset s with t_s attributes would have t_sd number of features. The computational complexity of a classifier used to compute classification accuracy would depend on t_sd number of features.

Methods

Dataset

In this study, three protein sequence datasets have been used: 1) DD-dataset [25], 2) TG-dataset (Taguchi and Gromiha, [16]) and 3) EDD-dataset [2]. The DD-dataset that we have used consists of 311 protein sequences in the training set where two proteins have no more than 35% of sequence identity for aligned subsequence longer than 80 residues. The test set consists of 383 protein sequences where sequence identity is less than 40%. Both the sets belong to 27 SCOP folds which represented all major structural classes: α, β, α/β, and α + β[25]. The training set and test set have been merged as a single set of data in order to perform k-fold cross-validation process.

TG-dataset consists of 1612 protein sequences belonging to 30 different folding types of globular proteins. The names of the number of protein sequences in each of 30 folds have been described in Taguchi and Gromiha [16]. The protein sequences of TG-dataset have been first transformed into their corresponding PSSM (position-specific-scoring-matrix) [35] sequences by using PSIBLAST (http://blast.ncbi.nlm.nih.gov/) (the cut off E-value is set to E = 0.001).

EDD-dataset consists of 3418 proteins with less than 40% sequential similarity belonging to the 27 folds that originally used in DD-dataset. We extracted the EDD-dataset from the 1.75 SCOP in similar manner to Dong et al. [2] in order to study our proposed method using a larger number of samples.

Physicochemical attributes

In this study 30 physicochemical attributes^a have been utilized including 5 popular attributes as used by Dubchak et al. [15]. The attributes with the corresponding symbols are listed in Table 1. The residues of amino acids of these 30 attributes are given in Table 2.

Table 1 Physicochemical attributes used in the study

A strategy to select suitable physicochemical attributes of amino acids for protein fold recognition

Abstract

Background

Results

Conclusion

Background

Multi-dimensional successive feature selection

MD-SFS: backward elimination

MD-SFS: forward selection

Methods

Dataset

Physicochemical attributes

Feature extraction

Classifiers

Results and discussions

Conclusion

Endnote

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us