 Research
 Open Access
 Published:
Prediction of protein structural classes by different feature expressions based on 2D wavelet denoising and fusion
BMC Bioinformatics volume 20, Article number: 701 (2019)
Abstract
Background
Protein structural class predicting is a heavily researched subject in bioinformatics that plays a vital role in protein functional analysis, protein folding recognition, rational drug design and other related fields. However, when traditional feature expression methods are adopted, the features usually contain considerable redundant information, which leads to a very low recognition rate of protein structural classes.
Results
We constructed a prediction model based on wavelet denoising using different feature expression methods. A new fusion idea, first fuse and then denoise, is proposed in this article. Two types of pseudo amino acid compositions are utilized to distill feature vectors. Then, a twodimensional (2D) wavelet denoising algorithm is used to remove the redundant information from two extracted feature vectors. The two feature vectors based on parallel 2D wavelet denoising are fused, which is known as PWDFUPseAAC. The related source codes are available at https://github.com/XiaohengWang12/Wangxiaoheng/tree/master.
Conclusions
Experimental verification of three lowsimilarity datasets suggests that the proposed model achieves notably good results as regarding the prediction of protein structural classes.
Background
Protein structural class prediction is a heavily researched subject in bioinformatics and performs a vital role in many related fields and applications, such as protein functional analysis, protein folding recognition, protein binding, rational drug design and so on [1,2,3,4,5,6,7,8,9,10,11]. However, in the light of newly discovered proteins, it will take time and money to determine the structure of proteins by traditional experimental methods, so many computational methods have been proposed to predict protein structural classes. Because the sequence of amino acids determines the specific spatial structure of protein, the method of predicting structural classes by sequence is a concise and effective way, which can help guide the direction of biological experiment, save the cost of biological experiment and provide useful information for a heuristic approach [9,10,11,12]. In particular, when the feature information of proteins is extracted, they often contain considerable redundant information, resulting in an unsatisfactory recognition rate for structural classes of protein.
To solve the problems of redundant information and low recognition rates, many computational methods have been proposed to predict protein structural classes during the past 30 years. One such method is the feature extraction method based on the information in amino acid sequences. Initially, amino acid composition [12, 13] (AAC) was used to extract the feature information. This method calculated the proportion of twenty amino acid residues in the sequence and expressed the feature information of the sequence by numerical vectors. Pseudo amino acid composition [14,15,16,17,18,19] (PseACC) was also used to extract its feature information. This method considered not only the composition of amino acid residues but also their hydrophobicity and other physical and chemical properties. In addition, peptide composition [20, 21] was adopted to extract its feature information. Compared with the previous two methods, this method considered the sequence factor between amino acid residues. These methods have achieved good prediction results on high similarity datasets but poor results on low similarity datasets. The prediction accuracy of these methods can reach more than 90% on high similarity datasets but only approximately 50% on low similarity datasets. Some improved feature extraction methods have been proposed. Lukasz et al. proposed the SCPRED method [22], which constructed feature vectors based on predictive secondary structure. Zhang proposed a TPM matrix to represent the feature on the predictive secondary structure [23], and Dai et al. [24] proposed a statistical feature method on the predictive secondary structure feature, which takes the secondary structure feature as part of the feature vector. In Ding [25], a multidimensional representation vector is constructed to predict protein secondary structural classes. Some methods for fusing multiple features such as feature selection [26] are also proposed. Chen et al. proposed the fusion of multiple features [27], which combined the derived structure information of sequences with the physicochemical properties [28]. Nanni et al. proposed a new feature fusion method based on the features of the primary sequence and the features of the secondary structure based on prediction [29]. Wang et al. [30] fused the improved simplified PSSM with secondary structure features. In addition, some other classical feature extraction methods have been proposed, such as Dehzangi et al., who used piecewise distribution and piecewise autocovariance ideas [31]. It is noted that it is hard for the above feature fusion algorithms to reduce the redundancy of feature information, which thus makes prediction accuracy hard to improve. Based on this properity, Liu et al. used a recursive feature selection algorithm to select the optimal feature vector [32].
The second is the classification algorithm. As far as the four common cases of structural classes, allα, allβ, α/β and α + β are concerned, how to distinguish them accurately is essential an efficient multiclassification problems. Multiple classification and various machine learning algorithms have been applied to protein classification prediction, such as neural networks, fuzzy clustering, Naive Bayes, support vector machines (SVM), Knearest neighbors (KNN) and the correlation coefficients methods [12, 33,34,35,36,37,38,39,40]. However, because the dataset used in protein structure prediction is usually small sample data, and the neural network classification algorithm requires a large amount of data, its performance cannot be fully developed. The fuzzy clustering algorithm also faces the same problem because the sample size is too small to cluster well, resulting in poor prediction results. For Naive Bayesian classification, the premise is that there is no correlation between the features and attributes, and it is sensitive to the form of data input. These factors affect the performance of classification prediction to a certain extent. Support Vector Machine can also play a role in classification performance when there are few data samples, but the process of searching parameters is highly timeconsuming. The Knearest neighbor algorithm is simple in theory, easy to implement, simple and efficient. This algorithm is also suitable for classification of small sample data. Later, some improved classification algorithms have been proposed. For example, Chen et al. proposed a method of fusing multiple support vector machines [41]. This method divides the extracted feature vectors into three parts, each part is input into a corresponding classifier, and then synthesizes the classification results of the three parts, voting to determine the category of the samples to be tested. The improved method is to fuse the same classifier. After that step, the fusions of different types of classifiers have been proposed, such as Dehzangi and other classifiers [42]. The classifiers are AdaBoost, M1, LogitBoost, SVM, MLP and Naive Bayes. However, the problem that redundant information in the feature vector affects the generalization ability of the model has not been solved by these methods.
In this article, to deal with this problem, the newly developed model for predicting structural classes of proteins is put forward based on different feature expression methods, known as PWDFUPseAAC. In order to prove the superiority of the proposed method, the extracted feature vectors are based on the primary sequence information of proteins. First, the features of the primary sequence of proteins are distilled by the traditional feature expression methods, type 1 pseudo amino acid composition (PseAAC) [43] and type 2 pseudo amino acid composition [44]. Since type 1 PseAAC is popularly used in many researches, here we explain a little about type 2 PseAAC. In Chou [44], type 2 PseAAC is also called ‘amphiphilic pseudo amino acid composition’, whose form is like AAC except much more information about the distribution of the hydrophobic and hydrophilic amino acids of a protein. Second, twodimensional multiscale wavelet denoising is used to process the feature vectors extracted by two feature expression methods, removing the redundant information from them. In the field of mathematics, a new direction of rapid and groundbreaking development is wavelet analysis, which has been increasingly widely utilized in the field of bioinformatics, particularly for protein structural prediction and functional analysis. This analysis has the characteristics of local transformation in the time domain and frequency domain and may efficaciously extract information from signals and perform multiscale fine analysis of functions or signals through scaling and translation operations. Wavelet denoising [45] is one of the significant branches of wavelet analysis, which can efficaciously eliminate redundant information of the extracted feature vectors, making the information more stable and efficacious, and improving the accuracy of prediction. Due to the complexity of the protein structure, it can be reasonably to employ twodimensional (2D) wavelet denoising rather than onedimensional (1D) wavelet denoising. To illustrate the validity of 2D wavelet denoising, it is compared with the 1D wavelet denoising in the following experimental parts. Third, the new feature vectors are obtained by fusing the two different feature vectors after denoising. Finally, the optimal feature vectors are treated as input data of the KNN to predict structural classes of proteins. To estimate the performance of our presented model, we adopt the jackknife test as a validation method to carry out relevant experimental analysis on the three lowsimilarity datasets. The final experimental outcomes indicate that our model has higher overall prediction accuracies than other methods.
Methods
Datasets
To compare with current methods fairly and objectively, three lowsimilarity benchmark datasets, the 25PDB [46], the 1189PDB [47] and the 640PDB [48], are selected as our experimental datasets, which are structural protein sequences with internal similarities of less than 25, 40 and 25%, respectively. The datasets have four categories, the details of which are shown in Table 1.
Feature extraction
In this article, the traditional feature expression methods, two types of pseudo amino acid compositions, are applied to convert the primary sequences of protein into numerical feature vectors. As known to all, pseudo amino acid composition is an improved expression on the basis of amino acid composition, not only considering the frequency of amino acid residues in the sequence but also considering the physicochemical properties of amino acid residues. There are two types of pseudo amino acid composition: parallel correlation type and sequence correlation type. For convenience, the pseudo amino acid composition of the parallel correlation type is called type 1 pseudo amino acid composition, and that of the sequence correlation type is called type 2 pseudo amino acid composition.
 (1)
Type 1 pseudo amino acid composition
Type 1 pseudo amino acid composition was proposed by Chou in 2001 [43]. This composition considers not only the hydrophilicity and hydrophobicity of amino acid residues, but also the quality of side chain groups of amino acid residues. Type 1 pseudo amino acid composition is used to extract the features of structural protein sequences.
Thus, a protein sequence can be transformed into 20+ λ dimensional numerical vectors, that is, P_{PseAAC _ type1} = [p_{1}, p_{2}, ......, p_{20 + λ}]^{T}, where p_{u} can be calculated from eq. (1):
where f_{i} is the frequency of 20 amino acid residues in protein sequence P; w is the weight factor, which is generally set to 0.05; λ is the hierarchical factor, which is less than the total length of the sequence N; θ_{j} is the sequence correlation coefficient of the jth layer, which can be calculated from eq. (2):
In addition:
Among them, H_{1}(R_{i}), H_{2}(R_{i}) and H_{3}(R_{i}) represent the hydrophobicity, hydrophilicity and the quality of side chain groups of amino acid residues, respectively.
 (2)
Type 2 pseudo amino acid composition
Type 2 pseudo amino acid composition was proposed by Chou in 2005 [44] because it considers the hydrophilicity and hydrophobicity of amino acid residues, also known as amphipathic pseudo amino acid composition. In this article, type 2 pseudo amino acid composition is also used to extract the features of structural protein sequences.
Thus, a protein sequence can be transformed into 20+ 2r dimensional numerical vectors, with P_{PseAAC _ type1} = [p_{1}, p_{2}, ......, p_{20 + 2r}]^{T}, where p_{u} can be calculated from equation (4):
where r is the hierarchical factor, which is less than the total length of the sequence N; τ_{j} is the sequence correlation coefficient of the jth layer, which can be calculated from eq. (5):
In addition:
where H^{1}(R_{i}) refer to the hydrophobicity of amino acid residues, and H^{2}(R_{i}) refer to the hydrophilicity of amino acid residues.
Twodimensional wavelet denoising
The process of wavelet denoising includes the following three parts: wavelet transform, processing of wavelet coefficients and wavelet inverse transform [49]. There are three commonly used methods of wavelet denoising: wavelet threshold denoising, modulus maximum denoising and spatial correlation denoising. To suppress the noise in the high frequency section and remove redundant information, the wavelet threshold denoising method is adopted. In other words, the wavelet denoising method used refers to the wavelet threshold denoising method in this paper.
This method’s decomposition and reconstruction can be expressed as follows:
where f^{0} represents the original signal; \( {f}_L^i \) represents the ith layer low frequency component obtained by wavelet decomposition; \( {f}_H^i \) represents the ith layer high frequency component obtained by wavelet decomposition; It contains three highfrequency components, in which \( {f}_{HH}^i \) refers to the horizontal component, \( {f}_{HV}^i \) refers to the vertical component and \( {f}_{HD}^i \) refers to the diagonal component.
Then, the above can be expressed as:
where ⊕ represents the direct orthogonal sum.
In addition, formula (8) can also be expressed as (9):
The flow chart of 2D wavelet denoising is shown in Fig. 1.
In Fig. 1, the input is the original 2D data and the output is the new obtained 2D data, the intermediate procedures of the 2D wavelet denoising is mainly as follows, which is summarized and deduced from references [48,49,50,51,52,53]:
1) Set the wavelet basis function x, decomposition scale n and threshold value t.
2) Through the wavelet transform, 2D data are decomposed into four components, one of which is a low frequency component, and the other three of which are high frequency components: a horizontal component, a vertical component and a diagonal component.
3) The low frequency component obtained from step 2 can be further decomposed into a new low frequency component and three new high frequency components: horizontal component, vertical component and diagonal component. Repeat this process until the decomposition scale n is reached.
4) A threshold value is applied to quantize high frequency coefficients obtained by each decomposition.
5) The lastly decomposed and quantized highfrequency component is reconstructed by wavelet transform with the only lowfrequency component to form a new lowfrequency component. The process is repeated n times upward until the new 2D data are synthesized.
The algorithm’s pseudocode is shown in Table 2.
Clearly, the key of the wavelet denoising method is undoubtedly to select the value of threshold and threshold function, which has the greatest impact on the effect of wavelet denoising. There are generally three ways to select the value of threshold: default threshold, given threshold and forced threshold. In this article, the default threshold determination model is selected to calculate the value of the threshold because it is convenient and concise. Furthermore, there are two common threshold functions: a soft threshold function and a hard threshold function. We choose a soft threshold function for quantifying because it makes reconstructed signals considerably smoother than the hard one.
Construction of prediction model
In this article, a new method, called PWDFUPseAAC, is proposed to predict the structural classes of protein sequences. First, the feature information of protein sequences is extracted by the traditional feature expression method, type 1 pseudo amino acid composition and type 2 pseudo amino acid composition. Each protein sequence is converted to 20+ λ dimensional numerical vectors by type 1 pseudo amino acid composition, and each protein sequence is converted to 20+ 2r dimensional numerical vectors by type 2 pseudo amino acid composition. Second twodimensional wavelet denoising is used to denoise the two feature vectors separately. Then, the two feature vectors after denoising are fused, which refers to splicing the first and last vectors of the two parts to form 40+ λ + 2r dimensional feature vectors. Moreover, the optimal 40+ λ + 2r dimensional feature vectors are fed into the KNN classifier for predicting. The jackknife test is used to test the performance of the model on the 25PDB, the 1189PDB and the 640PDB. According to the predicting accuracy, the parameters of the model are adjusted continuously to optimize the performance of the model. Finally, four measures are used to evaluate the performance of the predicting model. The advantages of choosing the classifier KNN are its efficiency and simplicity. Although KNN’s classifying effect is not as good as that of support vector machine (SVM), KNN requires considerably less running time than SVM, as the latter requires considerably effort to determine the optimal parameters. Therefore, considering the classifiers comprehensively, we choose KNN instead of SVM. The flow chart of the model is shown in Fig. 2.
In Fig. 2, new method of PWDFUPseAAC is as follows. The feature information of protein sequences is extracted by type 1 pseudo amino acid composition and type 2 pseudo amino acid composition, respectively. Then, 2D wavelet denoising is used to denoise the two feature vectors, respectively. Next, the two feature vectors after denoising are fused to form a 40+ λ + 2r dimensional vector, which is entered to the KNN classifier for predicting.
Performance evaluation
Four validation methods are commonly applied to estimate the performance of the prediction model: the selfconsistency test, independent dataset test, kfold crossvalidation and jackknife test [53,54,55,56,57]. Because of the objectivity and strictness of the jackknife test, in this experiment, we make use of it to examine the performance of our prediction model. The sensitivity (Sens), specificity (Spec), overall accuracy (OA) and Matthews correlation coefficient (MCC) are applied to assess the performance of our method. These measures are expressed in the following formula:
where TP denotes the number of true positives, FP denotes the number of false positives, TN denotes the number of true negatives, and FN denotes the number of false negatives.
Results and discussion
Choice of λ and r parameters
In this article, two types of pseudo amino acid compositions are used to extract feature vectors, and different parameters of λ and r will lead to inconsistency of the feature information contained in the extracted feature vectors, thereby affecting the final prediction results. Therefore, it is necessary to choose the optimal value of λ and r, and the range of λ and r are 1 to 9, therefore, this section chooses the optimal parameter of λ or r between 1 and 9. In this paper, using the 25PDB as the research object, the validity of these feature vectors extracted from two different types of pseudo amino acids is discussed respectively. The wavelet basis function of twodimensional wavelet denoising is db4, the wavelet decomposition scale is 3, and the K value of the KNN classifier is set to 3. The experimental results of the overall prediction accuracy of protein structural classes and the prediction accuracy of each class are shown in Table 3 and Table 4.
From Tables 3 and 4, it can be concluded that different λ_{1} and λ_{2} values do have an impact on the prediction results. When λ and r are 2, the overall prediction accuracy is the highest, 87.98 and 76.99% respectively. Therefore, the optimum λ and r for both types of pseudo amino acid compositions is 2.
Choice of the wavelet function and decomposition scale
The traditional feature expression method, type 1 pseudo amino acid composition and type 2 pseudo amino acid composition, are adopted in this article, which still contains considerable redundant information. To obtain more efficacious information, twodimensional wavelet denoising is used to process the feature vectors extracted by two feature expression methods separately, removing the redundant information from them.
However, the choice of wavelet function and decomposition scale will determine the denoising effect of the models and then further affect the final overall prediction accuracy. To further obtain efficacious information on structural proteins, we chose different wavelet functions and different decomposition scales to examine the effect on the prediction models, including db2, db4, db6, sym2, sym4, sym6, coif1, coif3, bior2.2 and bior2.4, and the decomposition scale from 2 to 5. We discussed the optimal denoising parameters of the feature vectors extracted by type 1 PseAAC and type 2 PseAAC.
The 25PDB is selected as the sample for finding the optimal parameters. Table 5 and Table 6 show that the two related factors of the wavelet function and decomposition scale do affect the effect of denoising, thereby affecting the overall prediction accuracy of the method. When the decomposition scale is 5 and the db6 wavelet function is adopted, the effect of wavelet denoising is optimal in Table 5; when the decomposition scale is 5 and the sym4 wavelet function is adopted, the effect of wavelet denoising is optimal in Table 6. Hence, to obtain good prediction results, we choose 5 as the decomposition scale and db4 wavelet as the wavelet function to denoise feature vectors extracted by type 1 pseudo amino acid composition; we choose 5 as the decomposition scale and sym4 wavelet as the wavelet function to denoise feature vectors extracted by type 2 pseudo amino acid composition. In addition, Table 5 and Table 6 show that when the decomposition scale is 2, regardless of the type of wavelet basis function chosen, the overall prediction accuracy is lower than other scales. With the increase of the decomposition scale, the overall prediction accuracy has an upward trend. To describe this trend more intuitively, we drew line charts of the overall prediction accuracy under different wavelet basis functions and decomposition scales, as shown in Figs. 3 and 4.
As shown in Figs. 3 and 4, with the increase of decomposition scale, the overall prediction accuracy obtained by experiments is improved under different conditions of wavelet basis functions. When the decomposition scales are 4 and 5, the overall prediction accuracy obtained by the experiment is notably close, which indicates that with the increase of the scale, the overall prediction accuracy will tend to be stable, will not continue to increase, or even may decline. Moreover, it can be seen from the Figs. 3 and 4 that although the choice of decomposition scale and wavelet basis function will affect the overall prediction accuracy of the experiment, the influence of the decomposition scale is greater than that of the wavelet basis function.
Comparison with 1D wavelet denoising
To verify the superiority of the twodimensional (2D) wavelet denoising method, we compare it with the onedimensional (1D) wavelet denoising method. The 1A1W structural protein sequence in the 25PDB was selected as the experimental sample to compare the denoising effect. The decomposition scale is 5, and the sym4 wavelet is chosen as the wavelet basis function. The K value in the classifier KNN is still 3. We use the 24dimensional numerical feature vectors extracted from the 1A1W protein sequence through the type 2 pseudo amino acid composition as the original signal. to intuitively show the comparison of the two denoising effects, we choose the form of graph to show. The comparison results of onedimensional wavelet denoising and twodimensional wavelet denoising are shown in Fig. 5.
As seen from Fig. 5, the original signal is notably messy, because it contains considerable redundant information, therefore, it seems to fluctuate. After 1D wavelet denoising, although the signal has changed, the effect of denoising is not strong. After 2D wavelet denoising, the signal is clearly different from the original signal, becoming smoother and more stable, indicating that the effect of denoising is notably good. This finding is observed in our study. We use variance to accurately describe the difference within the signal. The variance of the original signal is 30.526. After onedimensional wavelet denoising, the variance of the signal is 14.274. After twodimensional wavelet denoising, the variance of the signal becomes 6.189. In summary, the denoising effect of the 2D wavelet is better than that of the 1D wavelet.
To sum up, twodimensional wavelet denoising is better than onedimensional wavelet denoising, and this 2D wavelet denoising method can be used not only in structural classes but also in other types of protein classification models.
Selection of the K value in the Knearest neighbor classifier
K nearest neighbor classifier, which is based on the similarity of sample points to select the first K sample points for voting classification. However, this K value is often unknown, and choosing different K values will produce different prediction results. Therefore, to obtain better prediction results, it is necessary to select the optimal K value. In this section, the optimal K value is selected from 1 to 9. Under different K values, the prediction accuracy of each class and the overall prediction accuracy of the protein structure class sequence are shown in Table 7. Under different K values, the prediction accuracy of each class and the overall prediction accuracy of the protein structure class sequence are shown in Table 7.
As shown in Table 7, different K values have a certain impact on the prediction results. In model 1, with the increase of K values, the overall prediction accuracy decreases. When K is 1, the overall prediction accuracy is the highest, 97.91%, while when K is 9, the overall prediction accuracy is the lowest, 91.33%. To visualize the overall prediction accuracy under different K conditions, we use a line chart to describe it, as shown in Fig. 6. From the Fig. 6, it is clear that different K values will affect the prediction results of the experiment, and with the increase of K values, the overall prediction accuracy has a downward trend.
Comparison of different strategies
In this paper, a feature fusion model based on parallel twodimensional wavelet denoising is proposed. To better demonstrate the improvement of the prediction accuracy of the models, this section compares with other strategies.
Compare various strategies on the 25PDB. In the table, strategy 1 refers to the use of type 1 pseudo amino acid composition only; strategy 2 refers to the use of type 2 pseudo amino acid composition only; strategy 3 refers to the combination of type 1 pseudo amino acid composition and twodimensional wavelet denoising; strategy 4 refers to the combination of type 2 pseudo amino acid composition with twodimensional wavelet denoising; and strategy 5 refers to the first combination of features extracted from type 1 and type 2 pseudo amino acid composition. The feature vector fusion is then combined with twodimensional wavelet denoising; strategy 6 refers to the model proposed in this paper. Among these strategies, the parameters λ and r in the two types of pseudo amino acid composition are both 2. In the classifier, the K value in KNN ranges from 1 to 9, and the parameters in twodimensional wavelet denoising are also the best denoising wavelet basis function and decomposition scale. The experimental results are shown in Table 8 and Fig. 7.
From Table 8 and Fig. 7, it can be seen that the overall prediction accuracy of model 1 proposed in this paper reaches the highest level, 98.09%, and it can be seen from the table that the idea of parallel twodimensional wavelet denoising proposed in this chapter is effective. Compared with strategy 5, first fusing feature vectors and then denoising, the overall prediction accuracy is improved by 1.08%, while the application of twodimensional wavelet denoising improves the prediction accuracy by 1.08%. The measurement results have a great impact. Strategy 1 and Strategy 2 do not use twodimensional wavelet denoising, and their prediction accuracy is far from that of other strategies. In conclusion, the fusion idea proposed in this model is highly effective.
The influence of different classifiers on prediction results
Three classifiers: Naive Bayes, KNN and SVM are used to explore the effects of different classifiers on the prediction results. The parameters of two types of pseudo amino acid composition are 2. The denoising parameters of twodimensional wavelet denoising for the extracted feature vectors of type 1 pseudo amino acid composition: the wavelet basis function is db4 wavelet, the decomposition scale is 5, and the denoising parameters of twodimensional wavelet denoising for the extracted feature vectors of type 2 pseudo amino acid composition: the wavelet basis function is sym4, and the decomposition scale is 5. The K value of KNN is the best 1. For SVM, the radial basis function is used as the kernel function, and the grid search strategy is used for the selection of C and G parameters. The search ranges of both are 2^{− 10} to 2^{10}. The jackknife method was used to test the influence of three classifiers on the prediction results on the 25PDB. The experimental results are shown in Table 9 and Fig. 8.
As shown in Table 9 and Fig. 8, when the KNN is used as the classifier, the overall prediction accuracy is the highest, 98.09%. The prediction accuracy of each category is the highest, and only the prediction accuracy of the α + β class is the highest in parallel with other categories. When Naive Bayes is used as the classifier, the overall prediction accuracy is 82.90%, which is considerably less than the KNN. This finding shows that the Naive Bayes is not as effective as the KNN in this experimental condition. When SVM is used as the classifier, the overall prediction accuracy is 97.85%. The possible reason for this finding is that the range of the parameter search is not appropriate, which causes the performance of SVM not to be as good as that of KNN. Moreover, SVM takes considerably more time to find parameters than KNN; therefore, considering the classifiers comprehensively, the classifier of this model chooses KNN.
Prediction performance of our method
The performance of a method determines whether it can be applied by everyone. Therefore, as our study is no exception, the traditional performance evaluation methods are utilized to verify the performance of our methods. In model 1, based on two types of pseudo amino acid composition methods and parallel 2D wavelet denoising, a machine learning prediction model with the fusion of two features is proposed, which is called PWDFUPseAAC. First, the feature information of protein sequences is extracted by type 1 pseudo amino acid composition and type 2 pseudo amino acid composition; in other words, the primary protein sequences are converted into 20 + λ dimensional and 20 + 2r dimensional numerical vectors respectively. Second, the 2D wavelet denoising method is used to denoise the two feature vectors separately and remove their redundancy. Then, the two feature vectors after denoising are fused, which refers to splicing the first and last vectors of the two parts to form 40 + λ + 2r dimensional feature vectors. Finally, the optimal feature vectors are input into the KNN classifier for prediction, and the results are verified by jackknife. The optimal parameters of the prediction model can be obtained from the previous experimental analysis. The values of λ and r in both types of PseAAC are 2. The db4 wavelet is used as the wavelet function, and 5 is chosen as the decomposition scale to denoise the feature vectors extracted by type 1 PseAAC; Sym4 is chosen as the wavelet function and 5 is chosen as the decomposition scale to denoise the feature vectors extracted by type 2 PseAAC. The K value in the classifier is set to 1. The performance of the model is evaluated on the 25PDB, the 1189PDB and the 640PDB. The experimental results are shown in Table 10.
The results of four standard performance measures are shown in Table 10. From the results that emerged in Table 10, we note that we acquire 98.09, 97.25 and 96.09% overall accuracy on the 25PDB, the 1189PDB and the 640PDB, respectively. The overall accuracy obtained on three datasets was highly satisfactory. Moreover, the Matthews correlation coefficient (MCC) of α + β class proteins are lower than other classes for the three datasets. Hence, there are many challenges to identifying protein sequences of the α + β class with high very efficacy.
Comparison with existing methods
To objectively compare our method with previously reported methods, we carried out experiments under the same conditions as the previous methods. Among these methods, the MEDP [58] method is based on evolutionary information, and a new feature expression method is proposed. The SCPRED [22] method is based on predictive secondary structure to construct new feature vectors. The PKSPPSC [59] method is based on predictive secondary structure to construct feature vectors, but it uses chaotic game representation and information entropy to construct feature vectors. The method reported by Zhang et al. [23] is based on predictive secondary structure information, based on this information, the TPM matrix feature representation is proposed. The PSSSPSSM [25] method combines predicted secondary structure features with the PSSM matrix. The PSSSPsePSSM [60] method combines predicted secondary structure features with improved PSSM matrix, and proposes a new fusion feature expression. The WDPseAAC [53] method extracts feature vectors based on SVM, using a single feature expression method and then denoises them with wavelet denoising. Our method is to denoise the extracted feature vectors and then fuse them.
The experimental results are summarized in Table 11 and Figs. 9, 10, 11. From the experimental results in Table 11 and Fig. 9, the overall prediction accuracy of 98.1% is gained on the 25PDB, which is the highest and 5.0 to 23.3% higher than those of other methods. Furthermore, from the experimental results in Table 11 and Fig. 10, the overall prediction accuracy of 97.3% is also obtained on the 1189PDB, which is the highest and 6.5 to 21.5% higher than those of other methods. Moreover, from the experimental results in Table 11 and Fig. 11, the prediction results are also satisfactory for the 640PDB. The prediction accuracy of the four classes is the highest, and the overall prediction accuracy is the highest, 95.0%. At the same time, there are other significant changes that deserve our attention. For example, the overall prediction accuracy of our method can achieve such good results on three datasets because we have greatly enhanced the prediction rates of α/β class proteins and α + β class proteins, while the prediction rates of other methods for α/β class proteins and α + β class proteins are notably low. One of the reasons that the overall prediction accuracy of protein structural classes has been relatively low is that it is difficult to predict α/β and α + β proteins.
In summary, through the analysis of the above experimental results, we can conclude that our models can efficaciously forecast the structural classes of protein sequences, even on the lowsimilarity datasets. The reason why our method is better than others is that although the traditional method is used to extract feature vectors, the feature extraction method that we adopt may not be as good as others. However, after feature extraction, we use twodimensional wavelet denoising to denoise the redundant information in the feature vector, which makes it more recognizable. In addition, other researchers also use the method of wavelet denoising, but this paper proposes a new fusion strategy based on wavelet denoising.
Conclusions
A new method, PWDFUPseAAC, is proposed to forecast the structural classes of protein sequences. The method ameliorates the shortcomings of traditional feature expression methods, which contain considerable redundant information that cannot result in inefficiency. Therefore, in this paper, a new idea of fusion has been proposed, in which a parallel 2D wavelet denoising algorithm is adopted to process the extracted feature vectors before fusing them. Through related experiments, we not only verify the effect of the wavelet denoising algorithm on the models but also compare the overall accuracy of our models with those of other methods. Ultimately, we can conclude that our method is good for predicting the structural classes of protein sequences and is expected to be applied in other fields related to bioinformatics [61,62,63,64,65,66,67,68,69,70,71,72,73,74]. The related source codes and datesets are available at https://github.com/XiaohengWang12/Wangxiaoheng/tree/master.
Availability of data and materials
The related source codes and datasets are available at https://github.com/XiaohengWang12/Wangxiaoheng/tree/master.
Abbreviations
 1D:

One dimentonal
 2D:

Two dimentonal
 ACC:

Amino acid composition
 KNN:

Knearest neighbors
 MCC:

Matthews correlation coefficient
 OA:

Overall accuracy
 PseACC:

Pseudo amino acid composition
 Sens:

Sensitivity
 Spec:

Specificity
 SVM:

Support vector machines
References
 1.
Chou KC. Structural bioinformatics and its impact to biomedical science [J]. Curr Med Chem. 2004;11:2105–34.
 2.
Chou KC. Progress in protein structural class prediction and its impact to bioinformatics and proteomics [J]. Curr Protein Pept Sci. 2005;6:423–36.
 3.
Peng C, Zou L, Huang DS. Discovery of relationships between long noncoding RNAs and genes in human diseases based on tensor completion [J]. IEEE Access. 2018;6:59152–62.
 4.
Yi HC, You ZH, Huang DS, et al. A deep learning framework for robust and accurate prediction of ncRNAprotein interactions using evolutionary information [J]. Mol Ther Nucleic Acids. 2018;11:337–44.
 5.
Bao W, Jiang Z, Huang DS. Novel human microbedisease association prediction using network consistency projection [J]. BMC Bioinformatics. 2017;18:543.
 6.
Guo WL, Huang DS. An efficient method to transcription factor binding sites imputation via simultaneous completion of multiple matrices with positional consistency [J]. Mol BioSyst. 2017;13:1827–37.
 7.
Chuai G, Ma H, Yan J, et al. DeepCRISPR: optimized CRISPR guide RNA design by deep learning [J]. Genome Biol. 2018;19:80.
 8.
Yuan L, Zhu L, Guo WL, et al. Nonconvex penalty based lowrank representation and sparse regression for eQTL mapping [J]. IEEE/ACM Trans Comput Biol Bioinformatics. 2017;14:1154–64.
 9.
Hu H, Zhang L, Ai H, et al. HLPIensemble: prediction of human lncRNAprotein interactions based on ensemble strategy [J]. RNA Biol. 2018;15:797–806.
 10.
Zhao Q, Yu H, Ming Z, et al. The bipartite network projectionrecommended algorithm for predicting long noncoding RNAprotein interactions [J]. Mol Ther Nucleic Acids. 2018;13:464–71.
 11.
Zhao Q, Zhang Y, Hu H, et al. IRWNRLPI: integrating random walk and neighborhood regularized logistic matrix factorization for lncRNAprotein interaction prediction [J]. Front Genet. 2018;9:239.
 12.
Chou KC, Zhang CT. A correlationcoefficient method to predicting proteinstructural classes from amino acid compositions [J]. Eur J Biochem. 1992;207:429–33.
 13.
Zhang CT, Chou KC, Maggiora GM. Predicting protein structural classes from amino acid composition: application of fuzzy clustering [J]. Protein Eng. 1995;8:425–35.
 14.
Zhang TL, Ding YS. Using pseudo amino acid composition and binarytree support vector machines to predict protein structural classes [J]. Amino Acids. 2007;33:623–9.
 15.
Chen C, Tian YX, Zou XY, et al. Using pseudoamino acid composition and support vector machine to predict protein structural class [J]. J Theor Biol. 2006;243:444–8.
 16.
Ding YS, Zhang TL, Chou KC. Prediction of protein structure classes with Pseudo amino acid composition and fuzzy support vector machine network [J]. Protein Pept Lett. 2007;14:811–5.
 17.
Zhang TL, Ding YS, Chou KC. Prediction protein structural classes with pseudoamino acid composition: approximate entropy and hydrophobicity pattern [J]. J Theor Biol. 2008;250:186–93.
 18.
Xiao X, Wang P, Chou KC. Predicting protein structural classes with pseudo amino acid composition: an approach using geometric moments of cellular automaton image [J]. J Theor Biol. 2008;254:691–6.
 19.
Li ZC, Zhou XB, Dai Z, et al. Prediction of protein structural classes by Chou’s pseudo amino acid composition: approached using continuous wavelet transform and principal component analysis [J]. Amino Acids. 2009;37:415–25.
 20.
Luo R, Feng Z, Liu J. Prediction of protein structural class by amino acid and polypeptide composition.[J]. Eur J Biochem. 2002;269:4219–25.
 21.
Costantini S, Facchiano AM. Prediction of the protein structural class by specific peptide frequencies [J]. Biochimie. 2009;91:226–9.
 22.
Kurgan L, Cios K, Chen K. SCPRED: accurate prediction of protein structural class for sequences of twilightzone similarity with predicting sequences [J]. Bmc Bioinformatics. 2008;9:1–15.
 23.
Zhang S, Ding S, Wang T. Highaccuracy prediction of protein structural class for lowsimilarity sequences based on predicted secondary structure [J]. Biochimie 2011;93:0–714.
 24.
Dai Q, Li Y, Liu X, et al. Comparison study on statistical features of predicted secondary structures for protein structural class prediction: from content to position [J]. BMC Bioinformatics. 2013;14:152.
 25.
Ding S, Li Y, Shi Z, et al. A protein structural classes prediction method based on predicted secondary structure and PSIBLAST profile [J]. Biochimie. 2014;97:60–5.
 26.
Ding H, Lin H, Chen W, et al. Prediction of protein structural classes based on feature selection technique [J]. Interdiscip Sci. 2014;6:235–40.
 27.
Chen C, Chen LX, Zou XY, et al. Predicting protein structural class based on multifeatures fusion [J]. J Theor Biol. 2008;253:388–92.
 28.
Kumar AV, Ali RFM, Yu C, et al. Application of data mining tools for classification of protein structural class from residue based averaged NMR chemical shifts [J]. Biochim Biophys Acta. 1854;2015:1545–52.
 29.
Nanni L, Brahnam S, Lumini A. Prediction of protein structure classes by incorporating different protein descriptors into general Chou’s pseudo amino acid composition [J]. J Theor Biol. 2014;360:109–16.
 30.
Wang J, Wang C, Cao J, et al. Prediction of protein structural classes for lowsimilarity sequences using reduced PSSM and positionbased secondary structural [J]. Gene. 2015;554:241–8.
 31.
Dehzangi A. Proposing a highly accurate protein structural class predictor using segmentationbased features [J]. BMC Genomics. 2014;15:1–13.
 32.
Liu T, Qin Y, Wang Y, et al. Prediction of protein structural class based on gappeddipeptides and a recursive feature selection approach [J]. Int J Mol Sci. 2015;17:15–24.
 33.
Cai YD, Zhou GP. Prediction of protein structural classes by neural network [J]. Biochimie. 2000;82:783–5.
 34.
Shen HB, Yang J, Liu XJ, et al. Using supervised fuzzy clustering to predict protein structural classes [J]. Biochem Biophys Res Commun. 2005;334:577–81.
 35.
Chinnasamy A, Sung WK, Mittal A. Protein structure and fold prediction using treeaugmented naive Bayesian classifier [J]. J Bioinforma Comput Biol. 2005;3:387–98.
 36.
Zheng X, Li C, Wang J. An informationtheoretic approach to the prediction of protein structural class [J]. J Comput Chem. 2010;31:1201–6.
 37.
Cai YD, Liu XJ, Xu XB, et al. Prediction of protein structural classes by support vector machines [J]. Comput Chem. 2002;26:293–6.
 38.
Sun XD, Huang RB. Prediction of protein structural classes using support vector machines [J]. Amino Acids (Vienna). 2006;30:469–75.
 39.
Cai YD, Feng KY, Lu WC, et al. Using LogitBoost classifier to predict protein structural classes [J]. J Theor Biol. 2006;238:172–6.
 40.
Qiao S, Yan B, Li J. Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features [J]. Appl Intell. 2018;48:1813–24.
 41.
Chen C, Zhou X, Tian Y, et al. Predicting protein structural class with pseudoamino acid composition and support vector machine fusion network [J]. Anal Biochem. 2006;357:116–21.
 42.
Dehzangi A, Paliwal K, Sharma A, et al. A combination of feature extraction methods with an Ensemble of Different Classifiers for protein structural class prediction problem [J]. IEEE/ACM Trans Comput Biol Bioinform. 2013;10:564–75.
 43.
Chou KC. Prediction of protein cellular attributes using pseudo amino acid composition [J]. Proteins. 2001;44:246–55.
 44.
Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes [J]. Bioinformatics. 2005;21:10–9.
 45.
Yu B, Li S, Qiu WY, et al. Accurate prediction of subcellular location of apoptosis proteins combining Chou’s PseAAC and PsePSSM based on wavelet denoising [J]. Oncotarget. 2017;8:107640–65.
 46.
Kurgan LA, Homaeian L. Prediction of structural classes for protein sequences and domains—impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy [J]. Pattern Recogn. 2006;39:2323–43.
 47.
Wang ZX, Yuan Z. How good is prediction of protein structural class by the componentcoupled method?[J]. Proteinsstruct Funct Bioinformatics. 2015;38:165–75.
 48.
Chen K, Kurgan LA, Ruan J. Prediction of protein structural class using novel evolutionary collocationbased sequence representation [J]. J Comput Chem. 2008;29:1596–604.
 49.
Qiu WY, Li S, Cui XM, et al. Predicting protein submitochondrial locations by incorporating the pseudoposition specific scoring matrix into the general Chou’s pseudoamino acid composition [J]. J Theor Biol. 2018;450:86–103.
 50.
Luisier F, Blu T, Unser M. A new SURE approach to image Denoising: Interscale orthonormal wavelet Thresholding [J]. IEEE Trans Image Process. 2007;16:593–606.
 51.
Chang SG, Yu B, Vetterli M. Adaptive wavelet thresholding for image denoising and compression [J]. IEEE Trans Image Process. 2000;9:1532–46.
 52.
Selesnick IW, Li KY. Video denoising using 2D and 3D dualtree complex wavelet transforms [C]. Wavelets: Applications in Signal and Image Processing X. Int Soc Opt Photonics. 2003.
 53.
Yu B, Lou L, Li S, et al. Prediction of protein structural class for lowsimilarity sequences using Chou's pseudo amino acid composition and wavelet denoising [J]. J Mol Graph Model. 2017;76:260–73.
 54.
Huang DS, Zheng CH. Independent component analysisbased penalized discriminant method for tumor classification using gene expression data [J]. Bioinformatics. 2006;22:1855–62.
 55.
Deng SP, Cao S, Huang DS, et al. Identifying stages of kidney renal cell carcinoma by combining gene expression and DNA methylation data [J]. IEEE/ACM Trans Comput Biol Bioinform. 2017;14:1147–53.
 56.
Qiu JD, Luo SH, Huang JH, et al. Using support vector machines for prediction of protein structural classes based on discrete wavelet transform [J]. J Comput Chem. 2009;30:1344–50.
 57.
Zhang S, Liang Y, Yuan X. Improving the prediction accuracy of protein structural class: approached with alternating word frequency and normalized Lempel–Ziv complexity [J]. J Theor Biol. 2014;341:71–7.
 58.
Zhang L, Zhao X, Kong L. Predict protein structural class for lowsimilarity sequences by evolutionary difference information into the general form of chou’s pseudo amino acid composition [J]. J Theor Biol. 2014;355:105–10.
 59.
Yang JY, Peng ZL, Chen X. Prediction of protein structural classes for lowhomology sequences based on predicted secondary structure [J]. BMC Bioinformatics. 2010;11:S9.
 60.
Zhang SL. Accurate prediction of protein structural classes by incorporating PSSS and PSSM into Chou’s general PseAAC [J]. Chemom Intell Lab Syst. 2015;142:28–35.
 61.
Wu X, Wang F, Li Y, et al. Evaluation of latent membrane protein 1 and microRNA155 for the prognostic prediction of diffuse large B cell lymphoma.[J]. Oncol Lett. 2018;15:9725–34.
 62.
Wang S, Yue Y, Lin X. Protein subnuclear localization based on a new effective representation and intelligent kernel linear discriminant analysis by dichotomous greedy genetic algorithm [J]. PLoS One. 2018;13:e0195636.
 63.
Xiao X, Wang P, Lin WZ, et al. iAMP2L: a twolevel multilabel classifier for identifying antimicrobial peptides and their functional types [J]. Anal Biochem. 2013;436:168–77.
 64.
He X, Han K, Hu J, et al. TargetFreeze: identifying antifreeze proteins via a combination of weights using sequence evolutionary information and Pseudo amino acid composition [J]. J Membr Biol. 2015;248:1005–14.
 65.
Deng SP, Zhu L, Huang DS. Predicting hub genes associated with cervical Cancer through gene coexpression networks [J]. IEEE/ACM Trans Comput Biol Bioinform. 2016;13:27–35.
 66.
Deng SP, Zhu L, Huang DS. Mining the bladder cancerassociated genes by an integrated strategy for the construction and analysis of differential coexpression networks [J]. BMC Genomics. 2015;16(3 Supplement):S4.
 67.
Huang DS, Yu HJ. Normalized feature vectors: a novel alignmentfree sequence comparison method based on the numbers of adjacent amino acids [J]. IEEE/ACM Trans Comput Biol Bioinform. 2013;10:457–67.
 68.
Guo W, Zhu L, Deng S, et al. Understanding tissuespecificity with human tissuespecific regulatory networks [J]. SCIENCE CHINA Inf Sci. 2016;59:070105.
 69.
Hu H, Zhu C, Ai H, et al. LPIETSLP: lncRNA–protein interaction prediction using eigenvalue transformationbased semisupervised link prediction [J]. Mol BioSyst. 2017;13:1781–7.
 70.
Zhao Q, Liang D, Hu H, et al. RWLPAP: random walk for lncRNAprotein associations prediction [J]. Protein Pept Lett. 2018;25:830–7.
 71.
Shen Z, Bao WZ, et al. Recurrent neural network for predicting transcription factor binding sites [J]. Sci Rep. 2018;8:15270.
 72.
Shen Z, Zhang YH, Han K, et al. miRNAdisease association prediction with collaborative matrix factorization [J]. Complexity. 2017;2017:1–9.
 73.
Yuan L, Yuan CA, Huang DS. FAACOSE: a fast adaptive ant colony optimization algorithm for detecting SNP epistasis [J]. Complexity. 2017;2017:1–10.
 74.
Zhang H, Zhu L, Huang DS. DiscMLA: an efficient discriminative motif learning algorithm over highthroughput datasets [J]. IEEE/ACM Trans Comput Biol Bioinform. 2018;15:1810–20.
Acknowledgements
The authors would like to thank the reviewers and editors for their patient guidance and valuable suggestions.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 25, 2019: Proceedings of the 2018 International Conference on Intelligent Computing (ICIC 2018) and Intelligent Computing and Biomedical Informatics (ICBI) 2018 conference: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume20supplement25.
Funding
Publication costs are funded by grants from National Natural Science Foundation of China (11661081), Natural Science Foundation of Yunnan Province (2017FA032) and Training Plan for Young and Middleaged Academic Leaders of Yunnan Province (2018HB031).
Author information
Affiliations
Contributions
Wang SF designed the research and Wang XH designed the experiments. Wang XH performed most of the numerical experiments. Wang SF and Wang XH analyzed the experimental results and wrote this paper. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Wang, S., Wang, X. Prediction of protein structural classes by different feature expressions based on 2D wavelet denoising and fusion. BMC Bioinformatics 20, 701 (2019). https://doi.org/10.1186/s1285901932765
Published:
Keywords
 Prediction of protein structural classes
 Different feature expressions
 Parallel 2D wavelet denoising
 Fusion