Comparison study on statistical features of predicted secondary structures for protein structural class prediction: From content to position

Background Many content-based statistical features of secondary structural elements (CBF-PSSEs) have been proposed and achieved promising results in protein structural class prediction, but until now position distribution of the successive occurrences of an element in predicted secondary structure sequences hasn’t been used. It is necessary to extract some appropriate position-based features of the secondary structural elements for prediction task. Results We proposed some position-based features of predicted secondary structural elements (PBF-PSSEs) and assessed their intrinsic ability relative to the available CBF-PSSEs, which not only offers a systematic and quantitative experimental assessment of these statistical features, but also naturally complements the available comparison of the CBF-PSSEs. We also analyzed the performance of the CBF-PSSEs combined with the PBF-PSSE and further constructed a new combined feature set, PBF11CBF-PSSE. Based on these experiments, novel valuable guidelines for the use of PBF-PSSEs and CBF-PSSEs were obtained. Conclusions PBF-PSSEs and CBF-PSSEs have a compelling impact on protein structural class prediction. When combining with the PBF-PSSE, most of the CBF-PSSEs get a great improvement over the prediction accuracies, so the PBF-PSSEs and the CBF-PSSEs have to work closely so as to make significant and complementary contributions to protein structural class prediction. Besides, the proposed PBF-PSSE’s performance is extremely sensitive to the choice of parameter k. In summary, our quantitative analysis verifies that exploring the position information of predicted secondary structural elements is a promising way to improve the abilities of protein structural class prediction.


Background
Functionalities of proteins have been commonly believed to be determined by their unique 3-dimensional structures, which are determined by the exact spatial position of each atom [1]. In 1976, Levitt and Chothia studied the polypeptide chain topologies in a dataset of 31 globular proteins and proposed the concept of protein structural classes [2]. Proteins can be first classified into several structural folding classes, based on the type, amount, and spatial arrangement of their amino acid residues into potential secondary structure elements. SCOP (Structural Classification of Proteins) [3,4] and CATH (Class, Architecture, Topology and Homologous superfamily) [5,6] are two excellent protein structure databases that provide hierarchical structural classifications of proteins. The former database relies on a manual process to classify the structures, while the latter applies a combination of automated and manual procedures. There are 110,800 protein domains with known structural classes in SCOP database, and about 90% of them belong to the four major classes: all-α, all-β, α+β and α/ β classes [3,4]. The two former classes include structures dominated by α-helices and β-strands, respectively. The two latter classes correspond to structures that include both helices and strands where in the case of the α+β class these secondary structures are segregated, whereas for α/β class the structures are interspersed.
The structural class has become one of the most important features for characterizing the overall folding type of a protein and played an important role in protein function analysis, prediction of protein folding rates, prediction of DNA-binding sites, protein fold recognition, reduction of the conformation search space, and implementation of a heuristic approach to find tertiary structure [7][8][9][10][11][12]. Due to the exponential growth of the number of known protein sequences, the burden of experimental screening methods regarding time and cost to find the 3-dimensional structure would become even more unbearable. If one can develop fast computational methods to predict at least some important characteristics of protein structures, which will help to speed up and reduce the cost for protein annotation. Therefore, computational methods are actively pursued to overcome the limitations of experimental screening methods.
Due to the importance of protein structural class prediction, various significant efforts have been devoted to this problem during the past 30 years, aiming to find a prediction model that automatically determine the structural class based on the protein sequences and predicted secondary structures [9,[13][14][15]. Previous studies have shown that the protein structural class is strongly correlated with amino acid (AA) sequence, and the protein structural class can be predicted based on sequencebased features (SEFs) that are directly computed from AA sequences, such as the frequency of each AA in given proteins. These simple features are typically efficient, but they ignore the sequential order of AAs and the relationships among the distant AAs. To overcome these problems, high order SEFs have been proposed, such as composition of short polypeptides [16,17], pseudo AA composition [18], collocation of AA, function domain composition [19], and positions specific scoring matrices profiles computed by position specific iterative basic local alignment search tool (PSIBlast) [20]. However, these methods appear to be less effective in low-homology datasets whose average pair-wise sequence identities less than 40%. For instance, the reported overall accuracy for the widely used dataset 25PDB whose sequence homology is about 25%, were about 60% only [21,22].
In order to improve the prediction accuracy of lowsimilarity proteins, several new features of predicted secondary structures have been proposed [23][24][25][26][27]. Conveniently, we denote them by structure-based features (STFs). They exploit the fact that proteins with low sequence similarity but in the same structural class are likely to have high similarity in their corresponding secondary structure elements. Taking the above fact into account, Kurgan et al. computed the content of predicted secondary structural elements (content SE ), normalized count of segments (NCount), length of the longest segment (MaxSeg), normalized length of the longest segment (NMaxSeg), average length of the segment (AvgSeg), normalized average length of the segment (NAvgSeg) based on the predicted secondary structures in protein structural class prediction [23]. Zheng and Kurgan counted the 3PATTERN of the predicted secondary structures to improve the β-turns prediction [24]. In MODAS, the predicted secondary structure information is employed to perform the prediction with evolutionary profiles [25]. In 2010, Liu and Jia found that α-helices and β-strands alternate more frequently in α/β proteins than in α+βproteins, and counted their alternating frequency as well as the content of parallel β-sheets and anti-parallel β-sheets [26]. Zhang et al. computed the transition probability matrix (TPM) of the reduced predicted secondary structural sequences and added it to protein structural class prediction [27]. With help of these STFs, the prediction accuracy has been improved significantly, between 80% and 85% on several lowsimilarity benchmark data-sets.
Despite the success of these STFs, they still focus mostly on the content of predicted secondary structure elements, and therefore to sometimes are unaware of the useful position-based information of elements in predicted secondary structures. The main goal of our research is to explore a potential way to capture the position information of predicted secondary structures and improve the prediction accuracy for such low-similarity data sets. In particular, we focus our investigation on the performance of the position-based features of the predicted secondary structure elements (PBF-PSSE) by comparing or combining with the content-based features of the predicted secondary structure elements (CBF-PSSE) in protein structural class prediction. The major content of this paper includes the following: 1. We presented a scheme to describe position of the predicted secondary structure elements and analyzed their distribution in all-α, all-β, α+β and α/β classes. 2. In order to numerically characterize the position information of secondary structures, we regarded the distance between two successive occurrences of an element as a variable and calculated its coefficient of the variability. This approach appears to be sensitive to the order of the structure elements because it is on the basis of all the distances between two successive occurrences of the elements. 3. We implemented a multi-class support vector machine (SVM) to predict protein structural class using PBF-PSSE, CBF-PSSE and both on four different benchmark datasets. Through a comprehensive comparison, we wanted to address the following questions with the aid of the wellknown statistical indexes: (A) how well PBF-PSSE performs compared with the available CBF-PSSEs; (B) whether the CBF-PSSEs achieve a great improvement over the prediction accuracy when combining with the PBF-PSSE; (C) how well the proposed combined feature set, PBF11CBF-PSSE, performs in comparison with the available competing methods; (D) whether the PBF-PSSE's ability depends on the maximal interval distance k.

Datasets
In order to facilitate comparison with previous studies, we selected four widely used low-homology benchmark datasets in which any pair of sequences shares twilightzone similarity [22][23][24][25][26][27]. This means that any test sequence shares twilight-zone identity with any sequence in the training set used to generate the proposed classification model. The dataset, referred to as 25PDB, was selected using 25% PDBSELECT list [28], which includes proteins from PDB that were scanned with high resolution, and with low, on average about 25%, identity. The dataset was originally published in [22] and was used to benchmark two structural class prediction methods [29,30]. It contains 1673 proteins and domains. The secondary dataset, referred to as 1189, are downloaded from RCSB Protein Data Bank with the PDB IDs listed in the paper [22]. It contains 1092 proteins with 40% sequence identity. The third protein dataset, referred to as 640, was first studied in Chen et al. (2008) [20]. It contains 640 proteins with 25% sequence identity, and their classification labels are retrieved from the database SCOP [4]. The final dataset, named FC699, includes 858 sequences that share low 40% identity with each other. More details are presented in Table 1.

Protein secondary structure prediction
Every amino acid in a protein sequence can be predicted into one of the three secondary structural elements H (helix), E (strand), and C (coil). It is a problem known as protein secondary structure prediction, and many computational approaches have been developed in the past decades to predict the 3-state secondary structure from protein sequences. In this study, PSIPRED [31] was chosen to predict protein secondary structure because it outperforms other competing prediction methods [32,33]. If you want to obtain the prediction secondary structure of protein 1PET whose amino acid sequence is DSITYRVRKGDS LSSIAKRHGVNIKDVMRWNSDTANL QPGDKLTLFVK, you can submit it to PSIPRED and obtain the predicted secondary structure like this CCEEEEECCCCCHHHHHH HHCCCCCCCCCCCCCCEEEEEEC. The available structurebased predictions take the predicted secondary structure sequence as input, but they are not tied to any specific tool for the secondary structure prediction. Any improved secondary structure prediction would generally lead to a high accuracy structure-based protein structural class prediction method [34][35][36].
Content-based features of predicted secondary structure elements (CBF-PSSE) Prediction methods, using the protein SEFs, achieve promising results in protein structural class prediction, unfortunately the accuracy is limited. Some studies indicate that the contents and spatial arrangements of secondary structural elements are also significant factors that influence the protein intricate functions or structures [23][24][25][26][27], so various CBF-PSSEs have been proposed, such as the content of the predicted secondary structure elements or segments. Since this paper focuses on comparison study on statistical features of predicted secondary structures, we first reviewed the available CBF-PSSEs with better performance in protein structural class predication.
1. Predicted secondary structure elements' content (content SE ) Predicted secondary structure elements' content, denoted by content SE , is one of the most widely used CBF-PSSEs [23,[25][26][27]. It can be calculated by taking a sliding window and scanning through the predicted secondary structure sequences where Count SE is the total number of occurrence of the predicted secondary structure element SE, SE ∊ {C, H, E}. H, E and C denote α-helix, β-strand and coil, respectively. 2. First and second order composition moment vector (CMV) [23,[25][26][27], another important CBF-PSSE, can be calculated as follows: Where PO SEj represents the jth position of the predicted secondary structure element SE, N is the length of the predicted secondary structure sequence, and k is the order of the composition moment vector. 3. There are many different arrangements of α-helices and β-strands among four main classes. In order to distinguish these arrangements, the longest segment, average length of the segments and their normalized forms have been proposed and calculated as follows: Length of the longest segment (MaxSeg SE ) [23,25-27] where MaxLen is the maximal function of segment length, and SEG SE is the segments composed of structure element SE.

Normalized length of the longest segment
where N is the length of the predicted secondary structure sequence. 5. Average length of the segment (AvgSeg SE ) [23,[25][26][27] where Len is the function of segment length, and Content SEG SE denotes the total appearances of the SEG SE . 6. Normalized average length of the segment (NAvgSeg SE ) [23,25-27] where N is the length of the predicted secondary structure sequence. 7. 3PATTERN Zheng and Kurgan proposed 3PATTERN method and enhanced the prediction accuracy of β-turns to over 80% based on the predicted secondary structure sequences [24]. 3PATTERN m, k denotes a specific configuration of the secondary structure for the central and the two adjacent residues, where m is the pattern type. For m = 1 and k = C, the secondary structure prediction would be CCC, and for m = 2, 3, and 4 the prediction would be CCx, xCC, and xCx, respectively, where x ∊ {EH}. They encode whether the central (predicted) residue is located inside a secondary structure segment or at the interface between two segments.
8. Alternating frequency of α-helices and β-strands and proportion of parallel β-sheets and anti-parallel βsheets (APPA) In 2010, Liu and Jia found that the α-helices and the β-strands alternate more frequently in α/β proteins than in α+β proteins, so they counted the alternating frequency as well as the content of the parallel β-sheets and the anti-parallel β-sheets [26].
The normalized alternating frequency of the αhelices and the β-strands (Altn/N) is defined as follows: where Content α-β is the total alternation of α-helices and β-strands, and SeqLen is the length of the predicted secondary structure sequence. 9. The transition probability matrix of the reduced segment sequence (TPM) In 2010, Zhang et al. ignored coil segments and transformed a secondary structure sequence into a segment sequence that is only composed of helix segments and strand segments [27]. They defined transition probability matrix (TPM) of the reduced segment sequence as follows: where a i represents the ith element of the state space {α, β}, and Content aiaj is total appearance of the incident, a i is followed by letter a j in the segment sequence.

Representation of the secondary structure elements' position
The above CBF-PSSEs focus mainly on the content of predicted secondary structure elements, and therefore they will ignore the useful position distribution of elements in predicted secondary structures. For example, given a predicted secondary structure sequence CCEEEEECCCCCHHHHHHH, if we move its last seven HHHHHHH to the third position of the structure sequence, we will get another secondary structure sequence CCHHHHHHHEEEEECCCCC according to the elements' position, but the elements' content does not change. So when assigning the protein structural classes, the secondary structure elements' position should be considered as another deciding factor. Instead of counting the occurrences of distinct helix, strand and coil segments, this paper analyzed the distribution of the successive occurrences of a predicted secondary structure element. To find all occurrences of an element δ in the predicted secondary structure sequence s, the random indicator φ i (δ) is defined as follows: With help of the random indicator, we transformed a predicted secondary structure sequence into three position sequences. After removing zeros from the position sequences, we obtained three numerical sequences denoted as Po(δ). Take the above sequence s=CCHH HHEEEEECCCCCHHH as an example, its numerical sequences Po(C), Po(H) and Po(E) are: From the numerical sequence Po(δ), it is easily to deduce that whether two successive occurrences of the element δ belong to the same helix (strand and coil) or not. If the interval distance between two successive occurrences of the element δ, referred to as Dis(δ), is equal to 1, they will form a helix (strand and coil), otherwise they belong to different helixes (strands and coils). Based on the numerical sequence Po(δ), we computed the interval distances between two successive occurrences of the element δ and got a novel numerical characteristic sequence denoted by N(δ). Take the above position sequences as an example, their numerical characteristic sequences N (δ) are: These numerical sequences N(δ) not only indicate the structure elements' content, but also reflect distribution information of the interval distances between their consecutive occurrences.
Position-based feature of predicted secondary structure elements (PBF-PSSE) Given a structure element δ, we can transform a predicted secondary structure sequence into a numerical characteristic sequence N(δ) that provides a new profile of the correlation structure of the given structure sequence. Here, we chose 25PDB dataset that includes 443 all-α, 443 all-β, 346 α/β, and 441 α+β proteins. Using the random indicator φ(H) and statistical method, we obtained 1673 numerical characteristic sequences N(H) and calculated the count of the interval distance Dis(H) for all-α, all-β, α/β and α+β classes, which is represented in Figure 1. It is easy to find that more than 80% of Dis (H) is equal to 1 among all-α, all-β, α/β, and α+β classes, and the rest are too small. Figure 2 shows distribution of Dis(H) >1 more clearly because Dis(H) =1 has been omitted. Take a closer look at Figure 2, we found that the count of Dis(H) >1 in the all-α class is larger than the other classes, which is coincident with the fact that the all-α class is dominated by α-helices. Also, the distribution of Dis(H) >1 is more concentrative in the α/β class and the α+β class than that in the all-βclass.
Since Dis(δ) varies with different predicted secondary structure sequences, it can be regarded as a discrete random variable. Given a random variable Dis(δ), and a positive integer n , p(Dis(δ)=n) is the probability that Dis (δ) takes the value n. The collection of pairs (Dis(δ)=n, P (Dis(δ)=n)), for all positive integer n, is the probability distribution of the Dis(δ) listed in Table 2.
Based on above distribution function, we calculated two numerical characteristics: semi-mean Semi-E (k) (δ) and semi-variance Semi-D (k) (δ) defined by: Here, Semi-E (k) (δ)and Semi-D (k) (δ) are not mean and variance because we only added the former k values rather than all the parameter values. The PBF-PSSE C (k) (δ) is then defined as the ratio of the standard Semi-D (k) to Sime-E(k) C (k) (δ) is the reciprocal of coefficient of variation which shows the extent of variability in relation to mean of the population. For the convenience of comparison, we denoted C (k) (δ) based on all the parameter values as C (F) (δ).
In probability theory and statistics, the coefficient of variation is a normalized measure of dispersion of a probability distribution. It is also known as unitized risk or the variation coefficient. The coefficient of variation is also common in applied probability fields such as renewal theory, queuing theory, and reliability theory. The coefficient of variation is useful because the standard deviation of data must always be understood in the context of the mean of the data. Instead, the actual value of the coefficient of variation is independent of the unit in which the measurement has been taken, so it is a dimensionless number. For comparison between data sets with different units or widely different means, one should use the coefficient of variation instead of the standard deviation. Here, C (k) (δ) is used to describe the position distribution of predicted secondary structure elements.

Prediction assessment
In this paper, we adopted Vapnik's support vector machine to predict the protein structural class [37]. Support vector machine is one type of learning machine based on statistical learning theory. Since there are four structural classes, we chose the multi-class prediction method for protein structural class prediction. Given a test protein of unknown category, the SVM first maps the input vectors into one feature space (perhaps with a higher dimension). Then within the space mentioned above, it finds an optimized linear division to solve two-class or multi-class problem [38]. Finally, a prediction label to the test sample is assigned according to this way. A more detailed description of SVM is in Vapnik's book [37].
Among the three kinds of cross-validation methods (the single-test-set analysis, sub-sampling and jackknife analysis), the jackknife test is supposed to be the most effective one [39]. Here, we used it to evaluate the performance of the proposed method. We also considered standard performance measures over structural class, including the accuracy for class C j and overall accuracy, which was defined as the fraction of class C j or all the proteins tested that are classified correctly.
Overall accuracy ¼ where TP j is the number of true positives, and |C j | is the number of proteins in each structural class C j (all-α, allβ, α/β and α+β classes).

Selection of parameters C and gamma
We selected the Gaussian as the kernel function for the SVM because its superiority for solving nonlinear problems compared with other kernel functions [40]. Here, we selected the parameters for the sake of getting the highest overall prediction as possible. Then a simple grid search strategy over C and gamma values based on 10fold cross-validation for each dataset was selected, where C and gamma were allowed to take the values only between 2 -5 to 2 5 .

Results and discussion
This section includes discussion of the selected feature, experiment results, comparison of PBF-PSSE, CBF-PSSE, and the proposed combined feature set on four benchmarking datasets. In the first step, we used the PSIPRED to predict the secondary structures of protein. Then, the representation was employed to represent a predicted secondary structure as three numerical sequences, from which we calculated the PBF-PSSE, a 3-feature set. Finally, the PBF-PSSE, CBF-PSSE and the proposed combined feature set were fed into support vector machine to make prediction of its protein structural class, respectively. We reported overall accuracy and accuracy for each structural class.

Prediction accuracy of PBF-PSSE C F (δ) for four benchmark datasets
Four widely used datasets with low sequence identity were used in this study, including 25PDB that comprises 1673 proteins of about 25% sequence identity, 640 that includes 640 proteins of about 25% sequence identity, FC699 with 858 proteins of about 40% sequence identity, and 1189 that contains 1092 proteins of about 40% sequence identity. The results obtained by the PBF-PSSE C F (δ) were shown in Table 3. Table 3 shows that the overall accuracies obtained by the PBF-PSSE C F (δ) are 75.25%, 79.8%, 85.7% and 78.4% for the 25PDB, 640, FC699 and 1189 datasets, respectively.
Among the four structural classes, α+β is the most hardest to predict. Its average accuracy is always about 5-10% lower than the other three structural classes [22]. But in the PBF-PSSE C F (δ), the average accuracy for the α+β class is 81.76%, which is 0.63-20.21% higher than the other three structural classes. These results hence clearly indicate that the PBF-PSSE C F (δ) is more suitable to characterize the helix's and strand's distribution.

Comparison between PBF-PSSE C F (δ) and CBF-PSSEs
PBF-PSSE C F (δ) aims at the structure elements' position distribution among all-α, all-β, α/β and α+β classes. For a better understanding of the PBF-PSSE C F (δ), a comparison with other statistical features was performed. Since this paper focuses on comparison study on statistical features of predicted secondary structures, we compared PBF-PSSE C F (δ) with nine available CBF-PSSEs on the same data sets. In this section, we selected the accuracy of each class and overall accuracy as evaluation methods, which are summarized in Table 3.
In the 25PDB experiment, PBF-PSSE C F (δ) performs better than all CBF-PSSEs, with overall accuracy 75.25%. Among all the CBF-PSSEs, content SE is significantly better than all other CBF-PSSEs, and the next best CBF-PSSE is NMaxSeg SE . In the 640 experiment, the PBF-PSSE C F (δ) achieves the highest overall prediction accuracy among all the PBF-PSSE and the CBF-PSSEs. Among the CBF-PSSEs, content SE is better than all other CBF-PSSEs, and    the next best one is NMaxSeg SE . In the FC699 experiment, two CBF-PSSEs, content SE and AvgSeg SE , outperform the PBF-PSSE C F (δ). As for the dataset 1189, the PBF-PSSE C F (δ) is better than all the CBF-PSSEs, with overall accuracy 78.39%. The next best one is content SE , and the other features lag behind. As for α+β class, the accuracies of the PBF-PSSE C F (δ) for datasets 25PDB, 640, FC699 and 1189 are 78.00%, 78.95%, 80.49% and 68.46%, which are 7.02%, 14.62%, 29.27% and 12.44% higher than the best-performing CBF-PSSEs, respectively.
From the above experiments, we can see that both the PBF-PSSE C F (δ) and the CBF-PSSEs make their own positive contributions to the predictions. The PBF-PSSE C F (δ) performs better than CBF-PSSEs among three experiments, especially for α+β class prediction. content SE achieves the best performance among all the CBF-PSSEs.

Performance of the CBF-PSSE combined with the PBF-PSSE C F (δ)
PBF-PSSE and CBF-PSSEs are the two most important kind feature sets of predicted secondary structures for protein structural class prediction. It can be seen that when the features are used individually, the resulting overall prediction accuracy for four datasets is all well above 25%. It indicates that these predictions are unlikely to be random, since random assignment of protein classes generally leads to an accuracy value of about 25%. In other words, every feature subset makes its own positive contributions to the predictions.
The differences between the PBF-PSSE and the CBF-PSSEs are that the position information is considered in the former, and the content information is explored in the latter. For a better understanding of the PBF-PSSE C F (δ), we combined the PBF-PSSE C F (δ) with CBF-PSSEs to form some new combined feature sets. Through the experiments, we wanted to address how well the CBF-PSSEs perform when combining with the PBF-PSSE C F (δ). Table 4 lists prediction accuracy obtained with the CBF-PSSEs combined with the PBF-PSSE C F (δ). From Table 3, we note that the PBF-PSSE C F (δ) provides the overall prediction accuracy that is only comparable to the CBF-PSSE content SE , and it even gives a lower accuracy values (85.66% v.s. 88.46%) for the datasets FC699. But when combining with the CBF-PSSE content SE , the prediction accuracy of the PBF-PSSE C F (δ) is improved by about 9.0%. Specifically, there are the accuracy improvements of 29.94%, 5.94%, 9.21%, and 6.04% for the datasets 25PDB, 640, FC699 and 1189, respectively. Table 4 shows that all the CBF-PSSEs' prediction abilities are improved by combining with PBF-PSSE C F (δ), except for MaxSeg SE and 3PATTERN. There are about 4.43%~48.28% higher than the prediction methods solely from the CBF-PSSEs.
For comparison purpose, the CBF-PSSEs combined with the CBF-PSSE content SE were also tested. Here, we chose the CBF-PSSE content SE because it is one of the most efficient CBF-PSSEs and often combined with predicted secondary structures or protein sequence [23][24][25][26][27]. The comparison of the CBF-PSSEs combined with the PBF-PSSE C F (δ) and with the CBF-PSSE content SE is presented in Figure 3, and more details can be found in Additional file 1: Table S1.
As would be expected, the prediction accuracy of the different combined feature sets shows two clear trends: (i) when exploring the PBF-PSSE C F (δ) and the CBF-PSSE content SE , all the CBF-PSSEs' prediction abilities are improved except for MaxSeg SE and 3PATTERN; (ii) it is interesting to note that high accuracy of prediction can be achieved by the CBF-PSSE combined with the PBF-PSSE C F (δ). These experiments further demonstrate that the PBF-PSSE C F (δ) plays an important role in recognition of protein structural classes and can be used to improve the prediction accuracy. PBF-PSSE and CBF-PSSE have to work closely so as to make significant and complementary contributions to protein structural class prediction.

Comparison of the proposed PBF11CBF-PSSE with the competing predictions based on the predicted secondary structures
The above experiments show that the PBF-PSSE and the CBF-PSSE make significant and complementary contributions to protein structural class prediction, so this paper proposed a new combined feature set, denoted by PBF11CBF-PSSE, that consists of the PBF-PSSE C F (δ) and widely used 11-dimension CBF-PSSE set. Table 5 presents the accuracy of the proposed PBF11CBF-PSSE. To evaluate the efficiency of the PBF11CBF-PSSE, we compared it with the competing prediction methods on the same data sets. Since PBF11CBF-PSSE was constructed based on the information of the predicted secondary structure, the evaluated prediction methods should be based on predicted  [42]. Table 5 lists the accuracy of each class and overall accuracy of all the evaluated prediction methods. As for 25PDB dataset, the proposed PBF11CBF-PSSE outperforms all other methods. There are only two methods that provide the overall accuracy over 84%.
One is PBF11CBF-PSSE, and the other is the method proposed by Ding et al. [42]. But the overall accuracy of PBF11CBF-PSSE is 86.25%, which is 1.91% higher than Ding's method [42]. Results shown in Table 4, which concern on the 640, FC 699 and 1189 datasets, are consistent with the results on the 25PDB dataset.  The accuracy of each class and overall accuracy of the proposed PBF11CBF-PSSE for four datasets, and comparison with the competing prediction methods based on protein prediction secondary structures.
which are 2.97%, 5.39% and 1.51% higher than the existing best-performing method. We attribute higher overall accuracy to the PBF-PSSE C F (δ) involved in the PBF11CBF-PSSE.
In addition, we further compared the results of the proposed PBF11CBF-PSSE with two popular methods, MODAS [12] and SCPRED [23], in which the predicting sequence information was combined with evolutionary profiles or protein sequences to predict the protein structural classes. The overall accuracies yielded by MODAS for datasets 25PDB and 1189 are 81.4% and 83.5%, which are 4.85% and 1.21% lower than the proposed PBF11CBF-PSSE. As for SCPRED method, its overall accuracies for datasets 25PDB and FC699 are 79.7% and 87.5%, which are 6.55% and 7.49% lower than the proposed PBF11CBF-PSSE. These results also demonstrate that the position information from the predicted secondary structures could be more promising to improve protein structural class prediction because it is more suitable to represent the structure elements' order information, certain local interactions and spacial arrangements of the α-helices and the β-strands.

Influence of parameter k in the PBF-PSSE C F (δ)
PBF-PSSE C (k) (δ) is the reciprocal of coefficient of variation which shows the extent of variability in relation to mean of the population. It describes the position distribution of predicted secondary structure elements and contributes to the protein structural class prediction. However, it should be noted that C (k) (δ) relies heavily on the k parameter, the given interval distance.
From Figures 1 and 2, it is easy to find that more than 80% of the interval distances Dis(δ) are equal to 1, and the rest are too small. In order to show more clearly, we represented the cumulative content of the interval distances Dis(δ) for datasets 25PDB, 640, FC699 and 1189 in Figure 4. More details can be found in Additional file 2: Table S2. As would be expected, the content of the interval distances (Dis(δ) <5) is larger in four datasets, and their cumulative content of Dis(δ) <5for structure elements C, E and H are all well above 0.85. The cumulative content of the Dis(δ) increases from k=5 to k=30 for all four datasets. When Dis(δ) is equal to 30, all the cumulative content of the Dis(δ) are up to 0.96, especially for Dis(C) and Dis(H). That is to say, almost all the Dis(δ) are less than 30.

Conclusions
Prediction of structural classes for the low-homology datasets not only allows learning the overall folding type for a given protein sequence, but also helps in finding proteins that form similar folds in spite of low sequence similarity. Therefore, high quality prediction would be beneficial for in-silico prediction of tertiary structure of proteins with low sequence identity with respect to sequence used for prediction.
Numerous efficient methods have been proposed to predict protein structural classes for low-homology sequences, but challenge remains. In this paper, we aimed to develop a new method to improve prediction accuracy, which explores a potential way to capture the position information of predicted secondary structures. To do so, we first proposed a representation of the structure element position and analyzed the distance distribution of successive occurrences of an element, from which the semi-mean Semi-E (k) and semi-variance Semi-D (k) are calculated. Then, reciprocal of coefficient of variation was employed to construct the PBF-PSSE.
The main goal of our research is to investigate the importance of the PBF-PSSE and compare its performance with the CBF-PSSEs. The first contribution can be seen from the comparison with nine available CBF-PSSEs, we found that the PBF-PSSE is as important as the CBF-PSSEs, and content SE are the most efficient CBF-PSSEs. The second contribution can be indicated from evaluation of the CBF-PSSEs combined with the PBF-PSSE, we noticed that the CBF-PSSEs' prediction abilities are improved when combining with PBF-PSSE C F (δ), except for MaxSeg SE and 3PATTERN. These results demonstrate that the PBF-PSSE and the CBF-PSSE have to work closely so as to make significant and complementary contributions to protein structural class prediction. The third contribution can be deduced from the performance of the proposed combined feature set PBF11CBF-PSSE and its comparison with competing prediction methods. Its overall accuracies for datasets 25PDB, 640, FC699 and 1189 are 86.25%, 86.41%, 94.99% and 84.71%, which are 1.91%, 2.97%, 5.39% and 1.51% higher than the existing best-performing method. The improvement can be contributed to the introduction of the PBF-PSSE that describes collocation of helix and strand segments in the predicted secondary structures. The final contribution can be seen from analysis of the influence of parameter k, we found that C (k) (δ) possesses different performances with different parameter k, C 5 (δ) and C F (δ) have almost the similar performance. So we can calculate the C 5 (δ) instead of the C F (δ) , which can help you simplify calculations.
Overall our comparison study highlights the necessity to extract more position information of the predicted secondary structures as possible. Thus, this understanding can be used to guide development of more powerful method for protein structural class prediction.