Skip to main content

Protein structure search and local structure characterization



Structural similarities among proteins can provide valuable insight into their functional mechanisms and relationships. As the number of available three-dimensional (3D) protein structures increases, a greater variety of studies can be conducted with increasing efficiency, among which is the design of protein structural alphabets. Structural alphabets allow us to characterize local structures of proteins and describe the global folding structure of a protein using a one-dimensional (1D) sequence. Thus, 1D sequences can be used to identify structural similarities among proteins using standard sequence alignment tools such as BLAST or FASTA.


We used self-organizing maps in combination with a minimum spanning tree algorithm to determine the optimum size of a structural alphabet and applied the k-means algorithm to group protein fragnts into clusters. The centroids of these clusters defined the structural alphabet. We also developed a flexible matrix training system to build a substitution matrix (TRISUM-169) for our alphabet. Based on FASTA and using TRISUM-169 as the substitution matrix, we developed the SA-FAST alignment tool. We compared the performance of SA-FAST with that of various search tools in database-scale search tasks and found that SA-FAST was highly competitive in all tests conducted. Further, we evaluated the performance of our structural alphabet in recognizing specific structural domains of EGF and EGF-like proteins. Our method successfully recovered more EGF sub-domains using our structural alphabet than when using other structural alphabets. SA-FAST can be found at


The goal of this project was two-fold. First, we wanted to introduce a modular design pipeline to those who have been working with structural alphabets. Secondly, we wanted to open the door to researchers who have done substantial work in biological sequences but have yet to enter the field of protein structure research. Our experiments showed that by transforming the structural representations from 3D to 1D, several 1D-based tools can be applied to structural analysis, including similarity searches and structural motif finding.


Genome sequencing projects continue to produce amino acid sequences; however, understanding the biological roles played by these putative proteins requires knowledge of their structure and function [1]. Despite that empirical structure determination methods have provided structural information for some proteins, computational methods are still required for the large number of proteins whose structures are difficult to determine experimentally. And while the primary sequence should contain the folding guide for a given protein, our ability to predict the three-dimensional (3D) structure from the primary sequence alone remains limited. Some ab initio methods do not require such information, but the application of these methods is often limited to small proteins [2, 3].

Structure alignment research has led to the discovery of homologues of novel protein structures. And, although many structure alignment tools have been developed, such as CE [4], DALI [5], VAST [6], MAMMOTH [7], FATCAT [8], and Vorolign [9], we wanted to provide a different perspective on protein structure analysis. Previous studies of protein structures have shown the importance of repetitive secondary structures, particularly α-helices and β-sheets, in overall structure determination. Together with variable coils, these structures constitute a basic three-letter structural alphabet that has been used in the development of early-generation secondary structure prediction algorithms (such as GOR [10]) as well as more recent-generation algorithms. These newer algorithms have been applied to neural networks, homology sequences, and discriminative models [1114], and their accuracy in predicting secondary structure approaches 80%. However, despite this predictive accuracy, the three-letter alphabet does not contain the information necessary to approximate more refined 3D reconstructions.

The recent rapid increase in the number of available protein structures has allowed more precise and thorough studies of protein structures. Several authors have developed more complex structural alphabets that incorporate information about the heterogeneity of backbone protein structures by using subsets of small protein fragments that are observed frequently in different protein structure databases [1517]. The alphabet size varies from several letters to about 100 letters [18]. For example, Unger et al. [19] and Schuchhardt et al. [20] used k-means methods and self-organizing maps (SOMs), respectively, to identify the most common folds, but the number of clusters generated was too large to have substantial predictive value. By applying autoassociative neural networks, Fetrow et al. defined six clusters that represent super-secondary structures which subsume the classic secondary structures [21]. Bystroff and Baker produced similar short folds of different lengths and grouped them into 13 clusters that they used to predict 3D structure [22]. Camproux et al. developed a hidden Markov model (HMM) approach that accounted for the Markovian dependence to learn the geometry of the structural alphabet letters and the local rules for the assembly process [23]. Fixing the alphabet size to 23 letters, Yang & Tung applied a nearest-neighbor algorithm on a (κ, α)-map of structural segments to identify the 23 groups of segments used in their alphabet [24]. More details about these local structures can be found in a recent review [25].

In this study, we developed a flexible pipeline for protein structural alphabet design based on a combinatorial, multi-strategy approach. Instead of applying cross-validation [22] or Markovian processes [15] to refine the clusters directly, we used SOMs and Bayesian Information Criterion (BIC) to determine the optimum size of structural alphabet. We then applied the k-means algorithm [26] to group protein fragments into clusters, forming the bases of our structural alphabet. Moreover, unlike most other works that built substitution matrices for alphabets based on known blocks of aligned proteins, we used a matrix training framework that generated matrices automatically without depending on known alignments. An expressive structural alphabet allows us to quantify the similarities among proteins encoded in the appropriate letters. It also enables the primary representation of 3D structures using standard 1D amino acid sequence alignment methods. To demonstrate the feasibility of our new method, we verified the application of the alphabet produced by our pipeline and the trained substitution matrix to a widely used 1D alignment tool, FASTA [27]. We conducted several experiments using the same datasets used in other recently published works and evaluated the performance of our tool in database-scale searches. In addition to investigating whether our alphabet and matrix worked well with 1D alignment tools in database searches, we evaluated the ability of our structural alphabet to characterize local structural features.


Structural alphabet

By combining SOMs, minimum spanning trees, and k-means clustering, we developed a multi-strategy approach to designing a protein structural alphabet. To derive an appropriate substitution matrix for the new alphabet, we developed a matrix training framework that would automatically refine an initial matrix repeatedly until it converged. Unlike some previous works that presumed the size of the alphabet [23], our method determined the alphabet size autonomously and statistically. Various experiments were conducted to evaluate our methodology.

The SOM is an unsupervised inductive learner and can be viewed as topology preserving mapping from input space onto the 2D grid of map units [28]. The number of map units in SOMs defines an inductive bias [29], as does the number of hidden units for the feedforward artificial neural networks, and it affects the clustering results. By systematically varying the number of SOM map units and applying BIC, we identified the most frequent number of clusters that maximized the BIC and used this number to define the size of the alphabet. We tested SOMs ranging in size from 10 × 10 to 200 × 200, ultimately defining the size of our alphabet at 18 letters. The relationship between number of clusters found and number of SOM map units used is summarized in Table 1.

Table 1 Relationship between the number of clusters found and the number of SOM map units used

To verify whether fragments were assigned to the same cluster by the various SOMs, we analyzed those SOMs (with varying numbers of map units) that produced 18 clusters, including SOMs sized 80 × 80, 90 × 90, 190 × 190 units, etc. We calculated the overlap level between any two of the SOMs, defined as percentage of fragments that belonged to the same cluster. The average overlap between all pairs of SOMs for each of the 18 clusters was over 90%, indicating that these clusters were very consistent (Table 2). Table 3 and 4 display the within-cluster Euclidean distance, defined as the average distance of each segment to the center, and the center-to-center Euclidean distance for the 18 protein fragment clusters found by our method and by SOM alone, respectively. The average Phi/Psi angles (i.e. the Phi/Psi angles of the centroid) for the 18 clusters are presented in Table 5. As indicated in Table 3 and 4, the within-cluster Euclidean distances for our clusters were smaller than those of the SOM clusters, which suggested that our 18 clusters were more coherent. On the other hand, the center-to-center distances for our clusters were larger than those of the SOM clusters, indicating that our clusters were better separated from each other. The 3D conformation of the representative segment for each alphabet letter is illustrated in Figure 1 and the superimposition of protein segments is shown in Figure 2. To verify that these representative segments could be the building blocks for protein structures, we analyzed the frequency of their occurrence in four major structural classes according to the Structural Classification of Proteins (SCOP): all-alpha, all-beta, alpha/beta, and alpha+beta [30]. The frequency of each category of segments is presented in Table 6. The alpha helix segments represented by alphabet letters T, P, and R occurred more often in the all-alpha class than did the other segments. Similarly, more beta sheet segments, such as N, E, and A, were found in the all-beta class. In both the alpha/beta and alpha+beta classes, most of the segments were found to be either alpha helices or beta sheets.

Table 2 The average overlap between all pairs of SOMs that produced 18 clusters of fragments
Table 3 Summary of the within-cluster Euclidean distance and the center-to-center Euclidean distance for 18 protein fragment clusters found by our alphabet design pipeline
Table 4 Summary of the within-cluster Euclidean distance and the center-to-center Euclidean distance for 18 protein fragment clusters found by the SOM alone
Table 5 The average Phi/Psi angles (i.e. the Phi/Psi angles of the centroid) for the 18 clusters found by our alphabet design pipeline
Table 6 Frequency of occurrence of the protein segments defined by our alphabet in four main SCOP classes
Figure 1

The 3D conformation of the representative segment for each alphabet letter.

Figure 2

Superimposition of protein segments in the 18 clusters.

TRISUM – Substitution matrix

Most approaches to constructing substitution matrices require the alignment of known proteins [24, 31, 32]. Because alignments are not always available and their validity can be dubious, we used a self-training strategy to build the substitution matrix for our new structural alphabet. This training framework had a flexible and modular design, and unlike most other approaches, it did not rely on the pre-alignment of protein sequences or structures. Different training data or alignment tools can be incorporated into this framework to generate appropriate matrices under various circumstances. In this study, we used the non-redundant proteins contained in SCOP1.69 with sequence similarity of less than 40% for training, excluding those proteins in SCOP-894 and the 50 test proteins (see details below) to ensure that the training data and the testing data did not overlap. We defined the positive hit rate of a query as the ratio of the number of positive hits to the size of the family the query belonged to. As we iterated each training protein (as a query), we refined the matrix until we could no longer increase the average positive hit rate of all the proteins. We tried different learning rates ranging from 0.25 to 1.00. The final average positive hit rates under different learning rates were similar, ranging between 0.9112 and 0.9153. An example of the learning curve of matrix training is presented in Figure 3. We selected the converged matrix with the maximum positive hit rate with the learning rate set to 0.50. We named this matrix TRISUM-169 (TR ained I teratively for SU bstitution M atrix-SCOP1.69), as shown in Figure 4.

Figure 3

Example learning curve of matrix training. The average positive hit rate converged at 0.9153 with the learning rate set to 0.5.

Figure 4

The substitution matrix TRISUM-169.

Comparison with other tools

Several protein structure search tools based on 1D alignment algorithms have been developed, including SA-Search [33], YAKUSA [34], and 3D-BLAST [24]. Yang and Tung tested 3D-BLAST on the SCOP database scan task [24]. They prepared a protein query dataset named SCOP-894 from SCOP 1.67 and 1.69; this dataset contains 894 proteins with <95% sequence similarity. We tested SA-FAST on the same dataset in order to allow direct comparison (Table 7). The results indicated that SA-FAST outperformed 3D-BLAST and PSI-BLAST in the test of the SCOP-894 query dataset.

Table 7 SA-FAST versus 3D-BLAST and PSI-BLAST in SCOP structural function assignment accuracy for the SCOP-894 protein dataset

We also used the same 50 proteins selected from SCOP95-1.69 that were used by Yang & Tung to compare SA-FAST with 3D-BLAST, PSI-BLAST, YAKUSA, MAMMOTH, and CE, in search time, predictive accuracy, and precision. Other search tools exist, such as PBE [35], SA-Search [33], and Vorolign [9], but because they either could not be tested on the SCOP database directly or the versions of their databases provided were too old (e.g. ASTRAL in PBE derived from SCOP-1.65, Vorolign server only scans SCOP40-1.69), these tools were not used in the comparisons. The results showed that SA-FAST outperformed the other two BLAST-based search tools (i.e. 3D-BLAST and PSI-BLAST) and another structure search tool that describes structures as 1D sequences (YAKUSA) in both predictive accuracy and precision (Table 8). Additionally, SA-FAST was comparably accurate and precise as the structural alignment tools MAMMOTH and CE. Regarding search time (using one Intel Pentium 2.8 GHz processor and 512 Mbytes of memory), Table 8 clearly indicates that SA-FAST was far more efficient than were the structural alignment tools MAMMOTH and CE.

Table 8 Comparison between SA-FAST, 3D-BLAST, PSI-BLAST, YAKUSA, MAMMOTH, and CE on 50 proteins selected from SCOP95-1.69

To further evaluate the predictive validity of our alphabet, we examined pairwise alignment of difficult cases based on the number of residues aligned and the superposition root mean square deviation (RMSD). To avoid alignment process bias and to maintain consistency in our analysis of various structural alphabets, we applied the same FASTA-based alignment algorithm [27] in the alignment tests. We tested the alphabets and substitution matrices used in PBE-align, 3D-BLAST, and SA-FAST on ten difficult cases of previously studied pairwise alignments and compared the results with those produced using VAST, DALI, CE, and FATCAT [8, 36]. Based on the alignments obtained using different alphabets and matrices, we used VMD [37] to calculate the superposition RMSD for PBE-align, 3D-BLAST, and SA-FAST. Table 9 shows that our alphabet had the lowest average RMSD per aligned residue among the three structural alphabets in the ten difficult alignment tests. Figure 5 shows four superimposition examples based on our structural alphabet.

Table 9 Results of ten difficult cases of pairwise alignment
Figure 5

Superimposition examples based on alignments identified by SA-FAST. (a) 1fxiA & 1ubq_ (b) 2azaA & 1paz_ (c) 1cewI & 1molA (d) 1cid_ & 2rhe.

Local structure conservation in putative active sites can reflect biological meaning and these types of structural patterns can be used to predict protein function [19], e.g., the binding sites for metal-binding proteins [38]. Conserved local structural features can be identified in various ways and described using different representations. Because of the aforementioned advantages to 1D representation, we wanted to evaluate the feasibility of describing structural domains/sub-domains using our structural alphabet. Because there is no motif finding tool specifically designed for protein structural alphabets, we applied the motif finding programs available to evaluate the feasibility of using structural alphabets to characterize local structure features. Currently, we use the motif finding program, MEME [39] to identify common structural motifs in protein families. We tested our method on a well-known protein family, the epidermal growth factor (EGF)/EGF-like family. Based on the information published in literature or recorded in databases, we could verify whether the protein domains/sub-domains in EGF/EGF-like proteins could be described accurately using structural alphabets. EGF domains comprise extracellular protein modules described by 30–40 amino acids primarily stabilized by three disulfide bonds. Homologies and functional data suggest that these domains share some common functional features. If we number the cysteine residues as Cys1 to Cys6, where Cys1 is the closest to the N-terminus, the regularity of cysteine spacing defines three regions: A, B, and C. Based on the conservation in sequence and length of these regions, the homologies have been classified into three different categories [40]. We first described the 227 proteins in the EGF-type module family of SCOP 1.69 using our alphabet and the alphabets of Yang & Tung's [24] and de Brevern et al. [16, 35]. We then used MEME to identify the common motifs corresponding to the A, B, and C sub-domains. According to InterPro [41], 24 of these proteins were exclusively of EGF Type-1, 74 were of EGF-like Type-2, and 117 belonged to EGF-like Type-3 only. We classified the remaining 12 proteins as Others. Sub-domain A was typically composed of five to six residues in Types 1 and 2, sub-domain B usually contained 10–11 residues in Type-1 but was consistently three residues shorter than in Type-2. Sub-domain C was conserved in length and contained four or five specific residues in Type-1 and Type-2 [40]. The sub-domains in EGF-like Type-3 were less conserved. A found motif was considered to correspond to a sub-domain if more than one-half of the residues in the sub-domain were included in the motif. If any single motif correctly corresponded to a sub-domain, we claimed that this sub-domain was recovered successfully (that is, a hit). The results of the motifs found are summarized in Table 10 and 11. They show that MEME was able to identify more EGF sub-domains using our structural alphabet than using the alphabets of Yang & Tung or de Brevern et al. One example of each EGF group is shown in Figure 6, including the structures with highlighted sub-domains. Using our alphabet, MEME identified meaningful motifs that covered all three sub-domains in the EGF examples (Figure 6); however, using Yang & Tung's or de Brevern et al.'s alphabets, the motifs found covered only one or two sub-domains.

Table 10 Comparison between our structural alphabet (used in SA-FAST) and those of Yang & Tung (used in 3D-BLAST) and de Brevern et al. (converted by PBE-T, a facility associated with PBE-align) for describing motifs found by MEME within the EGF family
Table 11 Statistical analysis of EGF(EGF-like) proteins whose sub-domains were detected by MEME
Figure 6

Examples of structural motifs corresponding to EGF sub-domains. We colored the sub-domains A, B, and C in blue, green, and red, respectively. The motifs that corresponded to EGF sub-domains, using our structural alphabet and those of Yang & Tung and de Brevern et al., were also highlighted in blue, green, and red. The overlapping region between motifs was colored purple. In the sequence view, the first three sequences are EGF protein represented by our structural alphabet, the alphabet of Yang & Tung, and the alphabet of de Brevern et al., respectively. The fourth is the amino acid sequence with the cysteines highlighted in orange. The sub-domains are marked at the bottom.


This study aimed to: (1) introduce a systematic and modular pipeline for protein structural alphabet design, and (2) analyze the potential of our new alphabet to characterize local protein properties. There are two features that distinguish our method from the others. First, we took a multi-strategy approach to structural alphabet design. The alphabet size was automatically and statistically determined based on BIC and was visualized using a unified distance matrix (U-matrix). We did not pre-specify the alphabet size [24] or use an ad hoc procedure, such as iterative shrinking, to find the optimal size [15]. And, unlike other methods that use specialized databases, e.g. Pair Database [24] and PDB-SELECT [32, 42], the protein structure data used to build the alphabet were obtained from the non-redundant PDB (nrPDB) database and were not pre-processed for any particular purpose, ensuring the generality of our alphabet. Second, we proposed a novel automatic matrix training framework to construct an appropriate substitution matrix for the alphabet. This training strategy did not need any information about known alignments, e.g. PALI [43], that most previous strategies have required. Using different training data and update rules, the self-training methodology can be applied to various alphabets. For example, instead of protein classifications, we could consider RMSD in the update rules to tune the matrix. In Table 12, we summarize the properties of the structural alphabets and design methods evaluated in this study.

Table 12 Summary of properties of structural alphabets and alphabet designs

We demonstrated that our pipeline could produce a biologically meaningful structural alphabet. We compared SA-FAST, a search tool based on FASTA combined with our alphabet and substitution matrix, with other search tools. The results showed that SA-FAST was very competitive in its predictive accuracy and alignment efficiency for database-scale searches. In addition, we compared our alphabet with others in difficult cases of pairwise alignment. The number of residues aligned and the RMSD superpositions indicated that our structural alphabet was not only comparable to other alphabets but also performed competitively with structural alignment tools.

We found several advantages to using a 1D structural alphabet. First, 1D representations of protein structures are easier to compare and more economical to store. Second, previously designed and widely used 1D sequence alignment tools can be applied directly to protein structure and sequence analysis. Third, 1D-based approaches can serve as pre-processors to filter out irrelevant proteins prior to the application of more computationally intensive structural analysis tools.


These results are encouraging and we can extend this work in several directions. Firstly, we can use more complete datasets for substitution matrix training to increase the sensitivity and selectivity of future database searches. Secondly, we can combine other alignment tools, in addition to FASTA, with our substitution matrix and evaluate the performance of these different combinations. Thirdly, to increase the performance of MEME in structural motif detection, we could modify MEME or develop a new motif-finding tool specifically for our structural alphabet. MEME was originally designed to find motifs in amino acid and nucleic acid sequences. Currently, we use MEME to detect protein motifs and we have demonstrated that it can recover some of the structural sub-domains described by our structural alphabet. Finally, several structural alphabets have been developed based on different protein structural characteristics. It would be worthwhile to conduct a thorough comparative study and evaluate the feasibility of combining different alphabets. The combination of complementary structural alphabets would increase their overall applicability and characterize 3D protein structures more completely.


The use of frequent local structural motifs embedded in a polypeptide backbone has recently been shown to improve protein structure prediction [1, 22]. The success of this strategy has paved the way for further studies of structural alphabets and has enabled the application of standard 1D sequence alignment methods to 3D protein structural searches. In this study, we combined several computational methods into a new approach to the design of a protein structural alphabet. We then developed an automatic matrix training framework that could generate appropriate substitution matrices for new alphabets when applied in standard 1D sequence alignment methods, such as FASTA [27].

Structural alphabet design

We used proteins from the nrPDB [44] in our study with the aim of building a structural alphabet suitable for all proteins. The same approach could easily be applied to other databanks as well. We transformed each protein backbone into a series of dihedral angles (ϕ and ψ, neglecting ω) [15, 22]. Following de Brevern et al. [15], our analysis was limited to fragments of five residues because this number of residues is sufficient for describing a short α helix and a minimal β structure. Fixing the window size at five residues, we applied a sliding-window approach to each protein sequence in nrPDB and gathered 20,953,584 fragment vectors. Each protein fragment, associated with α-carbons Cα(i-2), Cα(i-1), Cα(i), Cα(i+1), and Cα(i+2), was represented by a vector of eight dihedral angles [ψi-2, ϕi-1, ψi-1, ϕ i , ψ i , ϕi+1, ψi+1, ϕi+2] Unlike previous works that directly applied SOMs to obtain clusters of backbone fragments as the basis of the structural alphabet [28], in our approach the SOM was only part of the process that determined the number of letters required for the alphabet. We did not build our alphabet directly from the clusters found by SOM.

The U-matrix is one of the most widely used methods for visualizing the clustering results of the SOM. The U-matrix shows the distances between neighboring reference vectors and can be visualized efficiently using the greyscale [45]. We conducted a post-process on the U-matrix using a minimum spanning tree algorithm. Based on the grey levels in the U-matrix, all of the map units were linked in the minimum spanning tree. Given a threshold θ determined by BIC, we partitioned the entire tree into several disconnected subtrees by removing the links between map units with grey levels below θ.

Let S = {s i | i = 1...M} be the set of map units we wished to cluster. Each map unit s i is associated with a collection of input data points, X i = { x j i | j = 1... n i } MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiwaG1aaWbaaSqabeaacqWGPbqAaaGccqGH9aqpcqGG7bWEcqWG4baEdaqhaaWcbaGaemOAaOgabaGaemyAaKgaaOGaeiiFaWNaemOAaOMaeyypa0JaeGymaeJaeiOla4IaeiOla4IaeiOla4IaemOBa42aaSbaaSqaaiabdMgaPbqabaGccqGG9bqFaaa@4183@ , mapped to the map unit s i . Let C k = {c i | i = 1...k} be the clustering of map units S with k clusters. We modeled each cluster c i as a multivariate Gaussian distribution N(μ i , Σ i ), where μ i and Σ i were estimated as the sample mean and the sample covariance from Xi, respectively. The number of parameters for each cluster was thus d + 1 2 d ( d + 1 ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemizaqMaey4kaSscfa4aaSqaaeaacqaIXaqmaeaacqaIYaGmaaGccqWGKbazcqGGOaakcqWGKbazcqGHRaWkcqaIXaqmcqGGPaqkaaa@36B9@ , where d = 8 in our case. We defined BIC(C k ) as:

B I C ( C k ) = i = 1 k { 1 2 n i log | i | } λ 1 2 ( d + 1 2 d ( d + 1 ) ) k log N MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOqaiKaemysaKKaem4qamKaeiikaGIaem4qam0aaSbaaSqaaiabdUgaRbqabaGccqGGPaqkcqGH9aqpdaaeWaqaaiabcUha7jabgkHiTKqbaoaaleaabaGaeGymaedabaGaeGOmaidaaOGaemOBa42aaSbaaSqaaiabdMgaPbqabaGccyGGSbaBcqGGVbWBcqGGNbWzdaabdaqaaiabggHiLpaaBaaaleaacqWGPbqAaeqaaaGccaGLhWUaayjcSdaaleaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGRbWAa0GaeyyeIuoakiabc2ha9jabgkHiTiabeU7aSLqbaoaaleaabaGaeGymaedabaGaeGOmaidaaOGaeiikaGIaemizaqMaey4kaSscfa4aaSqaaeaacqaIXaqmaeaacqaIYaGmaaGccqWGKbazcqGGOaakcqWGKbazcqGHRaWkcqaIXaqmcqGGPaqkcqGGPaqkcqWGRbWAcyGGSbaBcqGGVbWBcqGGNbWzcqWGobGtaaa@67B9@

where N = i = 1 k n i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemOta4Kaeyypa0ZaaabmaeaacqWGUbGBdaWgaaWcbaGaemyAaKgabeaaaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGRbWAa0GaeyyeIuoaaaa@3793@ and λ, the penalty weight, was set to 1.

We chose the threshold θ that maximized BIC(C k ). For example, for an SOM with 200 × 200 map units, the threshold θ that maximized BIC(C k ) was 21. The number of subtrees becomes the structural alphabet size. Because the SOM can be viewed as a topology preserving mapping from input space onto the 2D grid of map units, the number of map units can affect the clustering result. We systematically varied the number of units and repeated the above process. We selected the most frequent number of clusters as the alphabet size. After a series of systematic tests, we found that 18 was the most frequent number of clusters; therefore, 18 letters became size of our structural alphabet.

Rather than adopt the two-level approach that first trains the SOM then performs clustering on the trained SOM after determining the alphabet size [28], we applied the k-means algorithm to the input data vectors directly to obtain the clusters. The SOM established a local order among the set of reference vectors such that the closeness between two reference vectors in the Rdspace was dependent on how close the corresponding map units were in the 2D array. Nevertheless, an inductive bias of this kind might not be appropriate for structural alphabets since the local order does not always faithfully characterize the relationship between structural building blocks and can sometimes be misleading. For example, forcing the topology to preserve mapping from the input space of α-helix and β-strand to a 2D grid of units could be harmful to clustering. Therefore, we used the SOM only to visualize the alphabet size and relied on the k-means algorithm to extract the local features directly from the input data that actually reflected the characteristics of the clusters. The centroid of each cluster forms the prototypical representation of each alphabet letter. We performed k-means clustering 50 times, starting with different random seeds, all using k = 18. We computed the Euclidean distances from each fragment in each cluster to its centroid as the intra-cluster distance; we also calculated the centroid-to-centroid distance. We kept the clustering result that had the minimum ratio of the average intra-cluster distance to the centroid-to-centroid distance. Given this result as the basis for the structural alphabet, we could transform a protein into a series of alphabet letters by matching each of its fragments against our alphabet prototypes.

Automatic substitution matrix training

The substitution matrix used to align proteins represented by structural alphabets affects the alignment accuracy. The matrix is a crucial factor in the success of applying a 1D sequence alignment tool to search for similar 3D structures. The simplest matrix that can be used is the identity matrix. Some authors have applied an HMM approach to define the matrix [33], while others have adopted approaches similar to the development of BLOSUM matrices [24, 31, 45]. The identity matrix ignores possible acceptable alphabet letter substitutions, significantly limiting its applicability. The BLOSUM-like approach requires alignments of homologous proteins before calculating the log-odd ratios as the entries in the matrix; however, reliably aligned protein blocks might not always be available for log-odd ratio estimation. To avoid these drawbacks, we trained the substitution matrix without using the known blocks of protein alignments. This matrix training procedure can be applied regardless of how the alphabet is derived.

There are three components in the matrix training framework: an alignment tool with a substitution matrix, training data, and a matrix trainer. We used FASTA as the alignment tool and the non-redundant proteins in SCOP1.69 with sequence similarity less than 40%, excluding the families with less than five proteins and those proteins used for validation, as the training dataset. Note that the training dataset was only 9.62% of the entire SCOP1.69. The test data we used in the later experiments (see Results section) did not overlap with our training examples. We started by using the identity matrix as the initial substitution matrix where the score for a match was 1, and for a mismatch, 0. Each protein in the training dataset was iterated as a query for FASTA to search the rest of the dataset for similar proteins. If a protein returned by FASTA belonged to the same family as the query, we considered the case a positive hit; otherwise it was a negative hit. Those proteins not returned by FASTA but in the same family as the query were considered misses. We gathered the alignments of all positive hits and misses and computed the log-odd ratios to build the positive matrix based on the alignments. Similarly, we constructed the negative matrix using the alignments of negative hits, The matrix trainer updated the current substitution matrix S(t)to S(t+1)as follows:S(t+1)= S(t)+ M M = [W p ·(P - S(t)) - W n ·(N - S(t))]·τ W p = (|positive_hits| + |misses|)/|taining_data|W n = |negative_hits|/|training_data|

where P and N are the positive and the negative matrix, respectively, τ is the learning rate (similar to the learning rate in neural networks), and W p and W n are the weights. The weights were defined as the proportion of the total number of positive hits and misses to the training data size and the ratio of the number of negative hits to the training data size, respectively. We repeated the update process to train the substitution matrix until there were no changes in the matrix, that is, the number of both the positive and the negative hits remained constant. This converged matrix was the final substitution matrix that we combined with FASTA to become a new alignment tool named SA-FAST. SA-FAST was used to demonstrate the applicability of our new alphabet and matrix. The training framework appears in Figure 7.

Figure 7

System architecture of the matrix training framework.


  1. 1.

    Baker D, Sali A: Protein Structure Prediction and Structural Genomics. Science 2001, 294: 93–96. 10.1126/science.1065659

    CAS  Article  PubMed  Google Scholar 

  2. 2.

    Orengo CA, Bray JE, Hubbard T, LoConte L, Sillitoe I: Analysis and assessment of ab initio three-dimensional prediction, secondary structure, and contacts prediction. Protein 1999, 3: 149–170. Publisher Full Text 10.1002/(SICI)1097-0134(1999)37:3+%3C149::AID-PROT20%3E3.0.CO;2-H

    Article  Google Scholar 

  3. 3.

    Liu Y, Beveridge DL: Exploratory studies of ab initio protein structure prediction: Multiple copy simulated annealing, AMBER energy functions, and a generalized born/solvent accessibility solvation model. Proteins 2002, 46: 128–146. 10.1002/prot.10020

    CAS  Article  PubMed  Google Scholar 

  4. 4.

    Shindyalo IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 1998, 11: 739–747. 10.1093/protein/11.9.739

    Article  Google Scholar 

  5. 5.

    Holm L, Sander C: Protein structure comparison by alignment distance matrices. J Mol Biol 1993, 233: 123–138. 10.1006/jmbi.1993.1489

    CAS  Article  PubMed  Google Scholar 

  6. 6.

    Madej T, Gibrat JF, Bryant SH: Threading a database of protein cores. Proteins: Structure, Function and Bioinformatics 1995, 23: 356–369. 10.1002/prot.340230309

    CAS  Article  Google Scholar 

  7. 7.

    Ortiz AR, Strauss CE, Olmea O: MAMMOTH (Matching molecular models obtained from Theory): an automated method for model comparison. Protein Science 2002, 11: 2606–2622. 10.1110/ps.0215902

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  8. 8.

    Ye Y, Godzik A: Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 2003, 19 Suppl 2: ii246–255.

    PubMed  Google Scholar 

  9. 9.

    Birzele F, Gewehr JE, Csaba G, Zimmer R: Vorolign-fast structural alignment using Voronoi contacts. Bioinformatics 2007, 23: e205–211. 10.1093/bioinformatics/btl294

    CAS  Article  PubMed  Google Scholar 

  10. 10.

    Garnier J, Osguthorpe D, Bobson B: Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular protein. J Molecular Biology 1978, 120: 97–120. 10.1016/0022-2836(78)90297-8

    CAS  Article  Google Scholar 

  11. 11.

    Petersen TN, Lundegaard C, Nielsen M, Bohr H, Bohr J, Brunak S, Gippert GP, Lund O: Prediction of protein secondary structure at 80% accuracy. Proteins 2000, 41: 17–20. 10.1002/1097-0134(20001001)41:1<17::AID-PROT40>3.0.CO;2-F

    CAS  Article  PubMed  Google Scholar 

  12. 12.

    Rost B: Review: Protein secondary structure prediction continues to rise. J Structural Biology 2001, 134: 204–218. 10.1006/jsbi.2001.4336

    CAS  Article  Google Scholar 

  13. 13.

    Pollastri G, Martin AJM, Mooney C, Vullo A: Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information. BMC Bioinformatics 2007, 8: 201–212. 10.1186/1471-2105-8-201

    PubMed Central  Article  PubMed  Google Scholar 

  14. 14.

    Sander O, Sommer I, Lengauer T: Local protein structure prediction using discriminative models. BMC Bioinformatics 2006, 7: 14. 10.1186/1471-2105-7-14

    PubMed Central  Article  PubMed  Google Scholar 

  15. 15.

    de Brevern AG, Etchebest C, Hazout S: Bayesian probabilistic approach for predicting backbone structures in terms of protein blocks. Proteins 2000, 41: 271–287. 10.1002/1097-0134(20001115)41:3<271::AID-PROT10>3.0.CO;2-Z

    CAS  Article  PubMed  Google Scholar 

  16. 16.

    de Brevern AG, Valadie H, Hazout SA, Etchebest C: Extension of a local backbone description using a structural alphabet: A new approach to the sequence-structure relationship. Protein Science 2002, 11: 2871–2886. 10.1110/ps.0220502

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  17. 17.

    de Brevern AG: New assessment of a structural alphabet. In Silico Biol 2005, 5(3):283–9.

    PubMed Central  CAS  PubMed  Google Scholar 

  18. 18.

    Benros C, de Brevern AG, Etchebest C, Hazout S: Assessing a novel approach for predicting local 3D protein structures from sequence. Proteins 2006, 62(4):865–880. 10.1002/prot.20815

    CAS  Article  PubMed  Google Scholar 

  19. 19.

    Unger R, Harel D, Wherland S, Sussman JL: A 3D building blocks approach to analyzing and predicting structure of proteins. Proteins 1989, 5: 355–373. 10.1002/prot.340050410

    CAS  Article  PubMed  Google Scholar 

  20. 20.

    Schuchhardt J, Schneider G, Reichelt J, Schomburg D, Wrede P: Local structural motifs of protein backbones are classified by self-organizing neural networks. Protein Engineering 1996, 9: 833–842. 10.1093/protein/9.10.833

    CAS  Article  PubMed  Google Scholar 

  21. 21.

    Fetrow JS, Palumbo MJ, Berg G: Patterns, structures, and amino acid frequencies in structural building blocks, a protein secondary structure classification scheme. Proteins 1997, 27: 249–271. 10.1002/(SICI)1097-0134(199702)27:2<249::AID-PROT11>3.0.CO;2-M

    CAS  Article  PubMed  Google Scholar 

  22. 22.

    Bystroff C, Baker D: Prediction of local structure in proteins using a library of sequence-structure motif. J Molecular Biology 1998, 281: 565–577. 10.1006/jmbi.1998.1943

    CAS  Article  Google Scholar 

  23. 23.

    Camproux AC, Tuffery P, Chevrolat JP, Boisvieux JF, Hazout S: Hidden Markov model approach for identifying the modular framework of the protein backbone. Protein Eng 1999, 12: 1063–1073. 10.1093/protein/12.12.1063

    CAS  Article  PubMed  Google Scholar 

  24. 24.

    Yang JM, Tung CH: Protein structure databases search and evolutionary classification. Nucleic Acids Research 2006, 34: 3646–3659. 10.1093/nar/gkl395

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  25. 25.

    Offmann B, Tyagi M, de Brevern AG: Local Protein Structures. Current Bioinformatics 2007, 2: 165–202. 10.2174/157489307781662105

    CAS  Article  Google Scholar 

  26. 26.

    Hartigan JA, Wong MA: A k-means clustering algorithm. Applied Statistics 1975, 28: 100–108. 10.2307/2346830

    Article  Google Scholar 

  27. 27.

    Pearson WR: Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 2000, 132: 185–219.

    CAS  PubMed  Google Scholar 

  28. 28.

    Vesanto J, Alhoniemi E: Cluster of the self-organizing map. IEEE trans Neural Networks 2000, 11: 586–600. 10.1109/72.846731

    CAS  Article  PubMed  Google Scholar 

  29. 29.

    Mitchell TM: Machine Learning. McGraw-Hill; 1997.

    Google Scholar 

  30. 30.

    Andreeva A, Howorth D, Brenner SE, Hubbard TJP, Chothia C, Murzin AG: SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Research 2004, 32: D226–229. 10.1093/nar/gkh039

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  31. 31.

    Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. PNAS 1992, 89: 10915–10919. 10.1073/pnas.89.22.10915

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  32. 32.

    Zheng WM, Liu X: A protein structural alphabet and its substitution matrix CLESUM. LNCS 2005, 3680: 59–67.

    CAS  Google Scholar 

  33. 33.

    Guyon F, Camproux AC, Hochez J, Tuffery P: SA-Search: a web tool for protein structure mining based a structural alphabet. Nucleic Acids Res 2004, 32: W545-W548. 10.1093/nar/gkh467

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  34. 34.

    Carpentier M, Brouillet S, Pothier J: YAKUSA: a fast structural database scanning method. Proteins: Structure, Function and Genetics 2005, 61: 137–151. 10.1002/prot.20517

    CAS  Article  Google Scholar 

  35. 35.

    Tyagi M, Sharma P, Swamy CS, Cadet F, Srinivasan N, de Brevern AG, Offmann B: Protein Block Expert (PBE): a web-based protein structure analysis server using structural alphabet. Nucleic Acids Research 2006, 34: W119–123. 10.1093/nar/gkl199

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  36. 36.

    Tyagi M, de Brevern AG, Srinivasan N, Offmann B: Protein structure mining using structural alphabet. Proteins: structure, function and bioinformatics 2007, in press.

    Google Scholar 

  37. 37.

    Humphrey W, Dalke A, Schulten K: VMD – Visual Molecular Dynamics. Journal of Molecular Graphics 1996, 14: 33–38. 10.1016/0263-7855(96)00018-5

    CAS  Article  PubMed  Google Scholar 

  38. 38.

    Dudev M, Lim C: Discovering structural motifs using a structural alphabet: Applications to magnesium-binding sites. BMC Bioinformtics 2007, 8: 106. 10.1186/1471-2105-8-106

    Article  Google Scholar 

  39. 39.

    Bailey TL, Elkan C: Unsupervised learning of multiple motifs in biopolymers using EM. Machine Learning 1995, 21: 51–80.

    Google Scholar 

  40. 40.

    Appella E, Weber IT, Blasi F: Structure and function of epidermal growth factor-like regions in proteins. FEBS Letters 1988, 231: 1–4. 10.1016/0014-5793(88)80690-2

    CAS  Article  PubMed  Google Scholar 

  41. 41.

    Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Buillard V, Cerutti L, Copley R, Courcelle E, Das U, Daugherty L, Dibley M, Finn R, Fleischmann W, Gough J, Haft D, Hulo N, Hunter S, Kahn D, Kanapin A, Kejariwal A, Labarga A, Langendijk-Genevaux PS, Lonsdale D, Lopez R, Letunic I, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Nikolskaya AN, Orchard S, Orengo C, Petryszak R, Selengut JD, Sigrist CJA, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C: New developments in the InterPro database. Nucleic Acids Research 2007, 35: D224–228. 10.1093/nar/gkl841

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  42. 42.

    Hobohm U, Sander C: Enlarged representatives set of protein structures. Protein Science 1994, 3: 522–524.

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  43. 43.

    Gowri VS, Pandit SB, Karthik PS, Srinivasan N, Balaji S: Integration of related sequences with protein three-dimensional structural families in an updated Version of PALI database. Nucleic Acids Research 2003, 31: 486–488. 10.1093/nar/gkg063

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  44. 44.

    Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Research 2000, 28: 235–242. 10.1093/nar/28.1.235

    PubMed Central  CAS  Article  PubMed  Google Scholar 

  45. 45.

    Tyagi M, Gowri VS, Srinivasan N, de Brevern AG, Offmann B: A substitution matrix for structural alphabet based on structural alignment of homologous proteins and its applications. Proteins: Structure, Function and Bioinformatics 2006, 65: 32–39. 10.1002/prot.21087

    CAS  Article  Google Scholar 

Download references


Thanks go to Jinn-Moon Yang and Chi-Hua Tung for their assistance in using 3D-BLAST. This work was supported in part by National Science Council, Taiwan (NSC 96-2221-E-009-042; 96-2627-B-009-003) and Institute of Statistics, Academia Sinica, Taiwan.

Author information



Corresponding author

Correspondence to Yuh-Jyh Hu.

Additional information

Authors' contributions

S–YK implemented the structural alphabet design pipeline and conducted the experiments. Y–JH designed the BIC procedure, the matrix training framework and experiments, and supervised this study. Both authors read and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Ku, S., Hu, Y. Protein structure search and local structure characterization. BMC Bioinformatics 9, 349 (2008).

Download citation


  • Bayesian Information Criterion
  • Root Mean Square Deviation
  • Alignment Tool
  • Substitution Matrix
  • Alphabet Size