 Methodology article
 Open Access
An optimized TOPS+ comparison method for enhanced TOPS models
 Mallika Veeramalai^{1}Email author,
 David Gilbert^{2} and
 Gabriel Valiente^{3}
https://doi.org/10.1186/1471210511138
© Veeramalai et al; licensee BioMed Central Ltd. 2010
 Received: 11 September 2009
 Accepted: 17 March 2010
 Published: 17 March 2010
Abstract
Background
Although methods based on highly abstract descriptions of protein structures, such as VAST and TOPS, can perform very fast protein structure comparison, the results can lack a high degree of biological significance. Previously we have discussed the basic mechanisms of our novel method for structure comparison based on our TOPS+ model (Topological descriptions of Protein Structures Enhanced with Ligand Information). In this paper we show how these results can be significantly improved using parameter optimization, and we call the resulting optimised TOPS+ method as advanced TOPS+ comparison method i.e. advTOPS+.
Results
We have developed a TOPS+ string model as an improvement to the TOPS [1–3] graph model by considering loops as secondary structure elements (SSEs) in addition to helices and strands, representing ligands as first class objects, and describing interactions between SSEs, and SSEs and ligands, by incoming and outgoing arcs, annotating SSEs with the interaction direction and type. Benchmarking results of an allagainstall pairwise comparison using a large dataset of 2,620 nonredundant structures from the PDB40 dataset [4] demonstrate the biological significance, in terms of SCOP classification at the superfamily level, of our TOPS+ comparison method.
Conclusions
Our advanced TOPS+ comparison shows better performance on the PDB40 dataset [4] compared to our basic TOPS+ method, giving 90% accuracy for SCOP alpha+beta; a 6% increase in accuracy compared to the TOPS and basic TOPS+ methods. It also outperforms the TOPS, basic TOPS+ and SSAP comparison methods on the ChewKedem dataset [5], achieving 98% accuracy.
Software Availability
The TOPS+ comparison server is available at http://balabio.dcs.gla.ac.uk/mallika/WebTOPS/.
Keywords
 Protein Data Bank
 Receiver Operating Characteristic
 Edit Distance
 Comparison Score
 Dissimilarity Score
Background
The structural genomics consortium [6] aims to populate protein fold space using highthroughput experimental technologies. As a result the number of known structures in the Protein Data Bank (PDB) [7] is increasing rapidly every year and currently holds 59,790 structures (August 25, 2009). This highlights the importance of the need for fast and reliable protein structure comparison methods. There are various methods which use detailed 3D structures for comparison; SSAP [8, 9] uses a double dynamic programming method that takes into account several different features of protein structure including phi/psi angles, accessibility and interresidue vectors to align two protein structures. Other approaches include STAMP [10], DALI [11] and the Combinatorial Extension method [12]. On the other hand abstract level structural comparison methods are based on topological/vector models of secondary structure elements (SSEs) and their relationships. VAST is a vector based protein structure comparison method [13, 14]. GRATH [15] is a graphbased algorithm that compares the axial vectors of alpha helices and beta strands of two proteins, together with the distances, angles and chirality between these vectors. It is based on a method by Grindley et al. [16]. Earlier work by Koch et al. [17] uses a graph method to find maximal common SSEs in a pair of proteins. TOPS is a graphbased method applied to the topological representation of the protein structures [3]. Although these methods perform very fast protein structure comparison in most cases the results have significantly less biological interpretation due to the abstract nature of the protein model. Moreover, the functional annotation problem is made much more complex by the fact that the number of protein folds is limited while their range of functions is very diverse. For example, the current version of the SCOP database classified the (single) TIM barrels protein fold into 33 distinct functional superfamilies.
This motivated our research to develop a novel topological model for protein structures, enhanced with structural and biochemical features, such as ligand interaction information and amino acid sequence length of the secondary structures, in order to permit better, more biologically significant comparison methods. Previously, we have discussed the basic mechanisms of our novel TOPS+ comparison method for novel topological models. We compute the edit distance between two proteins based on TOPS+ strings elements using a dynamic programming approach. We have benchmarked our method with an allagainstall pairwise comparison using a large dataset of 2,620 nonredundant structures from the PDB40 and the results were validated using the standard SCOP superfamily classification numbers. We have also compared our method against other methods and showed that it is faster than SSAP, FATCAT, DALI and TOPS and that it has a comparable performance against TOPS [18]. Recently we developed the TOPS++FATCAT system that exploits the TOPS+ strings comparison method to speedup the FATCAT protein structural alignment program for fast flexible structural alignment, while preserving the accuracy of the original FATCAT method [19]. These promising results have facilitated the introduction of further constraints on ligandarc matching.
In this paper, we show how the above results can be significantly improved using parameter optimization at two stages of the TOPS+ method: (i) in the generation of the dynamic programming table and (ii) in the computation of the comparison score using a compression measure. The dynamic programming algorithm includes weight tables for matching TOPS+ strings elements, the match scores take into account not only the SSEType, orientation but in addition they include scores for total in/out/ligand arcs together with their arc types such as right and left chiralities, and parallel and antiparallel hydrogenbonds. This research work involved (a) generating the TOPS descriptions enhanced with in/out/ligand arc information for a large set of proteins; (b) designing the weight tables; (c) optimization of weights in the table; (d) designing a pairwise comparison metric based on a compression measure and optimizing different parameters to take into account the variability on both components of the topological and ligand interaction features. The optimization of our advanced TOPS+ comparison method was carried out on the PDB40 representative dataset. Furthermore, we assess the biological significance of our method against existing protein structure comparison methods based on cluster analysis and validation using an Fmeasure calculation [20, 21] on the ChewKedem dataset [4, 5].
Results and Discussion
Analysis of results for the PDB40 dataset
ROC curve and Fmeasure analysis of structural homology for the PDB40 dataset.
SCOP Class  TOPS  TOPS+  advTOPS+  

1  All alpha  0.76/0.79  0.83/0.85  0.82/0.88 
2  All beta  0.89/0.85  0.85/0.83  0.87/0.86 
3  Alpha/beta  0.82/0.75  0.75/0.70  0.77/0.70 
4  Alpha+beta  0.84/0.75  0.84/0.74  0.90/0.81 
In alpha+beta class our advTOPS+ method has a 90% accuracy, which is superior when compared with both TOPS and our basic TOPS+ method, which have only 84% accuracy (see Table 1). Because these proteins are composed of segregated alpha and beta regions the structuredependent ligand interactions and additional chiral, hydrogen bonds are also present. Thus our parameter optimization can handle all arcs more efficiently.
On the other hand, the alpha/beta class of proteins contains mixed alpha and beta secondary structures; more importantly although the protein domains from these classes have ligand interactions, they may not be structuredependent ligand interactions. In these classes for most of the protein superfamilies the ligands have a tendency to bind the clefts or binding pocket which have appropriate physiochemical properties and the correct conformational geometry of the amino acids. Furthermore it is important to note that in our TOPS+ and advTOPS+ comparison methods we have considered only the total number of ligandarcs rather than the actual ligand property match, thus we have false positives in some SCOP classes. In the case of allbeta class proteins our advTOPS+ method has comparable performance against TOPS with 87% accuracy (see Table 1); in this class proteins contain a significant number of hydrogen bond and chiral arcs, and thus parameter optimization is performed more efficiently. From the Fmeasure statistical evaluation analysis (we used the same cutoff value of 0.35 for all three methods) we found that the advTOPS+ method appears to always do better than TOPS and TOPS+ except for the alpha/beta class of proteins (see Table 1).
The overall results show that our advTOPS+ method exhibits substantial improvement compared to basic TOPS+. It has better performance for allalpha and alpha+beta proteins compared to TOPS. On the other SCOP classes the performance is comparable with TOPS. Since our method considers only the total number of ligand arcs rather than the actual ligand property this leads to false positives to some extent. Our advTOPS+ method can efficiently recognize structuredependent ligand interactions appropriately in the case of DNAbinding proteins and metal binding proteins.
Analysis of results for the ChewKedem dataset
Biological significance of protein domain clusters for the ChewKedem dataset.
Method  Fmeasure 

SSAP  0.966 
TOPS  0.955 
TOPS+  0.931 
advTOPS+  0.985 
advTOPS+ comparison scores for the ChewKedem dataset.
Protein Fold  Domain  SSE Ln  LCS PAT Ln  Adv TOPS+ Score  LCS SSE PATTERN 

Alphabeta  d1aa9__  23  20  0.49  uEUhuEuhuEuhuEUhuEhu 
d1gnp__  25  20  0.51  uEUhuEhuEuhuuEUhuEhu  
d6q21a_  21  16  0.59  uEUhuEuhuEUhuEhu  
d1qraa_  25  20  0.51  ueUHueHueuHuueUHueHu  
d5p21__  25  20  0.51  ueUHueHueuHuueUHueHu  
TIMbarrel  d6xia__  60  40  0.30  uhuhuHuHueuHueuHuhuehhuHuHueuuuHuhuhuuhu 
d2mnr_1  37  29  0.37  uuuuHueuHueuHueuHueuHuuhuuuhu  
d1chra1  41  31  0.35  uuuHuHueuHueuHueuHuHueuHuhuuuhu  
d4enl_1  55  35  0.37  uhuhuHuueuuuHuhueuHuHuuhuuHuuHuHuhu 
Conclusions
In this paper we have reported the generation of TOPS+ and TOPS+ strings models for large datasets and have presented an improved TOPS+ comparison method using parameter optimization both for the computation of the dynamic programming table and the computation of the comparison score using a compression metric. Through our evaluation analysis we have showed that our advanced TOPS+ comparison method has a substantial improvement on all the SCOP classes compared to our basic TOPS+ method. Our advanced TOPS+ method has better performance compared to TOPS on alpha+beta and allalpha and is comparable on allbeta and alpha/beta. On the ChewKedem dataset our advanced TOPS+ comparison outperforms all the other methods.
This demonstrates that our TOPS+ and TOPS+ strings models can find more biologically significant results and has led to interesting new directions to incorporate ligandpattern discovery in TOPS+ comparison [24]. Our method is faster than TOPS and SSAP because it has time complexity O(n^{2}), where n represents the number of SSEs in the protein domains. This research opens new doors to an exciting improvement to our TOPS+ models and advanced TOPS+ comparison method by the addition of features such as aminoacid sequences, biochemical properties of the proteinligand interaction at atomic level, and arc scores (both topological level and ligand level) for each SSE. Moreover we can improve the comparison process with additional statistical scoring values for each TOPS+ strings element match both at the micro (atomicdetails of proteinligand interaction information) and the macro level (abstract level).
Furthermore our novel TOPS+ models, TOPS+ strings and comparison approaches could be applicable to different problem areas such as RNA secondary structure comparison and prediction. Most of the drugdiscovery process starts with insilico chemical compound screening which is computationally expensive. Our TOPS+ comparison approach could be applied as an initial step to prune the search space and filter the proteins into same folds interacting with similar or different ligands and different folds interacting with similar or different ligands.
Methods
TOPS+ and TOPS+ Strings Models
The TOPS model [1, 2, 25, 26] represented protein structures at the fold level by a graph where the nodes stand for SSEs(up or down) alphahelices and betastrandsand (nondirected) edges represent right or lefthanded chirality and parallel or antiparallel hydrogenbond relationships. In addition, there is a total ordering over the nodes, corresponding to the backbone of the protein. Our TOPS+ model enhances the original TOPS graph model with structural and biochemical features such as ligand interaction information and amino acid sequence length of the secondary structures. We have added extra nodes for loops (represented as a first class objectSSE) and ligands as well as maintaining the existing nodes for alphahelices and betastrands.
Further, we have designed a string model based on our TOPS+ graph model where the longrange and shortrange interactions between the SSEs are converted into incoming and outgoing arcs for each SSE, which maintain the directions and arc type properties. All relevant SSE nodes are enhanced with SSEligand interaction information which includes loopligand interaction information. We abstract away from the ligands themselves, to give a linear model called TOPS+ string which preserves the essential biochemical information whilst permitting more efficient and nonheuristic algorithms for comparison.
Advanced TOPS+ Comparison Method based on Dynamic Programming Algorithm
Our TOPS+ comparison method computes a comparison score between two proteins based on edit distance using a dynamic programming approach. The Levenshtein distance or edit distance[28, 29] gives a measure (the cost) of the minimum number of elementary edit operations (insertions, deletions and substitutions of characters) necessary to transform one sequence into the other. In this research we have improved our existing method using parameter optimization in the dynamic programming table computation and also in the computation of the comparison score.
 1.
Recursive definition of the optimal dissimilarity score for match and mismatch between TOPS+ strings elements (this process is based on the advanced_SSEArc+Match function, which incorporates the parameter optimization process using parameter tuning table).
 2.
Construction of the Edit Distance (ED) matrix (dynamic programming table).
 3.
Traceback on the ED matrix (dynamic programming table).
 4.
Obtain the LCS (Longest Common Substring), which is equivalent to the largest common structural core.
 5.
Computation of the comparison score based on the compression measure which is optimized with penalty weights for arc information (at both topological arcs and ligand arcs).
In our optimized TOPS+ comparison method, the computation of the edit distance matrix M is an important process, in which the advanced_SSEArc+Match function plays a key role in assigning dissimilarity scores for each TOPS+ strings element match or mismatch between the target t_{ i }∈ T and the source s_{ j }∈ S. This function handles the parameter optimization process while computing the construction of the editdistance matrix using a dynamic programming approach. It takes the basic parameter list P_{ b }supplied together with the input and constructs the parameter tuning table PT with 12 weights (w_{1} to w_{12}) and integrates these weights with the absolute arc differences (D_{1} to D_{12}) between the TOPS+ strings elements t_{ i }∈ T and s_{ j }∈ S, computing the final normalized dissimilarity score for match or mismatch between the t_{ i }∈ T and s_{ j }∈ S. In each step the advanced_SSEArc+Match function performs the following processes in order to obtain the dissimilarity scores between each pair of TOPS+ strings elements of T and S and to construct the dynamic programming table:

Construct the parameter tuning table PT based on the basic parameter list P_{ b }and this process performed once.

Compute the absolute difference for the arc features such in/out/ligand arc between t_{ i }and s_{ j }of T and S respectively.

Compute the optimized dissimilarity score for t_{ i }and s_{ j }match using equations (1) and (2) below.

Construct the dynamic programming table.
Algorithm 1 (Edit distance between TOPS+ strings) A function call ComputeEditDistance(T, S) computes the edit distance matrix M, the backtrace pointer matrix P, the edit distance value ed, and the longest common subsequence lcs of two TOPS+ strings T and S.
function ComputeEditDistance(T = t_{1}, ..., t_{ n }, S = s_{1}, ..., s_{ m })
M [0, 0] ← 0
for i ← 1, ..., n do
M [i, 0] ← i
for j ← 1, ..., m do
M [0, j] ← j
for i ← 1, ... n do
for j ← 1, ..., m do
A ← SSEArc+Match(t_{ i }, s_{ j })
M [i, j] ← min{M [i, j  1] + 1, M [i, j  1] + 1, M [i  1, j  1] + A}
if M [i, j] = M [i  1, j  1] + A then
P [i, j] ← 'm' ▷ match or mismatch of s_{ i }to t_{ j }
else if M [i, j] = M [i, j  1] + 1 then
P [i, j] ← 'i' ▷ insertion of s_{ j }into t
else
P [i, j] ← 'd' ▷ deletion of t_{ i }from t
ed ← M [n, m]
lcs ← BuildLCS(M, P, T, S)
return ⟨M, P, ed, lcs⟩
function BuildLCS(M, P, T, S)
lcs ← ∅ ▷ empty sequence
k ← 0
i ← n ▷ length of T
j ← m ▷ length of S
while i > 0 or j > 0 do
if P [i, j] = 'm' then
lcs ← lcs ∪ t_{j1}▷ match or mismatch of s_{i1}to t_{j1}
k ← k + 1
i ← i  1
j ← j  1
else if P [i, j] = 'd' then
i ← i  1 ▷ deletion of t_{i1}from t
else
j ← j  1 ▷ insertion of s_{j1}into t
return lcs
function SSEArc+Match(t_{ i }, s_{ j })
mS ← 0
Parse(t_{ i }, t_{ sk }, t_{ I }, t_{ O }, t_{ L }, t_{ IR }, t_{ IL }, t_{ IP }, t_{ IA }, t_{ OR }, t_{ OL }, t_{ OP }, t_{ OA })
Parse(s_{ j }, s_{ sk }, s_{ I }, s_{ O }, s_{ L }, s_{ IR }, s_{ IL }, s_{ IP }, s_{ IA }, s_{ OR }, s_{ OL }, s_{ OP }, s_{ OA })
if MatchSSEArc+features(t_{ i }, s_{ j }) then
mS ← mS + 1
return mS
procedure Parse(t_{ i }, t_{ sk }, t_{ I }, t_{ O }, t_{ L }, t_{ IR }, t_{ IL }, t_{ IP }, t_{ IA }, t_{ OR }, t_{ OL }, t_{ OP }, t_{ OA })
t_{ sk }← secondary structure length of t_{ i }
t_{ I }, t_{ O }, t_{ L }← total number of incoming, outgoing, ligand arcs of t_{ i }
t_{ IR }, t_{ IL }, t_{ IP }, t_{ IA }← total number of incoming arcs of type R, L, P, A of t_{ i }
t_{ OR }, t_{ OL }, t_{ OP }, t_{ OA }← total number of outgoing arcs of type R, L, P, A of t_{ i }
The time complexity is O(n^{2}) where n is the length of the string of SSEs. The current version of our TOPS+ method performs global alignment [30]. However, local alignment [31] can be applied to find the local structural similarity or patterns such as similar SSEligand interactions at local level across different folds.
Optimizing the Computation of the Dynamic Programming Table
We performed parameter tuning/optimization in order to obtain the optimal approximate match between two protein structures. In general, at the superfamily level, only core structures are conserved throughout evolution across the members of protein families. Studies have shown that the number of SSE insertions and deletions is variable for different sequence families or organisms [32]. This implies that variable numbers of 'indels' are applicable to the ArcsTypes and SSETypes across protein families from various organisms within a superfamily. Thus, it is important to develop a cost matrix with an additional penalty scoring function for such an approximate matching process. In the following sections we discuss the development of the parameter tuning table and the computation of the absolute difference between ArcTypes types and SSETypes. Subsequently, we explain the main parameter optimization process involved in the computation of the dynamic programming table, which exploits the computation of a normalized dissimilarity score for TOPS+ strings element match.
Development of Parameter Tuning Table
Identity and dissimilarity scoring matrices for TOPS+ diagrams.
ISM  DSM  

SSE  E  e  H  h  U  u  E  e  H  h  U  u 
E  0  1  1  1  1  1  0  1  2  2  2  2 
e  1  0  1  1  1  1  1  0  2  2  2  2 
H  1  1  0  1  1  1  2  2  0  1  2  2 
h  1  1  1  0  1  1  2  2  1  0  2  2 
U  1  1  1  1  0  1  2  2  2  2  0  1 
u  1  1  1  1  1  0  2  2  2  2  1  0 

w_{3} = w_{8} = r (for incoming and outgoing arc type_R)

w_{4} = w_{9} = s (for incoming and outgoing arc type_L)

w_{5} = w_{10} = p (for incoming and outgoing arc type_P)

w_{6} = w_{11} = q (for incoming and outgoing arc type_A)

w_{2} = w_{7} = r + s + p + q (for total incoming and outgoing arcs)

w_{12} = t (for total ligand arcs)
Computation of Absolute Differences
Normalized similarity score between secondary structure elements.
Absolute Differences  Equations  Description 

total incoming arcs  D_{2} = t_{ I } s_{ I }  t_{ I }and s_{ I }are the total number of incoming arcs of the TOPS+ strings elements t_{ i }∈ T and s_{ j }∈ S respectively 
total incoming arcs type_R  D_{3} = t_{ IR } s_{ IR }  t_{ IR }and s_{ IR }indicate the total number of incoming arcs type_R for t_{ i }∈ T and s_{ j }∈ S respectively 
total incoming arcs type_L  D_{4} = t_{ IL } s_{ IL }  t_{ IL }and s_{ IL }indicate the total number of incoming arcs type_L for t_{ i }∈ T and s_{ j }∈ S respectively 
total incoming arcs type_P  D_{5} = t_{ IP } s_{ IP }  t_{ IP }and s_{ IP }indicate the total number of incoming arcs type_P for t_{ i }∈ T and s_{ j }∈ S respectively 
total incoming arcs type_A  D_{6} = t_{ IA } s_{ IA }  t_{ IA }and s_{ IA }indicate the total number of incoming arcs type_A for t_{ i }∈ T and s_{ j }∈ S respectively 
total outgoing arcs  D_{7} = t_{ O } s_{ O }  t_{ O }and s_{ O }are the total number of outgoing arcs for the TOPS+ strings elements t_{ i }∈ T and s_{ j }∈ S respectively 
total outgoing arcs type_R  D_{8} = t_{ OR } s_{ OR }  t_{ OR }and s_{ OR }indicate the total number of outgoing arcs type_R for t_{ i }∈ T and s_{ j }∈ S respectively 
total outgoing arcs type_L  D_{9} = t_{ OL } s_{ OL }  t_{ OL }and s_{ OL }indicate the total number of outgoing arcs type_L for t_{ i }∈ T and s_{ j }∈ S respectively 
total outgoing arcs type_P  D_{10} = t_{ OP } s_{ OP }  t_{ OP }and s_{ OP }indicate the total number of outgoing arcs type_P for t_{ i }∈ T and s_{ j }∈ S respectively 
total outgoing arcs type_A  D_{11} = t_{ OA } s_{ OA }  t_{ OA }and s_{ OA }indicate the total number of outgoing arcs type_A for t_{ i }∈ T and s_{ j }∈ S respectively 
total ligand arcs  D_{12} = t_{ L } s_{ L }  t_{ L }and s_{ L }are the total number of ligand arcs for the TOPS+ strings elements t_{ i }∈ T and s_{ j }∈ S respectively 
where t_{ I }and s_{ I }are the total number of incoming arcs of the TOPS+ strings elements t_{ i }∈ T and s_{ j }∈ S respectively. The absolute arc differences between incoming arcs, outgoing arcs and their arc types are calculated based on Table 5.
Computation of Normalized Dissimilarity Score
Computation of the Optimized Comparison Score (metric)
We have computed 17 different combinations of compression values based on ED and LCS together with or without different levels of SSEArc+ features information. Supplementary Table 1 (see supplementary material page at http://balabio.dcs.gla.ac.uk/mallika/WebTOPS/optTOPSplusresults.html) gives all the 17 normalized compression scores we have calculated based on ED and LCS from our advanced TOPS+ comparison with (output order of the results) and description.
We have performed training and analysis of our advanced TOPS+ comparison method with the parameter tuning table. Our method incorporates parameter optimization at two levels, both in the computation of the dynamic programming table and in the computation of the normalized compression measure. We have tested our method with 1,134 unique basic parameter lists on the training dataset of 7,000 random protein domain pairs from the PDB40 dataset, which contain both ligandbound and ligandfree proteins. We validated our results with the SCOP superfamily classification numbers and obtained the ROC and AUC values corresponding to each basic parameter list. The experimental testing and evaluation analysis involved the following steps:

Perform advanced TOPS+ comparison based on the advanced_SSEArc+Match function for all basic parameters in list P_{ b }.

Compute ROC (Receiver Operating Characteristic) curve analysis for all 7,000 results, and for each parameter list.

Calculate the AUC values corresponding to the 17 different nC scores.
Datasets
PDB40 subset
Structural homology of protein domains for the PDB40 dataset.
SCOP Class  Hom  %  NonHom  %  Total  % 

All alpha  129  69  58  31  187  10 
All beta  219  68  102  32  321  18 
Alpha/beta  452  48  487  52  939  52 
Alpha+beta  167  46  193  54  360  20 
Total  967  54  840  46  1807  100 
ChewKedem dataset
We have considered the Chew and Kedem dataset [4, 5] to assess the biological significance of our advanced TOPS+ comparison method. This dataset contains 36 medium size representative proteins of five different families: globins (17 entries), alphabeta (6 entries), timbarrels (4 entries), allalpha (2 entries), and allbeta (7 entries) proteins. We compared our method against the SSAP structure alignment program [9, 34, 35] and TOPS [3, 26] and validated our results based on computation of the Fmeasure [20, 21].
Evaluation Analysis
ROC and AUC Analysis
For the PDB40 dataset we have performed evaluation analysis as given below:

Obtain the pairwise comparison score from the protein comparison method for a given dataset.

Assignment of Homolog (TP, true positive) and nonHomolog (FP, false positive) based on the SCOP superfamily numbers for each protein domain (see below) and rank them according to the comparison score.

Perform the Receiver Operating Characteristics (ROC) curve analysis for equation (2). For all the ROC curves we have computed the AUC (Area Under the ROC Curve) values.
Homolog vs nonhomolog assignment
We have considered the assignment of homologous or nonhomologous information of a protein domain pair, based on the standard SCOP classification numbers at superfamily level as an indication of structural homology. If both protein domains belong to same superfamily then they are homologous, otherwise they are nonhomologous.
Fmeasure validation analysis for the ChewKedem dataset
We have obtained allagainstall comparison scores from all the comparison methods and based on these scores, for each method, we performed pairwise hierarchical clustering using the OC program [36]. We have evaluated the biological significance of the clusters obtained from different protein structure comparison methods based on Fmeasure calculations [20, 21].
Run Time Analysis
We performed all the analyses using a RedHat 7.2 linux environment with an Intel Pentium IV 1.6 GHz processor. The methods SSAP, TOPS, TOPS+ and advTOPS+ took 9139 s, 75 s, 21 s and 1805 s (s = seconds) respectively to complete 630 pairwise comparisons.
Declarations
Acknowledgements
We would like to thank the TOPS Project for TOPS data resources and Juris Viksna for advice on our method. MV has been supported by a PhD studentship from the University of Glasgow.
Authors’ Affiliations
References
 Gilbert D, Westhead DR, Viksna J, Thornton J: A Computer System to Perform Structure Comparison using TOPS Representations of Protein Structure. J Comput Chem 2001, 26: 23–30. 10.1016/S00978485(01)000961View ArticleGoogle Scholar
 Gilbert D, Westhead DR, Viksna J: Techniques for Comparison, Pattern Matching and Pattern Discovery: From Sequences to Protein Topology. In Artificial Intelligence and Heuristic Methods in Bioinformatics, Volume 183 of NATO Science Series: Computer & Systems Sciences. Edited by: Frasconi P, Shamir R. IOS Press; 2003:128–147.Google Scholar
 Viksna J, Gilbert D: Pattern Matching and Pattern Discovery Algorithms for Protein Topologies. In Algorithms in BioInformatics, Volume 2149 of Lecture Notes in Comput. Sci. SpringerVerlag; 2001:98–111.Google Scholar
 Krasnogor N, Pelta DA: Measuring the Similarity of Protein Structures by Means of the Universal Similarity Metric. Bioinformatics 2004, 20(7):1015–1021. 10.1093/bioinformatics/bth031View ArticlePubMedGoogle Scholar
 Chew LP, Kedem K: Finding the Consensus Shape for a Protein Family. Algorithmica 2003, 38: 115–129. 10.1007/s0045300310452View ArticleGoogle Scholar
 GoldsmithFischman S, Honig B: Structural Genomics: Computational Methods for Structure Analysis. Protein Sci 2003, 12(9):1813–1821. 10.1110/ps.0242903View ArticlePubMedPubMed CentralGoogle Scholar
 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235View ArticlePubMedPubMed CentralGoogle Scholar
 Orengo CA, Taylor WR: A Rapid Method for Protein Structure Alignment. J Theor Biol 1990, 147(4):517–551. 10.1016/S00225193(05)802632View ArticlePubMedGoogle Scholar
 Taylor WR, Orengo CA: Protein Structure Alignment. J Mol Biol 1989, 208: 1–22. 10.1016/00222836(89)900843View ArticlePubMedGoogle Scholar
 Russell RB, Barton GJ: Multiple Protein Sequence Alignment from Tertiary Structure Comparison: Assignment of Global and Residue Confidence Levels. Proteins 1992, 14(2):309–323. 10.1002/prot.340140216View ArticlePubMedGoogle Scholar
 Holm L, Sander C: Protein Structure Comparison by Alignment of Distance Matrices. J Mol Biol 1993, 233: 123–138. 10.1006/jmbi.1993.1489View ArticlePubMedGoogle Scholar
 Shindyalov IN, Bourne PE: Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path. Protein Engineering 1998, 11(9):739–747. 10.1093/protein/11.9.739View ArticlePubMedGoogle Scholar
 Madej T, Mossing MC: Hamiltonians for Protein Tertiary Structure Prediction Based on Threedimensional Environment Principles. J Mol Biol 1993, 233(3):480–487. 10.1006/jmbi.1993.1525View ArticlePubMedGoogle Scholar
 Madej T, Gibrat JF, Bryant SH: Threading a Database of Protein Cores. Proteins 1995, 23(3):356–369. 10.1002/prot.340230309View ArticlePubMedGoogle Scholar
 Harrison A, Pearl F, Sillitoe I, Slidel T, Mott R, Thornton J, Orengo C: Recognizing the Fold of a Protein Structure. Bioinformatics 2003, 19(14):1748–1759. 10.1093/bioinformatics/btg240View ArticlePubMedGoogle Scholar
 Grindley HM, Artymiuk PJ, Rice DW, Willett P: Identification of Tertiary Structure Resemblance in Proteins Using a Maximal Common Subgraph Isomorphism Algorithm. J Mol Biol 1993, 229(3):707–721. 10.1006/jmbi.1993.1074View ArticlePubMedGoogle Scholar
 Koch I, Lengauer T, Wanke E: An Algorithm for Finding Maximal Common Subtopologies in a Set of Protein Structures. J Comput Biol 1996, 3(2):289–306. 10.1089/cmb.1996.3.289View ArticlePubMedGoogle Scholar
 Veeramalai M, Gilbert D: A Novel Method for Comparing Topological Models of Protein Structures Enhanced with Ligand Information. Bioinformatics 2008, 24(23):2698–2705. 10.1093/bioinformatics/btn518View ArticlePubMedGoogle Scholar
 Veeramalai M, Ye Y, Godzik A: TOPS++FATCAT: fast flexible structural alignment using constraints derived from TOPS+ Strings Model. BMC Bioinformatics 2008, 9: 358. 10.1186/147121059358View ArticlePubMedPubMed CentralGoogle Scholar
 Handl J, Knowles J, Kell DB: Computational Cluster Validation in PostGenomic Data Analysis. Bioinformatics 2005, 21(15):3201–3212. 10.1093/bioinformatics/bti517View ArticlePubMedGoogle Scholar
 van Rijsbergen CJ: Information Retrieval. 2nd edition. London: Butterworths; 1979.Google Scholar
 Krishna SS, Grishin NV: Structural Drift: A Possible Path to Protein Fold Change. Bioinformatics 2005, 21(8):1308–1310. 10.1093/bioinformatics/bti227View ArticlePubMedGoogle Scholar
 Harrison A, Pearl F, Mott R, Thornton J, Orengo C: Quantifying the Similarities within Fold Space. J Mol Biol 2002, 323(5):909–926. 10.1016/S00222836(02)009920View ArticlePubMedGoogle Scholar
 Veeramalai M: A Novel Method for Comparing Topological Models of Protein Structures Enhanced with Ligand Information. PhD thesis. University of Glasgow; 2005.Google Scholar
 Michalopoulos I, Torrance GM, Gilbert D, Westhead DR: TOPS: An Enhanced Database of Protein Structural Topology. Nucleic Acids Res 2003, 32(D):251–254.Google Scholar
 Torrance GM, Gilbert D, Michalopoulos I, Westhead DR: Protein Structure Topological Comparison, Discovery and Matching Service. Bioinformatics 2005, 21(10):2537–2538. 10.1093/bioinformatics/bti331View ArticlePubMedGoogle Scholar
 Westhead D, Slidel T, Flores T, Thornton J: Protein structural topology: automated analysis and diagrammatic representation. Protein Science 1999, 8: 897–904.View ArticlePubMedPubMed CentralGoogle Scholar
 Levenshtein VI: Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics Doklady 1966, 10(8):707–710.Google Scholar
 Valiente G: Combinatorial Pattern Matching Algorithms in Computational Biology using Perl and R. Taylor & Francis/CRC Press; 2009.View ArticleGoogle Scholar
 Needleman SB, Wunsch CD: A General Method applicable to the Search for Similarities in the Amino Acid Sequence of two Proteins. J Mol Biol 1970, 48(3):443–453. 10.1016/00222836(70)900574View ArticlePubMedGoogle Scholar
 Smith TF, Waterman MS: Identification of Common Molecular Subsequences. J Mol Biol 1981, 147: 195–197. 10.1016/00222836(81)900875View ArticlePubMedGoogle Scholar
 Mizuguchi K, Blundell TL: Analysis of conservation and substitutions of secondary structure elements within protein superfamilies. Bioinformatics 2000, 16(12):1111–1119. 10.1093/bioinformatics/16.12.1111View ArticlePubMedGoogle Scholar
 Brazma A, Jonassen I, Vilo J, Ukkonen E: Pattern Discovery in Biosequences. In Proc. 4th Int. Coll. Grammatical Inference, Volume 1433 of Lecture Notes in Comput. Sci. SpringerVerlag; 1998:257–270.Google Scholar
 Orengo CA, Brown NP, Taylor WR: Fast Structure Alignment for Protein Databank Searching. Proteins 1992, 14(2):139–167. 10.1002/prot.340140203View ArticlePubMedGoogle Scholar
 Orengo CA, Taylor WR: SSAP: Sequential Structure Alignment Program for Protein Structure Comparison. Methods Enzymol 1996, 266: 617–635. full_textView ArticlePubMedGoogle Scholar
 Barton GJ: OCA Cluster Analysis Program.2002. [http://www.compbio.dundee.ac.uk/downloads/oc/]Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.