Tableaubased protein substructure search using quadratic programming
 Alex Stivala^{1}Email author,
 Anthony Wirth^{1} and
 Peter J Stuckey^{1, 2}
https://doi.org/10.1186/1471210510153
© Stivala et al; licensee BioMed Central Ltd. 2009
Received: 07 April 2009
Accepted: 19 May 2009
Published: 19 May 2009
Abstract
Background
Searching for proteins that contain similar substructures is an important task in structural biology. The exact solution of most formulations of this problem, including a recently published method based on tableaux, is too slow for practical use in scanning a large database.
Results
We developed an improved method for detecting substructural similarities in proteins using tableaux. Tableaux are compared efficiently by solving the quadratic program (QP) corresponding to the quadratic integer program (QIP) formulation of the extraction of maximallysimilar tableaux. We compare the accuracy of the method in classifying protein folds with some existing techniques.
Conclusion
We find that including constraints based on the separation of secondary structure elements increases the accuracy of protein structure search using maximallysimilar subtableau extraction, to a level where it has comparable or superior accuracy to existing techniques. We demonstrate that our implementation is able to search a structural database in a matter of hours on a standard PC.
Keywords
Background
Finding structures in a database which contain a substructure that is similar to a query structure or structural motif is an important technique in analyzing protein structure, function, and evolution. There are many existing methods for finding structurally similar proteins which take diverse approaches, such as: structural alignment at the level of residues or backbone atoms [1, 2] or (as an initial step) secondary structure elements [3–7], purely topological matching [8, 9], and probabilistic approaches [10–12]. Detailed structural alignment, however, although capable of great accuracy, is often slow [2], and therefore impractical for searching entire databases of the size of SCOP [13, 14] or the PDB [15].
The TOPSbased method [8, 9] provides structural motif searches, but by operating purely on topology it "ignores other important spatial properties" [[16], p. 1331]. Nonalignment approaches, such as PRIDE [10], can be extremely fast, but not as accurate as alignmentbased approaches [17], and provide only a matching score, not an alignment or a coarsegrained or seed alignment for further refinement.
Two recent approaches, ProSMoS [16] and TableauSearch [18], use spatial interactions between secondary structure elements (SSEs) to find common structures. ProSMoS constructs a "metamatrix" of SSEs and the interactions between them, and finds all possible submatrices in a database metamatrix that match the query metamatrix. TableauSearch constructs tableaux [19, 20], which represent relative orientations of SSEs, and finds substructural matches by extracting maximallysimilar subtableaux between the query tableau and a database tableau. In the exact (rigorous) technique, this problem is expressed as a quadratic integer program (QIP) or integer linear program (ILP) [18] and solved exactly using ILOG CPLEX [21]. Both ProSMoS and the exact tableau search formulation allow substructures to be found within structures. They also allow nonlinear matchings, that is, sets of correspondences between SSEs in which the sequential order of corresponding SSEs is not preserved. Such nonlinear matchings have recently been shown to be significantly more widespread than had previously been thought [22], and are therefore of considerable interest.
The two most similar methods to tableau matching are perhaps LOCK [5], and its newer version LOCK 2 [6], and ProSMoS. LOCK and LOCK 2 also match SSE vectors between structures, but use a more complex set of seven scoring functions, both orientation dependent and orientation independent, and use iterative dynamic programming requiring parameters for each of the scoring functions [5]. In contrast, the tableau matching formulation is simpler and more elegant, although to obtain higher accuracy we extend it with a distance difference constraint that requires a parameter.
ProSMoS, although it is similar to tableau matching in its use of SSE orientations, takes quite a different approach from most existing structural search methods in that, rather than taking a structure (or substructure) definition as a query, the query metamatrix is constructed manually (or at least modified manually from one generated by the supplied scripts) by the user. This is clearly useful for finding userspecified motifs in a database of structures, but creates challenges in assessing the performance of the method since the results are so dependent on the userspecified query. ProSMoS, in contrast to our method, returns a list of hits to the query structure, rather than a matching score for each structure in the database. This is often simpler for the user, but it does have the disadvantage that finding more (or fewer) hits requires editing the query metamatrix, which can be quite difficult to calibrate. Returning a score for each database structure means that adjusting the sensitivity or specificity required is simply a matter of varying the cutoff score for a match to be considered significant.
An advantage of the maximallysimilar subtableaux formulation is that it allows the discovery of similar substructures within two structures, without requiring that the two structures are themselves similar as a whole, or that one of the structures must match as a whole some substructure within the other structure. We may choose to use one structure as a "query" motif, usually a small welldefined structural folding pattern, and find structures that contain this entire motif as a substructure, but it is also possible to find common substructures in two unrelated folds.
However, the rigorous tableau searching method is too slow for a full database search, and so Konagurthu et al. [18] introduce TableauSearch. This method approximates the exact solution using an alignmentlike approach [23], with two phases of dynamic programming. TableauSearch is extremely fast, but loses the rigorous theoretical justification and is not as accurate as the exact method. It is also inherently sequential, thereby losing the ability to find nonlinear structural matchings, and, at least partly, loses the ability to find substructural (local) rather than full structure (global) matches. This may be possible, however, by removing end gap penalties [18].
Here we present a method, based on the exact tableaux matching formulation [18] and recent work in alignment of molecular networks [24], that allows searches for occurrences of a query structure as substructures of structures in a database such as SCOP in practical time, allows nonlinear matchings, and is able to provide a set of correspondences between SSEs.
Results and discussion
We evaluate the accuracy of our QP tableau matching algorithm as a method for determining the fold of a structure, using SCOP as the truth. The tradeoff between sensitivity and specificity for such a classification task can be shown as a Receiver Operating Characteristic (ROC) curve [25].
AUC and time for some widespread folds.
distance information  

without  with  
Fold  SCOP sid  # SSEs  AUC  time  AUC  time 
βgrasp  d1ubia_  8  0.80  0 h 47 m  0.92  0 h 32 m 
Keybarrel  d1tttb1  9  0.80  0 h 50 m  0.97  0 h 50 m 
Immunoglobulin  d1ae6h1  13  0.89  1 h 47 m  0.95  1 h 53 m 
Plait (ferredoxin)  d1bhne_  15  0.61  1 h 53 m  0.85  2 h 18 m 
GFPlike  d1h6rb_  17  1.00  3 h 06 m  0.99  3 h 06 m 
Jellyroll  d2phlb1  19  0.87  4 h 24 m  0.93  5 h 13 m 
Timbarrel  d1tima_  21  0.99  4 h 31 m  1.00  4 h 30 m 
NADbinding fold  d1f6dc_  30  0.98  14 h 45 m  0.99  16 h 33 m 
We find that the bestperforming variation of our method, using the discrete tableau encoding rather than numeric Ω matrices, and incorporating distance information, has an AUC of 0.95 averaged over the eight queries in Table 1.
We can see in Table 1 that the ferredoxin fold query (d1bhne_) performs significantly worse than the others. We examined the results from this query and found that a large number of false negatives occur (many members of this fold are not given a high score by our method). Examining some of these false negatives in detail, we find that it is often due to DSSP [26], which we use to define SSEs, not defining some of the SSEs required to match the query structure (in an extreme case, d2atcb1, DSSP defines only a single helix and nothing else). Although we have the capability of using STRIDE [27] rather than DSSP, the results are often similar (as in the d2atcb1 example). This is a shortcoming of any method that depends on SSE definitions, although ProSMoS solves it to some degree by using PALSSE [28], a secondary structure assignment method that assigns many more residues to SSEs precisely in order to avoid this problem [16]. False negatives can also occur independently of the SSE definition algorithm, if a structure does not have some SSEs not considered essential to the fold according to SCOP (but which are included in our query structure) and/or sufficiently different in their orientation that tableau matching does not assign them a high score. An example of this is d1q8ba_, which does not have all the helices in the query structure, and some which it does include have rather different orientations from those in the query, but it is nevertheless classified as a member of the ferredoxinlike fold.
Table 1 shows that a search in a database of 15273 structures takes approximately one hour for a small (10 or fewer SSEs) query structure on a single CPU of a standard PC, and under four hours for query structures with fewer than 18 SSEs, but can take more than 16 hours for a query structure with 30 SSEs. Since 75% of domains in the database have fewer than 20 SSEs, and the most frequent number of SSEs is 10, most queries for a structure drawn from a set of structures with the same distribution of tableau size as the database will complete within 4 hours. We note that the peak at 10 differs from the results of [20], who find the peak is at 6, as we have used DSSP to define SSEs and have included π and 3_{10}helices, while Kamat and Lesk [20] used the assignments of helices and strands from PDB files.
AUC for the 200 query set.
Method  Normalization  AUC 

SHEBA  None  0.941 
QP tableau search  norm2  0.925 
QP tableau search  norm1  0.904 
QP tableau search  norm3  0.904 
VAST  None  0.890 
TableauSearch  norm2  0.871 
TOPS  None  0.871 
TableauSearch  norm1  0.869 
QP tableau search  None  0.854 
TableauSearch  None  0.846 
TableauSearch  norm3  0.832 
In terms of elapsed time (for a single processor core), TableauSearch is by far the fastest method. On our system, it has a total elapsed time for the 200 query set of only 1 hour 25 minutes, compared to 28 hours for VAST, 52 hours for SHEBA, and 741 hours for our method. Large scale comparisons with the exact solution of the QIP or ILP with CPLEX are not practical, as a single comparison takes at least several seconds and can take up to several days, and in some cases exhausts the virtual memory of our machine.
Comparison with MAXCMO heuristic
Maximum Contact Map Overlap (MAXCMO) is a formulation of the problem of finding the similarity of two protein structures. MAXCMO uses the contact map representation of proteins, in which a protein with n residues is represented as a square symmetric matrix C_{n × n}where C_{ ij }= 1 when the distance between residues i and j is less than some threshold, and C_{ ij }= 0 otherwise. Typically this distance is defined as the C_{ α }distance, and the threshold is for example 7 Å. The MAXCMO problem is then to find a (noncrossing) alignment of residues that maximizes the overlap between two contact maps. The value (or score) of the alignment is the number of contacts in one protein whose residues are aligned with residues that are also in contact in the other protein [31].
MAXCMO is an NPhard problem, and methods for solving it exactly, by such techniques as integer programming with Lagrangian relaxation [31, 32] or branchandbound [33] can be impractically slow.
Therefore, heuristic approaches are useful, and recently a variable neighborhood search (VNS) algorithm for approximating MAXCMO has been published, with an analysis of its effectiveness in ranking protein similarity [34].
Here we compare the performance of the QP formulation of maximallysimilar subtableaux extraction with the VNS heuristic for MAXCMO of [34].
Area under the ROC curve (AUC) for the Fischer data set at fold level.
95% confidence interval  

Method  Normalization  AUC  standard error  lower  upper 
MSVNS3  None  0.788  0.017  0.754  0.821 
MSVNS3  norm1  0.791  0.017  0.758  0.824 
MSVNS3  norm2  0.809  0.016  0.777  0.842 
MSVNS3  norm3  0.781  0.017  0.747  0.815 
QP tableau search  None  0.837  0.016  0.807  0.868 
QP tableau search  norm1  0.882  0.014  0.855  0.909 
QP tableau search  norm2  0.887  0.014  0.861  0.914 
QP tableau search  norm3  0.860  0.015  0.831  0.889 
Area under the ROC curve (AUC) for the Fischer data set at class level.
95% confidence interval  

Method  Normalization  AUC  standard error  lower  upper 
MSVNS3  None  0.666  0.009  0.647  0.684 
MSVNS3  norm1  0.604  0.010  0.586  0.623 
MSVNS3  norm2  0.696  0.009  0.678  0.714 
MSVNS3  norm3  0.628  0.010  0.610  0.647 
QP tableau search  None  0.789  0.008  0.773  0.805 
QP tableau search  norm1  0.833  0.008  0.819  0.848 
QP tableau search  norm2  0.851  0.007  0.837  0.865 
QP tableau search  norm3  0.824  0.008  0.809  0.839 
Area under the ROC curve (AUC) for the Nh3D data set at architecture level.
95% confidence interval  

Method  Normalization  AUC  standard error  lower  upper 
MSVNS3  None  0.537  0.005  0.528  0.547 
MSVNS3  norm1  0.617  0.005  0.607  0.627 
MSVNS3  norm2  0.583  0.005  0.573  0.593 
MSVNS3  norm3  0.598  0.005  0.588  0.608 
QP tableau search  None  0.578  0.005  0.568  0.588 
QP tableau search  norm1  0.617  0.005  0.607  0.626 
QP tableau search  norm2  0.608  0.005  0.598  0.618 
QP tableau search  norm3  0.599  0.005  0.589  0.608 
Area under the ROC curve (AUC) for the Nh3D data set at class level.
95% confidence interval  

Method  Normalization  AUC  standard error  lower  upper 
MSVNS3  None  0.590  0.003  0.585  0.595 
MSVNS3  norm1  0.559  0.003  0.554  0.564 
MSVNS3  norm2  0.543  0.003  0.538  0.548 
MSVNS3  norm3  0.551  0.003  0.546  0.557 
QP tableau search  None  0.708  0.002  0.703  0.712 
QP tableau search  norm1  0.740  0.002  0.735  0.744 
QP tableau search  norm2  0.726  0.002  0.722  0.731 
QP tableau search  norm3  0.700  0.002  0.695  0.704 
We should perhaps discount any superiority in the performance of the tableau search method at the class level, as this level of classification in the CATH hierarchy indicates only the percentage of αhelices and βstrands in the domain [37]. Since tableaux are based on SSEs (defined by DSSP) we could trivially obtain good classification performance at this level just from the DSSP classification, while MAXCMO uses only residue contact information, and so must score protein similarity at this high level without having SSEs defined by an existing method.
Ignoring the class level comparisons therefore, we find that QP tableau search has significantly superior accuracy compared to MSVNS3 on the Fischer data set, and similar accuracy to MSVNS3 on the Nh3D data set.
For the Fischer data set, MSVNS3 took 8 hours while the sparse matrix (UMFPACK [38–41]) implementation of QP tableau search took 2 hours on a PC with an Intel Core 2 Duo processor and 2 GB of memory running 32bit Linux. For the Nh3D data set, MSVNS3 took 62 hours while QP tableau search took 8 hours.
Examples
Substructure search
Evaluated as a substructure (motif) query, the βgrasp query (d1ubia_) using the discrete tableau encoding has an AUC of 0.94. Since the data set used as the gold standard in this case is that defined by ProSMoS [16], that method by definition has an AUC of 1.00 on this query.
Comparison of methods for substructure searching.
Fold  SCOP sid  P  S  T  Q  R  P/R  S/R  T/R  Q/R 

βgrasp  d1ubia_  33  9  42  14  15  15/15  9/15  13/15  10/15 
Keybarrel  d1tttb1  10  3  0  17  5  5/5  1/5  0/5  4/5 
Immunoglobulin  d1ae6h1  27  1  4  1  11  9/11  1/11  4/11  1/11 
Plait (ferredoxin)  d1bhne_  20  1  61  14  28  7/28  1/28  24/28  6/28 
GFPlike  d1h6rb_  1  1  58  21  1  1/1  1/1  1/1  1/1 
Jellyroll  d2phlb1  1  1  19  15  12  1/12  1/12  10/12  5/12 
TIMbarrel  d1tima_  16  16  40  33  32  16/32  16/32  28/32  32/32 
NADbinding fold  d1f6dc_  1  1  42  19  8  1/8  1/8  7/8  5/8 
Therefore the ProSMoS results reflect not only the performance of ProSMoS, but also our construction of the relevant query matrices.
We note that our results here differ significantly from those in Table 2 of [16]: our method of constructing this table is similar, but not identical, to that of [16], we have used slightly different queries (with the exception of the βgrasp query, where we used the metamatrix described in [16]) and different versions of the software and a different database have been used. Consistently with [16], SSM finds the least number of matches. In our results, however, ProSMoS does not always return the greatest number of matches: sometimes TOPS does, since we are using a version of TOPS that computes scores for all matches, rather than the precomputed "classic" structure patterns.
TOPS also tends to have more false positives than ProSMoS or our method, that is, superfamilies found by TOPS that are not considered by the SCOP descriptions to contain the fold in question. This is consistent with TOPS being a purely topological method, which does not take account of other structural properties. Sometimes this also results in TOPS finding true positives which the other methods do not, for example when using the ferredoxin query, only TOPS finds the monooxygenase (hydroxylase) regulatory protein superfamily, d.137.1, which SCOP describes as having "some topological similarity to the ferredoxinlike fold". SCOP also notes in the family description for d.137.1.1 that "the solution structure determinations disagree in the relative orientations of two motifs", so topological similarity without taking into account more detailed structural similarity (specifically, SSE orientation, as used by our method and ProSMoS) is a more appropriate method to find matches to this structure, reflected in the relatively better performance of TOPS, and the previously discussed poor performance of our method on this query.
Comparison of the unique hits from each method for substructure searching.
Fold  SCOP sid  Pu  Su  Tu  Qu  R  Pu/R  Su/R  Tu/R  Qu/R 

βgrasp  d1ubia_  17  0  27  4  1  1/1  0/1  0/1  0/1 
Keybarrel  d1tttb1  4  0  0  10  1  1/1  0/1  0/1  0/1 
Immunoglobulin  d1ae6h1  25  0  2  0  9  7/9  0/9  2/9  0/9 
Plait (ferredoxin)  d1bhne_  8  0  48  7  21  1/21  0/21  17/21  3/21 
GFPlike  d1h6rb_  0  0  56  19  0  0/0  0/0  0/0  0/0 
Jellyroll  d2phlb1  0  0  15  11  9  0/9  0/9  7/9  2/9 
TIMbarrel  d1tima_  0  0  11  3  3  0/3  0/3  0/3  3/3 
NADbinding fold  d1f6dc_  0  0  33  10  4  0/4  0/4  3/4  1/4 
An interesting example is the ferredoxin fold, where the performance of our method as a structural search method is relatively poor. However, as a substructure search technique, some true positives are found only by our method. Only QP tableau search finds the peptide methionine sulfoxide reductase superfamily d.58.28, the CcmKlike superfamily d.58.56, and the release factor superfamily e.38.1. The first two are members of the ferredoxinlike fold but d.58.28 is described by SCOP as having the common fold "elaborated with additional secondary structures". The release factor superfamily (e.38.1) is described by SCOP as having 4 domains, one of which is a ferredoxinlike fold.
It is important to note several caveats in interpreting Table 7 and Table 8. First, as already discussed, ProSMoS queries were manually constructed, which is not the case for the other methods. Second, ProSMoS and SSM return a set of hits for a query, whereas the other methods return a matching score between the query and every database structure. Hence, in order to construct the tables, a cutoff score needs to be chosen (see Methods). The values in the tables are therefore very sensitive to the method used to choose the cutoff score: we could find arbitrarily many superfamilies simply by decreasing the value at which a score is considered a hit. Third, as discussed in [16], the lack of explicit mention of a structure in the SCOP description does not necessarily mean the structural motif is absent.
Nonlinear matchings
AUC for nonlinear matchings averaged over five permutations of each of the fold query tableaux.
Fold  SCOP sid  Average AUC 

βgrasp  d1ubia_  0.84 
Keybarrel  d1tttb1  0.90 
Immunoglobulin  d1ae6h1  0.92 
Plait (ferredoxin)  d1bhne_  0.65 
GFPlike  d1h6rb_  1.00 
Jellyroll  d2phlb1  0.90 
TIMbarrel  d1tima_  1.00 
NADbinding fold  d1f6dc_  0.99 
Conclusion
We have introduced an improved method of searching for protein structures with similar folds using tableaux, incorporating constraints on the distances between SSEs to improve accuracy. This method is capable of finding either matches of an entire structure to the query, or matches where the query is a substructure of a larger structure. It is capable of finding nonlinear matchings, where structurally equivalent parts do not have the same relative positions in the sequences of the two proteins. It also provides a set of corresponding SSEs, useful for manual validation of the result or as a seed for a more detailed structural alignment.
In assessing their VNS heuristic for MAXCMO, Pelta et al. [34] ask whether it is necessary to solve MAXCMO exactly in order to perform structure classification, and conclude that it is not: the heuristic solution is sufficient. We have shown that, consistent with previous work using the tableau representation of protein folds [18, 20], the much more coarsegrained (and hence smaller and faster to solve) tableau representation is sufficient to accurately represent protein folds and perform structure classification. Specifically, we have shown that the efficient approximation of maximallysimilar subtableaux extraction by relaxed quadratic programming is able to consistently classify folds at least as accurately as the VNS heuristic for MAXCMO. In addition, our implementation is able to do so in less time than the MSVNS3 implementation described by [34].
We have demonstrated that the accuracy of our technique assessed as a protein fold recognition method compares favorably with some existing methods, and that it is fast enough to scan protein structure databases in a practical time, unlike the exact solution using CPLEX. It is, however, not as fast as some existing methods such as SHEBA and VAST, and the TableauSearch dynamic programming approximation introduced by [18] is faster still. These methods, however, cannot be used to find substructures or nonlinear matchings.
We have also demonstrated the use of our technique as a method for searching for substructures in protein structures, and compared it with some existing techniques, including ProSMoS. Complications in objectively assessing the performance of these methods make definite conclusions in this area difficult: we can perhaps say at most that each method has different enough properties that they are all capable of finding unique hits that others miss. A structural biologist searching for matches to a motif or substructure, then, would do well to employ several of these methods rather than relying on just one. As noted by Li et al. [24], further theoretical work to find tight sufficient conditions for the QP to have an integer solution is required, although empirically an integer solution is almost always found.
Methods
We built a database of tableaux, which is a file containing the tableau representation for each structure in the database. By precomputing the tableaux in this way, only the query structure needs to have its tableau built when searching for occurrences of that structure. The search procedure is then to compute a matching score between the query tableau and each tableau in the database. Sorting the results by score allows the desired balance of sensitivity and specificity to be found by choosing a threshold score above which a match is considered a "hit" of the query to the database structure.
Tableaux
An orientation matrix is a square symmetric matrix which describes the relative orientation of secondary structures in a protein; a tableau is a concise encoding of this matrix where the angles have been discretized using a doublequadrant encoding [19]. Tableaux have been found to accurately differentiate folds [20] and form the basis of the structural searching algorithm of [18].
The orientation matrix Ω, for a protein with n SSEs, is an n × n symmetric matrix. Each element ω_{ ij }of Ω, π ≤ ω_{ ij }≤ π, 1 ≤ i, j ≤ n is the relative angle between the axes of SSEs i and j. Computing Ω therefore consists of three steps: defining the SSEs, fitting axes to the SSEs, and computing the interaxial angle between each pair of SSE axes.
The tableau is derived from the orientation matrix by a doublequadrant encoding scheme, in which the range of angles is divided into quadrants in two ways which differ in orientation by π/4, in order to prevent a small variation in angle resulting in two completely different encodings. The first quadrant encoding is labelled P, O, L, R for parallel, antiparallel, crossingleft, and crossingright, respectively, and the second arbitrarily E, D, S, T [19].
Quadratic integer programming formulation of extraction of maximallysimilar subtableaux
The extraction of maximallysimilar tableaux by quadratic integer programming (QIP) was described by Konagurthu et al. [18]. We use the same formulation:
Let Ω_{ A }= ( ), 1 ≤ i, j ≤ N_{ A }be the orientation matrix for protein/structure A with N_{ A }SSEs and Ω_{ B }= ( ), 1 ≤ i, j ≤ N_{ B }the orientation matrix for protein B with N_{ B }SSEs. Similarly let T_{ A }= ( ) and T_{ B }= ( ) be tableaux.
Define Boolean variables x_{ ij }, 1 ≤ i ≤ N_{ A }, 1 ≤ j ≤ N_{ B }where x_{ ij }= 1 indicates that the i th SSE in structure A is matched with the j th SSE in structure B.
where means the two tableau codes are identical, and means they differ in only one quadrant, for example OS and OT, or OT and RT.
Then the QIP is:
Constraints (5) and (6) ensure each SSE in one tableau is matched with at most one SSE in the other. We introduce a further condition that two SSEs of different types (for example an αhelix and a βstrand) should not be matched, by assigning a low score to such a matching, for which we use the SSE type information encoded on the diagonal of the tableau or Ω matrix.
Without this condition, nonlinear matchings are found.
where D^{ A }= ( ), 1 ≤ i, k ≤ N_{ A }and D^{ B }= ( ), 1 ≤ j, l ≤ N_{ B }are SSE midpoint distance matrices. These are square symmetric matrices, of the same dimensions as the orientation matrices and tableaux, where each entry is the distance (in Ångströms) between the centroids of the C_{ α }atoms used in computing the respective SSEs' axes. We use the value τ = 4.0 Å for the distance difference threshold. This value was found empirically to give good results after testing various values between 2.0 Å and 8.0 Å on a subset of the queries in Table 1. As with tableaux, these distance matrices are precomputed and stored as a triangle with SSE information on the main diagonal. As before, we do not implement this constraint directly, but instead penalize the objective function when it is violated.
Relaxed quadratic programming formulation and solution by interior point method
The QIP just described is NPhard. Even though the instances are quite small, direct solving with CPLEX is too slow for practical use in searching a structure database [18]. A solution to this problem is provided by the work of Li et al. [24], whose formulation of biological network alignment is strikingly similar to the QIP for extracting maximallysimilar tableaux. They show that the constraints (5)–(6) are totally unimodular, allowing the QIP to be relaxed to a quadratic program (QP) by removing the integrality constraints on the Boolean variables x_{ ij }, and that the QP will have an integer solution under certain conditions. This allows this (nonconvex) QP to be solved with an efficient interior ellipsoidal trust region method [47–49].
where Q is the symmetric n × n objective matrix, A is the constraint lefthand m × n matrix, b is the constraint righthand m × 1 vector, c is the objective n × 1 vector, and x is the solution n × 1 vector.
In expressing the maximallysimilar subtableaux QIP (4)–(6) in standard form, the vector c is zero as there is no linear term in the QIP objective function (4). The coefficient matrix A contains only 1s and 0s, since the constraints (5) and (6) are all of the form ; hence A is totally unimodular as shown by Li et al. [24]. The objective matrix Q contains the values of the scoring function ζ; these values are simply negated to transform the maximization problem (4) to the minimization problem (10).
The choice of 0 and 1 as the penalties in conjunction with the discrete tableau scoring function ζ (3) was found empirically to give good results.
We find in common with Li et al [24], that although the sufficient conditions described in the Supplementary Materials of [24], are not always met, that nevertheless an integer solution is almost always obtained.
Evaluation
We computed tableaux for all 15273 domains in the 95% sequence identity nonredundant subset of the ASTRAL SCOP 1.73 database [14, 29]. Unless otherwise stated, all queries, other than those for comparison with MAXCMO using the Fischer or Nh3D data sets, discussed in the results were against this database of tableaux.
The larger scale query set is a set of 200 queries chosen from the ASTRAL SCOP 1.73 95% sequence identity nonredundant data set. The queries were chosen so that each class (α, β, α/β, α + β) is represented in the query set in the same ratio as it is in the database. The list of queries is available with the source code and other data as described in the Availability section.
The Fischer data set, described in Table 2 of [35], consists of 68 proteins. Several PDB identifiers in this table have since been obsoleted, and we replaced these with their new versions according to the RCSB PDB website [50, 15]. As was done by [34], we performed an allagainstall comparison in this data set, including redundant comparisons, resulting in 4624 comparisons.
The Nh3D v3.0 data set [36] consists of 806 structures, each representing a different CATH [37] topology. We performed the same 58838 comparisons as [34] by comparing each of the 73 structures listed in the Supplementary Material of [34] against every structure in the Nh3D v3.0 data set.
where score is the overlap value or tableau matching score for MSVNS3 or tableau search, respectively, and size is the number of contacts or number of SSEs for MSVNS3 or tableau search, respectively.
We evaluated the accuracy of structural search by counting a hit (a score above the threshold) as correct (a true positive) if the structure is in the same SCOP fold as the query structure, and incorrect (a false positive) otherwise. By using SCOP as the gold standard in this way, large scale automatic evaluation on a large number of different queries is possible.
For the Fischer data set, we evaluated at both the fold and class level. At the fold level, a true positive is counted when the score is above the current cutoff and the two structures are in the same fold according to Table 2 of [35]; similarly for the class level. For the Nh3D data set, we evaluated at both the architecture and class levels in CATH. At the architecture level, a true positive is counted when the score is above the current cutoff and the two structures have the same CATH architecture identifier and the same CATH class identifier. At the class level, they need only have the same class identifier.
Evaluation of the accuracy of substructure queries is more challenging, since we require as our gold standard a database of structures that contain a motif as a substructure. By using d1ubia_, an exemplar of the βgrasp fold, as the query, we used the data from Table 1 of [16] as the gold standard. A hit is considered a true positive if it is in the same SCOP superfamily as the exemplars listed in Table 1 of [16] for the βgrasp core and gregarious fold [51] categories, or if it is one of the structures considered by [16] to contain the βgrasp motif by structural drift [52].
where FP is the number of false positives and TN is the number of true negatives. We then construct a ROC curve by plotting the TPR against the FPR for all values of the score threshold. The area under the ROC curve (AUC) is an overall measure of the quality of a classification method; a perfect classifier has AUC = 1.0, and a random classifier has AUC = 0.5. We approximate AUC by the trapezium integration rule.
When multiple queries are being assessed in one ROC curve, as in the Fischer and Nh3D data sets, and the 200 query set in the 95% sequence identity nonredundant subset of the ASTRAL SCOP 1.73 database, all the scores are combined together (after normalization), with each labelled as either a positive or a negative according to the appropriate gold standard. The ROC curves were then plotted with the ROCR package [53] in R [54] and the AUC and its standard error, when reported, are calculated by the HanleyMcNeil method [55].
For comparisons with other methods, SHEBA version 3.1.1, VAST downloaded from [56], ProSMoS downloaded from [57], and the TOPS matching software downloaded from [58] were used. TableauSearch was supplied by Dr Arun Konagurthu (personal communication). The authors' implementation of the VNS heuristic for MAXCMO [34], MSVNS4MaxCMO, was downloaded from [59]. We used MSVNS3, the best performing version of the heuristic according to [34], for all tests.
For MAXCMO, we generated contact maps for each structure with a threshold of 7.0 Å and sequence separation of 2 residues using a modified version of PConPy [60]. For QP tableau search, we generated tableaux and distance matrices for each structure with our own implementation of the tableau creation algorithm, including π and 3_{10} helices and using DSSP to define secondary structure elements. We built the TOPS database for the ASTRAL SCOP 1.73 95% sequence identity nonredundant subset using TOPS downloaded from [61] (July 2007) with default parameters (DSSP is used to define SSEs).
For the comparison with SSM, the SSM webserver [62] running SSM v2.36 and searching the SCOP 1.73 database was used, with default parameters. The search was restricted to the 95% sequence identity nonredundant subset by uploading the relevant ASTRAL SCOP identifier list as the list of SCOP 1.73 codes for the target.
For the comparison with ProSMoS, we found that the query metamatrices produced by the scripts included with ProSMoS applied to the query structures resulted in no hits, even when extensively edited to make them less specific. Therefore, we manually constructed the query metamatrices based on the following information:

DSSP SSE assignments

automatically generated topology cartoons

3D structure as shown by PyMOL [63]

SCOP description of the fold

the list file generated by the ProSMoS matrix generation scripts for the query structure.
where s is the score assigned to the matching, μ is the arithmetic mean of the scores for all database structures for that query, and σ is the standard deviation. For both our method and TOPS, we choose a significant hit to be a matching with Z ≥ 3.0, a value which was found empirically to give a reasonable number of hits without excluding too many true positives amongst our set of example queries.
Implementation
We implemented scripts for creating tableaux, building the tableaux database, evaluating results against SCOP and converting search output for visualization with PyMOL in Python. Our implementation of the tableaux creation algorithm optionally allows a list of SSEs in the structure to be represented in the tableau, rather than all SSEs in the structure, in order to generate tableaux for substructure queries. We used the BioPython library [64] and the Bio.PDB file parsing and structure class [65] to parse PDB files and the Bio.SCOP interface [66] to read SCOP and ASTRAL data. We reimplemented the QP solving algorithm [47, 48], originally implemented in MATLAB [67], in Fortran 77 with the BLAS [68] and LAPACK [69] libraries for dense matrices, and the UMFPACK 5.2 [38–41] library for sparse matrices. The tableau searching program itself was written in Fortran 77.
Availability
Source code, data sets, and executable binaries are available from http://www.cs.mu.oz.au/~astivala/qpprotein/.
Declarations
Acknowledgements
Discussions with Dr Arun Konagurthu have greatly assisted our work. Dr Konagurthu also supplied us with the source code for TableauCreator and TableauSearch. AS is supported by an Australian Postgraduate Award.
Authors’ Affiliations
References
 Holm L, Sander C: Mapping the Protein Universe. Science 1996, 273: 595–602.View ArticlePubMedGoogle Scholar
 Konagurthu AS, Whisstock JC, Stuckey PJ, Lesk AM: MUSTANG: A Multiple Structural Alignment Algorithm. Proteins 2006, 64: 559–574.View ArticlePubMedGoogle Scholar
 Madej T, Gibrat JF, Bryant SH: Threading a Database of Protein Cores. Proteins 1995, 23: 356–369.View ArticlePubMedGoogle Scholar
 Gibrat JF, Madej T, Bryant SH: Surprising similarities in structure comparison. Curr Opin Struct Biol 1996, 6(3):377–385.View ArticlePubMedGoogle Scholar
 Singh AP, Brutlag DL: Hierarchical Protein Structure Superposition using both Secondary Structure and Atomic Representations. Proc Int Conf Intell Syst Mol Biol 1997, 5: 284–293.PubMedGoogle Scholar
 Shapiro J, Brutlag D: FoldMiner: Structural motif discovery using an improved superposition algorithm. Protein Science 2004, 13: 278–294.PubMed CentralView ArticlePubMedGoogle Scholar
 Krissinel E, Henrick K: Secondarystructure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallogr 2004, D60: 2256–2268.Google Scholar
 Gilbert D, Westhead D, Nagano N, Thornton J: Motifbased searching in TOPS protein topology databases. Bioinformatics 1999, 15(4):317–326.View ArticlePubMedGoogle Scholar
 Torrance GM, Gilbert DR, Michalopoulos I, Westhead DW: Protein structure topological comparison, discovery and matching service. Bioinformatics 2005, 21(10):2537–2538.View ArticlePubMedGoogle Scholar
 Carugo O, Pongor S: Protein Fold Similarity Estimated by a Probabilitistic Approach Based on C^{ α }C^{ α }Distance Comparison. J Mol Biol 2002, 315: 887–898.View ArticlePubMedGoogle Scholar
 Gáspári Z, Vlahovicek K, Pongor S: Efficient recognition of folds in protein 3D structures by the improved PRIDE algorithm. Bioinformatics 2005, 21(15):3322–3323.View ArticlePubMedGoogle Scholar
 Kirillova S, Carugo O: Progress in the PRIDE technique for rapidly comparing protein threedimensional structures. BMC Res Notes 2008, 1: 44.PubMed CentralView ArticlePubMedGoogle Scholar
 Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: A Structural Classification of Proteins Database for the Investigation of Sequences and Structures. J Mol Biol 1995, 247: 536–540.PubMedGoogle Scholar
 Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJP, Chothia C, Murzin AG: Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res 2008, (36 Database):D419D425.Google Scholar
 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242.PubMed CentralView ArticlePubMedGoogle Scholar
 Shi S, Zhong Y, Majumdar I, Krishna SS, Grishin NV: Searching for threedimensional secondary structural patterns in proteins with ProSMoS. Bioinformatics 2007, 23(11):1331–1338.View ArticlePubMedGoogle Scholar
 Sierk ML, Pearson WR: Sensitivity and selectivity in protein structure comparison. Protein Sci 2004, 13: 773–785.PubMed CentralView ArticlePubMedGoogle Scholar
 Konagurthu AS, Stuckey PJ, Lesk AM: Structural Search and Retrieval using a Tableau Representation of Protein Folding Patterns. Bioinformatics 2008, 24(5):645–651.View ArticlePubMedGoogle Scholar
 Lesk AM: Systematic representation of folding patterns. J Mol Graphics 1995, 13: 159–164.View ArticleGoogle Scholar
 Kamat AP, Lesk AM: Contact Patterns Between Helices and Strands of Sheet Define Protein Folding Patterns. Proteins 2007, 66: 869–876.View ArticlePubMedGoogle Scholar
 ILOG CPLEX[http://www.ilog.com/products/cplex]
 Abyzov A, Ilyin VA: A comprehensive analysis of nonsequential alignments between all protein structures. BMC Struct Biol 2007, 7: 78.PubMed CentralView ArticlePubMedGoogle Scholar
 Needleman SB, Wunsch CD: A General Method Applicable to the Search for Similarities in the Amino Acid Sequences of Two Proteins. J Mol Biol 1970, 48: 443–453.View ArticlePubMedGoogle Scholar
 Li Z, Zhang S, Wang Y, Zhang XS, Chen L: Alignment of molecular networks by integer quadratic programming. Bioinformatics 2007, 23(13):1631–1639.View ArticleGoogle Scholar
 Sam V, Tai CH, Garnier J, Gibrat JF, Lee B, Munson PJ: ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification. BMC Bioinformatics 2006, 7: 206.PubMed CentralView ArticlePubMedGoogle Scholar
 Kabsch W, Sander C: Dictionary of Protein Secondary Structure: Pattern Recognition of HydrogenBonded and Geometrical Features. Biopolymers 1983, 22: 2577–2637.View ArticlePubMedGoogle Scholar
 Frishman D, Argos P: KnowledgeBased Protein Secondary Structure Assignment. Proteins 1995, 23: 566–579.View ArticlePubMedGoogle Scholar
 Majumdar I, Krishna SS, Grishin NV: PALSSE: A program to delineate linear secondary structural elements from protein structures. BMC Bioinformatics 2005, 6: 202.PubMed CentralView ArticlePubMedGoogle Scholar
 Chandonia JM, Hon G, Walker NS, Conte LL, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004. Nucleic Acids Res 2004, (32 Database):D189D192.Google Scholar
 Jung J, Lee B: Protein structure alignment using environmental profiles. Protein Eng 2000, 13(8):535–543.View ArticlePubMedGoogle Scholar
 Caprara A, Carr R, Istrail S, Lancia G, Walenz B: 1001 Optimal PDB Structure Alignments: Integer Programming Methods for Finding the Maximum Contact Map Overlap. J Comput Biol 2004, 11: 27–52.View ArticlePubMedGoogle Scholar
 Caprara A, Lancia G: Structural Alignment of LargeSize Proteins via Lagrangian Relaxation. In Proceedings of the Sixth Annual International Conference on Computational Molecular Biology (RECOMB '02). ACM Press; 2002:100–108.View ArticleGoogle Scholar
 Xie W, Sahinidis NV: A BranchandReduce Algorithm for the Contact Map Overlap Problem. In Proceedings of the Tenth Annual International Conference on Computational Molecular Biology (RECOMB '06), Volume 3909 of Lecture Notes in Bioinformatics. Edited by: Apostolico A, Guerra C, Istrail S, Pevzner P, Waterman M. Venice, Italy: Springer; 2006:516–529.Google Scholar
 Pelta DA, González JR, Vega MM: A simple and fast heuristic for protein structure comparison. BMC Bioinformatics 2008, 9: 161.PubMed CentralView ArticlePubMedGoogle Scholar
 Fischer D, Elofsson A, Rice D, Eisenberg D: Assessing the performance of fold recognition methods by means of a comprehensive benchmark. Pac Symp Biocomput 1996, 300–318.Google Scholar
 Thiruv B, Quon G, Saldanha SA, Steipe B: Nh3D: A reference dataset of nonhomologous protein structures. BMC Struct Biol 2005, 5: 12.PubMed CentralView ArticlePubMedGoogle Scholar
 Pearl F, Todd A, Sillitoe I, Dibley M, Redfern O, Lewis T, Bennett C, Marsden R, Grant A, Lee D, Akpor A, Maibaum M, Harrison A, Dallman T, Reeves G, Diboun I, Addou S, Lise S, Johnhston C, Sillero A, Thornton J, Orengo C: The CATH Domain Structure Database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis. Nucleic Acids Res 2005, (33 Database):D247D251.Google Scholar
 Davis TA, Duff IS: An UnsymmetricPattern Multifrontal Method for Sparse LU Factorization. SIAM J Matrix Anal Appl 1997, 18: 140–158.View ArticleGoogle Scholar
 Davis TA, Duff IS: A Combined Unifrontal/Multifrontal Method for Unsymmetric Sparse Matrices. ACM Trans Math Software 1999, 25: 1–20.View ArticleGoogle Scholar
 Davis TA: Algorithm 832: UMFPACK V4.3 – An UnsymmetricPattern Multifrontal Method. ACM Trans Math Software 2004, 30(2):196–199.View ArticleGoogle Scholar
 Davis TA: A Column PreOrdering Strategy for the UnsymmetricPattern Multifrontal Method. ACM Trans Math Software 2004, 30(2):165–195.View ArticleGoogle Scholar
 Elliott PR, Pei XY, Dafforn TR, Lomas DA: Topography of a 2.0 Å structure of α_{1}antitrypsin reveals targets for rational drug design to prevent conformational disease. Protein Science 2000, 9: 1274–1281.PubMed CentralView ArticlePubMedGoogle Scholar
 Koo BK, Jung J, Jung H, Nam HW, Kim YS, Yee A, Lee W: Solution structure of the hypothetical novelfold protein TA0956 from Thermoplasma acidophilum . Proteins 2007, 69(2):444–447.View ArticlePubMedGoogle Scholar
 Guerler A, Knapp EW: Novel protein folds and their nonsequential structural analogs. Protein Science 2008, 17: 1374–1382.PubMed CentralView ArticlePubMedGoogle Scholar
 Kolbeck B, May P, SchmidtGoenner T, Steinke T, Knapp EW: Connectivity independent proteinstructure alignment: a hierarchical approach. BMC Bioinformatics 2006, 7: 510.PubMed CentralView ArticlePubMedGoogle Scholar
 GANGSTA+[http://gangsta.chemie.fuberlin.de]
 Ye Y, Tse E: An extension of Karmarkar's projective algorithm for convex quadratic programming. Math Program 1989, 44: 157–179.View ArticleGoogle Scholar
 Ye Y: On affine scaling algorithms for nonconvex quadratic programming. Math Program 1992, 56: 285–300.View ArticleGoogle Scholar
 Ye Y: Interior Point Algorithms: Theory and Analysis. In WileyInterscience Series in Discrete Mathematics and Optimization. New York: Wiley; 1997.View ArticleGoogle Scholar
 The RCSB Protein Data Bank[http://www.pdb.org]
 Harrison A, Pearl F, Mott R, Thornton J, Orengo C: Quantifying the Similarities within Fold Space. J Mol Biol 2002, 323: 909–926.View ArticlePubMedGoogle Scholar
 Krishna SS, Grishin NV: Structural drift: a possible path to protein fold change. Bioinformatics 2005, 21(8):1308–1310.View ArticlePubMedGoogle Scholar
 Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics 2005, 21(20):3940–3941.View ArticlePubMedGoogle Scholar
 R[http://www.rproject.org]
 Hanley JA, McNeil BJ: The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology 1982, 143: 29–36.View ArticlePubMedGoogle Scholar
 VAST[http://migale.jouy.inra.fr/outils/mig/vast]
 ProSMoS[ftp://iole.swmed.edu/pub/ProSMoS]
 TOPS Services at Glasgow University[http://balabio.dcs.gla.ac.uk/tops/software.html]
 MSVNS4MaxCMO[http://modo.ugr.es/jrgonzalez/msvns4maxcmo]
 Ho HK, Kuiper MJ, Kotagiri R: PConPy – a Python module for generating 2D protein maps. Bioinformatics 2008, 24(24):2934–2935.View ArticlePubMedGoogle Scholar
 Topology of Protein Structures[http://www.tops.leeds.ac.uk]
 SSM[http://www.ebi.ac.uk/msdsrv/ssm/]
 PyMOL[http://www.pymol.org]
 BioPython[http://www.biopython.org]
 Hamelryck T, Manderick B: PDB file parser and structure class implemented in Python. Bioinformatics 2003, 19(17):2308–2310.View ArticlePubMedGoogle Scholar
 Casbon JA, Crooks GE, Saqi MAS: A high level interface to SCOP and ASTRAL implemented in Python. BMC Bioinformatics 2006, 7: 10.PubMed CentralView ArticlePubMedGoogle Scholar
 Matlab Programs for Optimization[http://www.stanford.edu/~yyye/matlab.html]
 Dongarra JJ, Du Croz J, Hammarling S, Hanson RJ: An extended set of FORTRAN basic linear algebra subprograms. ACM Trans Math Software 1988, 14: 1–17.View ArticleGoogle Scholar
 Anderson E, Bai Z, Bischof C, Demmel J, Dongarra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, Ostrouchov S, Sorensen D: LAPACK Users' Guide. Philadelphia: Society for Industrial and Applied Mathematics; 1992.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.