A fast indexing approach for protein structure comparison

Background Protein structure comparison is a fundamental task in structural biology. While the number of known protein structures has grown rapidly over the last decade, searching a large database of protein structures is still relatively slow using existing methods. There is a need for new techniques which can rapidly compare protein structures, whilst maintaining high matching accuracy. Results We have developed IR Tableau, a fast protein comparison algorithm, which leverages the tableau representation to compare protein tertiary structures. IR tableau compares tableaux using information retrieval style feature indexing techniques. Experimental analysis on the ASTRAL SCOP protein structural domain database demonstrates that IR Tableau achieves two orders of magnitude speedup over the search times of existing methods, while producing search results of comparable accuracy. Conclusion We show that it is possible to obtain very significant speedups for the protein structure comparison problem, by employing an information retrieval style approach for indexing proteins. The comparison accuracy achieved is also strong, thus opening the way for large scale processing of very large protein structure databases.

alignments (for example, DALI [3], SSAP [4] and MUS-TANG [5]) compare protein structures at a level of residues (sometime even atoms), and hence detect structural similarities (and differences) with high sensitivity and accuracy. However, the long running times of these methods are prohibitive for exhaustive searches across the entire database. PRIDE [6,7] has been proposed for fast recognition of folds, with reasonable accuracy, using the C a -C a distance profiles of a fixed range of residues. SARST [8][9][10] utilizes sequence alignment methods to compare Ramachandran codes of different proteins. It is fast enough perform database search. YAKUSA [11] and SHEBA [12] also compare protein structures using their one-dimensional characterizations, either based on protein backbone internal angles or on their environmental profiles. Although these methods are significantly faster than their structural alignment-based counterparts, the lack of global geometric information makes these methods less accurate. Several methods have also been proposed which compare proteins at a coarse level of secondary structures [13][14][15][16][17][18][19][20][21][22]. ProSMoS [20] and TableauSearch [15] both try to match the orientation between secondary structure elements (SSEs). Rather than only using angles, OPAAS [13,14,23] uses a probability-based method to align the angle-distance map of SSEs. Mainly, these programs look for similarities in the geometry of interactions between the secondary structural elements in the proteins being compared.
Lesk [24] proposed tableau as a concise representation of protein folding patterns. The tableau encodes the geometry of interactions between pairs of secondary structural elements that are in contact [24,25]. Konagurthu et al. [15] proposed three methods to identify structural similarities using a generalized tableau description of protein folding patterns. Their first method allows the identification of identical and nearidentical folding patterns in constant time. The second method facilitates a rigorous comparison of two tableaux to identify maximally similar substructures using computationally expensive quadratic and linear integer programming techniques. (We note that Stivala et al. [16] recently gave a faster solution to the quadratic programming formulation of the tableau comparison problem proposed by Konagurthu et al. However, their method still remains infeasible for searching entire databases.) The third method (TableauSearch) was proposed as a fast heuristic to detect similarities using a two-step dynamic programming method.
Most of the existing protein comparison techniques share a major limitation. They are computationally expensive, requiring hours or even days to search a large protein structure database. This has motivated us to develop a new and rapid protein comparison algorithm, IR Tableau, based on feature indexing techniques from Information Retrieval (IR). Our method transforms the robust tableau representation of a protein fold into a vector of features, allowing the application of several well-known similarity measures to efficiently compare these feature vectors. IR Tableau achieves excellent search efficiency (it can search the ASTRAL protein structure database for a protein containing 83,731 domains in less than a second), while providing accuracy comparable to the existing methods.

Methods
This section describes our IR Tableau method, which utilizes IR techniques to index and search protein structures efficiently.
Tableau representation of protein structure Briefly, tableau encodes the geometry of pairs of secondary structural elements (SSEs) -that is, helices and strands of sheet [24]. The relative orientation of a pair of SSEs in a protein is defined by the angle between their axes. Each angle between pairs of SSEs (in the range 180°to 180°) uses a double-character encoding scheme [15]. (See Figure 1.) There are 8 possible combinations of two characters. For example, Table 1 shows the tableau of a Ubiquitin-like protein, 1UBI (chain A). We used the idea of generalized tableau, which is introduced in [15].
Information retrieval (IR) approach A typical IR system aims to retrieve documents that are relevant to keywords (terms) in a user-query. Each document is represented as a vector of weights, where each weight denotes the importance of a given term. Terms are usually the words used by a document and each weight may correspond to the frequency of occurrence of some terms in the document. The collection of all term weights for a document effectively describes the contents of that document. This is known as the 'bag-of-words' model. Different documents can then be compared by comparing their weight vectors.

Figure 1
Tableau orientation encoding scheme.
If they use similar weights for each term, then they are likely to be related. Since only vector comparison is used, similarity matching of vectors can be performed extremely fast.
In our protein context, we analogously translate each protein tableau into a vector of weights, where each weight describes the importance of some feature of the protein. Protein structure comparison is then performed by similarity matching of protein vectors. We next describe our technique for creating the vector for a protein structure.

Protein feature construction method
We generate a vector of features for a given protein based on its tableau representation. So effectively, we translate a two dimensional (2D) tableau into a one dimensional (1D) vector. Each cell in a tableau describes the angle between a pair of SSEs in the protein. For example, in Table 1, the OT in < row 1 column 2> is the orientation between SSE b 1 and SSE b 2 . To turn this tableau into a 1D vector, we summarise the distribution of angle frequencies for each possible pair of SSE types.
Each feature of our vector will describe a pair of SSE types in one of eight possible orientations: PE, PD, RD, RT, OT, OS, LS, and LE. The value of each feature corresponds to the frequency at which that configuration occurs in the protein. There are also four possible pairs of SSE types: aa, ab, ba and bb . Hence each protein can thus be described by 4 × 8 = 32 features.
Again, in Table 1, there are two bb OT in the tableau of 1UBI, which appear at <row 1, column 2> and <row 4, column 6> in the matrix. Therefore the value for the feature bb OT is 2. The full feature vector for this protein is given in Table 2. In this table, each number indicates the frequency for some combination of SSE types and angle. In summary, there are 32 features, each with an associated frequency count. We construct the above feature vector transformation for every tableau in a structure database. Given a protein structural query, searching can now be performed rapidly in the new 1D feature space.

Similarity function
Choice of an appropriate similarity function is important for accurate comparison. In IR Tableau, there are a number of possible similarity functions which can be applied for comparing the protein vectors.
Cosine similarity [26] simply computes the cosine of the angle between two vectors in a N dimensional space. A higher score implies a smaller angle between the two vectors. If the value is 1, it means that the two vectors have the same direction.
The Jaccard index [27] is another popular similarity function, defined as the size of the intersection, divided by the size of the union of two sets.
where A and B are sets.   The Tanimoto coefficient [26] is a generalization of the Jaccard index.
Euclidean distance is another means of measuring similarity of proteins. Unlike the similarity functions described above, the value for Euclidean distance is not normalized to be between 0 and 1.
Unless stated otherwise, our results in the rest of the paper assume the use of the cosine similarity function.

Variation of featuring process
In addition to the method we have described for generating 32 features for each protein (hereafter referred to as the base method), we have also explored the value of associating further, additional features with each protein vector. In general, there is a trade-off between adding extra features which can help discriminate between classes of proteins, versus adding too many features which overwhelm accurate similarity calculation.

Alternative combinations of SSEs
In our base method, ordering information was only used for pairs of SSE types. This description loses some information about the position of each SSE. By instead preserving positional information about SSEs, we can hope to build a more accurate profile in each protein vector. Incorporating such relationships may be carried out as described in the following example. Protein 1UBI, whose tableau is shown in Table 1 has 6 SSEs: SSE 1 compared with SSE 2 is bb OT, SSE 2 compared with SSE 3 is ba RT and SSE 1 compared with SSE 3 is ba LE. Combining these, we get the triplet of SSE 1 , SSE 2 and SSE 3 , which is bba OT RT LE. In general, we can record statistics for all triplets of the form SSE m , SSE m+1 and SSE m+2 . (Note that the idea may also be generalized to non-consecutive triplets, such as SSE m , SSE m+1 and SSE m +3 ). In this triplet approach, there there are 2 3 = 8 SSE types and 8 3 = 512 angles, giving a space of 8 × 512 = 4096 possible features. This idea can be further extended to the use quadruplets, quintuplets and so on of SSEs to generate a larger feature space. However, as we can clearly see, the size of the protein vector grows exponentially with the increase in the number of SSEs in each combination. Another possibility is to disregard the ordering information between SSEs, which may be useful for non-linear matching of sub-structures. In this case, there are only three possible orderings, between two SSEs, rather than four: all aa, bb and ab . A final possibility is to only consider consecutive SSEs for generating features from a tableau [25]. This can be done by only using the ± 1 off-diagonal entries.
Approximate ordering: partitioning the SSE chain Using the exact order of SSEs as the basis for forming a protein vector can cause the vector to be very large, for complex combinations. To handle this, we have investigated a strategy which uses approximate positions for each SSE, rather than exact positions. Suppose we have a protein with N SSEs. The sequence of SSEs can be partitioned into two halves along the chain. All the SSEs in the first-half part will be given a position marker P 1 . SSEs in the second-half will be marked as P 2 . Then, when comparing each pair of SSEs, position markers can be used to provide additional position information. In protein 1UBI, the first SSE compared to the last SSE will be b P 1 b P 2 OT. The number of features generated by using this strategy will then be 4 SSE types ×8 angles ×2 2 positions = 128 features. If a protein SSE sequence is partitioned into n parts, the number of features will be 4 SSE types × 8 angles × n 2 positions.

Datasets
For our experimental evaluation, we use: 1. the entire ASTRAL 1.73 [28] protein domain database. All 97169 protein domains in this data set are processed through the tableau generator program of Konagurthu et al. [24]. The program successfully generated 83,731 tableaux of protein domains covering 1077 different SCOP folds. Using these tableaux, our index database is generated in a single preprocessing step. 2. the ASTRAL 1.73 95% sequence-identity nonredundant data set. We generate our index database from the tableau data set published by Stivala et al. [16] containing 15,169 entries.
We also use the query data set of Stivala et al.'s [16] containing 200 randomly chosen protein domains. Each run using a query returns a list containing all proteins in the respective index databases along with the associated scores.

Evaluation methodology
All experiments were conducted using a Intel Core 2 Duo 2.4 GHz processor running Ubuntu 9.10 Linux system. IR Tableau was implemented in Java. SCOP [29] fold classification is used as the gold-standard while assessing the accuracy of each search. We use the Receiver Operating Characteristic (ROC) curve, the Area Under this ROC Curve (AUC), Precision-Recall curve and the Mean Average Precision (MAP) to gauge the accuracy.
Given a query protein P q which belongs to the SCOP fold F q , let us consider the top k proteins returned by the search as hits and the remainder as misses. For an ith protein P r i belonging to the SCOP fold F r i , if F r i = F q and i ≤ k then the protein P r i is a true positive (TP). On the other hand, if F r i ≠ F q and i ≤ k then P r i is a false positive (FP). If F r i ≠ F q and i >k then P r i is treated as a true negative (TN). Otherwise, P r i is a false negative (FN). Using the above statistics, we can then compute the true positive rate (TPR or recall), false positive rate (FPR) and positive predictive value (PPV or precision) using the following formulae:

Results and discussion
In this section we first compare our IR Tableau against several popular methods for protein structure comparison.
ProSMoS, OPAAS, SARST and some other web-server based programs are not tested, as results are not comparable. Later, we assess the sensitivity and accuracy of IR Tableau using different types of features defined in this work.

ROC curve and precision-recall curve performance
In Table 4, the AUC is shown for the 200 query set. Surprisingly, IR Tableau achieves the second highest AUC value of 0.948. This clearly suggests that the protein feature vectors seem to capture important structural information from tableau. ROC curves for IR Tableau, TableauSearch and Yakusa are shown in Figure 2.
Yakusa has the highest TPR when the FPR is less than 0.35. After this point, IR Tableau becomes slightly better than the other two. TableauSearch is always worse than Yakusa, but better than IR Tableau when the TPR less than 0.3. So in terms of ROC performance, IR Tableau is as good as Yakusa, but over three hundreds times faster.
The Precision-Recall (PR) curves for IR Tableau, Tableau-Search and Yakusa are shown in Figure 3. We note that the performance of both TableauSearch and Yakusa is better than the performance of IR Tableau (their curves are both closer to the upper right corner). Clearly, the Precision-Recall curve exposes differences between the algorithms that weren't apparent in ROC space.
The difference between the behaviour in ROC-space compared to PR-space can be explained based on the imbalance between the classes formed from the top-k results when k is small. In this circumstance, a small number of positive and a large number of negative results are returned. Therefore a difference in the absolute number of FPs only results in a small change in FPR (as seen in the ROC curves). On the other hand, the same difference in FP results in a large change of precision (as seen in PR curves). In other words, for small k, Yakusa and TableauSearch have an advantage in accuracy over IR Tableau, but as k becomes larger, all three are very similar.
This suggests that IR Tableau may be very useful to use as a hybrid technique in conjunction with one of these more computationally expensive algorithms. Under this strategy, one would first search the protein database using IR Tableau to return a relatively large set of matches and then pass these results to a second algorithm for deeper, more computationally demanding analysis and reranking of matches.
We also conducted experiments on searching for commonly occurring protein folds. For SCOP domain   This is also better than the version of QP tableau with added SSE distance information, which is 0.95 [16]. The ROC and Precision-Recall curves for this protein search are shown in Figure 4 and 5 respectively.
IR Tableau achieves a superior TPR across almost all the regions in the ROC curve. For the precision-recall curve, the performance of IR Tableau is comparatively not as good for low k (low recall), but becomes comparatively better for higher k (higher recall). The mean average precision is 0.775 for IR Tableau, 0.777 for Tableau-Search and 0.703 for Yakusa.
In some cases, search results with higher scores are more important than ones with lower scores. Superposition of these returned protein structures is then a very good demonstration of the quality of the top ranked proteins. For protein d1ubia_, the graph in Figure 6 shows the superposition of the top 20 proteins returned by IR   Superposition graph. Superposition of the top 20 results using d1ubia as a query protein. MUSTANG [5] was used for structure alignment. This figure is generated using PyMol [30].
Testing different IR Tableau feature choices Results of IR Tableau when partitioning proteins into N parts are shown in Table 6. As the number of partitions increases, the AUC gradually decreases. When N is between 5 to 7, the system achieves the highest MAP values. The SSE length distribution in Figure 7 clearly shows that most of proteins have 6 to 8 SSEs.
So it is therefore natural that partitioning the SSE chain into 5 to 7 parts works well, since it means that we are effectively trying to match the position of each SSE exactly.
The behaviour for different similarity functions is shown in If two proteins greatly differ in size, we can expect unnormalized similarity measures will reduce the distance greatly.

Conclusion
We have introduced IR Tableau, a new algorithm for protein structure comparison. A key advantage is that it is highly scalable, being faster than existing methods, by over a factor of 100. This speed up factor also increases for longer proteins. Moreover, it is able to achieve good quality of search results, obtaining comparable AUC scores to existing algorithms and slightly lower MAP scores. Highly efficient search algorithms will be very important for protein structure databases of the future, which may contain millions of proteins. We believe that our IR Tableau approach is very promising for such a scenario. In particular, it may be used as part of a hybrid filter approach. The user can run IR Tableau for a high throughput scan of the database for approximate    matches. The search results would then be passed to a second algorithm, for deeper (and more computationally expensive) comparative analysis. Conducting experiments along these lines is an interesting avenue for future work.