Efficient protein alignment algorithm for protein search

Background Proteins show a great variety of 3D conformations, which can be used to infer their evolutionary relationship and to classify them into more general groups; therefore protein structure alignment algorithms are very helpful for protein biologists. However, an accurate alignment algorithm itself may be insufficient for effective discovering of structural relationships among tens of thousands of proteins. Due to the exponentially increasing amount of protein structural data, a fast and accurate structure alignment tool is necessary to access protein classification and protein similarity search; however, the complexity of current alignment algorithms are usually too high to make a fully alignment-based classification and search practical. Results We have developed an efficient protein pairwise alignment algorithm and applied it to our protein search tool, which aligns a query protein structure in the pairwise manner with all protein structures in the Protein Data Bank (PDB) to output similar protein structures. The algorithm can align hundreds of pairs of protein structures in one second. Given a protein structure, the tool efficiently discovers similar structures from tens of thousands of structures stored in the PDB always in 2 minutes in a single machine and 20 seconds in our cluster of 6 machines. The algorithm has been fully implemented and is accessible online at our webserver, which is supported by a cluster of computers. Conclusion Our algorithm can work out hundreds of pairs of protein alignments in one second. Therefore, it is very suitable for protein search. Our experimental results show that it is more accurate than other well known protein search systems in finding proteins which are structurally similar at SCOP family and superfamily levels, and its speed is also competitive with those systems. In terms of the pairwise alignment performance, it is as good as some well known alignment algorithms.


Background
Proteins show a great variety of 3D conformations, which are necessary to support their diverse functional roles. Protein sequences and structures have close relationship with their biological functions, while protein structures reveal more evolutionary information than protein sequences do, since the structure of a protein changes more slowly in the evolution than its sequence does. Also, researchers frequently find that proteins with low sequential similarity are structurally homogenous. Therefore it is particularly important to discover the structural similarity/dissimilarity among different proteins. The research of protein 3D structure similarity provides fundamental and very helpful tools for many biological research topics, such as predicting the functions of newly discovered proteins from the functions of known similar protein structures, identifying protein families with common evolutionary origins, and understanding the variations among different classes of proteins.
Protein structures can be determined via experimental techniques such as X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy, and even cryoelectron microscopy. Due to these techniques, the number of proteins discovered by biologists has increased dramatically over the last 30 years. The rapid growth of the PDB (see Figure 2a of [1] for an illustration of the PDB growth rate from 1970's to the year 2005) necessitates the development of efficient and accurate protein structure comparison and search algorithms and automatic software tools.
Protein structural similarity can be used to infer evolutionary relationship between proteins and to classify protein structures into more general groups; therefore a good protein structure alignment algorithm is very helpful for protein biologists. However, a good alignment algorithm itself may be insufficient for effective discovering of structural relationships among tens of thousands of proteins.
Protein structure query (e.g. [10,12,13,[24][25][26][27][28][29][30][31][32][33][34][35][36]) aims to find similar structures in a protein dataset according to a given query structure. Due to the large size of protein data repositories like the PDB, protein structure query requires a very fast structure alignment tool; however, the complexity of current alignment algorithms are usually too high to make a fully alignment-based search practical. For some proteins, it may take hours to days for protein structure search engines like CE [12] and DALI [37] to return a search result; a fast and accurate protein structure query tool which enables real-time structure searching in a large dataset is still in need. To improve the search speed, in recent years many methods have been designed to reduce the query time. Baker and Dauter (2004) developed SSM [8] which uses Secondary Structure Match for the pairwise structure comparison. In addition, various linear encoding methods have been applied to protein search systems. For instance, 3D-BLAST [33] developed by Yang and Tung (2006), can improve the comparison speed thousands of times as the speed of CE and DALI. Similar methods include ProtDex2 [35], Sarst [31], and TopScan [36]. These methods improve the time performance greatly. However, when being compared with pairwise alignment methods, they have weakness in accuracy. In [34] we have developed a protein structure query algorithm and tool to find similar protein structures in the PDB for any given structure. With a combination of geometric filter and 3D structure alignment, given a query protein, the algorithm can find proteins whose structures are overall similar with the query structure in the PDB in a few minutes. The geometric filter can exclude dissimilar proteins efficiently, and reduce a lot the number of times of pairwise alignments. On the negative side, it misses some significant proteins whose structures are partially similar with the given protein.
To further improve the speed and accuracy of our protein structure query tool, in this paper we propose a very fast protein structure alignment algorithm, which is suitable to do pairwise 3D alignment with all protein structure representatives in the PDB. It can find similar proteins in the PDB (of more than 130,000 protein chains) in a short time, and avoid missing similar structures. In our experiments, some exciting results have been observed when comparing our query tool with other well known protein search engines. The experimental results show that our tool is more accurate than other systems such as CE [12], Dali [37], and SSM [8] in finding proteins that are structurally similar to the query protein, and its speed is also competitive with them.

Results and discussion
In this section, we show the experimental results for an implementation of our algorithm and its comparisons with other well known similar systems which are accessible online. The quality evaluation of protein alignment algorithm is based on its alignment length and RMSD value. The quality evaluation of protein search tool is according to its precision, recall, and query speed.

Evaluation of the alignment results
There is no general standard for analyzing and comparing the results of different alignment algorithms, because each method uses different alignment measures. Besides the two basic measures: alignment length and RMSD value, some methods also calculate a native score for their alignment results. For instances, CE and Dali use different kinds of Z-scores as their native score. SSM has the where N mat is the number of matched pairs of C a atoms, N 1 is the number of C a atoms in the first protein, and N 2 is the number of C a atoms in the second protein. The Q-score considers both alignment length and RMSD value when measuring the alignment results. Subbiah proposed a geometric match measure, in [38]. As shown by its definition, SAS k also considers both alignment length and RMSD value. Lower SAS k means better alignment result and k is the degree to which the score favors. A smaller k can be used when longer alignment length is preferred and a bigger k is for smaller RMSD value. Here we have collected 224 alignment cases and used Q-score and SAS k to test the performance of our algorithm. The test cases were originally proposed by various papers for various testing purposes. A list file available on our website shows all the 224 cases. They include No. 1 -No. 20 (see Table III in [12]), No. 21 -No. 88 (see Table I in [24]), No. 89 (see Tables I and II in [12]), No. 90 -No. 92 (supplement to Table III in [12]), No. 93 (see Figure 5 in [12]), No. 94 -No. 101 (see Table IV in [12]), No. 102 -No. 111 (see Table V in [12]), No. 112 -No. 120 (supplement to Table V in [12]), No. 121 -No. 124 (see Table VII in [12]), No. 125 -No. 143 (see Table 1 in [11]), No. 144 -No. 183 (see Table 1 in [17]) and No. 184 -No. 224 (see Table 2 in [17]). We compare our alignment results with Dali, CE and SSM. In each test case, different alignment algorithms have different results. CE and Dali always get more aligned pairs than those of our algorithm and SSM, but their accuracy is relatively lower (having larger RMSD value). So using merely aligned pairs or RMSD value as the criterion to measure the performance of alignment algorithm makes no sense. Therefore, we calculate Q-score and SAS k for the alignment results of all the methods and compare the alignment results in terms of Q-score Difference and Average SAS k . The Q-score Difference is calculated by (Q score_Ours -Q score_X ) where X is Dali, CE or SSM. And we use k = 1, 2 and 3 for the Average SAS k to analyze the quality of all the four methods. It should be mentioned that our method can output sequential alignments and non-sequential alignments, thus we compare both of them with other methods. We call the sequential method SPSA and the non-sequential method NPSA for short. Table 1 shows some statistical data based on the experimental results. Compared with Dali, CE and SSM, our algorithm has smaller average RMSD, and its average alignment length is longer than that of SSM, but shorter than that of CE and Dali. Its average SAS k is always smaller than other three algorithms, no matter k = 1, 2 or 3. A lower SAS k score indicates a better alignment. To further test our algorithm, we compare it with others by Q-score. Figure 1 reflects the Q-score Difference between our algorithm and others respectively. A black area below X-axis indicates that the Q-score of our algorithm is lower than that of the compared method. Since in each graph the upper part of the whole black area is always larger than or equal to the lower part, it is For a database search tool, the recall rate and precision are two commonly used parameters for assessing its query quality. Precision is defined as n/N and recall rate is n/T, where n is the number of true proteins (from the same family of the query protein) in the result list, N is the total number of proteins in the result list, and T is the total number of proteins in the same family of the query protein in the database. Therefore, the precision is between 0 and 1, and the quality of a ranked output list is directly based on it. The recall rate is also between 0 and 1, and the missing problem of a search engine is in relation to it. In 2004, Aung and Tan [35] collected 34,055 proteins form the ASTRAL SCOP 1.59 to form a large target database, from which 108 proteins were selected as the query proteins. These query proteins are from four main classes (All-a, All-b, a/b and a + b) of ASTRAL SCOP 1.59 and their average family size is around 80. We use these different categories of proteins to do the queries on our search engine, use Q-score as the criterion to rank the output proteins, and compare the result with that of CE [12], MAMMOTH [11], 3D-BLAST [33], PSI-BLAST [29], ProtDex2 [35], and TopScan [36]. Results for all the methods except ours are taken from [33]. Since our sequential method is much faster than its non-sequential counterpart, we just use the sequential method in our experiments. In our search system we do support both sequential and nonsequential methods.

Discussion on the alignment results
According to the 108 query results, our method based on pairwise alignment algorithm shows better performance  Figure 1 shows the Q-score Difference between our algorithm and CE [12], Dali [37], and SSM [8], respectively.
BMC Bioinformatics 2010, 11(Suppl 1):S34 http://www.biomedcentral.com/1471-2105/11/S1/S34 than others. CE and MAMMOTH are the second and third accurate methods. 3D-Blast, which is based on a linear encoding method, ranks fourth and its precision is about 5% lower than ours on average. PSI-BLAST is a classical sequence search algorithm, and its precision outperforms both TopScan and ProtDex2. However, when being compared with alignment algorithms such as CE, MAM-MOTH and ours, its precision is much lower. Figure 2 (left) shows the Receiver Operating Characteristic (ROC) curves for all the methods. When the recall rate is 100%, the average precision of our method is 63%, which is the highest. In addition, all the pairwise alignment methods including CE, MAMMOTH and ours have higher precision than that of other methods, and with the increasing of recall, this trend becomes more obvious.

Performance for searching weak similarities
In the SCOP database, proteins in a same species or domain are the most similar proteins, and then are the proteins belonging to a same family. Experts also classify weakly similar proteins into a same superfamily. Therefore, to precisely assess the efficiency of searching methods challenged by searching weakly similar proteins, we use the entire ASTRAL SCOP 1.73 as the target database, and select 129 query proteins belonging to four major classes. The average superfamily size of these 129 query proteins is around 300. We use these proteins to do queries on our search engine and also on 3D-BLAST [33], PSI-BLAST [29], and SSM [8]. We are aware that ProtDex2 [35], Sarst [31], and TopScan [36] are also famous protein search systems, however they have not updated their database for a long time. Figure 2 (right) shows the experimental results of the four methods for searching similar proteins at superfamily level, according to which our method is the most accurate one. 3D-BLAST, the second most accurate method in this experiment, has its precisions about 8.1% lower than ours on average. We claim that pairwise alignment algorithms have more advantages in finding remote homologous proteins than linear encoding method or sequential search method does. Nevertheless, according to the experimental result, our method occasionally has problem detecting related proteins in the same superfamily. About 17% query results of our search engine have serious missing problems (precision lower than 50% when recall rate is 100%), while 3D-BLAST is 35% and other methods have more serious missing problems than 3D-BLAST.

Evaluation of reliability
Our search engine provides an alignment length and an RMSD value for every retrieved structure. In order to assess the reliability of our search engine, we calculate the Q-score value of each structure by its alignment length and RMSD value; and gather statistics on precision and recall rate for various Q-score values at both superfamily and family levels. According to the data in Table 2, when Q-score is higher than 0.4, the

Figure 2
Precision and recall curves. Figure 2 shows the accuracy performance of multiple protein search methods. The left shows the precisions and recall rates of 108 queries by multiple methods at SCOP family level, and the right shows those of 129 queries by the same methods at SCOP superfamily level.

Evaluation of query speed
As shown in Table 3, on average, our method requirs about 112.20 seconds to search the database for each query protein in a single machine. Although it is much slower than 3D-BLAST and PSI-BLAST, when being compared with pairwise alignment algorithms CE, and MAMMOTH, ours has great advantage in the time performance. In addition, we have used our web server and the web servers of 3D-BLAST, PSI-BLAST and SSM to do the 129 queries in the entire PDB database. The PSI-BLAST, SSM, and 3D-BLAST servers have average query time of 16, 27, and 44 seconds respectively. Benefit from the distributed computing system and the offline classification, our web tool can scan the entire PDB database in 17 seconds on average, which is shorter than those of 3D-BLAST and SSM. Moreover, our results contain an alignment length and an RMSD value for every output protein. Although PSI-BLAST and 3D-BLAST do not have these data, they are the most important measures for comparing protein structural similarities. The 3D-BLAST server is the slowest one with an average query time of 44 seconds. In our knowledge, other search engines such as CE [12] and Dali [37] which are based on one-against-all pairwise alignment algorithms need hours to days to complete the queries.

Conclusion
We have developed a fast protein alignment algorithm and an efficient protein structural similarity search engine by a combination of the structure alignment algorithm and a structure classification method. Our experiments show that it is more accurate than other well known systems in finding proteins that are structurally similar. The fast speed of our alignment algorithm results from a simple and efficient method for finding a rigid body transformation.

Brief description of algorithms
We give a brief description of our algorithms in this section. We have developed two protein alignment algorithms, both of them have three main stages. The first two stages are shared by them, but they have a different third stage. The third stage of the first alignment algorithm is based on DP (dynamic programming), and it preserves the order of C a atoms in protein backbones. The second algorithm, whose third stage is based on MM (maximal matching), does not preserve the order, however it brings larger alignments than the first, while its speed is slightly slower. Our main technical contribution is a fast method used in stage two for finding a rigid body transformation to superimpose two protein structures. A process flow is shown in Figure 3.
Brief-Algorithm S 1 and S 2 are two 3D protein C a backbone structures; L is a set of local alignments (a local alignment is a match of two substructures of consecutive C a atoms from two backbones); δ is the maximum allowed distance between two matched C a atoms; F G () is a function to calculate rigid body transformation; T () is a function to translate and rotate a structure; find local alignments (stage 1); for each local alignment l in L T = F G (l) (beginning of stage 2);  Table 2: Statistics on the reliability of scores. Precision is defined as n/N and recall rate is defined as n/T, where n is the number of true proteins of Q-scores higher than the limit value in the result list. A true protein means it is from the same family or superfamily of the query protein. N is the total number of retrieved proteins whose Q-scores are higher than the corresponding value, and T is the total number of proteins in the family or superfamily of the input protein for every pair of points (p u , q v ), where p u is in S 1 and

First stage
In the first stage, our algorithm searches for a set of local alignments, each consisting of a series of consecutive C a atom pairs in the backbones of two proteins P and Q, which are represented by their C a atoms in backbones P = p 1 p 2 ... p n 1 and Q = q 1 q 2 ... q n 2 . We use (i, j, l) to represent a local alignment, which indicates that a gapless segment p i p i+1 ... p i+l-1 of the first protein backbone P starting at C a atom i matches a gapless segment q j q j+1 ... q j+l-1 of second protein backbone Q starting at C a atom j, and both segments have l atoms. We compute the distance matrices of the two backbones to match their local regions. If all the corresponding distances in the two distance matrices have small difference, then a local alignment is found. Our algorithm to search all the local alignments runs in time O(d 1 m 1 m 2 ), where d 1 is a constant number, m 1 is the number of C a atoms in P and m 2 is the number of C a atoms in Q.

Second stage
In the second stage, each local alignment is used to find an initial rigid body transformation. There are many algorithms for finding a rigid body transformation to superimpose a set of pairs of 3D points [43]. In this paper a least square estimation method [44] is applied in our algorithm. Given a set H = {(a 1 , b 1 ), (a 2 , b 2 ),..., (a m , b m )} of point pairs in the 3D Euclidean space, F G (H) is a rigid body transformation T derived by the method in [44] to minimize the RMSD: Let H 0 be the set of all pairs (p i+k , q j+k ) (k = 0, 1,..., l -1) in a local alignment (i, j, l). The rigid body transformation T 0 = F G (H 0 ) is derived. After obtaining T 0 , we use it to superimpose the two structures and collect all the pairs (p u , q v ), where p u is from P and q v is from Q and the distance between p u and T(q v ) is bounded by a threshold, and put them into H 1 . A new rigid body transformation T 1 , which is F G (H 1 ), is derived based on the new set H 1 . By repetitively calculating a new transformation and adding new point pairs into H1, we can improve the transformation until no more pairs can be added.
A lot of matched pairs are used during the process of getting a transformation. It is necessary to mention that our algorithm selects all the matched pairs without considering conflicts, where two pairs share a same C a atom. Obviously, this kind of conflicts is ndisallowed in a global alignment. However, when calculating a transformation, there is no need to consider that, and sometimes it is hard to choose the best one among the conflicting pairs. Our method of searching a rigid body transformation with the presence of conflicting pairs not only makes the second stage simple and fast, but also improves its accuracy.

Third stage
In the third stage, we output an alignment that is a set of aligned pairs, where each C a is allowed to appear in at most one pair. Two different methods, dynamic programming and maximal matching, are applied to bring the sequential alignment and non-sequential alignment, respectively.
The dynamic programming method can find an optimal solution by following the order of two backbone sequences. And the minimal length of a local alignment is set to 4. The maximal matching method returns a nonsequential alignment. Before applying the classical maximal matching algorithm to a graph of local alignments, we first delete edges in the graph so that each edge in the bipartite graph corresponds to a local alignment of length at least 4. The minimal length is used to exclude some isolated pairs, which are not biologically meaningful.
As we know, most existing protein alignment algorithms repeat finding the aligned regions between two backbones and recalculate the rigid body transformation when looking for a maximal alignment. Our algorithm does not involve the C a atoms alignment when determining the rigid body transformation. This is why our alignment algorithm is remarkably faster than other algorithms. It can work out hundreds of pairs of protein alignments in one second. Therefore, it is very suitable for protein search. The sequential and non-sequential methods show an interesting tradeoff between speed and the number of aligned pairs.

Methods of speeding up the alignment and search
In order to speed up the computation of our basic algorithm, we propose some strategies that improve the time performance.
Improve the time performance of the alignment algorithm Finding a good rigid body transformation between two protein structures is often time consuming. This is why most alignment algorithms are relatively slow. In order to develop an efficient protein alignment algorithm for protein search, we reduce computational time for finding the rigid body transformation while maintaining sufficient alignment quality.
First of all, the size of local alignments directly determines the time performance. A large number of local alignments will result in slow global alignments. Before finding the local alignments, we first filter out all the short a helices of length less than 8, because this kind of local structures are very common in proteins and may prevent the program from finding significant local alignments. Moreover, the length of a local alignment should be reasonably large to make sense, so we only consider local alignments with length greater than a threshold. In our experiments these