MICAN : a protein structure alignment algorithm that can handle Multiple-chains, Inverse alignments, Cα only models, Alternative alignments, and Non-sequential alignments

Background Protein pairs that have the same secondary structure packing arrangement but have different topologies have attracted much attention in terms of both evolution and physical chemistry of protein structures. Further investigation of such protein relationships would give us a hint as to how proteins can change their fold in the course of evolution, as well as a insight into physico-chemical properties of secondary structure packing. For this purpose, highly accurate sequence order independent structure comparison methods are needed. Results We have developed a novel protein structure alignment algorithm, MICAN (a structure alignment algorithm that can handle Multiple-chain complexes, Inverse direction of secondary structures, Cα only models, Alternative alignments, and Non-sequential alignments). The algorithm was designed so as to identify the best structural alignment between protein pairs by disregarding the connectivity between secondary structure elements (SSE). One of the key feature of the algorithm is utilizing the multiple vector representation for each SSE, which enables us to correctly treat bent or twisted nature of long SSE. We compared MICAN with other 9 publicly available structure alignment programs, using both reference-dependent and reference-independent evaluation methods on a variety of benchmark test sets which include both sequential and non-sequential alignments. We show that MICAN outperforms the other existing methods for reproducing reference alignments of non-sequential test sets. Further, although MICAN does not specialize in sequential structure alignment, it showed the top level performance on the sequential test sets. We also show that MICAN program is the fastest non-sequential structure alignment program among all the programs we examined here. Conclusions MICAN is the fastest and the most accurate program among non-sequential alignment programs we examined here. These results suggest that MICAN is a highly effective tool for automatically detecting non-trivial structural relationships of proteins, such as circular permutations and segment-swapping, many of which have been identified manually by human experts so far. The source code of MICAN is freely download-able at http://www.tbp.cse.nagoya-u.ac.jp/MICAN.

http://www.biomedcentral.com/1471-2105/14 /24 Background Protein structure comparison techniques are of great importance for inferring distant evolutionary relationships that cannot be suggested from sequence similarity alone [1,2], and finding recurrent structural motifs in proteins [3]. Because of its importance, many structural alignment algorithms have been developed [4][5][6]. Although there are a variety of structure alignment algorithm, most of them follow a simple sequential rule, that is, two proteins are aligned only in sequential order. Recently, however, a number of interesting examples that show non-sequential structure similarity have been reported, where non-sequential structural similarity is the structural similarity in which structurally equivalent regions are aligned in different order in the sequence of the compared proteins. The most populated class of nonsequential structural similarity is circular permutation [7][8][9]. Although not as many as that of circularly permutation, the number of examples that show non-sequential similarity beyond circular permutation proved to be also significant [10]. Further, Ilyin et al. found that nonsequential structural alignments between proteins are not limited to proteins of any particular fold, and they are systematic and widespread across the protein universe, suggesting generality and importance of non-sequential structural relationship [10].
Non-sequential structural similarities are intriguing in terms of evolution. Currently, some examples of protein pairs with evidence for evolutionary relationships that fold into topologically different but share the same secondary structure packing arrangement have been known. One of the most interesting examples of such protein pairs is that of KH domains of hnRNP K (PDB entry 1KHM) and ribosomal protein S3 (PDB entry 1J5E) [11]. They show statistically significant sequence similarity (38% sequence identity) implying homology. Their structures, however, are quite different in topology while converging to the same architecture. Another example is a pair of Carboxylic esterase (PDB entry 1YAS) and Hydroxynitrile lyase (PDB entry 1QLW) [10]. Both belong to the same superfamily (alpha/beta-Hydrolases superfamily) defined in SCOP database [12], suggesting that they share a common evolutionary ancestor. However, structure alignment revealed that the sequence order of the long structurally equivalent region is swapped [10]. These examples suggest that some proteins have changed their folds by segment shuffling or rearrangement events while conserving the same core packing arrangements. Further investigation of such protein pairs would give us a hint as to how proteins can change their fold in the course of evolution.
Although some non-sequential structure alignment methods have been developed [13][14][15][16][17][18][19][20][21][22], there is still much room for improvement in non-sequential structure alignment algorithm. Indeed, we have tested all of the publicly available non-sequential structure alignment programs, and found out some problems of currently existing methods which is specific to non-sequential structure alignment. One of the problems is that some non-sequential structure alignment algorithms which utilize a single vector representation of each SSE tend to fail in aligning residues which belong to long SSEs. Such a simplified protein representation is crucial especially for non-sequential alignment, because search space for non-sequential alignment is much larger than that for sequential alignments. However, such an approach is not suitable for describing bent or twisted nature of long SSEs, which makes it difficult to correctly align long SSE regions. Also, it may be difficult to correctly align a short SSE in a long one by the single vector representation, even if the long SSE is not bent or twisted. On the other hand, another kind of the non-sequential structure alignment method which employ a C α representation of protein structures has another kind of problem. For instance, alignments generated by SAMO [18], which tries to maximize the number of matching C α atoms and minimize their root mean square distance, frequently show over-fragmentation; there is no long continuous alignment region, although the number of matching residue is large and RMSD of aligned residues is small. An example of such alignment is shown in Figure 1 for arylamine N-acetyltransferase and epididymal retinoic acid-binding protein. These two structures are very similar, as is suggested by the reference alignment which is taken from MALISAM database [23], a manually curated database of structurally analogous pairs. For the same protein pair, SAMO generates a totally different alignment. The reference alignment has RMSD of 4.4Å over 77 residues, while the alignment by SAMO gives 3.4Å over 91 residues, suggesting alignment by SAMO is better than reference just in terms of these values. However, it is questionable whether such a highly scattered and fragmented alignment is evolutionary or physically meaningful. In this paper, we introduce a novel non-sequential structure alignment method called "MICAN" that overcomes above-mentioned problems. MICAN is designed so as to find the rotational matrix that maximize a structural similarity score of SSE region, regardless of chain connectivity. Since MICAN utilizes an SSE as a structure unit in structure comparison, it is free from over-fragmentation problem, or noisy alignments which mostly consist of relatively short segments or single residues. The search scheme of MICAN is based on the geometric hashing paradigm. It was originally developed for model-based object recognition problems in the area of computer vision, and has been widely applied to protein structure comparison studies [13,[24][25][26][27]. Geometric hashing has the property that it does not use the sequential order of the points to be compared. Therefore, it is highly suitable for structure comparison in cases where the sequential order should be ignored. One of the key feature of the algorithm is utilizing the multiple vector representation for each SSE. This feature enables us to correctly treat bent or twisted long SSEs. Another feature is a building way of reference frame in the geometric hashing technique, in which one reference frame is defined per point to be compared, not defined per pair or per triplet of points. By virtue of this technique, we can reduce the number of reference frame systems to be compared to O(n), while that of the cases in which one reference frame is defined per triplet is O(n 3 ), where n is the number of points to be compared. As a result, MICAN algorithm is very fast; it is the fastest non-sequential structure alignment program among all the programs we examined here.
In order to assess the performance of the algorithm, we compare MICAN with other publicly available non-sequential alignment programs: DEDAL, SNAP, GANGSTA+, SCALI, MASS, and SAMO, as well as sequential ones: DaliLite, CE, and TM-align. We evaluate those programs using both reference-dependent and reference-independent evaluation methods on both sequential and non-sequential benchmark test sets. We show that MICAN outperforms the other existing methods for reproducing reference alignments of nonsequential test sets. Furthermore, although MICAN does not specialize in sequential structure alignment, it shows the top level performance on the sequential test sets for both reference-dependent/independent measure. The source code of MICAN is freely download-able at http://www.tbp.cse.nagoya-u.ac.jp/MICAN.

Evaluation
In order to assess the performance of alignment methods as objectively as possible, we use both reference-dependent and reference-independent evaluation methods.
Reference-dependent evaluation is based on precomputed gold standard of reference alignments. Reference alignments are often carefully inspected, or manually curated to ensure good alignment quality. They are considered to be, for the most part, biologically or physically meaningful, although they may not be necessarily perfect. The reference-dependent alignment accuracy is calculated as the number of correctly aligned residue pairs in a test alignment divided by the total number of aligned residue pairs in a reference alignment. Because of its biological or physical relevance, as well as simplicity of the evaluation, reference-dependent evaluation has been widely used [28][29][30][31][32]. However, some aspects of the method have been criticized; it relies heavily on the correctness of the reference alignment and it is unclear how the quality of reference alignment influence the evaluation of alignment programs [6,31,33].
On the other hand, reference-independent method does not require reference alignments. It evaluates purely geometric measure, such as number of aligned residue pairs, their RMSD, number of gap, or their combinations. Thus, reference-independent method is free from problems associated with reference-dependent evaluation. Comparing these geometric measure is particularly useful for identifying characteristic feature of several alignment programs. However, these similarity measures may not necessarily reflect the biologically or physicallymeaningful similarity [34]. This situation is particularly serious, especially for non-sequential structure alignment. For instance, as shown in Figure 1, an alignment by SAMO shows larger number of aligned residue pairs and lower RMSD than those of the reference alignment. However, its alignment seems to be not physically relevant because there are too many gaps and too many short segments that are likely to be spurious matches in the alignment.
As mentioned above, both reference-dependent and reference-independent method have their own advantages and weaknesses. The use of both two evaluation methods would provide more objective view on the results of the benchmark test, highlighting different aspects of the algorithm.

Datasets
What kind of datasets should we use for the evaluation of non-sequential structural alignment programs? http://www.biomedcentral.com/1471-2105/14 /24 Obviously, many protein pairs of which the 'correct' alignments is known to be non-sequential are needed. We refer to such test sets as 'non-sequential test sets' . In addition to that, 'sequential test sets' , a collection of protein pairs of which the 'correct' alignment is sequential, are also needed. That is because ideally, if the correct alignment is sequential, non-sequential alignment programs should generate a sequential alignment. Sequential test sets enable us to assess such an ability of non-sequential programs. Thus, we have used both sequential and nonsequential test sets for the evaluation of different structural alignment programs.
We must construct non-sequential test sets for evaluation of non-sequential alignment, because of the absence of gold standard of non-sequential structural alignment databases. Here, we create artificial non-sequential test sets "MALIDUP-ns" and "MALISAM-ns" based on the sequential test sets, rather than collecting naturally occurring non-sequential alignments, because the number of currently known naturally occurring protein pairs that show non-sequential structural similarity is limited. The detailed description of how to create artificial nonsequential test sets is given in the Methods section.

Comparison to other alignment programs
We compared the alignments of MICAN with those of other structure alignment programs. They are three sequential alignment programs: DaliLite [36], TM-align [37], and CE [38], and six non-sequential alignment programs: DEDAL [33], SNAP [22], GANGSTA+ [21], MASS [13], SCALI [16], and SAMO [18]. We selected DaliLite, TM-align and CE as representative of sequential alignment programs, because they have been known to be one of the best structure alignment programs [4,6,29]. The choice of above-mentioned non-sequential programs was based on the availability of stand-alone programs. All of the programs were downloaded and executed locally on a Linux platform. If the source code of the program was available, we compiled it with GNU Compiler 4.5.5 on the computer. To our knowledge, this is currently the largest and most comprehensive comparison of non-sequential structural alignment programs.

Results of sequential alignment test
In order to assess alignment accuracy of several method on sequential test sets, we compared the alignments generated by the ten methods with all of the reference alignment of MALIDUP-sq and MALISAM-sq, and computed percentage of agreement, Q-score. Here, Q-score is defined as the number of correctly aligned residue pairs in a test alignment divided by the total number of aligned residue pairs in a reference alignment. The corresponding box-and-whisker plots are shown in Figure 2, and mean Q-score are listed in Table 1. We also present the scatter plots of Q-score for all pairs of the compared methods in Figure 3. For MALIDUP-sq test set, the largest mean Qscore was achieved by MICAN with 85.8% (median 94.9%) and the second largest mean was 85.3% (median 94.3%) by DaliLite. On the other hand, for MALISAM-sq test The results of reference-dependent assessment with sequential test sets. The box-and-whisker plot of the distribution of the Q-score obtained for three sequential (white boxes) and seven non-sequential methods (gray boxes). The left (right) plot represents the result for the test with MALIDUP-sq (MALISAM-sq) set.   should be noted that although MICAN does not specialize in sequential structure alignment, the performance of MICAN is comparable with that of DaliLite, which is known to be one of the best sequential alignment program.

Results of non-sequential alignment test
To examine the ability of reproducing non-sequential reference alignments, we performed the benchmark test on the non-sequential test sets, MALIDUP-ns and MALISAM-ns test set. As far as we tested, CE program did not work for protein pairs in the non-sequential test sets. That is because artificially generated protein structures included in the test sets have chain break. Accordingly, data of CE algorithm is not shown for the test sets. The results of the non-sequential benchmark tests are shown in Figure 4, Figure 5, and Table 1.
As we expected, the performance of sequential alignment programs, DaliLite and TM-align, were significantly decreased on non-sequential test sets, compared with their results on sequential test sets. For example, the mean Q-score obtained by DaliLite has dropped from 85.3% (MALIDUP-sq test) to 37.9% (MALIDUPns test). This result is a natural consequence, because these sequential alignment programs were not designed so as to align protein structures in non-sequential manner.
On the other hand, the performance of non-sequential alignment programs are robust against changes of test sets from sequential to non-sequential. Among them, MICAN shows outstanding performance on the test sets. It achieves the highest median and mean Q-score on both non-sequential test sets (See Figure 4, Figure 5 and Table 1). Especially, the result on MALISAM-ns test, which is the harder one, is significant; the mean Q-score obtained by MICAN with 67.6% is as many as 11% higher than that by DEDAL, which shows the second highest mean Q-score (See Table 1). In addition, the difference between MICAN and DEDAL is statistically significant: p-value is 1.7 × 10 −4 (See also Figure 4). Taken together with the results on sequential and non-sequential test set, MICAN can be considered as the outstanding program for generating alignments consistent with reference alignments regardless of whether the reference alignment is sequential or not, at least for the test sets we used here. An example of one of the most difficult alignment taken from MALISAM-ns test set is shown in Figure 6. One of the pair is the structure of Formiminotransferase domain of formiminotransferase-cyclodeaminase (shown in Figure 6a left). The other is the permuted structure of FAD-binding domain (Figure 6a right), which is generated by randomly permuting sequential order of SSEs of FAD-binding domain.
The alignment plot of the reference alignment and those of nine alignment methods are also shown in Figure 6b and c. The structure alignment of the reference consists of five fragments, and each fragment contains a single SSE and loops attached to the SSE. The orders of SSEs in the polypeptide chains are completely different, which makes alignments quite complex, as shown in Figure 6b. Five non-sequential methods (SNAP, GANGSTA+, SCALI, MASS, and SAMO) returned zero agreement with the reference alignment, suggesting difficulty of reproducing the reference alignment of this pair. DaliLite, TM-align, and DEDAL returned partial agreement with the reference. All these three method correctly aligned α 2 to α 2, but only DaliLite did α 1 to α 1. All the method except MICAN failed to correctly align any β strands pairs. On the other hand, only MICAN correctly align all of the secondary structure pairs, completely reproducing the reference alignment (Q-score = 100%). Although MICAN is the best aligner in terms of reference-dependent evaluation, there are still some failures. For example, as depicted in Figure 3 and 5, MICAN generates some alignments with Q-score = 0%.
Manual investigation revealed that majority of such failed alignments had a common feature; They had 1-4 residue shift of an alignment with respect to the reference alignments. Shown in Figure 7  Reference alignment d1qd1b1 permuted d1n62c2 one possible improvement is to develop a scoring function that is more consistent with human expert knowledge.

Benchmark test with reference-independent evaluation
To identify characteristic feature of several alignment programs, we calculated several standard geometric measures, the number of aligned residue pairs (N ali ), the corresponding root mean square deviation (RMSD) of the C α atoms, and the number of gap opening (N gap ), for all the alignments generated by all the methods, for all the pairs provided in MALIDUP-sq, MALISAM-sq, MALIDUP-ns and MALISAM-ns test sets. Here, N ali is defined as the number of aligned residue pairs divided by the size of the smaller protein of the protein pair. Their average values for MALIDUP-sq, MALISAM-sq, MALIDUP-ns and MALISAM-ns test sets are listed in Table 2. To visualize the data, average N ali and RMSD obtained by each method is plotted on (N ali , RMSD) plane in Figure 8. On this plane, better performance corresponds to larger values along the horizontal axis with smaller values along the vertical axis. Shown in Figure 8a are results for MALIDUP-sq and MALISAM-sq test sets. Each method locates in various positions of the plane, which reflect the feature of objective function of each method. Among them, DaliLite, TM-align, MICAN, and SAMO show similar feature of the alignments; they are located on a similar position. These four algorithms are seated on the upper right part of the graph, suggesting that they prefer larger number of aligned residue pairs with larger RMSD (about 3Å). Although their average RMSD values are relatively large compared with the other programs, these values can be considered as biologically relevant, because they are comparable to typical RMSDs for homologous proteins [39]. In contrast, alignments by MASS shows shorter alignment with the smallest RMSD of all the algorithm. The other programs lie between the two extreme cases. Figure 8b shows results on MALIDUP-ns and MALISAM-ns test sets. Compared with Figure 8a, as we expected, N ali of sequential algorithms significantly decreased. On the other hand, the positions of all of the non-sequential algorithms on the plain are not affected by the change of test sets from sequential to non-sequential test sets, compared with those of sequential algorithms. Especially, MICAN, DEDAL, SNAP and MASS are shown to be highly robust against shuffling of sequence order. Their change in the N ali and RMSD are less than 2%.
According to Figure 8a and b, MICAN and SAMO show similar behavior; They prefer large number of aligned residue pairs with larger RMSD, regardless of whether the reference alignment is sequential or not. However, they showed totally different behavior in the referencedependent assessment, as shown in Figure 2 and Figure 4. Where does the discrepancy come from? One possible explanation is that the large number of aligned residue pairs by SAMO is mainly attributed to contribution of short segments that are likely to be spurious matches.
In fact, Table 2 shows that the number of gap opening is different between the two algorithms. The number of gap opening by SAMO is about 9 times larger than that by MICAN, suggesting that alignments by SAMO largely consist of relatively short segments or single residue pairs. As pointed out by Abyzov and Ilyin [10] and Teichert et al. [29], these short aligned segments are considered as 'noise' , which may likely arise by chance, rather than 'signal' of structural similarity, because sequence similarities based on structure alignments increase by removing short fragments from the alignments [29]. These noise could be one of the major factor of overestimation of the structural similarities [40]. Thus, in order to eliminate contribution of such noise from N ali , we calculated the number of aligned residue pairs, N ali , which does not count residues pairs in fragments shorter than three consecutive residues pairs.
Average value of N ali obtained by each method for each test set are listed in Table 2. Underlined data listed in Table 2 indicate significant difference (more than 20%) between N ali and N ali . The largest change is observed for SAMO, which shows 46% decrease between N ali and N ali on MALISAM-ns test set by omitting the short fragments, suggesting the large number of aligned residue pairs by SAMO is enhanced by gathering short (one or two residues) fragments. On the other hand, N ali and http://www.biomedcentral.com/1471-2105/14/24  N ali by MICAN are completely the same for all the test sets, implying that its alignments is mostly composed of relevant signals of structural similarity.
To visually understand the influence of eliminating short fragments on all the algorithms, we plotted average N ali and RMSD for each method on (N ali , RMSD) plane in Figure 8c and d for sequential and non-sequential test sets, respectively. Comparing Figure 8a with c, as well as Figure 8b with d, we can see that eliminating short continuous segments bring about drastic changes for some alignment programs. As pointed out above, SAMO is the most sensitive to eliminating short fragments, followed by MASS and TM-align. The others are located at the same position on the plane, suggesting that their alignments are free from spurious matches. Among then, alignments by MICAN shows the largest number of aligned residue pairs for all the test sets, implying that the characteristic feature of MICAN is that it prefers large number of aligned residues with reasonable RMSD and that its alignment is free from spurious matches.
So far, N ali and RMSD have been evaluated based on the residues located everywhere in the structure. However, as pointed out by Gueler and Knapp, comparing the number of aligned residues or RMSDs should be careful, because the residue assignment strategies can have a significant influence of the results [41]. To elucidate influence of the residue assignment strategies, we recalculated number of aligned residue pairs and corresponding RMSDs in two different strategies. The first strategy is based on the residues only within secondary structure elements(i.e. αhelix and β -strand). The second one is based on residues belonging to the same SSE type. We found that the results from both strategies were qualitatively the same as the result we have shown here; alignments by MICAN show the largest number of aligned residue pairs over all the test sets for both strategies (data now shown). This result implies the robustness of the results against the change of residue assignment strategies.
We have shown that the characteristic features of alignments by MICAN are (i) large number of aligned residue pairs with reasonable RMSD and (ii) its alignment is free from one or two length short segments. For further characterization of the algorithm, we investigate what length fragments are frequently used in the alignment of MICAN, and to compare them with those of the other algorithms. Shown in Figure 9 Figure 9 The distributions of the length of consecutive aligned fragments for MALIDUP-sq set. The algorithms are classified into four groups based on the location of the peak of their distribution, The first group (a) has a peak at 9 residue length, the second group (b) a peak at 7, the third group (c) a peak around 3 or 4, and the fourth group (d) a peak around 1 or 2.
of lengths of continuously aligned fragments in the alignments produced by each method for MALIDUP-sq set. According to the location of the peaks of their distribution, the algorithms can be classified into four groups: The first group has a peak at 9 residue length (shown in Figure 9a), the second group at 7 (Figure 9b), the third group around 3 or 4 (Figure 9c), and the fourth group around 1 or 2 (Figure 9d). MICAN belongs to the first group; its peak location of the distribution is larger than any other groups. This result indicates that the large number of aligned residue pairs by MICAN is mostly attributed to relatively longer consecutive fragments (typically longer than 9 residue length). This is contrast to the distribution of the fourth group; The number of aligned residue pairs is enhanced by contribution of a large number of 1 or 2 length short fragments.

Influence of multiple vector representation of protein structures on the alignment accuracy.
As described in the Methods section, MICAN utilizes a multiple vector representation for each SSE. Here, we discuss how it works on improving performance, by comparing the performance of MICAN with that of MASS, which utilizes single vector representation of SSEs. As pointed out in the introduction, it may be difficult to correctly align a short SSE in a long one by the single vector representation. In order to assess whether the multiple vector representation improve alignment accuracy over single vector representation methods for such cases, we examined the dependence of alignment accuracy on the average difference in residue number between aligned SSEs in the reference alignment. Shown in Figure 10a is the relationship between the average difference of aligned SSE length in the reference alignment and Q-scores of MICAN and MASS alignments for all the target pairs of both MALIDUP and MALISAM-sq test set. There is statistically significant relationship between Q-score and the average difference of aligned SSE length for MASS. These two values are correlated with a correlation coefficient r = −0.34. The p-value associated with this correlation is extremely small (p = 8.0 × 10 −12 ). This result indicates that the larger the average difference of aligned SSE length, the more MASS fail to align. On the other hand, there is no correlation between the two values of alignments by MICAN (r = −0.09, p = 0.10 ), implying that the performance of MICAN is robust against the difference of aligned SSE length. These results suggest that multiple vector representation of protein structures improve alignment accuracy for protein pairs that have large difference of SSE length.

Computational costs
To assess the computation speed of the algorithms we examined here, we measured computational time per alignment. Average CPU times required to calculate one structure alignment for each test sets with the ten algorithms are shown in Table 3. The fastest algorithm of all the programs is TM-align for all the test sets.
Although, compared with TM-align, MICAN is about 1.2 ∼ 2.3 times slower than TM-align, it is the fastest algorithm of all non-sequential programs. Since average CPU times required for one structure alignment by MICAN is very fast (0.097 ∼ 0.134s ), it is suitable for largescale database search. It should be noted that although MICAN showed the similar behavior to DaliLite in both reference-dependent/independent evaluation for sequential test sets, computational speed of MICAN is much faster (approximately 8 times faster) than that of DaliLite, which is known to be one of the best structure alignment program.

Homologous protein pairs that have non-sequential structural similarity
Recently, some homologous protein pairs that show nonsequential structure similarity have been reported [10,11]. However, it is unclear how abundant such pairs are. In order to address the issue, we compared homologous protein structures by MICAN. We considered all the pairs of SCOP family representatives within the same superfamily (12,603 pairs). For these 12,603 pairs, pairwise structural alignments were generated using MICAN. Here, only the alignments with significantly high structural similarity, TM-score ≥ 0.5 [42], were collected, resulting in total of 8,335 structurally similar protein pairs. Among them, the overall proportion of protein pairs that show nonsequential alignments was 39% (3,284 pairs). This result suggests that a large number of proteins have changed their folds by some complex events while conserving the same core packing arrangements. One of the interesting example of such a homologous protein pair is shown in Figure 11 for Ubiquitin Cterminal hydrolase (SCOP ID d1xd3a ) and Phytochelatin synthase (SCOP ID d2bu3a1). These two proteins belong to the same superfamily (Cysteine Proteinase superfamily) in the SCOP database [12], suggesting that they share a common evolutionary ancestor. Although their ligands are different, the molecular functions are suggested to be similar; both of them have hydrolase enzymatic activity and contain the same catalytic residues of cysteine, histidine, aspartate and glutamine [43,44]. TM-score of the alignment by MICAN is 0.517 (RMSD of 3.8Å over 70% residues), suggesting significant structural similarity. Figure 11b shows superimposed structures of aligned region around the catalytic-site. All the side-chain atoms of the catalytic residues are well superimposed, suggesting the correctness of the alignment. The corresponding alignment plot (Figure 11c) shows a non-sequential alignment; the large portion of structurally equivalent parts are not in the same order in protein sequences. This permutation can not be explained by well-established mechanism of protein evolution such as "Duplication and Fusion". It is interesting to investigate how such a fold change occurred, but left for future studies.

Conclusions
We have developed a new algorithm MICAN to compare protein structures regardless of the topology. One of the key of the algorithm is a novel representation of SSEs with multiple comparing elements of short segments. It enables us to properly describe bent or twisted nature of long secondary structure element. In addition, this representation method also helps speeding up a geometric hashing procedure. As a results, MICAN is the fastest program among non-sequential alignment programs we examined here.
We have presented a comparative analysis of structural alignments using both reference-dependent and reference-independent evaluation methods on both sequential and non-sequential test sets. Referencedependent evaluation shows MICAN outperforms the other existing methods for reproducing reference alignments of non-sequential test sets. Further, although MICAN does not specialize in sequential structure alignment, its performance on sequential test sets is comparable to DaliLite, which is known to be one of the most accurate sequential alignment program. Referenceindependent evaluation demonstrates that the features of alignment by MICAN are large number of aligned residue pairs with reasonable RMSD and that its alignment is free from short segments that are likely to be spurious matches. These results suggest that MICAN is highly effective tool for automatically detecting homologous protein structures bearing topological irregularities, such as circular permutations and segment-swapping, many of which have been identified manually by human experts so far.

The general strategy of the algorithm
Assume we have two protein structures, a query and a model, composed of N and M residues, respectively. The goal is to find the rotational matrix that maximizes structural equivalences between the two structures, regardless of chain connectivity. The search scheme of MICAN employs a hierarchical alignment algorithm, in which the SSEs are first aligned, and then a residue level alignment is iteratively performed.
The SSE level alignment is further divided into four steps: In the 1st step, secondary structure types are assigned for each residue of the two structures. In the 2nd part, we define "the Comparing Elements of the Short Segment (CESS)", by which we describe the structures to be compared in the SSE level alignment. The CESS is a point which is defined per consecutive 3 residues for strands, and 6 for helices. Accordingly, the number of CESSs included in each SSE is n − 2 for strands, n − 5 for helices, where n is the length of the SSE. Each CESS has the information of its SSE type, the representative location of the SSE segment, the direction of the short segment of SSE from the N-terminus to the C-terminus, and the perpendicular vector to the SSE direction. Thus, each CESS has two vectors, which enables us to define a reference frame on each CESS. In the 3rd step, one reference frame is defined per CESS by exploiting the two vector each CESS possesses. In the 4th step, we use geometric hashing technique to pick up candidates of best rotational matrix that maximize non-sequential structural similarity by searching for the best matches of CESSs. Some candidates of best rotational matrix are listed and stored for the next stage.
The residue level alignment is divided into two steps: In the 1st step, we generate a residue level alignment based on the superimposition obtained by SSE level alignment. Starting from the initial superimposition, we generate one-to-one mapping of C α atoms in step wise manner, in which we align residue pairs from pairs superimposed with smaller RMSD to larger. In the 2nd step, the refinement of alignment is performed iteratively until the similarity score converges. Finally, the best scoring alignment is selected based on TM-score like function. The details of each protocol is described in the following subsections.

The SSE level alignment The 1st step: SSE assignment
The starting points of MICAN algorithm is secondary structure assignment for all the residues of both the query and the model. For a given residue, a secondary structure type (α -helix, β -strand, or coil) is assigned by the method of Zhang and Skolnick [37], instead of DSSP [45] or STRIDE [46], because of its speed and applicability to C α only models. It was reported that Q 3 accuracy of the method was 85% with respect to the DSSP assignment. In the SSE level alignment, we use only residues that are assigned as α -helix or β -strand, and ignore the rest of the residues. We also ignore helices shorter than 6 residues and strands shorter than 3.

The 2nd step: defining the comparing elements of the short segment
In the second step, for both the query and the model, we define the Comparing Elements of the Short Segment (CESS), by which we describe the structures to be compared in the SSE level alignment. A CESS is defined per a Short Segment of SSE (SSSSE), where, SSSSE is defined as a segments of consecutive 3 residues for strands, and 6 for helices. Accordingly, the number of CESSs, as well as SSSSEs, included in each SSE is n − 2 for strands, n − 5 for helices, where n is the length of the SSE (see Figure 12a and b). Each CESS has following four kinds of information on the SSSSE. The first one is secondary structure type (i.e. α or β ) which was assigned in the first step. We denote secondary structure type of i-th CESS as s i . The second one is the coordinate of the representative point of the SSSSE, which we call "the coordinate of the CESS". We denote the coordinate of the i-th CESS as r CESS i . When a CESS represents the i-th SSSSE which starts from the kth residue (see Figure 12a and b), r CESS i is defined as the middle point of both ends of the SSSSE, i.e. where

for helices)
(r k + r k+1 )/2 ( f o r s t r a n d s )  Here, r k indicates coordinates of C α atom of k-th residue. The expressions of r init i and r end i are taken from Ref. [14]. The third kind of information is the direction of the SSSSE from N-to the C-terminus. The directional which was also used in Ref. [14]. The fourth information is the perpendicular vector to v para i . The reason why we introduce this vector is that by introducing the vector we can define the reference frame per CESS. This technique enables us to reduce complexity of Geometric Hashing algorithm to O(n 2 ), while that of the naive geometric hashing, in which reference frame is defined per triplet of points, is O(n 4 ). The perpendicular vector of i-th SSSSE, v In the same way, we calculate above mentioned quantities for every SSSSE. As a result, whole structures of both the query and the model are described by a set of CESSs. An example of schematic representation of a protein structure by a set of CESSs is shown in Figure 12c.

The 3rd step: defining the reference frame
In this step, we define reference frames for each CESS of both the query and the model. Reference frames are coordinate systems, which are used in geometric hashing technique. As described in the 2nd step, each CESS has two vectors, which are orthogonal to each other. By the use of two vectors, a coordinate reference frame for i-th CESS is defined as follows. The origin of the reference frame is placed on the coordinate of i-th CESS. The x-axis is defined by the vector v para i and y-axis is v perp i . The zaxis is defined to be perpendicular to the x-y plane and its direction is determined by the right-handed rule.

The 4th step: identifying candidates of best rotational matrix by geometric hashing
In order to find candidates of best rotational matrix, which maximizes coincidences of CESSs of the two structures, we use geometric hashing technique. Geometric hashing algorithm is composed of two interrelated phases: preprocessing and recognition.
In the preprocessing phase, we set up hashing table for the model structure. The preprocessing procedure we use here is essentially the same as the standard one, except for a set of descriptions stored in the hash table. For each reference frame, we define 3D grid. A resolution of the grid is set to 3.2Å in the algorithm. The cells of the grid make up a hash table, which we denote as H(a, b, c), where (a, b, c) is an index of 3D cube. First, i-th CESS in the model is chosen as a basis of the reference frame. Second, the position of j-th CESS in the reference frame i, which we call r CESS j (i), is calculated according to the formula, r CESS  cube (a, b, c), following four descriptions of j-th SSSSE are stored in the hash table  H(a, b, c) : these are (i) an identification of its SSE type (s j ), (ii) an identification of the basis (in this case, the CESS that defines the reference frame i), (iii) the transformed vector v j para (i), which is the vector v j para calculated in the http://www.biomedcentral.com/1471-2105/14/24 reference frame i, and (iv) the transformed vector v j perp (i). This procedure is repeated for all the CESS of all reference frames.
In the recognition phase, we compare the model and the query structure. The goal of this step is to find candidates of the rotational matrix that maximizes coincidences of CESSs of the two structures. The recognition phase of MICAN is also essentially the same as standard geometric hashing algorithm, except for a voting score. Let k-th CESS of the query structure be the selected point which defines the reference frame. For each of the other CESS of the query, the coordinates in the frame system defined by the k-th CESS are computed, as for the model. These coordinates are used as indexes of 3D cube. If r CESS l (k) of the query is found in the cube (a, b, c), and information of j-th CESS of the frame system i of the model is stored in the hash table H(a, b, c), voting score S(j(i), l(k)) is added for the pair of reference frames i and k. The function of the voting score S(j(i), l(k)) is defined as where Here, θ j(i),l(k) and φ j(i),l(k) are the angle between v para j (i) and v para l (k), and that between v perp j (i) and v perp l (k), respectively. This procedure is repeated for all the other CESS of all the reference frames of the query structure. As a result, all the reference frame pairs between the model and the query are ranked according to their summation of the voting score. Top scoring 50 reference frame pairs are stored for the next step. For each such pairs, we calculate a rotational matrix by superimposing one reference frame onto the other. These rotational matrices are used to set up an initial superimposition of the two structures in the residue level alignment.

The residue level alignment The 1st step: one-to-one assignment of C α atoms
In this step, starting from the initial superimposition obtained by SSE level alignment, we generate one-to-one mapping of C α atoms in a step wise manner, in which we align residue pairs from pairs superimposed with smaller RMSD to larger. In order to identify which residue pairs are superimposed with given cut off distance d R , we construct an N × M similarity matrix M i,j (d R ) based on a given superimposition of the query and the model structure, where N and M denote the number of residues in the query and the model. The matrix M ij (d R ) is defined as where d ij is the distance between the C α atom of the ith residue of the query and that of the j-th residue of the model, d 0 is a scale to normalize the match difference, d R is cut off distance of R-th step, σ i represent the three-states of secondary structure (helix, strand or coil) of residue i, and δ σ i σ j is Kronecker's delta. The scale d 0 is defined as where L is the length of the query structure. This formula was originally introduced by Zhang & Skolnick, to eliminate the inherent protein size dependence of the score function [47]. The factor (δ s i s j + w)/(1 + w) was introduced so that aligned residues should belong to the same SSE type. Here, w is a weight factor and is set to 1.0 in this paper. The cut off values d R in each step R are set as follows: We consider alignments composed only of continuous segments with certain length, in order to eliminate short segments that are likely to be spurious matches. Generally, a continuous segment of length l is described as a set of residue pairs; {(α, β), (α+1, β+1), · · · , (α+l−1, β+l−1)}, where (α, β) indicates that residue α of the query is paired with the residue β of the model. Here, given the cut off value d R , we consider only continuous segments that fulfill the condition; M α+m,β+m (d R ) > 0 for ∀m ∈ (0, 1, · · · , l − 1). (10) We refer to such a segment as A ζ (d R ), where ζ is the index of the segment. We introduce the similarity score The similarity score is used to rank the segments, as well as to exclude short segments.
First, we align residue pairs that are superimposable with the smallest cut off value, d 1 . We calculate M i,j (d 1 ) based on the initial superimposition obtained by SSE level alignment, and generate alignments based on the matrix. The goal of this step is to find a set of non-overlapping continuous segments {A 1 (d 1 ), A 2 (d 1 ), · · · , A n (d 1 )} that maximizes where n is the number of segments in the alignment.
Since the exact solution of this problem is hard to obtain, our approach is to use greedy-like algorithms to find an approximate solution. First, the algorithm chooses the continuous segment that has the highest score of all the continuous segments on the matrix. We refer to such a segment as is recorded as the first member of the set of best segments, where the cutoff parameter S min is set to 2.2 in this paper. Then, in order to choose the next segment, we modify the matrix M i,j (d 1 ) such that matrix elements interfering with A 1 (d 1 ) are set to zero. In other words, if the selected seg- , · · · , (α + l − 1, β + l − 1)}, matrix elements whose row number is i (α ≤ i ≤ α + l − 1) or whose column number is j (β ≤ j ≤ β + l − 1) are set to 0, except for A 1 (d 1 ) itself. We also set matrix elements of the segment A 1 (d 1 ) to zero so that we won't select A 1 (d 1 ) as a highest score segment in the following steps. Next, we choose the continuous segment that has the highest score of all the continuous segments on the modified matrix. We refer to such a segment as is recorded as the second member of the set of best segments. Further, we modify the matrix such that matrix elements interfering with A 2 (d 1 ), as well as those of A 2 (d 1 ), are set to zero. We repeat this procedure until the highest score of all remaining segments on the modified matrix is smaller than S min . As a result, we obtain a set of non-overlapping segments that approximately maximize the total score S tot (d 1 ) for given cut off distance d 1 . We describe such a set as is the number of the best segments for the given cut off value d 1 .
Second, we extend the alignment with increasing the cut off value d R from d 2 to d 3 in a step wise manner. For each R-th step, as in the case of d 1 , we calculate the similarity matrix M i,j (d R ) according to the equation (9). We modify the matrix M i,j (d R ) such that matrix elements interfering with any of the A i (d R−1 ) are set to zero, where A i (d R−1 ) is a segment involved in the best alignment obtained in (R − 1) -th step. This operation assures that the matrix elements of the best segments identified in (R − 1) -th step are included in the best segments of R-th step. Then, the algorithm chooses the segment that has the highest score of all the segments on the modified matrix. It is recorded as the first member of the best segments in R-th step. We denote such a segment as A 1 (d R ). Next, we further modify the matrix such that matrix elements interfering with A 1 (d R ), as well as those of A 1 (d R ), are set to zero. We repeat these procedures until the highest score of all remaining segments on the modified matrix is smaller than S min , for given cut off d R . The resultant best set of non-overlapping segments for d R is described as is the number of the best segments for the given cut off value d R .
By repeating the same procedure from d 2 to d 3 , we finally obtain the set of non-overlapping segments for d 3 , , · · · , A n (d 3 )}, which is used as the initial alignment in the iterative refinement step.

The 2nd step: iterative refinement
The final step of the algorithm is refinement of the alignment. It is performed iteratively as follows. First, we superimpose the structures by minimizing the RMSD of C α atoms based on the aligned residues identified in the previous step. Second, based on the superposition, we obtain a new alignment by the method as was described in the first step of the residue level alignment. These procedures are repeated until the score no longer improves. The score we use here is modified TM-score, which is defined as Here, N is the length of the query, and i,j indicates that summation is taken over all of the aligned residue pairs. Finally, the alignment with the highest mTM-score is returned. The alignments usually converge in moderate time, although there is a discrepancy in the algorithm; In superimposing structures, we minimize RMSD rather than mTM-score, which is the objective function in the algorithm. As far as we have tested, the score typically converges by 5-6 iterations, suggesting that the discrepancy is not critical.

Selecting the best alignment
Since the score function used in the SSE level alignment is different from that in the residue level, the top scoring superposition obtained in the SSE level alignment may not always lead to the best scoring alignment in the residue level. We explored the relationship between the two scoring function and found that although they did not have perfect linear relationship, there was a strong correlation between the two. Further, we confirmed that if we started from each of the top scoring 50 superimposition of the SSE level alignment, we reached the highest mTM-score in the residue level alignment in most cases. Thus, in order to obtain as high scoring an alignment of the residue level as possible, MICAN performs the residue level alignment starting from each of the top scoring 50 superposition of the SSE level alignment. An alignment with the highest mTM-score among those 50 alignments is returned as the best alignment. MICAN program can output sub-optimal alignments, if the user specifies the option of output of alternative alignments. http://www.biomedcentral.com/1471-2105/14/24 Average Q-score R 0 Figure 13 The cutoff distance dependence of the mean Q-score for the SISY set. The mean Q-score is calculated for several R 0 values on the SISY set. All the other parameters are set to the same values as those we used in this paper.

Handling reverse alignments
It has been reported that there are many interesting examples of non-sequential structure alignments that involve reverse alignments, in which SSEs structurally match each other but the polypeptide chains go in opposite directions [10,21]. MICAN can deal with forward and reverse alignments simultaneously, if needed. In order to allow reverse alignments, the algorithm should be changed as follows: • The function in equation (7) is changed as | cos (θ j(i),l(k) )| − cos (π/3) 1 − cos (π/3) (|θ| ≤ π/3).
This algorithm is implemented in MICAN program. Users can select either the forward only or forward/reverse mixed mode. All the data shown in this paper were obtained in forward only mode.

Parameter tuning
There are a number of parameters that need to be optimized in MICAN algorithm, such as w and d R in equation (9). We determined these parameters by maximizing agreement with the reference alignments of SISY-pairwise dataset [30] through try-and-error procedure in a nonsystematic way. The SISY-pairwise was derived from SISYPHUS database [48], a manually curated multiple structure alignment database which contains structural alignments with nontrivial relationships including circular permutation or segment swap, thus the dataset is suitable for training of non-sequential alignment. Through training, we found that some parameters have strong influence on the performance. One of such parameters is a distance cutoff value d R used in the residue level alignment. Figure 14 The schematic representation of how to create a non-sequential test sets. (1) One protein structure to be permuted is randomly chosen from each target pair. Original residue numbers are indicated inside circles. The three SSEs (strands) are represented by arrows. (2) The original chain is splitted into several segments at loop positions. In this example, the chain has been divided into three segments A, B and C. (3) The segment order is randomly shuffled (e.g. the order A-B-C was shuffled into different order B-C-A). Residue numbers are reassigned according to the shuffled segment order. (4) The generated structure by the permutation technique. Its residue numbers are also modified according to the segment order. The reference alignment are modified according to the reassigned residue numbers of the permuted protein. A', B' and C' represent segments of the other protein structure which has not been chosen for sequence shuffling. Segments A', B' and C' correspond structurally to segments A, B and C respectively. http://www.biomedcentral.com/1471-2105/14 /24 In this work, we set d R in each step R as d 1 = R 0 , d 2 = 1.5 × R 0 and d 3 = 2.5 × R 0 , where R 0 is the cutoff parameter. Thus, the single parameter R 0 determines the behavior of d R . Figure 13 shows the relationship between the mean Q-score and R 0 for the SISY test set. We clearly see that the optimal value is located near R 0 = 3 A, and that the mean Q-score sharply drops as R 0 value changes from the optimal point. Another is the voxel size h used in geometric hashing algorithm. We observed that the voxel size dependence of the mean Q-score is qualitatively similar to the case of R 0 . Preliminary numerical tests on a variety of d R and h values led to the choice of R 0 = 3.2Å and h = 3.2Å. The other parameters seems to be not critical for the performance. After the training, MICAN finally achieved high agreement (86.2% on average) with the reference alignments. This agreement is better than that of DaliLite (77.4% on average), which is known to be one of the best structure alignment program. We believe that the parameter set we determined is nearly optimal, although it has not been optimized systematically.