Multiple structure alignment and consensus identification for proteins
 Ivaylo Ilinkin^{1}Email author,
 Jieping Ye^{2} and
 Ravi Janardan^{3}
DOI: 10.1186/147121051171
© Ilinkin et al; licensee BioMed Central Ltd. 2010
Received: 24 August 2009
Accepted: 2 February 2010
Published: 2 February 2010
Abstract
Background
An algorithm is presented to compute a multiple structure alignment for a set of proteins and to generate a consensus (pseudo) protein which captures common substructures present in the given proteins. The algorithm represents each protein as a sequence of triples of coordinates of the alphacarbon atoms along the backbone. It then computes iteratively a sequence of transformation matrices (i.e., translations and rotations) to align the proteins in space and generate the consensus. The algorithm is a heuristic in that it computes an approximation to the optimal alignment that minimizes the sum of the pairwise distances between the consensus and the transformed proteins.
Results
Experimental results show that the algorithm converges quite rapidly and generates consensus structures that are visually similar to the input proteins. A comparison with other coordinatebased alignment algorithms (MAMMOTH and MATT) shows that the proposed algorithm is competitive in terms of speed and the sizes of the conserved regions discovered in an extensive benchmark dataset derived from the HOMSTRAD and SABmark databases.
The algorithm has been implemented in C++ and can be downloaded from the project's web page. Alternatively, the algorithm can be used via a web server which makes it possible to align protein structures by uploading files from local disk or by downloading protein data from the RCSB Protein Data Bank.
Conclusions
An algorithm is presented to compute a multiple structure alignment for a set of proteins, together with their consensus structure. Experimental results show its effectiveness in terms of the quality of the alignment and computational cost.
Background
This paper presents an algorithm to compute a multiple structure alignment for a set of proteins and to generate a consensus structure. The algorithm is called MAPSCI, which stands for M ultiple A lignment of P rotein S tructures and C onsensus I dentification. MAPSCI addresses the problem of global structure alignment, which has also been considered by CEMC [1], MAMMOTH [2], and MATT [3]. Specifically, MAPSCI computes an approximation to the multiple structure alignment that minimizes the socalled SumofConsensus distance (SCdistance), i.e. the sum of the pairwise distances between the consensus structure and each protein in the set (see the Methods section for the precise definition of SCdistance). Our experiments show that MAPSCI converges quite rapidly and produces alignments that compare favorably with the alignments produced by MAMMOTH and MATT. The consensus structures generated by MAPSCI are visually quite similar to the input proteins. Although the consensus structures are not real proteins, they could be used, for instance, as templates to perform fast searches through protein structure databases, such as the Protein Data Dank [4], to identify structurally similar proteins.
MAPSCI has similar structure to the algorithm of Ye and Janardan [5]. However, MAPSCI works directly on the coordinates of the C_{ α }atoms and produces true alignments; by contrast, the algorithm in [5] requires that the backbone vectors be translated to the origin, hence information about the relative positions of the C_{ α }atoms in ℝ^{3} is lost and as a result the algorithm does not generate true alignments. The Methods section presents the mathematical and algorithmic framework of MAPSCI and provides the complete details where the two algorithms differ significantly; when there is an overlap the reader is referred to publication [5].
Implementation
MAPSCI represents the input proteins and the consensus as sequences of triples of coordinates of the alphacarbon (or C_{ α }) atoms along the backbone. It then computes a correspondence between the coordinate triples of the C_{ α }atoms in the different protein structures by choosing one of the proteins as the initial consensus and applying an algorithm that is analogous to the centerstar method for multiple sequence alignment [6]. Next, MAPSCI derives a set of translation and rotation matrices that are optimal for the computed correspondence and uses these to align the structures in space via rigid motions and obtain the new consensus. The process is repeated until the change in SCdistance is less than a prescribed threshold. This iterative process is welldefined as it is shown in the Methods section that the SCdistance is nonincreasing from one iteration to the next. The computation of the optimal translations and rotations and the new consensus is itself an iterative process that both uses the current consensus and generates simultaneously a new one.
Algorithm MAPSCI: Multiple Alignment of Protein Structures and Consensus Identification
1. Choose initial consensus structure from . i ← 0. SC^{0} ← ∞. 

2. Do 
3. if i = 0 then compute pairwise structure alignment between and every P_{ j }. 
4. else use standard dynamic programming to align with every P_{ j }. 
5. i ← i + 1. 
6. Compute correspondence from the above alignments (either pairwise or dynamic programming) using centerstarlike method. 
7. Compute optimal translation matrix and optimal rotation matrix iteratively (Theorems 2 and 3). Transform P_{ j }by and for every j to obtain multiple structure alignment ℳ^{ i }. SC^{ i }← SC(ℳ^{ i }). 
8. Postprocess ℳ^{ i }by removing all columns consisting of only gaps. 
9. Compute new consensus structure from ℳ^{ i }by Theorem 1. 
10. Until .//η is a userspecified threshold (currently set at 0.0001) 
Results
Web Server
Remote access to the server
import urllib2 

url = "http://www.geomcomp.umn.edu/mapsci/align.cgi?wsget=pdb&rcsb=1sfp+1spp:A+1spp:B" 
server = urllib2.urlopen(url) 
output = file("alignment.zip", 'wb') 
output.write(server.read()) 
output.close() 
server.close() 
Comparison
As discussed earlier, there are many algorithms for multiple structure alignment. In general, it is difficult to make comparisons among them, since they operate under different sets of assumptions and problem formulations. We compare MAPSCI to two recent algorithms  MAMMOTH [2] and MATT [3]  which also work with coordinate triples, but employ a different objective function. Our experiments show that MAPSCI is competitive in terms of the sizes of the socalled conserved regions and runs significantly faster than the other two algorithms, hence can potentially scale to much larger datasets.
Benchmark datasets performance
HOMSTRAD  SABmark  

Average Core (%)  Average Core RMSD  Average Core (%)  Average Core RMSD  
MAPSCI  70.99  0.83_{(n = 232)}  48.89  1.00_{(n = 385)} 
MAMMOTH  66.74  0.83_{(n = 231)}  44.55  0.99_{(n = 394)} 
MATT  63.79  0.85_{(n = 229)}  47.88  0.99_{(n = 420)} 
In general, it is difficult to compare two algorithms based on these two metrics (larger cores tend to have larger RMSD). However, on the HOMSTRAD dataset MAPSCI outperformed MAMMOTH in 45% of the test cases and MATT in 59% of the test cases by computing alignments with both larger cores and smaller core RMSD. (MAMMOTH and MATT were better than MAPSCI on both metrics combined in 6% and 5% of the test cases, respectively). MAPSCI computed cores for all 232 test cases, while MAMMOTH failed to compute a core for one family (bowman), and MATT failed to compute a core for three families (asp, lipocalin, and tln).
On the SABmark dataset MAPSCI computed larger cores with better RMSD in 39% of the test cases when compared with MAMMOTH and in 37% of the test cases against against MATT. (MAMMOTH and MATT were better than MAPSCI on the two metrics combined in 15% and 26% of the test cases, respectively.) MATT was the most robust of the three algorithms and failed to compute a core in only five test cases; MAPSCI failed on 40 families and MAMMOTH failed on 31 families.
Methods
In this section, we provide the mathematical and algorithmic framework underlying MAPSCI. As mentioned earlier MAPSCI shares common elements with the algorithm in [5], and therefore, we follow the same general outline. However, we only present the full details when there are significant differences and refer the reader to [5] when there is an overlap.
Multiple Structure Alignment: Problem Formulation
Let {P_{1}, P_{2}, ⋯, P_{ k }} be the given set of K proteins and let l_{ i }be the number of C_{ α }atoms along the backbone of protein P_{ i }. We represent P_{ i }as a sequence of coordinate triples , 1 ≤ j ≤ l_{ i }, that represent the coordinates of the j th C_{ α }atom of P_{ i }along the backbone. (As is customary [14, 15], we consider only the backbone, not the amino acid residues themselves.) Let P_{0} = , ⋯, denote the consensus structure, of length l_{0}.
A correspondence of the K proteins in and the consensus structure P_{0} can be represented as a matrix H = ( )_{0 ≤ i ≤ K,1 ≤ j ≤ L}, for some L ≥ max_{0 ≤ i ≤ K}{l_{ i }}, where is either a coordinate triple belonging to the i th protein or a gap. Distances between coordinate triples are based on the squared distance between them in ℝ^{3}. The distance between a coordinate triple and a gap is called a gap penalty, and is denoted by ρ.
The results reported in this paper use 16.0 for the value of the gap penalty.
Let G_{ i }= (H_{ i } T_{ i })R_{ i }= (H_{ i } e × t_{ i })R_{ i }, for i > 0, where R_{ i }∈ ℝ^{3 × 3} is some rotation matrix, T_{ i }= e × is the translation matrix, e ∈ ℝ^{L × 1}is a vector with 1 in each entry, and ∈ ℝ^{1 × 3} is a translation vector. (The transformation of a gap remains a gap.) Note that P_{0} remains unchanged, i.e. G_{0} = H_{0}.
Intuitively, the SCdistance measures how well the consensus structure represents the given set of K proteins. A similar distance function is used in [17], where each protein is represented as a set of vectors in ℝ^{4}.
We can now define the multiple structure alignment problem as follows:
Multiple Structure Alignment Problem
Given a set {P_{1}, P_{2}, ⋯, P_{ K }} of protein structures, compute a transformation (i.e., rotation and translation) for each protein, and generate a consensus structure P_{0}, such that the resulting multiple structure alignment has minimum SCdistance as defined in Equation (1).
In the next section, we present a heuristic for this problem. Our algorithm approximates the global minimum of the SCdistance by iterative refinement of an initial multiple structure alignment and converges to a local minimum.
Step I: Choice of the initial consensus structure
We consider four choices for initial consensus structure: (i) median protein, i.e. the protein of median length; (ii) center protein, i.e. the protein that minimizes the sum of the pairwise distances to all the other proteins; (iii) the minmax protein, i.e. the protein with the smallest maximum pairwise distance; and (iv) maxcore protein, i.e. the protein that generates the largest initial core. (The first three choices for initial consensus are considered in [5].)
Step II: Compute an initial correspondence
After we determine the consensus structure P_{0} in Step I, the K  1 pairwise structure alignments between P_{0} and P_{ i }≠ P_{0}, for i = 1, ⋯, K, are computed using the algorithm in [7]. (Other pairwise structure alignment algorithms could also be used instead.) The K  1 pairwise structure are combined in Line 6 of the algorithm (Table 1) using the centerstarlike method described in [5].
Step III: Compute optimal rotation and translation matrices and consensus structure
Direct minimization of S over , and the T_{ j }'s and R_{ j }'s seems difficult. Instead, we propose an iterative procedure for minimizing S. Within each iteration, the minimization of S is carried out in two stages that are interleaved: (1) computation of the optimal for given R_{ j }'s and T_{ j }'s, and (2) computation of the optimal R_{ j }'s and T_{ j }'s for a given .
Computation of the optimal consensus structure
First, we show how to compute the consensus structure, given the rotation and translation matrices R_{ j }'s and T_{ j }'s, as stated in the following theorem:
Theorem 1. Assume that the correspondence is represented as a matrix H = ( ) and = (J_{1}, ⋯, J_{ L })^{ T }is the optimal consensus structure. For each column j, let I_{ n }be the set of indices of proteins with a nongap in the jth column and I_{ g }be the set of indices of proteins with a gap in the jth column. Then , in the jth position of the optimal consensus structure equals either the coordinate triple , or a gap.
Proof. For each j, we consider two distinct cases for J_{ j }: either it is a coordinate triple, x, or a gap. If J_{ j }is a gap, then the sum of the distances between and each protein P_{ j }along the j th column is I_{ n }ρ^{2}, where ρ is the gap penalty. If J_{ j }is a coordinate triple, x, then the sum of the distances between and each protein P_{ j }along the j th column is , which is minimized, for . Therefore, if , then the optimal choice for is the coordinate triple x_{ j }; otherwise, the optimal choice for is a gap.
Computation of the optimal translation matrix
In this section, we show how to compute the optimal translation matrix T_{ i }, for each i, for a given consensus structure . From Eq. (2), it is clear that the optimal T_{ i }and T_{ j }, for i ≠ j are independent of each other. Hence, in the following, we focus on the computation of T_{ i }, for a specific i. The translation matrix T_{ i }can be decomposed as T_{ i }= e × t_{ i }, where t_{ i }∈ R^{1 × 3} is the translation vector.
As mentioned earlier, the transformation of a gap remains a gap. Hence the computation of the translation and rotation matrices is independent of the mismatches (i.e., where at least one of the two elements being compared is a gap). We can thus simplify the computation by removing all mismatches in the alignment between the consensus structure and the i th protein P_{ i }.
Let A ∈ ℝ^{n × 3}and B ∈ ℝ^{n × 3}consist of the coordinate triples from the consensus structure and the i th protein, respectively, after removing the mismatches. (Here n is the number of matches between the consensus structure and the i th protein, i.e., comparison of two nongaps). Without loss of generality, assume e^{ T }A = [0, 0, 0], i.e., the coordinate triples in the consensus protein are centered at the origin. The optimal translation vector is the one that matches the centroids of the coordinate triple vectors from A and B as stated in the following theorem:
Theorem 2. Let A and B be defined as above. Assume that e^{ T }A = [0, 0, 0]. Then for any rotation matrix R_{ i }, the optimal translation vector t_{ i }for minimizing is given by .
More details can be found in [18].
Computation of the optimal rotation matrix
Hence the minimum of S_{ i }is obtained when trace (A^{ T }BR_{ i }) is maximized.
Let the Singular Value Decomposition (SVD) [16] of A^{ T }B be U ΣV^{ T }, where U and V are orthogonal and Σ is diagonal.
Theorem 3. The optimal rotation matrix R_{ i }that minimizes S_{ i }= A  BR_{ i }^{2}is given by R_{ i }= UWV^{ T }, where W = diag(1, 1, 1), if det(UV^{ T }) = 1, and W = diag(1, 1, 1), if det(UV^{ T }) = 1.
More details can be found in [18].
Convergence of the algorithm
In this section, we show that MAPSCI converges, by showing that the SCdistance is nonincreasing from one iteration to the next.
Line 4 in MAPSCI decreases the distance between the consensus structure and each of the K proteins, since the dynamic programming produces an alignment with minimum cost. By the property of the centerstarlike method, Line 6 leaves unchanged the distance between the consensus structure and each of the K proteins. By Theorems 2 and 3, the transformations computed in Line 7 do not increase the distance between the consensus structure and the j th protein, for each j. It is clear that Line 8 does not change the pairwise distance, since the cost for aligning two gaps is zero. Finally, by Theorem 1, Line 9 does not increase the sum of the pairwise distances from the consensus structure to the other proteins. Hence, the SCdistance is nonincreasing, and the algorithm converges.
Complexity analysis
Let n be the maximum length of the K proteins. Then the overall running time of the algorithm is O(K^{2}n^{2}). (If we choose the initial consensus structure as the protein of median length, the running time is O(Kn^{2} + K^{2}n).) The run time analysis is similar to that of the algorithm in [5].
Conclusions
We have presented an algorithm, called MAPSCI, to compute a multiple structure alignment for a set of proteins, together with their consensus structure. The algorithm represents the input proteins and the consensus as sequences of coordinate triples and computes an approximation to the optimal multiple structure alignment that minimizes the sum of the pairwise distances between the consensus and each input protein. Experimental results on a benchmark datasets derived from the HOMSTRAD and SABmark databases show that the algorithm compares favorably with existing algorithms for multiple structure alignment (MAMMOTH and MATT).
Availability and requirements

Project name: MAPSCI

Project home page: http://www.geomcomp.umn.edu/mapsci

Operating system(s): Platformindependent

Programming language: C++

License: Free BSD
Declarations
Acknowledgements
Research of JY and RJ was sponsored, in part, by the Army High Performance Computing Research Center under the auspices of the Department of the Army, Army Research Laboratory cooperative agreement number DAAD190120014, the content of which does not necessarily reflect the position or the policy of the government, and no official endorsement should be inferred.
Adam Isom contributed to the implementation of the webbased tool.
The authors thank the MAMMOTH and MATT teams for sharing their source code.
Authors’ Affiliations
References
 Guda C, Scheeff ED, Bourne PE, Shindyalov IN: A new algorithm for the alignment of multiple protein structures using Monte Carlo optimization. Proceedings of the Pacific Symposium on Biocomputing: 3–7 January 2001; Hawaii 2001, 275–286.Google Scholar
 Lupyan D, LeoMacias A, Ortiz AR: A new progressiveiterative algorithm for multiple structure alignment. Bioinformatics 2005, 21: 3255–3263. 10.1093/bioinformatics/bti527View ArticlePubMedGoogle Scholar
 Menke M, Berger B, Cowen L: Matt: Local Flexibility Aids Protein Multiple Structure Alignment. PLoS Computational Biology 2008, 4: 0088–0099. 10.1371/journal.pcbi.0040010View ArticleGoogle Scholar
 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Research 2000, 28: 235–242. 10.1093/nar/28.1.235View ArticlePubMedPubMed CentralGoogle Scholar
 Ye J, Janardan R: Approximate multiple protein structure alignment using the SumofPairs distance. Journal of Computational Biology 2004, 11(5):986–1000. 10.1089/cmb.2004.11.986View ArticlePubMedGoogle Scholar
 Gusfield D: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge; 1997.View ArticleGoogle Scholar
 Ye J, Janardan R, Liu S: Pairwise protein structure alignment based on an orientationindependent backbone representation. Journal of Bioinformatics and Computational Biology 2004, 2(4):699–717. 10.1142/S021972000400082XView ArticlePubMedGoogle Scholar
 Waterhouse AM, Procter JB, A MDM, M C, Barton GJ: Jalview Version 2  a multiple sequence alignment editor and analysis workbench. Bioinformatics 2009, 25(9):1189–1191. 10.1093/bioinformatics/btp033View ArticlePubMedPubMed CentralGoogle Scholar
 Chemis3D: Molecular Viewer Applet[http://chemis.free.fr/mol3d/]
 Mizuguchi K, Deane CM, Blundell TL, Overington JP: HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science 1998, 7: 2469–2471. 10.1002/pro.5560071126View ArticlePubMedPubMed CentralGoogle Scholar
 VanWalle I, Lasters I, Wyns L: SABmark  A benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 2005, 21: 1267–1268. 10.1093/bioinformatics/bth493View ArticleGoogle Scholar
 Madhusudhanm MS, Webb BM, MartiRenom MA, Eswar N, Sali A: Alignment of multiple protein structures based on sequence and structure features. Protein Engineering, Design & Selection 2009, 22(9):569–574. 10.1093/protein/gzp040View ArticleGoogle Scholar
 Venclovas C, Zemla A, Fidelis K, Moult J: Comparison of performance in successive CASP experiments. Proteins 2001, 45(S5):163–170. 10.1002/prot.10053View ArticleGoogle Scholar
 Holm L, Sander C: Protein Structure Comparison by Alignment of Distance Matrices. Journal of Molecular Biology 1993, 233: 123–138. 10.1006/jmbi.1993.1489View ArticlePubMedGoogle Scholar
 Singh AP, Brutlag DL: Hierarchical protein structure superposition using both secondary structure and atomic representation. Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology: 21–26 June, 1997; Halkidiki 1997, 284–293.Google Scholar
 Golub GH, Van Loan CF: Matrix Computations. Johns Hopkins University Press, Baltimore; 1996.Google Scholar
 Chew LP, Kedem K: Finding the consensus shape of a protein family. Proceedings of the Eighteenth Annual ACM Symposium on Computational Geometry: 5–7 June 2002; Barcelona 2002, 64–73. full_textView ArticleGoogle Scholar
 Umeyama S: Leastsquare estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence 1991, 13(4):376–380. 10.1109/34.88573View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.