MDAT- Aligning multiple domain arrangements
© Kemena et al.; licensee BioMed Central. 2015
Received: 1 October 2014
Accepted: 3 December 2014
Published: 28 January 2015
Proteins are composed of domains, protein segments that fold independently from the rest of the protein and have a specific function. During evolution the arrangement of domains can change: domains are gained, lost or their order is rearranged. To facilitate the analysis of these changes we propose the use of multiple domain alignments.
We developed an alignment program, called MDAT, which aligns multiple domain arrangements. MDAT extends earlier programs which perform pairwise alignments of domain arrangements. MDAT uses a domain similarity matrix to score domain pairs and aligns the domain arrangements using a consistency supported progressive alignment method.
MDAT will be useful for analysing changes in domain arrangements within and between protein families and will thus provide valuable insights into the evolution of proteins and their domains. MDAT is coded in C++, and the source code is freely available for download at http://www.bornberglab.org/pages/mdat.
Proteins are composed of domains, i.e. amino acid segments which have a specific function and/or a structure, fold independently from the rest of the protein and are evolutionary well conserved [1-3]. Domains are units of evolution, they influence the function of a protein, and can be selected for as a whole [1,4]. The number of known domains is relatively small: currently around 15,000 domains are listed in the Pfam database . About 65-70% of the known proteins contain at least one known domain . However, the number of known arrangements, the combination of domains in a protein, is much higher and steadily and rapidly increasing with more genomes being sequenced . These arrangements evolve over time as domains can be lost, new ones gained and domains are reordered, mostly by gene fusion and terminal domain losses. Typically, rearrangements occur at a rate of tens to hundreds over a span of one million years . Accordingly, rearrangements are more frequent than loss and gain of whole genes, but substantially rarer than changes at the level of amino acids.
Several studies have shown the importance of changes in domain arrangements during evolution. New arrangements can be produced by shuffling of existing domains. These new arrangements played, for example, an important role during the evolution of vertebrates where they are involved in vertebrate specific structures like the cartilage . In addition, it has been proposed that the usage of domains may facilitate convergent evolution. For example it has been shown that netrin and secreted frizzled-related proteins have several independent evolutionary origins . Furthermore, it was proposed that a repository of reusable domains allows for a faster adaptation in plants , since a high number of new domains and arrangements in plants are involved in stress and adaption related functions. Changes in domain arrangements are less likely to occur than changes at the amino acid level and are therefore suitable traits for the reconstruction of phylogenies. Accordingly, domain occurrence has been used to calculate large scale phylogenetic trees . Besides these large scale approaches it can be useful to investigate domain arrangements of a single protein family. It has been shown, for example, that the domain arrangements in virulence genes in Plasmodium falciparum are probably the result of a trade-off between optimizing within-host fitness and minimizing between-host immune selection pressure . Also, the evolution of Cry toxins is strongly affected by reordering the arrangement of their constituting domains and these rearrangements are important for the virulence of several bacteria .
The best currently available methods to study domain arrangements are classical multiple sequence alignment (MSA) methods, for example T-Coffee  or Clustal Omega . However, these alignment methods usually do not explicitely take domain arrangements into account and therefore do not incorporate any restriction concerning their alignment. Exceptions are Dialign-Pfam  and Cobalt  that use domain information to restrict the sequence alignments. Still, none of the existing methods produce a real multiple domain alignment (MDA). An MDA aligns multiple domain arrangements to find the best arrangement of domains using an objective function, similar to the traditional MSA that arranges amino acids and nucleotides.
There are several advantages in using MDAs instead of MSAs. Due to the much shorter arrangement length compared to the primary sequence, an MDA can be calculated faster and with lower memory requirements, which is especially an advantage with large datasets. Another advantage is that a domain arrangement is more conserved than the underlying amino acid sequence. It is therefore possible to produce meaningful MDAs when the amino acid sequences are already too divergent to be compared. Furthermore, it is easier to visually examine the resulting alignments, due to the smaller number of characters.
Since domain arrangement similarity and differences can provide insights into functional similarity and changes between proteins (see above) we present an algorithm which helps to compute an MDA and facilitate the analysis of domain arrangements of different proteins. In this paper, we present MDAT (Multiple Domain Alignment Tool), a program that takes multiple domain arrangements and aligns them. It uses a domain similarity matrix reflecting the similarity between all pairs of domains in the Pfam database. Using a combination of the RADS  algorithm and the MSA consistency approach described in T-Coffee  an MDA is calculated. The main goal of RADS is to compare and evaluate domain arrangements and to weight differences between domain arrangements. In addition, the resulting MDA can then serve as a backbone structure for the construction of an MSA.
Results and discussion
Domain similarity matrix
As expected, most domain pairs have a low probability of being a true positive match. It is interesting to note that a high number of domain pairs coming from the same clan have a very low probability of being a true match and that at least some domain pairs from different clans or without clan assignment have a high probability of being a true match.
Results of running 3 different methods on the BaliBase3 benchmarks
The use of domains as anchor-points can strongly reduce the memory usage and running time. Table 1 shows the results of running MAFFT, Clustal Omega and MDAT algorithms on BAliBASE3. MDAT is three times faster than MAFFT and about 9 times faster when calculating only the MDA. The increased speed comes at the cost of accuracy. A combination of different reasons can explain the loss in accuracy. Domain annotations are not perfect and wrongly annotated domains or discrepancies in the boundaries may influence the resulting alignment. Furthermore, an error in the MDA can have a large influence on the resulting MSA as whole regions can no longer be aligned, a problem that all anchor based methods have in common. Additionally, we use a simple implementation of the Gotoh algorithm  to perform the sequence alignment; more complex techniques, such as HMMs as used for example in Clustal Omega , might provide better results.
We show that using MDAs themselves has its merits. MDAs can be used to visualize in a simple way the similarity between domain arrangements. Just like any alignment program, MDAT is not able to handle inversions. However, due to the low number of domains in a protein, inversions can be easily detected in a graphical view, which in not possible at the amino acid level. Furthermore, we demonstrate that an MDA is a good starting point for a multiple sequence alignment. It is particularly useful as guidance for the MSA, because it strongly increases the speed with which a multiple sequence alignment is calculated. Currently, the resulting MSAs from MDAT are not as accurate as traditional sequence alignments, however, due to the short calculation time, we are able to handle larger data sets. For many analyses, such as genome projects, the detection of domains is an essential part of the standard annotation procedure. Therefore, domain annotation is often readily available.
Scoring domain matches
Contrary to amino acids, no scoring matrix currently exists to handle domain matches. Therefore, we calculated a domain similarity matrix (DSM) for Pfam-A domains that stores a similarity value for each domain pair. The entries of the DSM are calculated using the HHsearch  program. Every HMM model of a domain in Pfam is aligned with every other HMM model in Pfam resulting in 148312 alignment pairs. As recommended , we used the probability of a true positive match as a similarity score and not the e-value. A true positive match value corresponds to the probability that the two models compared belong to homologous sequences or if the sequence alignment supports a good structural alignment. Contrary to the standard BLOSUM  and PAM  matrices, the DSM contains only positive values between 0 and 100. The huge majority of entries in the DSM are values below 1, corresponding to domain pairs without similarity. Accordingly, these values do not need to be stored and can therefore be removed from the matrix without loss of information, thus reducing the actual size of the DSM.
Domain collapsing: Given a set of domain arrangements, the first step of the MDA construction is to collapse identical domain arrangements into a single one. A set of identical domain arrangements is from here on represented with a single arrangement. The length of a domain in this representative arrangement is defined as the average of the domains it contains. A change in tandem domain repeats is the most frequently occurring domain rearrangement event . Tandem repeats are very similar and their correct alignment on the domain level difficult to achieve. We therefore collapse successive repeats of the same domain into a single one as previously proposed . This facilitates the alignment process that can be easily confused by a high number of near identical domain matches introduced by repeat copies.
Library construction: In the next step the RADS algorithm, a dynamic programming algorithm, is used to produce alignments between all pairs of arrangements. RADS has been extended to use the DSM to score a match of two domains instead of a fixed value. The matches identified in this alignment are then stored in a library.
Library extension: The matches from the library are rescored according to the algorithm described in T-Coffee. The reweighting has the purpose to increase the score of a match that is supported by matches in a third sequence: If domain α in arrangement X and domain β in arrangement Y are matching, then the score of the match α−β is increased if there are arrangements Z with domain γ that is matching α as well as β.
Alignment calculation: Using these scores a normal progressive alignment, as first described by Higgins and Sharp , is performed.
Refinement: The last step of the algorithm is a simple refinement step. Blocks of domains are shifted to other columns if this increases the number of identical domains in a column.
First alignment step: All sequences that are represented by the same domain arrangement are aligned first. The sequences are split at domain boundaries and each pair of segments is aligned separately, thus allowing easy parallelization of this step. Domain segments are aligned using a banded alignment approach as previously described in the Pecan genome aligner .
Second alignment step: Following a guide tree constructed from the domain architecture similarity, the alignment profiles computed in the first step are progressively aligned with each other. At each node in the tree, two profiles are aligned that are based on different domain architectures. Similar to the first step, sequences can be split at domain boundaries. However, this is only possible at domains that are aligned with each other. The sequence segments between two aligned domains cannot be simply aligned in a global fashion, because there may be non-aligned domains which should not be aligned on the sequence level either (see Figure 4). In this case, the corresponding area in the dynamic programming matrix is declared forbidden. The alignment algorithm does not pass through these areas that are forbidden by the MDA and thus avoids violating the order of the domains as defined by the columns of the MDA.
Currently no reference benchmark exists for the evaluation of MDAs. Therefore, we use the BAliBASE3  benchmark that was originally developed for MSAs. To be able to evaluate an MDA, we annotated the sequences included in BAliBASE3 with Pfam domains using pfam_scan  in combination with the HMMER3  program. Since BAliBASE3 is a set of reference sequence alignments it is possible that more than one domain is aligned to another one, conflicting with the alignment definition that a domain is aligned only to a single other domain. We define two domain as being aligned to each other if at least three quarters of both domains are aligned with each other. For this benchmark the repeat-collapsing has been turned off to allow a comparison of the MDA with the pairs extracted from BAliBASE3.
To check the performance of the MDA2MSA algorithm, MDAT has been run on the BAliBASE3 benchmark and has been compared to two other alignment methods MAFFT (v6.940b)  and Clustal Omega (v1.2) .
We would like to thank Elias Dohmen for helpful suggestions to improve the manuscript. CK was funded by DFG grant BO2544/4-1. We acknowledge support by Deutsche Forschungsgemeinschaft and Open Access Publication Fund of University of Muenster. EBB ORCID ID is http://orcid.org/0000-0002-1826-3576.
- Moore AD, Björklund AK, Ekman D, Bornberg-Bauer E, Elofsson A. Arrangements in the modular evolution of proteins. Trends Biochem Sci. 2008; 33(9):444–51.View ArticlePubMedGoogle Scholar
- Marsh JA, Teichmann SA. How do proteins gain new domains?Genome Biol. 2010; 11(7):126.View ArticlePubMedPubMed CentralGoogle Scholar
- Forslund K, Sonnhammer ELL. Evolution of protein domain architectures. Methods Mol Biol. 2012; 856:187–216.View ArticlePubMedGoogle Scholar
- Bornberg-Bauer E, Albà M M. Dynamics and adaptive benefits of modular protein evolution. Curr Opin Struct Biol. 2013; 23(3):459–66.View ArticlePubMedGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate JG, Boursnell C, et al. The Pfam protein families database. Nucleic Acids Res. 2012; 40(Database-Issue):290–301.View ArticleGoogle Scholar
- Ekman D, Björklund AK, Frey-Skött J, Elofsson A. Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. J Mol Biol. 2005; 348(1):231–43.View ArticlePubMedGoogle Scholar
- Levitt M. Nature of the protein universe. Proc Natl Acad Sci U S A. 2009; 106(27):11079–84.View ArticlePubMedPubMed CentralGoogle Scholar
- Kersting AR, Bornberg-Bauer E, Moore AD, Grath S. Dynamics and adaptive benefits of protein domain emergence and arrangements during plant genome evolution. Genome Biol Evol. 2012; 4(3):316–29.View ArticlePubMedPubMed CentralGoogle Scholar
- Kawashima T, Kawashima S, Tanaka C, Murai M, Yoneda M, Putnam NH, et al. Domain shuffling and the evolution of vertebrates. Genome Res. 2009; 19(8):1393–403.View ArticlePubMedPubMed CentralGoogle Scholar
- Leclère L, Rentzsch F. Repeated evolution of identical domain architecture in metazoan netrin domain-containing proteins. Genome Biol Evol. 2012; 4(9):883–99.View ArticlePubMedPubMed CentralGoogle Scholar
- Fang H, Oates ME, Pethica RB, Greenwood JM, Sardar AJ, Rackham OJL, et al. A daily-updated tree of (sequenced) life as a reference for genome research. Sci Rep. 2013; 3:2015.PubMedGoogle Scholar
- Buckee CO, Recker M. Evolution of the multi-domain structures of virulence genes in the human malaria parasite, Plasmodium falciparum. PLoS Comput Biol. 2012; 8(4):1002451.View ArticleGoogle Scholar
- de Maagd RA, Bravo A, Berry C, Crickmore N, Schnepf HE. Structure, diversity, and evolution of protein toxins from spore-forming entomopathogenic bacteria. Annu Rev Genet. 2003; 37:409–33.View ArticlePubMedGoogle Scholar
- Notredame C, Higgins DG, Heringa J. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000; 302(2):205–17.View ArticlePubMedGoogle Scholar
- Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011; 7:539.View ArticlePubMedPubMed CentralGoogle Scholar
- Ait LA, Yamak Z, Morgenstern B. DIALIGN at GOBICS–multiple sequence alignment using various sources of external information. Nucleic Acids Res. 2013; 41:W3-7.View ArticlePubMedPubMed CentralGoogle Scholar
- Papadopoulos JS, Agarwala R. COBALT: constraint-based alignment tool for multiple protein sequences. Bioinformatics. 2007; 23(9):1073–79.View ArticlePubMedGoogle Scholar
- Terrapon N, Weiner J, Grath S, Moore AD, Bornberg-Bauer E. Rapid similarity search of proteins using alignments of domain arrangements. Bioinformatics. 2014; 30(2):274–81.View ArticlePubMedGoogle Scholar
- Finn RD, Mistry J, Schuster-Böckler B, Griffiths-Jones S, Hollich V, Lassmann T, et al. Pfam: clans, web tools and services. Nucleic Acids Res. 2006; 34:247–51.View ArticleGoogle Scholar
- Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz H-R, et al. The Pfam protein families database. Nucleic Acids Res. 2008; 36(Database issue):281–8.Google Scholar
- Thompson JD, Koehl P, Ripp R, Poch O. BAliBASE 3.0: latest Developments of the Multiple Sequence Alignment Benchmark. Proteins. 2005; 61(1):127–36.View ArticlePubMedGoogle Scholar
- Gotoh O. Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol. 1996; 264(4):823–38.View ArticlePubMedGoogle Scholar
- Söding J. Protein homology detection by HMM-HMM comparison. Bioinformatics. 2005; 21(7):951–60.View ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 1992; 89(22):10915–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Dayhoff MOSRM, Orcutt BC. A model of evolutionary change in proteins. Atlas Protein Sequence Struct. 1978; 5:345–52.Google Scholar
- Geer LY, Domrachev M, Lipman DJ, Bryant SH. CDART: protein homology by domain architecture. Genome Res. 2002; 12(10):1619–23.View ArticlePubMedPubMed CentralGoogle Scholar
- Higgins DG, Sharp PM. CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene. 1988; 73(1):237–44.View ArticlePubMedGoogle Scholar
- Paten B, Herrero J, Beal K, Birney E. Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment. Bioinformatics. 2008; 25(3):259–91.Google Scholar
- Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011; 7(10):1002195.View ArticleGoogle Scholar
- Katoh K, Kuma K, Toh H, Miyata T. MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res. 2005; 33(2):511–8.View ArticlePubMedPubMed CentralGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.