MATLIGN: a motif clustering, comparison and matching tool
© Kankainen and Löytynoja; licensee BioMed Central Ltd. 2007
Received: 23 January 2007
Accepted: 08 June 2007
Published: 08 June 2007
Sequence motifs representing transcription factor binding sites (TFBS) are commonly encoded as position frequency matrices (PFM) or degenerate consensus sequences (CS). These formats are used to represent the characterised TFBS profiles stored in transcription factor databases, as well as to represent the potential motifs predicted using computational methods. To fill the gap between the known and predicted motifs, methods are needed for the post-processing of prediction results, i.e. for matching, comparison and clustering of pre-selected motifs. The computational identification of over-represented motifs in sets of DNA sequences is, in particular, a task where post-processing can dramatically simplify the analysis. Efficient post-processing, for example, reduces the redundancy of the motifs predicted and enables them to be annotated.
In order to facilitate the post-processing of motifs, in both PFM and CS formats, we have developed a tool called Matlign. The tool aligns and evaluates the similarity of motifs using a combination of scoring functions, and visualises the results using hierarchical clustering. By limiting the number of distinct gaps created (though, not their length), the alignment algorithm also correctly aligns motifs with an internal spacer. The method selects the best non-redundant motif set, with repetitive motifs merged together, by cutting the hierarchical tree using silhouette values. Our analyses show that Matlign can reliably discover the most similar analogue from a collection of characterised regulatory elements such that the method is also useful for the annotation of motif predictions by PFM library searches.
Matlign is a user-friendly tool for post-processing large collections of DNA sequence motifs. Starting from a large number of potential regulatory motifs, Matlign provides a researcher with a non-redundant set of motifs, which can then be further associated to known regulatory elements. A web-server is available at http://ekhidna.biocenter.helsinki.fi/poxo/matlign.
Transcription factor mediated gene regulation is one of the main cellular mechanisms to control gene expression. The regulation is mostly performed by transcription factors binding onto short, degenerate sequence motifs that recur frequently in the genome [1, 2]. The binding specificities of the factors are commonly summarised as position frequency matrices (PFM) or consensus sequences (CS): PFMs list the number of occurrences of each nucleotide (columns) across sites of aligned binding sites (rows), whereas CSs represent a motif sequence using a set of degenerate symbols that give each base decoded by the given symbol an equal frequency .
Computational methods are often used to predict gene regulatory elements from a set of promoter sequences of similarly behaving genes, e.g. a set of co-expressed genes. Since the regulatory elements targeted by a given transcription factor are expected to resemble each other, over-represented DNA elements are seen as an indication of a common regulatory element and searched for [3–5]. The actual motif discovery is performed using probabilistic or deterministic optimisation, or pattern enumeration techniques, which both – although for different reasons – report repetitive motifs. In the first case, the search algorithms may stochastically terminate at different solutions and, due to this ambiguity, repetition of the analysis is recommended and multiple sets of similar motifs are obtained . A pattern enumeration technique that evaluates all possible patterns guarantees finding the most over-represented ones, but it also reports repetitive motifs as numerous overlapping forms of the same motif are discovered. On the other hand, regulatory elements are evolutionary restrained across species and gene regulation can, alternatively, be inferred by searching for conserved DNA segments . In terms of motif discovery, some methods have recently been developed that incorporate evolutionary information in the search of enriched motifs; these methods, however, also report redundant sets of motifs as they typically use probabilistic or deterministic optimisation.
Motif prediction tools typically output sets of PFMs or CSs. Of these, CSs could be analysed using conventional sequence alignment methods, such as those in the Emboss-package , but these tools are not designed for analyses of hundreds of motifs and, hence, are inconvenient to use. Methods specifically designed to align and compare sequence motifs do exist, e.g. the pattern assembly and comparison tools in RSA-tools, YSRA, TREG, MatCompare, CompareAce and PROCSE [4, 8–14]. However, each of them lacks some desired features, such as alignment of motifs with variable-length spacers, analyses of sets consisting of both PFMs and CSs, creation of new alignments or discovery of optimal and non-redundant motif sets.
In summary, Matlign is a practical post-processing tool for the comparison and clustering of short DNA motifs. We believe that Matlign is useful for tasks that involve measuring the similarity of motifs, such as identification of consensus motifs in multiple result sets, identification of the best matching hit from a collection of known regulative elements, and grouping together similar and redundant motifs. By reducing the undesired redundancy of raw data, Matlign saves the user from laborious and time-consuming analyses and facilitates the interpretation of motif prediction results.
We have implemented a dynamic programming algorithm for the alignment of motifs containing at most one internal gap event. Following the method of Gotoh , the match state is separated from the two gap states to allow for a more realistic gap cost function. However, similar to Sankoff , a return from a gap state to the earlier match state is not permitted and a move has to be taken to the succeeding match state. The procedure ensures that the chosen path has at most the specified number of distinct gap events but does not set a limit to the total length of gaps; terminal gaps are not penalised. See Additional file 1 for a more detailed description of the dynamic programming algorithm.
Matlign automatically converts CSs into PFMs and treats all input motifs in a similar manner. For CSs, the program supports the 15-letter IUPAC code and decodes the degenerate symbols to nucleotide frequencies by sharing the probability among relevant bases. This conversion can be adjusted by correcting the nucleotide frequencies according to a user defined AT/GC-ratio and/or by adding pseudocounts to matrices. Given the matrix representation of all motifs, the score of matching two motif sites is computed for two vectors of nucleotide frequencies. The scoring function for the motif matching can be chosen from the five implemented functions: Kendalls tau rank correlation coefficient (I), Spearman's rank correlation coefficient (II), Pearson correlation coefficient (III), normalised Euclidean distance (IV) and evolutionary substitution score (V), or any combination of these. Most of the functions are described in detail by Pietrokovski , who, for example, noted that the Spearman's and Pearson correlations are the most suitable functions for proteins . Since CSs can be based on low frequency counts and their degenerate symbols produce somewhat artificial nucleotide frequencies, it is recommended to use the robust measures of correlation by the rank correlations for them. Matlign combines the different scores, or alternatively their Z-scores, by calculating their product and using a signum function that returns the most frequently occurring sign among the scores selected. When Z-scores are used, Matlign first estimates the population mean and standard deviation by performing an all-against-all matching of IUPAC symbols, i.e. each of the 15 symbols is matched against each other using the chosen distance function, and calculating the mean and standard deviation of these scores. The Z-scores are then derived by subtracting the population mean from an individual score and dividing the difference by the population standard deviation.
Agglomerative hierarchical clustering is a commonly-used method to group a collection of elements into subsets or clusters. It classifies the elements by recursively joining the two most similar ones, and creates a tree representing the nested grouping events (see the review of Jain et al., ). We start by computing an all-against-all similarity matrix of alignment scores using the dynamic programming algorithm. Then, the two most similar elements, i.e. the motif pair with the highest alignment score, are recursively joined until a single motif remains. In a joining event, a new motif is created as the alignment of the two motifs, and the distance matrix scores are updated by calculating the averaged distance between the motifs in the newly created cluster and all other motifs.
The average distance from an element to all other elements within the same cluster a(i) is compared with the average distance from the element to the elements of the closest other cluster b(i). The resultant value is scattered between -1 (poor classification) and 1 (good classification) and the clustering yielding the highest average s(i) is chosen.
The false discovery rate (FDR) is the expected proportion of true null hypotheses rejected out of the total number of null hypotheses rejected . In Matlign, the FDR for each alignment is calculated using a permutation technique. The alignment positions (rows) between the PFMs are first randomised, after which the nucleotide counts (columns) within each new randomised PFMs are separately permutated. Based on the desired number of these permutations, Matlign calculates the FDR for each alignment as the average number of permutated alignments that have a score as good or better than the real alignment's score, divided by the number of alignments in the real data that have a score as good or better than that score. As the process is repeated at each level of hierarchy, the calculation of FDR can be a time-consuming step.
Results and discussion
Matlign is a tool to group and compare sequence motifs. To demonstrate the method's functionality, we describe a set of realistic examples of its usage. The first example focuses on the annotation of motifs, three following examples show how to use Matlign to reduce the redundancy of motif prediction results, and the last example how to create consensus predictions. The data for the examples presented here can be found and re-analysed using the Matlign server .
Overall performance of different methods. Area under the ROC-curve (AUC) with different noise-disturbed data sets.
Examples 2–4 demonstrate how to use Matlign for the post-processing of motifs obtained with different motif prediction tools. The test data, a set of promoter sequences of co-regulated genes from S. cerevisiae, was obtained from SCPD  by choosing the genes regulated by the PDR3 transcription factor. The first result set (Example 2) is from the probabilistic tool MotifSampler , whereas Examples 3 and 4 show the results of the pattern enumeration tools POCO and oligo-analysis [4, 5]. In each example, Matlign is used to discover similar motifs from the redundant set produced by the corresponding motif prediction tool, to cluster these motifs together, and to return a non-redundant motif set to the user. In all analyses, Matlign was run using its default parameters (see Abbreviations for details).
In Example 2, at the highest silhouette value the original 100 predictions are grouped into 17 clusters that vary in size from a single motif to a cluster of 72 almost identical motifs. Using the same procedure, the 50 pattern predictions of POCO and oligo-analysis are reduced to 29 and 33 clusters, respectively. The number of clusters retained varies depending on the prediction method and indicates that certain tools can indeed remove a portion of the undesired redundancy. However, when the prediction results were post-processed using Matlign, these results were compressed to nearly one half of their original size, simplifying the analysis of the remaining motifs and the interpretation of the results.
Evaluation of best motifs. Best motifs predicted by Matlign and the individual motif prediction tools. Rank and Hits indicate the rank of the motif in the original analysis and the number of motif instances with raw score higher than or equal to 6, respectively. The significance of the motifs was determined using Clover .
Matlign is a user-friendly web-tool to cluster and compare DNA sequence motifs. We have demonstrated that Matlign outperforms other available tools in finding remote analogues and is the preferable choice for the annotation and verification of potential binding site targets using collections of known motifs. By efficiently reducing the undesired redundancy of input motifs, Matlign speeds up the refinement of large collections of motif predictions and facilitates the interpretation of the results.
Availability and requirements
Project name: Matlign
Project home page: http://ekhidna.biocenter.helsinki.fi/poxo/matlign
Operating system: Unix
Programming language: C++/Perl
Other requirements: Following C++ libraries: studio, stdlib, string, vector, cmath, iostream, fstream, utility, cassert and config. For the web-server version: Zope, Gnuplot
Restrictions to use by non-academics: None
transcription factor binding site
position frequency matrix
pregnane X receptor
retinoic acid receptor alpha
false discovery rate
receiver operating characteristic
area under the receiver operating characteristic-curve
- Default parameters:
match = 5, transversion = -4, transition = -4, gap open = -10, gap extension = -1, maximal gap = undefined, spacers = true, Z-score = true, pseudocounts = 0, AT-frequency = 0.5, and Spearman's rank correlation coefficient, Pearson correlation coefficient and evolutionary substitution score (Viterbi-score)
This work was enabled by a Marie Curie grant to MK. He thanks Nick Goldman for the opportunity to visit the EBI.
- Wray G, Hahn M, Abouheif E, Balhoff J, Pizer M, Rockman M, Romano L: The evolution of transcriptional regulation in eukaryotes. Mol Biol Evol 2003, 20: 1377–1419. 10.1093/molbev/msg140View ArticlePubMedGoogle Scholar
- D'haeseleer P: What are DNA sequence motifs? Nat Biotechnol 2006, 24: 423–425. 10.1038/nbt0406-423View ArticlePubMedGoogle Scholar
- Thijs G, Lescot M, Marchal K, Rombauts S, Moor BD, Rouze P, Moreau YA: Higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 2001, 17: 1113–1122. 10.1093/bioinformatics/17.12.1113View ArticlePubMedGoogle Scholar
- van Helden J: Regulatory sequence analysis tools. Nucleic Acids Res 2003, 31: 3593–3596. 10.1093/nar/gkg567PubMed CentralView ArticlePubMedGoogle Scholar
- Kankainen M, Holm L: POCO: discovery of regulatory patterns from promoters of oppositely expressed gene sets. Nucleic Acids Res 2005, 33: W427–431. 10.1093/nar/gki467PubMed CentralView ArticlePubMedGoogle Scholar
- Prakash A, Tompa M: Discovery of regulatory elements in vertebrates through comparative genomics. Nat Biotechnol 2005, 23: 1249–1256. 10.1038/nbt1140View ArticlePubMedGoogle Scholar
- Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Gene 2000, 16: 276–277. 10.1016/S0168-9525(00)02024-2View ArticleGoogle Scholar
- Sandelin A, Hoglund A, Lenhard B, Wasserman WW: Integrated analysis of yeast regulatory sequences for biologically linked clusters of genes. Funct Integr Genomics 2003, 3: 125–34. 10.1007/s10142-003-0086-6View ArticlePubMedGoogle Scholar
- Roepcke S, Grossmann S, Rahmann S, Vingron M: T-Reg Comparator: an analysis tool for the comparison of position weight matrices. Nucleic Acids Res 2005, 33: W438–441. 10.1093/nar/gki590PubMed CentralView ArticlePubMedGoogle Scholar
- Schones D, Sumazin P, Zhang M: Similarity of position frequency matrices for transcription factor binding sites. Bioinformatics 2005, 21: 307–313. 10.1093/bioinformatics/bth480View ArticlePubMedGoogle Scholar
- Smith AD, Sumazin P, Xuan Z, Zhang MQ: DNA motifs in human and mouse proximal promoters predict tissue-specific expression. Proc Natl Acad Sci USA 2006, 103: 6275–80. 10.1073/pnas.0508169103PubMed CentralView ArticlePubMedGoogle Scholar
- Hughes JD, Estep PW, Tavazoie S, Church GM: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J Mol Bio 2000, 296: 1205–14. 10.1006/jmbi.2000.3519View ArticleGoogle Scholar
- Pietrokovski S: Searching databases of conserved sequence regions by aligning protein multiple-alignments. Nucleic Acids Res 1996, 24: 3836–3845. 10.1093/nar/24.19.3836PubMed CentralView ArticlePubMedGoogle Scholar
- van Nimwegen E, Zavolan M, Rajewsky N, Siggia ED: Probabilistic clustering of sequences: inferring new bacterial regulons by comparative genomics. Proc Natl Acad Sci USA 2002, 99: 7323–7328. 10.1073/pnas.112690399PubMed CentralView ArticlePubMedGoogle Scholar
- Goodwin B, Moore LB, Stoltz CM, McKee DD, Kliewer SA: Regulation of the human CYP2B6 gene by the nuclear pregnane X receptor. Mol Pharmacol 2001, 60: 427–431.PubMedGoogle Scholar
- Xie W, Yeuh MF, Radominska-Pandya A, Saini SP, Negishi Y, Bottroff BS, Cabrera GY, Tukey RH, Evans RM: Control of steroid, heme, and carcinogen metabolism by nuclear pregnane X receptor and constitutive androstane receptor. Proc Natl Acad Sci USA 2003, 100: 4150–4155. 10.1073/pnas.0438010100PubMed CentralView ArticlePubMedGoogle Scholar
- Wingender E, Dietze P, Karas H, Knuppel R: TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res 1996, 24: 238–241. 10.1093/nar/24.1.238PubMed CentralView ArticlePubMedGoogle Scholar
- Gotoh O: An improved algorithm for matching biological sequences. J Mol Bio 1982, 162: 705–708. 10.1016/0022-2836(82)90398-9View ArticleGoogle Scholar
- Sankoff D: Matching sequences under deletion-insertion constraints. Proc Natl Acad Sci USA 1972, 69: 4–6. 10.1073/pnas.69.1.4PubMed CentralView ArticlePubMedGoogle Scholar
- Jain A, Murty M, Flynn P: Data clustering: a review. ACM Comput Sur 1999, 31: 264–323. 10.1145/331499.331504View ArticleGoogle Scholar
- Rousseew PJ: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 1987, 20: 53–65. 10.1016/0377-0427(87)90125-7View ArticleGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc B 1995, 57: 289–300. [http://www.jstor.org/view/00359246/di993246/99p0222p/0]Google Scholar
- Matlign server[http://ekhidna.biocenter.helsinki.fi/poxo/matlign]
- Sandelin A, Alkema W, Engstrom P, Wasserman W, Lenhard B: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 2004, 32: D91–94. 10.1093/nar/gkh012PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu J, Zhang M: SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics 1999, 15: 607–611. 10.1093/bioinformatics/15.7.607View ArticlePubMedGoogle Scholar
- Frith MC, Fu Y, Yu L, Chen JF, Hansen U, Weng Z: Detection of functional DNA motifs via statistical over-representation. Nucleic Acids Res 2004, 32: 1372–1381. 10.1093/nar/gkh299PubMed CentralView ArticlePubMedGoogle Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: A sequence logo generator. Genome Research 2004, 14: 1188–1190. 10.1101/gr.849004PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.