Alignment-free clustering of transcription factor binding motifs using a genetic-k-medoids approach
- Pilib Ó Broin^{1, 2},
- Terry J Smith^{1} and
- Aaron AJ Golden^{1, 3}Email author
https://doi.org/10.1186/s12859-015-0450-2
© Ó Broin et al.; licensee BioMed Central. 2015
Received: 4 August 2014
Accepted: 2 January 2015
Published: 28 January 2015
Abstract
Background
Familial binding profiles (FBPs) represent the average binding specificity for a group of structurally related DNA-binding proteins. The construction of such profiles allows the classification of novel motifs based on similarity to known families, can help to reduce redundancy in motif databases and de novo prediction algorithms, and can provide valuable insights into the evolution of binding sites. Many current approaches to automated motif clustering rely on progressive tree-based techniques, and can suffer from so-called frozen sub-alignments, where motifs which are clustered early on in the process remain ‘locked’ in place despite the potential for better placement at a later stage. In order to avoid this scenario, we have developed a genetic-k-medoids approach which allows motifs to move freely between clusters at any point in the clustering process.
Results
We demonstrate the performance of our algorithm, GMACS, on multiple benchmark motif datasets, comparing results obtained with current leading approaches. The first dataset includes 355 position weight matrices from the TRANSFAC database and indicates that the k-mer frequency vector approach used in GMACS outperforms other motif comparison techniques. We then cluster a set of 79 motifs from the JASPAR database previously used in several motif clustering studies and demonstrate that GMACS can produce a higher number of structurally homogeneous clusters than other methods without the need for a large number of singletons. Finally, we show the robustness of our algorithm to noise on multiple synthetic datasets consisting of known motifs convolved with varying degrees of noise.
Conclusions
Our proposed algorithm is generally applicable to any DNA or protein motifs, can produce highly stable and biologically meaningful clusters, and, by avoiding the problem of frozen sub-alignments, can provide improved results when compared with existing techniques on benchmark datasets.
Keywords
Background
Transcription factors (TFs) are an important group of DNA-binding proteins whose interaction with their cognate sequence-specific binding sites results in the regulation of gene expression. These transcription factor binding sites (TFBSs) are short, degenerate sequences, usually in the order of 6-32bp in length [1], and are most commonly represented in the form of a position specific scoring matrix (PSSM), or position weight matrix (PWM). A PSSM is a 4 x ℓ matrix created from the alignment of known binding sites, where ℓ is the motif length, and each matrix entry, f _{ bi }, represents the probability of observing nucleotide b in position i of the motif [2].
While [3] described the manual creation of FBPs, there have since been numerous studies which have examined various metrics for motif comparison as well as methods for their automated clustering.
Determination of motif similarity can be broadly classified into two approaches: alignment-based methods, which implement a column-by-column scoring approach for each alignment provided by sliding motifs against one another in both forward and reverse directions, and alignment-free methods, which compare motifs directly based on shared features. Column scoring for alignment-based techniques can be based on metrics such as sum squared distance (SSD), Pearson’s correlation coefficient (PCC), and average Kullback-Leibler (AKL) distance, many of which have previously been examined in detail [9,10]. A method for alignment-free motif comparison was provided by the MoSta tool [11], which uses the asymptotic covariance of overlapping word sets between two motifs on a random sequence to determine similarity, with more similar motifs showing a higher degree of overlap. Another approach is taken by the authors in [12], who first convert each PSSM to a k-mer frequency vector (KFV), a 4^{ k }-dimensional vector comprising the frequency of each possible k-mer in a given motif, and then determine similarity using metrics such as PCC, Euclidean distance, or cosine distance. They demonstrate that the KFV approach outperforms alignment-based techniques in motif retrieval experiments designed to identify optimal motif similarity measures.
Currently, one of the most popular tools for motif clustering is the STAMP platform [10,13]. It offers a choice of column comparison metrics and performs pairwise gapped or ungapped local [14], or global [15] alignment, with progressive multiple alignment being performed using a UPGMA [16] guide tree. A known problem associated this type of agglomerative approach is that it can suffer from so-called frozen subalignments [17], where a datum seemingly well-clustered early on in the tree building process is later found to better match another cluster. STAMP therefore also provides an option for iterative refinement, although this can take much longer given that each motif from the initially constructed tree must be removed and realigned to the remaining motifs. MoSta, on the other hand provides its own non-tree-based clustering approach, which includes a threshold designed to ensure that the FBP resulting from successive merges maintains a high degree of similarity to each of its contributing members. This threshold helps to maintain structural homogeneity (defined as the proportion of cluster members which share the same structural class), an important concept in creating biologically meaningful FBPs.
Here, we are interested in an approach which would i) make use of the demonstrated success of the alignment-free KFV approach, and ii) allow motifs to move freely between clusters at any stage in the clustering process, thereby reducing the likelihood of convergence to a local rather than a global minimum, and obviating the need for post-clustering iterative refinement. Combining KFV calculation with a partitional clustering technique such as k-means [18] would provide such an approach, k-means however, is known to be highly sensitive to the effects of outliers. We therefore explore the use of the k-medoids algorithm [19] which, rather than calculating a group mean, uses the cluster member with the smallest total pairwise distance to all other cluster members as the group representative instead. This not only has the advantage of being resistant to outlier effects, but also provides the additional benefit of not having to repeatedly calculate a multiple alignment for each cluster as would a k-means approach. The k-medoids algorithm does however have two major associated problems of its own: i) like k-means, it can be sensitive to initial conditions, converging on different solutions depending on the randomly chosen starting medoids, ii) it performs a local search only, providing solutions exclusively for the value of k provided; ideally our approach should automatically determine the optimal number of clusters for any dataset provided. To address these two issues, we propose the use of a genetic algorithm (GA).
Implementation
In this section we provide details on the implementation of our genetic-k-medoids approach, GMACS. We describe the algorithm in terms of three stages. The first of these is the initialization stage where we construct a distance matrix for the motifs in our dataset based on k-mer frequencies and create an initial random population of candidate clustering solutions. The second stage details the process of assigning a ‘fitness’ score to each solution in the population through the use of one iteration of the k-medoids algorithm and the silhouette metric. The third step details the evolutionary process, the way in which the ‘fitter’ solutions are selected and recombined during each ‘generation’, or iteration of the genetic algorithm.
Initialization
and motifs with a d _{ cos } close to zero are regarded as highly similar. In contrast to an agglomerative approach, these pairwise distances remain unchanged and do not need to be re-calculated as the clustering progresses and motifs are merged.
Fitness calculation
Each candidate solution in the GA population encodes K, the number of clusters in the solution, and a vector m, of length K, which indicates the medoid for each of these clusters. During initialization, the value of K for each solution is randomly chosen in the range {2…n−1}, where n is the number of motifs in the input dataset. This range is used as the K=1 solution provides no real benefit since all motifs are clustered together regardless of similarity, while conversely, the K=n solution places each motif in its own singleton cluster and is of little use for reducing redundancy. Once the number of clusters for each solution has been established, starting medoids are also randomly chosen. When all of the candidate solutions are initialized, we proceed to calculate the fitness for each member of the population.
In this metric, a(i) is the average dissimilarity of motif i to all other motifs in its own cluster, and b(i), is the average dissimilarity of motif i to all motifs in its nearest neighbouring cluster; s(i) is therefore an indication of whether or not an individual motif is well-placed in the clustering, or if it would be clustered more appropriately elsewhere. By calculating a silhouette score each motif in the dataset, we generate an overall measure of cluster quality.
Evolutionary process
Once fitness values have been assigned to each solution, they are ranked in preparation for the evolutionary process. GMACS implements a linear ranking system incorporating a selective pressure parameter which can be used to adjust the strength of the selective bias towards fitter individuals. Linear ranking is commonly used as opposed to direct fitness values in order to avoid situations where a small number of disproportionately successful solutions leads to the premature convergence of the population. We follow an incremental or steady-state GA (SSGA) replace-worst strategy, such that, in each generation, the bottom 5% of the parent population will be replaced by newly-created offspring. This represents a more gradual progression towards fitter solutions in contrast to a more aggressive generational strategy where the entire population is replaced at each iteration and relatively fit solutions may be lost over time due to the stochastic nature of the algorithm.
We use roulette wheel, or fitness proportionate selection, to choose the two parent solutions when generating offspring for replacement. In this form of selection, each individual in the population is assigned a ‘slice’ of an imaginary roulette wheel which is proportionate to its fitness within the context of the current population. The wheel is ‘spun’, and solutions or individuals which have larger slices of the wheel will have a greater probability of being selected for recombination. Weaker solutions will still have a small probability of selection, and this is in keeping with the fundamental theory of genetic algorithms, namely that part of a less fit solution’s genotype may still be beneficial at a later stage when combined with genes from another solution.
Results
Motif comparison
Retrieval accuracy
GMACS | STAMP | MoSta | SMLR | |
---|---|---|---|---|
bZIP (93) | 0.92 | 0.94 | 0.90 | 0.92 |
C2H2 (74) | 0.82 | 0.76 | 0.76 | 0.77 |
C4 (52) | 0.98 | 0.98 | 0.98 | 0.91 |
Homeobox (50) | 0.88 | 0.82 | 0.82 | 0.85 |
Forkhead (49) | 0.92 | 0.90 | 0.92 | 0.83 |
bHLH (37) | 0.89 | 0.81 | 0.92 | 0.88 |
Total (355) | 0.90 | 0.87 | 0.88 | 0.86 |
Motif clustering
We demonstrate the clustering performance of our algorithm on 79 motifs from the JASPAR database. This dataset comprises the 71 motifs used by [3] in their initial manual creation of FBPs, plus a further eight zinc-finger proteins, four from the DOF family, and four from the GATA family. We compare our results to those reported by the authors of STAMP [10] and MoSta [11] who both use the same dataset to benchmark their approaches. While we are primarily interested in avoiding the issue of frozen subalignments faced by tree-based approaches, we include MoSta in this comparison as it allows us to demonstrate the benefit of combining the KFV metric with a genetic-k-medoids approach against not only tree-based techniques, but also another alignment-free, non-tree based technique. We report the results both in terms of number of clusters defined, and the structural homogeneity of the created clusters. The ability to create structurally homogeneous clusters is an important aspect in the generation of FBPs, and as we show in the next section, GMACS performs very well in this regard, successfully identifying even distinct subtypes within several structural families. For STAMP, PCC was used with ungapped local alignment and a UPGMA guide tree (default settings). As previously described, the authors of MoSta provide their own clustering approach whereby they select and merge motifs based on their word covariance similarity metric. For this set of experiments GMACS was configured with a population of 100 which was evolved over 300 generations. The default mutation rate of 0.05 and information content threshold of 0.3 were used – this trim threshold was consistent with the same parameter setting in the STAMP algorithm.
Cluster summary
GMACS | STAMP | MoSta | |
---|---|---|---|
Homogeneous | 13 | 9 | 11 |
Heterogeneous | 5 | 7 | 3 |
Singletons | 0 | 2 | 12 |
Total | 18 | 18 | 26 |
All three of the algorithms separate the four bZIP CREB subgroup motifs into a homogeneous group as well as clustering seven of the eight nuclear receptor motifs together. The remaining androgen receptor motif is classed as one of two singletons by STAMP, while GMACS clusters this motif with the two homeobox motifs, EN-1 and PAX4. It is possible that the length and complexity of the AR and PAX4 motifs play a role in this particular grouping, affecting the number of shared k-mers between the two.
Our final set of motifs includes members from the homeobox, HMG and forkhead families. Firstly, all three approaches cluster five of the seven homeobox motifs into one homogeneous cluster. STAMP clusters PBX with the four HMG motifs, SOX17, SOX19, SOX5 and SRY, and creates a single combined HMG/homeobox/forkhead cluster comprising both these motifs and six motifs from the forkhead family. STAMP also creates a HMG/forkhead group containing HMG-1 and FOXC1, and a HMG/homeobox group containing HMG-IY and PAX4. MoSta clusters the four HMG motifs with the six forkhead motifs as in the case of STAMP, but does not include the PBX homeobox motif. It also creates a HMG/homeobox group but in this case containing HMG-1 and EN-1. GMACS, like STAMP, clusters the PBX homeobox motif with the four HMG motifs, but as a separate cluster from the six forkhead motifs, which are instead clustered with another set of HMG motifs: HMG-IY and HMG-1.
FBP construction and stability
Robustness to noise
Discussion
GMACS was primarily developed in order to circumvent the problem of frozen subalignments associated with tree-based FBP techniques and the results provided in the motif comparison section indicate that a genetic-k-medoids approach may be useful not only when compare to tree-based techniques but also to other alignment-free non-tree-based techniques as well. We must however recognize some weaknesses arising from our approach. A common concern with genetic algorithms is that they can be computationally costly, with the most compute-intensive task usually being the evaluation of the fitness function. The k-medoids algorithm is also computationally intensive given that the swap stage typically progresses until no further exchanges can be made to decrease the overall cost of the cluster configuration. While the complexity of the standard algorithm is O(k(n−k)^{2}) (where k is the number of medoids and n is the number of objects to be clustered), we have shown that a single round of local search using the k-medoids as part of the fitness evaluation function is enough to greatly reduce the number of generations necessary for the GA to converge on good solutions. The silhouette component of the fitness function however requires an all-to all comparison adding a O(n ^{2}) term to GMACS’ overall complexity. This indicates that while increases in other parameters such as number of generations and population size will invariably increase the runtime, the rate-limiting step will inevitably be the number of motifs in the input dataset. For the relatively small test dataset of 79 JASPAR motifs however, setting a population size of 100 and evolving for 300 generations, the average time to completion on a 2.0 GHz Intel Core i7 2630QM processor calculated over 100 runs was 3.99 seconds. This is in comparison to 6.87 seconds for STAMP and 111.19 seconds for MoSta (Additional file 2: Figure S2). Genetic algorithms are however amenable to parallelization, and future work on decreasing runtime will focus on this aspect of development.
While GMACS is designed to seek a global rather than a local minimum, like all GAs, it is a stochastic algorithm and there are therefore no guarantees that it will in fact achieve this goal. In order to test its convergence properties, we ran the test dataset of 79 JASPAR motifs 10,000 times to ascertain the number of times which the algorithm would converge on any particular solution. The 18-cluster solution (K=18) which we have described above accounts for ∼90.5% of the solutions returned and has a fitness value of 0.469. The second most common solution, accounting for a further 5% of the returned cluster configurations, is a 19-cluster solution with a fitness value of 0.465. In this solution, the FOXC1 motif is classified as a singleton, resulting in the NFIL3 and HLF cEBP motifs becoming a homogeneous cluster. The third most common solution occurs in ∼1.2% of the runs and is another 19-cluster solution, also involving a FOXC1 singleton. This time however, the forkhead group becomes homogeneous and the two HMG motifs from the previous HMG / forkhead group are re-clustered elsewhere. The fitness for this third solution is in fact slightly higher (0.471) than that of the K=18 solution, raising several important points. Firstly, it illustrates the fact that the GA will quickly converge on good but not necessarily optimal solutions. This result also points to the difficulty, not only for GAs but also for most algorithms operating in complex problem domains, of appropriate parameter selection. The trim threshold, mutation rate, population size, and number of generations, for example, will all play a role in the type of solutions returned. Finally, the higher fitness of the K=19 solution provides an indication that a modified or alternative fitness function could help to move the GA towards these less commonly explored regions of the solution space. There are many cluster quality metrics which might be used as part of the fitness function as well as many possible implementations of the genetic operators – future work will focus on exploring these possible combinations in an attempt to further optimize the clustering process.
Our results also confirm the K-mer Frequency Vector as a suitable metric for general motif similarity. However, while this approach achieves the best average retrieval accuracy overall, it seems clear that each of the metrics examined are more sensitive to certain structural classes than others. An ensemble approach or weighted combination of metrics may therefore provide a more optimal way to ascertain a motif’s structural class.
Conclusion
While being cognizant of the issues raised in the previous section, we posit that our algorithm is a useful complementary technique to current standard approaches. The most common clustering solution provided by GMACS for the benchmark clustering dataset is both consistent and biologically meaningful, comprising a larger number of structurally homogeneous clusters than either STAMP or MoSta, without requiring a large number of singleton clusters to achieve this. As well as being applicable to other motif clustering problems, our algorithm is easily reconfigurable for other classes of general clustering problems, making it particularly attractive to researchers wishing to test the robustness of results returned by conventional tree-based approaches.
Availability and requirements
Project name: GMACS Project home page: http://goldenlab.einstein.yu.edu/projects/gmacs Operating system(s): Platform independent Programming language: COther requirements: noneLicense: GMACS is freely available for download and use under the GNU GPL
Declarations
Acknowledgements
The authors acknowledge the support of the Science Foundation Ireland (Grant Number 05/RFP/CMS0001) and wish to thank Dr. Shaun Mahony and Dr. Utz Pape for making their source code and benchmark datasets available.
Authors’ Affiliations
References
- Fogel GB, Weekes DG, Varga G, Dow ER, Craven AM, Harlow HB, et al. (2005) A statistical analysis of the TRANSFAC database. Biosystems. 2005; 81:137–54.View ArticlePubMedGoogle Scholar
- Stormo GD. DNA binding sites: representation and discovery. Bioinformatics. 2000; 16:16–23.View ArticlePubMedGoogle Scholar
- Sandelin A, Wasserman WW. Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics. J Mol Biol. 2004; 338:207–15.View ArticlePubMedGoogle Scholar
- Sandelin A, Alkema W, Engström P, Wasserman WW, Lenhard B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 2004; 32(Database issue):D91–4.View ArticlePubMedPubMed CentralGoogle Scholar
- Mahony S, Golden A, Smith TJ, Benos PV. Improved detection of DNA motifs using a self-organized clustering of familial binding profiles. Bioinformatics. 2005; 21(Suppl 1):i283–91.View ArticlePubMedGoogle Scholar
- Xing EP, Karp RM. MotifPrototyper: A Bayesian profile model for motif families. Proc Natl Acad Sci USA. 2004; 101:10523–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Narlikar L, Hartemink AJ. Sequence features of DNA binding sites reveal structural class of associated transcription factor. Bioinformatics. 2006; 22:157–63.View ArticlePubMedGoogle Scholar
- Kielbasa SM, Gonze D, Herzel H. Measuring similarities between transcription factor binding sites. BMC Bioinformatics. 2005; 6:237.View ArticlePubMedPubMed CentralGoogle Scholar
- Schones DE, Sumazin P, Zhang MQ. Similarity of position frequency matrices for transcription factor binding sites. Bioinformatics. 2004; 21:307–13.View ArticlePubMedGoogle Scholar
- Mahony S, Auron PE, Benos PV. DNA familial binding profiles made easy: comparison of various motif alignment and clustering strategies. PLoS Comput Biol. 2007; 3:e61.View ArticlePubMedPubMed CentralGoogle Scholar
- Pape UJ, Rahmann S, Vingron M. Natural similarity measures between position frequency matrices with an application to clustering. Bioinformatics. 2008; 24:350–357.View ArticlePubMedGoogle Scholar
- Xu M, Su Z. A novel alignment-free method for comparing transcription factor binding site motifs. PLoS ONE. 2010; 5:e8797.View ArticlePubMedPubMed CentralGoogle Scholar
- Mahony S, Benos PV. STAMP: a web tool for exploring DNA-binding motif similarities. Nucleic Acids Res. 2007; 35(Web Server Issue):W253–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981; 147:195–7.View ArticlePubMedGoogle Scholar
- Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970; 48(3):443–53.View ArticlePubMedGoogle Scholar
- Sokal RR, Michener CD. A statistical method for evaluating systematic relationships. Univ Kans Sci Bull. 1958; 28:1409–38.Google Scholar
- Barton GJ, Sternberg MJ. A strategy for the rapid multiple alignment of protein sequences. Confidence levels from tertiary structure comparisons. J Mol Biol. 1987; 198:327–37.View ArticlePubMedGoogle Scholar
- Lloyd SP. Least squares quantization in PCM. IEEE T Inform Theory. 1982; 28:129–37.View ArticleGoogle Scholar
- Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis. New York: Wiley; 1990.View ArticleGoogle Scholar
- Fraser AS. Simulation of genetic systems by automatic digital computers I. Introduction. Aust J Biol Sci. 1957; 10:484–91.View ArticleGoogle Scholar
- Holland JH. Adaptation in natural and artificial Systems. Ann Arbor: University of Michigan Press; 1975.Google Scholar
- Goldberg DE. Genetic algorithms in search, optimisation and machine learning. New York: Addison-Wesley; 1989.Google Scholar
- Notredame C, Higgins DG. SAGA: sequence alignment by genetic algorithm. Nucleic Acis Res. 1996; 24:1515–24.View ArticleGoogle Scholar
- Notredame C, O’Brien EA, Higgins DG. RAGA: RNA sequence alignment by genetic algorithm. Nucleic Acids Res. 1997; 25:4570–80.View ArticlePubMedPubMed CentralGoogle Scholar
- Wei Z, Jensen ST. GAME: detecting cis-regulatory elements using a genetic algorithm. Bioinformatics. 2006; 22:1577–84.View ArticlePubMedGoogle Scholar
- Liu FFM, Tsai JJP, Chen RM, Chen SN, Shih SH. FMGA: finding motifs by genetic algorithm. In: Fourth IEEE symposium on Bioinformatics and Bioengineering (BIBE2004). IEEE2004. p. 459–66.Google Scholar
- Bohlin J, Skjerve E, Ussery DW. Investigations of oligonucleotide usage variance within and between prokaryotes. PloS Comput Biol. 2008; 4(4):e1000057.View ArticlePubMedPubMed CentralGoogle Scholar
- Tibshirani R, Walther G, Hastie T. Estimating the number of data clusters via the Gap statistic. J Roy Stat Soc B. 2001; 63:411–23.View ArticleGoogle Scholar
- Calinski T, Harabasz J. A dendrite method for cluster analysis. Commun Stat. 1974; 3:1–27.View ArticleGoogle Scholar
- Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Comput Appl Math. 1987; 20:53–65.View ArticleGoogle Scholar
- Matys V, Fricke E, Geffers R, Gößling E, Haubrock M, Hehl R, et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 2003; 31:374–8.View ArticlePubMedPubMed CentralGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.