 Research
 Open Access
 Published:
SOMEA: selforganizing map based extraction algorithm for DNA motif identification with heterogeneous model
BMC Bioinformatics volume 12, Article number: S16 (2011)
Abstract
Background
Discrimination of transcription factor binding sites (TFBS) from background sequences plays a key role in computational motif discovery. Current clustering based algorithms employ homogeneous model for problem solving, which assumes that motifs and background signals can be equivalently characterized. This assumption has some limitations because both sequence signals have distinct properties.
Results
This paper aims to develop a SelfOrganizing Map (SOM) based clustering algorithm for extracting binding sites in DNA sequences. Our framework is based on a novel intranode soft competitive procedure to achieve maximum discrimination of motifs from background signals in datasets. The intranode competition is based on an adaptive weighting technique on two different signal models to better represent these two classes of signals. Using several real and artificial datasets, we compared our proposed method with several motif discovery tools. Compared to SOMBRERO, a stateoftheart SOM based motif discovery tool, it is found that our algorithm can achieve significant improvements in the average precision rates (i.e., about 27%) on the real datasets without compromising its sensitivity. Our method also performed favourably comparing against other motif discovery tools.
Conclusions
Motif discovery with model based clustering framework should consider the use of heterogeneous model to represent the two classes of signals in DNA sequences. Such heterogeneous model can achieve better signal discrimination compared to the homogeneous model.
Background
Identification of transcription factor binding sites (TFBS) is fundamental of understanding gene regulations. Binding sites or motif instances are typically 10 ~ 15bp in length and degenerated in some positions. They are often buried in a large amount of nonfunctional background sequences, which causes low signaltonoise ratio. Hence, using computational approaches to discriminate motif signals from background signals has not always brought satisfactory results. Development of advanced tools is necessary for more accurate motif predictions.
An essence of computational approaches for motif discovery is to search for motifs that are overrepresented in the input sequences compared to the background sequences. Motif overrepresentation can be explained by the existence of segments that have been evolutionarily preserved due to their functional significance to gene regulation. Hence, appearances of motif instances are rather similar to each other despite having variability in some of their positions [1]. Two issues that are closely related to motif discovery problem are: (i) how to construct a model to represent the motifs and, (ii) how to define a suitable search strategy to find putative motifs from the solution space. Positionspecificscoringmatrix (PSSM) [2] and its variations are the most widely used motif model. This model defines the maximumlikelihood estimation on the probability of nucleotide occurrences in every position of a motif. The motif search strategies can be local or global. Local search algorithms begin with an initial guess of a motif model and iteratively refine this model in the search space to maximize a certain criterion. Two examples of such algorithm are MEME [3] (expectationmaximization) and ALIGNACE (gippssampling) [4]). The local search approaches find out one motif at a time. Global search algorithms such as clustering based algorithms (e.g., SOMBRERO [5] and MISCLUSTER [6]) and genetic algorithms based algorithms (e.g., GAME [7] and iGAPK [8]) perform simultaneous searches for multiple candidate motifs by exploring the whole solution space.
In this paper, we aim to develop a SOM [9] based Extraction Algorithm (SOMEA) to discover overrepresented motifs in DNA datasets. We seek to use SOM to project kmers (i.e. a subsequence with length k of DNA sequences) onto a 2dimensional (2D) lattice of nodes. Through this projection, input patterns (i.e., kmers) with closely related features are projected onto the same or adjacent nodes on the map. Hence, the complex similarity relationships of the highdimensional input sequence space become apparent on the map. Analysis of selected nodes, therefore, can reveal potential patterns (i.e., motifs) in the dataset.
Previous studies have applied a standard (e.g. [5, 10, 11]) and hierarchical (e.g. [12]) SOM to discover motifs in protein or DNA sequences. Those studies have made a common assumption that the motif and the background signals can be analogously modeled by using a homogeneous node model. This assumption is weak because the two classes of signals have some distinct statistical properties [13]. Hence, homogeneous model of these two signal classes may cause unfaithful map representation and produce clusters with many false positives. The traditional homogeneous modeling of two signal classes implies that, both signals are clusterable under a single type of model. However, mutational events are more rapid in background regions compared to binding regions, causing most of the nucleotide bases in background regions to be random. Thus, they have relatively lower clusterability compared against binding site regions [14]. Therefore, nodes’ vectorial or string [9] based prototypes given by SOM in traditional tools, can represent motifs reasonably well, but do not well suit for background sequences since two different classes of signals are tried to be expressed through a homogeneous modeling. Hence, an alternative modeling approach, preferably a heterogeneous modeling approach, that takes these two signal properties in consideration is necessary.
In the development of SOMEA, we have proposed a hybrid node model to address some of the limitations of current SOM approaches. This hybrid node model is constituted by PSSM [2] and markov chain (MC) [15] model. These two model components perform softcompetition through an adaptive weighting scheme within a node to represent the mixture of signals in it. We hypothesized that, the fitness of each model’s components (i.e., PSSM and MC) with respect to the sequences in a node, is a fuzzy indication of its signal class composition. Heuristic learning rules are proposed in this paper to adjust the model parameters during learning stage. We have evaluated our proposed SOMEA algorithm against several motif discovery tools using real and artificial DNA datasets. Results have shown that, our approach performs significantly better than a stateoftheart clustering algorithm for motif discovery, named SOMBRERO [5].
Results and discussion
We now present an experimental evaluation of our SOMEA approach. We have used eight real datasets to compare the performances of our approach against SOMBRERO, MEME, ALIGNACE and WEEDER [16] in terms of sensitivity and specificity. Then, to evaluate SOMEA’s ability in multiple motif discovery, we used five artificial datasets.
For performance quantification, we employed three measures i.e., precision(P), recall(R) and Fmeasure(F) [17]. They can be computed as: P = TP/(TP + FP), R = TP/Y, F = 2/(1/P + 1/R), where TP, FP, and Y are the numbers of true positives, false positives, and true binding sites in the dataset, respectively. We have considered a predicted site as a true positive if it is overlapped with a true binding site location by at least x nucleotides, where x is selected according to the length of the true motif consensus.
Performance on real datasets
The eight test datasets used in this experiments are composed of seven datasets used in [7] and a dataset collected from the Promoter Database of S. cerevisiae (SCPD) [18]. Each sequence contains at least one true binding site. These datasets consist of motifs from Escherichia coli(CRP), homo sapiens(ERE, MEF2, SRF, CREB, E2F, MYOD) and S. cerevisiae(GCN4).
SOMEA was run with map sizes that were arbitrarily selected between 10 × 10 to 20 × 20 depending on the size of the dataset. In each case, SOMEA was trained for 100 epochs with a motif length value in [l – 3, l + 3], where l is the known motif consensus length. The top 10 highest ranked motifs according to their MAP score [19] were saved for evaluation purpose. A 3rd order markov chain model [15] was used to compute MAP score. The learning rate parameter was fixed at 0.005 in all the experiments. Whereas, the neighborhood function parameter value, σ was set at 3.0.
For WEEDER, we used the online tool [20] with the following options: sites might appear more than once, both strands, and normal or complete scan. The interesting motifs and their instances that scored at least 90 were used in the evaluation. SOMBRERO was run with the default map sizes and random initialization method. The standalone tool was downloaded from [21]. We evaluated all the “bestmotifs” returned by the tools. MEME was run with the “any number” model option and minimum and maximum length value as discussed above. AlignACE was run online [22] with default arguments in most cases.
Table 1 shows recall (R), precision(P) and Fmeasure (F) rates for a ten run average for each program on the eight real DNA datasets. Comparison shows that, in terms of recall rates, SOMEA performs better than or equally to other tools in four(4) of the eight(8) datasets. Compared to SOMBRERO, SOMEA performs better in terms of recall rates in six(6) of the datasets. Also, SOMEA has higher precision rates in six(6) of the datasets and has better Fmeasure values in seven of the test datasets(except ERE). Notably, for the MEF2 dataset, SOMEA obtained a much higher precision rate (0.99 vs 0.22) in comparison with SOMBRERO. The performances on all datasets show that, SOMEA achieves significant improvements in the average precision rate (26.9%) and recall rate (13.8%) in comparison with SOMBRERO. This clearly shows that, SOMEA with heterogeneous node model can represent the kmers distribution in DNA sequences better than the algorithms with homogeneous model.
It can be noticed that, SOMEA performance is comparable or better than ALIGNACE, MEME and WEEDER. For example, in terms of Fmeasure rates, SOMEA produces the best results for five of the eight datasets due to its higher precision rates (note that, both SOMEA and ALIGNACE achieve the same Fmeasure value for the CRP dataset). SOMEA’s average Fmeasure value for all datasets (i.e. 0.72) is found better than MEME (0.65), ALIGNACE(0.69) and SOMBRERO(0.55) and equally good as WEEDER(0.72).
It should be noted that, the comparison results between programs cannot be completely fair as every program has its own strengths and weaknesses. For example, some programs might perform rather well for strong motifs; whereas some are designed to discover motifs with certain characteristics (e.g. gapped motifs). The nature of the datasets can be an influential factor to the success of each program. Therefore, the results reported here should serve only as reference.
Performance on artificial datasets with multiple planted motifs
Practically, we can often find multiple motifs in upstream region of a set of coregulated genes. These motifs often work as cisregulatory module to regulate gene expressions. Motif discovery programs should be able to return all of these potential motifs. Local search algorithms, such as MEME, perform a search for single motif at one time; whereas SOMEA and SOMBRERO search for all motifs simultaneously. It is interesting to compare these two strategies.
We have prepared five artificial DNA datasets generated from Annotated regulatory Binding Sites (ABS, v1.0) database [23]. Every DNA dataset has twenty(20) sequences (each with 500bp in length) with three planted real motifs. We run MEME, WEEDER, SOMEA and SOMBRERO five times on each dataset. We asked SOMEA and MEME to return the top 20 motifs for the evaluation purposes. Again, we evaluate all best motifs returned by WEEDER and SOMBRERO.
Table 2 shows the results of comparison between the four algorithms. Overall, SOMEA has the best recall rates in seven(7) of fifteen(15) of the motifs. However, such higher recall rates come at the price of having lower precision rates compared to MEME and WEEDER. Compared to SOMBRERO, SOMEA performs significantly better in most of the datasets in all performance measures. For example, in terms of recall rates, SOMEA is higher in ten(10) of the motifs; whereas, in terms of Fmeasure values, SOMEA has better results in twelve(12) of the motifs. Hence, it can be observed that, our SOMEA has better signal discrimination ability than SOMBRERO. MEME performs better than SOMEA in terms of average Fmeasure values in four of five of the datasets. Nonetheless, SOMEA has higher average recall rates in all of the datasets. WEEDER performs poorly in most of the test datasets most likely due to the inability of its scoring function to rank the true motifs highly when planted in the artificial sequences (see WEEDER manual [20]). In summary, both global and local search techniques perform equally well and each strategy has its own strengths and weaknesses. Coupling them could be a feasible approach to enhance motif discovery result.
It should be noted, there are some biases in the comparisons for two reasons. Firstly, both SOMEA and SOMBRERO are rather sensitive to the motif length parameter. As the motif consensuses in a dataset have different lengths, a single run with a fixed length value might not be suited for all motifs. On the contrary, MEME is able to find a length value that suits better each motif. Consequently, in some of SOMEA/SOMBRERO runs, some motifs might appear to be performed better in the experiments. Secondly, the lower precision rates for SOMEA and SOMBRERO could be explained by the fact that the optimal map sizes are not known. Improper map sizes can, to some extend, affect the results for multiple motif datasets.
Robustness analysis
We have conducted some analysis on the robustness of SOMEA with respect to different map sizes. We have computed recall, precision and Fmeasure analysis on SOMEA using the eight real datasets for map sizes 10 × 10, 15 × 15, and 20 × 20. Each dataset is run for five times and their average recall, precision, and Fmeasure is computed.
Table 3 shows the Fmeasure of eight datasets with different map sizes. It can be seen that, different map sizes affect the performance on the datasets. From the comparisons, it can be noted that the performance of SOMEA and SOMBRERO shows a similar trend. Their Fmeasure rates reach maximum for most datasets when the map size 15 × 15 is used. The map size 10 × 10 is too small to represent the kmers distribution in the original space for all the datasets. For a smaller map size, naturally the average number of kmers in each cluster increases, hence, the precision rates become lower. In contrast, for a larger map size (i.e., 20 × 20) the precision rates naturally become higher. However, the recall rates can be lower as the true binding site kmers may suffer from sparse distribution among several nodes in the map.
The computational time of SOMEA is mainly imposed by three operations: a) finding winner node for each kmer; b) updating winner and its neighboring nodes models; and c) updating node model at the end of an epoch. The time complexity of SOMEA with map size R × C is O((L × R × C) + (P × L) + (R × C)), where L is the total length of DNA sequences and P is the size of neighborhood during kmers assignment. Here, the O(L × R × C) term is due to the computation of finding winning node for every kmer; (P × L) operations are needed for the computation of updating the temporary model variables during the kmers assignment stage; and (R × C) operations for updating the node models at the end of an epoch. Selforganizing map based algorithm is known to suffer from heavy computational time due to the global search to simultaneously discover all clusters. We have recorded the execution time of SOMEA for the eight real datasets and found that it has the highest average computational time of 1364s as compared to WEEDER (825s), SOMBRERO (326s), MEME (126.7s) and ALIGNACE (101s). The slower computational time of SOMEA compared to SOMBRERO is due to the fact that we have to update the ∆M_{ pssm } and ∆M_{ mc } parameters (see Methods) for the winner and its neighborhood during kmers assignment (i.e. see Eqs (9) and (10)). In SOMBRERO, the update of node models only occurs at the end of an epoch. Also, some heuristic optimizations are included in it to reduce computational time. It can be observed that, current version of SOMEA requires slightly larger computational time, however, its better sensitivity and specificity performances can offer a good tradeoff.
Conclusions
Motif discovery in DNA datasets is a challenging problem domain because of our lack of understanding of the nature of the data, and the mechanisms to which proteins recognize and interact with its binding sites are still perplexing to biologist. Hence, predicting binding sites by using computational algorithms is still far from satisfaction.
In this paper, we have proposed a SOM based Extraction Algorithm (SOMEA) for simultaneous identification of multiplemotifs in DNA dataset. We have made two main contributions in this work. Firstly, it is shown that, the use of node model that considers the distinct properties of the motif and background signals is helpful in mining DNA motifs. We have proposed a hybrid model that is composed of PSSM and MC model to better represent these two classes of signals. Secondly, it has been highlighted that, clustering based DNA motif mining requires some customizations in the clustering system design, as standard clustering frameworks may not be sufficient. In addition to these, we have proposed heuristic learning rules to update the node model’s parameters during learning.
Many computational motif discovery algorithms have been proposed in the past decade. Like most of these algorithms, SOMEA shares some common challenges that require further investigation. The first is the scalability of the system for large scale dataset such as ChIP sequences. The scalability is the ability of a tool to maintain its prediction performances and efficiency while the size of the datasets increases. To the best of our knowledge, most motif discovery algorithms are not designed to handle large scale datasets. In a recent study [24], using ChIP datasets as benchmark, it is shown that local search techniques such as MEME does not scale well with the increase in dataset sizes. This finding is consistent to an early study by [25]. Currently, SOMEA is not proposed to handle large scale dataset either. However, it can potentially be used to reduce the sequence search space by precluster sequences into lowerdimensional topological space. Then we can perform the motif searches in this lowerdimensional space instead of the original sequence space. It would be interesting to further investigate the feasibility of this search space reduction strategy to enable system scalability.
The second critical issue is the system’s robustness, which relates to the ability of the pattern recognition system to maintain its performance with the changes of parameters and noise in the inputs [26]. Currently, the critical parameters for SOMEA are the map size and the motif length. From our experiences with SOMEA, we found that setting improper map sizes have caused poorer performance. If the map sizes are two small, the precision(recall) rates might be poor(better); whereas if the map sizes are two large, opposite results are expected. Choosing a proper motif length value is important to reveal the true motif patterns. Setting improper length values caused motif discovery algorithms to return only partial motif consensus patterns. We can overcome this shortcoming by running the system with different length values. Through some analysis on the produced results from different runs, we will be able to reveal the true motif consensuses.
In conclusion, clustering biological sequences for motif discovery should consider the use of heterogeneous model to efficiently represent both motif and background signals. We have shown that, our proposed SOMEA using a heterogeneous model, can perform better in terms of sensitivity and specificity than the tools that use homogeneous model.
System and methods
Overview
The main idea of our SOMEA algorithm is to use a hybrid node model, where a model is heterogeneously composed of PSSM and MC. We assume each node on the map is a fuzzy composition between a motif signal and background noise. Since we do not have prior knowledge on the type of each node, we use a soft competitive weighting scheme for the two components (i.e., PSSM and MC) of each node model. We refer it as intranode competition. Our framework design is inspired by the fact that, the two sequence classes (i.e. motif and background noise) in the DNA dataset have distinctive properties. Subsequently, it is necessary to represent them using appropriate signal’s models.
SOMEA starts with converting the input DNA sequences (both strands) into a set of kmers using k length window shifting through the sequence. Then, the size of the map is defined (user input) and nodes’ model parameters are randomly initialized. Then, the following two learning steps are repeated for each input kmer in the dataset:

1.
Internodes competition: to find the best matching unit (BMU) of current input kmer K_{ j }.

2.
Models updating: update model parameters of the BMU including its topological neighborhood.
The two steps above define a recursive regression process [9], where the optimal models parameters are estimated by iteratively applying the kmers to the system. After some training epochs, similar kmers from supposing motif or background class are projected onto the same or adjacent nodes on the 2D grid map. The kmers projected in the vicinity on the map, generally forming clusters. This implies the similarity of their respective features. Once the nodes’ models have been stabilized, we can identify candidate motifs using a motif model evaluation metric.
Basic concepts and problem formulation
We first give some notations used in this paper, and then describe the SOMEA algorithm. Denoted by D = {S_{1}, S_{2},…, S_{ N }}, a DNA dataset with N sequences. Let a kmer K_{ i } = (b_{1}b_{2}…b_{ k }) be a continuous subsequence of length k in a sequence, and i = 1,…, Z, with Z is the total number of kmers in that sequence. For a sequence with length L, there are L – k + 1 number of kmers can be produced using k length window shifting process.
We can represent a kmer K as a 4 × k matrix [27]. Let the matrix representation be e(K) = [a_{ ij }]_{4×k}, where (a_{1}_{ j }, a_{2}_{ j }, a_{3}_{ j }, a_{4}_{ j }) = (A, C, G, T) and j = 1,…, k. The matrix has a column j representing certain nucleotide i at that position j in the kmer.
A 2D SOM map is a lattice of R × C nodes, where R, C is the number of rows and columns respectively. Each node V_{ ij }, i = 1,…, R and j = 1,…, C, has a subset of kmers assigned to it. For convenience, we use the notation V_{ l } to represent a node, where 1 ≤ l ≤ (R × C). The coordinate of a node V_{ l } in the lattice is expressed as z_{ l } = (i, j). Then, each node V_{ l } has a parameterized model Θ_{ l } associated with it.
Let us formulate the clustering based motif discovery task. Clustering on the kmers dataset aims to partition the dataset into a set of nonoverlapping clusters {C_{1}, C_{2},…, C_{ U }}, where each cluster C_{ i } holds a subset of kmers. In our study, each node in the SOM 2Dlattice represents a cluster (i.e. U = R × C). After forming the clusters, each cluster C_{ i }’s potential is evaluated as true motif using motif model evaluation metric and rank the clusters based on their obtained scores. In SOMEA, we used Maximum A Posteriori score (MAP score) as the model evaluation metric. Then, top H highest ranked clusters are selected as putative motifs, and kmers from those clusters indicate the motif locations in the sequences.
PSSM based motif model M_{ pssm }
We use the PositionSpecificScoringMatrix (PSSM) [2] to model the motif signals. The PSSM based motif model, let it denoted by M_{ pssm }, is a matrix, i.e., M_{ pssm } = [f(b_{ i }, i)]_{4×k}, where b_{ i } ∈ {A, C, G, T} and i = 1,…, k. Here, each entry f(b_{ i }, i) represents the probability of nucleotide b_{ i } in position i and . In our SOMEA, the M_{ pssm } for a node V_{ l } can be calculated from the kmers in a node using the maximum likelihood principle, with a pseudocount value added as under sample correction to the probabilistic model. We follow the Bayesian estimation method for this purpose [28]. The PSSM entries are computed as follows:
f(b_{ i }, i) = (c(b_{ i }, i) + g(b_{ i })) /(N + 1), (1)
where N is the number of kmers, c(b_{ i }, i) is the frequency of nucleotide b_{ i } at position i of a set of kmers in a node, g(b_{ i }) = [n(b_{ i }) + 0.25]/(N × k + 1) and .
Markov chain based background model M_{ mc }
In our approach, the background signal is modeled by using the markov chain (MC) model [15]. The MC is a commonly used background signal model to distinguish overrepresented motifs from background signals (e.g. in [16, 19]). The stochastic and temporal nature of this model can effectively model the complex relationship of the background bases. The MC model assumes that, the probability of occurrence of a nucleotide b_{ i } at position i in a DNA sequence is dependent only on the occurrences of m previous nucleotides. This relationship can be expressed by the conditional probability p(b_{ i }b_{ i }_{–}_{ m }…b_{ i }_{–1}), where b_{ i }_{–}_{ m }…b_{ i }_{–1} are bases that precede base b_{ i }, and m is the markov order. In our approach, the first order MC (i.e. m = 1) is used because higher order model usually requires more input data to avoid overfitting. The maximum likelihood estimation of the conditional probability p(b_{ i }b_{ i }_{–}_{ m }…b_{ i }_{–1}) is given by [15] as:
where c′(x) is the number of times subsequence x found in a set of kmers in a node.
Let us denote π (a, a′) to represent the conditional probability p(a′a) of the first order MC, where a, a′ ∈ {A, C, G, T}. Then the MC transition matrix gives the background model M_{ mc } to be used in SOMEA, i.e., M_{ mc } = [π(a, a′)]_{4×4} , where .
Similarity score
A similarity metric is needed for kmers assignment to the nodes during the learning. The score of a kmer K_{ j } = (b_{1}b_{2} …b_{ k }) in respect with the PSSM based model assigned to node V_{ l }, can be computed as,
Here, k is the length of kmer, and f(b_{ i }, i) represents the probability of nucleotide b_{ i } in position i. Then, the score of a kmer K_{ j } to the MC [15] based model M_{ mc } of node V_{ l } is computed as:
Here, p(b_{1}) is the independent and identically distribution (i.i.d) probability of nucleotide b_{1} in current node, which is estimated from the kmers of node V_{ l }.
Hybrid model
In practice, we are unable to certainly deduce if a SOMEA’s node is a motif or background at any stage of the learning process. Also, before the system converged, the members of a node are likely to be composed of mixed signals. Therefore, neither PSSM or MC based models (i.e.M_{ pssm } and M_{ mc }) alone would satisfactorily model such composition. However, we can weigh the fitness of MC and PSSM models with respect to the kmers in a node. In other words, when a set of kmers fit with a certain model, (i.e., either motif model given by M_{ pssm } or background model given by M_{ mc }), it is more likely that those kmers represent that class. Note that both signal models, can represent signal features from opposite class to some extent.
In this work, we aimed to combine the expression abilities of both of the models (i.e., i.e.M_{ pssm } and M_{ mc }) in an unified mechanism to improve the distinguishing ability of the system, since each node given by SOMEA (or any clustering based approach) contains a fuzzy mixture of motif signals and background signals.
In implementation, we adopted a simple linear weighting scheme to combine these two models for a node V_{ l } as follows:
Equation (5) gives a linear combination of the two models to produce a heterogeneous model for a node V_{ l }. Here, λ is a scaling factor, for simplicity default value of λ is set as 0.5. If a kmer K_{ j } gets a higher score by this heterogeneous model based scoring Θ_{ l }(K_{ j }), that indicates K_{ j } has a better fit to the combined model of node V_{ l }.
Motif ranking
Once the SOMEA is stabilized after training, we have to perform an evaluation on the nodes in order to identify the most prominent candidate motifs. The candidate motifs can be identified using either motif evaluation metric or statistical significance value. These metrics usually require the use of background sequences model for computation.
In this work, we adopt the Maximum A Posteriori score (MAP score) [19] for motif ranking. The MAP score measures the conservation property of a motif with respect to the species background sequences [19]. Since, rare motifs in the background can achieve a higher MAP score, this measure can be used to distinguish a true motif from false ones based on their scores ranking. The background sequences can be modeled by using the markov chain model generated from the intergenic sequences of a species under study. This model can be used to assign a probability of a K, namely , under the background model given by . The MAP score of a node V_{ l } can be calculated as follows:
where N_{ l } is the number of kmers in node V_{ l } and refer to background probability of a kmer K in respect with background model .can be written as,
Here, m is the Markov chain order, k is the length of kmers, p(b_{1}, b_{2},…, b_{m}) is the probability of subsequence b_{1},b_{2}, …, b_{ m }, and p(b_{ i }b_{ i }_{–}_{ m }, b_{ i }_{–}_{ m }_{+1},…, b_{ i }_{–1}) is the conditional probability of the subsequence b_{ i } under b_{ i–m }, b_{ i–m }_{+1,…,}b_{ i }_{–1} occurrence constraints. For instance, using the 3rd order model, the probability of the sequence ATGCG can be calculated as: . This background probability is usually precomputed on the sequences of interest. In Eq (6), E(V_{ l }) is the Shannon’s entropy, that can be written as,
Here, f(b_{i}, i) is the probability of nucleotide base b_{ i } ∈ {A, C, G, T} to occur in ith position of the PSSM.
Algorithm
In this Section, we describe our SOMEA learning algorithm, which includes the similarity metric used for kmer assignments, model parameters adaptation, and the finding of BMU for a kmer. According to [29], any arbitrary set of items, for which a similarity or distance measure between its elements is definable, can be mapped onto the SOM grid in an orderly fashion. Hence, the standard SOM learning algorithm is applicable for our purposes with some modifications.
Adaptation process
We opted for the more speed efficient batch training scheme to update the nodes’ model parameters. This method delays the update of the model parameters at the end of an epoch. Heuristic rules are proposed to update each node’s PSSM and MC model parameters. We associate each node with three computing components including: two matrices ∆M_{ pssm }, ∆M_{ mc } and a counter r. Let be BMU of an input kmer K = (b_{1},b_{2},…,b_{ k }). Denoted ∆M_{ pssm } = [∆f(b_{ i }, i)]_{4×}_{ k } for b_{ i } ∈ {A, C, G, T} and i = 1,… ,k. Similarly, let ∆M_{ mc } = [∆π(a, a′)]_{4×4} for a, a′ ∈ {A, C, G, T}. We initialize all entries in both matrices ∆M_{ pssm } and ∆M_{ mc } as 0. Also let r = 0. Once a winning node for a kmer K is found, the matrices of a node are updated as follows.
where is an entry of the binary matrix e(K), count(a, a′) is the frequency of dinucleotide (aa’) in kmer K and h is a neighborhood function. The neighborhood function h is defined as
where σ is the variance whose value is fixed throughout the learning stage. We also update r = r + 1. Upon completion of an epoch, all nodes’ model parameters will be updated as follows:
where η is the learning rate and f(b_{ i }, i) and π (a, a′) is defined in Eq (1) and Eq (2) respectively. Note that, in the computation of Eq (12) and Eq (13), we first compute f(b_{ i }, i) and π (a, a′) using the current set of kmers assigned to a node.
It is also necessary to update the weighting parameters α and β. Assuming a set of N_{ l }kmers {K_{1} …, K_{ Nl }} is assigned to a node V_{ l } at the end of an epoch, the weighting parameters update equations are
and
β_{ new } = 1 – α_{ new }. (15)
Training
Assuming a set of kmers X is available. The highlevel training algorithm for SOMEA is as follows.

1.
Inputs. kmer length k, number of top motifs to return in the results H, markov chain background model, and DNA sequences.

2.
Architecture setup. The SOMEA lattice size (U = R × C)is arbitrarily chosen. The default size is 10 × 10. Each node’s model, Θ_{ i }, is initialized with random values.

3.
Training.
Let the BMU index for a kmer K is q(K).
for epoch=1 to max_epoch do
for each K ∈ X do

Compute Θ_{ i } (K),∀i = 1, …,U.

Find the BMU of K as

Assign kmer K to node q(K).

Update ∆M_{ pssm }, ∆M_{ mc }, r of node q(K) and its neighboring nodes.
end for
Update model parameters of all nodes using Eqs (12) and (13).
end for

4.
Finalizing.

(a)
Compute the MAP score F(V_{ i }),∀i = 1,…,U.

(b)
Rank V_{ i } according to their MAP score values.

(c)
Save the top H ranked V_{ i } as result.
Author details
Intelligent Search and Discovery Laboratory, Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Victoria 3086, Australia.
References
 1.
Moses A, Chiang D, Kellis M, Lander E, Eisen M: Position specific variation in the rate of evolution in transcription factor binding sites. BMC Evolutionary Biology 2003, 3: 19.
 2.
Stormo GD: DNA binding sites: representation and discovery. Bioinformatics 2000, 16: 16–23.
 3.
Bailey TL, Elkan C: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 1995, 21: 51–80.
 4.
Roth FP, Hughes JD, Estep PW, Church GM: Finding DNA regulatory motifs within unaligned noncoding sequences clustered by wholegenome mRNA quantitation. Nat Biotechnol 1998, 16(10):939–945.
 5.
Mahony S, Hendrix D, Golden A, Smith TJ, Rokhsar DS: Transcription factor binding site identification using the selforganizing map. Bioinformatics 2005, 21(9):1807–1814.
 6.
Wang D, Lee NK: Computational discovery of motifs using hierarchical clustering techniques. In ICDM ’08: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining. Washington, DC USA: IEEE Computer Society; 2008:1073–1078.
 7.
Wei Z, Jensen ST: GAME: detecting cisregulatory elements using a genetic algorithm. Bioinformatics 2006, 22(13):1577–1584.
 8.
Wang D, Li X: iGAPK:Improved GAPK algorithm for regulatory DNA motif discovery. In Neural Information Processing. Models and Applications, 17th International Conference (ICONIP2008), Volume 6444 of Lecture Notes in Computer Science. Edited by: K. W. Wong, B. S. U. Mendis, A. Bouzerdoum. Springer; 2010: 217–225.
 9.
Kohonen T: Selforganizing maps. 3rd edition. Springer series in information sciences, 30, Springer; 2001.
 10.
Ferrán EA, Ferrara P: Clustering proteins into families using artificial neural networks. Comput. Appl. Biosci 1992, 8: 39–44.
 11.
Giuliano F, Arrigo P, Scalia F, Cardo PP, Damiani G: Potentially functional regions of nucleic acids recognized by a Kohonen’s selforganizing map. Comput. Appl. Biosci 1993, 9(6):687–693.
 12.
Liu D, Xiong X, DasGupta B, Zhang H: Motif discoveries in unaligned molecular sequences using selforganizing neural networks. IEEE Transactions on Neural Networks 2006, 17(4):919–928.
 13.
Gunewardena S, Zhang Z: A hybrid model for robust detection of transcription factor binding sites. Bioinformatics 2008, 24(4):484–491.
 14.
van Nimwegen E, Zavolan M, Rajewsky N, Siggia ED: Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics. PNAS 2002, 99(11):7323–7328.
 15.
Robin S, Rodolphe F, Schbath S: DNA, Words and Models. New York: Cambridge University Press; 2005.
 16.
Pavesi G, Mauri G, Pesole G: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 2001, 17(suppl 1):S207–214.
 17.
Fawcett T: An introduction to ROC analysis. Pattern Recognition Letters 2006, 27(8):861–874.
 18.
SCPD database[http://cgsigma.cshl.org/jian/]
 19.
Liu XS, Brutlag DL, Liu JS: An algorithm for finding proteinDNA binding sites with applications to chromatinimmunoprecipitation microarray experiments. Nat Biotechnol 2002, 20(8):835–839.
 20.
Weeder[http://159.149.109.9/modtools/]
 21.
SOMBRERO[http://bioinf.nuigalway.ie/sombrero]
 22.
 23.
Blanco E, Farre D, Alba MM, Messeguer X, Guigo R: ABS: a database of annotated regulatory binding sites from orthologous promoters. Nucleic Acids Res 2006, 34(Database issue):D63D67.
 24.
Li L: GADEM: A genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery. Journal of Computational Biology 2009, 16(2):317–329.
 25.
Hu J, Li B, Kihara D: Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res 2005, 33(15):4899–4913.
 26.
Wang D, Li X: GAPK: genetic algorithms with prior knowledge for motif discovery in DNA sequences. In CEC’09: Proceedings of the Eleventh conference on Congress on Evolutionary Computation. Piscataway, NJ, USA: IEEE Press; 2009:277–284.
 27.
Wang D, Lee NK: MISCORE: Mismatchbased matrix similarity scores for DNA motif detection. In Advances in NeuroInformation Processing, 15th International Conference (ICONIP2008), Volume 5506 of Lecture Notes in Computer Science. Edited by: Köppen M, Kasabov NK, Coghill GG. Springer; 2009:478–485.
 28.
Osada R, Zaslavsky E, Singh M: Comparative analysis of methods for representing and searching for transcription factor binding sites. Bioinformatics 2004, 20(18):3516–3525.
 29.
Kohonen T, Somervuo P: How to make large selforganizing maps for nonvectorial data. Neural Networks 2002, 15(8–9):945–952.
Acknowledgements
We would be grateful to the group members, Sarwar Tapan, Li Xi, Paul Conilione and Hai Thanh Do, for their comments on the technical aspects and some useful discussions at group meetings. The authors express their sincere appreciation to Sarwar Tapan, who helped in improving the linguistic quality of this paper. NK express his thanks to the Universiti Malaysia Sarawak, which sponsors his PhD study at La Trobe University.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S1.
Author information
Affiliations
Corresponding author
Additional information
Authors' contributions
DH conceived the idea of this study and gave the direction on the framework and experimental studies. Under DH supervision, NK implemented the algorithm and carried out the experimental studies. NK drafted the manuscript and amended by DH. Both authors agreed on the final manuscript.
Competing interests
The authors declare they have no competing interests.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Lee, N.K., Wang, D. SOMEA: selforganizing map based extraction algorithm for DNA motif identification with heterogeneous model. BMC Bioinformatics 12, S16 (2011). https://doi.org/10.1186/1471210512S1S16
Published:
Keywords
 Recall Rate
 Markov Chain Model
 Motif Discovery
 Precision Rate
 Node Versus