Overview
The main idea of our SOMEA algorithm is to use a hybrid node model, where a model is heterogeneously composed of PSSM and MC. We assume each node on the map is a fuzzy composition between a motif signal and background noise. Since we do not have prior knowledge on the type of each node, we use a soft competitive weighting scheme for the two components (i.e., PSSM and MC) of each node model. We refer it as intranode competition. Our framework design is inspired by the fact that, the two sequence classes (i.e. motif and background noise) in the DNA dataset have distinctive properties. Subsequently, it is necessary to represent them using appropriate signal’s models.
SOMEA starts with converting the input DNA sequences (both strands) into a set of kmers using k length window shifting through the sequence. Then, the size of the map is defined (user input) and nodes’ model parameters are randomly initialized. Then, the following two learning steps are repeated for each input kmer in the dataset:

1.
Internodes competition: to find the best matching unit (BMU) of current input kmer K_{
j
}.

2.
Models updating: update model parameters of the BMU including its topological neighborhood.
The two steps above define a recursive regression process [9], where the optimal models parameters are estimated by iteratively applying the kmers to the system. After some training epochs, similar kmers from supposing motif or background class are projected onto the same or adjacent nodes on the 2D grid map. The kmers projected in the vicinity on the map, generally forming clusters. This implies the similarity of their respective features. Once the nodes’ models have been stabilized, we can identify candidate motifs using a motif model evaluation metric.
Basic concepts and problem formulation
We first give some notations used in this paper, and then describe the SOMEA algorithm. Denoted by D = {S_{1}, S_{2},…, S_{
N
}}, a DNA dataset with N sequences. Let a kmer K_{
i
} = (b_{1}b_{2}…b_{
k
}) be a continuous subsequence of length k in a sequence, and i = 1,…, Z, with Z is the total number of kmers in that sequence. For a sequence with length L, there are L – k + 1 number of kmers can be produced using k length window shifting process.
We can represent a kmer K as a 4 × k matrix [27]. Let the matrix representation be e(K) = [a_{
ij
}]_{4×k}, where (a_{1}_{
j
}, a_{2}_{
j
}, a_{3}_{
j
}, a_{4}_{
j
}) = (A, C, G, T) and j = 1,…, k. The matrix has a column j representing certain nucleotide i at that position j in the kmer.
A 2D SOM map is a lattice of R × C nodes, where R, C is the number of rows and columns respectively. Each node V_{
ij
}, i = 1,…, R and j = 1,…, C, has a subset of kmers assigned to it. For convenience, we use the notation V_{
l
} to represent a node, where 1 ≤ l ≤ (R × C). The coordinate of a node V_{
l
} in the lattice is expressed as z_{
l
} = (i, j). Then, each node V_{
l
} has a parameterized model Θ_{
l
} associated with it.
Let us formulate the clustering based motif discovery task. Clustering on the kmers dataset aims to partition the dataset into a set of nonoverlapping clusters {C_{1}, C_{2},…, C_{
U
}}, where each cluster C_{
i
} holds a subset of kmers. In our study, each node in the SOM 2Dlattice represents a cluster (i.e. U = R × C). After forming the clusters, each cluster C_{
i
}’s potential is evaluated as true motif using motif model evaluation metric and rank the clusters based on their obtained scores. In SOMEA, we used Maximum A Posteriori score (MAP score) as the model evaluation metric. Then, top H highest ranked clusters are selected as putative motifs, and kmers from those clusters indicate the motif locations in the sequences.
PSSM based motif model M_{
pssm
}
We use the PositionSpecificScoringMatrix (PSSM) [2] to model the motif signals. The PSSM based motif model, let it denoted by M_{
pssm
}, is a matrix, i.e., M_{
pssm
} = [f(b_{
i
}, i)]_{4×k}, where b_{
i
} ∈ {A, C, G, T} and i = 1,…, k. Here, each entry f(b_{
i
}, i) represents the probability of nucleotide b_{
i
} in position i and . In our SOMEA, the M_{
pssm
} for a node V_{
l
} can be calculated from the kmers in a node using the maximum likelihood principle, with a pseudocount value added as under sample correction to the probabilistic model. We follow the Bayesian estimation method for this purpose [28]. The PSSM entries are computed as follows:
f(b_{
i
}, i) = (c(b_{
i
}, i) + g(b_{
i
})) /(N + 1), (1)
where N is the number of kmers, c(b_{
i
}, i) is the frequency of nucleotide b_{
i
} at position i of a set of kmers in a node, g(b_{
i
}) = [n(b_{
i
}) + 0.25]/(N × k + 1) and .
Markov chain based background model M_{
mc
}
In our approach, the background signal is modeled by using the markov chain (MC) model [15]. The MC is a commonly used background signal model to distinguish overrepresented motifs from background signals (e.g. in [16, 19]). The stochastic and temporal nature of this model can effectively model the complex relationship of the background bases. The MC model assumes that, the probability of occurrence of a nucleotide b_{
i
} at position i in a DNA sequence is dependent only on the occurrences of m previous nucleotides. This relationship can be expressed by the conditional probability p(b_{
i
}b_{
i
}_{–}_{
m
}…b_{
i
}_{–1}), where b_{
i
}_{–}_{
m
}…b_{
i
}_{–1} are bases that precede base b_{
i
}, and m is the markov order. In our approach, the first order MC (i.e. m = 1) is used because higher order model usually requires more input data to avoid overfitting. The maximum likelihood estimation of the conditional probability p(b_{
i
}b_{
i
}_{–}_{
m
}…b_{
i
}_{–1}) is given by [15] as:
where c′(x) is the number of times subsequence x found in a set of kmers in a node.
Let us denote π (a, a′) to represent the conditional probability p(a′a) of the first order MC, where a, a′ ∈ {A, C, G, T}. Then the MC transition matrix gives the background model M_{
mc
} to be used in SOMEA, i.e., M_{
mc
} = [π(a, a′)]_{4×4} , where .
Similarity score
A similarity metric is needed for kmers assignment to the nodes during the learning. The score of a kmer K_{
j
} = (b_{1}b_{2} …b_{
k
}) in respect with the PSSM based model assigned to node V_{
l
}, can be computed as,
Here, k is the length of kmer, and f(b_{
i
}, i) represents the probability of nucleotide b_{
i
} in position i. Then, the score of a kmer K_{
j
} to the MC [15] based model M_{
mc
} of node V_{
l
} is computed as:
Here, p(b_{1}) is the independent and identically distribution (i.i.d) probability of nucleotide b_{1} in current node, which is estimated from the kmers of node V_{
l
}.
Hybrid model
In practice, we are unable to certainly deduce if a SOMEA’s node is a motif or background at any stage of the learning process. Also, before the system converged, the members of a node are likely to be composed of mixed signals. Therefore, neither PSSM or MC based models (i.e.M_{
pssm
} and M_{
mc
}) alone would satisfactorily model such composition. However, we can weigh the fitness of MC and PSSM models with respect to the kmers in a node. In other words, when a set of kmers fit with a certain model, (i.e., either motif model given by M_{
pssm
} or background model given by M_{
mc
}), it is more likely that those kmers represent that class. Note that both signal models, can represent signal features from opposite class to some extent.
In this work, we aimed to combine the expression abilities of both of the models (i.e., i.e.M_{
pssm
} and M_{
mc
}) in an unified mechanism to improve the distinguishing ability of the system, since each node given by SOMEA (or any clustering based approach) contains a fuzzy mixture of motif signals and background signals.
In implementation, we adopted a simple linear weighting scheme to combine these two models for a node V_{
l
} as follows:
Equation (5) gives a linear combination of the two models to produce a heterogeneous model for a node V_{
l
}. Here, λ is a scaling factor, for simplicity default value of λ is set as 0.5. If a kmer K_{
j
} gets a higher score by this heterogeneous model based scoring Θ_{
l
}(K_{
j
}), that indicates K_{
j
} has a better fit to the combined model of node V_{
l
}.
Motif ranking
Once the SOMEA is stabilized after training, we have to perform an evaluation on the nodes in order to identify the most prominent candidate motifs. The candidate motifs can be identified using either motif evaluation metric or statistical significance value. These metrics usually require the use of background sequences model for computation.
In this work, we adopt the Maximum A Posteriori score (MAP score) [19] for motif ranking. The MAP score measures the conservation property of a motif with respect to the species background sequences [19]. Since, rare motifs in the background can achieve a higher MAP score, this measure can be used to distinguish a true motif from false ones based on their scores ranking. The background sequences can be modeled by using the markov chain model generated from the intergenic sequences of a species under study. This model can be used to assign a probability of a K, namely , under the background model given by . The MAP score of a node V_{
l
} can be calculated as follows:
where N_{
l
} is the number of kmers in node V_{
l
} and refer to background probability of a kmer K in respect with background model .can be written as,
Here, m is the Markov chain order, k is the length of kmers, p(b_{1}, b_{2},…, b_{m}) is the probability of subsequence b_{1},b_{2}, …, b_{
m
}, and p(b_{
i
}b_{
i
}_{–}_{
m
}, b_{
i
}_{–}_{
m
}_{+1},…, b_{
i
}_{–1}) is the conditional probability of the subsequence b_{
i
} under b_{
i–m
}, b_{
i–m
}_{+1,…,}b_{
i
}_{–1} occurrence constraints. For instance, using the 3rd order model, the probability of the sequence ATGCG can be calculated as: . This background probability is usually precomputed on the sequences of interest. In Eq (6), E(V_{
l
}) is the Shannon’s entropy, that can be written as,
Here, f(b_{i}, i) is the probability of nucleotide base b_{
i
} ∈ {A, C, G, T} to occur in ith position of the PSSM.