Overview
The main idea of our SOMEA algorithm is to use a hybrid node model, where a model is heterogeneously composed of PSSM and MC. We assume each node on the map is a fuzzy composition between a motif signal and background noise. Since we do not have prior knowledge on the type of each node, we use a soft competitive weighting scheme for the two components (i.e., PSSM and MC) of each node model. We refer it as intra-node competition. Our framework design is inspired by the fact that, the two sequence classes (i.e. motif and background noise) in the DNA dataset have distinctive properties. Subsequently, it is necessary to represent them using appropriate signal’s models.
SOMEA starts with converting the input DNA sequences (both strands) into a set of k-mers using k length window shifting through the sequence. Then, the size of the map is defined (user input) and nodes’ model parameters are randomly initialized. Then, the following two learning steps are repeated for each input k-mer in the dataset:
-
1.
Inter-nodes competition: to find the best matching unit (BMU) of current input k-mer K
j
.
-
2.
Models updating: update model parameters of the BMU including its topological neighborhood.
The two steps above define a recursive regression process [9], where the optimal models parameters are estimated by iteratively applying the k-mers to the system. After some training epochs, similar k-mers from supposing motif or background class are projected onto the same or adjacent nodes on the 2D grid map. The k-mers projected in the vicinity on the map, generally forming clusters. This implies the similarity of their respective features. Once the nodes’ models have been stabilized, we can identify candidate motifs using a motif model evaluation metric.
Basic concepts and problem formulation
We first give some notations used in this paper, and then describe the SOMEA algorithm. Denoted by D = {S1, S2,…, S
N
}, a DNA dataset with N sequences. Let a k-mer K
i
= (b1b2…b
k
) be a continuous subsequence of length k in a sequence, and i = 1,…, Z, with Z is the total number of k-mers in that sequence. For a sequence with length L, there are L – k + 1 number of k-mers can be produced using k length window shifting process.
We can represent a k-mer K as a 4 × k matrix [27]. Let the matrix representation be e(K) = [a
ij
]4×k, where (a1
j
, a2
j
, a3
j
, a4
j
) = (A, C, G, T) and j = 1,…, k. The matrix has a column j representing certain nucleotide i at that position j in the k-mer.
A 2D SOM map is a lattice of R × C nodes, where R, C is the number of rows and columns respectively. Each node V
ij
, i = 1,…, R and j = 1,…, C, has a subset of k-mers assigned to it. For convenience, we use the notation V
l
to represent a node, where 1 ≤ l ≤ (R × C). The coordinate of a node V
l
in the lattice is expressed as z
l
= (i, j). Then, each node V
l
has a parameterized model Θ
l
associated with it.
Let us formulate the clustering based motif discovery task. Clustering on the k-mers dataset aims to partition the dataset into a set of non-overlapping clusters {C1, C2,…, C
U
}, where each cluster C
i
holds a subset of k-mers. In our study, each node in the SOM 2D-lattice represents a cluster (i.e. U = R × C). After forming the clusters, each cluster C
i
’s potential is evaluated as true motif using motif model evaluation metric and rank the clusters based on their obtained scores. In SOMEA, we used Maximum A Posteriori score (MAP score) as the model evaluation metric. Then, top H highest ranked clusters are selected as putative motifs, and k-mers from those clusters indicate the motif locations in the sequences.
PSSM based motif model M
pssm
We use the Position-Specific-Scoring-Matrix (PSSM) [2] to model the motif signals. The PSSM based motif model, let it denoted by M
pssm
, is a matrix, i.e., M
pssm
= [f(b
i
, i)]4×k, where b
i
∈ {A, C, G, T} and i = 1,…, k. Here, each entry f(b
i
, i) represents the probability of nucleotide b
i
in position i and
. In our SOMEA, the M
pssm
for a node V
l
can be calculated from the k-mers in a node using the maximum likelihood principle, with a pseudo-count value added as under sample correction to the probabilistic model. We follow the Bayesian estimation method for this purpose [28]. The PSSM entries are computed as follows:
f(b
i
, i) = (c(b
i
, i) + g(b
i
)) /(N + 1), (1)
where N is the number of k-mers, c(b
i
, i) is the frequency of nucleotide b
i
at position i of a set of k-mers in a node, g(b
i
) = [n(b
i
) + 0.25]/(N × k + 1) and
.
Markov chain based background model M
mc
In our approach, the background signal is modeled by using the markov chain (MC) model [15]. The MC is a commonly used background signal model to distinguish over-represented motifs from background signals (e.g. in [16, 19]). The stochastic and temporal nature of this model can effectively model the complex relationship of the background bases. The MC model assumes that, the probability of occurrence of a nucleotide b
i
at position i in a DNA sequence is dependent only on the occurrences of m previous nucleotides. This relationship can be expressed by the conditional probability p(b
i
|b
i
–
m
…b
i
–1), where b
i
–
m
…b
i
–1 are bases that precede base b
i
, and m is the markov order. In our approach, the first order MC (i.e. m = 1) is used because higher order model usually requires more input data to avoid over-fitting. The maximum likelihood estimation of the conditional probability p(b
i
|b
i
–
m
…b
i
–1) is given by [15] as:
where c′(x) is the number of times sub-sequence x found in a set of k-mers in a node.
Let us denote π (a, a′) to represent the conditional probability p(a′|a) of the first order MC, where a, a′ ∈ {A, C, G, T}. Then the MC transition matrix gives the background model M
mc
to be used in SOMEA, i.e., M
mc
= [π(a, a′)]4×4 , where
.
Similarity score
A similarity metric is needed for k-mers assignment to the nodes during the learning. The score of a k-mer K
j
= (b1b2 …b
k
) in respect with the PSSM based model
assigned to node V
l
, can be computed as,
Here, k is the length of k-mer, and f(b
i
, i) represents the probability of nucleotide b
i
in position i. Then, the score of a k-mer K
j
to the MC [15] based model M
mc
of node V
l
is computed as:
Here, p(b1) is the independent and identically distribution (i.i.d) probability of nucleotide b1 in current node, which is estimated from the k-mers of node V
l
.
Hybrid model
In practice, we are unable to certainly deduce if a SOMEA’s node is a motif or background at any stage of the learning process. Also, before the system converged, the members of a node are likely to be composed of mixed signals. Therefore, neither PSSM or MC based models (i.e.M
pssm
and M
mc
) alone would satisfactorily model such composition. However, we can weigh the fitness of MC and PSSM models with respect to the k-mers in a node. In other words, when a set of k-mers fit with a certain model, (i.e., either motif model given by M
pssm
or background model given by M
mc
), it is more likely that those k-mers represent that class. Note that both signal models, can represent signal features from opposite class to some extent.
In this work, we aimed to combine the expression abilities of both of the models (i.e., i.e.M
pssm
and M
mc
) in an unified mechanism to improve the distinguishing ability of the system, since each node given by SOMEA (or any clustering based approach) contains a fuzzy mixture of motif signals and background signals.
In implementation, we adopted a simple linear weighting scheme to combine these two models for a node V
l
as follows:
Equation (5) gives a linear combination of the two models to produce a heterogeneous model for a node V
l
. Here, λ is a scaling factor, for simplicity default value of λ is set as 0.5. If a k-mer K
j
gets a higher score by this heterogeneous model based scoring Θ
l
(K
j
), that indicates K
j
has a better fit to the combined model of node V
l
.
Motif ranking
Once the SOMEA is stabilized after training, we have to perform an evaluation on the nodes in order to identify the most prominent candidate motifs. The candidate motifs can be identified using either motif evaluation metric or statistical significance value. These metrics usually require the use of background sequences model for computation.
In this work, we adopt the Maximum A Posteriori score (MAP score) [19] for motif ranking. The MAP score measures the conservation property of a motif with respect to the species background sequences [19]. Since, rare motifs in the background can achieve a higher MAP score, this measure can be used to distinguish a true motif from false ones based on their scores ranking. The background sequences can be modeled by using the markov chain model generated from the intergenic sequences of a species under study. This model can be used to assign a probability of a K, namely
, under the background model given by
. The MAP score of a node V
l
can be calculated as follows:
where N
l
is the number of k-mers in node V
l
and
refer to background probability of a k-mer K in respect with background model
.
can be written as,
Here, m is the Markov chain order, k is the length of k-mers, p(b1, b2,…, bm) is the probability of subsequence b1,b2, …, b
m
, and p(b
i
|b
i
–
m
, b
i
–
m
+1,…, b
i
–1) is the conditional probability of the subsequence b
i
under b
i–m
, b
i–m
+1,…,b
i
–1 occurrence constraints. For instance, using the 3rd order model, the probability of the sequence ATGCG can be calculated as:
. This background probability is usually pre-computed on the sequences of interest. In Eq (6), E(V
l
) is the Shannon’s entropy, that can be written as,
Here, f(bi, i) is the probability of nucleotide base b
i
∈ {A, C, G, T} to occur in i-th position of the PSSM.