Optimized mixed Markov models for motif identification

Background Identifying functional elements, such as transcriptional factor binding sites, is a fundamental step in reconstructing gene regulatory networks and remains a challenging issue, largely due to limited availability of training samples. Results We introduce a novel and flexible model, the Optimized Mixture Markov model (OMiMa), and related methods to allow adjustment of model complexity for different motifs. In comparison with other leading methods, OMiMa can incorporate more than the NNSplice's pairwise dependencies; OMiMa avoids model over-fitting better than the Permuted Variable Length Markov Model (PVLMM); and OMiMa requires smaller training samples than the Maximum Entropy Model (MEM). Testing on both simulated and actual data (regulatory cis-elements and splice sites), we found OMiMa's performance superior to the other leading methods in terms of prediction accuracy, required size of training data or computational time. Our OMiMa system, to our knowledge, is the only motif finding tool that incorporates automatic selection of the best model. OMiMa is freely available at [1]. Conclusion Our optimized mixture of Markov models represents an alternative to the existing methods for modeling dependent structures within a biological motif. Our model is conceptually simple and effective, and can improve prediction accuracy and/or computational speed over other leading methods.


Supplementary Materials
Calculation of Pr(x|M L k ) and Pr(x|M C k ) In real applications, it is easier to calculate Pr(x|M L k ) or Pr(x|M C k ) in terms of probabilities of oligomers. In the following, we derived generalized formulas to calculate a motif probability given a Markov model. These generalized formulas also give us a clearer view of the dierences in the probabilities of a motif sequence given dierent Markov models.

Estimation of model parameters
For the k th order Markov model, the length of all possible oligomers ranges from 1 to k + 1. We can estimate the probabilities of all these oligomers, as model parameters, by tting the model with training data. Let Y = Y 1 , . . . , Y v be a random vector associated with the oligomers consisting of bases from some certain positions of a motif, then Y follows a multinomial distribution: where N is the total number of motif sequences, v is the number of all possible y i , y i is the number of oligomer i and v i=1 y i = N , and θ i is the probability of oligomer i occurred and v i=1 θ i = 1.
According to the Bayes' theorem, the probability of θ given Y can be expressed as Suppose that the prior distribution of θ is a Dirichlet distribution as given by equation (12), then we can estimate the posterior mean of θ by maximum likelihood (equations 13 and 14).
where α is the parameter of Dirichlet distribution. In the context of Markov motif models, α is equivalent to pseudo counts of oligomers. For all our main results, we set pseudo counts = max(4, 0.01N ), where N is the number of true motif sites in training dataset. The prior was chosen based on a rule of thumb without being optimized for performance.

Detection of protein domains
OMiMa is a versatile motif prediction tool for biological sequences, including DNA, RNA and protein sequences. In this evaluation, we tested OMiMa's capability for motif identication in protein sequences. The functional motifs in protein sequences are also known as function domains that are usually well-conserved across dierent members in the same protein family. The data for this evaluation are the alignment seeds of Pfam protein domains (ftp://ftp.sanger.ac.uk/ pub/databases/Pfam/Pfam-A.seed.gz). We used a domain named A Propetide (accession # PF07966), of which the length is 29 bases. In total, there are 86 true domain sites, and for each true domain site, 50 false domain sites are simulated by randomly shuing the amino acids of the true site. For the 0-1 mixture model, OMiMa optimized position arrangement for the rst order chain is given by 5-3-15-7-10-12-26-9-28-13-1-24-21-16-20-0-2- 14-11-6-25-17-27-8-22-18-23-4-19 Although no position is independent of all others for this domain motif (the length of zero order chain is 0), the best model chosen by OMiMa (AIC criterion) is the zero order Markov model. We compared the following three Markov models: zero order, un-optimized rst order, and DNJ optimized rst order models by 10-fold cross validation. The log-likelihood ratio was used to score a domain site, and the rst order Markov chains were set to linear. Results (not given here) show that both the zero order, and DNJ optimized Markov models have perfect prediction (both Sn and Sp are 1), while the un-optimized rst order Markov model does not.

Computational eciency
OMiMa is an ecient computational tool designed for nding motifs in large genome sequences. Using Reese's dataset of human donor sites, we compared the running times used by OMiMa, PVLMM, and NNSplice. Since PVLMM and OMiMa run on dierent computer operating systems, the test was performed on a SUSE Linux/Window XP dual-boot computer with Intel 3.0 GHZ CPU and 512 M main memory. NNSplice is a web-based program at http://www.fruitfly.org/seq_ tools/splice.html, so we calculated its computational time as the interval from submitting data to returning the result from the NNSplice web-server using Microsoft IE browser on the same dualboot computer. Results (Table 1) showed that OMiMa is much more ecient than either NNSplice or PVLMM.