The structure of NestedMICA
NestedMICA is a probabilistic motif inference method based on a generative sequence model. The model has three sets of parameters: firstly, a background model which represents all the nonmotif parts of the input sequences; second, a set of positionweight matrices which represent the motifs themselves; finally, a binary matrix (the occupancy matrix) whose elements specify whether a given motif should be considered when modeling a given input sequence. The background model is built in advance and held constant during motif inference, while the motifs and occupancy matrix are updated to fit the supplied data. NestedMICA uses the Nested Sampling strategy [15] to update both these sets of parameters.
The implementation of NestedMICA's nminfer program can be split into two major parts: code that calculates the likelihood of some sequences under the generative model, and code which implements the Nested Sampling process. The Nested Sampling code makes few assumptions about the internal structure of the model (and could potentially be used to perform inference of quite different models), so we consider these two components separately.
The NestedMICA sequence model
Interesting motif regions and the remaining uninteresting parts of sequences can be represented as Hidden Markov Models (HMMs). This kind of model can be referred to as a sequence mixture model (SMM), as they contain states representing motifs as well as some prior models. An example to SMMs is the zeroorone occurrences per sequence (ZOOPS) model which is the default strategy in most motif finders based on expectation maximisation [21], or Gibbs sampling [22], a typical example of which can be the MEME [13] motif discovery program.
NestedMICA relaxes the constraints of this model slightly by allowing a given motif to appear multiple times in the same input sequence. To calculate the likelihood of a given sequence, NestedMICA first consults to appropriate row of the occupancy matrix to determine a (possibly empty) subset,
M, of the complete motif set which applies to this sequence. In the case where
M is empty, the likelihood of the sequence is simply its likelihood under the background model (see below). When
M is nonempty, NestedMICA sums over all possible configurations of motif occurrences along the sequence, filling in any gaps using the background model. This is performed using a dynamic programming recursion which gives the likelihood,
L
_{
n
}of all paths up to a given point in the input sequence,
n as:
where M is the number of motifs selected by the occupancy matrix, m is the length of weight matrix m, B
_{
n
}is the probability that the sequence symbol at position n was emitted by the background model, m(
) is the probability that the sequence from i to j was emitted by the weight matrix m, and t is a transition probability specifying the estimated density of motifs in the sequence.
We initialise L
_{0} = 1 then apply the above formula recursively along the length of the input sequence until the final position is reached, giving a likelihood for the complete sequence.
In principle, any background model could be used with this formulation. In practise, we choose to use a mosaic background [14] which admits the possibility of several different classes of background sequence, each of which is modeled using a loworder Markov chain (i.e. within a given class, the probability of observing a particular symbol at position n depends on the symbols observed at a fixed number of previous positions). The mosaic model is implemented as a fully connected HMM (transitions are allowed between any pair of classes).
To calculate
B
_{
n
}, NestedMICA first applies the standard posterior decoding algorithm [
23] to find
P
_{
hn
}, the posterior probability that the symbol at position
n in the input sequence was generated by state
h of the background model
H. We then calculate
B
_{
n
}as:
(i.e. summing over any remaining uncertainty in which background class is used at n). Note that when the Markov chain order, o is greater than zero, the probability of observing a given symbol, h(S
_{
n
}), depends on o previous symbols in the sequence. This means that is not possible to exactly calculate B
_{
n
}where n ≤ o. We choose to ignore the first o symbols in the input sequence (except for background calculation purposes) in order to avoid any edge effects.
Inference by Nested Sampling
Inferring optimal parameters for probabilistic models is a difficult task, particularly when the number of model parameters becomes large. NestedMICA performs inference using Nested Sampling [15], a robust Bayesian sampling method for model selection and parameter optimisation. Nested Sampling is a Monte Carlo inference strategy which can find globally good solutions to highdimensional problems. Classical Monte Carlo methods work by moving a single state (i.e. set of parameters) around the problem's parameter space, accepting or rejecting proposed moves depending on whether they increase or decrease the likelihood of the observed data. Nested Sampling is always applied to an ensemble of e different states, where the value of e is typically a few hundred. The process starts with an ensemble of states sampled uniformly from the prior.
Having sampled the states, they are then sorted in order of likelihood, and the least likely state is removed from the ensemble. To maintain the ensemble size, a new state is sampled, subject to the constraint that the new state must have a likelihood greater than that of the state it is replacing. Repeating this process many times means that nested samplers progressively move towards a small subset of the state space which contains highlikelihood states. This is somewhat analogous to simulated annealing methods where a temperature parameter is reduced to bring the model progressively closer to the posterior distribution, but nested sampling avoids the need to explicitly cool the model: progress towards highlikelihood states occurs automatically.
For each step of Nested Sampling, a certain fraction of state space is removed from further consideration (since it contains states with likelihoods lower than the threshold). Over many steps, the fraction of prior mass that is removed from consideration at step
t will tend towards
where
e is the ensemble size. Since all the states which have been removed from consideration will have a likelihood of approximately
L
_{
t
}, the likelihood of the state which was removed at step
t, the Bayesian evidence for the model,
Z, can be estimated as:
Clearly, it is possibly to progressively accumulate an estimate of
Z during the Nested Sampling process. The final estimate of
Z can be used for model comparison purposes (for example, finding optimal parameters for the NestedMICA sequence model). NestedMICA also uses
Z
_{
t
}, the online
Z estimate up to step
t to decide when to terminate the Nested Sampling process. Specifically, we terminate when:
i.e. the likely increase of Z in future iterations is small compared to the current value. Formally, this may lead to premature termination if L increases dramatically late in the training process, but in practise we find that this simple criterion is effective for motif discovery.
Implementation of NestedMICA
The NestedMICA nminfer program is based around a fairly general implementation of the Nested Sampling strategy, which can be applied to any probabilistic model. This code takes three inputs: a data set (i.e. a set of sequences), some code to calculate the likelihood of the dataset given a model state (i.e. an implementation of the likelihood function given above), plus a set of "sampling" operations which perturb a state and can be used to move around state space.
Each state consists of two sets of parameters: a set of motif weight matrices, and an occupancy matrix specifying where the motifs appear in the input sequence set. Most of NestedMICA's sampling moves are applied to one randomly selected weight matrix (WM):

making a small perturbation to one column of a weight matrix, by slightly increasing or decreasing one of the weights, then renormalising so they still sum to 1.

replacing a WM column with a new one, sampled from the prior.

removing a column in one end of a WM while adding another one to the other end.

adjusting motif length, by adding or removing a column from either end.
In addition, it is necessary to resample the occupancy matrix. In principle, a straightforward and valid sampling move would be to simply flip the state of one randomlyselected element in the occupancy matrix. In practise, NestedMICA tests multiple occupancy matrix moves at the same time, since this improves performance when running on multiprocessor systems.
Finally, it is necessary to place a prior over the state space. NestedMICA uses a simple noninformative prior for the Weight Matrix motif models: a uniform prior over weightmatrix space with a constraint that extremely low weights are forbidden. The lower limit is specified by the minClip parameter and is typically 10^{7} for amino acid, and of the order of 10^{3} for dna input. We also place a noninformative prior on the occupancy matrix, although if there is some prior knowledge about the frequency of the target motif in the dataset, this can be specified using the expectedUsageFraction option.
Adding protein support to NestedMICA
We made several changes to NestedMICA in order to support protein motif discovery. Firstly, we added support for loading and analysing protein sequences (enabled with the alphabet PROTEIN switch). The inference strategy remains identical to that previously described [14]. However, the dimensionality of the protein motif discovery problem is much higher than in nucleic acids: a DNA motif model has three free parameters per position, while a protein motif has 19. To compensate for this difference, we found that a rather larger ensemble of models in the Nested Sampling process was required than for DNA. Having found an optimal ensemble size by performing a systematic parameter sweep test, we altered this to be the default ensemble size when running the program in protein mode.
Another important difference between the proteincapable version and the previous version of NestedMICA is the way distribution probability initialisation is performed in setting up the amino acid probability distributions for each background mosaic class. Starting off with flat probability distributions in all the mosaic classes of a given background as in the dna case was not ideal for protein sequences, as we observed a minimal learning rate with these equal initial states. Instead, a semi random, semi actual inputbased initialisation was preferred: The distributions were initialised such that they directly reflect the amino acid distributions of the actual input data, except, these numbers were slightly changed randomly by a certain margin for the training to learn and converge faster.
Since the initial publication of NestedMICA [14], an important extra feature was added of automatically optimising a motif's length within a userspecified motif length range. NestedMICA treats the motif length as another free parameter of the motif model, and optimises it using the same Nested Sampling strategy as for all the other parameters. Another change in the new version is that, if no background model is provided by the user, NestedMICA uses a basic, zeroorder background model which is trained on the fly from the user supplied input sequences.
Further information regarding the parameters used in motif finding can be found in the user manual at the NestedMICA web site [24].
Background model training
Probabilistic motif finding tools usually employ background models to represent sequence regions where ideally no motif of interest exists. In most cases, however, these programs use a homogenous background model, assuming that all nonmotif portions of the sequence can be represented using a single amino acid frequency distribution. In reality, protein sequences are generally composed of different functional domains which can be chemically biased towards certain compositional forms. In addition, protein sequences are very likely to carry different sequence signals responsible for various moleculerecognition and binding related tasks. NestedMICA uses nonhomogenous ("mosaic") background models which subdivide the background sequences into several classes. Each class is modelled as a Markov chain. The order of the chain (i.e. the number of previous symbols on which the probability distribution for the next observed symbol is conditioned) can be set to an arbitrary value, but for protein sequence analysis we recommend only using zeroth or firstorder background models, since higher order models will have an extremely high parameter count and will be hard, if not impossible, to parametrise effectively.
A builtin background likelihood estimation procedure in NestedMICA (called "nmevaluatebg") allows an optimal background model architecture to be found for a given set of sequences. A NestedMICA background model can be of any order Markov chain and, consist of an arbitrary number of mosaic classes. As a good representative sequence set, we used the pTarget protein subcellular localisation dataset [25] for background model parameter optimisation (Figure 1). This is mainly because it includes different types of proteins from different subcellular localisations, eliminating the chance of some strong domain and localisation signals to overarch the background model training and evaluation. Furthermore, we reduced the sequence identity of the set from 95% down to a maximum of 40% by using the CDHIT [26] clustering software to have a total of 7437 eukaryotic proteins, which had an average sequence length of 522. For evaluation purposes, 6000 of these were used to train several different background models with different parameters, while the remaining sequences were used to test how well a certain background model represented them. As Figure 1 shows, using order1 probabilities, where the compositional probability of a certain residue depends only on a single adjacent residue, performs better than a zeroorder model. Moreover, likelihood for the test sequences increased monotonically with the number of mosaic classes. Training a multiclass higherorder background requires sufficient sequence data in order to prevent a possible overfitting of the background. For example, using a first order, 6classes model corresponds to having a total of 2400 different amino acid distributions.
Program output and sequence logos
NestedMICA reports discovered motifs as PWMs which can be viewed as sequence logos by an accompanying motifviewer tool. In a single NestedMICA protein motif logo, each column has a maximum information content of 4.32 bits (log
_{2}20), and amino acid letters are coloured according to their general physical and chemical properties, as depicted in Figure 2.
As opposed to majority of motif finders, NestedMICA does not report any significance measures such as Evalues, or entropy scores, as these values could be quite unreliable. All these scores are calculated based on the idea that a motif finder has picked up a real motif, which obviously cannot always be true. The recent publication by [27], discusses in detail why using such scores could lead to undesirable results.
Testing NestedMICA's performance
In order to get a better understanding of NestedMICA's protein motif finding capabilities and limits, a number of motif spiking experiments were performed using synthetic and biological motifs, similar to the approach previously used by [14]. In a motif spiking test, a number of short amino acid sequences are generated according to the weight matrix distribution probabilities of a given motif. These motifresembling short peptides are then inserted at random positions into a set of sequences. The program under test is then applied to the set of sequences to predict a set of motifs. Finally, the predicted candidate motif set is compared with the original test set to assess the performance of the program in recovering the spiked motifs. MEME PWMs were converted into NestedMICA sequence logos for easier comparison.
To evaluate how similar a reported motif is to the original one, we used cartesian motifmotif distances. The cartesian motif distance metric is the sum of individual cartesian distances calculated for each motif position, between corresponding pairs of the 20 amino acid probabilities from both motifs. For a motif to be considered as recovered with a reasonable precision, we used an empirically set threshold for the maximum allowed cartesian motif distance normalised for the original motif length. Motifs showing an average deviation per position of more than 0.3 of cartesian motif distance were considered as false discoveries.
For each reported motif, in addition to reporting its cartesian motif distance to the original test motif, we calculated its sensitivity (SN) and specificity (SP) values:
Matthew's Correlation Coefficient (MCC) [
18] values were calculated, too, to show a PWM's scanning power as in [
28]:
where TP, FP, FN, TN stand for true positives, false positives, false negatives and true negatives, respectively.
One advantage of using MCC in a PWM evaluation is that for random motif predictions MCC tends to be around zero, while for a perfect scanning performance it will have a maximum value of 1. On the other hand, depending on the choice of a score threshold, even for an irrelevant or weak motif one can get a sensitivity of 1, for instance, while the corresponding specificity value could be as low as 0.5, if the number of sequences in both datasets are equal. In such cases, MCC will tend to be very low, reflecting the random prediction.
To calculate these measures of motif scanning performance, first, we spiked every sequence in the test dataset with a particular motif, then we scanned a reported motif both in the spiked and original datasets to see how many motif instances would be correctly or falsely predicted in both datasets. For each individual test case, we picked a threshold score that maximises the corresponding MCC value, after trying a range of different score thresholds systematically incremented in each iteration to compute sensitivity, specificity and MCC values. We calculated these values not only for motifs reported by the programs we assessed, but also for the original test motifs. We did this because values measuring the scanning performances of recovered motifs should be considered relative to those of the original motif. A more objective and absolute metrics of motif recovery is the cartesian motif distance, which is the sum of probability differences in corresponding columns of any two compared motifs. For example, a test motif which contains only a small number of strongly conserved residues cannot be expected to have a good scanning performance in identifying all spiked motifs, because the motif tolerates too much sequence variation. Therefore judging the performance of a motif discovery tool based on only such sensitivity/specificity measures is inadequate, since a motif tool should find a weak motif from a set of spiked data, if the original motif is a weak one, too. The sensitivity/specificity of this type of less conserved motifs would be relatively low, and not reflect or reward a program's ability to have discovered such a difficult motif. Therefore, we report MCC of the original test motifs primarily as a measure indicating how difficult a motif is to recover by a motif discovery program, and we report cartesian motif distances with the purpose of indicating how good the program is in that task. For instance, even an MCC value of 0.6 would still be good for a motif found by a program, if the corresponding real test motif did not have a much better MCC.
To generate test motifs for the program's assessment, we used conserved blocks of several ClustalW multiple alignments of sufficiently large number of SwissProt [29] proteins which all feature an arbitrarily chosen Prosite [3], or PFAM domain entries. Segments from these domains' alignments were converted into PWMs to obtain 3 sets of 7 test motifs of varying lengths between 3 and 9. The 21 test motifs used in the evaluations are available for download at the NestedMICA home page.
As a dataset to carry out the spiking tests on, we used 438 wholelength cytoplasmic protein sequences obtained from the redundancyreduced nonplants version of the TargetP [19] subcellular localisation dataset. Having an average sequence length of 582, this dataset does not include any homologous proteins, after a filtering process by an algorithm mentioned in [19]. Both NestedMICA and MEME were run with the default options. Note that, NestedMICA's default parameters differ from those used in DNA motif finding. Both NestedMICA and MEME require a target motif length interval, and no matter what the actual spiked motif's length was, for all of our spiking tests this was set to be between 3 and 15.
The background model used in the spiking tests was trained from the same cytoplasmic sequence dataset. The similar background likelihood analysis that was performed on another set (Figure 1) suggested that there would be no significant gain in likelihood when using a model with more than 4 mosaic classes for this particular small dataset. Therefore, a first order background model containing 4 mosaic classes was used in the tests.
Finally, for the evaluation of the program's assessment in subcellular localisation motif recovery, which was performed using sequences of different lengths, we used the ER dataset of a multiclass protein subcellular localisation predictor, MultiLoc [30]. This dataset contains 198 homologyreduced, eukaryotic ER proteins.