Datasets
The following datasets were used for experiments in this paper:
(D1) The primary and secondary structures of 323 human miRNA precursors (these were obtained from miRBase); (D2) The primary structure of 646 "pseudo-hairpin" sequences [9]; i.e., sequences from human genic regions which can fold up into a hairpin structure, similar to pre-miRNA. These are expected to contain no miRNA precursors; (D3) The primary structures of 1,918 non-human miRNA precursors from 40 different species (taken from the datasets used by Ng and Mishra [9]); (D4) The non coding RNA set (Ensembl). Homo_sapiens.NCBI36.54.ncrna.fa; (D5) Small RNA sequences obtained from normal human leukocytes.
Cross validation and Hold out Tests
Part of datasets D1 and D2 (200 and 400 sequences respectively) were used as the training data (this was identical to that used by Ng and Mishra [9]) to construct the final classification tree. The remaining sequences from these two datasets, along with datasets D3 and D4 were used as test data on which predictions were made. To exclude the influence of same-family members on the test results, all human MiRNAs from a given text set which had a member of the same family in the training set were removed; additionally, for dataset D3, only one member of each family in a test set was kept. Thus the 123 remaining human precursors were purged of all the actual human pre-miRNAs belonging to families that were also represented in the training set (there were 41 of these). The residual test set comprised 82 human pre-miRNAs and 246 pseudo-hairpins. Similarly the Dataset D3 (1,918 non human miRNA sequences) was reduced to 512 sequences on removing family similarities. The known miRNAs were removed from the non coding RNA set D4 and the rest (6,978) were used as a test set as many of the other ncRNAs also form miRNA-like secondary structures. For details of the cross validation see Additional file 1.
Representing miRNA precursors
Regular HMMs cannot be used to generate the language of miRNA precursors: ignoring the loop, this language is that of palindromes with distant interactions between nucleotides and we need at least a context-free grammar to represent it. However, the idea of CSHMMs has been recently proposed [14]. These are capable of representing such sequences. CSHMMs extend the idea of HMMs by introducing a memory, in the form of a stack or a queue, between certain states in the model. The original idea was to have a pairwise-emission state, which would put a copy of every symbol emitted by it into the associated memory, and a single corresponding context-sensitive state, which would read a symbol from the memory, and based on it, would then decide what to emit and where to transit. To represent miRNA structures, we have extended this idea slightly. The CSHMM structure we propose has two context sensitive states which are linked to the same pairwise-emission state through a stack. This is because we need separate states to generate the stem and the symmetric bulges; yet both these states need information about what was emitted earlier (the stem state, so that it may emit the complementary nucleotides; and the symmetric bulge state so that it may ensure the symmetry of the bulge). The structure of the CSHMM we propose is shown in Fig. 1. Here states labeled as P are pairwise-emission states, those labeled as C are context-sensitive ones, and those labeled as S are regular HMM states.
Identifying miRNA precursors
Parameter estimation
A complete CSHMM consists not just of the structure, but also of probabilities for the symbols emitted and the probabilities of transition from one state to another (usually called emission and transition probabilities). Given data of known stem-loop structures, these probabilities can be estimated by keeping count of the different transition and emission events for all the states. With these counts, estimates of the emission and transition probabilities can be obtained using the following formulae [15]:
Here, P
e
is the probability of emitting symbol σ in state q; and P
t
the probability of transiting from state q to q'. Q is the set of all states in the models; Σ is the output alphabet, consisting in this case of A, C, G and U; c
t
and c
e
are the transition and emission counts obtained from the labeled data.
For the two context-sensitive states, the symbol at the top of the stack also has to be taken into account. Accordingly, we modify the formulae above as follows (here α represents a letter from the alphabet, i.e. A, C, G or U):
Discrimination
Given a complete CSHMM (structure and probabilities), and any input sequence, an optimal alignment algorithm for computing the most likely sequence of states using the CSHMM is known [16], We cannot, however, use this algorithm to discriminate between miRNA precursors and other kinds of RNA sequences. For each such sequence, the algorithm simply gives us two things: the most likely state sequence (and hence, secondary structure) and the likelihood of obtaining that state sequence. Nevertheless, if the parameters have been estimated using miRNA precursors, we can expect relatively high likelihoods for such sequences. In addition, we would also expect to see a much closer match between the true secondary structure of miRNA sequences and the structure predicted by the alignment algorithm.
In this paper, we investigate a very simple discriminatory function that uses the results from the alignment algorithm. For our model, discrimination is a function only of the likelihood returned by the alignment algorithm. The form of the discriminatory function is thus just a single-node classification tree [17], which corresponds to a threshold on the likelihood score. The value of this threshold is estimated from sequences of miRNA precursors and non-precursors. Each sequence is provided to the alignment algorithm, which uses the CSHMM from Stage 1 to return a likelihood value. A classification tree is then constructed to discriminate between the two sets of sequences, using just one feature: the likelihood value.