Noncoding RNA gene detection using comparative sequence analysis

Background Noncoding RNA genes produce transcripts that exert their function without ever producing proteins. Noncoding RNA gene sequences do not have strong statistical signals, unlike protein coding genes. A reliable general purpose computational genefinder for noncoding RNA genes has been elusive. Results We describe a comparative sequence analysis algorithm for detecting novel structural RNA genes. The key idea is to test the pattern of substitutions observed in a pairwise alignment of two homologous sequences. A conserved coding region tends to show a pattern of synonymous substitutions, whereas a conserved structural RNA tends to show a pattern of compensatory mutations consistent with some base-paired secondary structure. We formalize this intuition using three probabilistic "pair-grammars": a pair stochastic context free grammar modeling alignments constrained by structural RNA evolution, a pair hidden Markov model modeling alignments constrained by coding sequence evolution, and a pair hidden Markov model modeling a null hypothesis of position-independent evolution. Given an input pairwise sequence alignment (e.g. from a BLASTN comparison of two related genomes) we classify the alignment into the coding, RNA, or null class according to the posterior probability of each class. Conclusions We have implemented this approach as a program, QRNA, which we consider to be a prototype structural noncoding RNA genefinder. Tests suggest that this approach detects noncoding RNA genes with a fair degree of reliability.


The IID Model algorithm
The IID model is used to report the probability scores of the OTH, COD, and RNA models in a log-odds version. The IID model allows us to emit two sequences X, Y of varying length independently from each other. The IID model is the aggregation of two single HMMs as described in Figure 4. The probability of sequences X and Y being emitted by the IID model in Figure 4 is where the single nucleotide emission probabilities are the ones defined in (??) and (??).
Because the sequences are emitted independently, P (X, Y | IID) can be factorized.
Introduce the following quantities, Use the initialization conditions, Perform a similar construction for F Y α , with α = η L , η R , η J . In this notation, we have for the IID model, The IID is identical in design to the flanking models (F L , F R , and F J ) that we use to generate local alignments in the OTH, COD, and RNA models.

OTH Model Forward algorithm
Using the vectorial notation introduced previously in Section 2.3.3 of the paper, the pair-HMM OTH model Forward algorithm can be cast into the form, with  ≡ ı − (1, 1); with  ≡ ı − (0, 1); The initialization conditions are in addition to To obtain the Viterbi algorithm, one just has to replace all sums in (7) with a maximization operation.
The probability that sequences X and Y are related according to the OTH model by any local alignment XY is for a sequence X of length n and a sequence Y of length m.
The log-odds ratios of the OTH model respect to the IID model (1) are given by This general algorithm has a cost O(nm) in storage and a cost O(n 2 m 2 ) in time, for two sequences of length n and m.
The implemented version of the algorithm usually holds an input alignment fixed, instead of allowing the model to optimally align the input sequences. This simplified version can be obtained from the previous description simply by eliminating the vectorial character of the recursions in (7), by imposing the constraints i = i and j = j . In A further simplification of the OTH model that renders its cost linear both in time and memory requires trivializing the pass by the F J meta-state, In this way, no unaligned nucleotide is emitted outside the two flanking ends of the alignment.

Generalization of the OTH model Forward algorithm
In the previous section we have introduced the Forward algorithm for the OTH model assuming that we score both sequences from beginning to end. We need to generalize the algorithm so we can score sequences from a given start position (i 0 , i 0 ) to an end We will need this generalization in the algorithms of the COD and RNA models.
The generalization of the forward algorithm in (7) when we score between position The F l , F R , and F J functions are the ones defined in (6). Formally the recursions are identical to the ones in (7). However the initialization conditions are a little different in addition to The probability that sub-sequences X : (i 0 , j 0 ) and Y : (i 0 , j 0 ) are related according to the OTH model by any local alignment is for a sequence X of length n and a sequence Y of length m.
To obtain the Viterbi algorithm, one just has to replace all sums in (13) with a maximization operation.
For a fixed start (i 0 , i 0 ) and stop positions (j 0 , j 0 ), the complexity of this generalized algorithm is the same as that described before for the OTH model, simply substituting n → j 0 − i 0 + 1 and m → j 0 − i 0 + 1.

COD Model Forward algorithm
Let us define the following quantities: Where the probabilities are calculated as in equation (16)  The Forward algorithm for this COD pair-HMM is, To obtain the Viterbi algorithm, one just has to substitute all sums in (17) and (18) with a maximization operation.
We use hexamer-dependent statistics for the COD model. An approximation analogous to the one introduced for the emission of two base pairs in the RNA model (equation (6) of paper)) can be used to incorporate dicodon probabilities for the COD model.
If we assume that codons A and B (in sequences X and Y respectively) are aligned, and preceded by the aligned codons A, B, we can write where P hex X ( AA) and P hex Y ( BB) are the 64 × 64 dicodon frequencies for the relevant two species under comparison, which we assume are time independent.
The probability that sequences X and Y are related according to the COD model by any local alignment is for a sequence X of length n and a sequence Y of length m. The log-odds ratios of the COD model respect to the IID model given by (1) are

RNA Model Forward/Inside algorithm
Let us define the following quantities, where the probabilities are calculated as in equation (16)  The algorithm for the RNA model is, To obtain the "best-alignment" algorithm, one just has to substitute all sums in (22) and (23) by a maximization operation.
The probability that sequences X and Y are related according to the RNA model by any local alignment is for a sequence X of length n and a sequence Y of length m. The log-odds ratios of the RNA-alignment model respect to the IID model given by (1) are The RNA model is an SCFG. The SCFG-like part of the algorithm hides in the calculation of probabilities W ( , d). That is, the probability of having a RNA structure between positions ı =  − d and . The recursions involved in the calculation of the Inside algorithm for W ( , d) are, and similarly for primed d's. Here m and n are the respective lengths of the two sequences to be aligned. The vector e can take three different values: (1, 1), (1, 0) and (0, 1) that correspond to moving one position in at least one of the two sequences. The star product defined as e * s ≡ (e x s x , e y s y ) produces a "gap"-a zero instead of a nucleotide-when there is no movement for one of the sequences.
The symbol V again denotes the state we are in after emitting one pair in each sequence.
The recursion for state V is, and similarly for primed d's.
The recursions for the states that take care of length distributions for hairpin loops (IS1) and stems, bulges and internal loops (IS2) are, Here the functions G ISx (for x = 1, 2) are given by the general expression, The number of additive terms in this function reflects the number of possible alignments, and it verifies the condition k s=1 e s = d + 1. The initialization conditions are s x (−1,j) ="gap", and s y (j,−1) ="gap".
For the transition probabilities t trans state , we have trans∈state t trans state = 1, ∀ state.
For the particular grammar at hand, the previous conditions translate to These transition probabilities are calculated using a training set of RNA motifs that includes rRNAs and tRNAs [26,27]. P IS1 ( s ı ) and P IS2 ( s ı ) are the mutation probabilities in hairpin loops and internal loops respectively [which we set to be equal to the OTH model probabilities defined in equaiton (4) of the paper].
Notice that the RNA model can be aligned by either Forward or Viterbi with respect to the HMM part of the model; however, we always use the Inside algorithm to evaluate the SCFG part of the pair grammar. This is necessary to assure that the RNA score is comparable to the COD and OTH scores (by removing the conditioning on one particular structure out of the combinatorially enormous number of possible structures), and also to avoid undesirable effects associated with choosing a best path when we have a grammar with ambiguity (R. Giegerich, personal communication). This is the algorithm used when we allow the model to generate its own alignments.
This general algorithm has a cost O(n 2 m 2 ) in storage and a cost O(n 3 m 3 ) in time, for two sequences of length n and m. For the scoring system in which a predetermined alignment is scored locally, since we are not adding additional gaps in the pairwise alignment, the algorithm loses its vectorial character (j = j , d = d ), the vector e takes always the value (1, 1), and the memory and time complexity of the algorithm reduces to O(n 2 ) storage and O(n 3 ) time.