The function of RNA molecules is known to depend on their threedimensional structure, which is stabilized by a secondary structure scaffold of basepairing. The secondary structure is defined by hydrogen bonds between nucleotides, which form across the structure for thermodynamic stability and molecular function. Despite its importance, the accurate prediction of RNA secondary structure remains an unsolved challenge in computational biology.
With the advent of nextgeneration sequencing technologies and new methods in transcriptomics, an explosively growing amount of biological RNA data is available in public databases such as Rfam [1] and RNA STRAND [2]. This makes it possible to acquire a large number of RNA alignments to be used in comparative RNA secondary structure predictions. This is especially significant in the case of long RNAs such as RNA viral genomes and long genomic introns, many of which are known to have functional, conserved secondary structures.
Several methods have been established to predict RNA secondary structures from nucleotide sequences. In this paper, we focus entirely on nonpseudoknotted secondary structure prediction. Thermodynamic optimisation based on minimising freeenergy functions has been used to great effect in algorithms such as mfold [3], UNAFold [4] and RNAfold [5]. In a different approach, stochastic contextfree grammars (SCFGs) have also been successfully used to model RNA secondary structure. The Pfold program [6, 7], for example, combines molecular evolution with a lightweight SCFG model (known as a phylogrammar) to predict the consensus secondary structure of RNA alignments, and has in the past shown to be highly accurate for structural alignments [8]. PPfold is a recent multithreaded reimplementation of Pfold [9].
Common to these methods is that they produce a probability distribution over all possible nested secondary structures for the input sequences, but usually only a single, optimal secondary structure is reported to the user. A particularly interesting question is how the underlying distribution changes as a function of input data. Due to the large space of possible secondary structures, however, it is difficult to report useful quantities to describe this. Information entropy is one such measure.
Entropy computations in the context of RNA secondary structure prediction have been considered previously from a thermodynamic perspective, calculating the thermodynamic entropy over both secondary structure space [10, 11] and tertiary structure space [12]. Positional thermodynamic entropy [5] as well as thermodynamic entropy changes in response to basepair mutations [13] have also been computed. Additionally, SCFGbased methods have recently been utilised for calculating both the information entropy of individual basepairs in a singlesequence context [14], and the thermodynamic entropy changes in response to basepair mutations. However, no form of entropy has been computed previously in the case of comparative RNA secondary structure prediction.
The information entropy is a measure for the "spread" of the probability distribution, and has welldefined lower and upper bounds. The minimum entropy of 0 occurs when there is only one outcome with probability 1. For n possible outcomes, the maximum entropy is equal to log_{2} (n) and occurs for the uniform distribution. For a probability distribution, an entropy of k bits indicates that the expected value of the information content of observing a single outcome is k bits. In the context of secondary structure prediction, a low entropy therefore indicates that few secondary structures dominate the probability space, whereas a high entropy indicates a more even probability distribution over possible secondary structures. Thus, information entropy is a useful single quantity to characterize the underlying probability distribution of secondary structures.
In the case of RNA secondary structure prediction based on a semantically unambiguous SCFG, the information entropy of the probability distribution over RNA secondary structures can be computed as the derivational entropy of the SCFG that generates the distribution. We restrict ourselves to semantically unambiguous SCFGs, in order to maintain a onetoone correspondence between SCFG derivations and secondary structures. Thus, throughout this paper we use "information entropy" and "derivational entropy" interchangeably.
Notation
Consider RNA alignments of k sequences (k ≥ 1), with the i'th column denoted c
_{
i
} ∈ Σ^{
k
} = {"A", "C", "G", "U", ""}^{
k
} \ {""}^{
k
}. A stochastic contextfree phylogrammar (phyloSCFG) on such alignments is a tuple G = ((Σ^{
k
}
, N, S, R), P), where:

Σ^{
k
}forms the (finite) set of terminal symbols

N is a finite set of nonterminal symbols, such that

S is the start symbol, S ∈ N

R is a finite set of production rules, each rule of the form A → α, A ∈ N and α ∈ (∑^{
k
}∪ N)*

P is a function from R to real numbers in the interval [0,1]
In the case of a phylogrammar, P can be interpreted as Bayesian probabilities equal to the product of prior probabilities that only depend on the type of rule being used, and a likelihood factor that is typically derived from a phylogenetic model and is a function of the alignment columns. We will return to this more formally later. Furthermore, we assume the grammar is proper, that is ∀A ∈ N: ∑_{
π = (A → α) }
P(π) = 1.
Let
d be a complete (leftmost) derivation of the grammar. Informally, a complete derivation is a sequence of production rules, such that starting from the start symbol, and sequentially replacing all nonterminals with a production rule emitting from that nonterminal, a string of terminal symbols is obtained. The probability of
d is the product of the probability of all rules occurring in
d:
where f
_{
d
} (A → α) is the number of times rule A → α occurs in derivation d.
The grammar is consistent if ∑_{
d
}
p(d) = 1, where the sum is over all possible derivations of the grammar. In the case of phylogrammars, consistency implies that the total probability of the grammar emitting all alignments of k sequences (of all lengths) is 1.
The expected frequency (count) of a rule
A → α in all derivations of the grammar is
The expected frequency of each rule can be computed in practice using a dynamic programming algorithm known as the
insideoutside algorithm, as described in [
15]. Following the approach of [
16], we factorise a complete derivation
d at each occurrence of rule
A →
α into an "innermost" subderivation
, where
s ∈ (∑
^{
k
})*, and two "outermost" subderivations
, where
β, γ ∈ (∑
^{
k
} ∪
N)* and
t, u ∈ (∑
^{
k
})*. Then
The I(α) and O(A) variables can also be computed for a particular string C
_{
l
} of length l, using the insideoutside algorithm in O(l
^{3}) time.
The derivational entropy of an SCFG is the information entropy of the probability distribution of all derivations under the SCFG (c.f. equation 1):
where the sum is over all possible derivations of the grammar. This quantity can be computed efficiently using expected rule frequencies [
16]:
We assume that the phyloSCFG describes RNA secondary structure, and while it may be
syntactically ambiguous, it is
semantically unambiguous. In practice, this means that there may be a number of possible derivations for a particular alignment, but there is a onetoone correspondence between derivations and consensus secondary structures for the RNA alignment. Furthermore, we express RNA structure SCFGs in double emission normal form [
17], allowing only rules of the following types:
for A, B, C ∈ N, c, c
^{
'
} ∈ Σ^{
k
}. Apart from generating empty strings, all SCFGs modelling nested RNA secondary structures can be written in double emission normal form, so the methods presented here can be adapted to RNA secondary structure grammars of all types. Additionally, they can also be adapted for specific mildly contextsensitive RNA grammars that generate specific types of pseudoknots.
Type 1 rules correspond to the production of a single column of the alignment, and their probability can be expressed as
where P
_{
G
}(A → c) only depends on A, and
is the likelihood of observing column c under the phylogenetic model, assuming that it is unpaired in the consensus structure (denoted by
).
Type 2 rules correspond to the production of two basepaired columns of the alignment, and their probability can be expressed as
where P
_{
G
}(A → cBc') only depends on A and B, and
is the likelihood of observing column pair c, c
^{
'
} under the phylogenetic model, assuming that they are paired with each other in the consensus structure (denoted by
).
Type 3 rules express bifurcation and correspond to dividing the alignment into two parts. As these rules do not depend on alignment columns, we have:
where P
_{
G
}(A → BC) only depends on A, B and C.
It is now clear that the probability of any particular derivation under a phyloSCFG this structure can be expressed as a product of two probabilities: a probability
p
_{
G
} that only depends on the types of rules used, and a probability
p
_{
T
} that only depends on the emitted alignment columns:
p(
d) =
p
_{
G
}(
d)
p
_{
T
}(
d), with
for r
_{
a
} ∈ R of Type 1, r
_{
b
} ∈ R of Type 2, r
_{
c
} ∈ R of Type 3.
Given a single RNA alignment, there are typically a large number of possible derivations, each corresponding to a possible secondary structure for the alignment. In the rest of this work, we restrict ourselves to this space of derivations, which we characterize by its derivational entropy as described below.