RNA structures and their representations
RNA structures and pseudoknots
RNA molecules fold into complex three dimensional structures by nitrogenous bases that are connected via hydrogen bonds. The secondary structure of an ncRNA is defined by the interacting base pairs. Some RNA molecules fold into pseudoknot structures by paring bases in loop regions with bases outside the stem loop (Figure 1a).
In this work, two types of ncRNA secondary structure representations are used. The first type is the arcbased representation, where nucleotides and hydrogen bonds are represented by vertices and arcs, respectively (Figure 1b). For pseudoknotfree secondary structures, all arcs are either nested or in parallel. Crossover arcs indicate pseudoknots. The second type is based on dotbracket notation, where '.' represents unpaired bases and matching parenthesis '(' and ')' indicate basepairing nucleotides. Following the annotation of Rfam [3], we use an extended dotbracket notation to represent pseudoknot structures. The basepairing nucleotides forming pseudoknots are represented by upperlower case character pairs, such as A..a or B..b, as shown in Figure 1c.
Abstract shapes
Abstract shapes were formally introduced by Giegerich et al. [9]. The folding space of a given RNA sequence is partitioned into different classes of structures, by means of abstracting from structural details. These classes are called abstract shapes, or shapes for short.
An RNA secondary structure can be mapped to an abstract shape with different levels of abstraction [12]. In the abstract shape, details about the lengths of the loop and stacking regions are removed (see Figure 1 for examples of stacking and loop regions). Stacking regions are represented by pairs of brackets and unpaired regions are represented by underscores.
Pseudoknots are represented by pairs of upperlower case characters. Figure 2 presents examples of the abstract shapes of level 1, 3, and 5 of a pseudoknotfree structure and a pseudoknot. Level 5 represents the strongest abstraction and ignores all bulges, internal loops, and singlestranded regions. Level 3 adds the helix interruptions caused by bulges or internal loops. Level 1 only abstracts from loop and stack lengths while retains all singlestranded regions.
Shape prediction
In this section we describe KnotShape, a comparative shape prediction tool for homologous ncRNA sequences that allows pseudoknots. The task of shape prediction is to select the best representative shape for given homologous sequences. In order to identify the best shape, various features such as shape probability [22], sum of energies of all structures in this shape [12], and the rank sum score [12] are evaluated to rank shapes. It has not been systematically assessed whether combinations of multiple features can lead to better shape prediction. In this work, we incorporate Support Vector Machine (SVM) [27] and feature selection techniques to determine important features for shape identification. In addition, we adopted a machine learningbased scoring function to evaluate the qualities of shapes.
The method contains two important components. The first one is the consensus shape prediction (KnotShape) and the second one is structure prediction using predicted shape as input (KnotStructure). We will first describe KnotShape, focusing on the feature construction and selection strategy. Then we will describe how to derive the consensus structure given the consensus shape.
Notation
Figure 3 illustrates the mapping between sequences, structures, and shapes. The input is a set of homologous ncRNAs and the output is the predicted consensus shape. Notations used in this paper correspond to this mapping.

The N homologous ncRNAs constitute the input sequence space. X_{
i
}represents the i th sequence.

Each sequence can be folded into different secondary structures. Let S^{i}represent the set of folded structures of the i th sequence X_{
i
}. The set of structures predicted from all N input sequences is the union of S^{i}: S = S^{1} ∪ S^{2} ∪ . . . ∪ S^{N}.

{S}_{j}^{i} is the j th structure in the folding space of X_{
i
}. Its free energy is denoted by \mathrm{\Delta}G\left({S}_{j}^{i}\right). For a sequence X_{
i
}, the minimum free energy MFE(X_{
i
}) is the lowest free energy among the energies of all predicted structures of X_{
i
}, i.e. \text{MFE}({X}_{i})={\mathrm{min}}_{1\le j\le \left{S}^{i}\right}\mathrm{\Delta}G({S}_{j}^{i}).

All structures in S can be classified into a set of abstract shapes. For a shape P, we record its associated sequences and structures. P.LX denotes the set of associated sequences, each of which can fold into a structure with shape P. P.LS denotes all structures with shape P.

\widehat{P} is the predicted shape of the given homologous sequences X_{1}, X_{2}, .., X_{
N
}.
In order to explore the large folding space of multiple homologous sequences, we use a de novo folding tool to output the optimal and suboptimal structures within a given energy cutoff. This heuristic does not allow us to explore the complete folding space. Given the observation that the correct structure is usually close to the optimal structure, this heuristic works well in practice [28].
Feature construction and selection
Intuitively, the correct shape tends to possess the following properties. The correct shape should have high shape probability, meaning that a large number of structures can be classified into this shape. When we have multiple homologous sequences as input, the correct shape should be wellrepresented by all or a majority of the input sequences. Also, the ranking of the structure with the correct shape in the folding space of each sequence should be high. In addition, some structures with the correct shape have low thermodynamic energies. For the energyrelated properties, various measurements can be introduced. For example, instead of using the sum of the energies of all structures within a shape, one can use the smallest energy. Furthermore, more complicated properties such as the sequence similarity for all sequences associated with a shape P and the structural similarity of structures associated with a shape P might contribute to the shape prediction too. These similarities can be quantified using different methods such as kmers profiles, multiple sequence alignment scores, variation of base pairs and so on.
It is not trivial to decide whether a single property is enough to choose the correct shape. If not, which combination of these properties can lead to the best shape prediction performance? In order to systematically choose a set of features (i.e. properties) for shape prediction, we use Fscore [29] to measure the discrimination between a feature and its label. Given the feature vector x_{
k
}, k = 1, .., m, the Fscore of the i th feature is defined as:
F\left(i\right)\equiv \frac{{\left({\overline{x}}_{i}^{\left(+\right)}{\overline{x}}_{i}\right)}^{2}+{\left({\overline{x}}_{i}^{\left(\right)}{\overline{x}}_{i}\right)}^{2}}{\frac{1}{{n}_{+}1}{\sum}_{i=1}^{{n}_{+}}{\left({\overline{x}}_{k,i}^{\left(+\right)}{\overline{x}}_{i}^{\left(+\right)}\right)}^{2}+\frac{1}{{n}_{}1}{\sum}_{i=1}^{{n}_{}}{\left({\overline{x}}_{k,i}^{\left(\right)}{\overline{x}}_{i}^{\left(\right)}\right)}^{2}}
where n_{+} and n_{} are the numbers of positive and negative instances respectively. {\overline{x}}_{i}, {\overline{x}}_{i}^{\left(+\right)}, and {\overline{x}}_{i}^{\left(\right)} are the average values of the i th feature of the whole, positive labeled, and negative labeled data. {x}_{k,i}^{+} and {x}_{k,i}^{} are the values of i th feature of the k th positive and negative instances respectively.
Fscore reflects the discrimination of the features. The higher the Fscore, the more discriminative the feature is. Fscore is known to have a disadvantage in that it does not carry out the mutual information between features as it considers each feature separately. However, Fscore is simple and quite effective in practice.
Feature selection searches for the optimal subset of features [30]. There exist different methods for feature selection. In this work, we adopt sequential forward search (SFS) [31] because of its simplicity and effectiveness. Starting with an empty set, we iteratively select one feature at a time and add it to the current feature set. Features are selected in a descending order of the discriminative power determined by the Fscore. The feature added is the one that gives the highest accuracy.
Based on the properties that might be relevant to consensus shape prediction, we construct 17 features (the features are listed at our website) and compute the Fscore for each of them. The accuracy is evaluated using a linear SVM method. The standard grid search approach is used to find an optimal SVM parameter. The performance of a feature set is evaluated using 5fold cross validation. Prediction accuracy is the average value of all cross validation sets. The feature set that achieves the highest accuracy includes the following four features.

F1: the contribution of sequences. We capture the contribution of sequences using the number of sequences mapped to the shape. This feature reveals how the shape is shared among the homologous sequences. F 1 = P.LX.

F2: the contribution of structures. This feature represents the abundance of structures mapped to the shape. F 2 = P.LS.

F3: the average free energy. Energy model is commonly used to determine the stability of predicted structures. The basic idea behind this feature is that a stable shape is expected to be derived from a group of stable structures. F3=\frac{{\sum}_{s\in P.LS}\mathrm{\Delta}G\left(S\right)}{\leftP.LS\right}.

F4: the average of minimal free energy. This feature is different from F 3 in that it considers only the minimal free energy among all predicted structures of each sequence. F4=\frac{{\sum}_{X\in P.LX}MFE\left(X\right)}{\leftP.LX\right}.
Shape ranking using a simple scoring function
Once the features are determined, they are used together with a trained linear SVM for shape labeling. Multiple shapes might be labeled as "true". In order to rank these candidate shapes for the final shape selection, we evaluate each candidate shape using a score named sc, which is proportional to the signed distance between the candidate shape to the classification hyperplane [32]. Specifically, sc = w · x + b, where · denotes the dot product, w is the weight vector, and x is the instance vector. w is trained on the optimization function in the linear SVM. The larger w_{
j
} is, the more important the j th feature is. This is restricted to w in a linear SVM model.
Time complexity of shape prediction
For N input sequences, there are S predicted structures. These structure can be grouped into P' shapes. As we use the de novo folding tools to output nearoptimal structures within a given energy range (e.g. 5%), we found that N: S: P' ≈ 1: 10: 1:375. Mapping structures to shapes takes O(SL), where L is the sequence length. As sorting shapes according to their features takes P'log(P') and P' ≤ 2N and S ≤ 11N, the procedure of shape prediction has time complexity O(NL + NlogN).
Consensus structure prediction given a shape
Once we determine the shape, we will predict the structure in the shape class for the given homologous ncRNAs. Structures corresponding to the same shape can differ significantly in the details of the loop and stacking regions. A strategy is needed to choose the correct structure inside the shape class for each input sequence. The simplest strategy is to output the MFE structure for the chosen shape, which has been used in previous work [12]. However, the MFE structure in a shape may not be the native structure. In particular, the accuracy of the MFE prediction for ncRNAs containing pseudoknots is low.
In this section we describe KnotStructure, a comparative structure prediction method for homologous sequences given the shape. The rationale behind comparative structure prediction is that the secondary structures and sequences are conserved during evolution. Thus, finding the structures to maximize both the sequence and the secondary structure similarity among homologous ncRNAs provides the basis for comparative structure prediction. Various methods for evaluating structural and sequence similarity exist. The major challenge is to efficiently select \left\widehat{P}.LX\right representative structures to achieve the highest structural and sequence similarity.
As we already derived the consensus shape \widehat{P} using KnotShape, only structures with shape \widehat{P} will be allowed. In addition, for each sequence {X}_{i}\in \widehat{P}.LX, only one structure with shape \widehat{P} can be selected. The total number of combinations of structures for measuring the similarity is thus {\prod}_{i=1\phantom{\rule{0.3em}{0ex}}to\phantom{\rule{0.3em}{0ex}}\left\widehat{P}.LX\right}\widehat{P}.LS\cap {S}^{i}, where \widehat{P}.LS\cap {S}^{i} contains structures with shape \widehat{P} for a sequence X_{
i
}. Although efficient heuristics exist to measure the similarity among multiple structures and sequences, the sheer amount of combinations poses a great computational challenge.
Procedure 1 Representative structures selection
Input: \widehat{P}, \widehat{P}.LX, \widehat{P}.LS
Output: The representative structures
1. Initialization
for Every two structures {S}_{i}^{x} and {S}_{j}^{y}do
//only evaluate similarity of structures from different sequences
if x ≠ y then
Evaluate the similarity of {S}_{i}^{x} and {S}_{j}^{y}
else
{S}_{i}^{x} and {S}_{j}^{y} has similarity ∞
end if
end for
2. Select the set of representative structures using hierarchical clustering
//Each structure is a cluster by itself
repeat
Combine a pair of clusters with the highest similarity
For any structure {S}_{i}^{x} added to the cluster, remove all other structures {S}_{j}^{x} for j ≠ i
Reevaluate the similarity between clusters
until the cluster has size \left\widehat{P}.\mathsf{\text{LX}}\right
In order to efficiently select representative structures, we use a similar method to Unweighted Pair Group Method with Arithmetic Mean (UPGMA), an agglomerative hierarchical clustering technique [33]. Each object (i.e. secondary structure) starts in its own cluster. The closest pair of clusters is selected and merged into a single cluster as one moves up the hierarchy. The distance between clusters is measured using arithmetic mean defined in UPGMA. Compared to the standard clustering procedure, we have constraints on the objects that can be selected into the same cluster. Given the shape, only structures that have shape \widehat{P} and come from different ncRNAs can be combined in the same cluster. The detailed clustering process is described in Procedure 1.
During clustering, the structural and sequence similarity is evaluated using grammar stringbased approach [34, 35]. Grammar strings encode both secondary structure and sequence information for an ncRNA sequence. Grammar string alignment score can accurately quantify the structural and sequence similarity of two ncRNAs. In addition, grammar string can encode pseudoknot structures [34, 35]. For a sequence X_{
i
} and one structure {S}_{j}^{i} in the folding space of X_{
i
}, X_{
i
} and {S}_{j}^{i} are encoded in a grammar string g{s}_{j}^{i}. We measure the similarity between any two grammar strings using the normalized grammar stringbased alignment score over the alignment length. The similarity between groups of grammar strings is measured by arithmetic mean in UPGMA.
Figure 4 sketches the representative structure selection based on clustering procedure. Let g{s}_{j}^{i} be a grammar string converted from X_{
i
} and {S}_{j}^{i}, X_{
i
} ∈ P.LX. Once g{s}_{j}^{i} is selected, all the other grammar strings derived from the folding space of X_{
i
} will be removed from further processing.
The progressive MSA is performed on the set of representative structures using the clustering path as a guide tree. We then derive the consensus secondary structure from the alignment. The consensus structure can be mapped to each aligned sequence to accomplish the predicted structure of an individual sequence.
Running time of structure prediction
Converting a sequence and an associated secondary structure into a GS (grammar string) takes O(L^{2}), where L is the length of the sequence. Let the number of structures in \widehat{P}.LS be m. It takes O(L^{2}m) to encode all structures with shape \widehat{P}. In the first step of hierarchical clustering, we measure the similarity between GSs of different ncRNAs by conducting allagainstall comparison. Conducting pairwise GS alignment takes O(l^{2}), where l is the length of the GS sequence and l ≤ L. By using the default energy cutoff (5%) for suboptimal structure generation, we observed that m ≤ 11N. Thus, the allagainstall similarity measure has time complexity O(L^{2}N^{2}). The guide tree generated using the clustering procedure contains at most N representative structures. Thus, the total running time for clustering is O(L^{2}N^{3}), which is the leading time complexity term for the consensus structure prediction algorithm.