Volume 14 Supplement 9
Selected articles from the 8th International Symposium on Bioinformatics Research and Applications (ISBRA'12)
Classification and assessment tools for structural motif discovery algorithms
 Ghada Badr^{1, 2}Email author,
 Isra AlTuraiki^{1}Email author and
 Hassan Mathkour^{1}
DOI: 10.1186/1471210514S9S4
© Badr et al.; licensee BioMed Central Ltd. 2013
Published: 28 June 2013
Abstract
Background
Motif discovery is the problem of finding recurring patterns in biological data. Patterns can be sequential, mainly when discovered in DNA sequences. They can also be structural (e.g. when discovering RNA motifs). Finding common structural patterns helps to gain a better understanding of the mechanism of action (e.g. posttranscriptional regulation). Unlike DNA motifs, which are sequentially conserved, RNA motifs exhibit conservation in structure, which may be common even if the sequences are different. Over the past few years, hundreds of algorithms have been developed to solve the sequential motif discovery problem, while less work has been done for the structural case.
Methods
In this paper, we survey, classify, and compare different algorithms that solve the structural motif discovery problem, where the underlying sequences may be different. We highlight their strengths and weaknesses. We start by proposing a benchmark dataset and a measurement tool that can be used to evaluate different motif discovery approaches. Then, we proceed by proposing our experimental setup. Finally, results are obtained using the proposed benchmark to compare available tools. To the best of our knowledge, this is the first attempt to compare tools solely designed for structural motif discovery.
Results
Results show that the accuracy of discovered motifs is relatively low. The results also suggest a complementary behavior among tools where some tools perform well on simple structures, while other tools are better for complex structures.
Conclusions
We have classified and evaluated the performance of available structural motif discovery tools. In addition, we have proposed a benchmark dataset with tools that can be used to evaluate newly developed tools.
Introduction
Finding recurring patterns, motifs, in biological data gives an indication of important functional or structural roles. Motifs can be either sequential or structural. Motifs are represented as sequences when they represent repeated patterns in biological sequences. Motifs are structural when they represent patterns of conserved base pairs (e.g. RNA secondary structures) [1, 2]. Knowing structural motifs in RNA leads to a better understanding of the mechanisms of action. Unlike DNA motifs, which are sequentially conserved, RNA motifs may share a common structure even in the case of low sequence similarity. Many algorithms have been devised to solve the structural motif discovery problem. However, due to the lack of a gold standard benchmark, little work has been done to evaluate their performance. A survey of different structural RNA motif discovery algorithms can be found in [3]. The structural motif discovery problem should not be confused with the two close problems: RNA structure prediction and RNA consensus structure prediction [4]. In the former, it is required to predict the secondary structure of a single RNA sequence. In general, the predicted structure minimizes the total free energy. While in the later, it is required to find a list of base pairs that can simultaneously be formed in a set of related RNA sequences. In this case, it is generally assumed that sequences are evolutionary related and share a similar overall fold. Evolutionary conservation information is utilized to improve the accuracy of structure prediction process. In this paper, we focus on structural motif discovery, where the goal is to discover repeated patterns in RNA secondary structures. The paper is organized as follows. In the following section, the motif discovery problem is formulated. Next, the basics of RNA secondary structures are presented including: common representations and free energy models that are used to evaluate the stability of an RNA structure. Then, we survey and classify motif discovery algorithms and discuss their advantages and limitations. We proceed by proposing a benchmark and a tool for evaluating the performance of motif discovery algorithms. Finally, our experimental setup and comparative results are presented and discussed.
Motif discovery
The structural motif discovery problem can be formulated as follows: Given a set of RNA sequences S, it is required to find common structural motifs that are responsible for the function or regulation of the RNA sequences. In general the motif discovery problem is a challenging problem: The motif may not be present in all input sequences. In real situations, unrelated sequences may be included by mistake. Then, only a subset of the given sequences can share a common motif. In other situations the input sequences may be related, but there is no common single motif for all of them. This happens when sequences are functionally related but belong to different structural classes. For N sequences, there are 2^{ N }subsets of sequences that need to be examined to find a motif. Hundreds of algorithms and tools have been published to tackle motif discovery in the sequential case. However, less work has been done for structural motif discovery. The structural case is more difficult than the sequential case because motifs may share a common structure even if their underlying sequences are different.
Preliminaries
RNA secondary structures
RNA is found as a single strand where individual bases can bond with each other forming base pairs[5]. Bonding makes RNA fold into a structure called secondary structure. There are different rules for base paring [5]. A WastonCrick base pair is formed when A bonds with U through a double hydrogen bond or when G bonds with C through a triple hydrogen bond. A wobble base pair is formed when G bonds with U by a single hydrogen bond. There are other pairing rules such as GA and UC pairs, but they are relatively rare [5]. RNA secondary structures can be interacting or noninteracting. In an interacting structure, base pairs may be formed from bases that belong to different strands (intermolecular). In a noninteracting RNA structure, all base pairs are formed by bases in the same strand (intramolecular).
Definition: Secondary structure
Given an RNA sequence R = {r_{1}r_{2}...r_{ n } } of length n, the secondary structure S of R is a set of base pairs (r_{ i }, r_{ j } ), where 1 ≤ i <j ≤ n, that satisfies the following two criteria [5]:
(1) Each base is paired at most once.
(2) If (r_{ i }, r_{ j } ), (r_{ k }, r_{ l } ) ∈ S, then i <k <j ⇔ i <l <j. This is called the nested criterion.
The criteria above may be rarely violated in RNA secondary structures. If the first one is not satisfied, then a base triple may happen. If the second is not satisfied, then a pseudoknot may exist [5].
RNA representations
There are different ways to represent RNA secondary structures. Some popular RNA representations are:
Dotbracket notation
Planar graph
RNA secondary structure can be represented as a planar graph[6], also known as bond representation. The graph is a simple approximation of the RNA secondary structure in two dimensions. As shown in Figure 3, the planar graph may include the following different types of loops in the secondary structure:

♦ Hairpin loop: a loop that contains exactly one base pair.

♦ Staked pair: a set of consecutive base pairs.

♦ Internal loop: a loop with two base pairs and at least one unpaired base on each side of the loop.

♦ Bulge: Like an internal loop, a bulge has two base pairs but with only one side of the loop having an unpaired bases.

♦ Multiloop: any loop with three or more base pairs.

♦ External base: any unpaired base not contained in a loop. A set of consecutive external bases form an external element.
RNA expression
RNA secondary structures can be represented as an expression of the following six types of terms [1]:

♦ H5 and H3: to represent the beginning and end of an intramolecular stem.

♦ I5 and I3: to represent the beginning and end of an intermolecular stem for interacting molecules.

♦ SS: to represented a single stranded region (unpaired).

♦ BR: to indicate a move between RNA subpatterns in case of interacting patterns.
Componentbased representation
RNA secondary structures can be represented by components [1]. In this representation, a pattern can be defined by three parts: (1) its length, (2) the intramolecular (INTRAM) component (3) the intermolecular (INTERM) component if interacting. An interacting pattern consists of more than one subpatterns. In general P = {p_{1}, p_{2}, ..., p_{ m } }, each p_{ j } = (len_{ j } , {INTERM_{1}, INTERM_{2}, ..., INTERM_{ r } }, {INTRAM_{1}, INTRAM_{2}, ..., INTRAM_{ q } }) for 1 <j ≤ m. If the pattern is not interacting, when is m = 1, it will only have INTRAM component.
Covariance model
Connectivity table
RNA secondary structure can be represented as a table called connectivity table (CT). The table is formatted as follows:

♦ The first line shows the number of bases in the sequence and the name of the structure.

♦ Each following line provides information about one base pairs as follows:

base index.

base.

next index.

previous index.

index of the base to which it is paired. Zero for no pairing.

Natural numbering.

Arc representation
Free energy models
The stability of RNA structures is determined by their free energy. Stable structures have the lowest free energy values. In this section, we discuss wellknown models that are used to calculate the free energy.
Base pair energy model
Although this model is simple, it is known to be inaccurate [12]. The base pair maximization model does not yield biologically relevant structures. This is because: it ignores stacking base pairs, it dose not consider loop sizes, and it has no special scoring of multiloops.
Stacked pair energies
Where S_{i+1,j−1}= 1 if the pair (i + 1, j − 1) ∈ S, S_{i+1,j−1}= 0 otherwise. The total number of consecutive base pairs is called the stacking size. Single base pairs are not considered as stacks, so the staking size is at least 2.
Loop energy model
The loop energy model, also called the nearest neighbor model[15], considers the free energy of the different types of loops, including: free energy of externally interacting loops. Under this model, the free energy of a secondary structure is the sum of free energies of all of its component loops. This model appears to be more accurate especially for RNA molecules with length ≥ 150 [12]. The contribution of each loop type is determined as follows:

♦ Hairpin loop: the energy contribution of a hairpin loop is determined by two elements. First, the number of unpaired bases forming the loop. Second, the contribution of the terminal mismatch, which are the two bases adjacent to the closing base pair.

♦ Staked pair: the energy contribution of a stacked pair is determined by the type and order of base pairs.

♦ Bulge: the energy contribution of a bulge loop is determined by the length of the unpaired bases forming the bulge and the two closing base pairs.

♦ Internal loop: the energy contribution of an internal loop is determined by the length of the unpaired bases forming the loop and the four unpaired bases adjacent to the opening and closing base pairs.

♦ Multiloop: the energy contribution of a multiloop is a function of many factors, including: number of helices and the optimal configuration of free ends and terminal mismatches.
Classification of RNA motif discovery algorithms
There are many approaches to solve the structural motif discovery problem. In this section, we classify the different approaches that solve this problem. Based on how the search space is explored, we classify the approaches into two main classes: enumerative approaches and heuristics approaches. In enumerative approaches, the search space is exhaustively explored in order to discover overrepresented motifs. Algorithms in this class can be further divided into: dynamic programmingbased approaches (DP), data structuresbased approaches (DS), and graphbased approaches (GB). On the other hand, in heuristic approaches, only promising regions of the search are explored. Examples of algorithms in this class include: Expectation maximization and evolutionary algorithms. In addition, there are other heuristics specifically designed to tackle the motif discovery problem.
Enumerative approaches
• Dynamic programming approaches
Dynamic programming is a wellknown algorithm design technique. It is based on the observation that within optimal solutions there exist optimal solutions to subproblems. Most motif discovery algorithms in this class rely on extending Sankoff's algorithm [16] for simultaneous folding and aligning RNA sequences. The algorithm combines the recurrences of sequence alignment and folding algorithms. Given two sequences, it finds two structures and an alignment such that the energy and alignment score are optimal. The two structures need to show the same kind of branching. This should also be reflected in the alignment where branches are aligned to each other.
FOLDALIGN [17–19] is a DPbased motif discovery tool for two sequences. The first implementation of FOLDALIGN [17] starts with pairwise alignments using a simple extension to Sankoff's algorithm. Instead of minimizing the total cost of the alignment and the energy of the structure, the proposed extension maximizes the alignment similarity and the number of base pairs that are formed in the two aligned sequences. The time complexity is reduced by not allowing branching structures. After the initial pairwise alignments are computed, they are aligned with all individual sequences. The algorithm proceeds by following a greedy (progressive) approach by aligning triplet alignments to every individual sequence and so on. In order to obtain an alignment of size r, only alignments between a single sequence and an r − 1 sequences are considered. After each round only the best 30 alignments are retained to generate r + 1 alignments. For each round r, the few best alignments are taken and their score versus length is calculated. Scores are plotted as a function of r. The algorithm then stops at the round for which the rate of change decreases. Finally, the best five scored alignments from r and r − 1 sequences are considered. In subsequent implementations of FOLDALIGN [18, 19], the basic algorithm was improved to deal with branched structures and to use a lightweight energy model.
SLASH [20] is an approach for the discovery of hairpin motifs in unaligned sequences based on two approaches: FOLDALIGN [21] and COVE [7]. FOLDALIGN is used to find local alignments in RNA sequences. Then COVE's model is trained on that alignment.
• Data structuresbased approaches
Motif discovery algorithms can use data structures to accelerate the access and retrieval of words. Mauri and Pavesi [22] presented an algorithm using affix trees data structures for the discovery of hairpins, bulges and internal loops in RNA. An affix tree is a data structure that stores information about a given string and its reverse. The proposed algorithm has two inputs. First, a set of RNA sequences (denoted as S). Second, the minimum number of sequences in which a motif can appear (denoted as q). In addition, it has two optional inputs: the maximum value of loop size and the maximum number of unpaired bases that can form internal loops and bulges. The basic idea is to start by building an affix tree. From the affix tree all substrings of length l are identified. Substrings appearing in at least q sequences are then progressively expanded by adding one base pair or an unpaired base. The proposed algorithm terminates when motifs can not be expanded. Motifs are then evaluated to determine their free energy. A motif is reported only if its energy satisfies a given threshold.
Seed [23] uses another data structures, namely suffix arrays, for the discovery of structural motifs. In the algorithm one of the input sequences is considered as a seed. The data structure is then used to store the seed sequence and its reverse. The algorithm proceeds by listing all stems in the seed sequence. This is done by looking at complementary regions. Stems that appear in a number of sequences with frequency above a predefined threshold are kept for further processing. Then, base pairs in the listed motifs are replaced with the actual base pairs occurring in the seed sequence. Finally, multistem motifs are constructed by combining two motifs. Depending on the positions of the two motifs, they can be either nested in each other or put next to one another (adjacent relationship). The combination can result in different structures such as bulges, internal loops and multiloops. In fact, the motifs capture both sequence and secondary structure information. They are ranked based on free energy.
• Graphbased approaches
comRNA [24] is a graphbased approach that can find pseudoknots. It starts by finding all possible stable stems of length L in each sequence using a branch and bound approach. The stability of a stem is evaluated based on stacking energy. Bulges and internal loops are not allowed at this stage. Next, the similarity between each stem pair is computed. Then, an n − partite undirected weighted connectivity graph is created. Nodes in the graph represent stems, while edge represent similarity values between them. The graph is partitioned to n parts, where each part consists of stems belonging in the same sequences. The final step is to find all conserved stems that are common to multiple sequences. In graph theory, this corresponds to finding a set of maximum cliques. A clique is a complete subgraph where every node is connected to every other node. After that, a graph technique similar to topological sort is applied to find the best assemblies of stems. Finally, the highest ranking motifs are reported.
RNAmine [25] uses graph mining algorithm to discover motifs shared by a subset of different RNA sequences. An RNA sequence and its secondary structure is represented as a directed labeled graph, called stem graph. In a stem graph, a node denotes a stem and an edge denotes the relative positions of two stems. Initially, candidate stems are derived from McCaskill's algorithm [26] using Vienna package [27]. Stems shorter than a given threshold are discarded. The algorithm exhaustively enumerates stem patterns using a branch and bound algorithm. A pattern needs to satisfy three constraints: (1) A pattern that exist in at least m stem graphs, (2) A pattern should form a clique, and (3) It must not be a general pattern.
Heuristic approaches
• Evolutionary algorithms
Evolutionary computation algorithms are based on biological evolution concepts such as natural selection, survival of the fittest, and reproduction to search for an optimal solution. The main components of evolutionary algorithms are: solution representation, fitness function, population initialization, selection, and reproduction operators. RNAGA [28] uses genetic algorithms (GA) to find common secondary structures in a set of homologues RNA sequences.
In RNAGA, GA is applied at different levels. A GA is applied to each sequence to get a set of stable structures. The fitness of a secondary structure is based on its free energy. Then, for each sequence, the resulting set of stable structures are evaluated to determine their conservation among different sequences. Using the conservation score as a measure of fitness, a GA is applied again to the set of stable structures. Finally, the GA terminates when maximum number of generations is reached.
GPRM [29] uses genetic programming (GP) to find structural motifs in RNA. The algorithm requires two sets of inputs: a set of coregulated RNAs, called the positive set, and a set of randomly generated sequences, called the negative set. The initial population of motifs is generated based on a userdefined parameters: number of segment in a motif and the range of segment length. A segment is either WastonCrick base pairs or unpaired bases. Individuals are evaluated based on Fscore, which is a function of sensitivity and positive predictive values. The sets of positive and negative sequences are used in calculating the score.
GeRNAMo [30] also uses a GP to find RNA motifs. Unlike GPRM, where the initial motifs are randomly generated, GeRNAMo starts with a set of suboptimal secondary structures. First, a preprocessing step takes place where input sequences are broken into subsequeces ranging between a userdefined minimum and maximum lengths. Then for each subsequence, RNAsubopt [31] is used to find the suboptimal secondary structure based on thermodynamics evaluation. The output of RNAsubopt is then used to build an array that stores all the information of the suboptimal secondary structures. GP is then applied with a fitness function that favors motifs that are common to the majority of input sequences.
• Expectation maximization
Expectation maximization (EM) is a statistical method for parameter estimation in the case of incomplete data. CMfinder [32] is an RNA motif discovery algorithm that is based on expectation maximization (EM) and covariance model RNA representation. The algorithm starts by finding strong motif candidates. Those are the ones with more stable structure. For each input sequence, the minimum free energy of all subsequences is computed. Top ranking candidates are then chosen and compared at base level. The algorithm proceeds by constructing an initial alignment from similar candidates. The initial alignment is used as a seed for the EM algorithm. Then, EM takes place twice: First, to refine the initial alignment. Second, to scan each sequence looking for hits, where top hits are treated as candidates. Finally, motifs are merged to produce the final result.
RNApromo [33] is another EMbased method for identifying RNA motifs. It starts with a set of RNA sequences and a set of suggested secondary structures. The motif prediction algorithm is composed of two main parts. In the first part, several heuristics are employed to identify good motif candidates. In the second part, candidates are refined using an EM algorithm.
• Other heuristics
Structural motif discovery algorithms.
Tool  Class  Website 

FOLDALIGN [17]  EN  [40] 
Based on Sankoff's algorithm. It maximizes alignment similarity and number of base pairs formed in 2 aligned sequences.  
SLASH [20]  EN  NA 
Uses FOLDALIGN to find local alignments in RNA sequences. Then COVE [41], to build a SCFG model from the local alignments.  
Mauri & Pavesi [22]  EN  NA 
Uses Affix trees for the discovery of hairpins, bulges and internal loops in RNA. Substrings of certain length appearing in at least q sequences are found and expanded.  
Seed [23]  EN  [42] 
Uses suffix arrays to induce motifs from the seed sequence. Data structures are used to store the seed sequence, its reverse, and the input sequences.  
comRNA [24]  EN  [43] 
Uses an n − partite undirected weighted connectivity graph to represent stems and their similarity. The problem of finding motifs is mapped to finding a set of maximum cliques. A graph technique similar to topological sort is applied to find the best assemblies of stems.  
RNAmine [25]  EN  [44] 
Uses a graph mining algorithm to find conserved stems.  
RNAGA [28]  HU  NA 
Genetic algorithm is applied at different levels. First it is applied on each sequence to get a set of stable structures. Then it is applied again to the set of stable structures.  
GPRM [29]  HU  NA 
Uses genetic programming. It requires two sets of inputs: a positive set and a negative set. Individuals are evaluated based on Fscore and using the two input sets.  
GeRNAMo [30]  HU  NA 
GeRNAMo applies genetic programming on the output of RNAsubopt.  
CMfinder [32]  HU  [45] 
based on expectation maximization (EM) to simultaneously align and fold sequences using covariance model of RNA motifs.  
RNAProfile [34]  HU  [46] 
Uses a heuristic to extract a set of candidate regions from each sequence. The second step involves grouping regions to find similar motifs.  
RNAPromo [33]  HU  [47] 
The motif prediction algorithm initially looks for structural elements which are common to the input RNAs, and then employs an expectation maximization algorithm to refine the resulting probabilistic model. 
Advantages and limitations
Advantages and limitations of the surveyed motif discovery approaches.
Approach  Structures  Pseudoknots  Scalablity 

Dynamic Programming  stems  no  no 
Data StructuresBased  multi loops  no  yes 
GraphBased  complex  yes  no 
Evolutionary Algorithms  complex  yes  no 
Expectation maximization  complex  yes  no 
Proposed benchmarks
Motivated by the lack of a 'gold standard' benchmark, we propose a benchmark that can be used to assess the performance of structural motif discovery tools. The benchmark is specifically designed to highlight the different challenges of the motif discovery problem. The complexity of the RNA secondary structures depends on the number loops and the type of loops. Changing these two parameters can vary the complexity of the structure. Thus, based on the complexity of the allowed RNA secondary structures, the benchmark can be divided into the following datasets, ranging from low complexity to high complexity structures:

♦ Ref.1 contains three RNA families: Iron response element I (IRE), Histone 3' UTR stemloop (Histone3), and Selenocysteine insertion sequence 1 (SECIS I). These families have simple secondary structures that are composed of a few number of loops and loop types (1 to 2).

♦ Ref.2 contains three RNA families: FMN riboswitch, glmS ribozyme, and Lysine riboswitch. These families have more complex secondary structures that are composed of a higher number of loops (5 to 7) and loop types (2 to 3).

♦ Ref.3 contains three RNA families: Enteroviral 3' UTR element (Entero_OriR), Metazoan signal recognition particle RNA (Metazoa_SRP), and Bacterial RNase P class B (RNaseP_bact_b). These families have complex secondary structures that are composed of a large number of loops (817) and loop types (45).
Proposed measurement tool
To evaluate the performance of the available motif discovery tools, we developed a tool [39] to measure their performance in terms of sensitivity (Sn), specificity (Sp), and positive predictive value (PPV). The tool computes statistics for the discovered structures at the base pair level relative to the corresponding structures in the Rfam seed alignments. A discovered base pair (i, j) is considered true positive (TP), if it is identical to the known pair or it is shifted by one nucleotide on one side only. In other words, if (i, j), (i + 1, j), (i − 1, j), (i, j + 1), or (i, j − 1) is discovered [4].
Experimental setup
To the best of our knowledge, this is the first comparison between tools designed for the purpose of structural motif discovery. For our experiment, we focus on the tools that discover structural motifs in more than two input sequences, where the input sequences are different, but share a common RNA structure. Among the surveyed tools, we were able to obtain six: CMfinder [32], RNAProfile [34], RNAmine [25], comRNA [24], RNAPromo [33], and Seed [23]. Our first observation is that the use of structural discovery tools often requires knowledge of the Unix operating system. Except for RNAmine, all the tools were available as a source code that needs to be unpacked and compiled. Difficulties arise when the tool depends on some old libraries. In addition, some tools require the determination of many parameters, especially Seed.
We were able to successfully compile and test the tools using a 64bit Linux based operating system with 32 GB of RAM. All the tools were required to output up to ten motifs. The motifs were ranked according to the score used for each tool and the best scoring motif was compared to the known structure. The results were summarized by averaging statistics over secondary structures. Running the tools with default parameters was not always a good choice, so we had to tune parameters. The parameters used for each tool and each benchmark dataset are as follows:

♦ CMfinder: for Ref.1, all parameters were set to the default values. For Ref.2, the number of stemloops was set to two, the minimum and maximum lengths of a motif were set to: 100 and 200, respectively. The same parameters were used for Ref.3, except for the maximum length of a motif was set to 400 for the families RNaseP_bact_b and Metazoa_SRP. The predictions of Ref.2 and Ref.3 were post processed using a tool provided by CMfinder [32] where a heuristic is used to combine multiple motifs.

♦ RNAProfile: for families Entero_OriR, FMN riboswitch, Lysine riboswitch, and glmS ribozyme the number of hairpins was set to 4 and the minimum and maximum length of a motif was set to 100 and 200, respectively. For the families RNaseP_bact_b and Metazoa_SRP, the the number of hairpins was set to 5 and the minimum and maximum length of a motif was set to 300 and 400 respectively. The motifs were scored based on the average alignment score for each structure.

♦ RNAmine: for Ref.1, the minimum size of stem candidate was set to 1. For the two families in Ref.3: RNaseP_bact_b and Metazoa_SRP, the maximum motif size was set to 600 and 400, respectively.

♦ comRNA: for all the datasets, the tool was run with the same parameter values. The minimum percentage of sequences in which a common structure should occur was set to 0.7. The maximum time allowed for structure finding with a certain stem similarity score threshold was set to 20 minutes. The maximum pattern search time allowed for RNAMOT was set to 20 seconds.

♦ RNApromo: for Ref.1, all parameters were set to the default values. In Ref.2 and family Entero_OriR of Ref.3 the minimum and maximum lengths of a motif were set to 100 and 250 respectively. For the remaining two Ref.3 families, the maximum length was set to 400.

♦ Seed: Setting parameters for Seed was difficult. Without proper parameter values, Seed either fills the entire hard disk with motifs or fails to discover motifs. For Ref.1, stem minimum length was set to 3 with maximum distance between the start and end of a stem set to 30 for family IRE. For the other two families in Ref.1, the stem minimum length was set to 6 and a maximum number of stems was equal to 1. For Ref.2, the fraction of input sequences having a motif (minsupport) was set to 0.7. The stem minimum length and the maximum number of stems in a motif were set to 3 and 4, respectively. For Ref.3, the maximum separation was set to 10, minimum support to 0.8, stem minimum length to 3, and maximum number of stems to 5. For all the datasets, no mismatches were allowed. The motifs were scored based on the average minimum free energy for each structure.
Comparative results
In this section, we show and discuss the results obtained when testing the available tools using our proposed benchmark. Results are obtained for sensitivity (Sn), positive predictive value (PPV) and specificity (Sp). All tools are able to discover motifs in all benchmark datasets, except for comRNA, which is able to discover motifs in two families only, RNaseP_bact_b and glmS ribozyme. Part of this could be due to using the defaults for the minimum similarity score between two stems. In addition, the threshold on the percentage of sequences in which a common structure should appear is higher than the threshold used in [24, 32].
For the more complex structures, Ref.2, CMfinder appears to have the best accuracy in terms of Sn and PPV, while the accuracy of RNAPromo and Seed in terms of Sn is considerably low.
As for the most complex structures, Ref.3, CMfinder has the best accuracy values in terms of Sn. The performance of RNAmine and RNAprofile is comparable to CMfinder. In terms of PPV, RNAmine shows the best performance. As for Sp, Seed outperforms all the other tools for the complex structures, Ref.2 and Ref.3.
CMfinder is the only tool that performs best for all the complexity levels of RNA structures. This can be attributed to the combination of rich initialization step and EM for refining profiles. The initial candidates in CMfinder are stable structures derived using Vienna package and are not limited to stem loops.
Seed performs well on simple structures. However, the performance decreases by 80%, in terms of Sn, on datasets with complex structure. This indicates that the simple enumeration of stem loops works well for discovering simple structure. The later combination of stem loops does not seem to yield real complex structures. However, the authors showed that the accuracy can be improved when the discovered motifs are used as constraints for MFOLD.
As for RNAProfile, in terms of Sn, it appears to perform better on complex structures. However, the accuracy decreases by 65% on benchmark datasets with simple structure. RNAProfile follows a progressive alignment approach. The degradation in accuracy could be due to the fact the dataset of complex structures have more sequence identity than the dataset of simple structures. The figures suggest a complementary behavior among tools.
Results of statistical significance test with α = %5.
Measure  Fcalc  pvalue  Fcrit 

Sn  4.68294  0.0034  2.60597 
PPV  4.08559  0.0072  2.60597 
Sp  6.15544  0.00024  2.44343 
Average running time (linux real time converted to minutes).
CmFinder  RNAProfile  RNAmine  ComRNA  RNAPromo  Seed  

Ref1 1  1.35  0.61  14.90  31.83  2.88  1.28 
Ref2  34.87  351.99  7.14  41.28  18.12  3.81 
Ref3  62.19  582.78  23.27  80.79  37.97  7.44 
Conclusion
In this paper, we surveyed and classified different algorithms that solve the structural motif discovery problem. We explored their strengths and weaknesses. Motivated by the lack of a gold standard to benchmark structural motif discovery tools, we proposed a benchmark datasets based on the complexity of the RNA structures. The benchmark can be used to evaluate different structural motif discovery approaches. In addition, we presented our measurement tool that can be used by other developers to compute the accuracy measures for their own approaches. We used the proposed benchmark dataset to evaluate six structural motif discovery algorithms. The obtained results show that the accuracy levels are generally low. CMfinder performs well regardless of the complexity of RNA structures. Some tools are more suitable to discover simple structure, such as Seed, while others are better for the discovery of complex structures, such as RNAmine and RNAProfile. This suggests a complementary behaviour among tools. Combining the results from different tools or from different runs of the same tool, may improve the accuracy of structure motif discovery. This has shown to be useful in motif discovery in the sequential case.
The performance evaluation of the tools were based on statistical measures. For each tool, the discovered motifs were compared to known motifs. The problem was when a new motif was (computationally) discovered; determining its biological significance remained a big challenge. Different tools relied on different objective functions to score motifs; this includes: thermodynamic stability, alignment score, and probabilistic measures. However, existing objective functions were unable to captures the hidden features of biologically relevant motifs.
Declarations
Acknowledgements
This research has been supported by the National Plan for Sciences and Technology, King Saud University, Riyadh, Saudi Arabia (Project No. 12BIO260502). We would like to express our thanks and appreciation for Dr. Marcel Turcotte, associate professor, University of Ottawa, Canada, for his valuable consultation and comments. Special thanks for the BioInformatics research Group (BioInG) http://bioinformaticksurg.blogspot.ca at King Saud University for creating a rich environment for conducting research and for providing valuable discussions and events.
Declarations
Publication of this article was funded by the authors.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 9, 2013: Selected articles from the 8th International Symposium on Bioinformatics Research and Applications (ISBRA'12). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S9.
Authors’ Affiliations
References
 Badr G, Turcotte M: Componentbased matching for multiple interacting RNA sequences. Proceedings of the 7th international conference on Bioinformatics research and applications. 2011, ISBRA'11, Berlin, Heidelberg: SpringerVerlag, 7386.View Article
 Carvalho AM, Freitas AT, Oliveira AL, Sagot M: An Efficient Algorithm for the Identification of Structured Motifs in DNA Promoter Sequences. IEEE/ACM Trans Comput Biol Bioinformatics. 2006, 3 (2): 126140. 10.1109/TCBB.2006.16.View Article
 George A, Tenenbaum S: Informatic Resources for Identifying and Annotating Structural RNA Motifs. Molecular Biotechnology. 2009, 41 (2): 180193. 10.1007/s120330089114z.PubMed CentralView ArticlePubMed
 Mathews DH, Turner DH: Prediction of RNA secondary structure by free energy minimization. Current opinion in structural biology. 2006, 16 (3): 270278. 10.1016/j.sbi.2006.05.010.View ArticlePubMed
 Sung W: RNA Secondary Structure Prediction. The practical bioinformatician. 2004, 167192. World ScientificView Article
 Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P: Fast folding and comparison of RNA secondary structures. Monatshefte fur Chemie Chemical Monthly. 1994, 125 (2): 167188. 10.1007/BF00818163.View Article
 Eddy SR, Durbin R: RNA sequence analysis using covariance models. Nucleic Acids Research. 1994, 22 (11): 20792088. 10.1093/nar/22.11.2079.PubMed CentralView ArticlePubMed
 Schmitt WR, Waterman MS: Linear trees and RNA secondary structure. Discrete Appl Math. 1994, 51 (3): 317323. 10.1016/0166218X(92)00038N.View Article
 Nussinov R, Jacobson AB: Fast algorithm for predicting the secondary structure of singlestranded RNA. Proceedings of the National Academy of Sciences of the United States of America. 1980, 77 (11): 63096313. 10.1073/pnas.77.11.6309.PubMed CentralView ArticlePubMed
 Waterman M, To D, Kinney JR: Secondary Structure of SingleStranded Nucleic Acids. Studies on Foundations and Combinatorics, Advances in Mathematics Supplementary Studies. 1978, 1: 167212.
 Burghardt B, Hartmann AK: Dependence of RNA secondary structure on the energy model. Physical Review E, Statistical, Nonlinear, and Soft Matter Physics. 2005, 71 (2 Pt 1): 021913View ArticlePubMed
 Alkan C, Karakoç E, Nadeau JH, Sahinalp SC, Zhang K: RNARNA interaction prediction and antisense RNA target search. Journal of Computational Biology: A Journal of Computational Molecular Cell Biology. 2006, 13 (2): 267282. 10.1089/cmb.2006.13.267.View Article
 Mathews DH, Sabina J, Zuker M, Turner DH: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Journal of Molecular Biology. 1999, 288 (5): 911940. 10.1006/jmbi.1999.2700.View ArticlePubMed
 Rivas E, Eddy SR: A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of Molecular Biology. 1999, 285 (5): 20532068. 10.1006/jmbi.1998.2436.View ArticlePubMed
 Turner DH, Mathews DH: NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic acids research. 2010, 38 (Database): D280282. 10.1093/nar/gkp892.PubMed CentralView ArticlePubMed
 Sankoff D: Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems. SIAM Journal on Applied Mathematics. 1985, 45 (5): 81010.1137/0145048.View Article
 Gorodkin J, Heyer LJ, Stormo GD: Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Research. 1997, 25 (18): 37243732. 10.1093/nar/25.18.3724.PubMed CentralView ArticlePubMed
 Havgaard JH, Lyngsø RB, Stormo GD, Gorodkin J: Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics. 2005, 21 (9): 18151824. 10.1093/bioinformatics/bti279.View ArticlePubMed
 Havgaard JH, Torarinsson E, Gorodkin J: Fast Pairwise Structural RNA Alignments by Pruning of the Dynamical Programming Matrix. PLoS Comput Biol. 2007, 3 (10): e19310.1371/journal.pcbi.0030193.PubMed CentralView Article
 Gorodkin J, Stricklin SL, Stormo GD: Discovering common stemloop motifs in unaligned RNA sequences. Nucleic Acids Research. 2001, 29 (10): 21352144. 10.1093/nar/29.10.2135.PubMed CentralView ArticlePubMed
 Havgaard JH, Lyngs RB, Gorodkin J: The FOLDALIGN web server for pairwise structural RNA alignment and mutual motif search. Nucleic Acids Research. 2005, 33 (Web Server): W650653. 10.1093/nar/gki473.PubMed CentralView ArticlePubMed
 Mauri G, Pavesi G: Algorithms for pattern matching and discovery in RNA secondary structure. Theoretical Computer Science. 2005, 335: 2951. 10.1016/j.tcs.2004.12.015.View Article
 Anwar M, Nguyen T, Turcotte M: Identification of consensus RNA secondary structures using suffix arrays. BMC Bioinformatics. 2006, 7: 24410.1186/147121057244.PubMed CentralView ArticlePubMed
 Ji Y, Xu X, Stormo GD: A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences. Bioinformatics (Oxford, England). 2004, 20 (10): 15911602. 10.1093/bioinformatics/bth131.View Article
 Hamada M, Tsuda K, Kudo T, Kin T, Asai K: Mining frequent stem patterns from unaligned RNA sequences. Bioinformatics. 2006, 22 (20): 24802487. 10.1093/bioinformatics/btl431.View ArticlePubMed
 McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29 (67): 11051119. 10.1002/bip.360290621.View ArticlePubMed
 Hofacker IL, Fontana W, Stadler PF, Bonhoeffer SL, Tacker M, Schuster P: Fast Folding and Comparison of RNA Secondary Structures. Monatsh Chem. 1994, 125: 167188. 10.1007/BF00818163.View Article
 Chen J, Le S, Maizel JV: Prediction of common secondary structures of RNAs: a genetic algorithm approach. Nucleic Acids Research. 2000, 28 (4): 991999. 10.1093/nar/28.4.991.PubMed CentralView ArticlePubMed
 Hu YJ: Prediction of consensus structural motifs in a family of coregulated RNA sequences. Nucleic Acids Research. 2002, 30 (17): 38863893. 10.1093/nar/gkf485.PubMed CentralView ArticlePubMed
 Michal S, Ivry T, Cohen O, Sipper M, Barash D: Finding a common motif of RNA sequences using genetic programming: the GeRNAMo system. IEEE/ACM Transactions on Computational Biology and Bioinformatics / IEEE, ACM. 2007, 4: 596610.View ArticlePubMed
 Wuchty S, Fontana W, Hofacker IL, Schuster P: Complete suboptimal folding of RNA and the stability of secondary structures. Biopolymers. 1999, 49 (2): 145165. 10.1002/(SICI)10970282(199902)49:2<145::AIDBIP4>3.0.CO;2G.View ArticlePubMed
 Yao Z, Weinberg Z, Ruzzo WL: CMfindera covariance model based RNA motif finding algorithm. Bioinformatics. 2006, 22 (4): 445452. 10.1093/bioinformatics/btk008.View ArticlePubMed
 Rabani M, Kertesz M, Segal E: Computational prediction of RNA structural motifs involved in posttranscriptional regulatory processes. Proceedings of the National Academy of Sciences. 2008
 Pavesi G, Mauri G, Stefani M, Pesole G: RNAProfile: an algorithm for finding conserved secondary structure motifs in unaligned RNA sequences. Nucleic Acids Research. 2004, 32 (10): 32583269. 10.1093/nar/gkh650.PubMed CentralView ArticlePubMed
 Karp RM: Reducibility Among Combinatorial Problems. Complexity of Computer Computations. Edited by: Miller RE, Thatcher JW. 1972, Plenum Press, 85103.View Article
 GriffithsJones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A: Rfam: annotating noncoding RNAs in complete genomes. Nucleic Acids Research. 2005, 33 (Database): D121D124.PubMed CentralPubMed
 Rfam database. [http://rfam.sanger.ac.uk/]
 European Nucleotide Archive. [http://www.ebi.ac.uk/ena/]
 Bioinformatics Research Group at King Saud University. [http://bioinformaticksurg.blogspot.ca]
 FOLDALIGN. [http://foldalign.ku.dk/software/index.html]
 Eddy Lab: Software. [http://selab.janelia.org/software.html]
 Seed. [http://bio.eecs.uottawa.ca/software/seed/]
 comRNA. [http://stormo.wustl.edu/comRNA/]
 RNAmine. [http://software.ncrna.org/]
 CMfinder. [http://bio.cs.washington.edu/yzizhen/CMfinder/]
 RNAProfile. [http://159.149.160.51/modtools/downloads/rnaprofile.html]
 RNAPromo. [http://genie.weizmann.ac.il/pubs/rnamotifs08/index.html]
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.