An example consensus diagram
The crcB motif [19] was used to provide an example of a consensus diagram drawn using R2R (Figure 1A). The consensus is a representation of conserved sequence and secondary-structure features, the degree of conservation of nucleotides and a summary of covarying positions that retain base-pair complementarity. The output of R2R (Figure 1B) was customized by using additional commands (Figure 1C), and assembled using Adobe Illustrator into a finished diagram. Generic symbols and graphics used in finished diagrams are provided (Additional files 3 and 4). A complete example of R2R input and output is also given for a contrived RNA class with two representatives (Figure 2).
Multistem junctions
Nucleotides within multistem junctions and internal loops are typically positioned along a circle (e.g., as in Figure 1). Like most RNA-drawing programs, R2R supports manual layout of such loops, as well as a circular layout in which stems are oriented in whatever directions fit the circle. R2R also supports the drawing of loops that approximately follow a circle, subject to constraints on the directions of their stems (Figure 3). These constraints are specified by the user, and can be used to avoid overlapping nucleotides elsewhere in the diagram, to orient all stems in horizontal or vertical directions, or otherwise to promote symmetry in stem directions. Stems within the multistem junction can also be constrained to align horizontally, vertically or in an arbitrary axis. The resulting problem is expressed as a non-linear program (see Implementation), and solved by CFSQP [20]. This feature accelerates the determination of an approximately circular layout, compared to manual trial and error.
Pseudoknot drawing styles
R2R supports two styles to show pseudoknots. In an "in-line" style, pseudoknot pairings are drawn directly (Figure 4A). The pairing relationships are often most clear in the in-line style, but this layout is not possible for many RNA secondary structures without making other compromises. By contrast, the "callout" style (Figure 4B) involves connecting distant base-paired regions with a line marked "pseudoknot". The pseudoknot pairings can be shown explicitly in a small callout drawing. The callout allows annotation of covariation data, and helps the reader to see precisely which nucleotides form base pairs.
Modular structures
Many RNA motifs exhibit modular sub-structures that are present in only some motif representatives. For example, in many RNA motifs, certain hairpins are absent in some representatives, and some terminal loops frequently adopt one or more well-defined sequences (e.g., either GNRA or UNCG [21]). To show a modular structure, the R2R user uses regular expressions or Boolean logic to define which motif representatives exhibit the modular structure (Additional file 2). The occurrence frequency of the modular structure is automatically calculated by R2R (Figures 1 and 5).
Drawing of individual RNA molecules
Although the primary goal during the design of R2R was to produce software to assist in drawing consensus diagrams, R2R can also be used to draw the sequences and structures of individual representatives of a noncoding RNA class. For example, Figure 6 shows alternate structures possible in crcB RNAs from Acidothermus cellulolyticus and Roseburia intestinalis that suggest a model for gene regulation. We also previously used R2R to display structural probing data obtained by in-line probing experiments on a SAM-IV riboswitch [17].
Design principles for RNA secondary structural diagrams
R2R facilitates the application of the following design principles for RNA secondary structure diagrams. Although little research has investigated practical benefits of different RNA drawing styles [10], the principles integrated into R2R are similar to broadly followed guidelines for RNA depictions [10, 22] and some are related to ideas accepted in the field of graph layout [10, 23].
The principles are as follows. First, nucleotides or other symbols should not overlap. Second, nucleotides within bulges and loops should ideally be drawn along circles. Such a layout leads to symmetry [23] in the looped nucleotides, which share a common property. The circular layout also avoids arbitrarily drawing attention to individual nucleotides that might otherwise be located on a corner. Third, stems should ideally run horizontally or vertically, to emphasize the common structural role of stems. Fourth, the distance between consecutive nucleotide positions along the RNA backbone should be constant throughout the diagram. This principle avoids inelegant bunching of nucleotides, or extra space between nucleotides that draws unwarranted attention or requires additional clarification for the user to follow the RNA backbone. Finally, the diagram should be compact, which is both aesthetic and space-saving. Some of these principles often conflict, and the inference of an optimal solution may require some manual intervention.
Consensus diagrams merit annotation to highlight the extent of nucleotide conservation and to feature evidence supporting the proposed structure. This information, which is included in Figure 1, is automatically calculated by R2R (see below). Other annotations useful in consensus diagrams are the depiction of variable-length, poorly conserved regions as well as modular structures. R2R supports such annotations (e.g., see Figure 1), based on the user's explicit judgment regarding the RNA motif data.
Automatically calculated consensus annotation
Some annotations specific to consensus diagrams are automatically computed by R2R, using approaches described previously [16]. R2R graphically depicts the extent of conservation at nucleotide positions within an RNA. To reduce bias caused by highly similar sequences, sequences are weighted by the GSC algorithm [24] as implemented by the Infernal software package [25]. If the weighted frequency of a nucleotide exceeds 75%, R2R draws the nucleotide with a specific color (e.g., Figure 1) to indicate whether its frequency exceeds 75%, 90% or 97%, although these parameters can be adjusted. Otherwise, if the nucleotide is a purine or pyrimidine with a frequency above 75%, R2R indicates whether this frequency exceeds the same thresholds. The conservation of purine or pyrimidine identity is often associated with structural constraints. If a position does not meet the preceding criteria, R2R reports whether a nucleotide is present in the position with weighted frequency more than 50%, 75%, 90% or 97%, or otherwise does not show the nucleotide position.
R2R does not indicate other patterns of conservation. For example, the nucleotide immediately 5' to the hammerhead ribozyme cleavage site must be A, C or U [26], but this will not be indicated automatically by R2R. However, we concluded that routinely annotating such conservation patterns would unduly complicate diagrams, and users could add such distinctions that are desired. We also considered using entropy [27] as a measure of conservation. Although entropy measures conservation in a more general manner, we found it difficult to develop an intuition for how specific levels of entropy relate to likely biochemical constraints.
R2R marks each predicted base pair to indicate covariation (e.g., Figure 1). If two RNAs can form a Watson-Crick or G-U base pair at equivalent locations, and the base pair identities differ at both positions (e.g., A-U in one sequence and C-G in another), R2R classifies the base pair as covarying. If they vary at only one position (e.g., A-U in one sequence and G-U in another), the base pair is considered to carry a compatible mutation. Base pairs whose nucleotides are invariant have no mutational evidence for or against such base-pair predictions, and are marked accordingly. Each of these classifications is indicated unobtrusively by shading the base pairs with specific colors. Positions that contain non-canonical base pairs with a frequency exceeding 10% are not shaded.
This automated R2R annotation does not reflect the extent or confidence of covariation. While such information can be useful, we believe that thorough evaluation of covariation evidence ultimately requires analysis of the full sequence alignment. For example, misleading covariation can result from an incorrect alignment of sequences, or from alignments of sequences that do not function as structured RNAs. Unfortunately, there is no accepted method to assign confidence that entirely eliminates the need to analyze the full alignment.
User effort required with R2R
Since R2R's overriding goal is to facilitate highly aesthetic diagrams, it requires the user to give it explicit instructions to customize the RNA layout (e.g., Figure 1C), and to edit R2R's raw output in a general-purpose drawing program (e.g., compare raw Figure 1B with finished Figure 1A). In our experience, this manual effort is usually modest. The ~800-nucleotide GOLLD RNA [18] structure took us roughly 16 hours to draw using R2R, mainly owing to the challenge of finding a layout that fits within a page. However, most RNAs are hairpin structures that do not require any kind of customization, and were easily drawn in minutes. RNAs with complex structural features (e.g., pseudoknots or multistem junctions where the default layout is unsatisfactory) or annotations (e.g., modular structures or nucleotide positions with special significance) were still usually completed within 30-60 minutes.
Limitations
Despite the capabilities offered by R2R, we see some areas for improvement. First, a graphical user interface would allow additional researchers to more easily use R2R, and could help to make some tasks even faster for all users. Second, numerous features are possible to enrich diagrams with additional layout, particularly for RNAs with unusual biochemical features. Third, further automation of layout selection would speed the use of R2R. Fourth, R2R is also not designed to implement schematic diagrams that display extensive tertiary interactions or to project diagrams that are positioned to better reflect positions of nucleotides or substructures in atomic-resolution models (e.g., the newer secondary structure format for group I introns [28]).