 Methodology article
 Open Access
 Published:
RNACompress: Grammarbased compression and informational complexity measurement of RNA secondary structure
BMC Bioinformatics volume 9, Article number: 176 (2008)
Abstract
Background
With the rapid emergence of RNA databases and newly identified noncoding RNAs, an efficient compression algorithm for RNA sequence and structural information is needed for the storage and analysis of such data. Although several algorithms for compressing DNA sequences have been proposed, none of them are suitable for the compression of RNA sequences with their secondary structures simultaneously. This kind of compression not only facilitates the maintenance of RNA data, but also supplies a novel way to measure the informational complexity of RNA structural data, raising the possibility of studying the relationship between the functional activities of RNA structures and their complexities, as well as various structural properties of RNA based on compression.
Results
RNACompress employs an efficient grammarbased model to compress RNA sequences and their secondary structures. The main goals of this algorithm are two fold: (1) present a robust and effective way for RNA structural data compression; (2) design a suitable model to represent RNA secondary structure as well as derive the informational complexity of the structural data based on compression. Our extensive tests have shown that RNACompress achieves a universally better compression ratio compared with other sequencespecific or common textspecific compression algorithms, such as Gencompress, winrar and gzip. Moreover, a test of the activities of distinct GTPbinding RNAs (aptamers) compared with their structural complexity shows that our defined informational complexity can be used to describe how complexity varies with activity. These results lead to an objective means of comparing the functional properties of heteropolymers from the information perspective.
Conclusion
A universal algorithm for the compression of RNA secondary structure as well as the evaluation of its informational complexity is discussed in this paper. We have developed RNACompress, as a useful tool for academic users. Extensive tests have shown that RNACompress is a universally efficient algorithm for the compression of RNA sequences with their secondary structures. RNACompress also serves as a good measurement of the informational complexity of RNA secondary structure, which can be used to study the functional activities of RNA molecules.
Background
Ribonucleic acid (RNA) is an important class of molecules which performs a wide range of biological and chemical functions. Traditionally, most RNA molecules were regarded as being involved in the process of translation, including transfer RNA (tRNA) and ribosomal RNA (rRNA). Since the late 1990s, it has been widely acknowledged that there exists other type of functional RNA molecules such as nonproteincoding RNAs. These RNAs are found in organisms ranging from bacteria to mammals and affect a wide variety of processes including plasmid replication, phage development, bacterial virulence, chromosome structure, DNA transcription, RNA modification [1–5]. RNA has recently become the center of much attention because of its functions as well as catalytic properties, leading to a substantially increased interest in identifying new RNAs and obtaining their structural information [6–8]. Furthermore, the growth of RNA databases, such as NONCODE [9], Rfam [10], RNaseP [11] and RNAdb [12] has increased two to three fold annually.
To facilitate the maintenance and analysis of such RNA data, an efficient compression algorithm of RNA sequences is needed. Algorithms for compressing DNA sequences include GenCompress [13], DNACompress [14], Biocompress [15] and Cfact [16]. However, these algorithms are only suitable for compressing the primary sequences of DNA. As for RNA sequences, we are more interested in designing a novel compression algorithm to compress RNA primary sequence together with its secondary structure information. RNA secondary structure is similar to an alignment of nucleic acid sequences, except that the sequence folds back on itself and "complementary bases" pair (commonly AU, GC, GU) rather than identical or similar bases [17]. The functions of RNA are closely related to its structural characteristics and as such obtaining RNA secondary structure information (both experimentally or computationally) has been an important and interesting problem for several decades [17].
From a strictly mathematical point of view, compression implies understanding and comprehension [18]. Biological sequence compression is a useful tool to recover information from biological sequences. Better compression often implies better understanding. Compressing RNA sequence with secondary structure means that we can capture the essences of RNA sequence information and its structural information simultaneously. From an application point of view, we can derive the informational complexity of RNA structural data based on compression, which can be used to study the structural features and other various properties of RNAs.
In our study, we have developed an efficient grammarbased algorithm to compress RNA sequence and its secondary structure. The software RNACompress developed in Windows and Linux platforms is accessible freely at our website. We have also defined the informational complexity of RNA structural data based on compression coupled with the theory of Kolmogorov complexity [18]. This kind of informational complexity will be used to study the relationship between binding activities and structural complexity of RNA aptamers.
To the best of our knowledge, this is the first study to be published about the compression of biological sequences with structural information. Additionally, we apply the results to study functional activities of RNAs. The key idea of our compression algorithm is to use dotbracket notation [17] to represent the secondary structure of RNA and define specific context free grammars (CFG) to model RNA secondary structure together with its primary sequence during compression (decompression). Furthermore, several computational parser and coding approaches are incorporated to facilitate the whole procedure, including (1) Utilizing the LL(1) parser to derive the leftmost derivation of defined grammars for RNA primary sequence and its secondary structure and (2) Using Huffman coding to encode the symbol stream of leftmost derivation to achieve the most economical compression result, etc. Extensive tests have shown that our algorithm is fast, robust, effective and obtains a universally better compression ratio than the common textbased compression tools or primarysequencespecific compression tools in the compression of RNA sequence with its structure. These results show that our program is a useful tool for RNA data maintenance and analysis.
Results
Algorithm
Generally speaking, grammarbased compression starts by inferring the contextfree grammar to represent the string. The resulting grammar is encoded as a symbol stream, which is then converted into binary. Each step affects the final size of the compressed file. In our algorithm, each step will be designed specifically to facilitate the particular goal of RNA sequence and structure compression. The main schema of our grammarbased compression and decompression is shown in Figure 1. Two specific grammars G_{1} and G_{2} are defined in our study: G_{1} is viewed as the key grammar to represent RNA primary sequence together with its secondary structure, while G_{2} is only used to model the dotbracket sequence of RNA secondary structure and serves as a complementary to G_{1} to guide its generation order (As will show later). We start with parsing the dotbracket sequence in G_{2} using the leftmost deriving [19] to get a grammar tree T_{2}. At the same time, each deriving step is mapped to construct another grammar tree T_{1} based on G_{1}. Finally this leftmost deriving symbol stream of T_{1} is encoded using Huffman coding theory [20], so that the probability of each unpaired bases and base pairs occurring in the whole secondary structure is considered to get the most economic coding result. As for decompression, the reverse procedure is performed. It should be noted that once the grammar tree T_{1} or T_{2} is regenerated during decompression, the corresponding primary sequence and secondary structure in dotbracket notation can be regained using the postorder traversal [19] of the leaves of these grammar trees, respectively. More details will be presented in the following.
A. Content free grammars of RNA sequence and structure
In our study, RNA primary sequence is represented in FASTA format beginning with a singleline description or comment, followed by lines containing sequence data. The description line is distinguished from the sequence data by a greaterthan (">") symbol in the first column. For each sequence, the corresponding secondary structure is represented in dotbracket notation. Dotbracket notation is the dominant RNA secondary structure format. It uses dot to represent unpaired bases and brackets to represent base pairs in RNA stems. Many useful tools use this format as input (and output) and hence it has become an unofficial standard [21, 22]. As for our compression, the dotbracket notation was proved to be an efficient way to represent RNA secondary structure and suitable for our grammar parser. A simple example of our input file for compression is shown in Figure 2.
We have defined two concise content free grammars G_{1} and G_{2} to model RNA primary sequence and its secondary structure information. A CFG is very similar to a finite automaton [23], and has been proved to be an efficient model to study RNA secondary structure. It contains the following elements, which are defined as follows:
(1). Terminals – a symbol that represents a constant value
(2). Nonterminals – a symbol that has the capability of being further defined in terms of terminals and/or nonterminals, usually denoted by a capital letter.
(3). Production rules – rules by which nonterminals can be replaced.
In our study, two grammars are defined as:
G_{ 1 }:
S: LS  e
L: aSu  uSa  cSg  gSc  uSg  gSu  a u c g
G_{ 2 }:
S: LS  e
L: (S)  •
For both grammars, S and L are nonterminals, e is empty string, and the symbols a, u, c, g, (,) and • are terminals representing the 4 different bases, left bracket, right bracket and dot, respectively.
Here G_{1} is a combination grammar to analyze RNA primary sequence and secondary structure simultaneously. It can model WatsonCrick base pairs AU, GC and Wobble base pair GU in RNA secondary structure. G_{2} is aimed at modeling the dotbracket sequence of RNA secondary structure. With these two grammars, two kinds of grammar trees can be generated for the RNA primary sequence and secondary structure, respectively, as shown in Figure 3. It should be noted that G_{1}is ambiguous, meaning that the same primary sequence can be generated with more than one grammar tree, while G_{2} is unambiguous which means that one dotbracket string corresponding to only one grammar tree [19]. Thus we have utilized G_{2} to guide G_{1}during the grammar parsing to identify the grammar tree of G_{1}.
B. Compression algorithm
Based on the two grammars we have defined, we are able to perform the compression as shown in Figure 1. In the following we also take the RNA sequence in Figure 3 as an example to demonstrate the whole compression procedure. First we discuss several computational approaches used in our work.
LL(1) parser
We start from parsing the dotbracket sequence of RNA secondary structure using G_{2}, and the LL(1) parser is used to derive the leftmost derivation of the input sequence. A LL parser is a topdown parser for a subset of the contextfree grammars [24]. It parses the input from left to right, and constructs a leftmost derivation of the sentence. Practically, there are two common ways to describe how a given string can be derived from the start symbol of a given grammar. The simplest way is to list the consecutive strings of symbols, beginning with the start symbol and ending with the string, and the rules that have been applied. If we introduce a strategy such as "always replace the leftmost nonterminal first" then for contextfree grammars the list of applied grammar rules is by itself sufficient. This is defined as the leftmost derivation of a string [19].
As for LL(1) parser, it uses one token of lookahead when parsing a sentence. The parser consists of:

(1)
an input buffer, a string from the grammar

(2)
a stack on which to store the terminals and nonterminals from the grammar yet to be parsed.

(3)
a parsing table which tells it what (if any) grammar rule to apply given the symbols on top of its stack and the next input token.
In our study, the parser applies the rule in the parser table (Table 1) which we have defined for grammar inference of RNA secondary structure, by matching the topmost symbol on the stack (row) with the current symbol in the input stream (column). When the parser starts, the stack already contains two symbols: [S, #], where '#' is a special terminal to indicate the bottom of the stack and the end of the input stream, and 'S' is the start symbol of the grammar. The parser will attempt to rewrite the contents of this stack to what it sees on the input stream. Three types of steps for our leftmost derivation are followed depending on whether the top of the stack is a nonterminal, a terminal or the special symbol #:

(1)
If the top of the stack is a nonterminal symbol, the nonterminal symbol and the symbol on the input stream is looked up in the parsing table to determine which rule of the grammar to use. The number of the rule is written to the output stream. If the parsing table indicates that there is no such rule then it reports an error and stops.

(2)
If the top of the stack is a terminal symbol, then it is compared to the symbol on the input stream. If they are equal they are both removed. If they are not equal, the parser reports an error and stops.

(3)
If the top is # and on the input stream there is also a # then the parser reports that it has successfully parsed the input, otherwise it reports an error. In both cases the parser will stop.
These steps are repeated until the parser stops, and then it will have either completely parsed the input or written a leftmost derivation to the output stream or it will have reported an error.
Map leftmost derivation of G_{2} to G_{1}
As mentioned above, G_{2} is used to guide the leftmost derivation of G_{1} since it is ambiguous. The mapping of the leftmost derivation of G_{2} to G_{1} is straightforward: '()' will be mapped to the corresponding base pairs of the RNA secondary structure and '•' will be mapped to the corresponding unpaired bases. After this mapping, a leftmost derivation of G_{1} is obtained and the Huffman coding is performed on the symbol stream of this leftmost derivation to encode them into a bit stream, as discussed follow.
Huffman coding
Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variablelength code table for encoding a source symbol where the variablelength code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol [20]. Huffman coding is able to design the most efficient compression method of this type: no other mapping of individual source symbols to unique strings of bits produces a smaller average output size when the actual symbol frequencies agree with those used to create the code.
We use variablelength code table to encode the symbol stream of leftmost derivation of G_{1} based on the probability associated for each production rules. G_{1} can be viewed as a stochastic context free grammar (SCFG) [25]. We have derived the rule probabilities based on a complete statistic analysis of the frequency distribution of base pair and unpaired bases in different RNA secondary structure using the RNA structural database RNABase [26]. Nearly 1200 RNA sequences which cover diverse RNAs including tRNA, rRNA, noncoding RNA etc. are examined and the final statistical probabilities are listed in Table 2.
It should be noted that for different types of RNA or RNA in different species, the frequency distribution of their base pairs or unpaired bases are different, thus the production probabilities of the rules are different. However, from a statistical perspective, we aim at designing a universal compression algorithm for all types of RNA, thus we make use of these general probabilities here. For more specific RNA types, more specific probabilities can be used.
The Huffman tree based on our statistical probabilities is shown in Figure 4. Finally variablelength bit codes are generated to encode the different production rules.
Example
We took the RNA sequence in Figure 3 as an example to demonstrate the whole compression procedure, as shown in Table 3. The input is a RNA primary sequence with its secondary structure in dotbracket notation. Each step of the LL(1) parser and the corresponding operation is also listed.
The final bit stream of this leftmost derivation is 0 01111 0 1110 0 101 0 00 0 010 0 1111 1 1 1 0 010 1, for a total of 35 bits.
Definition of compression ratio
In our work, the compression ratio of the compression algorithm can be computed in two ways:
R_{1} = uncompressed_file_bytesize/compressed_file_bytesize, or R_{2} = (n × (H_{1} + H_{2}))/o, where n is the number of the bases in input RNA sequence. H_{1} and H_{2} are the information entropy of the RNA primary sequence and secondary structure, respectively. o is the number of bits in compressed file.
While R_{1} is a straightforward way to compute the compression ratio based on the byte size of the input file and the compressed file, R_{2} is more specific based on entropy theory. In the definition of R_{2}, the uncompressed input file is divided into two parts: RNA primary sequence and RNA secondary structure in dotbracket notation. From an information entropy perspective [27], since there exists four bases (A, U, G, C) for primary sequence and 3 characters " (", ") " and "." for secondary structure, the average information entropy for the primary sequence (H_{1}) and secondary structure (H_{2}) can be computed as:
Where P_{ i }and P'_{ i }are the occurrence probabilities of each bases and characters in dotbracket notation. If we consider a RNA sequence with infinite length, then P_{ i }= 1/4 and P'_{ i }= 1/3, assuming an independent probability distribution of 4 base pairs and 3 characters, thus H_{1} = 2 and H_{2} ≈ 1.585. This means that 2 bits is enough for encode the RNA primary sequence and 1.585 bit can be used to encoding RNA secondary structure in dotbracket notation. Note that in our implementation, the occurrence probabilities of 4 bases and 3 characters will be computed according to the particular RNA.
C. Informational complexity
The definition of informational complexity of RNA structural data underlies the concept of Kolmogorov complexity. The Kolmogorov complexity K(•) of an object o is defined by the length of the shortest program P for a Universal Turing Machine U that is needed to output o [18]. Intuitively, K(x) represents the minimal amount of information required to generate × by an algorithm.
It is well known that there is a relationship between Kolmogorov complexity of sequences and Shannon information theory [28]: the expected Kolmogorov complexity of a sequence x is asymptotically close to the entropy of the information source emitting x. However, Kolmogorov complexity is noncomputable in the Turing sense [18] and in practical applications it is approximated by the length of the compressed sequence calculated by a compression algorithm [18].
In summary, the informational complexity of a given RNA sequence with its secondary structure is approximated by the compressed bit string using RNACompress. This definition is straightforward, yet with rigorously theoretical support. Later experiment will prove that our informational complexity can reveal the relationship between structural complexity and functional activity of RNA aptamers, which could be useful in predicting the functional utility of novel heteropolymers.
Experimental testing
Our experiments are performed in two parts: first the compression ability of RNACompress is tested, and secondly the results are applied to reveal the relationship between binding activities and structural complexity of RNA aptamers.
A. Compression ability
We have tested the compression ability of RNACompress on 7 benchmark files that are access freely at our website. These 7 data files are generated from different databases or curated from literatures, covering diverse types of RNA molecules with their secondary structures (computationally predicted or experimental validate), including rRNA, tRNA and small noncoding RNAs. Note that the compression of one RNA sequence makes no sense from the statistical perspective, since RNA sequence is generally much shorter than the DNA sequences or the whole genomes. In our test, the input file contains a set of RNAs. Also the sequence identities in each input file are different. The intention to use these extensive data files is two folds: to test the behavior of our algorithm in compression of files with different sequence identities and to demonstrate that our algorithm is universally efficient for different type of RNAs. Detail descriptions of these 7 test files are listed in Table 4.
We have compared the running time and compression ratios (R_{1} and R_{2}) of RNACompress with three other algorithms: Gencompress, winrar and gzip. Gencompress (DNACompress is the newest version of Gencompress) has reported to be the top algorithm among all the other sequencespecific compression algorithms, such as Biocompress and Cfact. The other two algorithms, winrar (commercially) and gzip (based on LempelZiv coding/LZ77), are two classical text compression algorithm widely used in Windows and Linux, respectively. Detail comparisons between RNACompress and these three algorithms are listed in Table 5.
It can be seen that RNACompress achieves the best compression ratio with comparable speed among the other algorithms, except for two tests file rRNA.txt and miRNA.txt. For rRNA.txt, the sequence identities are nearly 90%. Gencompress and other two common compression algorithms are efficient to capture the pattern repeats in this file, thus achieve better results. For miRNA.txt, the same reason also holds. Furthermore, microRNAs are generally short RNA molecules of about 21–23 nucleotides in length, thus their ability to be compressed are reduced compared to longer sequences. Although efficient at searching for approximate matches and reverse complements, the running time for Gencompress was found to be unpractical long when the input file is large.
Essentially, our compression algorithm is based on grammar inference and Huffman coding, and currently does not consider the repeat patterns of the input file. This is why RNACompress failed to achieve the better compression ratio when the sequence identities are high in a set of RNAs. Our algorithm is, however, very robust to different types of RNA and influenced little by the arrangement of the input file. As for the three other algorithms, if we rearrange the same set of RNAs in different order and artificially space out two highly identical sequences, their compression ratios will decrease dramatically. In addition, there also exist other algorithms that are based on different mechanisms besides searching repeat pattern, one of these is PPM [29], which uses a specialized form of compression based on Markov modeling. Unfortunately, these algorithms are generally computation extensive in their exchange for higher compressions.
B. Aptamer activity and complexity
To further demonstrate the applications of RNACompress, we present a comparison of the structural complexities and activities of RNA aptamers, which was initially conducted by Carothers et al.[30]. In their study, a remarkable correspondence between the affinities of eleven GTPbinding RNAs and the intricacy of their secondary structures is found, i.e., aptamers with higheraffinity binding to a target molecule are likely to have more structural informational complexity. However, an efficient calculation of informational complexity was missing in their study. The authors have pointed out the difficult and ambiguity to determine the amount of information of stems in RNA secondary structures and presents three complicated methods to compute it. In our study, we have applied our defined informational complexity to measure the whole structural complexity of RNA aptamers, which makes the calculation more straightforward. Moreover, we have also calculated the Spearman rank correlation coefficient (r_{ s }) of the aptamer informational complexity onto the binding activities, as done by Carothers et al.. Our results are consistent with their study, which proves that the informational complexity defined here is reasonable when studying the relationship between functional activities and structural complexity of RNA molecules (Table 6). More detail information of eleven GTPbind RNAs is listed in Additional file 1.
Discussion
Generally speaking, if we treat both RNA sequences and the representation of their secondary structures as text, any textspecific compression algorithms can be used to compress them. However, these compressions have no biological meaning and disturb the original RNA structure information, although they may achieve higher compression ratios. From a biological perspective, RNACompress is more competitive than any others because it is not only an efficient algorithm to compress RNAs, but also a nice model to represent RNA data. These kinds of compression and representation abilities are based on our grammar inference, which is inherently suitable to capture the structural essence of RNA.
In addition, there still exist several interesting issues in our study, which needs to be discussed or investigated in the future.

(1)
currently we are focused on modeling two dominant types of base pairs in RNA secondary structure: WatsonCrick pairs and Wobble pairs. There also exists other minor variations of basepairing in nucleic acids, such as Hoogsteen base pair (AT) [31]. One challenge remain problem is how to incorporate the modeling of these minor base pairs and keep the compression ratios simultaneously.

(2)
one promising way to improve the compression ability of RNACompress is to consider the repeat pattern of RNA motifs in RNA secondary structure. This is different from the repeat pattern identified in primary sequences, as used in Gencompress etc. Also it will be helpful to approximate the Kolmogorov complexity and evaluate the informational complexity more accuracy. RNA motifs are basic building blocks used repeatedly, and in various combinations, to form different RNA types and define their unique structural and functional properties. Currently many algorithms for RNA motif identifications have been proposed [6, 32, 33]. However, these efforts were moderately successfully in define simple RNA structure. A powerful algorithm to capture complex structural domains or various noncanonical pairings in RNA motifs is still needed.

(3)
another application of compression RNA secondary structure is that it is a great alignmentfree tool for RNA secondary structure comparison. A universal (dis)similarity measure (USM) can be defined to measure the pairwise distance of RNA secondary structures based on the compression, as we will demonstrate elsewhere (Qi Liu et al., RNA secondary structure comparison based on compression: a methodological study, manuscript in preparation).
Conclusion
In this article we have introduced a universal algorithm for the compression of RNA secondary structure as well as the evaluation of its informational complexity. We have developed RNACompress, as a useful tool for academic users. Extensive tests have shown that RNACompress is a universally efficient algorithm for the compression of RNA sequences with their secondary structures. RNACompress also serves as a good measurement of the informational complexity of RNA secondary structure, which can be used to study the functional activities of RNA molecules. Furthermore, future studies will show that our compression algorithm can facilitate the comparisons of RNA secondary structure and studying of noncoding RNA structures, provides a new way to investigate RNA properties based on compression.
Availability and Requirements
Project name:
RNACompress: Grammarbased compression and informational complexity measurement of RNA secondary structure
Project home page: http://www.wigs.zju.edu.cn/education/students/liuqi/RNACompress.html
Operating systems:
Windows 2000/XP and Linux
Programming language:
C/C++
References
 1.
Avner P, Heard E: Xchromosome inactivation: counting, choice and initiation. Nat Rev Genet 2001, 2(1):59–67. 10.1038/35047580
 2.
Frank DN, Pace NR: RIBONUCLEASE P: Unity and Diversity in a tRNA Processing Ribozyme. Annual Review of Biochemistry 1998, 67(1):153–180. 10.1146/annurev.biochem.67.1.153
 3.
Kiss T: Small nucleolar RNAguided posttranscriptional modification of cellular RNAs. EMBO J 2001, 20(14):3617–3622. 10.1093/emboj/20.14.3617
 4.
Lankenau S, Corces VG, Lankenau DH: The Drosophila micropia retrotransposon encodes a testisspecific antisense RNA complementary to reverse transcriptase. Molecular and Cellular Biology 1994, 14(3):1764–1775.
 5.
Lowe TM, Eddy SR: A Computational Screen for Methylation Guide snoRNAs in Yeast. Science 1999, 283(5405):1168–1171. 10.1126/science.283.5405.1168
 6.
Batey RT, Rambo RP, Doudna JA: Tertiary motifs in RNA structure and folding. Angew Chem Int Ed 1999, 38: 2326–2343.
 7.
Nykanen A, Haley B, Zamore PD: ATP Requirements and Small Interfering RNA Structure in the RNA Interference Pathway. Cell 2001, 107(3):309–321. 10.1016/S00928674(01)005475
 8.
Zuker M: Computer prediction of RNA structure. Methods Enzymol 1989, 180: 262–288.
 9.
Liu C, Bai B, Skogerb G, Cai L, Deng W, Zhang Y, Bu D, Zhao Y, Chen R: NONCODE: an integrated knowledge database of noncoding RNAs. Nucleic Acids Research 2005, 33(Database Issue):D112D115. 10.1093/nar/gki041
 10.
GriffithsJones S, Bateman A, Marshall M, Khanna A, Eddy SR: Rfam: an RNA family database. Nucleic Acids Research 2003, 31(1):439–441. 10.1093/nar/gkg006
 11.
Brown JW, Journals O: The ribonuclease P database. Nucleic Acids Research 2005, 26(1):351–352. 10.1093/nar/26.1.351
 12.
Pang KC, Stephen S, Engstrom PG, TajulArifin K, Chen W, Wahlestedt C, Lenhard B, Hayashizaki Y, Mattick JS: RNAdba comprehensive mammalian noncoding RNA database. Nucleic Acids Research 2005, 33(Database Issue):D125. 10.1093/nar/gki089
 13.
Chen X, Kwong S, Li M: A compression algorithm for DNA sequences and its applications in genome comparison. Proceedings of RECOMB 2000., 107:
 14.
Chen X, Li M, Ma B, Tromp J: DNACompress: fast and effective DNA sequence compression. Bioinformatics 2002, 18(12):1696–1698. 10.1093/bioinformatics/18.12.1696
 15.
Grumbach S, Tahi F, Inria LC: Compression of DNA sequences. Data Compression Conference, 1993 DCC'93 1993, 340–350.
 16.
Rivals E, Delahaye JP, Dauchet M, Delgrange O: A guaranteed compression scheme for repetitive DNA sequences. Data Compression Conference, 1996 DCC'96 Proceedings 1996.
 17.
Higgs PG: RNA secondary structure: physical and computational aspects. Quarterly Reviews of Biophysics 2001, 33(03):199–253. 10.1017/S0033583500003620
 18.
Li M, Badger JH, Chen X, Kwong S, Kearney P, Zhang H: An informationbased sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 2001, 17(2):149–154. 10.1093/bioinformatics/17.2.149
 19.
Unger SH: A global parser for contextfree phrase structure grammars. Communications of the ACM 1968, 11(4):240–247. 10.1145/362991.363001
 20.
Knuth DE: Dynamic Huffman coding. Journal of Algorithms 1985, 6(2):163–180. 10.1016/01966774(85)900367
 21.
Steffen P, Voss B, Rehmsmeier M, Reeder J, Giegerich R: RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics 2006, 22(4):500. 10.1093/bioinformatics/btk010
 22.
Voss B, Giegerich R, Rehmsmeier M: Complete probabilistic analysis of RNA shapes. BMC Biol 2006., 4(5):
 23.
Hashiguchi K: Limitedness Theorem on Finite Automata With Distance Functions. J COMP AND SYS SCI 1982, 24(2):233–244. 10.1016/00220000(82)900514
 24.
Grune D, Jacobs CJH: A programmerfriendly LL (1) parser generator. Software—Practice & Experience 1988, 18(1):29–38. 10.1002/spe.4380180105
 25.
Knudsen B, Hein J: RNA secondary structure prediction using stochastic contextfree grammars and evolutionary history. Bioinformatics 1999, 15: 446–454. 10.1093/bioinformatics/15.6.446
 26.
Murthy VL, Rose GD: RNABase: an annotated database of RNA structures. Nucleic Acids Research 2003, 31(1):502–504. 10.1093/nar/gkg012
 27.
Campbell J: Grammatical Man: Information, Entropy, Language, and Life. Simon and Schuster; 1982.
 28.
Cover TM TJA: Elements of Information Theory. Wiley; 1990.
 29.
Moffat A: Implementing the PPM data compression scheme. Communications, IEEE Transactions on 1990, 38(11):1917–1921. 10.1109/26.61469
 30.
Carothers JM, Oestreich SC, Davis JH, Szostak JW: Informational Complexity and Functional Activity of RNA Structures. networks 2001, 63(57):94.
 31.
Zagryadskaya EI, Doyon FR, Steinberg SV, Journals O: Importance of the reverse Hoogsteen base pair 54–58 for tRNA function. Nucleic Acids Research 2003, 31(14):3946–3953. 10.1093/nar/gkg448
 32.
Bergig O, Barash D, Kedem K: RNA Motif Search Using the Structure to String (STR2) Method. Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB'04)Volume 00 2004, 660–661.
 33.
Yao Z, Weinberg Z, Ruzzo WL: CMfindera covariance model based RNA motif finding algorithm. Bioinformatics 2006, 22(4):445. 10.1093/bioinformatics/btk008
 34.
Szymanski M, Barciszewska MZ, Erdmann VA, Barciszewski J, Journals O: 5S Ribosomal RNA Database. Nucleic Acids Research 2002, 30(1):176–178. 10.1093/nar/30.1.176
 35.
GriffithsJones S, Grocock RJ, van Dongen S, Bateman A, Enright AJ, Journals O: miRBase: microRNA sequences, targets and gene nomenclature. Nucleic Acids Research 2006, 34(Database Issue):D140D144. 10.1093/nar/gkj112
 36.
Torarinsson E, Havgaard JH, Gorodkin J: Multiple structural alignment and clustering of RNA sequences. Bioinformatics 2007, 23(8):926. 10.1093/bioinformatics/btm049
 37.
Engstrom PG, Suzuki H, Ninomiya N, Akalin A, Sessa L, Lavorgna G, Brozzi A, Luzi L, Tan SL, Yang L: Complex loci in human and mouse genomes. PLoS Genet 2006, 2(4):e47. 10.1371/journal.pgen.0020047
 38.
Lestrade L, Weber MJ, Journals O: snoRNALBMEdb, a comprehensive database of human H/ACA and C/D box snoRNAs. Nucleic Acids Research 2006, 34(Database issue):D158D162. 10.1093/nar/gkj002
 39.
Do CB, Woods DA, Batzoglou S: CONTRAfold: RNA secondary structure prediction without physicsbased models. Bioinformatics 2006, 22(14):e90. 10.1093/bioinformatics/btl246
Acknowledgements
The authors would like to thank Dr. Thao Tran, Dr. Ying Xu at Computational Systems Biology Laboratory, University of Georiga, USA for their suggestions.
Additional information
Authors' contributions
QL carried out the designing of the whole computational algorithm and drafted the manuscript. YY was responsible for the software implementation. YZ and JB were responsible for the data collection and selection. CC and XY conceived the study and participated in the design and coordination of the analyses. All authors read and approved the final manuscript.
Qi Liu, Yu Yang contributed equally to this work.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Received
Accepted
Published
DOI
Keywords
 Compression Algorithm
 Kolmogorov Complexity
 Huffman Code
 Grammar Tree
 Good Compression Ratio