Grammar-based compression approach to extraction of common rules among multiple trees of glycans and RNAs

Background Many tree structures are found in nature and organisms. Such trees are believed to be constructed on the basis of certain rules. We have previously developed grammar-based compression methods for ordered and unordered single trees, based on bisection-type tree grammars. Here, these methods find construction rules for one single tree. On the other hand, specified construction rules can be utilized to generate multiple similar trees. Results Therefore, in this paper, we develop novel methods to discover common rules for the construction of multiple distinct trees, by improving and extending the previous methods using integer programming. We apply our proposed methods to several sets of glycans and RNA secondary structures, which play important roles in cellular systems, and can be regarded as tree structures. The results suggest that our method can be successfully applied to determining the minimum grammar and several common rules among glycans and RNAs. Conclusions We propose integer programming-based methods MinSEOTGMul and MinSEUTGMul for the determination of the minimum grammars constructing multiple ordered and unordered trees, respectively. The proposed methods can provide clues for the determination of hierarchical structures contained in tree-structured biological data, beyond the extraction of frequent patterns. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0558-4) contains supplementary material, which is available to authorized users.


Background
Many tree structures are found in nature and organisms. One such tree structure is a glycan, in which each monosaccharide is regarded as a vertex, except for cyclic oligosaccharides and so on. Since glycans contain various complicated structures, they are believed to be constructed by various mechanisms that recognize a monosaccharide, binding it with another. For instance, a galactosyltransferase is known to catalyze the biosynthesis with a galactose [1]. Glycans are also known to play several important roles in a cellular system, such as molecular recognition, cell adhesion, and antigen-antibody interactions. Therefore, many studies have been conducted to understand the structures and functions of glycans; in addition, several methods have been developed for the discovery of glycan motifs or significant subtrees, as glycan structures are conserved in evolutionary processes [2][3][4], and to measure the similarities between glycans [5,6].
RNA secondary structures can be also regarded as tree structures; these structures depend on the nucleic acid sequence. RNAs, which are large biological molecules, also perform important functions in living cells, such as the catalysis of biological reactions and expression of genes. Milo et al. displayed a pseudoknot-free RNA secondary structure as an ordered rooted tree, wherein each base pair, unpaired base interval, hairpin loop, internal loop, multi-loop, and external loop corresponds to a vertex, and developed a cubic time algorithm for the homeomorphic subtree alignment problem [7]. They applied it to pairwise alignments of RNA secondary structures, and found several structural similarities, which were not detected by other existing algorithms. Chen and Zhang developed an efficient algorithm for tree edit distance, and used the same to compare several RNA secondary structures [8]. These methods tried to measure the similarities between the tree structures, and to determine frequent subtrees.
In this paper, we focus on finding construction rules for multiple biomolecular tree structures. For example, it was reported that the glycosyltransferases such as ALG1 and ALG2 are involved in the linkages of Glc 3 Man 9 GlcNAc 2 oligosaccharide precursor as shown in Figure 1 [9], where Glc, Man, and GlcNAc stand for glucose, mannose, and N-acetylglucosamine, respectively. ALG1 connects a GlcNAc with a Man, and ALG2 connects a Man with the Man connected by ALG1. According to local structures, different glycosyltransferases catalyze those biosyntheses in order to construct the same structure of the oligosaccharide. Since it is difficult to find existences of such genes from one tree structure, we try to find them from Figure 1 Glycosyltransferases involved in the linkages of the Glc 3 Man 9 GlcNAc 2 oligosaccharide precursor [9]. GlcNAc denotes N-acetylglucosamine. multiple tree structures that the same enzyme constructs a specified local structure.
In the field of computer science, grammar-based compression is used to determine the rules of construction of various types of data. The identification of the smallest grammar for input data would provide a clue towards understanding the construction rules. The determination of the smallest Context-Free Grammar (CFG) constructing a given string is known to be NP-hard [10]. Polynomial-time approximation methods have been developed to determine the smallest CFG for sequence data such as DNA, RNA nucleic acid sequences, and protein amino acid sequences [10][11][12][13]. Although these methods were extended to the tree structured data including XML [14][15][16], the minimum grammar has not always been provided. Hence, in our previous study, we developed grammar-based compression methods for single trees that always output the minimum grammar. We used bisectiontype tree grammars proposed in [17], Simple Elementary Ordered Tree Grammar (SEOTG) for ordered trees and Simple Elementary Unordered Tree Grammar (SEUTG) for unordered trees, in which at most two nonterminal and terminal symbols appear on the right-hand side of each production rule. Since it seemed to be difficult to directly formulate the problems of finding the minimum SEOTG and SEUTG using Integer Programming (IP), we instead formulated the problems of finding SEOTG and SEUTG with a given size [18].
We previously developed methods for compressing single trees. As considered in the example provided in Figure 1, specified construction rules can be applied to generate several similar trees. Therefore, in this study, we attempt to discover common construction rules among multiple distinct trees; in addition, by improving and extending the previous methods, we propose the novel IP formulations MinSEOTGMul and MinSEUTGMul, for the direct determination of the minimum grammars of SEOTG and SEUTG. In our previous study, the problems associated with the determination of the minimum SEOTG and SEUTG were not directly formulated by IP; instead, the problems associated with determining the SEOTG and SEUTG utilizing the given sizes were formulated. Therefore, the previous IP was executed at least twice with different parameters in order to confirm the minimum size of the grammars that construct a given single tree. The methods proposed in this paper can be applied to the direct determination of the minimum grammar in one attempt. As for multiple input trees, our previous method can be trivially extended for the determination of the minimum SEOTG and SEUTG for N trees, which adds a special root vertex, connects each root of N trees to the special root by a distinct special edge, and applies the previous IP formulation to this generated single tree. This approach, however, increases the number of variables in the IP formulation, and may enlarge the execution time.
We apply our methods to several sets of glycans and RNA secondary structures. Consequently, we successfully determined the minimum grammar and several common construction rules using this method.

Methods
In this section, we briefly review the CFG, SEOTG, and SEUTG [17], and explain the proposed IP formulations, MinSEOTGMul and MinSEUTGMul, for multiple trees. Integer Programming (IP) is a method used to optimize a linear objective function subject to linear inequality constraints, with the variables being restricted to integers. We use these tools to solve the proposed integer program as our problem of finding the minimum SEOTG and SEUTG is NP-hard, with efficient solvers being developed. The benefit to use IP is that we can obtain exact solutions for combinatorial optimization problems.

Context-free grammar (CFG)
A context-free grammar (CFG) deals with strings, and is represented by 4-tuple (T, V , S, P), where T is a set of terminal symbols, V is a set of nonterminal symbols, S is a start symbol in V , and P is a set of production rules, wherein a nonterminal symbol on the left-hand side is replaced with a string on the right-hand side, which consists of symbols from V and T [19]. The final product generated by a CFG does not include any nonterminal symbol. The size of a grammar is defined as the total number of symbols appearing on the right-hand side of the production rules. For example, in a case where T = {a, b}, V = {S}, and P = {S → aSb, S → ab}, the start symbol S is repeatedly replaced with the rule S → aSb, all nonterminal symbols are replaced with terminal symbols, and the strings 'ab' , 'aabb' , 'aaabbb' , etc. are generated, by this grammar. The size of the grammar is 5. In a case where T = {a, b}, V = {S, X, Y }, and P = {S → aXb, X → aYb, Y → ab}, only 'aaabbb' is generated from S. The size of this grammar is 8. X and Y represent 'aabb' and 'ab' , respectively. The rest of this paper deals with grammars that generate a constant number of trees. Each nonterminal symbol represents a specified tree.

Simple elementary ordered/unordered tree grammar (SEOTG/SEUTG)
Simple elementary ordered tree grammar (SEOTG) is defined for the rooted ordered trees T(V , E), where V is a set of vertices, and E is a set of labeled edges. If T represents a glycan, a vertex corresponds to a monosaccharide, an edge corresponds to a bond between the monosaccharides, and the enzyme involved in the biosynthesis can be represented by the label of the edge, as shown in Figure 1. As well as CFGs, a grammar of SEOTG consists of 4-tuple ( , , S, ), where is a set of terminal symbols, is a set of nonterminal symbols, S is a start symbol in , and is a set of production rules, that are classified into Horizontal Bisection (RHB), Vertical Bisection (RVB), and Name Change (RNC), as in Figure 2. It should be noted that production rules of SEOTG and SEUTG are different from construction rules of biomolecular trees. (RHB) includes three rules that an edge of nonterminal symbol A is replaced with a tree whose root is both roots of nonterminal symbols B and C. A is bisected at the root into B and C. We introduce a tag to represent the vertex connected with another tree. The first rule in (RHB) does not contain a tag, and the other rules contain a tag, respectively. (RVB) includes two rules that an edge of nonterminal symbol A is  replaced with a tree in which the root is the root of nonterminal symbol B, and the root of nonterminal symbol C is attached to the tag of B. A is bisected at an internal vertex of A into B and C. (RNC) includes two rules that an edge of nonterminal symbol A is replaced with an edge of terminal symbol a. In addition, any nonterminal symbol does not appear in expansion of the symbol itself. Then, each nonterminal symbol corresponds to a subtree of a given tree T. Figure 3 shows an example of SEOTG with ( , , S, ).
is the set of six production rules, R 1 , · · · , R 6 , where R 1 is a vertical bisection rule, R 2 and R 3 are horizontal bisection rules, and R 4 , R 5 , R 6 are name change rules. Figure 4 illustrates the derivation of a tree from the start symbol S by the SEOTG. The first replacement is done by R 1 , and U 1 is replaced with the right-hand side of R 2 . Then, the lower endpoint of U 1 connects with the root of U 2 , and one of leaves of the replaced tree of U 1 connects with the root of U 2 . A tag indicates the vertex connected with another vertex. Hence, the lower endpoint labeled with a tag in A 1 connects with the root of U 2 . By applying R 3 , · · · , R 6 , the right-most tree is generated. The trees surrounded by dotted curves are derived from nonterminal symbols U 1 and U 2 , respectively. It is considered in the example of Figure 1 that a terminal symbol corresponds to a biosynthesis, and a nonterminal symbol corresponds to a sequence of biosyntheses.
Simple elementary unordered tree grammar (SEUTG) is defined for rooted unordered trees in a manner similar to the SEOTG. In SEUTG, the second and third production rules with a tag in the (RVB), described in Figure 2, are equivalent to each other, as the trees are unordered.

Extension to multiple trees
We extend the SEOTG and SEUTG to multiple trees. N is the number of given trees, and T α indicates the α-th edge labeled rooted tree. The start symbol S is replaced with the set of the nonterminal symbols S α . Each tree T α is generated from S α using one grammar. Figure 5 shows an example of the input multiple trees T 1 , T 2 , and T 3 . One of the minimum grammars generating these trees is shown in Figure 6. The size of the grammar is the total number of symbols present on the right-hand side in the production rules, i.e., 11. We minimize the number of distinct nonterminal symbols instead of the size of the grammar, as there exists the same number of production rules as the number of distinct nonterminal symbols. Figure 7 illustrates the derivation of T 2 from the start symbol S 2 using the grammar. T 2 is the same as T in Figure 4, and we observe the modification of the derivation process by providing other similar trees.
The Euler string es(T) is used to determine if the labeled rooted trees T 1 and T 2 are isomorphic to each other,  where es(T) for a tree T is defined by the sequence of edge labels l and its oppositel, along the depth-first search traversal of T [17]. For example, for T 3 in Figure 5, es(T 3 ) is determined to be aāaābb. For a tagged tree, the tagged edge, labeled a, is transformed into aτ a, using a special symbol, τ , which represents the tag. It is noted that for two edge labeled rooted trees T 1 and T 2 , U is assigned as the set of all Euler strings for all connected subtrees of N trees. By improving the previous formulation, we propose the following IP formulation, MinSEOTGMul, for the direct determination of the minimum SEOTG that constructs N ordered trees T α . Figure 6 Minimum SEOTG with ( , , {S 1 , S 2 , S 3 }, ) that generates T 1 , T 2 , and T 3 , as shown in Figure 5. S 1 , S 2 , S 3 are start symbols in this grammar, = {S 1 , S 2 , S 3 , U 2 , A 1 , A 2 , B}, and is the set of seven production rules, R 1 , · · · , R 7 .

Figure 7
Derivation of the tree T 2 shown in Figure 5 by the grammar, using the set of production rules shown in Figure 6. The trees surrounded by dotted curves are derived from nonterminal symbols S 1 and U 2 , respectively.
Here, lch(α, i), rch(α, i), and ch(α, i) denote the leftmost child of the vertex v i in T α , the rightmost child of v i in T α , and the set of child vertices of v i in T α , respectively. T α,i,t,h,k denotes the subtree rooted at vertex v i , with the child vertices v j (h ≤ j ≤ k) and v t labeled with a tag in T α , which does not have a tag when t = (Figure 8). I(T) denotes the set of internal vertices, except for the root and leaves of tree T. anc(α, j) denotes the set of ancestor vertices of v j , where j / ∈ anc(j) and anc( ) = ∅. In MinSEOTGMul, the variable p u is equated to 1 if a nonterminal symbol that corresponds to the subtree represented by Euler string u appears in the grammar; otherwise, the variable is equated to 0. MinSEOTGMul minimizes the sum of p u , i.e., the number of distinct nonterminal symbols appeared in the output grammar as a result. The variable x α,i,t,h,k takes on a value of 1, if the subtree T α,i,t,h,k is constructed by the grammar; otherwise, the value of this variable remains 0. In Eqs. (1) and (2), x α,i, ,j,j and x α,i,j,j,j correspond to an edge in T α , and each edge in the N trees is always constructed according to a production rule of (RNC). x α,1, ,lch(α,1),rch(α,1) corresponds to the entire α-th tree T α , where the root of each tree is numbered as 1. Eq. (3) represents that MinSEOTG-Mul requires that all N trees T α are constructed using the grammar.
The variable y α,i,j,h,l,k takes on a value of 1 if T α,i,j,h,k is constructed from T α,i,j,h,l and T α,i,j,l+1,k using an (RHB) production rule; otherwise, the value is maintained at 0 ( Figure 9). The variable z α,i,j,h,k,t is denoted as 1 if T α,i,j,h,k is constructed from T α,i,t,h,k and T α,t,j,lch(α,t),rch(α,t) using an (RVB) production rule; otherwise, the value is retained as 0 ( Figure 10). Eqs. (4) and (7) indicate that the subtree T α,i,j,h,k is constructed by at least one established production rule of (RHB) and (RVB) in the grammar. Eqs. (5), (6), (8), (9), and (10) indicate that a production rule of (RHB) and (RVB) becomes a candidate rule in the grammar when both of the two source subtrees are constructed.
The variable s u is defined by Eq. (12), and takes on a real value of 0 ≤ s u ≤ 1. If at least one subtree T α,i,j,h,k whose Euler string is u, i.e., es(T α,i,j,h,k ) = u, is constructed, then s u > 0. Based on Eq. (11), p u takes on a value of 1. It means that one nonterminal symbol corresponding to the subtree appears in the grammar. Conversely, when any subtree whose Euler string is u is not constructed, then s u = 0, p u takes on a value of 0, and a nonterminal symbol is not generated. In our previous study, unnecessary nonterminal symbols could be generated, and made it difficult to find the minimum number of nonterminal symbols.
For unordered trees, we propose the following IP formulation, MinSEUTGMul, to determine the minimum Figure 9 Horizontal bisection of T α,i,j,h,k in T α . The nonterminal symbol corresponding to T α,i,j,h,k is generated if the nonterminal symbols corresponding to subtrees T α,i,j,h,l and T α,i,j,l+1,k are generated.

Figure 10
Vertical bisection of T α,i,j,h,k in T α . The nonterminal symbol corresponding to T α,i,j,h,k is generated if the nonterminal symbols corresponding to subtrees T α,i,t,h,k and T α,t,j,lch(α,t),rch(α,t) are generated.
SEUTG for constructing N unordered trees T α , in a manner similar to that used for ordered trees.
Minimize u∈U p u Subject to x α,1, ,ch(α,1) = 1 for all α, for all α, i, C ⊆ ch(α, i), j ∈ I(T α,i, ,C ), t ∈ anc(α, j), s u ≤ p u < 1 + s u for all u ∈ U, The horizontal bisection rules of (RHB) split the set C of child vertices of the vertex v i into a subset C , and the remaining vertices C − C . T α,i,j,C indicates that the subtree rooted at vertex v i has a set C of the child vertices. The variables x α,i,j,C , y α,i,j,C ,C−C , and z α,i,j,C,t are used in a manner similar to x α,i,j,h,k , y α,i,j,h,l,k , and z α,i,j,h,k,t in MinSEOTGMul, respectively.
It should be noted that the IP formulations MinSEOT-GMul and MinSEUTGMul can output multiple grammars with the minimum number of nonterminal symbols. Figure 11 displays such an example, where the grammars G 1 and G 2 generate the tree T, and the number of nonterminal symbols of G 1 (G 2 ) is 3. The first production rule R 1 of G 1 is different from the R 1 of G 2 . By providing more such trees, the number of the minimum grammars can be reduced to almost one.

Tree representation of glycans and RNAs
The proposed methods MinSEOTGMul and MinSEUT-GMul were evaluated by preparing two types of biological data, glycans and RNA secondary structures, which were dealt with as unordered and ordered trees, respectively. For this analysis, we utilized 16 glycans, G02677, G03655, G03661, G03664, G03678, G03687, G04186, G04458, G04695, G04802, G05058, G05226, G05256, G05988, G07243, and G09054, from the KEGG Glycan database [20]. As glycans are regarded as vertex labeled rooted trees, wherein each vertex is a monosaccharide, the glycans were transformed into edge labeled rooted trees, wherein each edge is labeled with a label of its lower vertex.
In addition, 24 RNA secondary structures belonging to distinct RNA families were taken from the Rfam database [21], as shown in Additional file 1: Table S1 on our supplementary web site; these were transformed into rooted ordered trees. For this, one sequence was selected from multiple sequence alignments of each RNA family, as our method requires edge labels, i.e., bases. RNA secondary structures consist of base pairs with hydrogen bonds, and group binding, such as bulges and hairpin loops (as seen in Figure 12 (a)). There are several representations of trees for RNA secondary structures. An RNA secondary structure can be represented as an ordered rooted tree, by labeling the vertices with unpaired loops and the edges with paired bases [22]; this structure can be represented as an ordered rooted tree by labeling the vertices with hairpin loops, internal loops, bulges, and paired bases [7,23]. Chen and Zhang represented an RNA secondary structure using a paired base and a leaf, corresponding to an internal vertex and an unpaired base, respectively [8]. In our implementation, the representation by [8] was modified by eliminated vertices other than those corresponding to paired bases; in Figure 11 Minimum SEOTGs G 1 and G 2 that generate tree T. The number of nonterminal symbols of G 1 (G 2 ) is 3. The first production rule addition, the vertex labeled tree was transformed into an edge labeled tree in a manner similar to the glycans. Figure 12 illustrates the transformation, wherein a paired base was transformed into an edge, labeled with its base pair. It is noted that the edges in this representation are ordered by following the 5'-3' direction of the RNA sequence.
The CPLEX Optimization Studio (version 12.5) was used to solve integer programming using a linux operating system. The source code to transform given multiple trees into the proposed IP formulations is available at our supplementary web site on http://sunflower.kuicr.kyotou.ac.jp/tyoyo/treecomp/.

Minimum grammars for glycans and RNAs
We applied the proposed method MinSEUTGMul for the glycans and their several combinations. Table 1 shows the minimum number of nonterminal symbols of SEUTG for glycan unordered trees. In all cases, the minimum number of nonterminal symbols for multiple trees of glycans was lower than the sum of the minimum numbers for its single trees. For example, the minimum number of  '#vertices' denotes the total number of vertices in glycan trees. 'sum' denotes the sum of the minimum numbers for single glycans in a combination, and is omitted for single glycans. 'min' denotes the minimum number for all glycans in a combination. 'time' denotes the execution time in seconds. 'memory' denotes the memory usage in mega bytes.
nonterminal symbols for G02677, G03661, and G03664 was 31, which was lower than the sum of the minimum numbers 13+17+16 = 46. This suggests that our method successfully determined several common rules among the combination of glycans, and that the compression of several glycans together is better than that of each individual glycan. The execution time and the memory usage by the IP solver, for multiple trees with over 100 vertices in our experiments, were observed to be less than 10 minutes and 4G bytes, respectively (except in the case of G02677, G03661, G04458, and G07243). These compression sizes can be used to estimate the similarities between glycan structures. If the compression size of two glycans is smaller than the sum of compression sizes of the individual glycans, these glycan structures are considered to be similar. Figure 13 shows the subtrees corresponding to the nonterminal symbols contained within the minimum SEUTG for the unordered rooted trees of the glycans G02677, G03661, and G03664; the subtrees corresponding to the same nonterminal symbol are filled with the same color, and a portion of the nonterminal symbol is shown. The nonterminal symbol colored blue appeared in all three glycans, while those colored green and red appeared in two glycans. The nonterminal symbol colored brown consisted of the nonterminal symbols colored red and blue. This implies that the hierarchical structures contained within the glycans beyond the frequent patterns can be extracted using the developed methods. The proposed method MinSEOTGMul was applied to the RNA secondary structures and their several combinations. Table 2 shows the minimum number of nonterminal symbols of SEOTG for RNA ordered trees. In all cases, the minimum number of nonterminal symbols for multiple trees of RNA was lower than the sum of minimum numbers for its single trees, similar to the glycans. The execution time and the memory usage by the IP solver for multiple trees in our experiments were between two seconds and six hours, and between 260 Mbytes and 37 Gbytes, respectively. Figure 14 shows the subtrees corresponding to nonterminal symbols in the minimum SEOTG for ordered rooted trees of the RNA secondary structures RF00002 and RF00008. Figure 15 shows the original RNA secondary structures of RF00002 and RF00008, and the base pairs corresponding to nonterminal symbols in the minimum SEOTG. We also observed the hierarchical structure of the nonterminal symbols, colored blue and brown.
We examined the alternative approach that transforms multiple trees into a single tree and applies our previous methods. For the set of glycans G02677, G03661, and G03664, the alternative method output the existence of the SEUTG grammar with size 39 in 8.83 seconds. However, we could not obtain the result for size 38 within 24 hours, and could not determine the minimum number of nonterminal symbols. For the set of RNAs RF00002, and RF00004, the method output the existence of the SEOTG grammar with size 42 in 5.01 seconds. However, we could not obtain the result for size 41. We can see that the proposed methods are more efficient than the previous methods.

Conclusions
We proposed novel integer programming-based methods MinSEOTGMul and MinSEUTGMul to determine the minimum simple elementary ordered and unordered tree grammars (SEOTG and SEUTG) for multiple ordered and unordered trees, respectively. These could be directly applied to the determination of the minimum grammar, unlike our previously proposed methods. We applied MinSEUTGMul to several unordered trees transformed from glycans, and their combinations; MinSEOTGMul  was applied to several ordered trees transformed from RNA secondary structures, and their combinations. In all cases, the minimum number of nonterminal symbols in the grammars used in the construction of multiple trees was lower than the sum of minimum numbers in the grammars used to construct the single trees. This suggests that the proposed methods were successful in determining several common rules for glycans and RNA. In addition, several results of the minimum grammars for multiple trees of glycans and RNA reveal that our methods can provide clues towards extracting the hierarchical structures contained within tree-structured biological data, beyond the frequent patterns.
In our experiments, the execution time and the memory usage for a set of trees required six hours and 37GBytes, respectively. To obtain the minimum SEOTG and SEUTG for more trees including more complicated trees, we need to further improve the efficiency.
In this study, we utilized the minimum grammar for extraction of common construction rules among multiple distinct trees. However, the proposed methods can be used for data compression. Furthermore, the execution times of some operations can be decreased to multiple trees by applying the operations to the previously obtained minimum grammar. In the future, we would like to apply our methods to more glycans, RNA, and other tree-structured biological data.

Figure 14
Results of subtrees corresponding to nonterminal symbols in the minimum SEOTG for ordered rooted trees of RNAs from RF00002 and RF00008. Subtrees corresponding to the same nonterminal symbol are filled in the same color.

Figure 15
Original RNA secondary structures of RF00002 and RF00008, and base pairs corresponding to nonterminal symbols in the minimum SEOTG. Base pairs corresponding to the same nonterminal symbol are filled in the same color as Figure 14.