Grammar-based compression approach to extraction of common rules among multiple trees of glycans and RNAs

Zhao, Yang; Hayashida, Morihiro; Cao, Yue; Hwang, Jaewook; Akutsu, Tatsuya

doi:10.1186/s12859-015-0558-4

Research Article
Open access
Published: 24 April 2015

Grammar-based compression approach to extraction of common rules among multiple trees of glycans and RNAs

Yang Zhao¹,
Morihiro Hayashida¹,
Yue Cao¹,
Jaewook Hwang¹ &
…
Tatsuya Akutsu¹

BMC Bioinformatics volume 16, Article number: 128 (2015) Cite this article

1732 Accesses
63 Citations
2 Altmetric
Metrics details

Abstract

Background

Many tree structures are found in nature and organisms. Such trees are believed to be constructed on the basis of certain rules. We have previously developed grammar-based compression methods for ordered and unordered single trees, based on bisection-type tree grammars. Here, these methods find construction rules for one single tree. On the other hand, specified construction rules can be utilized to generate multiple similar trees.

Results

Therefore, in this paper, we develop novel methods to discover common rules for the construction of multiple distinct trees, by improving and extending the previous methods using integer programming. We apply our proposed methods to several sets of glycans and RNA secondary structures, which play important roles in cellular systems, and can be regarded as tree structures. The results suggest that our method can be successfully applied to determining the minimum grammar and several common rules among glycans and RNAs.

Conclusions

We propose integer programming-based methods MinSEOTGMul and MinSEUTGMul for the determination of the minimum grammars constructing multiple ordered and unordered trees, respectively. The proposed methods can provide clues for the determination of hierarchical structures contained in tree-structured biological data, beyond the extraction of frequent patterns.

Background

Many tree structures are found in nature and organisms. One such tree structure is a glycan, in which each monosaccharide is regarded as a vertex, except for cyclic oligosaccharides and so on. Since glycans contain various complicated structures, they are believed to be constructed by various mechanisms that recognize a monosaccharide, binding it with another. For instance, a galactosyltransferase is known to catalyze the biosynthesis with a galactose [1]. Glycans are also known to play several important roles in a cellular system, such as molecular recognition, cell adhesion, and antigen-antibody interactions. Therefore, many studies have been conducted to understand the structures and functions of glycans; in addition, several methods have been developed for the discovery of glycan motifs or significant subtrees, as glycan structures are conserved in evolutionary processes [2-4], and to measure the similarities between glycans [5,6].

RNA secondary structures can be also regarded as tree structures; these structures depend on the nucleic acid sequence. RNAs, which are large biological molecules, also perform important functions in living cells, such as the catalysis of biological reactions and expression of genes. Milo et al. displayed a pseudoknot-free RNA secondary structure as an ordered rooted tree, wherein each base pair, unpaired base interval, hairpin loop, internal loop, multi-loop, and external loop corresponds to a vertex, and developed a cubic time algorithm for the homeomorphic subtree alignment problem [7]. They applied it to pairwise alignments of RNA secondary structures, and found several structural similarities, which were not detected by other existing algorithms. Chen and Zhang developed an efficient algorithm for tree edit distance, and used the same to compare several RNA secondary structures [8]. These methods tried to measure the similarities between the tree structures, and to determine frequent subtrees.

In this paper, we focus on finding construction rules for multiple biomolecular tree structures. For example, it was reported that the glycosyltransferases such as ALG1 and ALG2 are involved in the linkages of Glc ₃ Man ₉ GlcNAc ₂ oligosaccharide precursor as shown in Figure 1 [9], where Glc, Man, and GlcNAc stand for glucose, mannose, and N-acetylglucosamine, respectively. ALG1 connects a GlcNAc with a Man, and ALG2 connects a Man with the Man connected by ALG1. According to local structures, different glycosyltransferases catalyze those biosyntheses in order to construct the same structure of the oligosaccharide. Since it is difficult to find existences of such genes from one tree structure, we try to find them from multiple tree structures that the same enzyme constructs a specified local structure.

In the field of computer science, grammar-based compression is used to determine the rules of construction of various types of data. The identification of the smallest grammar for input data would provide a clue towards understanding the construction rules. The determination of the smallest Context-Free Grammar (CFG) constructing a given string is known to be NP-hard [10]. Polynomial-time approximation methods have been developed to determine the smallest CFG for sequence data such as DNA, RNA nucleic acid sequences, and protein amino acid sequences [10-13]. Although these methods were extended to the tree structured data including XML [14-16], the minimum grammar has not always been provided. Hence, in our previous study, we developed grammar-based compression methods for single trees that always output the minimum grammar. We used bisection-type tree grammars proposed in [17], Simple Elementary Ordered Tree Grammar (SEOTG) for ordered trees and Simple Elementary Unordered Tree Grammar (SEUTG) for unordered trees, in which at most two nonterminal and terminal symbols appear on the right-hand side of each production rule. Since it seemed to be difficult to directly formulate the problems of finding the minimum SEOTG and SEUTG using Integer Programming (IP), we instead formulated the problems of finding SEOTG and SEUTG with a given size [18].

We previously developed methods for compressing single trees. As considered in the example provided in Figure 1, specified construction rules can be applied to generate several similar trees. Therefore, in this study, we attempt to discover common construction rules among multiple distinct trees; in addition, by improving and extending the previous methods, we propose the novel IP formulations MinSEOTGMul and MinSEUTGMul, for the direct determination of the minimum grammars of SEOTG and SEUTG. In our previous study, the problems associated with the determination of the minimum SEOTG and SEUTG were not directly formulated by IP; instead, the problems associated with determining the SEOTG and SEUTG utilizing the given sizes were formulated. Therefore, the previous IP was executed at least twice with different parameters in order to confirm the minimum size of the grammars that construct a given single tree. The methods proposed in this paper can be applied to the direct determination of the minimum grammar in one attempt. As for multiple input trees, our previous method can be trivially extended for the determination of the minimum SEOTG and SEUTG for N trees, which adds a special root vertex, connects each root of N trees to the special root by a distinct special edge, and applies the previous IP formulation to this generated single tree. This approach, however, increases the number of variables in the IP formulation, and may enlarge the execution time.

We apply our methods to several sets of glycans and RNA secondary structures. Consequently, we successfully determined the minimum grammar and several common construction rules using this method.

Methods

In this section, we briefly review the CFG, SEOTG, and SEUTG [17], and explain the proposed IP formulations, MinSEOTGMul and MinSEUTGMul, for multiple trees. Integer Programming (IP) is a method used to optimize a linear objective function subject to linear inequality constraints, with the variables being restricted to integers. We use these tools to solve the proposed integer program as our problem of finding the minimum SEOTG and SEUTG is NP-hard, with efficient solvers being developed. The benefit to use IP is that we can obtain exact solutions for combinatorial optimization problems.

Context-free grammar (CFG)

A context-free grammar (CFG) deals with strings, and is represented by 4-tuple (T,V,S,P), where T is a set of terminal symbols, V is a set of nonterminal symbols, S is a start symbol in V, and P is a set of production rules, wherein a nonterminal symbol on the left-hand side is replaced with a string on the right-hand side, which consists of symbols from V and T [19]. The final product generated by a CFG does not include any nonterminal symbol. The size of a grammar is defined as the total number of symbols appearing on the right-hand side of the production rules. For example, in a case where T={a,b}, V={S}, and P={S→a S b,S→a b}, the start symbol S is repeatedly replaced with the rule S→a S b, all nonterminal symbols are replaced with terminal symbols, and the strings ’ab’, ’aabb’, ’aaabbb’, etc. are generated, by this grammar. The size of the grammar is 5. In a case where T={a,b}, V={S,X,Y}, and P={S→a X b,X→a Y b,Y→a b}, only ’aaabbb’ is generated from S. The size of this grammar is 8. X and Y represent ’aabb’ and ’ab’, respectively. The rest of this paper deals with grammars that generate a constant number of trees. Each nonterminal symbol represents a specified tree.

Simple elementary ordered/unordered tree grammar (SEOTG/SEUTG)

Simple elementary ordered tree grammar (SEOTG) is defined for the rooted ordered trees T(V,E), where V is a set of vertices, and E is a set of labeled edges. If T represents a glycan, a vertex corresponds to a monosaccharide, an edge corresponds to a bond between the monosaccharides, and the enzyme involved in the biosynthesis can be represented by the label of the edge, as shown in Figure 1. As well as CFGs, a grammar of SEOTG consists of 4-tuple (Σ,Γ,S,Δ), where Σ is a set of terminal symbols, Γ is a set of nonterminal symbols, S is a start symbol in Γ, and Δ is a set of production rules, that are classified into Horizontal Bisection (RHB), Vertical Bisection (RVB), and Name Change (RNC), as in Figure 2. It should be noted that production rules of SEOTG and SEUTG are different from construction rules of biomolecular trees. (RHB) includes three rules that an edge of nonterminal symbol A is replaced with a tree whose root is both roots of nonterminal symbols B and C. A is bisected at the root into B and C. We introduce a tag to represent the vertex connected with another tree. The first rule in (RHB) does not contain a tag, and the other rules contain a tag, respectively. (RVB) includes two rules that an edge of nonterminal symbol A is replaced with a tree in which the root is the root of nonterminal symbol B, and the root of nonterminal symbol C is attached to the tag of B. A is bisected at an internal vertex of A into B and C. (RNC) includes two rules that an edge of nonterminal symbol A is replaced with an edge of terminal symbol a. In addition, any nonterminal symbol does not appear in expansion of the symbol itself. Then, each nonterminal symbol corresponds to a subtree of a given tree T. Figure 3 shows an example of SEOTG with (Σ,Γ,S,Δ). Δ is the set of six production rules, R ₁,⋯,R ₆, where R ₁ is a vertical bisection rule, R ₂ and R ₃ are horizontal bisection rules, and R ₄,R ₅,R ₆ are name change rules. Figure 4 illustrates the derivation of a tree from the start symbol S by the SEOTG. The first replacement is done by R ₁, and U ₁ is replaced with the right-hand side of R ₂. Then, the lower endpoint of U ₁ connects with the root of U ₂, and one of leaves of the replaced tree of U ₁ connects with the root of U ₂. A tag indicates the vertex connected with another vertex. Hence, the lower endpoint labeled with a tag in A ₁ connects with the root of U ₂. By applying R ₃,⋯,R ₆, the right-most tree is generated. The trees surrounded by dotted curves are derived from nonterminal symbols U ₁ and U ₂, respectively. It is considered in the example of Figure 1 that a terminal symbol corresponds to a biosynthesis, and a nonterminal symbol corresponds to a sequence of biosyntheses.

Simple elementary unordered tree grammar (SEUTG) is defined for rooted unordered trees in a manner similar to the SEOTG. In SEUTG, the second and third production rules with a tag in the (RVB), described in Figure 2, are equivalent to each other, as the trees are unordered.

Extension to multiple trees

We extend the SEOTG and SEUTG to multiple trees. N is the number of given trees, and T _α indicates the α-th edge labeled rooted tree. The start symbol S is replaced with the set of the nonterminal symbols S _α. Each tree T _α is generated from S _α using one grammar. Figure 5 shows an example of the input multiple trees T ₁, T ₂, and T ₃. One of the minimum grammars generating these trees is shown in Figure 6. The size of the grammar is the total number of symbols present on the right-hand side in the production rules, i.e., 11. We minimize the number of distinct nonterminal symbols instead of the size of the grammar, as there exists the same number of production rules as the number of distinct nonterminal symbols. Figure 7 illustrates the derivation of T ₂ from the start symbol S ₂ using the grammar. T ₂ is the same as T in Figure 4, and we observe the modification of the derivation process by providing other similar trees.

The Euler string e s(T) is used to determine if the labeled rooted trees T ₁ and T ₂ are isomorphic to each other, where e s(T) for a tree T is defined by the sequence of edge labels l and its opposite $\bar {l}$, along the depth-first search traversal of T [17]. For example, for T ₃ in Figure 5, e s(T ₃) is determined to be $a\bar {a}a\bar {a}b\bar {b}$. For a tagged tree, the tagged edge, labeled a, is transformed into $a\tau \overline {a}$, using a special symbol, τ, which represents the tag. It is noted that for two edge labeled rooted trees T ₁ and T ₂, T ₁ is isomorphic to T ₂ if (and only if) e s(T ₁)=e s(T ₂).

U is assigned as the set of all Euler strings for all connected subtrees of N trees. By improving the previous formulation, we propose the following IP formulation, MinSEOTGMul, for the direct determination of the minimum SEOTG that constructs N ordered trees T _α.

$$\begin{array}{*{20}l} &\text{Minimize} \sum\limits_{u\in U}p_{u}\\ &\text{Subject to}\\ &x_{\alpha,i,\epsilon,j,j} = 1 \quad \text{for all } \alpha, i, j\in ch(\alpha,i) (|ch(\alpha,j)|=0), \end{array} $$

((1))

$$\begin{array}{*{20}l} &x_{\alpha,i,j,j,j}\, = 1 \quad \text{for all } \alpha, i, j\in ch(\alpha,i) (|ch(\alpha,j)|>0), \end{array} $$

((2))

$$\begin{array}{*{20}l} &x_{\alpha,1,\epsilon,lch(\alpha,1),rch(\alpha,1)}=1 \text{ for all } \alpha, \end{array} $$

((3))

$$\begin{array}{*{20}l} &x_{\alpha,i,\epsilon,h,k}\leq \sum_{l=h}^{k-1} y_{\alpha,i,\epsilon,h,l,k}+\sum_{t\in I(T_{\alpha,i,\epsilon,h,k})}z_{\alpha,i,\epsilon,h,k,t}\\ &\hspace*{54pt}\text{for all } \alpha, i, h \leq k \in ch(\alpha,i), \end{array} $$

((4))

$$\begin{array}{*{20}l} &y_{\alpha,i,\epsilon,h,l,k}\leq \frac{1}{2}(x_{\alpha,i,\epsilon,h,l}+x_{\alpha,i,\epsilon,l+1,k}) \\ &\hspace*{54pt}\text{for all } \alpha, i, h \leq l<k \in ch(\alpha,i), \end{array} $$

((5))

$$\begin{array}{*{20}l} &z_{\alpha,i,\epsilon,h,k,t}\leq \frac{1}{2}(x_{\alpha,i,t,h,k}+x_{\alpha,t,\epsilon,lch(\alpha,t),rch(\alpha,t)})\\ &\hspace*{54pt}\text{for all } \alpha, i, h \!\leq\! k \in ch(\alpha,i), t\!\in\! I(T_{\alpha,i,\epsilon,h,k}), \end{array} $$

((6))

$$\begin{array}{*{20}l} &x_{\alpha,i,j,h,k}\leq \sum_{l=h}^{k-1} y_{\alpha,i,j,h,l,k}+\sum_{t\in anc(\alpha, j)}z_{\alpha,i,j,h,k,t}\\ &\hspace*{54pt}\text{for all } \alpha, i, h\!\leq\! k\in ch(\alpha,i), j\in I(T_{\alpha,i,\epsilon,h,k}), \end{array} $$

((7))

$$\begin{array}{*{20}l} &y_{\alpha,i,j,h,l,k}\leq \frac{1}{2}(x_{\alpha,i,\epsilon,h,l}+x_{\alpha,i,j,l+1,k}) \\ &\hspace*{54pt}\text{for all } \alpha, i, h\leq l<k\in ch(\alpha,i),\\&\hspace*{6.9pc} j\in I(T_{\alpha,i,\epsilon,l+1,k}), \end{array} $$

((8))

$$\begin{array}{*{20}l} &y_{\alpha,i,j,h,l,k}\leq \frac{1}{2}(x_{\alpha,i,j,h,l}+x_{\alpha,i,\epsilon,l+1,k}) \\ &\hspace*{54pt} \text{for all } \alpha, i, h\leq l<k\in ch(\alpha,i),\\&\hspace*{6.8pc} j\in I(T_{\alpha,i,\epsilon,h,l}), \end{array} $$

((9))

$$\begin{array}{*{20}l} &z_{\alpha,i,j,h,k,t}\leq \frac{1}{2}(x_{\alpha,i,t,h,k}+x_{\alpha,t,j,lch(\alpha,t),rch(\alpha,t)}) \\ &\hspace*{54pt} \text{for all } \alpha, i, h\!\leq\! k\in ch(\alpha,i), j\in I(T_{\alpha,i,\epsilon,h,k}), \\&\hspace*{6.8pc} t\in anc(\alpha, j), \end{array} $$

((10))

$$\begin{array}{*{20}l} &s_{u} \leq p_{u}<1+s_{u}\quad \text{for all } u \in U, \end{array} $$

((11))

$$\begin{array}{*{20}l} &s_{u}=\frac{1}{|\left\{(\alpha,i,j,h,k)|es(T_{\alpha,i,j,h,k}) = u\right\}|}\\&\qquad\sum_{\left\{(\alpha,i,j,h,k)|es(T_{\alpha,i,j,h,k}) = u\right\}} x_{\alpha,i,j,h,k}. \end{array} $$

((12))

Here, l c h(α,i), r c h(α,i), and c h(α,i) denote the leftmost child of the vertex v _i in T _α, the rightmost child of v _i in T _α, and the set of child vertices of v _i in T _α, respectively. T _α,i,t,h,k denotes the subtree rooted at vertex v _i, with the child vertices v _j(h≤j≤k) and v _t labeled with a tag in T _α, which does not have a tag when t=ε (Figure 8). I(T) denotes the set of internal vertices, except for the root and leaves of tree T. a n c(α,j) denotes the set of ancestor vertices of v _j, where j∉a n c(j) and a n c(ε)=∅.

In MinSEOTGMul, the variable p _u is equated to 1 if a nonterminal symbol that corresponds to the subtree represented by Euler string u appears in the grammar; otherwise, the variable is equated to 0. MinSEOTGMul minimizes the sum of p _u, i.e., the number of distinct nonterminal symbols appeared in the output grammar as a result. The variable x _α,i,t,h,k takes on a value of 1, if the subtree T _α,i,t,h,k is constructed by the grammar; otherwise, the value of this variable remains 0. In Eqs. (1) and (2), x _α,i,ε,j,j and x _α,i,j,j,j correspond to an edge in T _α, and each edge in the N trees is always constructed according to a production rule of (RNC). x _{α,1,ε,l
c
h(α,1),r
c
h(α,1)} corresponds to the entire α-th tree T _α, where the root of each tree is numbered as 1. Eq. (3) represents that MinSEOTGMul requires that all N trees T _α are constructed using the grammar.

The variable y _α,i,j,h,l,k takes on a value of 1 if T _α,i,j,h,k is constructed from T _α,i,j,h,l and T _α,i,j,l+1,k using an (RHB) production rule; otherwise, the value is maintained at 0 (Figure 9). The variable z _α,i,j,h,k,t is denoted as 1 if T _α,i,j,h,k is constructed from T _α,i,t,h,k and T _{α,t,j,l
c
h(α,t),r
c
h(α,t)} using an (RVB) production rule; otherwise, the value is retained as 0 (Figure 10). Eqs. (4) and (7) indicate that the subtree T _α,i,j,h,k is constructed by at least one established production rule of (RHB) and (RVB) in the grammar. Eqs. (5), (6), (8), (9), and (10) indicate that a production rule of (RHB) and (RVB) becomes a candidate rule in the grammar when both of the two source subtrees are constructed.

The variable s _u is defined by Eq. (12), and takes on a real value of 0≤s _u≤1. If at least one subtree T _α,i,j,h,k whose Euler string is u, i.e., e s(T _α,i,j,h,k)=u, is constructed, then s _u>0. Based on Eq. (11), p _u takes on a value of 1. It means that one nonterminal symbol corresponding to the subtree appears in the grammar. Conversely, when any subtree whose Euler string is u is not constructed, then s _u=0, p _u takes on a value of 0, and a nonterminal symbol is not generated. In our previous study, unnecessary nonterminal symbols could be generated, and made it difficult to find the minimum number of nonterminal symbols.

For unordered trees, we propose the following IP formulation, MinSEUTGMul, to determine the minimum SEUTG for constructing N unordered trees T _α, in a manner similar to that used for ordered trees.

$${} {\small{\begin{aligned} &\text{Minimize}\sum_{u\in U}p_{u}\\ &\text{Subject to}\\ &x_{\alpha,i,\epsilon,\{j\}} = 1 \quad \text{ for all } \alpha, i, j\in ch(\alpha,i) (|ch(\alpha,j)|=0),\\ &x_{\alpha,i,j,\{j\}} \,= 1 \quad \text{ for all } \alpha, i, j\in ch(\alpha,i) (|ch(\alpha,j)|>0),\\ &x_{\alpha,1,\epsilon,ch(\alpha,1)}=1 \quad \text{ for all } \alpha,\\ &x_{\alpha,i,\epsilon,{\cal C}}\leq \sum_{{\cal C}'(\neq\emptyset)\subset{\cal C}} y_{\alpha,i,\epsilon,{\cal C}',{\cal C}-{\cal C}'}+\sum_{t\in I(T_{\alpha,i,\epsilon,{\cal C}})}z_{\alpha,i,\epsilon,{\cal C},t}\\ &\hspace*{50pt} \text{for all } \alpha, i, {\cal C}\subseteq ch(\alpha,i), \\ &y_{\alpha,i,\epsilon,{\cal C}',{\cal C}-{\cal C}'}\leq \frac{1}{2}(x_{\alpha,i,\epsilon,{\cal C}'}+x_{\alpha,i,\epsilon,{\cal C}-{\cal C}'}) \\ &\hspace*{50pt}\text{for all } \alpha, i, {\cal C}\subseteq ch(\alpha,i), {\cal C}'(\neq\emptyset)\subset{\cal C},\\ &z_{\alpha,i,\epsilon,{\cal C},t}\leq \frac12(x_{\alpha,i,t,{\cal C}}+x_{\alpha,t,\epsilon,ch(\alpha,t)}) \\ &\hspace*{50pt}\text{for all } \alpha, i, {\cal C}\subseteq ch(\alpha,i), t\in I(T_{\alpha,i,\epsilon,{\cal C}}),\\ &x_{\alpha,i,j,{\cal C}} \leq \sum_{{\cal C}'(\neq\emptyset)\subset{\cal C}} y_{\alpha,i,j,{\cal C}',{\cal C}-{\cal C}'}+\sum_{t\in anc(\alpha, j)}z_{\alpha,i,j,{\cal C},t}\\ &\hspace*{50pt}\text{for all } \alpha, i, {\cal C}\subseteq ch(\alpha,i), j\in I(T_{\alpha,i,\epsilon,{\cal C}}), \\ &y_{\alpha,i,j,{\cal C}',{\cal C}-{\cal C}'}\!\leq\! \frac12(x_{\alpha,i,\epsilon,{\cal C}'}\,+\,x_{\alpha,i,j,{\cal C}-{\cal C}'}) \\ &\hspace*{50pt}\text{for all } \alpha, i, {\cal C}\!\subseteq\! ch(\alpha,i), j\!\in\! I(T_{\alpha,i,\epsilon,{\cal C}}), {\cal C}'(\neq\!\emptyset)\!\subset\!{\cal C},\\ &z_{\alpha,i,j,{\cal C},t}\leq \frac12(x_{\alpha,i,t,{\cal C}}+x_{\alpha,t,j,ch(\alpha,t)}) \\ &\hspace*{50pt}\text{for all } \alpha, i, {\cal C}\!\subseteq\! ch(\alpha,i), j\!\in\! I(T_{\alpha,i,\epsilon,{\cal C}}), t\!\in\! anc(\alpha, j),\\ &s_{u}\leq p_{u}<1+s_{u} \quad \text{ for all } u \in U, \\ &s_{u}= {\frac 1 {|\{(\alpha,i,j,{\cal C})|es(T_{\alpha,i,j,{\cal C}}) = u\}|}} \sum_{\{(\alpha,i,j,{\cal C})|es(T_{\alpha,i,j,{\cal C}}) = u\}} x_{\alpha,i,j,{\cal C}}. \end{aligned}}} $$

The horizontal bisection rules of (RHB) split the set of child vertices of the vertex v _i into a subset ^′, and the remaining vertices −^′. T _α,i,j, indicates that the subtree rooted at vertex v _i has a set of the child vertices. The variables x _α,i,j,, $y_{\alpha,i,j,{\cal C}^{\prime }, {\cal C}-{\cal C}^{\prime }}$, and z _α,i,j,,t are used in a manner similar to x _α,i,j,h,k, y _α,i,j,h,l,k, and z _α,i,j,h,k,t in MinSEOTGMul, respectively.

It should be noted that the IP formulations MinSEOTGMul and MinSEUTGMul can output multiple grammars with the minimum number of nonterminal symbols. Figure 11 displays such an example, where the grammars G ₁ and G ₂ generate the tree T, and the number of nonterminal symbols of G ₁ (G ₂) is 3. The first production rule R ₁ of G ₁ is different from the R ₁ of G ₂. By providing more such trees, the number of the minimum grammars can be reduced to almost one.

Results and discussion

Tree representation of glycans and RNAs

The proposed methods MinSEOTGMul and MinSEUTGMul were evaluated by preparing two types of biological data, glycans and RNA secondary structures, which were dealt with as unordered and ordered trees, respectively. For this analysis, we utilized 16 glycans, G02677, G03655, G03661, G03664, G03678, G03687, G04186, G04458, G04695, G04802, G05058, G05226, G05256, G05988, G07243, and G09054, from the KEGG Glycan database [20]. As glycans are regarded as vertex labeled rooted trees, wherein each vertex is a monosaccharide, the glycans were transformed into edge labeled rooted trees, wherein each edge is labeled with a label of its lower vertex.

In addition, 24 RNA secondary structures belonging to distinct RNA families were taken from the Rfam database [21], as shown in Additional file 1: Table S1 on our supplementary web site; these were transformed into rooted ordered trees. For this, one sequence was selected from multiple sequence alignments of each RNA family, as our method requires edge labels, i.e., bases. RNA secondary structures consist of base pairs with hydrogen bonds, and group binding, such as bulges and hairpin loops (as seen in Figure 12 (a)). There are several representations of trees for RNA secondary structures. An RNA secondary structure can be represented as an ordered rooted tree, by labeling the vertices with unpaired loops and the edges with paired bases [22]; this structure can be represented as an ordered rooted tree by labeling the vertices with hairpin loops, internal loops, bulges, and paired bases [7,23]. Chen and Zhang represented an RNA secondary structure using a paired base and a leaf, corresponding to an internal vertex and an unpaired base, respectively [8]. In our implementation, the representation by [8] was modified by eliminated vertices other than those corresponding to paired bases; in addition, the vertex labeled tree was transformed into an edge labeled tree in a manner similar to the glycans. Figure 12 illustrates the transformation, wherein a paired base was transformed into an edge, labeled with its base pair. It is noted that the edges in this representation are ordered by following the 5’-3’ direction of the RNA sequence.

The CPLEX Optimization Studio (version 12.5) was used to solve integer programming using a linux operating system. The source code to transform given multiple trees into the proposed IP formulations is available at our supplementary web site on http://sunflower.kuicr.kyoto-u.ac.jp/tyoyo/treecomp/.

Minimum grammars for glycans and RNAs

We applied the proposed method MinSEUTGMul for the glycans and their several combinations. Table 1 shows the minimum number of nonterminal symbols of SEUTG for glycan unordered trees. In all cases, the minimum number of nonterminal symbols for multiple trees of glycans was lower than the sum of the minimum numbers for its single trees. For example, the minimum number of nonterminal symbols for G02677, G03661, and G03664 was 31, which was lower than the sum of the minimum numbers 13+17+16=46. This suggests that our method successfully determined several common rules among the combination of glycans, and that the compression of several glycans together is better than that of each individual glycan. The execution time and the memory usage by the IP solver, for multiple trees with over 100 vertices in our experiments, were observed to be less than 10 minutes and 4G bytes, respectively (except in the case of G02677, G03661, G04458, and G07243). These compression sizes can be used to estimate the similarities between glycan structures. If the compression size of two glycans is smaller than the sum of compression sizes of the individual glycans, these glycan structures are considered to be similar. Figure 13 shows the subtrees corresponding to the nonterminal symbols contained within the minimum SEUTG for the unordered rooted trees of the glycans G02677, G03661, and G03664; the subtrees corresponding to the same nonterminal symbol are filled with the same color, and a portion of the nonterminal symbol is shown. The nonterminal symbol colored blue appeared in all three glycans, while those colored green and red appeared in two glycans. The nonterminal symbol colored brown consisted of the nonterminal symbols colored red and blue. This implies that the hierarchical structures contained within the glycans beyond the frequent patterns can be extracted using the developed methods.

Table 1 Results on the minimum number of nonterminal symbols by MinSEUTGMul for glycan unordered trees

Full size table

The proposed method MinSEOTGMul was applied to the RNA secondary structures and their several combinations. Table 2 shows the minimum number of nonterminal symbols of SEOTG for RNA ordered trees. In all cases, the minimum number of nonterminal symbols for multiple trees of RNA was lower than the sum of minimum numbers for its single trees, similar to the glycans. The execution time and the memory usage by the IP solver for multiple trees in our experiments were between two seconds and six hours, and between 260 Mbytes and 37 Gbytes, respectively. Figure 14 shows the subtrees corresponding to nonterminal symbols in the minimum SEOTG for ordered rooted trees of the RNA secondary structures RF00002 and RF00008. Figure 15 shows the original RNA secondary structures of RF00002 and RF00008, and the base pairs corresponding to nonterminal symbols in the minimum SEOTG. We also observed the hierarchical structure of the nonterminal symbols, colored blue and brown.

Table 2 Results on the minimum number of nonterminal symbols by MinSEOTGMul for RNA ordered trees

Full size table

We examined the alternative approach that transforms multiple trees into a single tree and applies our previous methods. For the set of glycans G02677, G03661, and G03664, the alternative method output the existence of the SEUTG grammar with size 39 in 8.83 seconds. However, we could not obtain the result for size 38 within 24 hours, and could not determine the minimum number of nonterminal symbols. For the set of RNAs RF00002, and RF00004, the method output the existence of the SEOTG grammar with size 42 in 5.01 seconds. However, we could not obtain the result for size 41. We can see that the proposed methods are more efficient than the previous methods.

Conclusions

We proposed novel integer programming-based methods MinSEOTGMul and MinSEUTGMul to determine the minimum simple elementary ordered and unordered tree grammars (SEOTG and SEUTG) for multiple ordered and unordered trees, respectively. These could be directly applied to the determination of the minimum grammar, unlike our previously proposed methods. We applied MinSEUTGMul to several unordered trees transformed from glycans, and their combinations; MinSEOTGMul was applied to several ordered trees transformed from RNA secondary structures, and their combinations. In all cases, the minimum number of nonterminal symbols in the grammars used in the construction of multiple trees was lower than the sum of minimum numbers in the grammars used to construct the single trees. This suggests that the proposed methods were successful in determining several common rules for glycans and RNA. In addition, several results of the minimum grammars for multiple trees of glycans and RNA reveal that our methods can provide clues towards extracting the hierarchical structures contained within tree-structured biological data, beyond the frequent patterns.

In our experiments, the execution time and the memory usage for a set of trees required six hours and 37GBytes, respectively. To obtain the minimum SEOTG and SEUTG for more trees including more complicated trees, we need to further improve the efficiency.

In this study, we utilized the minimum grammar for extraction of common construction rules among multiple distinct trees. However, the proposed methods can be used for data compression. Furthermore, the execution times of some operations can be decreased to multiple trees by applying the operations to the previously obtained minimum grammar. In the future, we would like to apply our methods to more glycans, RNA, and other tree-structured biological data.

References

Campbell JA, Davies GJ, Bulone V, Henrissat B. A classification of nucleotide-diphospho-sugar glycosyltransferases based on amino acid sequence similarities. Biochem J. 1997; 326:929–42.
Article CAS PubMed PubMed Central Google Scholar
Hizukuri Y, Yamanishi Y, Nakamura O, Yagi F, Goto S, Kanehisa M. Extraction of leukemia specific glycan motifs in humans by computational glycomics. Carbohydr Res. 2005; 340:2270–78.
Article CAS PubMed Google Scholar
Aoki-Kinoshita KF, Ueda N, Mamitsuka H, Kanehisa M. ProfilePSTMM: capturing tree-structure motifs in carbohydrate sugar chains. Bioinformatics. 2006; 22:25–34.
Article Google Scholar
Hashimoto K, Takigawa I, Shiga M, Kanehisa M, Mamitsuka H. Mining significant tree patterns in carbohydrate sugar chains. Bioinformatics. 2008; 24:167–73.
Article Google Scholar
Li L, Ching W-K, Yamaguchi T, Aoki-Kinoshita KF. A weighted q-gram method for glycan structure classification. BMC Bioinformatics. 2010; 11(Suppl 1):S33.
Article CAS PubMed PubMed Central Google Scholar
Fukagawa D, Tamura T, Takasu A, Tomita E, Akutsu T. A clique-based method for the edit distance between unordered trees and its application to analysis of glycan structures. BMC Bioinformatics. 2011; 12(Suppl 1):S13.
Article PubMed PubMed Central Google Scholar
Milo N, Zakov S, Katzenelson E, Bachmat E, Dinitz Y, Ziv-Ukelson M. Unrooted unordered homeomorphic subtree alignment of rna trees. Algorithms Mol Biol. 2013; 8:13.
Article CAS PubMed PubMed Central Google Scholar
Chen S, Zhang K. An improved algorithm for tree edit distance with applications for rna secondary structure comparison. J Comb Optimization. 2014; 27:778–97.
Article Google Scholar
Farid A, Pabst M, Schoberer J, Altmann F, Glössl J, Strasser R. Arabidopsis thaliana alpha1,2-glucosyltransferase (ALG10) is required for efficient N-glycosylation and leaf growth. Plant J. 2011; 68:314–25.
Article CAS PubMed PubMed Central Google Scholar
Charikar M, Lehman E, Liu D, Panigrahy R, Prabhakaran M, Sahai A, et al.The smallest grammar problem. IEEE Trans Inf Theory. 2005; 51:2554–76.
Article Google Scholar
Rytter W. Application of lempel-ziv factorization to the approximation of grammar-based compression. Theor Comput Sci. 2003; 302:211–22.
Article Google Scholar
Sakamoto H, Maruyama S, Kida T, Shimozono S. A space-saving approximation algorithm for grammar-based compression. IEICE Trans Inf Syst. 2009; 92-D:158–65.
Article Google Scholar
Scicluna J, de la Higuera C. Grammatical inference of some probabilistic context-free grammars from positive data using minimum satisfiability. In: 12th International Conference on Grammatical Inference. Brookline, MA: Microtome Publishing: 2014. p. 139–52.
Google Scholar
Busatto G, Lohrey M, Maneth S. Efficient memory representation of xml document trees. Inf Syst. 2008; 33:456–74.
Article Google Scholar
Murakami S, Doi K, Yamamoto A. Finding frequent patterns from compressed tree-structure data. In: Proc. 11th Int. Conf. Discovery Science. Berlin Heidelberg: Springer: 2008. p. 284–95.
Google Scholar
Yamagata K, Uchida T, Shoudai T, Nakamura Y. An effective grammar-based compression algorithm for tree structured data. In: Proc. 13th Int. Inductive Logic Programming. Berlin Heidelberg: Springer: 2003. p. 383–400.
Google Scholar
Akutsu T. A bisection algorithm for grammar-based compression of ordered trees. Inf Process Lett. 2010; 110:815–20.
Article Google Scholar
Zhao Y, Hayashida M, Akutsu T. Integer programming-based method for grammar-based tree compression and its application to pattern extraction of glycan tree structures. BMC Bioinformatics. 2010; 11(Suppl 11):4.
Article Google Scholar
Hopcroft JE, Motwani R, Ullman JD. Chapter 5: Context-Free Grammars and Languages. Introduction to Automata Theory, languages, and Computation 2nd edn. Boston: Addison-Wesley; 2001, pp. 169–218.
Google Scholar
Hashimoto K, Goto S, Kawano S, Aoki-Kinoshita F, Ueda N, Hamajima M et al.Kegg as a glycome informatics resource. Glycobiology. 2006; 16(5):63–70.
Article CAS Google Scholar
Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, Nawrocki EP, et al.Rfam 11.0: 10 years of RNA families. 2012. doi:10.1093/nar/gks1005.
Shapiro BA. An algorithm for comparing multiple rna secondary structures. Comput Appl Biosci. 1988; 4:387–93.
CAS PubMed Google Scholar
Höchsmann M, Voss B, Giegerich R. Pure multiple rna secondary structure alignments: a progressive profile approach. IEEE/ACM Trans Comput Biol Bioinformatics. 2004; 1:53–62.
Article Google Scholar

Download references

Acknowledgments

This work was partially supported by Grants-in-Aid #26240034, #24500361, and #25-2920 from MEXT, Japan.

Author information

Authors and Affiliations

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Japan
Yang Zhao, Morihiro Hayashida, Yue Cao, Jaewook Hwang & Tatsuya Akutsu

Authors

Yang Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Morihiro Hayashida
View author publications
You can also search for this author in PubMed Google Scholar
Yue Cao
View author publications
You can also search for this author in PubMed Google Scholar
Jaewook Hwang
View author publications
You can also search for this author in PubMed Google Scholar
Tatsuya Akutsu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Morihiro Hayashida.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

YZ and MH developed and implemented the proposed methods, and drafted the manuscript. YC, JH, and TA participated in the discussions during the development of the methods and helped draft the manuscript. All authors have read and approved the final manuscript.

Additional file

Additional file 1

Table S1. RNA sequences used in our experiments.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Zhao, Y., Hayashida, M., Cao, Y. et al. Grammar-based compression approach to extraction of common rules among multiple trees of glycans and RNAs. BMC Bioinformatics 16, 128 (2015). https://doi.org/10.1186/s12859-015-0558-4

Download citation

Received: 26 December 2014
Accepted: 30 March 2015
Published: 24 April 2015
DOI: https://doi.org/10.1186/s12859-015-0558-4

Grammar-based compression approach to extraction of common rules among multiple trees of glycans and RNAs