Volume 11 Supplement 11
Proceedings of the 21st International Conference on Genome Informatics (GIW2010)
Integer programmingbased method for grammarbased tree compression and its application to pattern extraction of glycan tree structures
 Yang Zhao^{1},
 Morihiro Hayashida^{1} and
 Tatsuya Akutsu^{1}Email author
DOI: 10.1186/1471210511S11S4
© Akutsu et al; licensee BioMed Central Ltd. 2010
Published: 14 December 2010
Abstract
Background
A bisectiontype algorithm for the grammarbased compression of treestructured data has been proposed recently. In this framework, an elementary orderedtree grammar (EOTG) and an elementary unorderedtree grammar (EUTG) were defined, and an approximation algorithm was proposed.
Results
In this paper, we propose an integer programmingbased method that finds the minimum contextfree grammar (CFG) for a given string under the condition that at most two symbols appear on the righthand side of each production rule. Next, we extend this method to find the minimum EOTG and EUTG grammars for given ordered and unordered trees, respectively. Then, we conduct computational experiments for the ordered and unordered artificial trees. Finally, we apply our methods to pattern extraction of glycan tree structures.
Conclusions
We propose integer programmingbased methods that find the minimum CFG, EOTG, and EUTG for given strings, ordered and unordered trees. Our proposed methods for trees are useful for extracting patterns of glycan tree structures.
Background
Data compression is useful because it can help reduce the consumption of expensive resources such as hard disks. To date, many methods such as Huffman coding, arithmetic coding, etc. have been proposed to solve problems of data compression. Data compression is also useful for the analysis of biological data. Li et al. proposed the universal similarity metric (USM), and approximated the dissimilarity using compression sizes. They applied a compression algorithm to unaligned mitochondrial genomes, and obtained a phylogeny that was consistent with the commonly accepted one [1]. Similarly, protein tertiary structures and metabolic networks were compressed, and their similarities were measured [2, 3]. Grammarbased compression, which is a typical datacompression method, seeks a small grammar to generate a given string, as it is well known that it is NPhard to find the smallest contextfree grammar (CFG). However, in recent years, several polynomial time algorithms have been proposed to approximate the smallest grammar for the input data within a factor of O(log(n/m)), where n and m are the sizes of the input data and the smallest grammar [4–6], respectively. These algorithms can be used to compress biological data such as DNA, RNA, and amino acid sequences. However, there exist a large amount of treestructured biological data (e.g., glycan, etc.). Therefore, it is necessary to develop methods to compress treestructured data. Recent approaches show that it is feasible to extend the grammarbased compression to the treestructure data [7–9]. However, these methods neither output the minimum grammar, nor they achieve a guaranteed approximation ratio.
In this paper, we propose an integer programming (IP)based method that finds the minimum CFG for a given string under the condition that at most two symbols appear on the righthand side of each production rule. Next, we extend this method to find the minimum elementary orderedtree grammar (EOTG) and elementary unorderedtree grammar (EUTG) for given ordered and unordered trees. To the best of our knowledge, these are the first methods that can find the minimum size grammars for strings, ordered trees, and unordered trees.
It is possible to compress ordered trees by transforming them into Euler strings [10], and by applying existing grammarbased stringcompression algorithms to the strings. Such an approach may achieve better compression performances. However, there do not always exist tree grammars corresponding to string grammars derived by the approach. Our objective is not only to compress trees but also to extract features and patterns from input trees. Therefore, we need to develop methods for finding minimum tree grammars. The organization of the paper is as follows. First, we give an IPbased method for sequence data compression using CFG. Second, we review an elementary ordered tree grammar (EOTG) [10] for ordered rooted tree compression, and extend this IPbased method to the tree compression problem. Then, we also review an elementary unordered tree grammar (EUTG) [10] for unordered rooted tree compression, and extend the above mentioned IPbased method for unordered trees. Furthermore, we conduct some computational experiments, apply the proposed methods to glycan treestructure data, and extract tree patterns from generated production rules of simple EUTGs. Finally, we conclude with future work.
IP formulation for strings
Minimum CFG problem
We use a simple contextfree grammar (CFG) for string and text compression. CFG is defined as 4tuple (Σ, Γ, S, Δ), where Σ is a set of terminal symbols (denoted by a lowercase letter), Γ is a set of nonterminal symbols (denoted by an uppercase letter), S is a start symbol in Γ, and Δ is a set of production rules. The size of CFG is defined as the total number of letters appearing on the RHSs (RightHand Sides) of production rules. Two CFGs G_{1} and G_{2} are said to be equivalent if G_{1} generates the same set of strings as G_{2} does. We only consider CFGs consisting of the following types of production rules:

A→a,

A→BC.
We call this CFG a simple CFG. We can show that any CFG of size m can be transformed into an equivalent, simple CFG of size 3m.
The smallest grammar problem is thus defined as the problem of finding the smallest grammar that generates a given string [4].
Minimum CFG
Input: String s = s_{1}s_{2}…s_{ n } and integer m.
Output: Simple CFG with m nonterminal symbols that generates s only.
Transformation to IP
We use the number of nonterminal symbols m instead of the size of the grammar because the number of terminal symbols appearing in production rules of A → a is constant for the given string. In order to solve this minimum CFG problem, we propose an IPbased method as follows. We transform this problem into the following integer program, where x_{1,n} = 1 holds iff there exists a required CFG G.
Maximize x_{1,n}
Subject to
x_{i,i} = 1 for all i=1,…,n
In the above equations, each variable of x_{i,j}, y_{i,k,j}, and z_{ u } takes either 0 or 1. Each x_{i,j} corresponds to substring s_{i,j} = s_{ i }s_{i+1}… s_{ j }, and x_{i,j} = 1 iff there exists a nonterminal symbol A_{i,j} in G that generates s_{ i,j }. y_{ i,k,j } = 1 iff both of x_{ i,k } = 1 and x_{k+}_{1}_{ ,j } = 1 hold. It means that s_{i,j} can be generated by concatenating s_{i,k} and s_{k+1,}_{ j } that are generated from nonterminal symbols A_{i,k} and A_{k+1}_{,j,} respectively. z_{ u } = 1 iff there exists a nonterminal symbol in G that generates a substring u of s. The meaning of each (in)equality is as follows:
(1) each s_{i,i} (= s_{ i }) must be generated,
(2,3) if A_{ i,j } appears in G, s_{ i,j } must be generated, that is, for at least some k, both of s_{ i,k } and s_{k+1}_{ ,j } must be generated and the production rule A_{i,j} → A_{i,k}A_{k+1,j} must appear in G,
(4) A_{ i,j } and A_{i',j'}, are identified if both generate the same substring u, and
(5) the number of nonterminal symbols used in G must be m.
A_{1,8} → A_{1,6}A_{7,8},
A_{1,6} → A_{1,3}A_{4,6},
A_{1,3} → A_{1,2}A_{3,3},
A _{4,6} → A_{4,5}A_{6,6},
A_{7,8} → A_{7,7}A_{8,8},
….
On the other hand, we have A_{ abcabcab } = A_{1,8}, A_{ abc } = A_{1,3} = A_{4,6}, etc. Therefore, we finally have:
A_{ abcabcab } → A_{ abc }A_{ abc }A_{ ab },
A_{ abc } → A_{ ab }A_{ c },
A_{ ab } → A_{ a }A_{ b },
A_{ a } → a,
A_{ b } → b,
A_{ c } → c.
IP formulation for ordered trees
Minimum EOTG problem
We use a simple elementary ordered tree grammar (EOTG) [10] for rooted tree compression. In this grammar, a tree can contain a vertex called a tag. A tag indicates that another tree at the root can be attached to it. We assume that there is at most one tag in such a tree.
Within the class of SEOTGs, we can transform the minimum grammar problem into the IP. For this purpose, we define the minimum SEOTG problem as follows.
Minimum SEOTG
Input: Rooted ordered tree T(V, E) and integer m, where V is a set of vertices and E is a set of labeled edges.
Output: Simple EOTG with m nonterminal symbols that generates only T.
Transformation to IP
Maximize x_{1,∈,}_{ lch }_{(1),rch(1)}
(1u,t) each untagged (tagged) edge T_{ i,∈,j,j } (T_{ i,j,j,j }) must be generated,
(23u,t) if A_{ i,j,h,k } appears in G either for at least some l, A_{ i,j,h,k } → A_{ i,j,h,l }A_{ i,j,l }_{+1,}_{ k } ∈ G, or, for at least some t, A_{ i,j,h,k } → A_{ i,t,h,l }A_{ t,j,lch(t }_{),}_{ rch }_{(}_{ t }_{)} ∈ G holds. means that T_{ i,j,h,k } is horizontally divided at root i into the children {s ∈ ch(i)s ≤ l} and {s ∈ ch(i)s > l}. means that T_{i,j,h,k} is vertically divided at t into two subtrees T_{ i,t,h,k } and T_{ t,j,lch }_{(}_{ t }_{),}_{ rch }_{(}_{ t }_{)} (if j ≠ ∈, t must be in an(j), otherwise, a divided tree would have two tags),
(4) A_{ i,j,h,k } and A_{ i },_{ ,j },_{ h },_{ k' } are identified if both generate the same Euler string u, and
(5) the number of nonterminal symbols used in G must be m.
IP formulation for unordered trees
Thus, within the class of SEUTGs, we define the minimum SEUTG problem as follows:
Minimum SEUTG
Input: Rooted unordered tree T(V, E) and integer m, where V is a set of vertices and E is a set of labeled edges.
Output: Simple EUTG with m nonterminal symbols that generates only T.
We must identify unordered subtrees to count the number of nonterminal symbols m. For this purpose, we also use the Euler strings es(T) for the unordered trees T as in the minimum SEOTG. First, the unordered tree T is transformed into the ordered tree T' as follows. The children of each vertex in T are sorted by labels, and if it contains a tag, the tag is moved to the first of the children. Next, es(T) is calculated to be es(T').
Thus, this problem is transformed into the following integer program, where x_{1,∈}_{ ,ch }_{(1)} = 1 holds iff there exists a required EUTG G:
Maximize x_{1,∈,ch}_{(1)}
Computational experiments
We implemented the above mentioned IPbased methods for the ordered and unordered trees to perform some computational experiments. We used ILOG CPLEX (version 11.2, http://www.ilog.com/products/cplex/) to solve the integer programs. All of the computational experiments were conducted on a PC with a Xeon CPU 3.33 GHz and 10 GB RAM running under the LINUX OS. In our implementation, we first transformed the minimum grammar problem of the ordered and unordered trees into the integer programs. Next, we used ILOG CPLEX, and obtained the number of nonterminal symbols needed in the minimum grammars of this tree compression. Finally, as the results, the minimum grammars for the tree compression were constructed from the solution of the IP. We also tested the computational time of solving these integer programs. We performed experiments on both artificial data and the glycan treestructure data, and compared our proposed methods with an existing method.
Artificial data
T_{1,∈,2,6} → T_{1,∈,2,2}T_{1,∈,3,6} (1)
T_{1,∈,3,6} → T_{1,∈,}_{3,3}T_{1,∈,6,6} (2)
T_{1,∈,3,3} → T_{1,3,3,3}T_{3,∈,4,5} (3)
T_{3,∈,4,5} → T_{3,∈,4,4}T_{3,∈,5,5}. (4)
The production rules of this tree compression are also shown in Figure 7. The elapsed time to solve the IP was 0.014 s.
When T was regarded as an unordered tree, we generated the integer program with 14 nonterminal symbols for 12 horizontal and 4 vertical divisions. The minimum number of nonterminal symbols of T is 6 (Figure 8). The minimum grammar was constructed as follows.
T_{1,∈,2,3,6} → T_{1,∈,3}T_{1,∈,2,6} (5)
T_{1,∈,3} → T_{1,3,3}T_{3,∈,4,5} (6)
T_{1,∈,2,6} → T_{1,∈,2}T_{1,∈,6} (7)
T_{3,∈,4,5} → T_{3,∈,4}T_{3,∈,5}. (8)
Results on the elapsed time (seconds) for ordered and unordered trees of type A and B
tree type  max degree  # vertices  ordered  unordered  

m  time  m  time  
A  2  11  7  0.021  7  0.019 
A  2  31  10  302.74  10  329.20 
A  2  41  10  8063.19  10  7730.64 
A  2  51  12*  230.51  12*  233.44 
B  3  7  9  0.011  8  0.010 
B  6  19  11  0.185  10  1.108 
B  8  25  11  1.404  10  26440.01 
B  10  31  12  2.265  ^{}  ^{} 
B  16  49  11  481.15  ^{}  ^{} 
B  20  61  13  432.72  ^{}  ^{} 
Glycan treestructure data
Statistics of glycans, G02703, G03655, G03710, G04045, G04458, G04666, G04859, G05058,G05256, G05552, G06867, and G09054, and results on the grammar size
glycan  max degree  # vertices  # distinct labels  Min SEOTG  Min SEUTG  TREEBISECTION  

size  time  size  time  size  time  
G02703  3  26  3  22  3.68  22  2.9  32  0.001 
G03655  3  34  3  47  0.96  47  2.32  49  0.001 
G03710  3  28  3  20  0.47  20  0.51  20  0.001 
G04045  3  36  3  20  1.77  20  1.98  22  0.001 
G04458  3  21  2  16  1.55  16  0.69  36  0.001 
G04666  3  20  4  25  1.41  25  0.94  33  0.001 
G04859  3  19  5  27  0.12  27  0.25  29  0.001 
G05058  3  25  5  26  3.03  26  66.28  36  0.001 
G05256  3  25  2  19  3.14  19  3.98  29  0.001 
G05552  3  19  5  27  0.66  23  0.23  27  0.001 
G06867  3  28  3  22  2.22  22  6.46  26  0.001 
G09054  4  31  5  29  2.81  29  6.71  29  0.001 
We compared the results of the grammar size for the minimum SEOTG and SEUTG by our methods with those of an existing method, TREEBISECTION [10]. TREEBISECTION repeatedly divides a given tree horizontally and vertically such that the size of a divided subtree is similar to that of another subtree until each subtree consists of an edge. It is known that TREEBISECTION computes in polynomial time a simple EOTG of size O(mn^{5/6}) [10], where m is the size of the minimum simple EOTG and n is the number of vertices of the given tree. Table 2 shows the results of the grammar size and the elapsed time by our proposed IPbased methods for the minimum SEOTG and SEUTG problems, and TREEBISECTION. The minimum SEOTG size was the same as that of the minimum SEUTG for each glycan except G05552 because the tree contains vertices only with at most two children, and all subtrees of a vertex having three children are isomorphic. The size of the grammar generated by our methods was always smaller than or equal to that by TREEBISECTION, and the ratio was 1.0 (G09054) to 2.25 (G04458). This result shows that our proposed method performs better with the compression ratio than TREEBISECTION.
Conclusions
We proposed integer programmingbased methods for finding the minimum grammars to generate given strings, ordered trees, and unordered trees. By conducting computational experiments, we confirmed that our IP formulations work correctly. The results also show that our IPbased grammar compression is efficient for ordered trees, although some improvements are required for unordered trees.
We applied our proposed method to glycan treestructure data, and extracted interesting patterns. Although these patterns were obtained from production rules generated for a single tree, we may be able to extract common patterns and rules from multiple glycans by extending our methods to find minimum grammars to generate given forests.
In this paper, we dealt with grammars for trees. However, real structured data often contain some cycles. Therefore, we are in the process of developing IPbased methods for more complex structured data.
Declarations
Acknowledgements
This work was partially supported by GrantsinAid #22240009 and #21700323 from MEXT, Japan.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 11, 2010: Proceedings of the 21st International Conference on Genome Informatics (GIW2010). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/11?issue=S11.
Authors’ Affiliations
References
 Li M, Badger J, Chen X, Kwong S, Kearney P, Zhang H: An informationbased sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics. 2001, 17: 149154.View ArticlePubMedGoogle Scholar
 Hayashida M, Akutsu T: Image compressionbased approach to measuring the similarity of protein structures. In Proc. 6th AsiaPacific Bioinformatics Conference. 2008, 221230.Google Scholar
 Hayashida M, Akutsu T: Comparing biological networks via graph compression. BMC Systems Biology. 2010, 4 (Supp 2): S13PubMed CentralView ArticlePubMedGoogle Scholar
 Charikar M, Lehman E, Liu D, Panigrahy R, Prabhakaran M, Sahai A, Shelat A: The smallest grammar problem. IEEE Transactions on Information Theory. 2005, 51: 25542576.View ArticleGoogle Scholar
 Rytter W: Application of LempelZiv factorization to the approximation of grammarbased compression. Theoretical Computer Science. 2003, 302: 211222.View ArticleGoogle Scholar
 Sakamoto H, Maruyama S, Kida T, Shimozono S: A spacesaving approximation algorithm for grammarbased compression. IEICE Transactions on Information and Systems. 2009, 92D: 158165.View ArticleGoogle Scholar
 Busatto G, Lohrey M, Maneth S: Efficient memory representation of XML document trees. Information Systems. 2008, 33: 456474.View ArticleGoogle Scholar
 Murakami S, Doi K, Yamamoto A: Finding frequent patterns from compressed treestructure data. In Proc. 11th Int. Conf. Discovery Science. 2008, 284295.Google Scholar
 Yamagata K, Uchida T, Shoudai T, Nakamura Y: An effective grammarbased compression algorithm for tree structured data. In Proc. 13th Int. Inductive Logic Programming. 2003, 383400.View ArticleGoogle Scholar
 Akutsu T: A bisection algorithm for grammarbased compression of ordered trees. Information Processing Letters. 2010, 110: 815820.View ArticleGoogle Scholar
 Hizukuri Y, Yamanishi Y, Nakamura O, Yagi F, Goto S, Kanehisa M: Extraction of leukemia specific glycan motifs in humans by computational glycomics. Carbohydrate Research. 2005, 340: 22702278.View ArticlePubMedGoogle Scholar
 Hashimoto K, Goto S, Kawano S, AokiKinoshita K, Ueda N, Hamajima M, Kawasaki T, Kanehisa M: KEGG as a glycome informatics resource. Glycobiology. 2006, 16 (5): 63R70R.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.