Integer programming-based method for grammar-based tree compression and its application to pattern extraction of glycan tree structures

Background A bisection-type algorithm for the grammar-based compression of tree-structured data has been proposed recently. In this framework, an elementary ordered-tree grammar (EOTG) and an elementary unordered-tree grammar (EUTG) were defined, and an approximation algorithm was proposed. Results In this paper, we propose an integer programming-based method that finds the minimum context-free grammar (CFG) for a given string under the condition that at most two symbols appear on the right-hand side of each production rule. Next, we extend this method to find the minimum EOTG and EUTG grammars for given ordered and unordered trees, respectively. Then, we conduct computational experiments for the ordered and unordered artificial trees. Finally, we apply our methods to pattern extraction of glycan tree structures. Conclusions We propose integer programming-based methods that find the minimum CFG, EOTG, and EUTG for given strings, ordered and unordered trees. Our proposed methods for trees are useful for extracting patterns of glycan tree structures.


Background
Data compression is useful because it can help reduce the consumption of expensive resources such as hard disks. To date, many methods such as Huffman coding, arithmetic coding, etc. have been proposed to solve problems of data compression. Data compression is also useful for the analysis of biological data. Li et al. proposed the universal similarity metric (USM), and approximated the dissimilarity using compression sizes. They applied a compression algorithm to unaligned mitochondrial genomes, and obtained a phylogeny that was consistent with the commonly accepted one [1]. Similarly, protein tertiary structures and metabolic networks were compressed, and their similarities were measured [2,3]. Grammar-based compression, which is a typical data-compression method, seeks a small grammar to generate a given string, as it is well known that it is NP-hard to find the smallest context-free grammar (CFG). However, in recent years, several polynomial time algorithms have been proposed to approximate the smallest grammar for the input data within a factor of O(log(n/m)), where n and m are the sizes of the input data and the smallest grammar [4][5][6], respectively. These algorithms can be used to compress biological data such as DNA, RNA, and amino acid sequences. However, there exist a large amount of tree-structured biological data (e.g., glycan, etc.). Therefore, it is necessary to develop methods to compress tree-structured data. Recent approaches show that it is feasible to extend the grammar-based compression to the tree-structure data [7][8][9]. However, these methods neither output the minimum grammar, nor they achieve a guaranteed approximation ratio.
In this paper, we propose an integer programming (IP)-based method that finds the minimum CFG for a given string under the condition that at most two symbols appear on the right-hand side of each production rule. Next, we extend this method to find the minimum elementary ordered-tree grammar (EOTG) and elementary unordered-tree grammar (EUTG) for given ordered and unordered trees. To the best of our knowledge, these are the first methods that can find the minimum size grammars for strings, ordered trees, and unordered trees.
It is possible to compress ordered trees by transforming them into Euler strings [10], and by applying existing grammar-based string-compression algorithms to the strings. Such an approach may achieve better compression performances. However, there do not always exist tree grammars corresponding to string grammars derived by the approach. Our objective is not only to compress trees but also to extract features and patterns from input trees. Therefore, we need to develop methods for finding minimum tree grammars. The organization of the paper is as follows. First, we give an IP-based method for sequence data compression using CFG. Second, we review an elementary ordered tree grammar (EOTG) [10] for ordered rooted tree compression, and extend this IP-based method to the tree compression problem. Then, we also review an elementary unordered tree grammar (EUTG) [10] for unordered rooted tree compression, and extend the above mentioned IP-based method for unordered trees. Furthermore, we conduct some computational experiments, apply the proposed methods to glycan tree-structure data, and extract tree patterns from generated production rules of simple EUTGs. Finally, we conclude with future work.

Minimum CFG problem
We use a simple context-free grammar (CFG) for string and text compression. CFG is defined as 4-tuple (Σ, Γ, S, Δ), where Σ is a set of terminal symbols (denoted by a lower-case letter), Γ is a set of nonterminal symbols (denoted by an upper-case letter), S is a start symbol in Γ, and Δ is a set of production rules. The size of CFG is defined as the total number of letters appearing on the RHSs (Right-Hand Sides) of production rules. Two CFGs G 1 and G 2 are said to be equivalent if G 1 generates the same set of strings as G 2 does. We only consider CFGs consisting of the following types of production rules: • A a, • A BC.
We call this CFG a simple CFG. We can show that any CFG of size m can be transformed into an equivalent, simple CFG of size 3m.
The smallest grammar problem is thus defined as the problem of finding the smallest grammar that generates a given string [4].

Minimum CFG
Input: String s = s 1 s 2 …s n and integer m.
Output: Simple CFG with m nonterminal symbols that generates s only.

Transformation to IP
We use the number of nonterminal symbols m instead of the size of the grammar because the number of terminal symbols appearing in production rules of A a is constant for the given string. In order to solve this minimum CFG problem, we propose an IP-based method as follows. We transform this problem into the following integer program, where x 1,n = 1 holds iff there exists a required CFG G.
Maximize In the above equations, each variable of x i,j , y i,k,j , and z u takes either 0 or 1. Each x i,j corresponds to substring s i,j = s i s i+1 … s j , and x i,j = 1 iff there exists a nonterminal symbol A i,j in G that generates s i,j . y i,k,j = 1 iff both of x i,k = 1 and x k+1,j = 1 hold. It means that s i,j can be generated by concatenating s i,k and s k+1,j that are generated from nonterminal symbols A i,k and A k+1,j, respectively. z u = 1 iff there exists a nonterminal symbol in G that generates a substring u of s. The meaning of each (in)equality is as follows: (1) each s i,i (= s i ) must be generated, (2,3) if A i,j appears in G, s i,j must be generated, that is, for at least some k, both of s i,k and s k+1,j must be generated and the production rule A i,j A i,k A k+1,j must appear in G, (4) A i,j and A i',j' , are identified if both generate the same substring u, and (5) the number of nonterminal symbols used in G must be m. Figure 1 shows an example of the above IP formulation for the string "abcabcab". For this example, the following grammar is constructed from a solution of IP:

Minimum EOTG problem
We use a simple elementary ordered tree grammar (EOTG) [10] for rooted tree compression. In this grammar, a tree can contain a vertex called a tag. A tag indicates that another tree at the root can be attached to it. We assume that there is at most one tag in such a tree. A simple EOTG (SEOTG) is defined as 4-tuple (Σ,Γ, S, Δ), where Σ is a set of terminal symbols, Γ is a set of nonterminal symbols; each edge of the trees has either a terminal or a nonterminal symbol; S is a start symbol in Γ, and Δ is a set of production rules (R1u,t), (R2u,t,t'), and (R3u,t), as in Figure 2. (R1u) ((R1t)) denotes a rule when an untagged (tagged) edge of nonterminal symbol A is replaced with an untagged (tagged) edge of terminal symbol a. (R2u,t,t') denotes a rule when an edge of a nonterminal symbol A is replaced with a tree that contains the upper endpoints of edges of nonterminal symbols B and C as the root, and the lower endpoints as two children. (R3u,t) denotes a rule when an edge of A is replaced with a tree in which the root is the upper endpoint of an edge of B, and the lower endpoint is the upper endpoint of an edge of C. We can show that any EOTG of size m can be transformed into an equivalent SEOTG of size 3m.
Within the class of SEOTGs, we can transform the minimum grammar problem into the IP. For this purpose, we define the minimum SEOTG problem as follows.

Minimum SEOTG
Input: Rooted ordered tree T(V, E) and integer m, where V is a set of vertices and E is a set of labeled edges.

Transformation to IP
includes all the children of i between h and k in T(V, E). If a subtree does not contain any tag, we introduce that is not included in V, and represent it as T i,Î,h,k ,. It is obvious from the production rules of simple EOTGs that it is sufficient to consider only such subtrees T i,t,h,k for T(V, E) because (R2u,t) denotes a rule to horizontally divide a tree into two trees at the root, and (R3u,t) denotes a rule to vertically divide a tree into two trees at an internal vertex that becomes a tag. Let ch(i) = (lch(i),…, rch(i)) denote a sequence of all the children of i in T(V, E), lch(i) is the left-most child, and rch(i) is the right-most child. Without loss of generality, we assume that i 1 ≤...≤ i k for ch(i) = (i 1 , ..., i k ). We suppose that the root of T(V, E) is 1. Then, this problem can be transformed into the following integer program, where x 1,Î,lch (1),rch(1) = 1 holds iff there exists a required EOTG G.
Maximize x 1,Î,lch(1),rch (1) Subject to x y y where I(T i,Î,h,k ) denotes a set of internal vertices that are vertices (neither root or leaves) in T i,Î,h,k, an(t) denotes a set of ancestors of i (i ∉ an(i), and suppose an (Î) = ∅), and es(T) denotes the Euler string of the ordered tree T (for a tagged tree, the tagged edge with label A is transformed into AxĀ, where x is a special symbol representing the tag). It should be noted that if es(T) = es(T'), T is isomorphic to T'. In the above program, each variable of x i,j,h,k , y i j h l k ho , , , , , y i j h k t ve , , , , and z u takes either 0 or 1. Each x i,j,h,k corresponds to subtree T i,j, h,k, and x i,j,h,k = 1 iff there exists a nonterminal A i,j,h,k in G that generates T i,j,h,k . z u = 1 iff there exists a nonterminal symbol in G that generates subtree u of T(V, E). The meaning of each (in)equality is as follows (Figure 4): h,k and T t,j,lch(t),rch(t) (if j ≠ , t must be in an(j), otherwise, a divided tree would have two tags),

IP formulation for unordered trees
In some cases, a given ordered tree is not well compressed. Figure 5 shows an example of such a tree T(V, E), where edges (1, 2),(1, 3),(1, 6), (3,4), and (3, 5) E are labeled with a,c,b,a, and b, respectively. A subtree T 3,Î,4,5 is the same as a subtree with root i. However, we cannot divide the tree into such a subtree and the remaining part, as in the figure, according to production rules in EOTGs. Therefore, we need to extend the above IP for the ordered trees to that for the unordered trees. For this purpose, we extend the EOTG to a grammar for the unordered trees, called the elementary unordered tree grammar (EUTG) [10], and use a simple EUTG for rooted unordered tree compression.
A simple EUTG (SEUTG) is defined as 4-tuple (Σ, Γ, S, Δ) in a similar way to EOTG. A set of production rules Δ is also the same as that of EOTG (Figure 2), except that trees appeared in the production rules are dealt as unordered trees. In other words, there is no sibling relationship between children B and C in the rules (R2u,t). Therefore, we must consider the subtrees of T (V, E) as T i,t,C (Figure 6), where C (≠ ∅) is a subset of the children of i. Although ch(i) is considered to be the sequence of children of i for the ordered trees, we allow it be a set of the children of i for the unordered trees. Thus, within the class of SEUTGs, we define the minimum SEUTG problem as follows:

Minimum SEUTG
Input: Rooted unordered tree T(V, E) and integer m, where V is a set of vertices and E is a set of labeled edges.
Output: Simple EUTG with m nonterminal symbols that generates only T.
We must identify unordered subtrees to count the number of nonterminal symbols m. For this purpose, we also use the Euler strings es(T) for the unordered trees T as in the minimum SEOTG. First, the unordered tree T is transformed into the ordered tree T' as follows. The children of each vertex in T are sorted by labels, and if it contains a tag, the tag is moved to the first of the children. Next, es(T) is calculated to be es(T').
Thus, this problem is transformed into the following integer program, where x 1, ,ch(1) = 1 holds iff there exists a required EUTG G: Maximize x 1,Î,ch (1) Subject to

Computational experiments
We implemented the above mentioned IP-based methods for the ordered and unordered trees to perform some computational experiments. We used ILOG   CPLEX (version 11.2, http://www.ilog.com/products/ cplex/) to solve the integer programs. All of the computational experiments were conducted on a PC with a Xeon CPU 3.33 GHz and 10 GB RAM running under the LINUX OS. In our implementation, we first transformed the minimum grammar problem of the ordered and unordered trees into the integer programs. Next, we used ILOG CPLEX, and obtained the number of nonterminal symbols needed in the minimum grammars of this tree compression. Finally, as the results, the minimum grammars for the tree compression were constructed from the solution of the IP. We also tested the computational time of solving these integer programs.
We performed experiments on both artificial data and the glycan tree-structure data, and compared our proposed methods with an existing method.

Artificial data
We chose the left tree T of Figure 5 in which edges labeled with "a" and "b" are connected to both endpoints of an edge labeled with "c", and performed computational experiments, where the simple tree T was treated either as an ordered or an unordered tree. When T was regarded as an ordered tree, we generated the integer program with 13 nonterminal symbols for 9 horizontal and 4 vertical divisions. The number of nonterminal symbols needed in the minimum grammar of T is 7 because the number of production rules except (R1u,t) is 4 and the number of terminal symbols is 3. (Figure 7, in which nonterminal symbols, S,A,..., and F are used). The minimum grammar constructed from the solution of IP is as follows.
When T was regarded as an unordered tree, we generated the integer program with 14 nonterminal symbols for 12 horizontal and 4 vertical divisions. The minimum number of nonterminal symbols of T is 6 ( Figure 8). The minimum grammar was constructed as follows.
In addition to this simple example, we performed experiments for two types of trees with more vertices (Figure 9), where the number of vertices and degree was up to 61 and 20, respectively, and measured the elapsed times. Type A trees only contain vertices with the degree at most two and edges labeled with a, while Type B trees contain edges labeled with a and b, and the height is two. Table 1 shows the results on the elapsed time (seconds) to solve the minimum SEOTG and SEUTG problems by using CPLEX for the ordered and unordered trees of Type A and B with several sizes. m was the same as the minimum number of nonterminal symbols, except the case of Type A trees with 51 vertices. In these cases, CPLEX did not output the solution for m = 11 within 8 h. However, we were able to generate the production rules for m = 12, although 10 is the minimum number of nonterminal symbols. If we do not need the minimum grammar, then we can obtain the production rules faster than in the case of finding the minimum grammar. Furthermore, the results show that the elapsed time for an ordered Type A tree was almost the same as that for the corresponding unordered tree, and the time for an ordered Type B tree was shorter than that for the corresponding unordered tree. Even for the ordered tree with 61 vertices, the time was a few minutes. These results suggest that our proposed method is efficient for ordered trees. These results also suggest that the IP-based method for unordered trees should be used when sibling relationships do not have any meanings and the number of vertices and the maximum degree are not so large because the minimum SEUTG size is always smaller than the minimum SEOTG size. However, if the maximum degree is large and sufficient time is not given, the IP-based method for ordered trees should be used. It is because solving the minimum SEUTG problem for such trees may take too much time whereas the method for ordered trees is expected in many cases to provide a small grammar whose size is close to or the same as that of the smallest grammar obtained by the method for unordered trees.

Glycan tree-structure data
It is known that glycans play important roles in a cell such as cellular adhesion and antigen-antibody reaction. Therefore, it is important to analyze structures of glycans. Hizukuri et al. extracted characteristic functional motifs of glycans, predicted a leukemia specific glycan motif, and confirmed by biological experiments that the Agrocybe cylindracea galectin specifically recognized human leukemic cells [11]. Thus, it is also important to find motifs and repeated patterns of glycans. We obtained twelve glycans, G02703, G03655, G03710, G04045, G04458, G04666, G04859, G05058, G05256, G05552, G06867, and G09054 as rooted trees from the KEGG Glycan database [12]. We labeled each edge with a lower-case letter corresponding to the type of sugar of the lower endpoint, because the edges are not labeled in the original data. For each glycan, the maximum degree, the number of vertices, and the number of distinct labels are shown in Table 2. Then, we applied our proposed IP-based method for SEUTG to each glycan as an unordered tree, and obtained the production rules. Figures 10, 11, 12, and 13 show extracted patterns from the production rules of G03655, G04458, G04666, and G05058. We can see from the result of the generated production rule that the tree of G03655 contains 2 of the same subtrees with 4 vertices and 3 of the same subtrees with 5 vertices, the tree of G04458 contains 2 of the same subtrees with 8 vertices and the subtree contains 3 of the same subtrees with 3 vertices, the tree of G04666 contains 3 of the same subtrees with 5 vertices and 2 of the same subtree with 3 vertices, and the tree of G05058 contains 3 of the same subtrees with 6 vertices and 2 of the same subtrees with 5 vertices. We were able to extract patterns similar to those of G03655, G04458, G04666, G05058 for the other glycans. The detailed derivation diagrams of production rules for the four glycans are available on our supplementary web site (http://sunflower.kuicr.kyoto-u.ac.jp/morihiro/treegram/).
We compared the results of the grammar size for the minimum SEOTG and SEUTG by our methods with those of an existing method, TREE-BISECTION [10]. TREE-BISECTION repeatedly divides a given tree horizontally and vertically such that the size of a divided subtree is similar to that of another subtree until each subtree consists of an edge. It is known that TREE-BISECTION computes in polynomial time a simple EOTG of size O(mn 5/6 ) [10], where m is the size of the minimum simple EOTG and n is the number of vertices of the given tree. Table 2 shows the results of the grammar size and the elapsed time by our proposed IP-based methods for the minimum SEOTG and SEUTG problems, and TREE-BISECTION. The minimum SEOTG size was the same as that of the minimum SEUTG for each glycan except G05552 because the tree contains vertices only with at most two children, and all subtrees of a vertex having three children are isomorphic. The size of the grammar generated by Results on the elapsed time (seconds) for ordered and unordered trees of type A and B (in Figure 9) with several sizes. m was the same as the minimum number of nonterminal symbols for each tree, except the case denoted by '*'. '-' denotes that the solver took more than 8 hours. Statistics of glycans, G02703, G03655, G03710, G04045, G04458, G04666, G04859, G05058, G05256, G05552, G06867, and G09054, and results on the grammar size and the elapsed time (seconds) by our proposed IP-based methods for the minimum SEOTG and SEUTG problems, and TREE-BISECTION [10].  Figure 10 Extracted patterns from glycan G03655 The label related with the lower endpoint is attached to each edge. Labels, a, b, and c denote GlcNAc, Man, and P, respectively.  Figure 11 Extracted patterns from glycan G04458 The label related with the lower endpoint is attached to each edge. Labels, a, and b denote Xyl, and Glc, respectively.   Figure 13 Extracted patterns from glycan G05058 The label related with the lower endpoint is attached to each edge. Labels, a, b, c, d, and e denote Glc, Man6Ac, Man, GlcA, and 3-en-eryHexA, respectively.

Conclusions
We proposed integer programming-based methods for finding the minimum grammars to generate given strings, ordered trees, and unordered trees. By conducting computational experiments, we confirmed that our IP formulations work correctly. The results also show that our IP-based grammar compression is efficient for ordered trees, although some improvements are required for unordered trees.
We applied our proposed method to glycan treestructure data, and extracted interesting patterns. Although these patterns were obtained from production rules generated for a single tree, we may be able to extract common patterns and rules from multiple glycans by extending our methods to find minimum grammars to generate given forests.
In this paper, we dealt with grammars for trees. However, real structured data often contain some cycles. Therefore, we are in the process of developing IP-based methods for more complex structured data.