 Proceedings
 Open Access
 Published:
Enumerating treelike chemical graphs with given upper and lower bounds on path frequencies
BMC Bioinformatics volume 12, Article number: S3 (2011)
Abstract
Background
Enumeration of chemical graphs satisfying given constraints is one of the fundamental problems in chemoinformatics and bioinformatics since it leads to a variety of useful applications including structure determination of novel chemical compounds and drug design.
Results
In this paper, we consider the problem of enumerating all treelike chemical graphs from a given set of feature vectors, which is specified by a pair of upper and lower feature vectors, where a feature vector represents the frequency of prescribed paths in a chemical compound to be constructed. This problem can be solved by applying the algorithm proposed by Ishida et al. to each single feature vector in the given set, but this method may take much computation time because in general there are many feature vectors in a given set. We propose a new exact branchandbound algorithm for the problem so that all the feature vectors in a given set are handled directly. Since we cannot use the bounding operation proposed by Ishida et al. due to upper and lower constraints, we introduce new bounding operations based on upper and lower feature vectors, a bond constraint, and a detachment condition.
Conclusions
Our proposed algorithm is useful for enumerating treelike chemical graphs with given upper and lower bounds on path frequencies.
Introduction
Development of novel drugs is one of the major goals in chemoinformatics and bioinformatics. To achieve this purpose, it is important not only to investigate common chemical properties over chemical compounds having common structural patterns [1–3] but also to study methods of enumerating chemical structures satisfying given constraints. The enumeration of chemical structures has a long history. Actually, Cayley [4] considered the enumeration of structural isomers of alkanes in the 19th century. Applications for the enumeration of chemical compounds include structure determination using massspectrum and/or NMRspectrum [5, 6], virtual exploration of chemical universe [7, 8], reconstruction of molecular structures from their signatures [9, 10], and classification of chemical compounds [11].
In the field of machine learning, the preimage problem [12, 13] has been studied. In this problem, a desired object is computed as a feature vector in a feature space, and then the feature vector is mapped back to the input space, where this mapped back object is called a preimage. The definition of the feature vectors based on the frequency of labeled paths [14, 15] or small fragments [11, 16] has been widely used. Akutsu and Fukagawa [17] formulated the graph preimage problem as the problem of inferring graphs from the frequency of paths of labeled vertices, which corresponds to the preimage problem, and proved that the problem is NPhard even for planar graphs with bounded degrees [17]. Nagamochi [18] proved that a graph determined by frequency of paths with length 1 can be found in polynomial time if any.
To enumerate treelike chemical graphs, Fujiwara et al. [19] proposed a branchandbound algorithm which consists of a branching procedure based on the tree enumeration algorithm due to Nakano and Uno [20, 21] and bounding operations designed by the path frequency and the atomatom bonds. In addition, to reduce the size of search trees, Ishida et al. [22] introduced a new bounding operation, called the detachmentcut, based on the result by Nagamochi [18]. Implementations of the algorithm proposed by Ishida et al. [22] are available at a web server (http://sunflower.kuicr.kyotou.ac.jp/tools/enumol/) for enumerating treelike chemical graphs with given path frequency. However, an instance with constraint which is specified by one feature vector admits no solution in many cases. Therefore, it is needed to introduce a more relaxed constraint than a single feature vector to obtain some solutions in the treelike chemical graph enumeration problem.
In this paper, we are given a set of feature vectors, which is specified by a pair of upper and lower feature vectors, and enumerate all treelike chemical graphs satisfying one of the vectors. It seems that this can be done by simply applying the algorithm proposed by Ishida et al. to each single feature vector in the given set. However, this method will take much computation time because in general there are many feature vectors in a given set. We propose a new exact branchandbound algorithm for the problem so that all the feature vectors in a given set are handled directly.
Methods
Preliminaries and problem formulation
A graph is called a multigraph if multiple edges (i.e., edges with the same end vertices) are allowed; otherwise it is called simple. A path P is a sequence v_{0}, e_{1}, v_{1}, e_{2}, v_{2}, …, e_{ k }, v_{ k } of distinct vertices v_{ i } (i = 0, …, k) and edges e_{ j } that join v_{j – 1} and v_{ j } (j = 1, …, k). Without confusion we may write P = (v_{0}, v_{1}, …, v_{ k }). The length P of path P is defined to be k, i.e., the number of edges. Assume that a set Σ = {ℓ_{1},ℓ_{2}, …,ℓ_{ s }} (i.e., chemical elements) is given. Let each label ℓ be associated with a valence val( ℓ) ∈ ℤ_{+}. A multigraph G is called Σlabeled if each vertex v has a label ℓ(v) ∈ Σ, and is called (Σ, val)labeled if, in addition, the degree of each vertex v is val(ℓ(v)), i.e., the valence of the element ℓ(v). We regard chemical compounds as (Σ, val )labeled, selfloopless, and connected multigraphs, where vertices and labels represent atoms and elements, respectively. For a path P = (v_{0}, v_{1}, …, v_{ k }), we call ℓ(P) = ℓ(v_{0}), ℓ(v_{1}), …, ℓ(v_{ k }) the label sequence of P. Given a label sequence t, let #t denote the number of paths P with ℓ(P) = t in a graph, where multiple edges with the same endvertices are treated as a single edge and paths are considered to be “directed.” The feature vector f_{ K }(G) of level K(∈ ℤ_{+}) of G is defined to be the vector whose entry f_{ K }(G)[t] (t ≤ K) represents #t. See Fig. 1 for an example.
Let deg(v; G) denote the degree of a vertex v in a graph G. The treelike chemical graph enumeration problem with given one feature vector can be formulated as follows [19].
Enumeration of Treelike chemical graphs with given Path Frequency (ETPF)
Given a set Σ of labels, a valence function val : Σ → ℤ_{+} and a feature vector g of level K, find all (Σ, val)labeled multitrees T such that f_{ K }(T) = g and deg(v;T) = val(ℓ(v)) for all vertices v ∈ V(T).
Observe that a large number of chemical compounds contain a high proportion of hydrogens. Based on this fact, another model can be considered in the problem ETPF by removing all hydrogen atoms. These two different models were proposed by Fujiwara et al. [19] and Ishida [23].
In this paper, we consider the problem of enumerating all treelike chemical graphs based on given upper and lower feature vectors because we want to relax the feature vector constraint in the problem ETPF. For feature vectors g_{1} and g_{2} of level K, we define g_{1} ≤ g_{2} to be g_{1}[t] ≤ g_{2}[t] for any label sequence t (t ≤ K). The problem of enumerating treelike compounds from given two feature vectors can be formulated based on the problem ETPF as follows (see Fig. 2 for an illustration).
Enumeration of Treelike chemical graphs with given Upper and Lower bounds on path Frequencies (ETULF)
Given a set Σ of labels, a valence function val : Σ → ℤ_{+} and feature vectors g_{ U } and g_{ L } of level K (g_{ L } ≤ g_{ U }), find all (Σ, val)labeled multitrees T such that g_{ L } ≤ f_{ K }(T) ≤ g_{ U } and deg(v;T) = val(ℓ(v)) for all vertices v ∈ V(T).
For the problem ETULF, we assume that g_{ L }(ℓ) = g_{ U }(ℓ) for an atom type ℓ ∈ Σ, where g(L) denotes the entry in g that corresponds to a label sequence L (thus g(ℓ) specifies the number of vertices of label ℓ) and that g_{ L }(L) ≤ g_{ U }(L) for any label sequence L (L ≥ 2).
Note that the number n of vertices is given by Σ_{ℓ∈Σ}g(ℓ). To solve the problem ETULF, we start with an empty graph, and repeatedly extend the current tree T by appending a new vertex with each label ℓ ∈ Σ to obtain a valid tree (a tree that does not violate any constraints on output trees) one by one until we get n vertices. In order to avoid duplicate outputs, we follow the branchandbound framework of Fujiwara et al. [19], which first defines a canonical representation for isomorphic trees, and then lists them using the algorithm of Nakano and Uno [20, 21] (the branching operation) discarding invalid trees with some bounding operations. Since we cannot directly use the bounding operation proposed by Ishida et al. [22] due to upper and lower constraints, we introduce some new bounding operations.
Canonical representation of trees and the branching operation
In this section, we explain a canonical representation of trees introduced by Fujiwara et al. [19] and the branching operation based on the canonical representation.
First of all, we introduce a root of a tree based on the following theorem.
Theorem 1 (Jordan [24]) For any tree with n′ vertices, either there exists a unique vertex v* such that each subtree obtained by removing v* contains at mostvertices, or there exists a unique edge e* such that both of the subtrees obtained by removing e* contain exactlyvertices.
Such a vertex v* and an edge e* in Theorem 1 are called unicentroid and bicentroid, respectively. Either unicentroid or bicentroid is called as centroid. Note that there exists a bicentroid only for an even n′. Since a case of bicentroid is similar to a case of unicentroid, now we only explain a case of unicentroid.
Next we introduce a canonical representation of trees that must be unique up to isomorphism. Let T be a tree of n vertices rooted at a vertex v_{0} (which is not necessarily its unicentroid). Suppose that it is embedded in the plane as an ordered tree, where v_{0} is located at the top part. Without loss of generality, let v_{0}, v_{1}, …, v_{n – 1} be indexed by the depthfirst search (DFS) that starts from v_{0} and visits vertices from the left to the right. Define the depth d(v) of a vertex v to be the length of the (unique) path from v_{0} to v in T. The depthlabel sequence of T (L(T)) is defined to be
Given an arbitrary order of labels, we define the order of depthlabel sequences as follows. For any T_{1} and T_{2}, we denote L(T_{1}) >L(T_{2}) if L(T_{1}) is lexicographically larger than L(T_{2}). Then the canonical representation of a rooted tree is defined by the largest depthlabel sequence among all its plane embeddings. Actually this is equivalent to the leftheavy plane embedding [20, 21].
Thus our branching task is to list all centroidrooted leftheavy trees with n vertices and m (= Σ) labels. Following the scheme [20, 21], we define a parentchild relation between two leftheavy trees. The parent P(T) of a leftheavy tree T is obtained from T by removing its rightmost leaf. Clearly P(T) is still leftheavy In this way, we can define a family tree of leftheavy trees whose leaves are exactly what we want to obtain.
Therefore we only need to enumerate the (leaf) nodes of . This can be done by starting from the empty tree (the root node of ) and repeatedly appending a new leaf to some appropriate place on the rightmost path of the current tree. Our branching operation employs the algorithm of Nakano and Uno [20, 21], which extends the current tree T (i.e., finds a child of T) in constant time [19].
Bounding operations
In this section, we explain how to check the validity of the current tree T. If we can conclude that T and all its descendants are not valid, then we can discard T. Our bounding operation discards T if at least one of the following criteria is violated:
(C1) The root of T remains the centroid of an output (the centroid constraint);
(C2) deg(v;T) ≤ val(l(v)) for all v ∈ V(T) (the valence constraint);
(C3) f_{ K }(T) ≤ g_{ U }, and T = n and g_{ L } ≤ f_{ K }(T) (the feature vector constraint);
(C4) T can be extended to a connected and loopless tree with n vertices (the detachment constraint);
(C5) T can have a descendant which has an appropriate number of multiple bonds (the multiplicity constraint).
(C1) and (C2) are the same as the work by Fujiwara et al. [19] and not difficult to check. (C3) and (C4) are different from the work by Fujiwara et al. [19] and Ishida et al. [22] due to upper and lower constraints. (C5) is a new bounding operation that we propose in this paper. In the following three subsections, we will discuss three bounding operations resulting from (C3), (C4), and (C5), called as featurevectorcut, detachmentcut, and multiplicitycut, respectively.
Featurevectorcut procedure
In the problem ETULF, we cannot use the bounding operation proposed by Fujiwara et al. [19] directly due to upper and lower feature vectors, but we can introduce a bounding operation based on upper and lower feature vectors by modifying Fujiwara et al.’s work slightly.
Let T denote a current tree, f_{ K }(T) denote the feature vector of T, g_{ u } denote a given upper feature vector, and g_{ L } denote a given lower feature vector. By the feature vector constraints in the problem ETULF, we check the following condition.
If T violates (1), then we discard T.
In addition, if T = n, then we check the following condition based on the constraint of upper and lower feature vectors.
If T violates (2), then we discard T.
Detachmentcut procedure
This subsection describes the definition of detachment [18] and a new bounding operation based on it for the problem ETULF. Let G be a multigraph that may have selfloops, which represents the graph obtained from a chemical graph H by contracting the vertices with the same label into a single vertex, where each vertex in G corresponds a label in H (note that we do not eliminate any edges in H in contracting vertices to obtain G). A process of regaining H from G is described as follows. Given a function r : V(G) → ℤ_{+}, an rdetachment H of G is a multigraph obtained from G by splitting each vertex v ∈ V(G) into a set of r(v) copies of v, denoted by W_{ v } = {v^{1}, v^{2} …, v^{r}^{(}^{v}^{)}}, so that each edge {u, v} ∈ E(G) joins some vertices u^{i} ∈ W_{ u } and v^{j} ∈ W_{ v }. Hence an rdetachment H of G is not unique in general. A selfloop {u, u} in G may be mapped to a selfloop {u^{i},u^{i}} or a nonloop edge {u^{i},u^{j}} in a detachment H of G. Note that, for all vertex pairs {u, v} ∈ V(G), the number of edges between subsets W_{ u } and W_{ v } in H is equal to that of edges between vertices u and v in G.
To obtain a chemical graph H as an rdetachment H of G, we need to specify the degree of vertices (with the same label) in H. For a function r : V(G) → ℤ_{+}, an rdegree specification is a set ρ of vectors for v ∈ V(G) such that
which is necessary for all the edges incident to vertex v in G to be assigned to split vertices v^{i} ∈ W_{ v } completely. An rdetachment H of G is called a ρdetachment if each v ∈ V satisfies
which is a requirement that each vertex v_{ i } in H must have the prescribed degree . Figure 3 illustrates a ρdetachment H for a graph G = (V, E) with V = {a, b, c}, a function r with r(a) = 4, r(b) = 3, r(c) = 1, and a degree specification ρ with ρ(a) = (2, 2, 3, 2), ρ(b) = (2, 3, 1), ρ(c) = (3). The next theorem gives a characterization of a multigraph G that admits a connected and loopless ρ detachment.
Theorem 2 (Nagamochi [18]) Let G = (V, E) be a multigraph, r : V → ℤ_{+} and . Then G has a connected and loopless ρdetachment H if and only if the following hold:
where r(X) = Σ _{ v }_{∈} _{ X }r(v), c(G′) denotes the number of connected components of a graph G′, G – X denotes the graph obtained from a graph G by removing the vertices in X together with all edges incident to vertices in X, and d(A, B; G) denotes the number of edges (u, v) ∈ E with u ∈ A and v ∈ B.
Ishida et al. [22] proposed a bounding operation for the problem ETPF based on Theorem 2. However, we cannot use the bounding operation proposed by Ishida et al. for the problem ETULF due to upper and lower constraints. We now describe our new bounding operation based on detachments for the problem ETULF. The new bounding operation, called detachmentcut tests whether the current multitree T has a multitree that is consistent with given path frequencies among its descendants in the family tree, based on the difference between the feature vector f_{ K }(T) and the input feature vectors g_{ U } and g_{ L }.
Let ℓ_{1}, ℓ_{2}, …, ℓ_{ s } be input labels and g_{ U }, g_{ L } : Σ^{≤ K + 1} → ℤ_{+} be feature vectors. Let r_{0}, …, r_{ h } be the vertices in the rightmost path to which a new leaf can be appended and denote the number of vertices r_{ j } (0 ≤ j ≤ h) with ℓ(r_{ j }) = ℓ_{ i }. For each label sequence t, #t denotes the number of paths P in T with ℓ(P) = t. From g_{ U }, g_{ L }, and T, we define new feature vectors and of level K = 1 to be
We next introduce a vertex with a new label ℓ_{s+1} of valence h + 1 (for example, label A in Fig. 4), a graph G_{ U } = (V_{ U }, E_{ U }) with a vertex set V_{ U } = {v_{1}, …, v_{ s }, v_{ s }_{+1}  ℓ(v_{ i }) = ℓ_{ i }, 1 ≤ i ≤ s + 1} and edge set , and a graph G_{ L } = (V_{ L },E_{ L }) with a vertex set V_{ L } = {v_{1}, …, v_{ s }, v_{ s }_{+1}  ℓ(v_{ i }) = ℓ_{ i }, 1 ≤ i ≤ s + 1} and edge set . Note that d({v_{ i }}, {v_{ j }}; G) means a multiplicity of the edge {v_{ i },v_{ j }} in a graph G. The function r and degree specification ρ are defined to be
Using G_{ U }, G_{ L }, r, and ρ, we can check if a current multitree T violates (C4). We need to check whether none of the following two conditions is violated.

(a)
.

(b)
r(X) + c(G_{ U } – X) – d(X, V_{ U }; G_{ U }) ≤ 1 (∀X ⊆ V_{ U }, X ≠ ∅).
In the first condition, we check whether the number of the rest of bonds is large enough to satisfy the lower feature vector constraint. In the second condition, we check whether T has a connected and loopless descendant based on G_{ U } and Theorem 2.
Multiplicitycut procedure
This subsection describes a new bounding operation based on multiplicity for the problem ETULF. Let g(ℓ) be the number of vertices with label ℓ ∈ Σ that are obtained from given the feature vector. Now we assume that g(ℓ) for all ℓ ∈ Σ are fixed in the problem ETULF. Then we can calculate the number of edges in output trees in the problem ETULF. Let n be the number of vertices in output trees. If we treat a multiple edge as a set of single edges, the number of edges e_{ m } in an output tree is given by:
On the other hand, if we treat a multiple edge as a simple one, the number of edges e_{ s } in an output tree is equal to n – 1 due to the treelike constraint. Now we consider
which means that only M edges are used to construct multiple bonds in an output tree. Note that M ≥ 0. We calculate M from an input of the problem ETULF before the enumeration algorithm starts.
Let T = (V, E) be a multitree, and m_{ e } denote the multiplicity of e ∈ E. The multiplicity M(T) of T is defined to be
Now we describe the multiplicitycut based on M(T) and M.
Let T be the current rooted multitree in the branching operation, M(T) be the multiplicity of T, RP(T) = (r_{0}, r_{1}, …, r_{ k }) be the rightmost path of T, T_{ i } be the new rooted multitree obtained by appending a new leaf p to a vertex r_{ i } (0 ≤ i ≤ k), and RP(T_{ i }) be the rightmost path of T_{ i } . The rightmost path RP(T_{ i }) of T_{ i } is updated by appending p to the end of RP(T) when a new leaf p is appended to r_{ i }, that is, RP(T_{ i }) = (r_{0}, r_{1},…, r_{ i }, p). Then we can determine the multiplicities of the edges {(r_{ j }, r_{j – 1}), j = k, k – 1, …, i + 1} due to the valence constraint, at the same time, we update M(T_{ i }). We denote the multiplicity of an edge (r_{ j }, r_{j – 1}) in T_{ i } by Mul(r_{ j }, r_{j – 1}  T_{ i }). When we update the multiplicity of the edge (r_{ j },r_{j – 1}), M(T_{ i }) is updated as follows:
By the definition of M, a valid multitree T_{ i } satisfies
If T_{ i } violates (3), then we discard T_{ i } . See Fig. 5 for an illustration of this.
Results
This section reports the experimental results of our algorithm. First of all, we mention that the problem ETULF can be solved by applying the algorithm proposed by Ishida et al. [22] to each single feature vector in a given set of feature vectors, i.e., the problem ETULF can regard as a set of the problem ETPF. Then we call an algorithm for the problem ETULF based on the algorithm proposed by Ishida et al. RepEnum (Repeated Enumeration). On the other hand, we call our algorithm SimEnum (Simultaneous Enumeration). It is to be noted that RepEnum is one of the fastest tools to enumerate treelike chemical structures from a given molecular formula (i.e., feature vector with K = 0) [22] and, to our knowledge, there does not exist any other available tool to enumerate chemical structures from a given feature vector based on path frequency (i.e., feature vector with general K).
Now we compare the performances of two algorithms, SimEnum and RepEnum, and we also compare the performances of two algorithms, SimEnum including multiplicitycut and SimEnum not including multiplicitycut. We have tested the algorithm SimEnum for some widths between upper and lower feature vectors. Tests were carried out on a PC with CPU AMD Athlon Dual Core Processor 5050e using instances based on some chemical compounds selected from the KEGG LIGAND database [25] (http://www.genome.jp/ligand/). Note that we treat a benzene ring contained in these compounds as a new virtual atom of valence six.
We define w ∈ ℤ_{+} to be a width between upper and lower feature vectors. From a feature vector g, we construct two feature vectors g_{ U } and g_{ L } as follows. For each entry a > 0 of g, let g_{ U } be the upper feature vector, where each entry a_{ U } is given by a + w and g_{ L } be the lower one, where each entry a_{ L } is given by max{0, a – w}. Note that if w = 0, then an instance for the problem ETULF is equivalent for the problem ETPF.
Table 1 and Additional file 1 show the results of the comparison. We find that the algorithm RepEnum cannot solve all the problems with K = 2 within the time limit since the number of feature vectors in a given set is exponentially increasing with K. On the other hand, Table 1 shows that the algorithm SimEnum can solve the problem much faster for a larger K. This shows that the algorithm SimEnum runs significantly faster than the algorithm RepEnum. It is also seen that RepEnum can only examine a very small portion of feature vectors in most cases. Additional file 1 shows that the algorithm SimEnum including multiplicitycut runs faster than the algorithm SimEnum not including multiplicitycut for almost all of the instances. This shows that the multiplicitycut operation works well to improve enumeration efficiency.
Table 2 shows the results on the performance for varying width w for the problem ETULF. The search space in the problem ETULF is exponentially increasing with w. However, it seems that the number of search nodes and computation time are not exponentially increasing with w. This suggests that the algorithm SimEnum works efficiently for the large search space in the problem ETULF.
Here, we briefly discuss practical values on K and w though we do not have concrete evidence and these values depend on target classes of chemical compounds. It is suggested from the results on similar feature vectors [9, 10, 15] that K between 3 to 10 should be used. Though there is no previous result on w, it is seen from Table 2 that w cannot be large because there may exist too many solutions. Therefore, w less than 4 should be used.
Conclusions
We considered the problem of enumerating all treelike chemical graphs from a given set of feature vectors, which is specified by upper and lower feature vectors based on frequencies of paths, and proposed a new exact branchandbound algorithm. Our experimental results show that our algorithm outperforms the naive algorithm based on a previous method. In comparison to the algorithm based on Ishida et al. [22], our algorithm can greatly reduce the number of search nodes and the computation time and enumerate all the feasible solutions in many instances.
However, the search space of the problem ETULF is much larger than that of the problem ETPF due to upper and lower constraints and in fact there are many search nodes for solving the problem ETULF by our algorithm. One of the future works is to improve the bounding operations, or introduce a new bounding operation. Actually, in the featurevectorcut mentioned in subsection , information of a lower feature vector g_{ L } is only used if T = n. Another future work is to develop a web server that implements our proposed algorithm. Generalization of the proposed techniques for other types of kernel functions and other problems is also left as a future work.
References
 1.
Bytautas L, Klein DJ: Chemical combinatorics for alkaneisomer enumeration and more. Journal of Chemical Information and Computer Sciences 1998, 38: 1063–1078. 10.1021/ci980095c
 2.
Bytautas L, Klein DJ: Formula periodic table for acyclic hydrocarbon isomer classes: combinatorially averaged graph invariants. Physical Chemistry Chemical Physics 1999, 1: 5565–5572.
 3.
Bytautas L, Klein DJ: Isomer combinatorics for acyclic conjugated polyenes: enumeration and beyond. Theoretical Chemistry Accounts 1999, 101: 371–387. 10.1007/s002140050455
 4.
Cayley A: On the analytic forms called trees with applications to the theory of chemical combinations. Reports British Association for the Advancement of Science 1875, 45: 257–305.
 5.
Buchanan BG, Feigenbaum EA: DENDRAL and MetaDENDRAL: their applications dimension. Aritificial Intelligence 1978, 11: 5–24. 10.1016/00043702(78)900103
 6.
Funatsu K, Sasaki S: Recent advances in the automated structure elucidation system, CHEMICS. Utilization of twodimensional NMR spectral information and development of peripheral functions for examination of candidates. Journal of Chemical Information and Computer Sciences 1996, 36: 190–204. 10.1021/ci950152r
 7.
Fink T, Reymond JL: Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery. Journal of Chemical Information and Computer Sciences 2007, 47: 342–353. 10.1021/ci600423u
 8.
Mauser H, Stahl M: Chemical fragment spaces for de novo design. Journal of Chemical Information and Computer Sciences 2007, 47: 318–324. 10.1021/ci6003652
 9.
Faulon JL, Churchwell CJ, Jr DPV: The signature molecular descriptor. 2. Enumerating molecules from their extended valence sequences. Journal of Chemical Information and Computer Sciences 2003, 43: 721–734. 10.1021/ci020346o
 10.
Hall LH, Dailey ES: Design of molecules from quantitative structureactivity relationship models. 3. Role of higher order path counts: path 3. Journal of Chemical Information and Computer Sciences 1993, 33: 598–603. 10.1021/ci00014a012
 11.
Deshpande M, Kuramochi M, Wale N, Karypis G: Frequent substructurebased approaches for classifying chemical compounds. IEEE Transactions on Knowledge and Data Engineering 2005, 17: 1036–1050.
 12.
Bakir GH, Weston J, Schölkopf B: Learning to find preimages. Advances in Neural Information Processing Systems 2003, 16: 449–456.
 13.
Bakir GH, Zien A, Tsuda K: Learning to find graph preimages. Lecture Notes in Computer Science 2004, 3175: 253–261. 10.1007/9783540286493_31
 14.
Kashima H, Tsuda K, Inokuchi A: Marginalized kernels between labeled graphs. Proceedings of the Twentieth International Conference on Machine Learning, AAAI Press 2003, 321–328.
 15.
Mahé P, Ueda N, Akutsu T, Perret JL, Vert JP: Graph kernels for molecular structureactivity relationship analysis with support vector machines. Journal of Chemical Information and Modeling 2005, 45: 939–951. 10.1021/ci050039t
 16.
Byvatov E, Fechner U, Sadowski J, Schneider G: Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. Journal of Chemical Information and Computer Sciences 2003, 43: 1882–1889. 10.1021/ci0341161
 17.
Akutsu T, Fukagawa D: Inferring a graph from path frequency. Lecture Notes in Computer Science 2005, 3537: 371–392. 10.1007/11496656_32
 18.
Nagamochi H: A detachment algorithm for inferring a graph from path frequency. Algorithmica 2009, 53: 207–224. 10.1007/s0045300891840
 19.
Fujiwara H, Wang J, Zhao L, Nagamochi H, Akutsu T: Enumerating treelike chemical graphs with given path frequency. Journal of Chemical Information and Modeling 2008, 48: 1345–1357. 10.1021/ci700385a
 20.
Nakano S, Uno T: Generating colored trees. Lecture Notes in Computer Science 2005, 3787: 249–260. 10.1007/11604686_22
 21.
Nakano S, Uno T: Efficient generation of rooted trees. NII Technical Report NII2003–005E 2003.
 22.
Ishida Y, Zhao L, Nagamochi H, Akutsu T: Improved algorithms for enumerating treelike chemical graphs with given path frequency. Genome Informatics 2008, 21: 53–64.
 23.
Ishida Y: Improved algorithms for enumerating treelike chemical graphs with given path frequency. Master thesis of Graduate School of Informatics in Kyoto University 2008.
 24.
Kvasnicka V, Pospichal J: Constructive enumeration of acyclic molecules. Collect Czech Chem Commun 1991, 56: 1777–1802. 10.1135/cccc19911777
 25.
Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res 2010, 36: D355D360.
Acknowledgements
This work was partially supported by GrantinAid #22240009 from Mext, Japan.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 14, 2011: 22nd International Conference on Genome Informatics: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/12?issue=S14.
Author information
Additional information
Competing interests
The authors declare that they have no competing interests.
Author’s contributions
HN gave the basic idea based on discussions with TA and MS. MS developed and implemented the algorithms, and carried out the experiments. MS, HN, and TA authored and approved the manuscript.
Rights and permissions
About this article
Cite this article
Shimizu, M., Nagamochi, H. & Akutsu, T. Enumerating treelike chemical graphs with given upper and lower bounds on path frequencies. BMC Bioinformatics 12, S3 (2011). https://doi.org/10.1186/1471210512S14S3
Published:
Keywords
 Feature Vector
 Multiple Edge
 Canonical Representation
 Enumeration Algorithm
 Label Sequence