Enumerating tree-like chemical graphs with given upper and lower bounds on path frequencies

Shimizu, Masaaki; Nagamochi, Hiroshi; Akutsu, Tatsuya

doi:10.1186/1471-2105-12-S14-S3

Volume 12 Supplement 14

22nd International Conference on Genome Informatics: Bioinformatics

Proceedings
Open access
Published: 14 December 2011

Enumerating tree-like chemical graphs with given upper and lower bounds on path frequencies

Masaaki Shimizu¹,
Hiroshi Nagamochi¹ &
Tatsuya Akutsu²

BMC Bioinformatics volume 12, Article number: S3 (2011) Cite this article

3229 Accesses
11 Citations
10 Altmetric
Metrics details

Abstract

Background

Enumeration of chemical graphs satisfying given constraints is one of the fundamental problems in chemoinformatics and bioinformatics since it leads to a variety of useful applications including structure determination of novel chemical compounds and drug design.

Results

In this paper, we consider the problem of enumerating all tree-like chemical graphs from a given set of feature vectors, which is specified by a pair of upper and lower feature vectors, where a feature vector represents the frequency of prescribed paths in a chemical compound to be constructed. This problem can be solved by applying the algorithm proposed by Ishida et al. to each single feature vector in the given set, but this method may take much computation time because in general there are many feature vectors in a given set. We propose a new exact branch-and-bound algorithm for the problem so that all the feature vectors in a given set are handled directly. Since we cannot use the bounding operation proposed by Ishida et al. due to upper and lower constraints, we introduce new bounding operations based on upper and lower feature vectors, a bond constraint, and a detachment condition.

Conclusions

Our proposed algorithm is useful for enumerating tree-like chemical graphs with given upper and lower bounds on path frequencies.

Introduction

Development of novel drugs is one of the major goals in chemoinformatics and bioinformatics. To achieve this purpose, it is important not only to investigate common chemical properties over chemical compounds having common structural patterns [1–3] but also to study methods of enumerating chemical structures satisfying given constraints. The enumeration of chemical structures has a long history. Actually, Cayley [4] considered the enumeration of structural isomers of alkanes in the 19th century. Applications for the enumeration of chemical compounds include structure determination using mass-spectrum and/or NMR-spectrum [5, 6], virtual exploration of chemical universe [7, 8], reconstruction of molecular structures from their signatures [9, 10], and classification of chemical compounds [11].

In the field of machine learning, the pre-image problem [12, 13] has been studied. In this problem, a desired object is computed as a feature vector in a feature space, and then the feature vector is mapped back to the input space, where this mapped back object is called a pre-image. The definition of the feature vectors based on the frequency of labeled paths [14, 15] or small fragments [11, 16] has been widely used. Akutsu and Fukagawa [17] formulated the graph pre-image problem as the problem of inferring graphs from the frequency of paths of labeled vertices, which corresponds to the pre-image problem, and proved that the problem is NP-hard even for planar graphs with bounded degrees [17]. Nagamochi [18] proved that a graph determined by frequency of paths with length 1 can be found in polynomial time if any.

To enumerate tree-like chemical graphs, Fujiwara et al. [19] proposed a branch-and-bound algorithm which consists of a branching procedure based on the tree enumeration algorithm due to Nakano and Uno [20, 21] and bounding operations designed by the path frequency and the atom-atom bonds. In addition, to reduce the size of search trees, Ishida et al. [22] introduced a new bounding operation, called the detachment-cut, based on the result by Nagamochi [18]. Implementations of the algorithm proposed by Ishida et al. [22] are available at a web server (http://sunflower.kuicr.kyoto-u.ac.jp/tools/enumol/) for enumerating tree-like chemical graphs with given path frequency. However, an instance with constraint which is specified by one feature vector admits no solution in many cases. Therefore, it is needed to introduce a more relaxed constraint than a single feature vector to obtain some solutions in the tree-like chemical graph enumeration problem.

In this paper, we are given a set of feature vectors, which is specified by a pair of upper and lower feature vectors, and enumerate all tree-like chemical graphs satisfying one of the vectors. It seems that this can be done by simply applying the algorithm proposed by Ishida et al. to each single feature vector in the given set. However, this method will take much computation time because in general there are many feature vectors in a given set. We propose a new exact branch-and-bound algorithm for the problem so that all the feature vectors in a given set are handled directly.

Methods

Preliminaries and problem formulation

A graph is called a multigraph if multiple edges (i.e., edges with the same end vertices) are allowed; otherwise it is called simple. A path P is a sequence v₀, e₁, v₁, e₂, v₂, …, e_k, v_k of distinct vertices v_i (i = 0, …, k) and edges e_j that join v_{j – 1} and v_j (j = 1, …, k). Without confusion we may write P = (v₀, v₁, …, v_k). The length |P| of path P is defined to be k, i.e., the number of edges. Assume that a set Σ = {ℓ₁,ℓ₂, …,ℓ_s} (i.e., chemical elements) is given. Let each label ℓ be associated with a valence val( ℓ) ∈ ℤ₊. A multigraph G is called Σ-labeled if each vertex v has a label ℓ(v) ∈ Σ, and is called (Σ, val)-labeled if, in addition, the degree of each vertex v is val(ℓ(v)), i.e., the valence of the element ℓ(v). We regard chemical compounds as (Σ, val )-labeled, self-loopless, and connected multigraphs, where vertices and labels represent atoms and elements, respectively. For a path P = (v₀, v₁, …, v_k), we call ℓ(P) = ℓ(v₀), ℓ(v₁), …, ℓ(v_k) the label sequence of P. Given a label sequence t, let #t denote the number of paths P with ℓ(P) = t in a graph, where multiple edges with the same end-vertices are treated as a single edge and paths are considered to be “directed.” The feature vector f_K(G) of level K(∈ ℤ₊) of G is defined to be the vector whose entry f_K(G)[t] (|t| ≤ K) represents #t. See Fig. 1 for an example.

Let deg(v; G) denote the degree of a vertex v in a graph G. The tree-like chemical graph enumeration problem with given one feature vector can be formulated as follows [19].

Enumeration of Tree-like chemical graphs with given Path Frequency (ETPF)

Given a set Σ of labels, a valence function val : Σ → ℤ₊ and a feature vector g of level K, find all (Σ, val)-labeled multitrees T such that f_K(T) = g and deg(v;T) = val(ℓ(v)) for all vertices v ∈ V(T).

Observe that a large number of chemical compounds contain a high proportion of hydrogens. Based on this fact, another model can be considered in the problem ETPF by removing all hydrogen atoms. These two different models were proposed by Fujiwara et al. [19] and Ishida [23].

In this paper, we consider the problem of enumerating all tree-like chemical graphs based on given upper and lower feature vectors because we want to relax the feature vector constraint in the problem ETPF. For feature vectors g₁ and g₂ of level K, we define g₁ ≤ g₂ to be g₁[t] ≤ g₂[t] for any label sequence t (|t| ≤ K). The problem of enumerating tree-like compounds from given two feature vectors can be formulated based on the problem ETPF as follows (see Fig. 2 for an illustration).

Enumeration of Tree-like chemical graphs with given Upper and Lower bounds on path Frequencies (ETULF)

Given a set Σ of labels, a valence function val : Σ → ℤ₊ and feature vectors g_U and g_L of level K (g_L ≤ g_U), find all (Σ, val)-labeled multitrees T such that g_L ≤ f_K(T) ≤ g_U and deg(v;T) = val(ℓ(v)) for all vertices v ∈ V(T).

For the problem ETULF, we assume that g_L(ℓ) = g_U(ℓ) for an atom type ℓ ∈ Σ, where g(L) denotes the entry in g that corresponds to a label sequence L (thus g(ℓ) specifies the number of vertices of label ℓ) and that g_L(L) ≤ g_U(L) for any label sequence L (|L| ≥ 2).

Note that the number n of vertices is given by Σ_ℓ∈Σg(ℓ). To solve the problem ETULF, we start with an empty graph, and repeatedly extend the current tree T by appending a new vertex with each label ℓ ∈ Σ to obtain a valid tree (a tree that does not violate any constraints on output trees) one by one until we get n vertices. In order to avoid duplicate outputs, we follow the branch-and-bound framework of Fujiwara et al. [19], which first defines a canonical representation for isomorphic trees, and then lists them using the algorithm of Nakano and Uno [20, 21] (the branching operation) discarding invalid trees with some bounding operations. Since we cannot directly use the bounding operation proposed by Ishida et al. [22] due to upper and lower constraints, we introduce some new bounding operations.

Canonical representation of trees and the branching operation

In this section, we explain a canonical representation of trees introduced by Fujiwara et al. [19] and the branching operation based on the canonical representation.

First of all, we introduce a root of a tree based on the following theorem.

Theorem 1 (Jordan [24]) For any tree with n′ vertices, either there exists a unique vertex v* such that each subtree obtained by removing v* contains at mostvertices, or there exists a unique edge e* such that both of the subtrees obtained by removing e* contain exactlyvertices.

Such a vertex v* and an edge e* in Theorem 1 are called unicentroid and bicentroid, respectively. Either unicentroid or bicentroid is called as centroid. Note that there exists a bicentroid only for an even n′. Since a case of bicentroid is similar to a case of unicentroid, now we only explain a case of unicentroid.

Next we introduce a canonical representation of trees that must be unique up to isomorphism. Let T be a tree of n vertices rooted at a vertex v₀ (which is not necessarily its unicentroid). Suppose that it is embedded in the plane as an ordered tree, where v₀ is located at the top part. Without loss of generality, let v₀, v₁, …, v_{n – 1} be indexed by the depth-first search (DFS) that starts from v₀ and visits vertices from the left to the right. Define the depth d(v) of a vertex v to be the length of the (unique) path from v₀ to v in T. The depth-label sequence of T (L(T)) is defined to be

Given an arbitrary order of labels, we define the order of depth-label sequences as follows. For any T₁ and T₂, we denote L(T₁) >L(T₂) if L(T₁) is lexicographically larger than L(T₂). Then the canonical representation of a rooted tree is defined by the largest depth-label sequence among all its plane embeddings. Actually this is equivalent to the left-heavy plane embedding [20, 21].

Thus our branching task is to list all centroid-rooted left-heavy trees with n vertices and m (= |Σ|) labels. Following the scheme [20, 21], we define a parent-child relation between two left-heavy trees. The parent P(T) of a left-heavy tree T is obtained from T by removing its rightmost leaf. Clearly P(T) is still left-heavy In this way, we can define a family tree of left-heavy trees whose leaves are exactly what we want to obtain.

Therefore we only need to enumerate the (leaf) nodes of . This can be done by starting from the empty tree (the root node of ) and repeatedly appending a new leaf to some appropriate place on the rightmost path of the current tree. Our branching operation employs the algorithm of Nakano and Uno [20, 21], which extends the current tree T (i.e., finds a child of T) in constant time [19].

Bounding operations

In this section, we explain how to check the validity of the current tree T. If we can conclude that T and all its descendants are not valid, then we can discard T. Our bounding operation discards T if at least one of the following criteria is violated:

(C1) The root of T remains the centroid of an output (the centroid constraint);

(C2) deg(v;T) ≤ val(l(v)) for all v ∈ V(T) (the valence constraint);

(C3) f_K(T) ≤ g_U, and |T| = n and g_L ≤ f_K(T) (the feature vector constraint);

(C4) T can be extended to a connected and loopless tree with n vertices (the detachment constraint);

(C5) T can have a descendant which has an appropriate number of multiple bonds (the multiplicity constraint).

(C1) and (C2) are the same as the work by Fujiwara et al. [19] and not difficult to check. (C3) and (C4) are different from the work by Fujiwara et al. [19] and Ishida et al. [22] due to upper and lower constraints. (C5) is a new bounding operation that we propose in this paper. In the following three subsections, we will discuss three bounding operations resulting from (C3), (C4), and (C5), called as feature-vector-cut, detachment-cut, and multiplicity-cut, respectively.

Feature-vector-cut procedure

In the problem ETULF, we cannot use the bounding operation proposed by Fujiwara et al. [19] directly due to upper and lower feature vectors, but we can introduce a bounding operation based on upper and lower feature vectors by modifying Fujiwara et al.’s work slightly.

Let T denote a current tree, f_K(T) denote the feature vector of T, g_u denote a given upper feature vector, and g_L denote a given lower feature vector. By the feature vector constraints in the problem ETULF, we check the following condition.

(1)

If T violates (1), then we discard T.

In addition, if |T| = n, then we check the following condition based on the constraint of upper and lower feature vectors.

(2)

If T violates (2), then we discard T.

Detachment-cut procedure

This subsection describes the definition of detachment [18] and a new bounding operation based on it for the problem ETULF. Let G be a multigraph that may have self-loops, which represents the graph obtained from a chemical graph H by contracting the vertices with the same label into a single vertex, where each vertex in G corresponds a label in H (note that we do not eliminate any edges in H in contracting vertices to obtain G). A process of regaining H from G is described as follows. Given a function r : V(G) → ℤ₊, an r-detachment H of G is a multigraph obtained from G by splitting each vertex v ∈ V(G) into a set of r(v) copies of v, denoted by W_v = {v¹, v² …, v^r⁽^v⁾}, so that each edge {u, v} ∈ E(G) joins some vertices uⁱ ∈ W_u and v^j ∈ W_v. Hence an r-detachment H of G is not unique in general. A self-loop {u, u} in G may be mapped to a self-loop {uⁱ,uⁱ} or a non-loop edge {uⁱ,u^j} in a detachment H of G. Note that, for all vertex pairs {u, v} ∈ V(G), the number of edges between subsets W_u and W_v in H is equal to that of edges between vertices u and v in G.

To obtain a chemical graph H as an r-detachment H of G, we need to specify the degree of vertices (with the same label) in H. For a function r : V(G) → ℤ₊, an r-degree specification is a set ρ of vectors for v ∈ V(G) such that

which is necessary for all the edges incident to vertex v in G to be assigned to split vertices vⁱ ∈ W_v completely. An r-detachment H of G is called a ρ-detachment if each v ∈ V satisfies

which is a requirement that each vertex v_i in H must have the prescribed degree . Figure 3 illustrates a ρ-detachment H for a graph G = (V, E) with V = {a, b, c}, a function r with r(a) = 4, r(b) = 3, r(c) = 1, and a degree specification ρ with ρ(a) = (2, 2, 3, 2), ρ(b) = (2, 3, 1), ρ(c) = (3). The next theorem gives a characterization of a multigraph G that admits a connected and loopless ρ- detachment.

Theorem 2 (Nagamochi [18]) Let G = (V, E) be a multigraph, r : V → ℤ₊ and . Then G has a connected and loopless ρ-detachment H if and only if the following hold:

where r(X) = Σ _v_∈ _Xr(v), c(G′) denotes the number of connected components of a graph G′, G – X denotes the graph obtained from a graph G by removing the vertices in X together with all edges incident to vertices in X, and d(A, B; G) denotes the number of edges (u, v) ∈ E with u ∈ A and v ∈ B.

Ishida et al. [22] proposed a bounding operation for the problem ETPF based on Theorem 2. However, we cannot use the bounding operation proposed by Ishida et al. for the problem ETULF due to upper and lower constraints. We now describe our new bounding operation based on detachments for the problem ETULF. The new bounding operation, called detachment-cut tests whether the current multitree T has a multitree that is consistent with given path frequencies among its descendants in the family tree, based on the difference between the feature vector f_K(T) and the input feature vectors g_U and g_L.

Let ℓ₁, ℓ₂, …, ℓ_s be input labels and g_U, g_L : Σ^{≤ K + 1} → ℤ₊ be feature vectors. Let r₀, …, r_h be the vertices in the rightmost path to which a new leaf can be appended and denote the number of vertices r_j (0 ≤ j ≤ h) with ℓ(r_j) = ℓ_i. For each label sequence t, #t denotes the number of paths P in T with ℓ(P) = t. From g_U, g_L, and T, we define new feature vectors and of level K = 1 to be

We next introduce a vertex with a new label ℓ_s+1 of valence h + 1 (for example, label A in Fig. 4), a graph G_U = (V_U, E_U) with a vertex set V_U = {v₁, …, v_s, v_s₊₁ | ℓ(v_i) = ℓ_i, 1 ≤ i ≤ s + 1} and edge set , and a graph G_L = (V_L,E_L) with a vertex set V_L = {v₁, …, v_s, v_s₊₁ | ℓ(v_i) = ℓ_i, 1 ≤ i ≤ s + 1} and edge set . Note that d({v_i}, {v_j}; G) means a multiplicity of the edge {v_i,v_j} in a graph G. The function r and degree specification ρ are defined to be

Using G_U, G_L, r, and ρ, we can check if a current multitree T violates (C4). We need to check whether none of the following two conditions is violated.

(a)
.
(b)
r(X) + c(G_U – X) – d(X, V_U; G_U) ≤ 1 (∀X ⊆ V_U, X ≠ ∅).

In the first condition, we check whether the number of the rest of bonds is large enough to satisfy the lower feature vector constraint. In the second condition, we check whether T has a connected and loopless descendant based on G_U and Theorem 2.

Multiplicity-cut procedure

This subsection describes a new bounding operation based on multiplicity for the problem ETULF. Let g(ℓ) be the number of vertices with label ℓ ∈ Σ that are obtained from given the feature vector. Now we assume that g(ℓ) for all ℓ ∈ Σ are fixed in the problem ETULF. Then we can calculate the number of edges in output trees in the problem ETULF. Let n be the number of vertices in output trees. If we treat a multiple edge as a set of single edges, the number of edges e_m in an output tree is given by:

On the other hand, if we treat a multiple edge as a simple one, the number of edges e_s in an output tree is equal to n – 1 due to the tree-like constraint. Now we consider

which means that only M edges are used to construct multiple bonds in an output tree. Note that M ≥ 0. We calculate M from an input of the problem ETULF before the enumeration algorithm starts.

Let T = (V, E) be a multitree, and m_e denote the multiplicity of e ∈ E. The multiplicity M(T) of T is defined to be

Now we describe the multiplicity-cut based on M(T) and M.

Let T be the current rooted multitree in the branching operation, M(T) be the multiplicity of T, RP(T) = (r₀, r₁, …, r_k) be the rightmost path of T, T_i be the new rooted multitree obtained by appending a new leaf p to a vertex r_i (0 ≤ i ≤ k), and RP(T_i) be the rightmost path of T_i . The rightmost path RP(T_i) of T_i is updated by appending p to the end of RP(T) when a new leaf p is appended to r_i, that is, RP(T_i) = (r₀, r₁,…, r_i, p). Then we can determine the multiplicities of the edges {(r_j, r_{j – 1}), j = k, k – 1, …, i + 1} due to the valence constraint, at the same time, we update M(T_i). We denote the multiplicity of an edge (r_j, r_{j – 1}) in T_i by Mul(r_j, r_{j – 1} | T_i). When we update the multiplicity of the edge (r_j,r_{j – 1}), M(T_i) is updated as follows:

By the definition of M, a valid multitree T_i satisfies

(3)

If T_i violates (3), then we discard T_i . See Fig. 5 for an illustration of this.

Results

This section reports the experimental results of our algorithm. First of all, we mention that the problem ETULF can be solved by applying the algorithm proposed by Ishida et al. [22] to each single feature vector in a given set of feature vectors, i.e., the problem ETULF can regard as a set of the problem ETPF. Then we call an algorithm for the problem ETULF based on the algorithm proposed by Ishida et al. RepEnum (Repeated Enumeration). On the other hand, we call our algorithm SimEnum (Simultaneous Enumeration). It is to be noted that RepEnum is one of the fastest tools to enumerate tree-like chemical structures from a given molecular formula (i.e., feature vector with K = 0) [22] and, to our knowledge, there does not exist any other available tool to enumerate chemical structures from a given feature vector based on path frequency (i.e., feature vector with general K).

Now we compare the performances of two algorithms, SimEnum and RepEnum, and we also compare the performances of two algorithms, SimEnum including multiplicity-cut and SimEnum not including multiplicity-cut. We have tested the algorithm SimEnum for some widths between upper and lower feature vectors. Tests were carried out on a PC with CPU AMD Athlon Dual Core Processor 5050e using instances based on some chemical compounds selected from the KEGG LIGAND database [25] (http://www.genome.jp/ligand/). Note that we treat a benzene ring contained in these compounds as a new virtual atom of valence six.

We define w ∈ ℤ₊ to be a width between upper and lower feature vectors. From a feature vector g, we construct two feature vectors g_U and g_L as follows. For each entry a > 0 of g, let g_U be the upper feature vector, where each entry a_U is given by a + w and g_L be the lower one, where each entry a_L is given by max{0, a – w}. Note that if w = 0, then an instance for the problem ETULF is equivalent for the problem ETPF.

Table 1 and Additional file 1 show the results of the comparison. We find that the algorithm RepEnum cannot solve all the problems with K = 2 within the time limit since the number of feature vectors in a given set is exponentially increasing with K. On the other hand, Table 1 shows that the algorithm SimEnum can solve the problem much faster for a larger K. This shows that the algorithm SimEnum runs significantly faster than the algorithm RepEnum. It is also seen that RepEnum can only examine a very small portion of feature vectors in most cases. Additional file 1 shows that the algorithm SimEnum including multiplicity-cut runs faster than the algorithm SimEnum not including multiplicity-cut for almost all of the instances. This shows that the multiplicity-cut operation works well to improve enumeration efficiency.

Table 1 Comparison of previous method and our method

Full size table

Table 2 shows the results on the performance for varying width w for the problem ETULF. The search space in the problem ETULF is exponentially increasing with w. However, it seems that the number of search nodes and computation time are not exponentially increasing with w. This suggests that the algorithm SimEnum works efficiently for the large search space in the problem ETULF.

Table 2 Comparison of varying width

Full size table

Here, we briefly discuss practical values on K and w though we do not have concrete evidence and these values depend on target classes of chemical compounds. It is suggested from the results on similar feature vectors [9, 10, 15] that K between 3 to 10 should be used. Though there is no previous result on w, it is seen from Table 2 that w cannot be large because there may exist too many solutions. Therefore, w less than 4 should be used.

Conclusions

We considered the problem of enumerating all tree-like chemical graphs from a given set of feature vectors, which is specified by upper and lower feature vectors based on frequencies of paths, and proposed a new exact branch-and-bound algorithm. Our experimental results show that our algorithm outperforms the naive algorithm based on a previous method. In comparison to the algorithm based on Ishida et al. [22], our algorithm can greatly reduce the number of search nodes and the computation time and enumerate all the feasible solutions in many instances.

However, the search space of the problem ETULF is much larger than that of the problem ETPF due to upper and lower constraints and in fact there are many search nodes for solving the problem ETULF by our algorithm. One of the future works is to improve the bounding operations, or introduce a new bounding operation. Actually, in the feature-vector-cut mentioned in subsection , information of a lower feature vector g_L is only used if |T| = n. Another future work is to develop a web server that implements our proposed algorithm. Generalization of the proposed techniques for other types of kernel functions and other problems is also left as a future work.

References

Bytautas L, Klein DJ: Chemical combinatorics for alkane-isomer enumeration and more. Journal of Chemical Information and Computer Sciences 1998, 38: 1063–1078. 10.1021/ci980095c
CAS Google Scholar
Bytautas L, Klein DJ: Formula periodic table for acyclic hydrocarbon isomer classes: combinatorially averaged graph invariants. Physical Chemistry Chemical Physics 1999, 1: 5565–5572.
Article CAS Google Scholar
Bytautas L, Klein DJ: Isomer combinatorics for acyclic conjugated polyenes: enumeration and beyond. Theoretical Chemistry Accounts 1999, 101: 371–387. 10.1007/s002140050455
Article CAS Google Scholar
Cayley A: On the analytic forms called trees with applications to the theory of chemical combinations. Reports British Association for the Advancement of Science 1875, 45: 257–305.
Google Scholar
Buchanan BG, Feigenbaum EA: DENDRAL and Meta-DENDRAL: their applications dimension. Aritificial Intelligence 1978, 11: 5–24. 10.1016/0004-3702(78)90010-3
Article Google Scholar
Funatsu K, Sasaki S: Recent advances in the automated structure elucidation system, CHEMICS. Utilization of two-dimensional NMR spectral information and development of peripheral functions for examination of candidates. Journal of Chemical Information and Computer Sciences 1996, 36: 190–204. 10.1021/ci950152r
CAS Google Scholar
Fink T, Reymond JL: Virtual exploration of the chemical universe up to 11 atoms of C, N, O, F: assembly of 26.4 million structures (110.9 million stereoisomers) and analysis for new ring systems, stereochemistry, physicochemical properties, compound classes, and drug discovery. Journal of Chemical Information and Computer Sciences 2007, 47: 342–353. 10.1021/ci600423u
CAS Google Scholar
Mauser H, Stahl M: Chemical fragment spaces for de novo design. Journal of Chemical Information and Computer Sciences 2007, 47: 318–324. 10.1021/ci6003652
CAS Google Scholar
Faulon JL, Churchwell CJ, Jr DPV: The signature molecular descriptor. 2. Enumerating molecules from their extended valence sequences. Journal of Chemical Information and Computer Sciences 2003, 43: 721–734. 10.1021/ci020346o
CAS PubMed Google Scholar
Hall LH, Dailey ES: Design of molecules from quantitative structure-activity relationship models. 3. Role of higher order path counts: path 3. Journal of Chemical Information and Computer Sciences 1993, 33: 598–603. 10.1021/ci00014a012
CAS Google Scholar
Deshpande M, Kuramochi M, Wale N, Karypis G: Frequent substructure-based approaches for classifying chemical compounds. IEEE Transactions on Knowledge and Data Engineering 2005, 17: 1036–1050.
Article Google Scholar
Bakir GH, Weston J, Schölkopf B: Learning to find pre-images. Advances in Neural Information Processing Systems 2003, 16: 449–456.
Google Scholar
Bakir GH, Zien A, Tsuda K: Learning to find graph pre-images. Lecture Notes in Computer Science 2004, 3175: 253–261. 10.1007/978-3-540-28649-3_31
Article Google Scholar
Kashima H, Tsuda K, Inokuchi A: Marginalized kernels between labeled graphs. Proceedings of the Twentieth International Conference on Machine Learning, AAAI Press 2003, 321–328.
Google Scholar
Mahé P, Ueda N, Akutsu T, Perret JL, Vert JP: Graph kernels for molecular structure-activity relationship analysis with support vector machines. Journal of Chemical Information and Modeling 2005, 45: 939–951. 10.1021/ci050039t
Article PubMed Google Scholar
Byvatov E, Fechner U, Sadowski J, Schneider G: Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. Journal of Chemical Information and Computer Sciences 2003, 43: 1882–1889. 10.1021/ci0341161
CAS PubMed Google Scholar
Akutsu T, Fukagawa D: Inferring a graph from path frequency. Lecture Notes in Computer Science 2005, 3537: 371–392. 10.1007/11496656_32
Article Google Scholar
Nagamochi H: A detachment algorithm for inferring a graph from path frequency. Algorithmica 2009, 53: 207–224. 10.1007/s00453-008-9184-0
Article Google Scholar
Fujiwara H, Wang J, Zhao L, Nagamochi H, Akutsu T: Enumerating treelike chemical graphs with given path frequency. Journal of Chemical Information and Modeling 2008, 48: 1345–1357. 10.1021/ci700385a
Article CAS PubMed Google Scholar
Nakano S, Uno T: Generating colored trees. Lecture Notes in Computer Science 2005, 3787: 249–260. 10.1007/11604686_22
Article Google Scholar
Nakano S, Uno T: Efficient generation of rooted trees. NII Technical Report NII-2003–005E 2003.
Google Scholar
Ishida Y, Zhao L, Nagamochi H, Akutsu T: Improved algorithms for enumerating tree-like chemical graphs with given path frequency. Genome Informatics 2008, 21: 53–64.
Article PubMed Google Scholar
Ishida Y: Improved algorithms for enumerating tree-like chemical graphs with given path frequency. Master thesis of Graduate School of Informatics in Kyoto University 2008.
Google Scholar
Kvasnicka V, Pospichal J: Constructive enumeration of acyclic molecules. Collect Czech Chem Commun 1991, 56: 1777–1802. 10.1135/cccc19911777
Article CAS Google Scholar
Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M: KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res 2010, 36: D355-D360.
Article Google Scholar

Download references

Acknowledgements

This work was partially supported by Grant-in-Aid #22240009 from Mext, Japan.

This article has been published as part of BMC Bioinformatics Volume 12 Supplement 14, 2011: 22nd International Conference on Genome Informatics: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S14.

Author information

Authors and Affiliations

Graduate School of Informatics, Kyoto University, Yoshida, Kyoto, 606-8501, Japan
Masaaki Shimizu & Hiroshi Nagamochi
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto, 611-0011, Japan
Tatsuya Akutsu

Authors

Masaaki Shimizu
View author publications
You can also search for this author in PubMed Google Scholar
Hiroshi Nagamochi
View author publications
You can also search for this author in PubMed Google Scholar
Tatsuya Akutsu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hiroshi Nagamochi.

Additional information

Competing interests

The authors declare that they have no competing interests.

Author’s contributions

HN gave the basic idea based on discussions with TA and MS. MS developed and implemented the algorithms, and carried out the experiments. MS, HN, and TA authored and approved the manuscript.

Electronic supplementary material

12859_2011_4962_MOESM1_ESM.pdf

Additional file 1: Comparison of multiplicity-cut Comparison of SimEnum including multiplicity-cut and SimEnum not including multiplicity-cut for the problem ETULF. Note: (1) “add multiplicity-cut” is the algorithm SimEnum including multiplicity-cut; and (2) “no multiplicity-cut” is the algorithm SimEnum not including multiplicity-cut. (PDF 10 KB)

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Shimizu, M., Nagamochi, H. & Akutsu, T. Enumerating tree-like chemical graphs with given upper and lower bounds on path frequencies. BMC Bioinformatics 12 (Suppl 14), S3 (2011). https://doi.org/10.1186/1471-2105-12-S14-S3

Download citation

Published: 14 December 2011
DOI: https://doi.org/10.1186/1471-2105-12-S14-S3

22nd International Conference on Genome Informatics: Bioinformatics

Enumerating tree-like chemical graphs with given upper and lower bounds on path frequencies