Towards comprehensive structural motif mining for better fold annotation in the

Jia, Yi; Huan, Jun; Buhr, Vincent; Zhang, Jintao; Carayannopoulos, Leonidas N

doi:10.1186/1471-2105-10-S1-S46

Volume 10 Supplement 1

Selected papers from the Seventh Asia-Pacific Bioinformatics Conference (APBC 2009)

Research
Open access
Published: 30 January 2009

Towards comprehensive structural motif mining for better fold annotation in the "twilight zone" of sequence dissimilarity

Yi Jia¹,
Jun Huan¹,
Vincent Buhr¹,
Jintao Zhang² &
…
Leonidas N Carayannopoulos³

BMC Bioinformatics volume 10, Article number: S46 (2009) Cite this article

3945 Accesses
3 Citations
Metrics details

Abstract

Background

Automatic identification of structure fingerprints from a group of diverse protein structures is challenging, especially for proteins whose divergent amino acid sequences may fall into the "twilight-" or "midnight-" zones where pair-wise sequence identities to known sequences fall below 25% and sequence-based functional annotations often fail.

Results

Here we report a novel graph database mining method and demonstrate its application to protein structure pattern identification and structure classification. The biologic motivation of our study is to recognize common structure patterns in "immunoevasins", proteins mediating virus evasion of host immune defense. Our experimental study, using both viral and non-viral proteins, demonstrates the efficiency and efficacy of the proposed method.

Conclusion

We present a theoretic framework, offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrices as studied here, and devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we significantly expanded the analytical power of sophisticated data mining algorithms in dealing with large volume of complicated and noisy protein structure data. And without loss of generality, choice of appropriate compatibility matrices allows our method to be easily employed in domains where subgraph labels have some uncertainty.

Background

Genomics efforts continue to yield a myriad of new protein sequences. Among the most valuable are those expressed by mammalian pathogens, organisms that successfully grow and disseminate despite a hostile host immunologic environment. A subset of pathogen-encoded proteins, "immunoevasins", facilitate this success by mediating cellular adhesion and entry, and by distorting the interactions of host receptors and cell-surface ligands [1]. Study of immunoevasins gives insight into host-defense mechanisms, insight that can help guide development of therapies and vaccines against refractory organisms [2].

Though immunoevasins frequently possess protein-recognition domain (PRD) folds common to mammalian proteins of immunologic importance, their divergent amino acid sequences may fall into the "twilight-" or "midnight-" zones where pair-wise sequence identities to known sequences fall below 25% and purely sequence-based attempts at annotations often fail [3, 4].

To better annotate these, and any other highly divergent sequences, more generally, some means of explicitly incorporating three-dimensional structural information into the sequence evaluation is required. Inclusion of even rudimentary structural considerations enhances the performance of sequence scoring heuristics such as local alignment tools [5] and hidden Markov models (HMM) [6]. Indeed an HMM constrained with crystallographically determined secondary structure data allowed discovery of a previously unsuspected MHC class I-like immunoevasin in the genomes of orthopoxviruses [7]. A vast literature covers various schemes for structural data incorporation and fold classification. Nevertheless, much progress remains to be made [8].

We are pursuing an approach whereby structural patterns common to a protein fold are collected, assessed for their classification value, and mapped onto statistical models of protein sequences (e.g. HMMs, support vector machines (SVMs), and conditional random fields). As a first step, a comprehensive and objective means is required of identifying and assessing the above common structure patterns, or structure fingerprints.

Automatic identification of structure fingerprints from a group of diverse protein structures is challenging for a number of reasons. First, we have only limited knowledge about the possible location, composition, and geometric shape of these structure patterns. Second, protein structures are large geometric objects that typically contain hundreds of amino acids with thousands of atoms and chemical bonds. Third, due to accumulated mutations in evolution the same structure pattern may appear slightly different in different proteins. If we use terms from computer algorithm design, we say that the problem of automatic structure pattern identification is challenging since (1) the problem has a large combinatory search space (meaning patterns may occur in any part of a protein and in any subset of a group of proteins) and (2) we should use approximate matching rather than exact matching in retrieving such patterns (meaning that we should tolerate certain level of geometric distortion and amino acid mismatch in search for common structure patterns).

In this paper we demonstrate a novel data mining technique that efficiently extracts and scores structure pattern from diverse proteins. Specifically in our method, we encode a protein structure as a geometric graph where a node represents an amino acid residue and an edge represents a physical or a chemical interaction between a pair of residues. We encode structural motifs as subgraphs of a geometric graph and we identify conserved structure fingerprints by searching for frequently occurring approximately subgraphs in a group of graph represented proteins.

Our contributions in designing a new graph data mining method are to develop a solid theoretic framework, to offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrices as studied here, and to devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we expanded the analytical power of data mining algorithms in dealing with large volume of complicated and noisy protein structure data. As evaluated in our driving biological application of recognizing common structure patterns in immunoevasins, our proposed method identifies many structure patterns and affords better structure classification accuracy compared to existing graph mining algorithms.

The rest of the paper is organized in the following way. In the Related Work section, we give an overview of related work on subgraph mining and protein structure pattern identification. In the Methods section, we introduce the technique about how to translate protein structures into graphs, provide our model for approximate subgraph mining, and present the details of our algorithm. In the Results section, we show an empirical study of the proposed algorithm using protein structure data sets. In the Discussion section, we discuss the biological significance of the structural motifs mined by our method. Finally in the Conclusions section, we conclude with a short discussion of our approach.

Related work

There is an extensive body of literature on comparing and classifying proteins using multiple sequence or structure alignment, such as VAST [9] and DALI [10]. Here we focus on the recent algorithmic techniques for discovering structure motifs from protein structures. The methods can be classified into the following five types:

Depth-first search, starting from simple geometric patterns such as triangles, progressively finding larger patterns [11–13].
Geometric hashing, originally developed in computer vision, applied pairwise between protein structures to identify structure motifs [14–16].
String pattern matching methods that encode the local structure and sequence information of a protein as a string, and apply string search algorithms to derive motifs [17–19].
Delaunay Tessellation (DT) [20–22] partitioning the structure into an aggregate of non-overlapping, irregular tetrahedra thus identifying all unique nearest neighbor residue quadruplets for any protein [22].
Graph matching methods comparing protein structures modeled as graphs and discovering structure motifs by finding recurring subgraphs [23–29].

Graph database mining is an active research field in data mining research. The goal of graph database mining is to locate useful and interpretable patterns in a large volume of graph data. Recent exact matching graph mining algorithms can be roughly divided into three categories. The first category uses the level-wise search strategy, which includes AGM [30] and FSG [31]. And the second category takes the depth-first search strategy, which includes gSpan [32] and FFSM [33]. The third category works by mining frequent trees, for which SPIN [34] and GASTON [35] are the representative. There are many other existing graph mining algorithms, and we refer to [36] for a recent survey.

Frequent subgraph mining with approximate matching capability has also been investigated. The current approximate subgraph mining algorithms can be divided into four categories: (1) proximity measures between graphs [37–39], (2) given a proximity measurement, compute representative frequent subgraphs [40], (3) pattern discovery in a single large graph [41], and (4) pattern discovery from a group of graphs. The last category is what we concentrate on. For algorithms in (4), SUBDUE [42] does not claim completeness. Monkey [43] handles only edge missing and edge label mismatch. Partially Labeled Graphs [44] uses a wild card method to handle node label mismatches. The algorithm may be viewed as a special case of our algorithm.

Different from the existing work, to our best knowledge, we are the first group that incorporates a probability matrix in a graph mining method. We also developed a general framework to fully utilize a probability matrix for approximate match, which we can apply to a number of different applications. In addition, we have developed two ways to demonstrate the statistical significance of the patterns mined from a graph database. Statistical significance is an important but often overlooked issue in evaluating the quality of identified pattern in frequent pattern mining. Finally we offered a practical implementation and evaluated its performance using the synthetic sets.

Methods

In this section, we first briefly describe the technique that translates protein structures into graphs. Then we demonstrate our method called APGM(AP proximate G raph M ining) with two steps: introducing the theoretic model, and showing our algorithm in detail.

Almost-Delaunay graph

Since the protein backbone trace defines the overall protein conformation, we choose the C_αatoms as the nodes of protein graphs. Based on this simplified protein model, we compute edges using Almost-Delaunay Tesselation [45]. The Almost-Delaunay edges are a superset of the Delaunay edges. All nearest neighbor residues connected by Delaunay edges are defined using Delaunay Tessellation [46]. This tessellation is defined for a finite set of points by an empty sphere property: A pair of points is joined by an edge iff one can find an empty sphere whose boundary contains those two points. The definition of the Delaunay Tessellation depends on the precise coordinate values given to its points, but these coordinate values are not exact in the case of proteins due to measurement imprecision and atomic motions. In order to address this problem, Almost-Delaunay Edges are defined by relaxing the empty sphere property to say that a pair of points p and q is joined by an Almost-Delaunay edge with parameter ε, or AD(ε), if by perturbing all points by at most ε, p and q can be made to lie on an empty sphere. In Figure 1, we show one segment of the 3D structure and the corresponding AD graph of 1FP5A Immunoglobulin C1-type protein as an example. More detailed information is available in [45] and [47].

Theoretic framework

Definition 1

A labeled graph G is a 5-tuple G = {V, E, Σ_V, Σ_E, λ) where V is the set of vertices of G and E ⊆ V × V is the set of undirected edges of G. Σ_Vand Σ_E are (disjoint) sets of labels. And labeling function λ: V → Σ_V∪ E → Σ_Emaps vertices and edges in G to their labels. A graph database D is a set of graphs.

We also use V[G] to denote the node set of a graph G and E[G] to denote the edge set of G. We also use Σ_V[G]to denote the node labels, Σ_E[G]to denote edge labels, and λ_Gto denote the labeling function for a graph G. Before we introduce approximate matching, we define compatibility matrix, which offers a probability framework for approximate subgraph mining.

Definition 2

A compatibility matrix M = (m_i,j) is an n × n matrix indexed by symbols from a label set Σ (n = |Σ|). An entry m_i,j( 0 ≤ m_i,j≤ 1, Σ_jm_i,j= 1) in M is the probability that the label i is replaced by the label j.

A compatibility matrix M is stable if the diagonal entry is the largest one in the row (i.e. M_i,i> M_i,j, for all j ≠ i). A compatibility matrix being stable means that any label i is more likely to be replaced by itself rather than by any other symbol. For our biological application, we consider substitution matrices as being, in essence, stable matrices since most or all rows fit the criterion. For example, in the BLOSUM62 substitution matrix, there is only one violation of the criterion – the row for methionine(MET). Hence for the rest of the discussion, we will treat substitution matrices as stable compatibility matrices.

Example 1. We show a graph database D with three labeled graphs P, Q, R on the left side of Figure 2. In this database, the node label set is {a, b, c} and the edge label set is {x, y}. On the right part of Figure 2, we show a compatibility matrix M, which is a 2D matrix indexed by the set of node labels in D. The probability that the vertex label a is substituted by b is m_a,b= 0.3. In M, we use probability 0 to simplify the matrix. In reality these probabilities are never 0.

Definition 3

A labeled graph G = {V, E, Σ_V, Σ_E, λ} is approximately subgraph isomorphic to another graph G' = {V', E', ${Σ^{'}}_{V}$ , ${Σ^{'}}_{E}$ , λ'} if there exists an injection f : V → V' such that

∏_u∈VM_{λ(u),λ'(f(u))}≥ τ, and
$\prod_{(u, v) \in E} {M^{'}}_{λ (u, v), λ^{'} (f (u), f (v))} \geq τ^{'}$

The injection f is an approximate subgraph isomorphism between G and G'. M is a compatibility matrix for node label sets Σ_V∪ ${Σ^{'}}_{V}$ . M' is a compatibility matrix for edge label sets Σ_E∪ ${Σ^{'}}_{E}$ . In an edge compatibility matrix, we assume Σ_Eand ${Σ^{'}}_{E}$ both contain a special label called empty edge. In this way, we handle both topology distortion (missing edges) and edge label mismatches in the same unified way through an edge compatibility matrix. τ (0 <τ ≤ 1) is the threshold for node mismatch and τ'(0 <τ' ≤ 1) is the threshold for edge mismatch.

For simplicity in the following discussion, we assume that we only need to handle node label mismatches (i.e. corresponding edge relations and corresponding edge labels should exactly match each other in matching two graphs). In principle, edge label mismatch (including missing edges) can be handled in a similar way as node label mismatch. Hence our assumption does not reduce the complexity of algorithm design, but the assumption significantly simplifies our demonstration and makes our algorithm easy of access.

With the assumption, the new definition of approximate subgraph isomorphism is:

Definition 4

A graph G is approximate subgraph isomorphic to another graph G', denoted by G ⊆_aG' if there exists a 1-1 injection f V[G] to V[G'], such that

∏_u∈VM_{λ(u),λ'(f(u))}≥ τ,
∀ u, v ∈ V, (u, v) ∈ E ⇔ (f(u), f(v)) ∈ E', and
∀ (u, v) ∈ E, λ(u, v) = λ(f(u), f(v))

Given a node injection f from graph G to G', the co-domain of f is an embedding of G in G'. M is a compatibility matrix for node label sets Σ_V∪ ${Σ^{'}}_{V}$ . The approximate subgraph isomorphism score of f, denoted by S_f(G, G'), is the product of normalized probabilities: $S_{f} (G, G^{'}) = \prod \frac{M_{λ (u), λ^{'} (f (u))}}{M_{λ (u), λ (u)}}$ . For the case of exception in mutation matrix, we use MAX(M_{λ(u), *}) as the normalizing factor instead of M_λ(u),λ(u). For a pair of graphs, there may be many different ways of mapping nodes from one graph to another and hence may have different approximate isomorphism scores. The approximate matching score (score for simplicity) between two graphs, denoted by S(G, G'), is the largest approximate subgraph isomorphism score, or

S (G, G^{'}) = \max_{f} {S_{f} (G, G^{'})}

Similarly, we define exact subgraph isomorphism below.

Definition 5

A graph G is subgraph isomorphic to another graph G', denoted by G ⊆ G' if there exists a 1-1 injection f from the node set V of a graph G to V' of a graph G', such that

∀ u ∈ V, λ(u) = λ'(f(u))
∀ u, v ∈ V, (u, v) ∈ E ⇔ (f(u), f(v)) ∈ E', and
∀ (u, v) ∈ E, λ(u, v) = λ(f(u), f(v))

Example 2. In Figure 2, we show a graph database D = {P, Q, R} and a compatibility matrix M. We set isomorphism threshold τ = 0.4 and with this threshold, graph P is approximate subgraph isomorphic to graph Q with the approximate subgraph isomorphic score equaling 0.6. To see this, there are a total of 6 different ways to map nodes of P to those of Q. The only two that satisfy edge label constraints are f₁ = p₁ → q₁ p₂ → q₂ p₃ → q₃ and f₂ = p₁ → q₂ p₂ → q₁ p₃ → q₃. The approximate subgraph isomorphism score of f₁ equals that of f₂.

Definition 6

Given a graph database D, an isomorphism threshold τ, a support threshold σ (0 <σ ≤ 1), the support value of a graph G, denoted by sup_G, is the average score of the graph to graphs in the database:

s u p G = \sum_{G^{'} \in D, G \subseteq_{a} G^{'}} S (G, G^{'}) / | D |

(1)

G is a frequent approximate subgraph if its support value is at least σ. With this definition, we only use those graphs that a subgraph G is approximate subgraph isomorphic to (controlled by the parameter τ) to compute the support value of G. We do this to filter out low quality (but potentially many) graph matchings in counting the support value of a subgraph. For a moderate sized graph database (100 1000), according our experience, the number of frequent subgraphs identified is usually not sensitive to the isomorphism threshold, which makes sense since low quality graph matching has low "weight" in the support computation nevertheless.

Problem statement

Given a graph database D, an isomorphism threshold τ, a compatibility matrix M, and a support threshold σ, the approximate subgraph mining problem is to find all the frequent approximate subgraphs in D. In Figure 3, we show all the frequent approximate subgraphs in the graph database D shown in Figure 2. By comparison with the frequent subgraphs acquired by the exact graph mining, the approximate mining method identifies meaningful patterns that cannot be identified by exact graph mining methods. Since the support value of approximate subgraph mining and that of frequent subgraph mining have different meaning, it is generally hard to do a comparison of approximate subgraph mining and that of frequent subgraph mining. Fortunately with the assumption of stable compatibility matrix, we can see frequent subgraph mining as a special case of approximate subgraph mining.

Example 3. Given a graph database D, a compatibility matrix M in Figure 2, the support threshold σ = 2/3 and isomorphism threshold τ = 0: 4, we show how to calculate the isomorphism score and support value for the approximate frequent patterns in Figure 3.

S(A₁, P) = 1, S(A₁, Q) = 1, S(A₁, R) = 1, Sup(A₁) = 3/3;

S(A₂, P) = 1, S(A₂, Q) = 1, S(A₂, R) = 1, Sup(A₂) = 3/3;

S(A₃, P) = 1, S(A₃, Q) = 0.6, S(A₃, R) = 0.4, Sup(A₃) = 2/3;

S(A₄, P) = 1, S(A₄, Q) = 0.6, S(A₄, R) = 0.4, Sup(A₄) = 2/3.

Algorithm design

Here we demonstrate a new algorithm APGM for approximate subgraph mining. APGM starts with frequent single node subgraphs. At a subsequent step, it adds a node to an existing pattern to create new subgraph patterns and identify their support value. If none of the resulting subgraphs are frequent, APGM backtracks. APGM stops when no more patterns need to be searched. Before we proceed to the algorithmic details, we introduce the following definitions to facilitate the demonstration of the APGM algorithm.

Definition 7

Given a graph T, one of the embeddings e = v₁, v₂,⋯,v_kof T, a node v is a neighbor of e if ∃u ∈ e, (u, v) ∈ E[G].

In other words, a neighbor node of a embedding e is any node that connects to at least one node in e. The neighbor set of an embedding e, denoted by N(e), is the set of e's neighbors.

Definition 8

Given a graph T, one of the embeddings e = v₁, v₂,⋯,v_kof T in a graph G, a node v ∈ N(e), and a node label l, the approximate subgraph, denoted by G|_T,e,v,l, is a graph (V', E', ${Σ^{'}}_{V}$ , ${Σ^{'}}_{E}$ , λ') such that

V' = {v₁, v₂,⋯,v_k} ∪ v
E' = V' × V' ∩ E[G]
${Σ^{'}}_{V}$ = Σ_V
${Σ^{'}}_{E}$ = Σ_E
∀ u ∈ e : λ'(u) = λ_T(u)
λ'(v) = l
∀ u, v ∈ e : λ'((u, v)) = λ_G((u, v))

Example 4. In Figure 4, we show a pattern T and one of its embeddings e = (s₁, s₂) in a graph Q. Node s₃ is a neighbor node of e since it connects to at least one node of e (in fact both). Given a node label l ="a", we obtain an approximate subgraph G' = Q|_T,e,v,lof Q shown in the same figure. The G' has an embedding e' = (s₁, s₂, s₃) in Q and the score of the embedding is $\frac{M (a, a)}{M (a, a)} \frac{M (a, a)}{M (a, a)} \frac{M (a, b)}{M (a, a)} = \frac{M (a, b)}{M (a, a)} = 0.6$ . (Recall the score of an embedding is the multiplication of the probability of observed node label replacement, normalized by the probability of node label self-replacement.)

With the two definitions, we present the pseudo code of APGM below. follows.

Algorithm 1. APGM_MAIN(D, M, τ , σ )

1: Begin

2: C ← {frequent single node}

3: F ← C

4: for each T ∈ C do

5: APGM_SEARCH(T, τ, σ, F)

6: end for

7: return F

8: End

Algorithm 2. APGM_SEARCH(T, τ , σ , F)

1: Begin

2: C ← ∅

3: for each (e, v), e is an embedding of T in G, v ∈ N(e) do

4: CL ← approximateLabelSet(T, G, e, v)

5: for each l ∈ CL do

6: X ← G|_{T, e, v, l}

7: C ← C ∪ {X}

8: $H$ (X) = $H$ (X) ∪ (e, v)

9: end for

10: end for

11: remove infrequent T from C

12: F ← F ∪ C

13: for each T ∈ C do

14: APGM_SEARCH(T, τ, σ, F)

15: end for

16: End

$H$ is a hash function to store candidate subgraphs and their embeddings. The hash key of the function in our implementation is a canonical code of the subgraph X, which is a unique string presentation of a graph. We use the Canonical Adjacency matrix (CAM) and the Canonical Adjacency Matrix code, developed in [48], to compute the canonical code of a graph.

Algorithm 3. approximateLabelSet(T, G, e, v)

1: Begin

2: R ← ∅

3: l₀ ← λ_G(v)

4: for each l ∈ Σ_V[G]do

5: if $S (e, T) \times \frac{M (l_{0}, l)}{M (l_{0}, l_{0})} \geq τ$ then

6: R ← R ∪ l

7: end if

8: end for

9: return R

10: End

Example 5. Applying APGM to the graph database shown in Figure 2with the support threshold σ = 2/3 and the isomorphism threshold τ = 0.4, we identify one frequent single-node pattern a (shown as A₁ in Figure 3). Adding one node to the pattern A₁, there are two candidate single-edge patterns and both of them are frequent. These two are shown as A₂ and A₃ in the same figure. From pattern A₂, we enumerate one additional pattern A₄. We stop here since there is no more candidate patterns to explore.

Results

Experimental setup

We performed all the experiments on a cluster with 256 Intel Xeon 3.2 Ghz EM64T processors with 4 GB memory each. The approximate graph mining algorithm was implemented in the C++ language and compiled by using the g++ compiler in Linux environment with -O3 optimization.

We downloaded all protein structures from Protein Data Bank (PDB). We followed [45] to use the same software as [47] to calculate Almost-Delaunay(AD) for graph representation of protein geometry. We took BLOSUM62 as the compatibility matrix and back-calculated the conditional probability matrix by following the procedure described in [49]. We normalized the matrix according to Definition 4.

Data set

We investigated two immunologically relevant protein domain families: the Immunoglobulin V set and the Immunoglobulin C1 set. Immunoglobulin domains are among those used by immunoevasins [50, 51]. We collected proteins from SCOP release 1.69. For each family we created a culled set of proteins with maximal pairwise sequence identity percentage below some threshold by using PISCES server [52](Immunoglobulin C1 set below 40%, and Immunoglobulin V set below 30%). The characteristics of the complete domain sequence sets are shown in Table 1. And the PDB IDs of individual proteins for the two culled sets are shown in Table 2.

Table 1 Characteristics of domain sequence sets

Full size table

Table 2 Immunoevasins protein lists for research

Full size table

Experimental protocol

We randomly divided proteins from each family into two groups: 10 proteins to serve as sources for feature extraction, and the remainder(positive sample) for training and testing in "leave-one-out" cross validation. A negative sample set of the the same size as the positive sample set was randomly chosen from PDB. The negative sample was used along with the positive sample in testing. The complete flowchart of our experiment procedure is shown in Figure 5. During this experimental research, we mined frequent clique subgraphs [53] in order to enforce biological constraints on the patterns. We compared APGM with the exact graph mining methods MGM [53]. We chose MGM as the counterpart for the comparison because it is an available clique pattern mining algorithm. (Any exact match method with clique constraint should provide the same number of patterns from a graph database.)

Number of patterns identified

We identified frequent approximate subgraph patterns from 10 positive proteins in each family. There are two parameters that may have significant influence on the set of mined patterns. The first is the support threshold(σ) and the second is the isomorphism threshold(τ). For simplicity, in following experiments in this section we use the new support threshold σ' = σ × |D|, |D| is the size of graph database, and the same change applied in support value. In Figure 6, we run APGM with different combinations of τ and σ and collect the total number of identified patterns. Our results show that the total number of patterns is not sensitive to the isomorphism threshold, and rather depends on the support threshold heavily. Such fact eases the worry that the parameter τ may be too strong for deciding the number of patterns.

For the purpose of comparison, the number of patterns mined by two mining methods are shown in Table 3 and 4, and the number of patterns acquired by APGM from Immunoglobulin C1 proteins are also shown in Figure 6. In our experiment, we treat a pattern set with the number more than 10000 as a meaningless one because our sample space is comparatively small and the isomorphism check is computationally expensive. From Table 4, we see that exact match fails to provide useful patterns on the Immunoglobulin V proteins, which is the typical data set with very noisy background. In comparison, APGM does find some pattern set with a reasonable size in such situation. (We only use rough parameter combination grids to do the pattern search. If we increase the precision of τ and σ, more patterns will be found.) In order to evaluate the quality of these patterns, we use the identified frequent subgraphs in classification tests as discussed below.

Table 3 Number of patterns by APGM(τ = 0.35) and MGM on Immunoglobulin C1

Full size table

Table 4 Number of patterns by APGM(τ = 0.75) and MGM on Immunoglobulin V

Full size table

Classification performance

In this experimental section, we used libsvm SVM package [54] for protein structure classification. We treat each mined pattern as a feature and a protein is represented as a feature vector V = (v_i) where i ≤ i ≤ n and n is the total number of identified features. v_iis 1, if the related feature occurs in the protein and otherwise v_iis 0. We used the linear kernel and default parameters for SVM leave-one-out cross validation. The classification results are summarized in Table 5 and 6. For some parameter combinations, there are no accuracies – an event which happens under two circumstances. First, there are no patterns found. Second, the pattern set is too big to be useful. From the tables we see that the classifications with APGM-based feature highly outperform those based on exact match. For Immunoglobulin C1 set, the classification based on feature identified by MGM only can reach 73%, while APGM is between 69%~91%. For Immunoglobulin V set, since the exact match method cannot mine any meaningful patterns, it fails in classification, while by using APGM, we have the accuracy around 78%. This shows that our APGM has more capability to mine useful structure information from very noisy background than general exact match graph mining algorithms.

Table 5 Classification accuracy of APGM (τ = 0.35) and MGM on Immunoglobulin C1 Set

Full size table

Table 6 Classification accuracy of APGM τ = 0.75) and MGM on Immunoglobulin V set

Full size table

Statistical significance of patterns

In order to further demonstrate the quality of the patterns mined by using APGM, we chose the parameter combination with the best accuracy for the Immunoglobulin C1 proteins and the Immunoglobulin V proteins to check the distribution and significance of patterns. Figure 7 shows the number of the patterns that the 11 Immunoglobulin C1 proteins contain and the significance scores. Figure 8 shows those for the 9 Immunoglobulin V proteins. Proteins in Figure 7 and 8 are numbered according to their appearance order in in Table 2. For example protein "10" in Figure 7 is protein 1nfa (chain A). The proteins in Figure 7 and 8 are sorted according to the number of patterns contained in the proteins. The significance score P is defined as follows.

\begin{matrix} P = \log \frac{f^{+} / N^{+}}{f^{-} / N^{-}}, i f f^{-} \neq 0 & f^{+} \neq 0 \end{matrix}

(2)

There are three special cases of P's value. If f^- = 0 and f⁺ ≠ 0, we set P = 10; if f^- = 0 and f⁺ ≠ 0, we set P = -10; and if f^- = 0 and f⁺ = 0, we set P = 0.

Although the patterns do not distribute uniformly among Immunoglobulin C1 proteins, they cover all the positive proteins. The significance score of these patterns shows strong bias toward the Immunoglobulin C1 proteins, and among 202 only 30 noise features(P = -10) exist. For Immunoglobulin V proteins, the features miss two positive proteins, but these features are highly correlated with positive samples with all P equalling 10.

Computational performance

Since the support value of approximate subgraph mining and that of frequent subgraph mining have different meaning, it is generally hard to compare the computational performance of approximate subgraph mining and that of frequent subgraph mining. If τ is less than 1, approximate subgraph mining may obtain more patterns than that of general frequent subgraph mining by taking more running time. Because of this reason, we use the pattern discovery rate ("rate" for simplicity), which is computed as the number of discovered patterns N divided by the running time t. We use rate rather than running time as the criteria to compare computational efficiencies of different algorithms. We evaluated the computational efficiency of APGM with synthetic data sets.

We generated the synthetic data set by the same synthetic graph generator as [56]. The synthetic graph generator takes the following set of parameters: D is the total number of graphs; T is the average size of graph; I is the average size of potentially frequent subgraphs; L is the number of potentially frequent subgraphs; V is the number of vertex labels; E is the number of edge labels. The default parameter values that we use are D = 10000, T = 30, I = 11, L = 200, E = 20, V = 20.

We compare the performance rate between MGM and APGM using different isomorphism threshold values (and hence introduce different level of approximate matching). We use the support threshold (σ) defined in Definition 6 in this experiment. From Figure 9, we see that with the change of isomorphism threshold, performance of APGM differs narrowly. Even if APGM takes approximate matching, its performance is very similar with MGM. Indeed, with some values of support threshold, APGM with low isomorphism threshold (τ = 0.6) even has much higher rates.

Discussion

Finding features (corresponding to packing motifs) that discriminate one protein family from random selected proteins motivated us to further investigate the possibility of examining these motifs as characteristic signatures of a protein family. We investigated the spatial distribution of the residues covered by our mined structure motifs in individual proteins. We found the residues of structure motifs are highly centralized on a limited number of positions for each protein. We picked up the protein 1mju (chain l) in Immunoglobulin C1 set as one example. 202 patterns, which we obtained, maps to 21 amino acids among the total of 219 residues in 1mju. Through literature search, we found residues identified by APGM are related to the known functional sites in the protein. For example, position 200 and 202 are residues in contact with ligand GOL1406 as studied in [55]. Both positions are not discovered by the exact pattern mining method. This result suggests that APGM is more sensitive in recognizing functional related residues, as compared to exact pattern mining methods. However, we admit that comprehensive experimental study, involving multiple protein families, is needed before we could draw the conclusion convincingly.

Conclusion

In this paper we present a novel data mining algorithm, APGM(AP proximate G raph M ining), to perform structure comparison and structure motif identification in diverse proteins. In our method we encode structural motifs as subgraphs of geometric graph of proteins. Instead of using a general graph mining method to extract frequent subgraph motifs, we have developed the approximate graph mining algorithm and taken advantage of known substitution matrices in protein structure motif identification. Compared with general graph mining algorithms, APGM not only offers more qualified patterns that achieve higher classification accuracy, but also shows a reasonable computational performance. By applying this method to other protein families, "structure fingerprints" can be collected and used in domain classification schemes where structural information is desired. Furthermore, without loss of generality, choice of appropriate compatibility matrices allows our method to be employed in any domain where subgraph labels have some uncertainty. For example, networks of personal contacts "mutate" as people die or change employment. Compatibility matrices assigning probabilities of 'label substitution' within families or organizations may allow the essential natures of personal contact subgraphs to be preserved nevertheless.

References

J L, HL P: Antigen presentation and the ubiquitin-proteasome system in host-pathogen interactions. Adv Immunol. 2006, 92: 225-305.
Article Google Scholar
Judson KA, Lubinski JM, Jiang M, Chang Y, Eisenberg RJ, Cohen GH, Friedman HM: Blocking Immune Evasion as a Novel Approach for Prevention and Treatment of Herpes Simplex Virus Infection. J Virol. 2003, 77: 12639-12645.
Article PubMed Central CAS PubMed Google Scholar
RF D: Of URFs and ORFs: A Primer on How to Analyze Derived Amino Acid Sequences. 1986, Mill Valley: University Science Books, 92:
Google Scholar
B R: Twilight zone of protein sequence alignments. Protein Eng. 1999, 12 (2): 85-94.
Article Google Scholar
JU B, R L, D E: A method to identify protein sequences that fold into a known three-dimensional structure. Science. 253 (5016): 164-170. 1991 Jul 12
Hargbo J, Elofsson A: Hidden Markov models that use predicted secondary structures for fold recognition. Proteins. 1999, 36 (1): 68-76.
Article CAS PubMed Google Scholar
Campbell JA, Trossman DS, WM WM, Carayannopoulos LN: Zoonotic orthopoxviruses encode a high-affinity antagonist of NKG2D. J Exp Med . 204 (6): 1311-7.
Kryshtafovych A, Venclovas C, Fidelis K, Moult J: Progress over the first decade of CASP experiments. Proteins. 2005, 61 (Suppl 7): 225-236.
Article CAS PubMed Google Scholar
Gibrat J, Madej T, Bryant S: Surprising similarities in structure comparison. Curr Opin Struct Biol. 1996, 6 (3): 377-385.
Article CAS PubMed Google Scholar
Holm L, Sander C: Mapping the protein universe. Science. 1996, 273: 595-602.
Article CAS PubMed Google Scholar
Bradley P, Kim PS, Berger B: TRILOGY: Discovery of sequence-structure patterns across diverse proteins. Proc Natl Acad Sci USA . 2002, 99 (13): 8500-8505.
Article PubMed Central CAS PubMed Google Scholar
Russell RB: Detection of protein three-dimensional side-chain patterns: new examples of convergent evolution. Journal of Molecular Biology. 1998, 279: 1211-1227.
Article CAS PubMed Google Scholar
Stark A, Russell R: Annotation in three dimensions. PINTS: Patterns in Non-homologous Tertiary Structures. Nucleic Acids Res. 2003, 31 (13): 3341-4.
Article PubMed Central CAS PubMed Google Scholar
Barker J, Thornton J: An algorithm for constraint-based structural template matching: application to 3D templates with statistical analysis. Bioinformatics. 2003, 19 (13): 1644-9.
Article CAS PubMed Google Scholar
Nussinov R, Wolfson HJ: efficient detection of three-dimensional structural motifs in biological macromolecules by computer vision techniques. PNAS. 1991, 88: 10495-99.
Article PubMed Central CAS PubMed Google Scholar
Wallace A, Borkakoti N, Thornton J: TESS: a geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites. Protein Sci. 1997, 6 (11): 2308-23.
Article PubMed Central CAS PubMed Google Scholar
Jonassen I, Eidhammer I, Conklin D, Taylor WR: Structure motif discovery and mining the PDB. Bioinformatics. 2002, 18: 362-367.
Article CAS PubMed Google Scholar
Jonassen I, Eidhammer I, Taylor WR: Discovery of local packing motifs in protein structures. Proteins. 1999, 34: 206-219.
Article CAS PubMed Google Scholar
Taylor WR, Jonassen I: A Method for Evaluating Structural Models using Structural Patterns. Proteins. 2004
Google Scholar
Cammer S, Carter C, Tropsha A: Identification of sequence-specific tertiary packing motifs in protein structures using Delaunay tessellation. Lecture notes in Computational Science and Engineering. 2002, 24: 477-494.
Article Google Scholar
Krishnamoorthy B, Tropsha A: Development of a four-body statistical pseudo-potential to discriminate native from non-native protein conformations. Bioinformatics. 2003, 19 (12): 1540-48.
Article CAS PubMed Google Scholar
Tropsha A, Carter C, Cammer S, Vaisman I: Simplicial neighborhood analysis of protein packing (SNAPP) : a computational geometry approach to studying proteins. Methods Enzymol. 2003, 374: 509-544.
Article CAS PubMed Google Scholar
Artymiuk PJ, Poirrette AR, Grindley HM, Rice DW, Willett P: A Graph-theoretic Approach to the Identification of Three-dimensional Patterns of Amino Acid Side-chains in Protein Structures. Journal of Molecular Biology. 243: 327-44.
Grindley H, Artymiuk P, Rice D, Willet P: Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. J Mol biol. 1993, 229: 707-721.
Article CAS PubMed Google Scholar
Huan J, Wang W, Bandyopadhyay D, Snoeyink J, Prins J, Tropsha A: Mining Protein Family Specific Residue Packing Patterns >From Protein Structure Graphs. Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB). 2004, 308-315.
Google Scholar
Milik M, Szalma S, Olszewski K: Common Structural Cliques: a tool for protein structure and function analysis. Protein Eng. 2003, 16 (8): 543-52.
Article CAS PubMed Google Scholar
Spriggs RV, Artymiuk PJ, Willett P: Searching for patterns of amino acids in 3D protein structures. J Chem Inf Comput Sci. 2003, 43: 412-421.
Article CAS PubMed Google Scholar
Stark A, Shkumatov A, Russell RB: Finding functional sites in structural genomics proteins. Structure (Camb). 2004, 12: 1405-1412.
Article CAS Google Scholar
Wangikar P, Tendulkar A, Ramya S, Mali D, Sarawagi S: Functional sites in protein families uncovered via an objective and automated graph theoretic approach. J Mol Biol. 2003, 326 (3): 955-978.
Article CAS PubMed Google Scholar
Inokuchi A, Washio T, Motoda H: An apriori-based algorithm for mining frequent substructures from graph data. PKDD'00. 2000, 13-23.
Google Scholar
Kuramochi M, Karypis G: Frequent Subgraph Discovery. Proc International Conference on Data Mining'01. 2001, 313-320.
Google Scholar
Yan X, Han J: gSpan: Graph-Based Substructure Pattern Mining. Proc International Conference on Data Mining'02. 2002, 721-724.
Google Scholar
Huan J, Wang W, Prins J: Efficient mining of frequent subgraphs in the presence of isomorphism. Proc of ICDM. 2003
Google Scholar
Huan J, Prins WWJ, Yang J: SPIN: Mining Maximal Frequent Subgraphs from Graph Databases. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004, 581-586.
Google Scholar
Nijssen S, Kok J: A quickstart in frequent structure mining can make a difference. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004, 647-652.
Google Scholar
Han J, Cheng H, Xin D, Yan X: Frequent Pattern Mining: Current Status and Future Directions. Data Mining and Knowledge Discovery. 2007, 14:
Google Scholar
Koren Y, North SC, Volinsky C: Measuring and extracting proximity in networks. KDD. 2006, 245-255.
Google Scholar
Tong H, Koren Y, Faloutsos C: Fast Direction-Aware Proximity for Graph Mining.
Yan X, Zhu F, Yu PS, Ha J: Feature based substructure similarity search. ACM Transactions on Database Systems. 2006
Google Scholar
Hasan M, Chaoji V, Salem S, jeremy Besson, Zaki M: ORIGAMI: Mining Representative Orthogonal Graph Patterns. Proc. 2007 Int. Conf. on Data Mining (ICDM'07). 2007
Google Scholar
Chen C, Yan X, Zhu F, Han J: gapprox: Mining frequent approximate patterns from a massive network. Proc. 2007 Int. Conf. on Data Mining (ICDM'07). 2007
Google Scholar
Holder LB, Cook DJ, Djoko S: Substructures discovery in the subdue system. Proc AAAI'94 Workshop Knowledge Discovery in Databases. 1994, 169-180.
Google Scholar
Zhang S, Yang J, Cheedella V: Monkey: Approximate Graph Mining Based on Spanning Trees. Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference. 2007, 1247-1249.
Chapter Google Scholar
Vanetik N, Gudes E: Mining Frequent Labeled and Partially Labeled Graph Patterns.
Bandyopadhyay D, Snoeyink J: Almost-Delaunay Simplices : Nearest Neighbor Relations for Imprecise Points. ACM-SIAM Symposium On Distributed Algorithms. 2004, 403-412.
Google Scholar
Delaunay B: Sur la sphere vide. A la memoire de Georges Voronoi. Izv. Akad. Nauk SSSR, Otdelenie Matematicheskih i Estestvennyh Nauk. 1934, 7: 793C800-
Google Scholar
Huan J, Wang W, Bandyopadhyay D, Snoeyink J, Prins J, Tropsha A: Mining Family Specific Residue Packing Patterns from Protein Structure Graphs. Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB). 2004, 308-315.
Google Scholar
Huan J, Wang W, Prins J: Efficient Mining of Frequent Subgraph in the Presence of Isomorphism. Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM). 2003, 549-552.
Chapter Google Scholar
Eddy SR: Where did the BLOSUM62 alignment score matrix come from. Nature Biotechnology. 2004, 22: 1035-1036.
Article CAS PubMed Google Scholar
Kelly G: Structure of the cell-adhesion fragment of intimin from enteropathogenic Escherichia coli. Nature Struct Biol. 1999, 6: 313-318.
Article CAS PubMed Google Scholar
Hamburger Z: Crystal structure of invasin: a bacterial integrin-binding protein. Science. 1999, 286: 291-295.
Article CAS PubMed Google Scholar
Wang G, Dunbrack RL, PISCES J: A Protein Sequence Culling Server. Bioinformatics. 2003, 19: 1589-1591.
Article CAS PubMed Google Scholar
Huan J, Bandyopadhyay D, Snoeyink J, Prins J, Tropsha A, Wang W: Distance-based identification of spatial motifs in proteins using constrained frequent subgraph mining. Proceedings of the IEEE Computational Systems Bioinformatics (CSB). 2006
Google Scholar
LIBSVM. [http://www.csie.ntu.edu.tw/~cjlin/libsvm]
PDBsum Structure Database. [http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum]
Kuramochi M, Karypis G: Frequent subgraph discovery. Proc International Conference on Data Mining01. 2001, 313C320-
Google Scholar

Download references

Acknowledgements

This work has been partially supported by the Kansas IDeA Network for Biomedical Research Excellence (NIH/NCRR award #P20 RR016475) and a NIH grant #R01 GM868665.

This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1

Author information

Authors and Affiliations

Department of Electrical Engineering & Computer Science, University of Kansas, Lawrence, KS, 66045, USA
Yi Jia, Jun Huan & Vincent Buhr
Department of Molecular Biosciences, The University of Kansas, Lawrence, KS, 66046, USA
Jintao Zhang
School of Medicine, Washington University in St. Louis, St. Louis, MO, 63130, USA
Leonidas N Carayannopoulos

Authors

Yi Jia
View author publications
You can also search for this author in PubMed Google Scholar
Jun Huan
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Buhr
View author publications
You can also search for this author in PubMed Google Scholar
Jintao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Leonidas N Carayannopoulos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Huan.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

YJ developed methods, implemented the software, and drafted the manuscript. VB and JZ were involved in testing the data set. JH was responsible for all aspects of the project, and helped revise the manuscript. LC provided advices on the biological aspect of the work, and helped revise the manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Jia, Y., Huan, J., Buhr, V. et al. Towards comprehensive structural motif mining for better fold annotation in the "twilight zone" of sequence dissimilarity. BMC Bioinformatics 10 (Suppl 1), S46 (2009). https://doi.org/10.1186/1471-2105-10-S1-S46

Download citation

Published: 30 January 2009
DOI: https://doi.org/10.1186/1471-2105-10-S1-S46

Selected papers from the Seventh Asia-Pacific Bioinformatics Conference (APBC 2009)

Towards comprehensive structural motif mining for better fold annotation in the "twilight zone" of sequence dissimilarity

Abstract

Background

Results

Conclusion

Background

Related work

Methods

Almost-Delaunay graph

Theoretic framework

Definition 1

Definition 2

Definition 3

Definition 4

Definition 5

Definition 6

Problem statement

Algorithm design

Definition 7

Definition 8

Results

Experimental setup

Data set

Experimental protocol

Number of patterns identified

Classification performance

Statistical significance of patterns

Computational performance

Discussion

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us