Towards comprehensive structural motif mining for better fold annotation in the "twilight zone" of sequence dissimilarity
 Yi Jia^{1},
 Jun Huan^{1}Email author,
 Vincent Buhr^{1},
 Jintao Zhang^{2} and
 Leonidas N Carayannopoulos^{3}
https://doi.org/10.1186/1471210510S1S46
© Jia et al; licensee BioMed Central Ltd. 2009
Published: 30 January 2009
Abstract
Background
Automatic identification of structure fingerprints from a group of diverse protein structures is challenging, especially for proteins whose divergent amino acid sequences may fall into the "twilight" or "midnight" zones where pairwise sequence identities to known sequences fall below 25% and sequencebased functional annotations often fail.
Results
Here we report a novel graph database mining method and demonstrate its application to protein structure pattern identification and structure classification. The biologic motivation of our study is to recognize common structure patterns in "immunoevasins", proteins mediating virus evasion of host immune defense. Our experimental study, using both viral and nonviral proteins, demonstrates the efficiency and efficacy of the proposed method.
Conclusion
We present a theoretic framework, offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrices as studied here, and devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we significantly expanded the analytical power of sophisticated data mining algorithms in dealing with large volume of complicated and noisy protein structure data. And without loss of generality, choice of appropriate compatibility matrices allows our method to be easily employed in domains where subgraph labels have some uncertainty.
Keywords
Background
Genomics efforts continue to yield a myriad of new protein sequences. Among the most valuable are those expressed by mammalian pathogens, organisms that successfully grow and disseminate despite a hostile host immunologic environment. A subset of pathogenencoded proteins, "immunoevasins", facilitate this success by mediating cellular adhesion and entry, and by distorting the interactions of host receptors and cellsurface ligands [1]. Study of immunoevasins gives insight into hostdefense mechanisms, insight that can help guide development of therapies and vaccines against refractory organisms [2].
Though immunoevasins frequently possess proteinrecognition domain (PRD) folds common to mammalian proteins of immunologic importance, their divergent amino acid sequences may fall into the "twilight" or "midnight" zones where pairwise sequence identities to known sequences fall below 25% and purely sequencebased attempts at annotations often fail [3, 4].
To better annotate these, and any other highly divergent sequences, more generally, some means of explicitly incorporating threedimensional structural information into the sequence evaluation is required. Inclusion of even rudimentary structural considerations enhances the performance of sequence scoring heuristics such as local alignment tools [5] and hidden Markov models (HMM) [6]. Indeed an HMM constrained with crystallographically determined secondary structure data allowed discovery of a previously unsuspected MHC class Ilike immunoevasin in the genomes of orthopoxviruses [7]. A vast literature covers various schemes for structural data incorporation and fold classification. Nevertheless, much progress remains to be made [8].
We are pursuing an approach whereby structural patterns common to a protein fold are collected, assessed for their classification value, and mapped onto statistical models of protein sequences (e.g. HMMs, support vector machines (SVMs), and conditional random fields). As a first step, a comprehensive and objective means is required of identifying and assessing the above common structure patterns, or structure fingerprints.
Automatic identification of structure fingerprints from a group of diverse protein structures is challenging for a number of reasons. First, we have only limited knowledge about the possible location, composition, and geometric shape of these structure patterns. Second, protein structures are large geometric objects that typically contain hundreds of amino acids with thousands of atoms and chemical bonds. Third, due to accumulated mutations in evolution the same structure pattern may appear slightly different in different proteins. If we use terms from computer algorithm design, we say that the problem of automatic structure pattern identification is challenging since (1) the problem has a large combinatory search space (meaning patterns may occur in any part of a protein and in any subset of a group of proteins) and (2) we should use approximate matching rather than exact matching in retrieving such patterns (meaning that we should tolerate certain level of geometric distortion and amino acid mismatch in search for common structure patterns).
In this paper we demonstrate a novel data mining technique that efficiently extracts and scores structure pattern from diverse proteins. Specifically in our method, we encode a protein structure as a geometric graph where a node represents an amino acid residue and an edge represents a physical or a chemical interaction between a pair of residues. We encode structural motifs as subgraphs of a geometric graph and we identify conserved structure fingerprints by searching for frequently occurring approximately subgraphs in a group of graph represented proteins.
Our contributions in designing a new graph data mining method are to develop a solid theoretic framework, to offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrices as studied here, and to devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we expanded the analytical power of data mining algorithms in dealing with large volume of complicated and noisy protein structure data. As evaluated in our driving biological application of recognizing common structure patterns in immunoevasins, our proposed method identifies many structure patterns and affords better structure classification accuracy compared to existing graph mining algorithms.
The rest of the paper is organized in the following way. In the Related Work section, we give an overview of related work on subgraph mining and protein structure pattern identification. In the Methods section, we introduce the technique about how to translate protein structures into graphs, provide our model for approximate subgraph mining, and present the details of our algorithm. In the Results section, we show an empirical study of the proposed algorithm using protein structure data sets. In the Discussion section, we discuss the biological significance of the structural motifs mined by our method. Finally in the Conclusions section, we conclude with a short discussion of our approach.
Related work
There is an extensive body of literature on comparing and classifying proteins using multiple sequence or structure alignment, such as VAST [9] and DALI [10]. Here we focus on the recent algorithmic techniques for discovering structure motifs from protein structures. The methods can be classified into the following five types:

Depthfirst search, starting from simple geometric patterns such as triangles, progressively finding larger patterns [11–13].

Geometric hashing, originally developed in computer vision, applied pairwise between protein structures to identify structure motifs [14–16].

String pattern matching methods that encode the local structure and sequence information of a protein as a string, and apply string search algorithms to derive motifs [17–19].

Delaunay Tessellation (DT) [20–22] partitioning the structure into an aggregate of nonoverlapping, irregular tetrahedra thus identifying all unique nearest neighbor residue quadruplets for any protein [22].

Graph matching methods comparing protein structures modeled as graphs and discovering structure motifs by finding recurring subgraphs [23–29].
Graph database mining is an active research field in data mining research. The goal of graph database mining is to locate useful and interpretable patterns in a large volume of graph data. Recent exact matching graph mining algorithms can be roughly divided into three categories. The first category uses the levelwise search strategy, which includes AGM [30] and FSG [31]. And the second category takes the depthfirst search strategy, which includes gSpan [32] and FFSM [33]. The third category works by mining frequent trees, for which SPIN [34] and GASTON [35] are the representative. There are many other existing graph mining algorithms, and we refer to [36] for a recent survey.
Frequent subgraph mining with approximate matching capability has also been investigated. The current approximate subgraph mining algorithms can be divided into four categories: (1) proximity measures between graphs [37–39], (2) given a proximity measurement, compute representative frequent subgraphs [40], (3) pattern discovery in a single large graph [41], and (4) pattern discovery from a group of graphs. The last category is what we concentrate on. For algorithms in (4), SUBDUE [42] does not claim completeness. Monkey [43] handles only edge missing and edge label mismatch. Partially Labeled Graphs [44] uses a wild card method to handle node label mismatches. The algorithm may be viewed as a special case of our algorithm.
Different from the existing work, to our best knowledge, we are the first group that incorporates a probability matrix in a graph mining method. We also developed a general framework to fully utilize a probability matrix for approximate match, which we can apply to a number of different applications. In addition, we have developed two ways to demonstrate the statistical significance of the patterns mined from a graph database. Statistical significance is an important but often overlooked issue in evaluating the quality of identified pattern in frequent pattern mining. Finally we offered a practical implementation and evaluated its performance using the synthetic sets.
Methods
In this section, we first briefly describe the technique that translates protein structures into graphs. Then we demonstrate our method called APGM(AP proximate G raph M ining) with two steps: introducing the theoretic model, and showing our algorithm in detail.
AlmostDelaunay graph
Theoretic framework
Definition 1
A labeled graph G is a 5tuple G = {V, E, Σ_{ V }, Σ_{ E }, λ) where V is the set of vertices of G and E ⊆ V × V is the set of undirected edges of G. Σ_{ V }and Σ_{ E } are (disjoint) sets of labels. And labeling function λ: V → Σ_{ V }∪ E → Σ_{ E }maps vertices and edges in G to their labels. A graph database D is a set of graphs.
We also use V[G] to denote the node set of a graph G and E[G] to denote the edge set of G. We also use Σ_{V[G]}to denote the node labels, Σ_{E[G]}to denote edge labels, and λ_{ G }to denote the labeling function for a graph G. Before we introduce approximate matching, we define compatibility matrix, which offers a probability framework for approximate subgraph mining.
Definition 2
A compatibility matrix M = (m_{i,j}) is an n × n matrix indexed by symbols from a label set Σ (n = Σ). An entry m_{i,j}( 0 ≤ m_{i,j}≤ 1, Σ_{ j }m_{i,j}= 1) in M is the probability that the label i is replaced by the label j.
A compatibility matrix M is stable if the diagonal entry is the largest one in the row (i.e. M_{i,i}> M_{i,j}, for all j ≠ i). A compatibility matrix being stable means that any label i is more likely to be replaced by itself rather than by any other symbol. For our biological application, we consider substitution matrices as being, in essence, stable matrices since most or all rows fit the criterion. For example, in the BLOSUM62 substitution matrix, there is only one violation of the criterion – the row for methionine(MET). Hence for the rest of the discussion, we will treat substitution matrices as stable compatibility matrices.
Definition 3
A labeled graph G = {V, E, Σ_{ V }, Σ_{ E }, λ} is approximately subgraph isomorphic to another graph G' = {V', E', ${{\Sigma}^{\prime}}_{V}$, ${{\Sigma}^{\prime}}_{E}$, λ'} if there exists an injection f : V → V' such that

∏_{u∈V}M_{λ(u),λ'(f(u))}≥ τ, and

$\prod}_{(u,v)\in E}{{M}^{\prime}}_{\lambda (u,v),{\lambda}^{\prime}(f(u),f(v))}\ge {\tau}^{\prime$
The injection f is an approximate subgraph isomorphism between G and G'. M is a compatibility matrix for node label sets Σ_{ V }∪ ${{\Sigma}^{\prime}}_{V}$. M' is a compatibility matrix for edge label sets Σ_{ E }∪ ${{\Sigma}^{\prime}}_{E}$. In an edge compatibility matrix, we assume Σ_{ E }and ${{\Sigma}^{\prime}}_{E}$ both contain a special label called empty edge. In this way, we handle both topology distortion (missing edges) and edge label mismatches in the same unified way through an edge compatibility matrix. τ (0 <τ ≤ 1) is the threshold for node mismatch and τ'(0 <τ' ≤ 1) is the threshold for edge mismatch.
For simplicity in the following discussion, we assume that we only need to handle node label mismatches (i.e. corresponding edge relations and corresponding edge labels should exactly match each other in matching two graphs). In principle, edge label mismatch (including missing edges) can be handled in a similar way as node label mismatch. Hence our assumption does not reduce the complexity of algorithm design, but the assumption significantly simplifies our demonstration and makes our algorithm easy of access.
With the assumption, the new definition of approximate subgraph isomorphism is:
Definition 4
A graph G is approximate subgraph isomorphic to another graph G', denoted by G ⊆_{ a }G' if there exists a 11 injection f V[G] to V[G'], such that

∏_{u∈V}M_{λ(u),λ'(f(u))}≥ τ,

∀ u, v ∈ V, (u, v) ∈ E ⇔ (f(u), f(v)) ∈ E', and

∀ (u, v) ∈ E, λ(u, v) = λ(f(u), f(v))
Similarly, we define exact subgraph isomorphism below.
Definition 5
A graph G is subgraph isomorphic to another graph G', denoted by G ⊆ G' if there exists a 11 injection f from the node set V of a graph G to V' of a graph G', such that

∀ u ∈ V, λ(u) = λ'(f(u))

∀ u, v ∈ V, (u, v) ∈ E ⇔ (f(u), f(v)) ∈ E', and

∀ (u, v) ∈ E, λ(u, v) = λ(f(u), f(v))
Example 2. In Figure 2, we show a graph database D = {P, Q, R} and a compatibility matrix M. We set isomorphism threshold τ = 0.4 and with this threshold, graph P is approximate subgraph isomorphic to graph Q with the approximate subgraph isomorphic score equaling 0.6. To see this, there are a total of 6 different ways to map nodes of P to those of Q. The only two that satisfy edge label constraints are f_{1} = p_{1} → q_{1} p_{2} → q_{2} p_{3} → q_{3} and f_{2} = p_{1} → q_{2} p_{2} → q_{1} p_{3} → q_{3}. The approximate subgraph isomorphism score of f_{1} equals that of f_{2}.
Definition 6
G is a frequent approximate subgraph if its support value is at least σ. With this definition, we only use those graphs that a subgraph G is approximate subgraph isomorphic to (controlled by the parameter τ) to compute the support value of G. We do this to filter out low quality (but potentially many) graph matchings in counting the support value of a subgraph. For a moderate sized graph database (100 1000), according our experience, the number of frequent subgraphs identified is usually not sensitive to the isomorphism threshold, which makes sense since low quality graph matching has low "weight" in the support computation nevertheless.
Problem statement
Example 3. Given a graph database D, a compatibility matrix M in Figure 2, the support threshold σ = 2/3 and isomorphism threshold τ = 0: 4, we show how to calculate the isomorphism score and support value for the approximate frequent patterns in Figure 3.
S(A_{1}, P) = 1, S(A_{1}, Q) = 1, S(A_{1}, R) = 1, Sup(A_{1}) = 3/3;
S(A_{2}, P) = 1, S(A_{2}, Q) = 1, S(A_{2}, R) = 1, Sup(A_{2}) = 3/3;
S(A_{3}, P) = 1, S(A_{3}, Q) = 0.6, S(A_{3}, R) = 0.4, Sup(A_{3}) = 2/3;
S(A_{4}, P) = 1, S(A_{4}, Q) = 0.6, S(A_{4}, R) = 0.4, Sup(A_{4}) = 2/3.
Algorithm design
Here we demonstrate a new algorithm APGM for approximate subgraph mining. APGM starts with frequent single node subgraphs. At a subsequent step, it adds a node to an existing pattern to create new subgraph patterns and identify their support value. If none of the resulting subgraphs are frequent, APGM backtracks. APGM stops when no more patterns need to be searched. Before we proceed to the algorithmic details, we introduce the following definitions to facilitate the demonstration of the APGM algorithm.
Definition 7
Given a graph T, one of the embeddings e = v_{1}, v_{2},⋯,v_{ k }of T, a node v is a neighbor of e if ∃u ∈ e, (u, v) ∈ E[G].
In other words, a neighbor node of a embedding e is any node that connects to at least one node in e. The neighbor set of an embedding e, denoted by N(e), is the set of e's neighbors.
Definition 8
Given a graph T, one of the embeddings e = v_{1}, v_{2},⋯,v_{ k }of T in a graph G, a node v ∈ N(e), and a node label l, the approximate subgraph, denoted by G_{T,e,v,l}, is a graph (V', E', ${{\Sigma}^{\prime}}_{V}$, ${{\Sigma}^{\prime}}_{E}$, λ') such that

V' = {v_{1}, v_{2},⋯,v_{ k }} ∪ v

E' = V' × V' ∩ E[G]

${{\Sigma}^{\prime}}_{V}$ = Σ_{ V }

${{\Sigma}^{\prime}}_{E}$ = Σ_{ E }

∀ u ∈ e : λ'(u) = λ_{ T }(u)

λ'(v) = l

∀ u, v ∈ e : λ'((u, v)) = λ_{ G }((u, v))
With the two definitions, we present the pseudo code of APGM below. follows.
Algorithm 1. APGM_MAIN(D, M, τ , σ )
1: Begin
2: C ← {frequent single node}
3: F ← C
4: for each T ∈ C do
5: APGM_SEARCH(T, τ, σ, F)
6: end for
7: return F
8: End
Algorithm 2. APGM_SEARCH(T, τ , σ , F)
1: Begin
2: C ← ∅
3: for each (e, v), e is an embedding of T in G, v ∈ N(e) do
4: CL ← approximateLabelSet(T, G, e, v)
5: for each l ∈ CL do
6: X ← G_{T, e, v, l}
7: C ← C ∪ {X}
8: $\mathcal{H}$(X) = $\mathcal{H}$(X) ∪ (e, v)
9: end for
10: end for
11: remove infrequent T from C
12: F ← F ∪ C
13: for each T ∈ C do
14: APGM_SEARCH(T, τ, σ, F)
15: end for
16: End
$\mathcal{H}$ is a hash function to store candidate subgraphs and their embeddings. The hash key of the function in our implementation is a canonical code of the subgraph X, which is a unique string presentation of a graph. We use the Canonical Adjacency matrix (CAM) and the Canonical Adjacency Matrix code, developed in [48], to compute the canonical code of a graph.
Algorithm 3. approximateLabelSet(T, G, e, v)
1: Begin
2: R ← ∅
3: l_{0} ← λ_{ G }(v)
4: for each l ∈ Σ_{V[G]}do
5: if$S(e,T)\times \frac{M({l}_{0},l)}{M({l}_{0},{l}_{0})}\ge \tau $then
6: R ← R ∪ l
7: end if
8: end for
9: return R
10: End
Example 5. Applying APGM to the graph database shown in Figure 2with the support threshold σ = 2/3 and the isomorphism threshold τ = 0.4, we identify one frequent singlenode pattern a (shown as A_{1} in Figure 3). Adding one node to the pattern A_{1}, there are two candidate singleedge patterns and both of them are frequent. These two are shown as A_{2} and A_{3} in the same figure. From pattern A_{2}, we enumerate one additional pattern A_{4}. We stop here since there is no more candidate patterns to explore.
Results
Experimental setup
We performed all the experiments on a cluster with 256 Intel Xeon 3.2 Ghz EM64T processors with 4 GB memory each. The approximate graph mining algorithm was implemented in the C++ language and compiled by using the g++ compiler in Linux environment with O3 optimization.
We downloaded all protein structures from Protein Data Bank (PDB). We followed [45] to use the same software as [47] to calculate AlmostDelaunay(AD) for graph representation of protein geometry. We took BLOSUM62 as the compatibility matrix and backcalculated the conditional probability matrix by following the procedure described in [49]. We normalized the matrix according to Definition 4.
Data set
Characteristics of domain sequence sets
Immunoglobulin C1 Set  Immunoglobulin V Set  

Number of Proteins  1786  371 
Average Length  210  194 
Maximum Length  457  444 
Minimum Length  98  99 
Immunoevasins protein lists for research
PDB ID of proteins in Immunoglobulin C1 set  

Proteins for Feature Extraction(10):  1fp 5a 1onqa 1ogad 1pqza 1t 7va 1l 6xa 1je 6a 1mjul 1uvqb 1dn 0b 
Proteins for Leaveoneout Testing(11):  1nfda 1uvqa 1q 0xl 1mjuh 1a 6za 1k 5na 1hdma 3frua 1ogae 1hdmb 1k 5nb 
PDB ID of proteins in Immunoglobulin V set  
Proteins for Feature Extraction(10):  1pkoa 1ogad 1npua 1cdca 1jmaa 1fo 0b 1nkoa 1mjuh 1nfdb 1qfoa 
Proteins for Leaveoneout Testing(9):  1zcza 1f 97a 1eaja 1mjul 1cida 1neua 1cdya 1hkfa 1nezg 
Experimental protocol
Number of patterns identified
Number of patterns by APGM(τ = 0.35) and MGM on Immunoglobulin C1
Support Threshold(σ)  

6  5.5  5  4.5  4  
APGM(τ = 0.35)  17  24  141  202  841 
MGM  16  16  126  126  660 
Number of patterns by APGM(τ = 0.75) and MGM on Immunoglobulin V
Support Threshold(σ)  

6  5.5  5  4.5  4  
APGM(τ = 0.75)  0  0  0  160  14686 
MGM  0  0  0  0  13911 
Classification performance
Classification accuracy of APGM (τ = 0.35) and MGM on Immunoglobulin C1 Set
Support Threshold(σ)  

6  5.5  5  4.5  4  
APGM  68.18%  77.27%  86.36%  90.91%  81.82% 
MGM  72.73%  72.73%  72.73%  72.73%  72.73% 
Classification accuracy of APGM τ = 0.75) and MGM on Immunoglobulin V set
Support Threshold (σ)  

6  5.5  5  4.5  
APGM        77.78% 
MGM         
Statistical significance of patterns
There are three special cases of P's value. If f^{} = 0 and f^{+} ≠ 0, we set P = 10; if f^{} = 0 and f^{+} ≠ 0, we set P = 10; and if f^{} = 0 and f^{+} = 0, we set P = 0.
Although the patterns do not distribute uniformly among Immunoglobulin C1 proteins, they cover all the positive proteins. The significance score of these patterns shows strong bias toward the Immunoglobulin C1 proteins, and among 202 only 30 noise features(P = 10) exist. For Immunoglobulin V proteins, the features miss two positive proteins, but these features are highly correlated with positive samples with all P equalling 10.
Computational performance
Since the support value of approximate subgraph mining and that of frequent subgraph mining have different meaning, it is generally hard to compare the computational performance of approximate subgraph mining and that of frequent subgraph mining. If τ is less than 1, approximate subgraph mining may obtain more patterns than that of general frequent subgraph mining by taking more running time. Because of this reason, we use the pattern discovery rate ("rate" for simplicity), which is computed as the number of discovered patterns N divided by the running time t. We use rate rather than running time as the criteria to compare computational efficiencies of different algorithms. We evaluated the computational efficiency of APGM with synthetic data sets.
We generated the synthetic data set by the same synthetic graph generator as [56]. The synthetic graph generator takes the following set of parameters: D is the total number of graphs; T is the average size of graph; I is the average size of potentially frequent subgraphs; L is the number of potentially frequent subgraphs; V is the number of vertex labels; E is the number of edge labels. The default parameter values that we use are D = 10000, T = 30, I = 11, L = 200, E = 20, V = 20.
Discussion
Finding features (corresponding to packing motifs) that discriminate one protein family from random selected proteins motivated us to further investigate the possibility of examining these motifs as characteristic signatures of a protein family. We investigated the spatial distribution of the residues covered by our mined structure motifs in individual proteins. We found the residues of structure motifs are highly centralized on a limited number of positions for each protein. We picked up the protein 1mju (chain l) in Immunoglobulin C1 set as one example. 202 patterns, which we obtained, maps to 21 amino acids among the total of 219 residues in 1mju. Through literature search, we found residues identified by APGM are related to the known functional sites in the protein. For example, position 200 and 202 are residues in contact with ligand GOL1406 as studied in [55]. Both positions are not discovered by the exact pattern mining method. This result suggests that APGM is more sensitive in recognizing functional related residues, as compared to exact pattern mining methods. However, we admit that comprehensive experimental study, involving multiple protein families, is needed before we could draw the conclusion convincingly.
Conclusion
In this paper we present a novel data mining algorithm, APGM(AP proximate G raph M ining), to perform structure comparison and structure motif identification in diverse proteins. In our method we encode structural motifs as subgraphs of geometric graph of proteins. Instead of using a general graph mining method to extract frequent subgraph motifs, we have developed the approximate graph mining algorithm and taken advantage of known substitution matrices in protein structure motif identification. Compared with general graph mining algorithms, APGM not only offers more qualified patterns that achieve higher classification accuracy, but also shows a reasonable computational performance. By applying this method to other protein families, "structure fingerprints" can be collected and used in domain classification schemes where structural information is desired. Furthermore, without loss of generality, choice of appropriate compatibility matrices allows our method to be employed in any domain where subgraph labels have some uncertainty. For example, networks of personal contacts "mutate" as people die or change employment. Compatibility matrices assigning probabilities of 'label substitution' within families or organizations may allow the essential natures of personal contact subgraphs to be preserved nevertheless.
Declarations
Acknowledgements
This work has been partially supported by the Kansas IDeA Network for Biomedical Research Excellence (NIH/NCRR award #P20 RR016475) and a NIH grant #R01 GM868665.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/10?issue=S1
Authors’ Affiliations
References
 J L, HL P: Antigen presentation and the ubiquitinproteasome system in hostpathogen interactions. Adv Immunol. 2006, 92: 225305.View ArticleGoogle Scholar
 Judson KA, Lubinski JM, Jiang M, Chang Y, Eisenberg RJ, Cohen GH, Friedman HM: Blocking Immune Evasion as a Novel Approach for Prevention and Treatment of Herpes Simplex Virus Infection. J Virol. 2003, 77: 1263912645.PubMed CentralView ArticlePubMedGoogle Scholar
 RF D: Of URFs and ORFs: A Primer on How to Analyze Derived Amino Acid Sequences. 1986, Mill Valley: University Science Books, 92:Google Scholar
 B R: Twilight zone of protein sequence alignments. Protein Eng. 1999, 12 (2): 8594.View ArticleGoogle Scholar
 JU B, R L, D E: A method to identify protein sequences that fold into a known threedimensional structure. Science. 253 (5016): 164170. 1991 Jul 12Google Scholar
 Hargbo J, Elofsson A: Hidden Markov models that use predicted secondary structures for fold recognition. Proteins. 1999, 36 (1): 6876.View ArticlePubMedGoogle Scholar
 Campbell JA, Trossman DS, WM WM, Carayannopoulos LN: Zoonotic orthopoxviruses encode a highaffinity antagonist of NKG2D. J Exp Med . 204 (6): 13117.Google Scholar
 Kryshtafovych A, Venclovas C, Fidelis K, Moult J: Progress over the first decade of CASP experiments. Proteins. 2005, 61 (Suppl 7): 225236.View ArticlePubMedGoogle Scholar
 Gibrat J, Madej T, Bryant S: Surprising similarities in structure comparison. Curr Opin Struct Biol. 1996, 6 (3): 377385.View ArticlePubMedGoogle Scholar
 Holm L, Sander C: Mapping the protein universe. Science. 1996, 273: 595602.View ArticlePubMedGoogle Scholar
 Bradley P, Kim PS, Berger B: TRILOGY: Discovery of sequencestructure patterns across diverse proteins. Proc Natl Acad Sci USA . 2002, 99 (13): 85008505.PubMed CentralView ArticlePubMedGoogle Scholar
 Russell RB: Detection of protein threedimensional sidechain patterns: new examples of convergent evolution. Journal of Molecular Biology. 1998, 279: 12111227.View ArticlePubMedGoogle Scholar
 Stark A, Russell R: Annotation in three dimensions. PINTS: Patterns in Nonhomologous Tertiary Structures. Nucleic Acids Res. 2003, 31 (13): 33414.PubMed CentralView ArticlePubMedGoogle Scholar
 Barker J, Thornton J: An algorithm for constraintbased structural template matching: application to 3D templates with statistical analysis. Bioinformatics. 2003, 19 (13): 16449.View ArticlePubMedGoogle Scholar
 Nussinov R, Wolfson HJ: efficient detection of threedimensional structural motifs in biological macromolecules by computer vision techniques. PNAS. 1991, 88: 1049599.PubMed CentralView ArticlePubMedGoogle Scholar
 Wallace A, Borkakoti N, Thornton J: TESS: a geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites. Protein Sci. 1997, 6 (11): 230823.PubMed CentralView ArticlePubMedGoogle Scholar
 Jonassen I, Eidhammer I, Conklin D, Taylor WR: Structure motif discovery and mining the PDB. Bioinformatics. 2002, 18: 362367.View ArticlePubMedGoogle Scholar
 Jonassen I, Eidhammer I, Taylor WR: Discovery of local packing motifs in protein structures. Proteins. 1999, 34: 206219.View ArticlePubMedGoogle Scholar
 Taylor WR, Jonassen I: A Method for Evaluating Structural Models using Structural Patterns. Proteins. 2004Google Scholar
 Cammer S, Carter C, Tropsha A: Identification of sequencespecific tertiary packing motifs in protein structures using Delaunay tessellation. Lecture notes in Computational Science and Engineering. 2002, 24: 477494.View ArticleGoogle Scholar
 Krishnamoorthy B, Tropsha A: Development of a fourbody statistical pseudopotential to discriminate native from nonnative protein conformations. Bioinformatics. 2003, 19 (12): 154048.View ArticlePubMedGoogle Scholar
 Tropsha A, Carter C, Cammer S, Vaisman I: Simplicial neighborhood analysis of protein packing (SNAPP) : a computational geometry approach to studying proteins. Methods Enzymol. 2003, 374: 509544.View ArticlePubMedGoogle Scholar
 Artymiuk PJ, Poirrette AR, Grindley HM, Rice DW, Willett P: A Graphtheoretic Approach to the Identification of Threedimensional Patterns of Amino Acid Sidechains in Protein Structures. Journal of Molecular Biology. 243: 32744.Google Scholar
 Grindley H, Artymiuk P, Rice D, Willet P: Identification of tertiary structure resemblance in proteins using a maximal common subgraph isomorphism algorithm. J Mol biol. 1993, 229: 707721.View ArticlePubMedGoogle Scholar
 Huan J, Wang W, Bandyopadhyay D, Snoeyink J, Prins J, Tropsha A: Mining Protein Family Specific Residue Packing Patterns >From Protein Structure Graphs. Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB). 2004, 308315.Google Scholar
 Milik M, Szalma S, Olszewski K: Common Structural Cliques: a tool for protein structure and function analysis. Protein Eng. 2003, 16 (8): 54352.View ArticlePubMedGoogle Scholar
 Spriggs RV, Artymiuk PJ, Willett P: Searching for patterns of amino acids in 3D protein structures. J Chem Inf Comput Sci. 2003, 43: 412421.View ArticlePubMedGoogle Scholar
 Stark A, Shkumatov A, Russell RB: Finding functional sites in structural genomics proteins. Structure (Camb). 2004, 12: 14051412.View ArticleGoogle Scholar
 Wangikar P, Tendulkar A, Ramya S, Mali D, Sarawagi S: Functional sites in protein families uncovered via an objective and automated graph theoretic approach. J Mol Biol. 2003, 326 (3): 955978.View ArticlePubMedGoogle Scholar
 Inokuchi A, Washio T, Motoda H: An aprioribased algorithm for mining frequent substructures from graph data. PKDD'00. 2000, 1323.Google Scholar
 Kuramochi M, Karypis G: Frequent Subgraph Discovery. Proc International Conference on Data Mining'01. 2001, 313320.Google Scholar
 Yan X, Han J: gSpan: GraphBased Substructure Pattern Mining. Proc International Conference on Data Mining'02. 2002, 721724.Google Scholar
 Huan J, Wang W, Prins J: Efficient mining of frequent subgraphs in the presence of isomorphism. Proc of ICDM. 2003Google Scholar
 Huan J, Prins WWJ, Yang J: SPIN: Mining Maximal Frequent Subgraphs from Graph Databases. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004, 581586.Google Scholar
 Nijssen S, Kok J: A quickstart in frequent structure mining can make a difference. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004, 647652.Google Scholar
 Han J, Cheng H, Xin D, Yan X: Frequent Pattern Mining: Current Status and Future Directions. Data Mining and Knowledge Discovery. 2007, 14:Google Scholar
 Koren Y, North SC, Volinsky C: Measuring and extracting proximity in networks. KDD. 2006, 245255.Google Scholar
 Tong H, Koren Y, Faloutsos C: Fast DirectionAware Proximity for Graph Mining.Google Scholar
 Yan X, Zhu F, Yu PS, Ha J: Feature based substructure similarity search. ACM Transactions on Database Systems. 2006Google Scholar
 Hasan M, Chaoji V, Salem S, jeremy Besson, Zaki M: ORIGAMI: Mining Representative Orthogonal Graph Patterns. Proc. 2007 Int. Conf. on Data Mining (ICDM'07). 2007Google Scholar
 Chen C, Yan X, Zhu F, Han J: gapprox: Mining frequent approximate patterns from a massive network. Proc. 2007 Int. Conf. on Data Mining (ICDM'07). 2007Google Scholar
 Holder LB, Cook DJ, Djoko S: Substructures discovery in the subdue system. Proc AAAI'94 Workshop Knowledge Discovery in Databases. 1994, 169180.Google Scholar
 Zhang S, Yang J, Cheedella V: Monkey: Approximate Graph Mining Based on Spanning Trees. Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference. 2007, 12471249.View ArticleGoogle Scholar
 Vanetik N, Gudes E: Mining Frequent Labeled and Partially Labeled Graph Patterns.Google Scholar
 Bandyopadhyay D, Snoeyink J: AlmostDelaunay Simplices : Nearest Neighbor Relations for Imprecise Points. ACMSIAM Symposium On Distributed Algorithms. 2004, 403412.Google Scholar
 Delaunay B: Sur la sphere vide. A la memoire de Georges Voronoi. Izv. Akad. Nauk SSSR, Otdelenie Matematicheskih i Estestvennyh Nauk. 1934, 7: 793C800Google Scholar
 Huan J, Wang W, Bandyopadhyay D, Snoeyink J, Prins J, Tropsha A: Mining Family Specific Residue Packing Patterns from Protein Structure Graphs. Proceedings of the 8th Annual International Conference on Research in Computational Molecular Biology (RECOMB). 2004, 308315.Google Scholar
 Huan J, Wang W, Prins J: Efficient Mining of Frequent Subgraph in the Presence of Isomorphism. Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM). 2003, 549552.View ArticleGoogle Scholar
 Eddy SR: Where did the BLOSUM62 alignment score matrix come from. Nature Biotechnology. 2004, 22: 10351036.View ArticlePubMedGoogle Scholar
 Kelly G: Structure of the celladhesion fragment of intimin from enteropathogenic Escherichia coli. Nature Struct Biol. 1999, 6: 313318.View ArticlePubMedGoogle Scholar
 Hamburger Z: Crystal structure of invasin: a bacterial integrinbinding protein. Science. 1999, 286: 291295.View ArticlePubMedGoogle Scholar
 Wang G, Dunbrack RL, PISCES J: A Protein Sequence Culling Server. Bioinformatics. 2003, 19: 15891591.View ArticlePubMedGoogle Scholar
 Huan J, Bandyopadhyay D, Snoeyink J, Prins J, Tropsha A, Wang W: Distancebased identification of spatial motifs in proteins using constrained frequent subgraph mining. Proceedings of the IEEE Computational Systems Bioinformatics (CSB). 2006Google Scholar
 LIBSVM. [http://www.csie.ntu.edu.tw/~cjlin/libsvm]
 PDBsum Structure Database. [http://www.ebi.ac.uk/thorntonsrv/databases/cgibin/pdbsum]
 Kuramochi M, Karypis G: Frequent subgraph discovery. Proc International Conference on Data Mining01. 2001, 313C320Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.