- Open Access
Identification of hot regions in protein-protein interactions by sequential pattern mining
© Hsu et al; licensee BioMed Central Ltd. 2007
- Published: 24 May 2007
Identification of protein interacting sites is an important task in computational molecular biology. As more and more protein sequences are deposited without available structural information, it is strongly desirable to predict protein binding regions by their sequences alone. This paper presents a pattern mining approach to tackle this problem. It is observed that a functional region of protein structures usually consists of several peptide segments linked with large wildcard regions. Thus, the proposed mining technology considers large irregular gaps when growing patterns, in order to find the residues that are simultaneously conserved but largely separated on the sequences. A derived pattern is called a cluster-like pattern since the discovered conserved residues are always grouped into several blocks, which each corresponds to a local conserved region on the protein sequence.
The experiments conducted in this work demonstrate that the derived long patterns automatically discover the important residues that form one or several hot regions of protein-protein interactions. The methodology is evaluated by conducting experiments on the web server MAGIIC-PRO based on a well known benchmark containing 220 protein chains from 72 distinct complexes. Among the tested 218 proteins, there are 900 sequential blocks discovered, 4.25 blocks per protein chain on average. About 92% of the derived blocks are observed to be clustered in space with at least one of the other blocks, and about 66% of the blocks are found to be near the interface of protein-protein interactions. It is summarized that for about 83% of the tested proteins, at least two interacting blocks can be discovered by this approach.
This work aims to demonstrate that the important residues associated with the interface of protein-protein interactions may be automatically discovered by sequential pattern mining. The detected regions possess high conservation and thus are considered as the computational hot regions. This information would be useful to characterizing protein sequences, predicting protein function, finding potential partners, and facilitating protein docking for drug discovery.
- Protein Data Bank
- Pattern Mining
- Protein Chain
- Sequential Block
- Conservation Score
Identification of functionally important regions directly from a protein sequence is a challenging problem in molecular biology [1–7]. Investigation of possible protein-protein interactions and prediction of the associated physical binding areas facilitate the study of all aspects of cellular function [8, 9]. The principles that govern the interaction of two proteins and the general properties of their interacting interfaces remain uncovered [10–12], resulting in the difficulties of predicting interface regions directly from protein sequences. Even when the structure of a protein is available, it is still not a trivial task to localize the functional interfaces and to clarify the contribution of each involved residue [7, 13, 14].
Previous studies observed that not all the interface residues contribute the same level of free energy in a complex [15–17]. Using the alanine scanning mutagenesis , which estimates the energetic contribution of individual side-chains, it suggests that a small set of interface residues can contribute the most to the binding free energy [15, 16, 19]. These critical residues are called hot spots; they give rise to a significant increase in the absolute binding energy when mutated to alanine [15, 16, 20]. It is interestingly observed that hot spots are not uniformly spread along the interfaces. Instead, they are clustered as densely packed regions and are surrounded by energetically less important residues which might serve to occlude bulk solvent from the hot spots . The assemblies of the hot spots and its neighboring moderately conserved residues are called hot regions . A single or a few hot regions can be found in the interacting interface of two proteins [17, 21]. Within the dense clusters, the hot spots and some moderately conserved residues both contribute to the stability of the complex .
Several approaches have attempted to predict interacting sites based on structure information [22–31]. Some of the approaches identify potential surface patches based on the shape of structures and then use features such as solvation potential, hydrophobicity, planarity, or accessible surface area to differentiate interacting sites from the other surface patches. Evolutionary information has also been demonstrated as a useful feature to this problem and widely employed when structures are available [32–36]. While little correlation between interface and conservation is observed at the level of amino acid side-chains [15, 32, 37–40], the conservation degrees of hot spots are more significant [15, 17]. Several studies have shown that hot spots are usually more conserved than other surface residues and clustered in space [17, 21, 38]. It has been also shown that structurally conserved residues at protein-protein interfaces correlate with the experimental alanine-scanning hot spots . In other words, the residues that affect the binding free energy dramatically tend to be strictly conserved during evolution. In this regard, Lichtarge et al. proposed an evolutionary trace method to facilitate the study of protein interfaces , followed by the development of an easy-to-use facility named ConSurf by Armon et al. in 2001 . The procedure is based on extraction of functionally important residues from homologous proteins, and after that the conserved residues are mapped onto the protein surface to identify the functional interfaces [7, 13].
The task becomes much more challenging when only sequence information is available. In such situation, the information about residue composition remains. Besides, evolutionary information is also available if there are sufficient homologues. In this regard, a classification scheme based on neural networks or support vector machines (SVM) with the features extracted from a sliding window on amino acid composition and evolutionary information is usually adopted [41–43]. Constructing a classifier requires a set of training data for which the protein structures are available. After that, the interacting residues of a query sequence can be predicted without structure information. Even though the information about which conserved residues form clusters in space is absent and cannot be exploited here, another observation from [42, 44], interface residues tend to form clusters in sequence, has been aggressively employed in recent studies to refine the predicting results [41, 42]. There also exist approaches that attempt to tackle this problem without learning from existing structures. Gallet et al. showed in their work that the interacting residues can be identified by hydrophobic moments .
As evolutionary information is demonstrated to be useful in finding interacting sites, we present here an alternative approach to discover conserved residues, sequential pattern mining [1, 46, 47]. Different from the evolutionary information derived by multiple sequence alignment of homologous sequences, the pattern mining approach focuses on the concurrence of several conserved blocks present in a subset of protein homologues . Sequential pattern mining discovers a particular subsequence that frequently occurs among a set of sequences . This technique has been widely used to identify protein motifs in many previous studies [48–50], where the term motif refers to such a subsequence that captures the characteristic regarding a specific biochemical function . Finding functional motifs directly from protein sequences is challenging, because many sequence motifs are discontinuous and the spacing between motif elements is usually large and irregular . By considering large flexible gaps in sequential pattern mining, the developed method can deliver long patterns spanning large wildcard regions efficiently [1, 47]. Though the conserved blocks in our patterns are largely separated in sequences, they are often close to each other in 3D structures and play critical roles to protein functions . The proposed methodology performs well even when the similarity identities between input sequences are low or the functional sites are only conserved in a few members of the input sequences [47, 1]. This feature is important since it has been observed that residues that are conserved only in a specific subfamily may play more family-specific functional roles and are usually found at functional patches [5, 6, 14, 52]. We expect that a highly supported pattern may highlight the residues that were conserved together during evolution for a particular purpose, for example, interacting with other proteins. The experimental results conducted in this work reveal that the conservation information provided by sequential pattern mining is helpful to this problem before any existing structures are included to facilitate the learning task.
This paper investigates the effectiveness of the approach by answering the following two questions: (1) are the locations of the sequential blocks near the interfaces of protein-protein interactions? and (2) do the derived sequential blocks tend to cluster together in space? Of course the first question is more related to the objective of this study. But by answering the second question, we expect to make it clearer why the proposed methodology works. We do not address the recall issue in this paper because we are aware of that it might not be possible to identify the complete set of interacting residues by a single pattern or in a single run of mining process. In fact, identifying important residues associated with hot regions is not identical to the problem of predicting interacting residues. As mentioned in the previous paragraphs, not all the interface residues are hot spots and expected to be conserved. On the other hand, some interior residues might also contribute to the stability of the complexes and are thus conserved. This work aims to show that the information provided by sequential pattern mining is useful to discovering hot regions of protein-protein interactions. This information can be refined and incorporated in other approaches to enhance the predicting power of the state of the art predictors.
In this section, we first describe the datasets used in this work and how the patterns are selected for different experiments. Using the five proteins in the first dataset, we investigate the potential of sequential pattern mining in identifying hot regions of protein-protein interactions by examining carefully the discovered patterns. To illustrate the advantages of our method, we compare our results with ConSurf's results. Next, we use the 220 protein chains of the second dataset to evaluate the general performance of the proposed method. The details of the datasets and the experimental procedures are described in the following subsections.
Summary of the first dataset
Query protein (Swiss-Prot AC number)
PDB complex (PDB entry : chain)
Carboxypeptidase A2 precursor
Ras GTPase-activating protein 1
Guanine nucleotide-binding protein G(i)
Ras-related C3 botulinum toxin substrate 2
Summary of the second dataset, the protein-protein docking benchmark 2.0
Number of complexes
Number of chains
Total in the dataset
For the first dataset, the top ten large-size patterns are examined for the mining results of each query protein. The size of a pattern is defined as the number of conserved residues it contains. In the first experiment, it is observed in every case that the hot regions can be revealed directly by the maximum-size pattern. Thus in the second experiment, we investigate how the maximum-size pattern of each query protein performs in identifying protein interacting regions automatically.
Results on the first dataset
The performance of the proposed methodology is evaluated from two aspects. First, the effectiveness of identifying hot regions is evaluated. Second, the efficiency of the pattern mining algorithm is compared with ConSurf, where multiple sequence alignment is employed in identifying conserved residues. In addition, the conservation plots generated by ConSurf are included for comparison.
Comparing the efficiency of MAGIIC-PRO and ConSurf
Query protein (PDB Code:Chain ID)
Results on the second dataset
Summary of the experimental results for the second dataset
Number of tested protein chains
Number of patterns examined
Number of discovered blocks
Average number of blocks per protein chain
Average time used for each protein chain
Number of blocks that is near interface
Number of blocks that form clusters
The maximum support of the patterns
The minimum support of the patterns
Average support of the patterns
Clustering propensity: the percentage of sequential blocks in a pattern P that interacts with at least one of the other blocks in P. The interaction between a pair of blocks is defined by the following criterion: there exists an atom from one block that is within 5 Å to an atom of the other block.
Interface propensity: the percentage of sequential blocks in a pattern P that contacts another protein chain in the complex. The definition of contact is that any of the atoms from the block is within 7 Å to any atom of another protein chain in the complex.
Clustering and interface propensities of the patterns derived for different categories of the proteins in the second dataset
Average clustering propensity
Average interface propensity
Total average in the dataset
Similar conclusion can be made from Table 4. As summarized in Table 4, there are about 66% of the derived blocks close to the contacting areas of protein-protein interactions. Furthermore, there are about 92% of the blocks clustering with at least one of the other blocks to form protein substructures in space. It is observed in some cases that some clustered but non-interacting blocks are actually the binding sites of other molecules (ligands).
The statistics on the block numbers of the derived patterns for 218 protein chains of the second dataset.
Number of blocks: x
Patterns with x blocks
Patterns with at least x blocks
The statistics on the number of interacting blocks of the derived patterns for 218 protein chains of the second dataset.
Number of blocks: x
Patterns with x interface blocks
Patterns with at least x interface blocks
The statistics on the number of interacting blocks of the derived patterns for 138 non-redundant protein chains of the second dataset
Number of blocks: x
Patterns with x interface blocks
Patterns with at least x interface blocks
Conservation information is important in predicting hot regions involved in protein-protein binding. However, the conservation information at residue level is not sufficient in predicting hot regions because not all the reported residues are conserved for the same purpose (the one studied in this paper is to preserve the environment of interacting with another protein). The conservation information derived by the pattern mining approach is more precise than that generated by multiple sequence alignment followed by constructing the evolutionary tree. That is, the concurrence of conserved blocks among a subset of protein homologues is focused. The experiments conducted in this paper reveal that the derived conserved blocks tend to cluster together in space and most of the aggregated blocks are related with interacting interfaces. The detected regions possess high conservation and thus are considered as the computational hot regions. By using sequential pattern mining, it may be possible to predict hot spots of an interface without exhaustive mutagenesis and thermodynamic analysis and thus the link between protein functions and their primary sequences can be constructed much more rapidly.
In this section, we provide the details about the procedures of discovering and selecting patterns for predicting hot regions.
The residues associated with an interface are not necessarily found in one region of the sequence. Instead, it is usually observed that several remote segments of a protein sequence constitute a binding site [57–59]. Since it is time consuming to find long patterns with large irregular gaps, we recently presented a novel algorithm named MAGIIC to tackle this problem by using a combination of intra- and inter-block gap constraints . In MAGIIC, the flexibility of intra-block gaps is limited, but the flexibility of inter-block gaps is largely relaxed. Using two types of gap constraints for different purposes improves the efficiency of mining process while keeping high accuracy of mining results.
The constraint model of MAGIIC has been refined in our recent work WildSpan  to enhance the capability of the mining algorithm in discovering functional motifs for a specific query protein. WildSpan restricts the length of intra-block gaps to be fixed, because it has been observed in previous studies that insertions and deletions are seldom present within highly conserved regions [17, 59]. WildSpan further merges the upper and lower bounds of an inter-block gap into a single gap constraint called maximum relative flexibility. This constraint subsequently sets the upper and lower bounds of an inter-block gap with respect to the length of the gap observed on the query protein. The refinement of the constraint model reduces the complexity of the mining program and largely improves the accuracy of the derived patterns when functional motifs are desired. The idea of WildSpan was previously realized on the web server MAGIIC-PRO  to facilitate the whole process of discovering functional signatures from protein sequences. MAGIIC-PRO provides an easy-to-use environment in that the users can collect training data for a query protein by invoking PSI-BLAST  or Swiss-Prot  annotations. In addition, after the mining process completes, the derived patterns can be examined through several well-developed facilities .
A distinguishing characteristic of pattern mining from multiple sequence alignment in providing conservation information is that the residues in a pattern are simultaneously conserved among a certain amount of protein sequences in the training data. This property is appreciated from two points of view. First, a pattern collects a set of residues that are not necessarily the most highly conserved residues but are for sure to have been conserved simultaneously during evolution. Second, the pattern mining algorithm automatically identifies a subset of sequences from the training data that matches a particular pattern. Usually the resultant support rates are quite low, but it might still make sense since a sub-family could play family-specific functional roles [5, 6, 14, 52]. We will explain more about why the concurrence of conserved residues in a set of protein sequences is important after introducing the concept of cluster-like patterns and the detailed mining procedures.
Definition of cluster-like patterns and associated constraints
The maximum length of an intra-block gap: the length of intra-gap is rigid and cannot exceed the specified value.
The minimum number of residues in a block: a sequential block must contain at least a certain number of residues to eliminate noises.
The flexibility of an inter-block gap: a sequence can match a pattern as long as the inter-block gap does not violate the flexibility with respect to the query protein.
The minimum number of blocks in a pattern: a binding site is usually consisted of more than one protein segment. This constraint is set as 2 by default.
The minimum support of a pattern: the minimum percentage of sequences in the training data that match the derived pattern.
Setting minimum support is not an easy task. A loose bound may lead to explosion of patterns and cost a huge amount of computation, while a tight bound might result in no patterns. In MAGIIC-PRO, this issue is handled automatically by relaxing the minimum support constraint step by step until an expected number of desired patterns are discovered. In this regard, the patterns match the most input sequences will always be reported first.
Description of WildSpan algorithm
Constraint-based sequential pattern mining extracts frequent patterns from unaligned sequences that satisfy the user-specified constraints, where pattern components maintain their order in the sequential data . The algorithm WildSpan aims at discovering cluster-like patterns defined above by using a two-phase mining strategy. In the first phase, WildSpan generates the complete set of closed pattern blocks satisfying the block constraint and the intra-block gap constraint. A pattern or block is closed if none of its super-patterns getting exactly the same support (i.e. occurrence frequency). After that, in the second phase, WildSpan discovers the complete set of closed long patterns satisfying the inter-block gap constraint by connecting frequent blocks found in the first phase with large irregular gaps. Both the first and second phases execute a procedure call named bounded-prefix-growth, which was developed based on the function prefix-growth of a well known sequential pattern mining algorithm, PrefixSpan . The bounded-prefix-growth procedure takes our new constraint framework into account, in order to match both the effectiveness and efficiency considerations. It uses a number of pruning strategies during the mining process. First, it exploits some good properties of the constraints to filter out many unpromising patterns/candidates in the early mining stage aggressively. Second, it recursively projects a sequence database into a smaller search space and grows patterns only in each projected database. Both features contribute to favorable mining efficiency. At the end of the second phase, WildSpan outputs the complete set of patterns that satisfy all the constraints specified by the users. The readers can refer to  for the details of the algorithm and  for the web server MAGIIC-PRO.
Obtaining homologues of a query protein (150 at most): This is achieved by running PSI-BLAST  against Swiss-Prot database  posted on Aug 4, 2005 with BLOSUM62  substitution matrix and an E-value cut-off of 0.01. If the homologues of query protein are not sufficient in Swiss-Prot database (< 5 homologues), the searching is executed one more time against the non-redundant (NR) database  posted on Aug 4, 2005. The sequences nearly identical to the query protein (sequence identity from BLAST > 90%) or with a low identity (sequence identity from BLAST < 30%) against the query protein are further excluded from the training data.
Executing pattern mining: The minimum support is initially set as 100% and decreased repeatedly until at least one pattern with five blocks is discovered. A sequential block must contain as least three conserved residues, and the maximum length of an intra-block gap is 3. The mining process is terminated once the mining period exceeds four minutes in a single run, which often happens when the setting of minimum support constraint is too low such that the number of patterns explodes. If no patterns with five blocks can be reported with the previous settings, MAGIIC-PRO is invoked iteratively with the constraint on minimum number of blocks relaxed by one at a time.
Emerging information from all the patterns with two or more blocks into one conservation plot: The derived patterns are collected together to create a conservation plot. The conservation plot provides a whole picture about the conserved residues of a query protein. In this plot, the conservation scores are represented in different colors. The color level of a residue x is defined as: L(x) = ceil(9 × R(x)), where the conservation score R(x) is calculated by the following equation:
Here, the conservation level of each residue is determined by the percentage of total number of supporting proteins merged from different patterns.
The conservation plot is reported with the derived patterns to provide more detailed information when a pattern is examined.
The authors would like to thank Yuan Ze University and National Science Council of Republic of China, Taiwan(contract no. NSC 95-2221-E-002-274-MY2), for the financial support.
This article has been published as part of BMC Bioinformatics Volume 8, Supplement 5, 2007: Articles selected from posters presented at the Tenth Annual International Conference on Research in Computational Biology. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S5.
- Hsu CM, Chen CY, Liu BJ: MAGIIC-PRO: detecting functional signatures by efficient discovery of long patterns in protein sequences. Nucleic Acids Res 2006, (34 Web Server):W356-W361. 10.1093/nar/gkl309Google Scholar
- Zvelvbil MJ, Barton GJ, Taylor WR, Sternberg MJ: Prediction of protein secondary structure and active sites using the alignment of homologous sequences. J Mol Biol 1987, 195: 957–961. 10.1016/0022-2836(87)90501-8View ArticleGoogle Scholar
- Godzik A, Sander C: Conservation of residue interactions in a family of Ca-binding proteins. Protein Eng 1989, 2: 589–596. 10.1093/protein/2.8.589View ArticlePubMedGoogle Scholar
- Valdar WS: Scoring residue conservation. Proteins 2002, 48: 227–241. 10.1002/prot.10146View ArticlePubMedGoogle Scholar
- Livingstone CD, Barton GJ: Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Comput Appl Biosci 1993, 9: 745–756.PubMedGoogle Scholar
- Casari G, Sander C, Valencia A: A method to predict functional residues in proteins. Nat Struct Biol 1995, 2: 171–178. 10.1038/nsb0295-171View ArticlePubMedGoogle Scholar
- Armon A, Graur D, Ben-Tal N: ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. J Mol Biol 2001, 307: 447–463. 10.1006/jmbi.2000.4474View ArticlePubMedGoogle Scholar
- Sali A, et al.: From words to literature in structural proteomics. Nature 2003, 422: 216–225. 10.1038/nature01513View ArticlePubMedGoogle Scholar
- Rhodes DR, et al.: Probabilistic model of the human protein-protein interaction network. Nat Biotechnol 2005, 23: 951–959. 10.1038/nbt1103View ArticlePubMedGoogle Scholar
- Janin J: Elusive affinities. Proteins 1995, 21: 30–39. 10.1002/prot.340210105View ArticlePubMedGoogle Scholar
- Xu D, et al.: Hydrogen bonds and salt bridges across protein-protein interfaces. Protein Eng 1997, 10: 999–1012. 10.1093/protein/10.9.999View ArticlePubMedGoogle Scholar
- Lo Conte L, Chothia C, Janin J: The atomic structure of protein-protein recognition sites. J Mol Biol 1999, 285: 2177–2198. 10.1006/jmbi.1998.2439View ArticlePubMedGoogle Scholar
- Lichtarge O, Sowa ME: Evolutionary predictions of binding surfaces and interactions. Curr Opin Struct Biol 2002, 12: 21–27. 10.1016/S0959-440X(02)00284-1View ArticlePubMedGoogle Scholar
- Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 1996, 257: 342–358. 10.1006/jmbi.1996.0167View ArticlePubMedGoogle Scholar
- Bogan AA, Thorn KS: Anatomy of hot spots in protein interfaces. J Mol Biol 1998, 280(1):1–9. 10.1006/jmbi.1998.1843View ArticlePubMedGoogle Scholar
- Thorn KS, Bogan AA: ASEdb: a database of alanine mutations and their effects on the free energy of binding in protein interactions. Bioinformatics 2001, 17: 284–285. 10.1093/bioinformatics/17.3.284View ArticlePubMedGoogle Scholar
- Keskin O, Ma B, Nussinov R: Hot regions in protein-protein interactions: the organization and contribution of structurally conserved hot spot residues. J Mol Biol 2005, 345: 1281–1294. 10.1016/j.jmb.2004.10.077View ArticlePubMedGoogle Scholar
- Cunningham BC, Wells JA: Rational design of receptor-specific variants of human growth hormone. Proceedings of the National Academy of Sciences of the United States of America 1991, 88(8):3407–3411. 10.1073/pnas.88.8.3407PubMed CentralView ArticlePubMedGoogle Scholar
- Clackson T, Wells JA: A hot spot of binding energy in a hormone-receptor interface. Science 1995, 267: 383–386. 10.1126/science.7529940View ArticlePubMedGoogle Scholar
- Li X, Keskin O, Ma B, Nussinov R, Liang J: Protein-protein interactions: hot spots and structurally conserved residues often locate in complemented pockets that pre-organized in the unbound states: implications for docking. J Mol Biol 2004, 344: 781–795. 10.1016/j.jmb.2004.09.051View ArticlePubMedGoogle Scholar
- Ma B, Elkayam T, Wolfson H, Nussinov R: Protein-protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proceedings of the National Academy of Sciences of the United States of America 2003, 100(10):5772–5777. 10.1073/pnas.1030237100PubMed CentralView ArticlePubMedGoogle Scholar
- Bahadur RP, et al.: A dissecting of specific and non-specific protein-protein interfaces. J Mol Biol 2004, 336: 943–955. 10.1016/j.jmb.2003.12.073View ArticlePubMedGoogle Scholar
- Chakrabarti P, Janin J: Dissecting protein-protein recognition sites. Proteins 2002, 47: 334–343. 10.1002/prot.10085View ArticlePubMedGoogle Scholar
- Chotia C, Janin J: Principles of protein-protein recognition. Nature 1975, 256: 705–708. 10.1038/256705a0View ArticleGoogle Scholar
- Jones S, Thornton JM: Principles of protein-protein interactions. Proceedings of the National Academy of Sciences of the United States of America 1996, 93(1):13–20. 10.1073/pnas.93.1.13PubMed CentralView ArticlePubMedGoogle Scholar
- Lo Conte L, et al.: The atomic structure of protein-protein recognition sites. J Mol Biol 1999, 285(5):2177–2198. 10.1006/jmbi.1998.2439View ArticlePubMedGoogle Scholar
- Nooren IMA, Thornton JM: Structural characterization and functional significance of transient protein-protein interactions. J Mol Biol 2003, 325: 991–1018. 10.1016/S0022-2836(02)01281-0View ArticlePubMedGoogle Scholar
- Ofran Y, Rost B: Analysing six types of protein-protein interfaces. J Mol Biol 2003, 325: 377–387. 10.1016/S0022-2836(02)01223-8View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Analysis of protein-protein interaction sites using surface patches. J Mol Biol 1997, 272: 121–132. 10.1006/jmbi.1997.1234View ArticlePubMedGoogle Scholar
- Jones S, Thornton JM: Prediction of protein-protein interaction site using surface patches. J Mol Biol 1997, 272: 133–143. 10.1006/jmbi.1997.1233View ArticlePubMedGoogle Scholar
- Neuvirth H, Raz R, Schreiber G: ProMate: a structure based prediction program to identify the location of protein-protein binding sites. J Mol Biol 2004, 338: 181–199. 10.1016/j.jmb.2004.02.040View ArticlePubMedGoogle Scholar
- Burgoyne NJ, Jackson RM: Predicting protein interaction sites: binding hot-spots in protein-protein and protein-ligand interfaces. Bioinformatics 2006, 22: 1335–1342. 10.1093/bioinformatics/btl079View ArticlePubMedGoogle Scholar
- Liang S, Zhang C, Song L, Zhou Y: Protein binding site prediction using an empirical scoring function. Nucleic Acids Res 2006, 34: 3698–3707. 10.1093/nar/gkl454PubMed CentralView ArticlePubMedGoogle Scholar
- Fariselli P, Pazos F, Valencia A, Casadio R: Prediction of protein-protein interaction sites in heterocomplexes with neural networks. Eur J Biochem 2002, 269: 1356–1361. 10.1046/j.1432-1033.2002.02767.xView ArticlePubMedGoogle Scholar
- Bradford JR, Westhead DR: Improved prediction of protein-protein binding sites using a support vector machines approach. Bioinformatics 2005, 21(8):1487–1494. 10.1093/bioinformatics/bti242View ArticlePubMedGoogle Scholar
- Panchenko AR, Kondrashov F, Bryant S: Prediction of functional sites by analysis of sequence and structure conservation. Protein Science 2004, 13: 884–892. 10.1110/ps.03465504PubMed CentralView ArticlePubMedGoogle Scholar
- Caffrey DR, et al.: Are protein-protein interfaces more conserved in sequence than the rest of the protein surface. Protein Science 2004, 13: 190–202. 10.1110/ps.03323604PubMed CentralView ArticlePubMedGoogle Scholar
- Hu Z, Ma B, Wolfson H, Nussinov R: Conservation of polar residues as hot spots at protein interfaces. Proteins 2000, 39: 331–342. 10.1002/(SICI)1097-0134(20000601)39:4<331::AID-PROT60>3.0.CO;2-AView ArticlePubMedGoogle Scholar
- Ouzounis C, Perez-Irratxeta C, Sander C, Valencia A: Are binding residues conserved? Pac Symp Biocomput 1998, 401–412.Google Scholar
- Aloy P, Querol E, Aviles FX, Sternberg MJ: Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J Mol Biol 2001, 311: 395–408. 10.1006/jmbi.2001.4870View ArticlePubMedGoogle Scholar
- Res I, Mihalek I, Lichtarge O: An evolution based classifier for prediction of protein interfaces without using protein structures. Bioinformatics 2005, 21: 2496–2501. 10.1093/bioinformatics/bti340View ArticlePubMedGoogle Scholar
- Ofran Y, Rost B: Predicted protein-protein interaction sites from local sequence information. FEBS Lett 2003, 544: 236–239. 10.1016/S0014-5793(03)00456-3View ArticlePubMedGoogle Scholar
- Yan C, et al.: A two-stage classifier for identification of protein-protein interface residues. Bioinformatics 2004, 20(Suppl 1):i371-i378. 10.1093/bioinformatics/bth920View ArticlePubMedGoogle Scholar
- Madabushi S, Yao H, Marsh M, Kristensen DM, Philippi A, Sowa ME, Lichtarge O: Structural clusters of evolutionary trace residues are statistically significant and common in proteins. J Mol Biol 2002, 316(1):139–154. 10.1006/jmbi.2001.5327View ArticlePubMedGoogle Scholar
- Gallet X, Charloteaux B, Thomas A, Brasseur R: A fast method to predict protein interaction sites from sequences. J Mol Biol 2000, 302(4):917–926. 10.1006/jmbi.2000.4092View ArticlePubMedGoogle Scholar
- Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu MC: Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Transactions on Knowledge and Data Engineering 2004, 16: 1424–1440. 10.1109/TKDE.2004.77View ArticleGoogle Scholar
- Hsu CM, Chen CY, Hsu CC, Liu BJ: Efficient discovery of structural motifs from protein sequences with combination of flexible intra- and inter-block gap constraints. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining: 9–12 April 2006; Sigapore. Volume LNCS 3918. Edited by: Carbonell JG, Siekmann J. Springer Berlin/Heidelberg; 2006:530–539.Google Scholar
- Rigoutsos I, Floratos A: Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 1998, 14: 55–67. 10.1093/bioinformatics/14.1.55View ArticlePubMedGoogle Scholar
- Jonassen I: Efficient discovery of conserved patterns using a pattern graph. Comput Appl Biosci 1997, 13: 509–522.PubMedGoogle Scholar
- Califano A: SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics 2000, 16(4):341–347. 10.1093/bioinformatics/16.4.341View ArticlePubMedGoogle Scholar
- Gregory AP, Dagmar R: Protein motifs. In Protein structure and function. 4th edition. Edited by: Gregory AP, Dagmar R. Waltham, MA: New Science Press; 2003.Google Scholar
- Landgraf R, Xenarios I, Eisenberg D: Three-dimensional cluster analysis identifies interfaces and functional residue clusters in protein. J Mol Biol 2001, 307: 1487–1502. 10.1006/jmbi.2001.4540View ArticlePubMedGoogle Scholar
- Berman HM, et al.: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242. 10.1093/nar/28.1.235PubMed CentralView ArticlePubMedGoogle Scholar
- Mintseris J, Wiehe K, Pierce B, Anderson R, Chen R, Janin J, Weng Z: Protein-Protein Docking Benchmark 2.0: an update. Proteins 2005, 60(2):214–216. 10.1002/prot.20560View ArticlePubMedGoogle Scholar
- Li W, Godzik A: CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22: 1658–1659. 10.1093/bioinformatics/btl158View ArticlePubMedGoogle Scholar
- Online supplement of this paper[http://biominer.bime.ntu.edu.tw/hotregions]
- Schueler-Furman O, Baker D: Conserved residue clustering and protein structure prediction. Proteins 2003, 52: 225–235. 10.1002/prot.10365View ArticlePubMedGoogle Scholar
- Ogiwara A, Uchiyama I, Yasuhiko S, Kanehisa M: Construction of dictionary of sequence motifs that characterize groups of related proteins. Protein Eng 1992, 5: 479–488. 10.1093/protein/5.6.479View ArticlePubMedGoogle Scholar
- Chakrabarti S, Anand AP, Bhardwaj N, Pugalenthi G, Sowdhamini R: SCANMOT: searching for similar sequences using s simultaneous scan of multiple sequence motifs. Nucleic Acids Res 2005, (33 Web Server):W274-W276. 10.1093/nar/gki493Google Scholar
- Hsu CM, Chen CY, Liu BJ: WildSpan: efficient discovery of functional motifs spanning large wildcard regions from protein sequences. Technical Report [http://biominer.bime.ntu.edu.tw/wildspan/]
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: The universal protein resource (UniProt). Nucl Acids Res 2005, (33 Database):D154-D159.Google Scholar
- Pei J, Han J, Wang W: Mining sequential patterns with constraints in large database. In Proceedings of the 11th ACM International Conference on Information and Knowledge Management: 4–9 November 2002; McLean. ACM Press; 18–25.Google Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 1992, 89(22):10915–10919. 10.1073/pnas.89.22.10915PubMed CentralView ArticlePubMedGoogle Scholar
- BLAST Database[ftp://ftp.ncbi.nlm.nih.gov/blast/db/]
- Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, Ben-Tal N: ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information. Nucleic Acids Res 2005, (33 Web Server):W299-W302. 10.1093/nar/gki370Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.