A Network of SCOP Hidden Markov Models and Its Analysis
© Zhang et al; licensee BioMed Central Ltd. 2011
Received: 6 December 2010
Accepted: 23 May 2011
Published: 23 May 2011
The Structural Classification of Proteins (SCOP) database uses a large number of hidden Markov models (HMMs) to represent families and superfamilies composed of proteins that presumably share the same evolutionary origin. However, how the HMMs are related to one another has not been examined before.
In this work, taking into account the processes used to build the HMMs, we propose a working hypothesis to examine the relationships between HMMs and the families and superfamilies that they represent. Specifically, we perform an all-against-all HMM comparison using the HHsearch program (similar to BLAST) and construct a network where the nodes are HMMs and the edges connect similar HMMs. We hypothesize that the HMMs in a connected component belong to the same family or superfamily more often than expected under a random network connection model. Results show a pattern consistent with this working hypothesis. Moreover, the HMM network possesses features distinctly different from the previously documented biological networks, exemplified by the exceptionally high clustering coefficient and the large number of connected components.
The current finding may provide guidance in devising computational methods to reduce the degree of overlaps between the HMMs representing the same superfamilies, which may in turn enable more efficient large-scale sequence searches against the database of HMMs.
The Structural Classification of Proteins (SCOP) database is a comprehensive protein database that organizes and classifies proteins based on their evolutionary and structural relationships [1–3]. It is organized into four hierarchical levels: family, superfamily, fold, and classes. At the lowest level (family), individual proteins are clustered into families based on some criteria that may indicate their common evolutionary origin, such as having a pairwise sequence similarity of more than 30% or lower sequence similarity but similar functions and structures. A good example of the latter is seen in globin proteins whose pairwise sequence similarities are much lower than 30% but which have similar protein functions. Next, families are grouped into superfamilies if their structures and/or function features indicate a possible common evolutionary origin. Then superfamilies are clustered into folds if superfamilies share major secondary structures with the same topological arrangements. Finally, different folds are grouped into classes based on their secondary structural compositions. Unlike the other levels, a class might not necessarily imply common evolutionary origins and exists more for convenience than for actual biological implications.
Apart from the hierarchical classification and organization of proteins, the SCOP database employs hidden Markov models (HMMs) to represent superfamilies [4, 5]. The basic procedure of building an HMM for a particular superfamily starts with a seed protein and performs sequence search in a database to obtain other proteins that have sequence similarities above a set threshold. The newly obtained sequences are used to iterate the search for some number of times to obtain additional proteins. Finally, all sequences are aligned and an HMM is constructed for the multiple sequence alignment [4, 5]. It has been shown that different seed proteins might produce HMMs that cover different members of the superfamily [4, 5]. Thus, in order to represent the full set of proteins in a superfamily, multiple HMMs are built for the superfamily using multiple seed proteins. For example, the beta-beta-alpha zinc fingers superfamily has altogether 91 HMMs representing it, and the P-loop containing nucleoside triphosphate hydrolases superfamily has 406 HMMs representing it.
Because each superfamily might be represented by multiple HMMs, there may be a high degree of overlap and redundancy among the models. However, there have not been any studies examining this issue systematically. To understand how the HMMs in the SCOP database are related to one another and the degree of overlap or redundancy among HMMs from either the same or different superfamilies, we perform a detailed analysis of the HMMs in SCOP for their similarity and relationships using a network approach. Specifically, we perform an all-against-all HHsearch for the library of HMMs in the SCOP database.
HHsearch is similar to BLAST, except that instead of matching a sequence against a database of sequences, it uses a query HMM or sequence to match against a database of HMMs and identifies the HMMs significantly homologous to the query HMM or sequence . We then construct a network of HMMs, where the link between two HMMs is based on their similarity, and examine some commonly evaluated network properties. We compare the current network with previously documented networks and outline some questions for future research.
Results and Discussion
General statistics of the HMMs and their network
The general statistics of the HMM library
Number of HMMs
Number of folds
Number of superfamilies
Number of families
The 20 largest connected components and their densities
Number of vertices
Thus, individual CCs tend to have very high connectivity, whereas the entire network is not well connected. The density of the 20 largest CCs is shown in Table 2. The largest CC with 590 vertices has the lowest density, and the 18th largest CC with 70 vertices has a density of 1, and is therefore a fully connected component. There is a significant negative correlation between CC size and density (Kendall's rank correlation τ = -0.43, p-value < 2.2 · 10-16 for CC size > 2).
The 20 HMMs with largest betweenness
The top 2 HMMs with the highest centrality measurements for the 20 largest CCs.
d1 fi5a_ (a.39.1.5)
d1 5a (a.39.1.5)
The results show that from the entire network, the vertices with the highest degrees do not necessarily have the highest betweenness, and vice versa. Degree measures how many immediate neighbors one HMM has, and therefore, the more it has, the more central it is. The vertices with the 20 largest degrees are all from the third largest CC, and are connected to about 94% of its vertices. The vertices with the 20 largest betweenness are from either the largest CC or the second largest CC. Since betweenness reflects how essential one vertex is to the connection of any other two vertices in the graph, in the case of HMMs, it may reflect the possibility that one HMM is the hybrid of two HMMs, that is, between the two HMMs, there is no significant similarity, but through the one HMM, the HMMs can be linked. Biologically, this idea seems to reflect hybrid or mosaic proteins where one protein contains domains from multiple proteins. To our knowledge, the idea of hybrid HMMs has not been discussed previously and deserves more research attention. Moreover, we hypothesize that the HMMs with high centrality measurements may be better able to pick up the sequences that belong to the superfamily than the more peripheral HMMs. This idea seems to be especially promising considering the observation that the three centrality measurements identify similar sets of vertices within the connected components. Future studies can be directed to test this hypothesis.
The diameter of the largest CC (containing 590 vertices) is 9. The average distance between the vertices in the component is 2.94. This bears some similarity to the yeast protein interaction network , constructed using the protein interaction data from the January 2007 version of the BioGRID database, an online repository for interaction datasets aggregated from both high-throughput data and focused individual studies for the affinity of interacting protein pairs [8, 9]. This protein interaction network consists of 5,151 proteins and 31,201 interactions. Its largest CC (containing 5,128 vertices) also has the same diameter of 9, but a larger average distance of 3.68. Thus, this protein interaction network seems to have more vertices that are a bit more spread out, which contributes to a larger average distance. To this point, it is very interesting that despite the big difference in the sizes of the two CCs of the two networks, the diameters are the same.
The effect of e-value cutoff on the network
CCs and SCOP hierarchy
Within the CCs, we examined whether the HMM members are from the same family, superfamily, fold, or class. There are altogether 1178 CCs whose members have the same SCOP domain classification (conserved at all hierarchical levels), 271 CCs whose HMMs belong to the same superfamily but to different families, 24 whose members belong to the same fold, but to different superfamilies, 18 whose members belong to the same class but have different folds, and the remaining 33 whose members are from different classes.
Functional annotation of the top ten superfamilies that have either the largest number of HMM representations or CCs.
# of HMMs
# of CCs
P-loop containing nucleoside triphosphate hydrolases
NAD(P)-binding Rossmann-fold domains
Winged helix DNA-binding domain
Fibronectin type III
# of CCs
# of HMMs
Winged helix DNA-binding domain
E set domains
Nucleic acid-binding proteins
Concanavalin A-like lectins/glucanases
Ribosomal protein S5 domain 2-like
Glucocorticoid receptor-like (DNA-binding domain)
Positive stranded ssRNA viruses
The working hypothesis
Taking into account the processes that built the HMMs and the hierarchical classification of the HMMs in the SCOP database, we hypothesize that the network should reflect this process, i.e., the HMMs in a connected component belong to the same family or superfamily more often than expected under a random network connection model. The results show strong evidence that HMMs in a connected component tend to represent the same family or superfamily. Among the total 1524 CCs, more than 77% have only members from the same family; more than 95% have only members from the same superfamily. Thus, there is overwhelming evidence supporting our working hypothesis that HMMs belonging to the same family or superfamilies tend to cluster together in the network.
However, to formally evaluate this and provide some statistical support, we also simulated 10,000 random networks, while preserving the degree distribution and the number and sizes of connected components. Each random network has the same number of connected components as our original network, and the working hypothesis predicts that the connected components of such a network have a lower degree of conservation in the family and superfamily assignment. Among the 10,000 simulated random networks, the highest proportions of CCs having only members from the same family and superfamily are as low as 0.5% and 0.7%. This shows that in the observed network, the HMMs from the same family or superfamily do have a strong tendency to cluster, agreeing with our working hypothesis.
Comparison with other networks
It is evident that the HMM network is highly clustered. In fact, its clustering coefficient is 0.85, which, to our knowledge, seems to be the highest among the biological networks that have been studied so far. As shown by Newman , the undirected networks that tend to have high clustering coefficients are social networks. For example, the film directors network has a clustering coefficient of 0.20 and coauthorship networks for math, physics, and biology disciplines are 0.15, 0.45, and 0.088, respectively, whereas biological networks such as metabolic network and protein interaction network have only a clustering coefficient of 0.09 and 0.07, respectively . The comparison indicates that the current network has distinct features from the previously characterized real-world networks.
In this paper, we examined the properties of the network constructed for HMM models in the SCOP protein structural classification database. A number of questions remain to be addressed in future research. For example, can we devise a computational method to measure or evaluate the degree of redundancy or overlap between HMM models that are used to represent the same superfamily? This research is meaningful given the ever increasing number of large-scale genomic sequences (therefore more protein sequences). Given that we can measure the redundancy of the HMMs of a superfamily, the logical question becomes, can we computationally reduce the redundancy of the HMM library, e.g., possibly by constructing super-HMMs, each of which represents a collection of redundant HMMs, so that a protein sequence is scanned against a reduced set of HMMs (super-HMMs) rather than the entire set of HMMs that have overlaps and redundancies? Finally, because the HMM network shows distinct properties from many documented networks as discussed above, can we propose a theoretical model to better account for the observations in the current network? Moreover, as our HMM network is also weighted, with edges quantifying the similarity between two HMMs, future proposed models can also consider the incorporation of weighted edges into the network.
The SCOP library of HMMs (scop70_1.75.hhm.tar.gz) was downloaded from the website http://scop.mrc-lmb.cam.ac.uk/scop/count.html#scop-1.75, where the SCOP version was filtered to 70% maximum pairwise sequence identity. The library contains a total of 13,730 HMMs, from seven classes a, b, c, d, e, f, g, where class a contains only α (i.e., α helix) proteins, class b contains only β (i.e., β sheet) proteins, class c contains α and β proteins (mainly parallel β sheets (beta - alpha - beta units)), class d contains α and β proteins (mainly antiparallel β sheets, i.e., segregated α and β regions), class e contains multi-domain proteins (i.e., folds consisting of two or more domains belonging to different classes), class f contains membrane and cell surface proteins, and class g contains small proteins. It is useful to mention that the SCOP domain classification ID specifies the entire hierarchy, e.g. c.1.1.1, the first field is for the class c, second for the fold, third for the superfamily, and the last for the family.
HHsearch  was performed for all-against-all HMMs with the default parameters. The command used was "hhsearch -i hmm.hhm -d hmmlib.hhm", where hmm.hhm is the query HMM and hmmlib.hhm is the library of all the HMMs. The secondary structure scoring option was not used, as our goal was not to detect remote homology between HMMs and sequences. According to the HHsearch authors, no calibration is necessary, as the HHsearch is performed on the SCOP database. HHsearch, similar to BLAST, uses a query that can be either a protein sequence or an HMM to search a database of sequences or HMMs and identify homology between the query and sequences and HMM models in the databases that is above a given threshold. In the current study, the e-value, a measurement of homology similar to BLAST's e-value, was set to 0.001. This e-value cutoff has also been used by Pfam to identify a Pfam clan , which is essentially equivalent to the superfamily hierarchy. A total of 13,547 HMMs have matches that met the criterion, with 1,618 having no other matches except themselves. Thus, 11,929 HMMs were used for the subsequent network analysis.
introduced in Freeman , measures roughly the number of shortest paths going through a σ (s, t) is the number of shortest paths between vertices s and t, and σ (s, t | a) is the number of shortest paths between vertices s and t that go through a. Thus, the higher the betweenness of a vertex, the more "central"/important the vertex is. In a fully connected network, the betweenness of all vertices is 0.
where d a, i is the length of the shortest path between vertex a and vertex i. Closeness ranges from 0 (does not reach 0) to 1; the higher it is for a vertex, the more "central" the vertex is. These centrality measurements have different motivations and show different aspects for the importance of vertices in a network.
The network clustering coefficient, C, also known as transitivity, measured by the ratio between the number of triangles and the number of connected triplets, was computed for the entire network. The number of connected components that are trees, where there are N vertices but only N - 1 edges between the vertices, was computed for the entire network as well.
An ROC curve was plotted for the four levels (i.e., class, fold, superfamily, and family) with different e-value cutoffs ranging from 10-20 to 10-3.
The authors thank T Murali for discussion. The work was partially supported by USDA 2009-35205-05221 and AFRL Grant FA8650-09-2-3938 and AFOSR Grant FA9550-09-1-0153.
- Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: Data growth and its impact on the SCOP database: New developments. Nucleic Acids Res 2008, (36 Database):D419–25.Google Scholar
- Lo Conte L, Brenner SE, Hubbard TJ, Chothia C, Murzin AG: SCOP database in 2002: Refinements accommodate structural genomics. Nucleic Acids Res 2002, 30: 264–7. 10.1093/nar/30.1.264PubMed CentralView ArticlePubMedGoogle Scholar
- Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: A structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–40.PubMedGoogle Scholar
- Gough J, Chothia C: SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res 2002, 30: 268–72. 10.1093/nar/30.1.268PubMed CentralView ArticlePubMedGoogle Scholar
- Gough J, Karplus K, Hughey R, Chothia C: Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol 2001, 313(4):903–19. 10.1006/jmbi.2001.5080View ArticlePubMedGoogle Scholar
- Soding J: Protein homology detection by HMM-HMM comparison. Bioinformatics 2005, 21(7):951–60. 10.1093/bioinformatics/bti125View ArticlePubMedGoogle Scholar
- Kolaczyk E: Statistical Analysis of Network Data: Methods and Models. New York, Springer; 2009.View ArticleGoogle Scholar
- Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006, (34 Database):D535–9.
- Stark C, Breitkreutz BJ, Chatr-Aryamontri A, Boucher L, Oughtred R, Livstone MS, Nixon J, Van Auken K, Wang X, Shi X, Reguly T, Rust JM, Winter A, Dolinski K, Tyers M: The BioGRID Interaction Database: 2011 update. Nucleic Acids Res 2011, (39 Database):D698–704.
- Newman ME: Networks: An Introduction. New York, NY, USA: Oxford University Press. Inc; 2010.View ArticleGoogle Scholar
- Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A: Pfam: Clans, web tools and services. Nucleic Acids Res 2006, (34 Database):D247–51.Google Scholar
- Freeman L: A set of measures of centrality based on betweenness. Sociometry 1977, 40: 35–41. 10.2307/3033543View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.