Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
© Ma et al; licensee BioMed Central Ltd. 2005
Received: 12 April 2005
Accepted: 03 October 2005
Published: 03 October 2005
The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes.
Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric.
Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes.
Clustering of protein sequences from different organisms has been used to identify orthologous and paralogous protein sequences, to find protein sequences unique to an organism, and to derive the phylogenetic profile for a cluster of protein sequences. These are some of the essential components of a comparative genomics study of protein sequences across several genomes.
Clusters of Orthologous Groups (COGs) find triangles of mutually consistent genome-specific best hits from distant organisms without specifying a fixed similarity cut-off, thus accommodating both fast evolving and slow evolving genes. It then merges triangles which share a common edge. Each COG cluster is further analyzed manually to eliminate false positives caused by multidomain proteins so that each COG cluster represents a domain.
ProDom is based on the assumption that short protein sequences are single domain proteins. It first sorts all the protein sequences according to their lengths. It then undergoes a repetitive process: during each iteration, ProDom chooses as the query sequence the current shortest protein sequence or its internal repeat unit if it has internal repeats, searches the whole protein sequence set with PSI-BLAST , builds the sequence profile, and masks segments covered by the sequence profile for multidomain proteins or removes the single domain proteins completely covered by the sequence profile. The process iterates until there is no sequence left in the protein sequence set.
Picasso merges pairwise sequence alignments from the initial all-against-all pairwise sequence similarity searches into multiple sequence alignments of closer homologs, and later hierarchically merges these multiple sequence alignments into representative sequence profiles of remote homologs by profile-profile comparisons. The representative sequence profiles may contain sequences of different domain structures, but share at least one domain. Picasso then cuts domains within the representative sequence profiles into individual domain clusters based on the concept of overlapping maximal clusters proposed in SYSTERS . Maximal clusters are clusters not fully contained in any other clusters. Two maximal clusters may have not only the overlapping set of neighbour members but also the unique set of neighbour members to these two maximal clusters. Thus these two overlapping maximal clusters must be of different domain structures sharing at least one domain which corresponds to the overlapping set of neighbour members. Then these two overlapping maximal clusters must undergo domain-cutting to be split into individual domains corresponding to closed neighbours, where no member has any neighbour outside of the cluster, from multiple alignments. However, since it is still a challenging problem to precisely pinpoint the structure domain border based on primary sequence information [6, 7], the performance of the clustering algorithm will be determined by the accuracy of domain demarcations.
Family-based clustering methods group protein sequences into families, which contain a group of evolutionarily related proteins that share similar domain architecture (see Figure 1), e.g., CluSTr , SYSTERS, ProClust[9, 10], PROTONET /ProtoMap , and MCL. CluSTr clusters protein sequence with the single linkage algorithm using the Z-score as the metric.
SYSTERS uses each protein sequence in the dataset as a seed sequence and applies the single linkage algorithm with a stringent threshold. Thus, each seed sequence has a cluster associated with it. It then merges all the clusters to maximal clusters. The maximal clusters could be either separate maximal clusters corresponding to single domain protein clusters or overlapping maximal clusters representing clusters having multiple domains, but sharing at least one domain.
ProClust uses a different metric to detect whether the aligned two proteins have similar domain structures. The metric value, which scales from 0 to 1, is the ratio of the raw score of the sequence alignments to the raw score of one of those two sequences aligned to itself. Thus the metric value between two sequences is directional. It assumes that the metric is symmetric if two aligned sequences have similar domain structure and non-symmetric otherwise. It then represents each sequence as a vertex and represents the metric value above the threshold as a directional edge in a directed graph. Each strongly connected component corresponds to a cluster . It was later improved by building Profile-HMMs for all clusters having more than 20 sequences and merging two clusters A and B into a cluster corresponding to a SCOP superfamily if the average E-value from searching all the sequences in the cluster A against the profile-HMM of the cluster B is below the threshold.
PROTONET applies the hierarchical clustering of protein sequences based on the their pairwise similarity E-values, but adopts different rules for merging clusters: arithmetic mean, geometric mean, and harmonic mean. However, different families may have different levels of sequence conservation. It is not appropriate to choose one E-value threshold. And at the level of higher E-value, it may merge two clusters of different domain structures, but sharing one domain. However, different families may have different levels of sequence conservation. It is not appropriate to choose one E-value threshold. And at the level of higher E-value, it may merge two clusters of different domain structures, but sharing one domain.
Transitive homology detection methods have been proposed in the Intermediate Sequence Search, ISS [14, 15], and . It works by searching the query sequence against the database with a conservative threshold to find the closely homologous sequences and using these homologous sequences as seeds to search the database to find remotely homologous sequences with a less conservative threshold. The method has been shown to be close to the profile based methods and better than a direct pairwise homology search . But, it is a challenge to quantify the indirect, transitive homology as opposed to using the E-value for quantifying direct pairwise sequence homology.
The Markov cluster (MCL)  algorithm has been successfully applied to clustering protein sequences. MCL represents protein sequences as nodes on a graph where similar proteins are connected by edges weighted according to their BLASP E-Value. The MCL algorithm works through a series of iterative random walks across the graph and inflations of the edge weights that gradually strengthens the connections between very similar nodes and weakens the connections between less similar nodes. MCL makes no explicit use of protein domain architecture but does leverage transitive homologies in the random walk phase of the algorithm.
Compared to the hierarchical clustering family based clustering method, e.g., PROTONET , our method can take advantage of the transitive homologue closure by the third intermediate sequence to detect remote homologues at the superfamily level. Compared to single linkage based methods, e.g., CluSTr , our method avoids the problem of merging two unrelated multi-domain cluster sharing a common domain. Compared to the iterative clustering method, e.g., SYSTERS , our method generates clusters where each sequence belongs to only one cluster.
Results and discussion
Specificity: The specificity of a cluster is defined as N tp / (N tp + N fp ).
Sensitivity: The sensitivity of an InterPro superfamily/domain is defined as N tp / (N tp + N fn ).
Goodness: The goodness of an InterPro superfamily/domain is a measure of how well a cluster corresponds to the mapped InterPro superfamily/domain and is defined below (Equation 1) where N denotes the number of generated clusters associated with that InterPro superfamily/domain. The Area Under the ROC Curve (AUC) has been shown to be a better evaluation measure than accuracy within the context of binary classification, where the negative dataset is clearly defined. However, we cluster protein sequences into 1598 interpro families simultaneously. As a result, using the AUC as a measure of performance is not the appropriate metric here. Instead, we adopt as the "goodness " the set relative measure as defined in . In order to decrease the goodness value when a large number of clusters is associated with an InterPro superfamily/domain, a penalty of (N-1) is applied in the numerator of the equation.
Ideally specificity, sensitivity, and goodness should be 100%.
Specificity, sensitivity, goodness, cluster number, and orphan cluster values at different cutoff values on the benchmarking dataset.
Number of orphan clusters
In Figure 5 we see additional tradeoffs between specificity and overall performance. As specificity increases the number of orphan clusters decreases. This improvement in performance comes with an increase in the total number of clusters. Once again the extreme case of one sequence per cluster guarantees no orphan clusters at the cost of many non-informative clusters. Ideally one wishes to strike a balance reducing the number of orphan clusters while not drastically increasing the total number of clusters.
The overall performances of MCL and CLUGEN are fairly similar, with CLUGEN demonstrating a clear advantage at specificities below 0.98. CLUGEN's sensitivity and goodness are better at specificities below 0.98, whereas MCL's goodness is slightly better at specificities greater than 0.98. The number of total clusters and orphan clusters generated by both methods are comparable at specificities below 0.98. CLUGEN tends toward fewer orphan clusters at the cost of more total clusters at higher specificities.
Analysis of some CLUGEN generated clusters
In this section, we will give examples of some successfully generated clusters with one-to-one correspondence to specific InterPro families, some clusters which have false positives, and some which have false negatives.
Top 50 InterPro superfamily/domains that have been mapped to clusters with one-to-one correspondence
InterPro family/Domain ID
Number of proteins in the benchmark dataset
Ribulose bisphosphate carboxylase, large chain
Short-chain dehydrogenase/reductase SDR
Heat shock protein Hsp70
Zinc-containing alcohol dehydrogenase superfamily
Glyceraldehyde 3-phosphate dehydrogenase
HMG1/2 (high mobility group) box
20S proteasome, A and B subunits
Ribulose bisphosphate carboxylase, small chain
Cytochrome c oxidase, subunit III
Guanine nucleotide binding protein (G-protein), alpha subunit
H+-transporting two-sector ATPase, A subunit
Cytochrome c oxidase, subunit I
Hemagglutinin, HA1/HA2 chain
Secreted growth factor Wnt protein
Intermediate filament protein
Bacterial regulatory protein, LysR
Respiratory-chain NADH dehydrogenase, subunit 1
Small chemokine, interleukin-8 like
Proteinase inhibititor I4, serpin
Pyridoxal-5'-phosphate-dependent enzyme, beta subunit
Binding-protein-dependent transport systems inner membrane component
Copper/Zinc superoxide dismutase
Glutamine synthetase, catalytic domain
Manganese and iron superoxide dismutase
4Fe-4S ferredoxin, iron-sulfur binding domain
Photosynthetic reaction centre protein
Ribosomal protein S12, bacterial and chloroplast form
Heat shock protein Hsp20
NADH/Ubiquinone/plastoquinone (complex I)
Mitochondrial substrate carrier
Papillomavirus major capsid L1 (late) protein
Fork head transcription factor
We also conducted a detailed analysis of clusters that had false negatives/false positives in order to understand the areas in which the clustering algorithm could be further improved. The following is a description of errors encountered in clustering algorithms with specific reference to the data from our method.
Errors from low-complexity and coiled-coil regions
The first type of error is due to the presence of low complexity sequences with repetitive sequence patterns or sequences with coiled-coil structures, since we mask the low complexity regions and coiled-coil regions before the all-against-all pairwise similarity searches. As an example, the InterPro family IPR000533, Tropomyosins, which regulate muscle contraction, are alpha-helical proteins that form a coiled-coil. There are 25 tropomyosin sequences in the benchmarking dataset, among which 24 tropomyosin sequences are false negative sequences and appear in the following cluster along with members of IPR002699 ATP synthase subunit D.
Errors from short sequences or from an abundance of certain amino acid type in the sequences
Short sequences with less than 70 amino acids could also cause false positives in the clustering results. Cluster 1259 which is mapped to InterPro family 003019, the metallothionein superfamily, consists of 125 short protein sequences with 68 amino acids on average in length among which 34 false positive protein sequences are from InterPro family IPR001762, Disintegrin, and 34 false positive protein sequences are from InterPro family IPR000877, Bowman-Birk serine protease inhibitor. The reason these families cluster together is that metallothioneins are small proteins with high content of cysteine residues, while disintegrins and Bowman-Birk serine protease inhibitors are also short cysteine-rich protein sequences. This suggests that a more stringent threshold should be applied to cluster short protein sequences which are rich in a particular amino acid.
Similar domain structures in different superfamilies
Sequences that belong to different superfamilies but share similar domain structures may also cluster incorrectly in some cases. For example, 1039 out of 1058 sequences correctly cluster into the IPR000276 Rhodopsin-like G-protein coupled receptors family. But 17 out of 19 sequences from the IPR000276 Rhodopsin-like G-protein coupled receptors which have Cysteine-rich N-terminal regions are mistakenly clustered into the InterPro 000372 which is annotated as a cysteine-rich flanking region N-terminal. Similarly, members of the IPR001878 CCHC Zinc finger domain have been incorrectly clustered into the cluster 1008 which is mapped to the InterPro family 000981 Neurohypophysical hormone because they share the two cysteine residues and other surrounding weak motifs.
This paper describes a novel method for the clustering of protein sequences based on a new metric derived from prediction using neural networks and further utilizing the metric to model the transitive sequence homologue to detect the remote homologue. Good performance with respect to the InterPro protein sequence database has been achieved on the benchmarking dataset.
Clustering of sequences has many applications in target discovery and gene functionalization. One can identify in silico, antimicrobial drug targets by examining clusters without any eukaryotic member sequence in it. These proteins could be potentially selective targets for infectious diseases due to the absence of any appreciable homolog in the human proteome. Such computationally derived targets need to be further validated using experimental data derived from gene expression profiling and proteomics experiments . Another application is to predict the function for prokaryotic proteins of unknown function by phylogenetic profiling , where the phylogenetic profile for a cluster is a vector of binary values, with 1 meaning the presence of a genome in that cluster and 0 otherwise. The assumption here is that genes with the same phylogenetic profile could have the same function.
After we mask the low complexity regions and the coiled-coil regions and carry out the all-against-all pairwise sequence similarity searches, we extract four sets of features to represent the homology between any given pair of sequences. The first two sets of input features detect the homology of two aligned sequences, the last two sets of input features test whether two aligned sequences have similar domain structures. We use neural networks to map input features to a new metric, a probability value which scales from 0 to 1 and is interpreted as the likelihood that two sequences are of the same homologous superfamily.
The first input feature is the log scale of the pairwise E-value.
The raw score, from which the E-value of the two aligned sequences is derived, is calculated by summing up the log score of each position in the alignment, which assumes that each position is independent of the other. However, in practice, it has been shown that two consecutive positions in the alignments are quite correlated . To model the correlation between two consecutive positions in the alignment, we adopt the concept of the 2-gram encoding method . Ideally, hydrophobic regions of one sequence should align with the hydrophobic regions of the other sequence, and hydrophilic regions should align with each other as well. Each position in the alignment could fall into one of four categories: residue identity denoted by A1, hydrophobic similarity denoted by A2, hydrophilic similarity denoted by A3, and everything else denoted by A4. Let Len a denote the total length of the alignment and Occuri,jdenote the number of occurrences of i and j, where i is immediately followed by j, with i and j denoting any one of A1, A2, A3, or A4 respectively. Let freq i,j denote frequency of i,j, which is equal to Occur i,j /(Len a -1). Thus the second set of input features consists of freq i,j values of the alignment positions, which consists of 16 input feature values for a pair of aligned sequences since each of the two consecutive positions could be one of four possible values.
The fourth and final input feature is to measure the overlap between two neighbor sets of aligned sequences, where the neighbor set, Set i of sequence i , is defined as the set of protein sequences that sequence i hits when sequence i is used as the query sequence. One straightforward method to measure the overlap is to use the ratio of the cardinality of the intersection of two neighbor sets to the cardinality of the union of two neighbor sets.
Here we propose another method to measure the overlap. Specifically, if we represent the neighbor set of sequence i as Vector i , the value of the k th element of Vector i is the log E-value, Log ik between sequence i and its k th neighbor in Set i . However, Vector i and Vector j for two aligned sequences, sequence i and sequence j , may be of different dimensions since the cardinalities of Set i and Set j may be different. We make these two vectors have the same dimension by adding the log E-value threshold to Vector i whenever the sequence i has no corresponding neighbor in the neighbor set, Set j of the other aligned sequencej.Thus the last input feature is the linear correlation coefficient LCC2 between Vector i and Vector j defined by Equation 3. Intuitively, the more similar the domain structure two aligned sequences have, the more similar neighbor sets they will have, and the closer to 1 the linear correlation coefficient will be.
where N is the dimension of Vector i and Vector j and Log i and Log j are the mean values and are defined by , and .
To summarize, the first input feature is the log scale of the pairwise E-value. The second input feature consists of 16 feature values, which are frequency values of the alignment positions and the third input feature includes 7 values, which relate to the details of the alignments. And finally, the fourth input feature includes 1 feature value which measures the overlap of the neighbour sets to make a total of 25 input features which will be used to train the neural network as described below.
After we represent the sequence homology between a pair of sequences by the set of 25 input features, we train the neural network using the training data. Each homologous pair of sequences is labeled as 1 if they belong to the same InterPro superfamily or the same domain if they are single domain proteins, and 0 otherwise. We selected as large a number of sequences as possible to train the neural networks to avoid overfitting the data. In all, we selected 27844 homologous sequence pairs as the positive training set and 29999 non-homologous sequence pairs as the negative training set.
Given the large amount of training data relative to the number of the weights in the neural network, the neural network is unlikely to overfit. It may however underfit if there are not a sufficient number of weights. If the training data are smaller relative to the number of weights in the neural network, measures should be taken to avoid the overfitting problem and the cross-validation method should be used to choose the best model. Clearly in this study, such is not the case.
We used a split-sample approach in which the validation set is not used during training, but is used to select the best model. After the neural network is trained, it is validated on the validation dataset containing 250597 homologous pairs of sequences and 30000 non-homologous pairs. Different numbers of hidden layer nodes have been tested. The ultimate goal is not to determine if any two proteins sequences belong to the same Interpro family, but to cluster all sequences in Interpro families as accurately as possible. We selected the model with the smallest number of weights and smallest error on the validation set. Thus, we chose 4 hidden layer nodes such that the neural network has the least number of hidden units and the best performance on the validation dataset with a specificity of 94.18% and a sensitivity of 91.81%.
Modeling the transitive homology
Hierarchical average linkage clustering
Once the metric value for every pair of protein sequences is calculated, the hierarchical average linkage clustering method is applied to cluster the protein sequences in the new metric space using the geometric mean as the merging rule.
Hierarchical average linkages uses the Unweighted Pair-Group Average (UPGA)  to measure the distance between clusters. Let D i , i = 1,2, ... n. denote the protein sequences contained in Cluster D and let E j , j = 1,2, ..., m denote the protein sequence contained in Cluster E. The geometric mean distance G between Cluster D and Cluster E is defined as Equation 4:
The hierarchical average linkage clustering works in an iterative process: it begins with each protein sequence as a singleton cluster; during each iteration, it finds two clusters with the lowest metric value, then joins these two clusters into a new cluster, and updates the metric value between this new cluster and all others. This process iterates until the current lowest metric value exceeds the threshold.
We thank Dr. Xia Yang for comments and feedback. We also acknowledge Mischa Reinhardt and Darrell Ricke for a critical reading of this manuscript.
- Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278(5338):631–637.View ArticlePubMedGoogle Scholar
- Gouzy J, Corpet F, Kahn D: Whole genome protein domain analysis using a new method for domain clustering. Comput Chem 1999, 23(3–4):333–340.View ArticlePubMedGoogle Scholar
- Heger A, Holm L: Picasso: generating a covering set of protein family profiles. Bioinformatics 2001, 17(3):272–279.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402.PubMed CentralView ArticlePubMedGoogle Scholar
- Krause A, Vingron MA: set-theoretic approach to database searching and clustering. Bioinformatics 1998, 14(5):430–438.View ArticlePubMedGoogle Scholar
- George RA, Heringa J: SnapDRAGON: a method to delineate protein structural domains from sequence data. J Mol Biol 2002, 316(3):839–851.View ArticlePubMedGoogle Scholar
- Nagarajan N, Yona G: A multi-expert system for the automatic detection of protein domains from sequence information. In Proceedings of the seventh annual international conference on Computational molecular biology. Berlin, Germany; 2003:224–234.Google Scholar
- Kriventseva EV, Fleischmann W, Zdobnov EM, Apweiler R: CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Res 2001, 29(1):33–36.PubMed CentralView ArticlePubMedGoogle Scholar
- Bolten E, Schliep A, Schneckener S, Schomburg D, Schrader R: Clustering protein sequences–structure prediction by transitivehomology. Bioinformatics 2001, 17(10):935–941.View ArticlePubMedGoogle Scholar
- Pipenbacher P, Schliep A, Schneckener S, Schonhuth A, Schomburg D, Schrader R: ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics 2002, (Suppl 2):S182–191.Google Scholar
- Sasson O, Linial N, Linial M: The metric space of proteins-comparative study of clustering algorithms. Bioinformatics 2002, (Suppl 1):S14–21.Google Scholar
- Yona G, Linial N, Linial M: ProtoMap: automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space. Proteins 1999, 37(3):360–378.View ArticlePubMedGoogle Scholar
- Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002, 30(7):1575–84.PubMed CentralView ArticlePubMedGoogle Scholar
- Teichmann SA, Chothia C, Church GM, Park J: Fast assignment of protein structures to sequences using the intermediate sequence library PDB-ISL. Bioinformatics 2000, 16(2):117–124.View ArticlePubMedGoogle Scholar
- Park J, Teichmann SA, Hubbard T, Chothia C: Intermediate sequences increase the detection of homology between sequences. J Mol Biol 1997, 273(1):349–354.View ArticlePubMedGoogle Scholar
- Gerstein M: Measurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequence. Bioinformatics 1998, 14(8):707–714.View ArticlePubMedGoogle Scholar
- Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C: Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 1998, 284(4):1201–1210.View ArticlePubMedGoogle Scholar
- Boeckmann B, et al.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31(1):365–370.PubMed CentralView ArticlePubMedGoogle Scholar
- Zdobnov EM, Apweiler R: InterProScan – an integration platform for the signature-recognition methods in InterPro. Bioinformatics 2001, 17(19):847–848.View ArticlePubMedGoogle Scholar
- Cole ST: Comparative mycobacterial genomics as a tool for drug target and antigen discovery. Eur Respir J Suppl 2002, 36: 78s-86s.View ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96(8):4285–4288.PubMed CentralView ArticlePubMedGoogle Scholar
- Pei J, Grishin NV: AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics 2001, 17(8):700–712.View ArticlePubMedGoogle Scholar
- Wang TJ, Ma Q, Shasha D, Wu C: New techniques for extracting features from protein sequences. IBM Systems Journal, Special Issue on Deep Computing for the Life Sciences 2001, 40(2):426–441.Google Scholar
- Bishop CM: Neural Networks for Pattern Recognition. Oxford University Press, New York, New York; 1995.Google Scholar
- Hanselman DC: Mastering MATLAB 5: A comprehensive tutorial and reference. Prentice Hall, Upper Saddle River, New Jersey; 1998.Google Scholar
- Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. Springer, New York; 2001.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.