Skip to main content

Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks



The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes.


Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric.


Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes.


Clustering of protein sequences from different organisms has been used to identify orthologous and paralogous protein sequences, to find protein sequences unique to an organism, and to derive the phylogenetic profile for a cluster of protein sequences. These are some of the essential components of a comparative genomics study of protein sequences across several genomes.

The methods of clustering protein sequences can be either domain-based or family-based. All the clustering methods start with an all-against-all pairwise protein sequence similarity search. The domain-based clustering methods organize the protein sequence universe into domain clusters where domains are the structural units of proteins, e.g., COG [1], ProDom [2], and Picasso [3] (Figure 1). A multidomain protein may belong to multiple domain clusters.

Figure 1
figure 1

The schematic view of family-based clustering. Figure 1 illustrates a typical example of the clustering of three protein families denoted by the three oval outlines. Family I consists of protein sequences 1 and 2. Family II consists of protein sequences 3, 4, and 5. Family III consists of protein sequence 6 and 7. Domain A is common to families 1 and 2 while Domain B is common to families 2 and 3.

Clusters of Orthologous Groups (COGs) find triangles of mutually consistent genome-specific best hits from distant organisms without specifying a fixed similarity cut-off, thus accommodating both fast evolving and slow evolving genes. It then merges triangles which share a common edge. Each COG cluster is further analyzed manually to eliminate false positives caused by multidomain proteins so that each COG cluster represents a domain.

ProDom is based on the assumption that short protein sequences are single domain proteins. It first sorts all the protein sequences according to their lengths. It then undergoes a repetitive process: during each iteration, ProDom chooses as the query sequence the current shortest protein sequence or its internal repeat unit if it has internal repeats, searches the whole protein sequence set with PSI-BLAST [4], builds the sequence profile, and masks segments covered by the sequence profile for multidomain proteins or removes the single domain proteins completely covered by the sequence profile. The process iterates until there is no sequence left in the protein sequence set.

Picasso merges pairwise sequence alignments from the initial all-against-all pairwise sequence similarity searches into multiple sequence alignments of closer homologs, and later hierarchically merges these multiple sequence alignments into representative sequence profiles of remote homologs by profile-profile comparisons. The representative sequence profiles may contain sequences of different domain structures, but share at least one domain. Picasso then cuts domains within the representative sequence profiles into individual domain clusters based on the concept of overlapping maximal clusters proposed in SYSTERS [5]. Maximal clusters are clusters not fully contained in any other clusters. Two maximal clusters may have not only the overlapping set of neighbour members but also the unique set of neighbour members to these two maximal clusters. Thus these two overlapping maximal clusters must be of different domain structures sharing at least one domain which corresponds to the overlapping set of neighbour members. Then these two overlapping maximal clusters must undergo domain-cutting to be split into individual domains corresponding to closed neighbours, where no member has any neighbour outside of the cluster, from multiple alignments. However, since it is still a challenging problem to precisely pinpoint the structure domain border based on primary sequence information [6, 7], the performance of the clustering algorithm will be determined by the accuracy of domain demarcations.

Family-based clustering methods group protein sequences into families, which contain a group of evolutionarily related proteins that share similar domain architecture (see Figure 1), e.g., CluSTr [8], SYSTERS, ProClust[9, 10], PROTONET [11]/ProtoMap [12], and MCL[13]. CluSTr clusters protein sequence with the single linkage algorithm using the Z-score as the metric.

SYSTERS uses each protein sequence in the dataset as a seed sequence and applies the single linkage algorithm with a stringent threshold. Thus, each seed sequence has a cluster associated with it. It then merges all the clusters to maximal clusters. The maximal clusters could be either separate maximal clusters corresponding to single domain protein clusters or overlapping maximal clusters representing clusters having multiple domains, but sharing at least one domain.

ProClust uses a different metric to detect whether the aligned two proteins have similar domain structures. The metric value, which scales from 0 to 1, is the ratio of the raw score of the sequence alignments to the raw score of one of those two sequences aligned to itself. Thus the metric value between two sequences is directional. It assumes that the metric is symmetric if two aligned sequences have similar domain structure and non-symmetric otherwise. It then represents each sequence as a vertex and represents the metric value above the threshold as a directional edge in a directed graph. Each strongly connected component corresponds to a cluster [9]. It was later improved by building Profile-HMMs for all clusters having more than 20 sequences and merging two clusters A and B into a cluster corresponding to a SCOP superfamily if the average E-value from searching all the sequences in the cluster A against the profile-HMM of the cluster B is below the threshold[10].

PROTONET applies the hierarchical clustering of protein sequences based on the their pairwise similarity E-values, but adopts different rules for merging clusters: arithmetic mean, geometric mean, and harmonic mean. However, different families may have different levels of sequence conservation. It is not appropriate to choose one E-value threshold. And at the level of higher E-value, it may merge two clusters of different domain structures, but sharing one domain. However, different families may have different levels of sequence conservation. It is not appropriate to choose one E-value threshold. And at the level of higher E-value, it may merge two clusters of different domain structures, but sharing one domain.

Transitive homology detection methods have been proposed in the Intermediate Sequence Search, ISS [14, 15], and [16]. It works by searching the query sequence against the database with a conservative threshold to find the closely homologous sequences and using these homologous sequences as seeds to search the database to find remotely homologous sequences with a less conservative threshold. The method has been shown to be close to the profile based methods and better than a direct pairwise homology search [17]. But, it is a challenge to quantify the indirect, transitive homology as opposed to using the E-value for quantifying direct pairwise sequence homology.

The Markov cluster (MCL) [13] algorithm has been successfully applied to clustering protein sequences. MCL represents protein sequences as nodes on a graph where similar proteins are connected by edges weighted according to their BLASP E-Value. The MCL algorithm works through a series of iterative random walks across the graph and inflations of the edge weights that gradually strengthens the connections between very similar nodes and weakens the connections between less similar nodes. MCL makes no explicit use of protein domain architecture but does leverage transitive homologies in the random walk phase of the algorithm.

Compared to the hierarchical clustering family based clustering method, e.g., PROTONET [11], our method can take advantage of the transitive homologue closure by the third intermediate sequence to detect remote homologues at the superfamily level. Compared to single linkage based methods, e.g., CluSTr [8], our method avoids the problem of merging two unrelated multi-domain cluster sharing a common domain. Compared to the iterative clustering method, e.g., SYSTERS [5], our method generates clusters where each sequence belongs to only one cluster.

Results and discussion


In order to test the performance of CLUGEN, we selected all Swissprot [18] sequences with an InterPro [19] annotation, which resulted in 41480 sequences from 1598 InterPro families. The criteria we used to select sequences are that more than one member database from the InterPro annotation have the same superfamily or domain assignment and that the aligned region of the Swissprot sequence with respect to either Profile or hidden Markov model is longer than 30 amino acids. The benchmarking dataset is available on request. Figure 2 shows the InterPro superfamily/domain size distribution in the benchmarking dataset. There are 102 singleton families, that is families that consist of only one sequence. The largest family is IPR000276, the Rhodopsin-like G-Protein Coupled Receptor (GPCR) family which comprised of 1058 protein sequences.

Figure 2
figure 2

shows the distribution of InterPro family size. Figure 2 shows the distribution of the InterPro families used in the benchmarking dataset based upon the number of members in each family. There are 102 singleton InterPro families, and the largest InterPro family in the benchmarking dataset is Rhodopsin-like GPCR superfamily which has 1058 protein sequences in the benchmarking dataset.

Performance measure

We measure CLUGEN's performance by sensitivity, specificity, and goodness. A protein sequence is a false positive (FP) if it is misclassified to a certain InterPro superfamily/domain and a true positive (TP) otherwise. A protein sequence in a certain InterPro superfamily/domain is a false negative (FN) if it is not classified to that InterPro superfamily/domain (Figure 3). Let N fp and N tp denote the number of false positives and the number of true positives with respect to a cluster. Let N fn denote the number of false negatives with respect to a InterPro superfamily/domain.

Figure 3
figure 3

Definition of various clustering parameters. Figure 3 illustrates the mapping of three generated clusters denoted by oval outlines differentiated by different colors into a InterPro family denoted by a rectangle. The cluster can be mapped to an InterPro family only if more than 50% cluster members belong to that InterPro family; and is declared as a orphan cluster otherwise. Protein sequences outside the rectangle are false positives. Protein sequences within both the oval outline and the rectangle are true positives. Protein sequences wholly within the grey rectangle are false negatives.

Specificity: The specificity of a cluster is defined as N tp / (N tp + N fp ).

Sensitivity: The sensitivity of an InterPro superfamily/domain is defined as N tp / (N tp + N fn ).

Goodness: The goodness of an InterPro superfamily/domain is a measure of how well a cluster corresponds to the mapped InterPro superfamily/domain and is defined below (Equation 1) where N denotes the number of generated clusters associated with that InterPro superfamily/domain. The Area Under the ROC Curve (AUC) has been shown to be a better evaluation measure than accuracy within the context of binary classification, where the negative dataset is clearly defined. However, we cluster protein sequences into 1598 interpro families simultaneously. As a result, using the AUC as a measure of performance is not the appropriate metric here. Instead, we adopt as the "goodness " the set relative measure as defined in [12]. In order to decrease the goodness value when a large number of clusters is associated with an InterPro superfamily/domain, a penalty of (N-1) is applied in the numerator of the equation.

Ideally specificity, sensitivity, and goodness should be 100%.

Equation 1:

G o o d n e s s = i = 1 N N t p i N + 1 i = 1 N N t p i + i = 1 N N f p i + N f n MathType@MTEF@5@5@+=feaafeart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGhbWrcqWGVbWBcqWGVbWBcqWGKbazcqWGUbGBcqWGLbqzcqWGZbWCcqWGZbWCcqGH9aqpdaWcaaqaamaaqahabaGaemOta40ccqWG0baDcqWGWbaCdaWgaaqaaiabdMgaPbqabaGccqGHsislcqWGobGtcqGHRaWkcqaIXaqmaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaGcbaWaaabCaeaacqWGobGtliabdsha0jabdchaWnaaBaaabaGaemyAaKgabeaakiabgUcaRmaaqahabaGaemOta40ccqWGMbGzcqWGWbaCdaWgaaqaaiabdMgaPbqabaGccqGHRaWkcqWGobGtliabdAgaMjabd6gaUbqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaWcbaGaemyAaKMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaaaaaa@66DA@

Overall performance

We evaluated CLUGEN at several threshold values. Table 1 lists shows the specificity, sensitivity, and goodness as well as the number of generated clusters and the number of orphan clusters as a function of the different threshold values respectively. A cluster can be mapped to an InterPro family only if more than 50% of the cluster members belong to that InterPro family; and is declared as an orphan cluster otherwise. At one extreme of the spectrum, each cluster is a singleton cluster consisting of only one protein sequence. Thus both specificity and sensitivity are 100%. But the goodness value is very low, the reciprocal of the size of the family. Clustering based on more stringent threshold values, e.g. 0.5, generates a larger number of smaller clusters causing a smaller number of false positives, also resulting in a low goodness value. As the threshold values become less stringent, small clusters merge at different levels into larger clusters, therefore a smaller number of larger clusters are generated. At the threshold of 0.2, there are 1706 clusters resulting in a specificity of 97.1%, sensitivity 98.6%, goodness value of 78.2%, and the number of orphan clusters is 201. As can be seen from table 1, the threshold value is a compromise of sensitivity, specificity, goodness and the number of orphan clusters. Ideally, we would like the clustering results to produce as many clusters as there should be and as few orphan clusters as possible.

Table 1 Specificity, sensitivity, goodness, cluster number, and orphan cluster values at different cutoff values on the benchmarking dataset.

For a basis of comparison we also applied the MCL [13] algorithm to the same test dataset with various inflation values. Results are depicted in Figures 4 and 5. At higher specificities, the sensitivity of both methods increases. This is expected because higher specificities are achieved via stricter thresholds that result in more clusters overall and fewer large clusters. In the extreme case one could place each test sequence in its own cluster of size 1 and achieve 100% sensitivity and 100% specificity but with a low goodness score. This trade-off between sensitivity, specificity, and goodness is clearly evident in Figure 4; as specificity increases, sensitivity increases whereas goodness decreases.

Figure 4
figure 4

Specificity, sensitivity, and goodness on the benchmarking dataset. Sensitivity and specificity for CLUGEN and MCL at various specificities. At higher specificities, the sensitivity of both methods increases, whereas the goodness of both methods decreases. This is expected because higher specificities are achieved via stricter parameter thresholds that more clusters overall and fewer large clusters. Performance for both methods is comparable in this range with CLUGEN performing better at lower specificities and MCL performing better at higher specificities.

Figure 5
figure 5

The number of generated clusters and orphan clusters on the benchmarking dataset. Total clusters and orphan clusters for clugen and MCL at various specificities. With stricter parameter thresholds, overall specificity and the total number of clusters increases for both methods. The larger number of small clusters at higher specificities leads to a reduction in the number of orphan clusters in both methods.

In Figure 5 we see additional tradeoffs between specificity and overall performance. As specificity increases the number of orphan clusters decreases. This improvement in performance comes with an increase in the total number of clusters. Once again the extreme case of one sequence per cluster guarantees no orphan clusters at the cost of many non-informative clusters. Ideally one wishes to strike a balance reducing the number of orphan clusters while not drastically increasing the total number of clusters.

The overall performances of MCL and CLUGEN are fairly similar, with CLUGEN demonstrating a clear advantage at specificities below 0.98. CLUGEN's sensitivity and goodness are better at specificities below 0.98, whereas MCL's goodness is slightly better at specificities greater than 0.98. The number of total clusters and orphan clusters generated by both methods are comparable at specificities below 0.98. CLUGEN tends toward fewer orphan clusters at the cost of more total clusters at higher specificities.

Analysis of some CLUGEN generated clusters

In this section, we will give examples of some successfully generated clusters with one-to-one correspondence to specific InterPro families, some clusters which have false positives, and some which have false negatives.

As previously outlined, 41480 sequences with Interpro superfamily annotation (1598 clusters) were clustered using our algorithm. This results in a total of 1972 clusters. Overall, there are 1199 clusters that have been correctly mapped with one-to-one correspondence to 1199 out of 1598 InterPro superfamilies /domains. There are 79 orphan clusters. Some correctly clustered large protein superfamily/domain examples are: 507 Cytochrome P450 proteins are correctly clustered into family IPR001128 without false positives and false negatives; 398 large chain ribulose bisphosphate carboxylase proteins are correctly clustered into family IPR000685 without false positives and false negatives; 290 short-chain dehydrogenase/reductase SDR proteins are clustered into family IPR002198 without false positives and false negatives. Table 2 shows top 50 InterPro superfamily/domains that have been mapped to clusters with one-to-one correspondence.

Table 2 Top 50 InterPro superfamily/domains that have been mapped to clusters with one-to-one correspondence

We also conducted a detailed analysis of clusters that had false negatives/false positives in order to understand the areas in which the clustering algorithm could be further improved. The following is a description of errors encountered in clustering algorithms with specific reference to the data from our method.

Errors from low-complexity and coiled-coil regions

The first type of error is due to the presence of low complexity sequences with repetitive sequence patterns or sequences with coiled-coil structures, since we mask the low complexity regions and coiled-coil regions before the all-against-all pairwise similarity searches. As an example, the InterPro family IPR000533, Tropomyosins, which regulate muscle contraction, are alpha-helical proteins that form a coiled-coil. There are 25 tropomyosin sequences in the benchmarking dataset, among which 24 tropomyosin sequences are false negative sequences and appear in the following cluster along with members of IPR002699 ATP synthase subunit D.

Errors from short sequences or from an abundance of certain amino acid type in the sequences

Short sequences with less than 70 amino acids could also cause false positives in the clustering results. Cluster 1259 which is mapped to InterPro family 003019, the metallothionein superfamily, consists of 125 short protein sequences with 68 amino acids on average in length among which 34 false positive protein sequences are from InterPro family IPR001762, Disintegrin, and 34 false positive protein sequences are from InterPro family IPR000877, Bowman-Birk serine protease inhibitor. The reason these families cluster together is that metallothioneins are small proteins with high content of cysteine residues, while disintegrins and Bowman-Birk serine protease inhibitors are also short cysteine-rich protein sequences. This suggests that a more stringent threshold should be applied to cluster short protein sequences which are rich in a particular amino acid.

Similar domain structures in different superfamilies

Sequences that belong to different superfamilies but share similar domain structures may also cluster incorrectly in some cases. For example, 1039 out of 1058 sequences correctly cluster into the IPR000276 Rhodopsin-like G-protein coupled receptors family. But 17 out of 19 sequences from the IPR000276 Rhodopsin-like G-protein coupled receptors which have Cysteine-rich N-terminal regions are mistakenly clustered into the InterPro 000372 which is annotated as a cysteine-rich flanking region N-terminal. Similarly, members of the IPR001878 CCHC Zinc finger domain have been incorrectly clustered into the cluster 1008 which is mapped to the InterPro family 000981 Neurohypophysical hormone because they share the two cysteine residues and other surrounding weak motifs.


This paper describes a novel method for the clustering of protein sequences based on a new metric derived from prediction using neural networks and further utilizing the metric to model the transitive sequence homologue to detect the remote homologue. Good performance with respect to the InterPro protein sequence database has been achieved on the benchmarking dataset.

Clustering of sequences has many applications in target discovery and gene functionalization. One can identify in silico, antimicrobial drug targets by examining clusters without any eukaryotic member sequence in it. These proteins could be potentially selective targets for infectious diseases due to the absence of any appreciable homolog in the human proteome. Such computationally derived targets need to be further validated using experimental data derived from gene expression profiling and proteomics experiments [20]. Another application is to predict the function for prokaryotic proteins of unknown function by phylogenetic profiling [21], where the phylogenetic profile for a cluster is a vector of binary values, with 1 meaning the presence of a genome in that cluster and 0 otherwise. The assumption here is that genes with the same phylogenetic profile could have the same function.


Feature extraction

After we mask the low complexity regions and the coiled-coil regions and carry out the all-against-all pairwise sequence similarity searches, we extract four sets of features to represent the homology between any given pair of sequences. The first two sets of input features detect the homology of two aligned sequences, the last two sets of input features test whether two aligned sequences have similar domain structures. We use neural networks to map input features to a new metric, a probability value which scales from 0 to 1 and is interpreted as the likelihood that two sequences are of the same homologous superfamily.

The first input feature is the log scale of the pairwise E-value.

The raw score, from which the E-value of the two aligned sequences is derived, is calculated by summing up the log score of each position in the alignment, which assumes that each position is independent of the other. However, in practice, it has been shown that two consecutive positions in the alignments are quite correlated [22]. To model the correlation between two consecutive positions in the alignment, we adopt the concept of the 2-gram encoding method [23]. Ideally, hydrophobic regions of one sequence should align with the hydrophobic regions of the other sequence, and hydrophilic regions should align with each other as well. Each position in the alignment could fall into one of four categories: residue identity denoted by A1, hydrophobic similarity denoted by A2, hydrophilic similarity denoted by A3, and everything else denoted by A4. Let Len a denote the total length of the alignment and Occuri,jdenote the number of occurrences of i and j, where i is immediately followed by j, with i and j denoting any one of A1, A2, A3, or A4 respectively. Let freq i,j denote frequency of i,j, which is equal to Occur i,j /(Len a -1). Thus the second set of input features consists of freq i,j values of the alignment positions, which consists of 16 input feature values for a pair of aligned sequences since each of the two consecutive positions could be one of four possible values.

Each of the two aligned sequences is separated into three segments, the unaligned N-terminal region, the aligned region, and the unaligned C-terminal region, by the beginning position of the alignment, denoted by begin i , and the end position of the alignment, denoted by end i with respect to sequence i (Figure 6). Let Len i denote the length of sequence i . Then lengths of three segments of sequence i are begin i -1, end i - begin i +1, and Len i - end i , respectively. If we normalize the length of each of three segments within an aligned sequence i by dividing the length of each segment by Len i , we get a vector of three values, Segi 1= (begin i -1)/Len i , Segi 2= (end i - begin i +1)/Len i , and Segi 3= (Len i - end i )/Len i . Intuitively, if the two aligned sequences have similar domain structures, the alignment will divide the two aligned sequences i and j in a similar proportion, and the linear correlation coefficient, LCC1 defined by Equation 2, between these two vectors tend to be close to 1. So the third set of input features include LCC1, Segi 1, Segi 2, Segi 3, Segj 1, Segj 2, and Segj 3.

Figure 6
figure 6

A schematic view of a pairwise alignment. Figure 6 shows a pairwise alignment between two aligned sequences. The aligned regions of the two sequences are highlighted. Their boundaries are pinpointed by the arrows.

Equation 2:

L C C 1 = 1 3 k = 1 N ( S e g i k 1 3 ) ( S e g j k 1 3 ) 1 3 k = 1 N ( S e g i k 1 3 ) 2 1 3 k = 1 N ( S e g j k 1 3 ) 2 MathType@MTEF@5@5@+=feaafeart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGmbatcqWGdbWqcqWGdbWqdaWgaaWcbaGaeGymaedabeaakiabg2da9maalaaabaWaaSaaaeaacqaIXaqmaeaacqaIZaWmaaWaaabCaeaacqGGOaakcqWGtbWucqWGLbqzcqWGNbWzdaWgaaWcbaGaemyAaKMaem4AaSgabeaakiabgkHiTmaalaaabaGaeGymaedabaGaeG4mamdaaiabcMcaPiabcIcaOiabdofatjabdwgaLjabdEgaNnaaBaaaleaacqWGQbGAcqWGRbWAaeqaaOGaeyOeI0YaaSaaaeaacqaIXaqmaeaacqaIZaWmaaGaeiykaKcaleaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGobGta0GaeyyeIuoaaOqaamaakaaabaWaaSaaaeaacqaIXaqmaeaacqaIZaWmaaWaaabCaeaacqGGOaakcqWGtbWucqWGLbqzcqWGNbWzdaWgaaWcbaGaemyAaKMaem4AaSgabeaakiabgkHiTmaalaaabaGaeGymaedabaGaeG4mamdaaiabcMcaPmaaCaaaleqabaGaeGOmaidaaaqaaiabdUgaRjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaWcbeaakmaakaaabaWaaSaaaeaacqaIXaqmaeaacqaIZaWmaaWaaabCaeaacqGGOaakcqWGtbWucqWGLbqzcqWGNbWzdaWgaaWcbaGaemOAaOMaem4AaSgabeaakiabgkHiTmaalaaabaGaeGymaedabaGaeG4mamdaaiabcMcaPmaaCaaaleqabaGaeGOmaidaaaqaaiabdUgaRjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaWcbeaaaaaaaa@7C1C@

The fourth and final input feature is to measure the overlap between two neighbor sets of aligned sequences, where the neighbor set, Set i of sequence i , is defined as the set of protein sequences that sequence i hits when sequence i is used as the query sequence. One straightforward method to measure the overlap is to use the ratio of the cardinality of the intersection of two neighbor sets to the cardinality of the union of two neighbor sets.

Here we propose another method to measure the overlap. Specifically, if we represent the neighbor set of sequence i as Vector i , the value of the kthelement of Vector i is the log E-value, Log ik between sequence i and its kthneighbor in Set i . However, Vector i and Vector j for two aligned sequences, sequence i and sequence j , may be of different dimensions since the cardinalities of Set i and Set j may be different. We make these two vectors have the same dimension by adding the log E-value threshold to Vector i whenever the sequence i has no corresponding neighbor in the neighbor set, Set j of the other aligned sequencej.Thus the last input feature is the linear correlation coefficient LCC2 between Vector i and Vector j defined by Equation 3. Intuitively, the more similar the domain structure two aligned sequences have, the more similar neighbor sets they will have, and the closer to 1 the linear correlation coefficient will be.

Equation 3:

L C C 2 = 1 N k = 1 N ( L o g i k L o g i ) ( L o g j k L o g j ) 1 N k = 1 N ( L o g i k L o g i ) 2 1 N k = 1 N ( L o g j k L o g j ) 2 MathType@MTEF@5@5@+=feaafeart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGmbatcqWGdbWqcqWGdbWqdaWgaaWcbaGaeGOmaidabeaakiabg2da9maalaaabaWaaSaaaeaacqaIXaqmaeaacqWGobGtaaWaaabCaeaacqGGOaakcqWGmbatcqWGVbWBcqWGNbWzdaWgaaWcbaGaemyAaKMaem4AaSgabeaakiabgkHiTiabdYeamjabd+gaVjabdEgaNnaaBaaaleaacqWGPbqAaeqaaOGaeiykaKIaeiikaGIaemitaWKaem4Ba8Maem4zaC2aaSbaaSqaaiabdQgaQjabdUgaRbqabaGccqGHsislcqWGmbatcqWGVbWBcqWGNbWzdaWgaaWcbaGaemOAaOgabeaakiabcMcaPaWcbaGaem4AaSMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaakeaadaGcaaqaamaalaaabaGaeGymaedabaGaemOta4eaamaaqahabaGaeiikaGIaemitaWKaem4Ba8Maem4zaC2aaSbaaSqaaiabdMgaPjabdUgaRbqabaGccqGHsislcqWGmbatcqWGVbWBcqWGNbWzdaWgaaWcbaGaemyAaKgabeaakiabcMcaPmaaCaaaleqabaGaeGOmaidaaaqaaiabdUgaRjabg2da9iabigdaXaqaaiabd6eaobqdcqGHris5aaWcbeaakmaakaaabaWaaSaaaeaacqaIXaqmaeaacqWGobGtaaWaaabCaeaacqGGOaakcqWGmbatcqWGVbWBcqWGNbWzdaWgaaWcbaGaemOAaOMaem4AaSgabeaakiabgkHiTiabdYeamjabd+gaVjabdEgaNnaaBaaaleaacqWGQbGAaeqaaOGaeiykaKYaaWbaaSqabeaacqaIYaGmaaaabaGaem4AaSMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaaleqaaaaaaaa@8ABD@

where N is the dimension of Vector i and Vector j and Log i and Log j are the mean values and are defined by L o g i = 1 N k = 1 N L o g i k MathType@MTEF@5@5@+=feaafeart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGmbatcqWGVbWBcqWGNbWzdaWgaaWcbaGaemyAaKgabeaakiabg2da9maalaaabaGaeGymaedabaGaemOta4eaamaaqahabaGaemitaWKaem4Ba8Maem4zaC2aaSbaaSqaaiabdMgaPjabdUgaRbqabaaabaGaem4AaSMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaaaa@42BC@ , and L o g j = 1 N k = 1 N L o g j k MathType@MTEF@5@5@+=feaafeart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGmbatcqWGVbWBcqWGNbWzdaWgaaWcbaGaemOAaOgabeaakiabg2da9maalaaabaGaeGymaedabaGaemOta4eaamaaqahabaGaemitaWKaem4Ba8Maem4zaC2aaSbaaSqaaiabdQgaQjabdUgaRbqabaaabaGaem4AaSMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdaaaa@42C0@ .

To summarize, the first input feature is the log scale of the pairwise E-value. The second input feature consists of 16 feature values, which are frequency values of the alignment positions and the third input feature includes 7 values, which relate to the details of the alignments. And finally, the fourth input feature includes 1 feature value which measures the overlap of the neighbour sets to make a total of 25 input features which will be used to train the neural network as described below.

Neural networks

After we represent the sequence homology between a pair of sequences by the set of 25 input features, we train the neural network using the training data. Each homologous pair of sequences is labeled as 1 if they belong to the same InterPro superfamily or the same domain if they are single domain proteins, and 0 otherwise. We selected as large a number of sequences as possible to train the neural networks to avoid overfitting the data. In all, we selected 27844 homologous sequence pairs as the positive training set and 29999 non-homologous sequence pairs as the negative training set.

The neural network we use is a fully connected feed-forward back propagation neural network and has one hidden layer with sigmoid activation functions (see Figure 7). The output layer of the neural network has one output unit. The output value is bounded between 0 and 1. The network is trained with the scaled conjugate gradient algorithm [24] implemented in MATLAB [25].

Figure 7
figure 7

The architecture of the neural network. Figure 7 demonstrates the architecture of the neural network. The neural network is actually fully connected, but not shown in the figure for simplicity, and has three layers. The first layer is the input layer consisting of 25 input features. The hidden layer in the middle has 4 nodes. The output layer has one output node.

Given the large amount of training data relative to the number of the weights in the neural network, the neural network is unlikely to overfit. It may however underfit if there are not a sufficient number of weights. If the training data are smaller relative to the number of weights in the neural network, measures should be taken to avoid the overfitting problem and the cross-validation method should be used to choose the best model. Clearly in this study, such is not the case.

We used a split-sample approach in which the validation set is not used during training, but is used to select the best model. After the neural network is trained, it is validated on the validation dataset containing 250597 homologous pairs of sequences and 30000 non-homologous pairs. Different numbers of hidden layer nodes have been tested. The ultimate goal is not to determine if any two proteins sequences belong to the same Interpro family, but to cluster all sequences in Interpro families as accurately as possible. We selected the model with the smallest number of weights and smallest error on the validation set. Thus, we chose 4 hidden layer nodes such that the neural network has the least number of hidden units and the best performance on the validation dataset with a specificity of 94.18% and a sensitivity of 91.81%.

Modeling the transitive homology

The neural network is then used to calculate the metric value for any pair of protein sequences that hit to each other below the E-value that was used as a cutoff. The metric value, P(A,B), for protein sequences A and B is interpreted as the likelihood that these two protein sequences belong to the same InterPro superfamily or have the same single domain. The value P(A,B) is replaced by P(A,C)P(C,B) if there exists a sequence C such that P(A,C)P(C,B) is larger than the current value of P(A,B). This transformation takes advantage of the transitive homology of sequences A and B through the intermediate sequence C, assuming that protein sequences A and C and protein sequences B and C are independently homologous. Figure 8 illustrates the transitive homology between sequence a and sequence b through the third sequence c. The E-values between sequence a and sequence c, sequence c and sequence b, as well as sequence a and sequence b are 0.01, 0.005, 20 respectively. P(a,c), P(c,b), and P(a,b) are 0.8, 0.9, and 0.2 respectively. The homology between sequence a and sequence b cannot be detected with their direct E-value. However, the value of P(a,b) is assigned to 0.72 because of the transitive sequence homology.

Figure 8
figure 8

Figure 8 illustrates the transitive homology between sequence a and sequence b through the third sequence c. The homology between sequence a and sequence b can be detected with P(a,b) = 0.72 by the transitive sequence homology.

Hierarchical average linkage clustering

Once the metric value for every pair of protein sequences is calculated, the hierarchical average linkage clustering method is applied to cluster the protein sequences in the new metric space using the geometric mean as the merging rule.

Hierarchical average linkages uses the Unweighted Pair-Group Average (UPGA) [26] to measure the distance between clusters. Let D i , i = 1,2, ... n. denote the protein sequences contained in Cluster D and let E j , j = 1,2, ..., m denote the protein sequence contained in Cluster E. The geometric mean distance G between Cluster D and Cluster E is defined as Equation 4:

Equation 4:

G = i = 1 , j = 1 i = n , j = m P ( D i , E j ) MathType@MTEF@5@5@+=feaafeart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGhbWrcqGH9aqpdaqeWbqaaiabdcfaqjabcIcaOiabdseaeTGaeeyAaKMccqGGSaalcqWGfbqrliabdQgaQPGaeiykaKcaleaacqWGPbqAcqGH9aqpcqaIXaqmcqGGSaalcqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGPbqAcqGH9aqpcqWGUbGBcqGGSaalcqWGQbGAcqGH9aqpcqWGTbqBa0Gaey4dIunaaaa@49A8@

The hierarchical average linkage clustering works in an iterative process: it begins with each protein sequence as a singleton cluster; during each iteration, it finds two clusters with the lowest metric value, then joins these two clusters into a new cluster, and updates the metric value between this new cluster and all others. This process iterates until the current lowest metric value exceeds the threshold.


  1. Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278(5338):631–637.

    Article  CAS  PubMed  Google Scholar 

  2. Gouzy J, Corpet F, Kahn D: Whole genome protein domain analysis using a new method for domain clustering. Comput Chem 1999, 23(3–4):333–340.

    Article  CAS  PubMed  Google Scholar 

  3. Heger A, Holm L: Picasso: generating a covering set of protein family profiles. Bioinformatics 2001, 17(3):272–279.

    Article  CAS  PubMed  Google Scholar 

  4. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Krause A, Vingron MA: set-theoretic approach to database searching and clustering. Bioinformatics 1998, 14(5):430–438.

    Article  CAS  PubMed  Google Scholar 

  6. George RA, Heringa J: SnapDRAGON: a method to delineate protein structural domains from sequence data. J Mol Biol 2002, 316(3):839–851.

    Article  CAS  PubMed  Google Scholar 

  7. Nagarajan N, Yona G: A multi-expert system for the automatic detection of protein domains from sequence information. In Proceedings of the seventh annual international conference on Computational molecular biology. Berlin, Germany; 2003:224–234.

    Google Scholar 

  8. Kriventseva EV, Fleischmann W, Zdobnov EM, Apweiler R: CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Res 2001, 29(1):33–36.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  9. Bolten E, Schliep A, Schneckener S, Schomburg D, Schrader R: Clustering protein sequences–structure prediction by transitivehomology. Bioinformatics 2001, 17(10):935–941.

    Article  CAS  PubMed  Google Scholar 

  10. Pipenbacher P, Schliep A, Schneckener S, Schonhuth A, Schomburg D, Schrader R: ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics 2002, (Suppl 2):S182–191.

    Google Scholar 

  11. Sasson O, Linial N, Linial M: The metric space of proteins-comparative study of clustering algorithms. Bioinformatics 2002, (Suppl 1):S14–21.

    Google Scholar 

  12. Yona G, Linial N, Linial M: ProtoMap: automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space. Proteins 1999, 37(3):360–378.

    Article  CAS  PubMed  Google Scholar 

  13. Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002, 30(7):1575–84.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  14. Teichmann SA, Chothia C, Church GM, Park J: Fast assignment of protein structures to sequences using the intermediate sequence library PDB-ISL. Bioinformatics 2000, 16(2):117–124.

    Article  CAS  PubMed  Google Scholar 

  15. Park J, Teichmann SA, Hubbard T, Chothia C: Intermediate sequences increase the detection of homology between sequences. J Mol Biol 1997, 273(1):349–354.

    Article  CAS  PubMed  Google Scholar 

  16. Gerstein M: Measurement of the effectiveness of transitive sequence comparison, through a third 'intermediate' sequence. Bioinformatics 1998, 14(8):707–714.

    Article  CAS  PubMed  Google Scholar 

  17. Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C: Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 1998, 284(4):1201–1210.

    Article  CAS  PubMed  Google Scholar 

  18. Boeckmann B, et al.: The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 2003, 31(1):365–370.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  19. Zdobnov EM, Apweiler R: InterProScan – an integration platform for the signature-recognition methods in InterPro. Bioinformatics 2001, 17(19):847–848.

    Article  CAS  PubMed  Google Scholar 

  20. Cole ST: Comparative mycobacterial genomics as a tool for drug target and antigen discovery. Eur Respir J Suppl 2002, 36: 78s-86s.

    Article  CAS  PubMed  Google Scholar 

  21. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96(8):4285–4288.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  22. Pei J, Grishin NV: AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics 2001, 17(8):700–712.

    Article  CAS  PubMed  Google Scholar 

  23. Wang TJ, Ma Q, Shasha D, Wu C: New techniques for extracting features from protein sequences. IBM Systems Journal, Special Issue on Deep Computing for the Life Sciences 2001, 40(2):426–441.

    Google Scholar 

  24. Bishop CM: Neural Networks for Pattern Recognition. Oxford University Press, New York, New York; 1995.

    Google Scholar 

  25. Hanselman DC: Mastering MATLAB 5: A comprehensive tutorial and reference. Prentice Hall, Upper Saddle River, New Jersey; 1998.

    Google Scholar 

  26. Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning. Springer, New York; 2001.

    Book  Google Scholar 

Download references


We thank Dr. Xia Yang for comments and feedback. We also acknowledge Mischa Reinhardt and Darrell Ricke for a critical reading of this manuscript.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Qicheng Ma.

Additional information

Authors' contributions

QM carried out the design and implementation of the method and wrote the manuscript. JDS compared the performances between CLUGEN and MCL. GWC and RC participated in the project. NRN directed and participated in the project and prepared the figures in the manuscript. All authors involved in reviewing and revising the manuscript and approved the final manuscript.

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

Ma, Q., Chirn, GW., Cai, R. et al. Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks. BMC Bioinformatics 6, 242 (2005).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: