In order to ensure that protein environments with known similarities would cluster together using our FEATURE vector representation, we compared the results of clustering vectors from environments known to be similar (from previously built FEATURE models) using several different distance measures. We use the silhouette value (see Methods section) in order to quantify clustering quality for each object in a cluster by a continuous number between +1 (perfectly clustered) and -1 (the opposite). The plot of these values, termed the silhouette plot, shows tight and separate clusters as blue bars to the right, whereas loose or overlapping clusters appear as blue bars to the left (as described in the Methods section). Figures 1A and 1B compare clustering results using the Euclidean distance between the original FEATURE vectors to the Hamming distance between binary vectors. The Hamming distance between two binary vectors is defined as the number of coordinates in which the vectors' values differ. As seen in the plots, clustering vectors in their binary representation produces a better result in terms of the median silhouette values. In Figure 1C, we show that the F-distance (a weighted version of the Hamming distance) outperforms the Hamming distance and produces clusters with much better definition in terms of the silhouette value. On average, a cluster contains 1.06 residues from each protein chain represented in the cluster, indicating that the potential similarities between overlapping local environments do not dominate the clustering results.
For the full clustering, we used a single-processor 3.6-GHz P4 machine with 4 GB of memory running the Linux operating system. The total runtime of clustering nearly two million vectors was on the order of a few hours.
Figure 2 shows a heat map of the values of f
i
, a measure of information content after normalization, over all the 44 features and the 6 shells used. The vast majority of properties have high information content at low to medium radii, and many properties have high information content even at 6 to 7.5 Angstroms.
Initial validation of clusters
We have defined a distance metric between two binary vectors based on the Hamming distance that takes the information content of each feature into account (see Methods section). We call this weighted Hamming distance the F-distance. In our clustering, the mean distance between clusters (intercluster distance) was 0.210 ± 0.197 F-distance units. The mean distance of vectors within clusters (intracluster distance) was 0.118 ± 0.028 F-distance units. Thus, as expected from the trial silhouette plots using the F-distance in Figure 1, the resulting clusters are generally tight and separated. Figure 3 shows a histogram of cluster size. The number of FEATURE vectors in each cluster ranges from as few as 2 to as many as 6,731. The mean and median sizes are 437.2 and 232, respectively, and the standard deviation is 589.8.
In terms of biological validation, Figure 4 presents fingerprints of the features listed in Table 1 that are over- or underrepresented in each of the clusters described below with respect to the background of all two million feature vectors. Since some variation among environments sharing the same PROSITE annotation is expected, we do not anticipate that all examples of a given motif will cluster together. We present five examples in which at least 75% of the hits to a PROSITE pattern among the 9,600 protein chains used in this study occur in the same cluster. All these clusters have additional unannotated residues. These may represent novel predictions of shared function or they may be cases of related but different functions. Assessment of novelty will be addressed in future work. Our focus here is to evaluate the validity of the clustering approach.
Tyrosine protein kinases specific active-site signature
Thirteen of the 17 TYROSINE_KINASE_TYR PROSITE pattern (accession number PS00109) hits among the protein chains used in this study are contained within a single cluster. There are 346 total residues clustered together. Of the 15 residues in this cluster that have PROSITE annotations, only 4 do not belong to this motif. The average sequence identity among the proteins in this cluster with the tyrosine protein kinase motif is 31.4 ± 4.8%. Figure 4A shows a fingerprint of the features listed in Table 1 that are over- or underrepresented in this cluster with respect to the background of all two million feature vectors, and Figure 5A shows a comparison of the environments around two residues from the cluster that share the TYROSINE_KINASE_TYR annotation. All the residues in the cluster annotated with this PROSITE pattern are centered around alanines, and they share a great deal of structural similarity even though only one-half of the residues in the environment are contained within the PROSITE pattern itself.
Staphyloccocal enterotoxin/streptococcal pyrogenic exotoxin signature 2
Ten examples of the STAPH_STREP_TOXIN_2 PROSITE motif (accession number PS00278) occur in our dataset, and nine of these occur in a single cluster (Figure 4B). There are 275 total residues in this cluster, 6 of which have PROSITE annotations other than STAPH_STREP_TOXIN_2. The average sequence identity among the proteins in this cluster with this pattern is 20.3 ± 8.4%. Three clusters are required to capture all 10 instances of this motif in our data set. The two examples of environments in this cluster around residues that participate in the STAPH_STREP_TOXIN_2 motif (Figure 5C) exhibit greater structural diversity than do the environments from the other validation clusters described here. Fewer than one-third of the residues in these two environments are located within the motif.
Guanylate kinase-like signature
Four of the five hits to the GUANYLATE_KINASE_1 PROSITE motif (accession number PS00856) within our dataset are represented in a single cluster (Figure 4C). There are 162 total residues in this cluster, and 3 residues have differing PROSITE annotations. The average pairwise sequence identity among the proteins in this cluster with the guanylate kinase-like signature is 21.7 ± 6.0%.
Glycosyl hydrolases family 1 active site
All the seven hits to the GLYCOSYL_HYDROL_F1_2 PROSITE pattern (accession number PS00572) in our dataset are represented in a single cluster (Figure 4D). Of the 151 residues in this cluster, 6 have PROSITE annotations other than GLYCOSYL_HYDROL_F1_2. The average pairwise sequence identity among the proteins with the glycosyl hydrolase family 1 active site motif is 31.6 ± 5.5%.
Ubiquitin-conjugating enzymes active site
Eight of the 10 hits to the UBIQUITIN_CONJUGAT_1 PROSITE pattern (accession number PS00183) occur in the same cluster (Figure 4E). Of the 362 total residues in the cluster, 7 have alternate PROSITE annotations. The average pairwise sequence identity among the proteins with the ubiquitin-conjugating enzyme active site motif is 26.2 ± 4.2%. Figure 5B shows two examples of environments around asparagine residues contained in the UBIQUITIN_CONJUGAT_1 motif. Despite the fact that the cysteine residue toward the top of this figure is annotated in the PROSITE database as the catalytic residue, it is in the outskirts of the environment. Because active sites are often dynamic, regions that are slightly removed from the center of catalytic activity may show stronger conservation than the active site itself. Fewer than one-half of the residues in the environments are located within the PROSITE motif.
Statistical significance of the validation clusters
Since it may have been the case that the five validation clusters described above could have occurred by chance, we repeatedly reassigned the two million feature vectors into our clusters randomly and assessed the segregation of residues with the same PROSITE annotation into clusters. As all of the PROSITE patterns associated with the validation clusters reported above have at least five hits in our dataset, we limited our analysis to PROSITE patterns with at least five occurrences. In approximately 13% of 50,000 trials, we observed one case where at least 75% of the hits to a PROSITE pattern occurred in a single cluster. The probabilities of obtaining two or three such clusters were 0.7% and 0.02%, respectively. A random trial in which four PROSITE patterns were each predominantly captured in a single cluster occurred only once, and we never observed five patterns to cluster according to these criteria. Thus, the results reported above (five examples of PROSITE patterns having at least five hits that are each predominantly captured in a single cluster) are statistically significant.
We also evaluated the overall performance of the clustering algorithm by determining how well each PROSITE pattern with at least three hits in our dataset clustered. For each pattern, we identified the cluster in which the highest percentage of hits is represented. On average, 67.0 ± 22.3% of the hits to a pattern are represented in the cluster that best captures that pattern. For the 50,000 random clusterings described above, this number drops to 38.5 ± 0.5%. If we exclude all patterns with fewer than five hits, 57.4 ± 21.2% of hits occur in the best cluster, whereas only 28.6 ± 0.5% are expected to cluster together by chance.