### Data preparation and preprocessing

We downloaded a list of approximately 9,600 nonredundant protein chains in the PDB [16, 24]. No two structures in this data set share greater than 50% sequence similarity. From these structures, we derived 1,992,567 FEATURE vectors. Each vector represents the 44 physicochemical environments listed in Table 1 measured along 6 concentric shells with radii of 1.25 Angstroms, yielding a total of 244 dimensions in an environment with a radius of 7.5 Angstroms. Vectors for amino acids with aromatic rings are centered at the centroid of the rings, and vectors for hydrophobic residues are centered at the beta carbon. A hypothetical beta carbon is constructed for glycines based on idealized backbone geometry. For amino acids with polar side chains, the FEATURE vector is centered at the centroid of the functional group containing the polar atoms. For instance, the vectors for serines, threoines, and tyrosines are centered at the oxygen atom of the hydroxyl group. Since tyrosine contains both an aromatic group and a hydroxyl group, it is represented by two separate feature vectors. We also constructed two feature vectors for tryptophans; these are centered at the beta carbon and at the centroid of the aromatic rings.

As previously reported, FEATURE vectors are a combination of continuous and discrete variables. This makes the definition of a distance metric challenging. In order to simplify the vectors, we converted each FEATURE vector into a binary form. For discrete variables, the zero values were kept, and the non-zero values were replaced with 1. For continuous variables, the values less than the median value among all two million feature vectors were set to zero, and the others were set to one.

We adopted this preprocessing method because clustering results on 15 FEATURE models previously built manually [3, 4] showed that the FEATURE vectors in the binary representation produced better clusters than the originals in terms of the silhouette value [25].

The silhouette value *s* is defined as:

s(i)=\frac{\underset{k}{\mathrm{min}}\{b(i,cluste{r}_{k})\}-a(i)}{\mathrm{max}\{a(i),\underset{k}{\mathrm{min}}\{b(i,cluste{r}_{k})\}}

where *a(i)* is the average distance from the *i*^{th} point to the other points in its cluster, and *b(i, cluster*_{
k
}*)* is the average distance from the *i*^{th} point to points in another cluster *k*. A silhouette value is used in order to quantify clustering quality for each object in a cluster by a continuous number between +1 (perfectly clustered) and -1 (the opposite).

### Clustering FEATURE vectors

We used the K-means clustering algorithm to cluster the binary vectors. We first select K initial centers in a manner designed to bias the selection toward likely functional sites (see below). The distance metric (F-distance, as above) is then used to assign each vector to one of the centers. After all vectors are assigned, new cluster centers are computed as the average of all the assigned vectors. The average value for each dimension is determined by a voting method – if there are more ones, then the average is set to one, or else it is set to zero. The procedure is terminated when the cluster centers do not move more than some predefined cutoff value.

Since the K-means algorithm is an expectation-maximization (EM) algorithm that can find only local optima, it is sensitive to the location of initial centers [26]. In order to improve our selection of initial center locations, we generated and examined distributions for solvent exposure and atom density over all feature vectors. Based on these distributions, we defined four classes of points: (1) low atom density and high solvent exposure; (2) low atom density and low solvent exposure; (3) high atom density and high solvent exposure; and (4) high atom density and low solvent exposure. Since functional sites, such as surface pockets, are most likely to belong to the first class, 50% of our initial cluster centers were randomly selected from it. Ten percent of the cluster centers belong to the second class, 20% belong to the third class, and 20% belong to the fourth class. We varied the number of cluster centers (K) between 300 and 5,000 and found that K = 4,550 provided the best correlation with known functional sites (as discussed in the Results section).

We define a weighted Hamming distance, the F-distance, as a distance metric to be used in the binary vector space. The F-distance between two *n*-dimensional binary vectors *X* = (*X*_{1}, *X*_{2},...,*X*_{
n
}) and Y = (*Y*_{1}, *Y*_{2},...*Y*_{
n
}) is defined as follows:

\text{F}-\text{distance}=\frac{\text{1}}{\text{n}}{\displaystyle {\sum}_{i}2{f}_{i}\left|{X}_{i}-{Y}_{i}\right|}

where the weight

2{f}_{i}=2\times \left|0.5-\frac{{\displaystyle {\sum}_{N}{X}_{i}}}{N}\right|

represents how far dimension *i* is from randomness in the information theoretic sense with respect to the distribution of the dimension over the entire N vectors. That is, *f*_{
i
}empirically indicates how important feature *i* is in calculating the distance between two FEATURE vectors. The values of *f*_{
i
}are calculated from the distribution of ones and zeros in feature *i* over the entire two million FEATURE vectors used.

### Characterization and validation of clusters

We created a histogram of cluster size for the resulting clusters and computed both the inter- and intraclass distances between vectors within each cluster to assess the degree of separation achieved in the clustering.

We annotated every residue in the data set that is part of a PROSITE pattern with that pattern's identifier. This results in an average of 6.2 ± 3.8 vectors per PROSITE hit. In order to biologically validate some of the clusters, we looked for PROSITE patterns for which at least 75% of the hits in our dataset are contained within a relatively small number of clusters and for which the pattern is the dominant annotation present in those clusters.

The clusters are available to the public for additional analysis and collaboration [27]