A hybrid clustering approach to recognition of protein families in 114 microbial genomes

Background Grouping proteins into sequence-based clusters is a fundamental step in many bioinformatic analyses (e.g., homology-based prediction of structure or function). Standard clustering methods such as single-linkage clustering capture a history of cluster topologies as a function of threshold, but in practice their usefulness is limited because unrelated sequences join clusters before biologically meaningful families are fully constituted, e.g. as the result of matches to so-called promiscuous domains. Use of the Markov Cluster algorithm avoids this non-specificity, but does not preserve topological or threshold information about protein families. Results We describe a hybrid approach to sequence-based clustering of proteins that combines the advantages of standard and Markov clustering. We have implemented this hybrid approach over a relational database environment, and describe its application to clustering a large subset of PDB, and to 328577 proteins from 114 fully sequenced microbial genomes. To demonstrate utility with difficult problems, we show that hybrid clustering allows us to constitute the paralogous family of ATP synthase F1 rotary motor subunits into a single, biologically interpretable hierarchical grouping that was not accessible using either single-linkage or Markov clustering alone. We describe validation of this method by hybrid clustering of PDB and mapping SCOP families and domains onto the resulting clusters. Conclusion Hybrid (Markov followed by single-linkage) clustering combines the advantages of the Markov Cluster algorithm (avoidance of non-specific clusters resulting from matches to promiscuous domains) and single-linkage clustering (preservation of topological information as a function of threshold). Within the individual Markov clusters, single-linkage clustering is a more-precise instrument, discerning sub-clusters of biological relevance. Our hybrid approach thus provides a computationally efficient approach to the automated recognition of protein families for phylogenomic analysis.


Background
The comprehensive classification of proteins into similarity groups is an important but difficult challenge in postgenomic bioinformatics. These similarity groups might be based on e.g. common sequence, structure, or function. We are studying the evolutionary diversification of microbial genomes [1,2] and therefore wish to group proteins into families based on sequence similarity for subsequent multiple alignment and inference of phylogenetic trees. The ongoing rapid appearance of new microbial genome sequences makes it imperative that this clustering be rapid, scalable, and automated to the extent possible.
Sets of protein sequences linked by pairwise matches can be clustered, and the resulting clusters interpreted as families [3,4]. This has proven successful for protein domains (ADDA [5], DIVCLUS [6], PRODOM [7]) and for complete protein sequences (ProtoMap [8], SYSTERS [9]). Sequence similarity is first assessed pairwise, typically using BLAST [10], following which single-linkage clustering [11] is used to generate a hierarchy of internal nodes or subtrees. Sometimes FASTA [12] or another statistically based comparison tool is utilised in place of BLAST, or another criterion in place of single-linkage (SL) clustering; although our discussion below focuses on BLAST and SL clustering, the argument applies more generally to other pairwise comparison tools and clustering criteria.
This approach to recognition of clusters offers several important advantages. Protein family groups built up in this way are not simply unstructured lists, but natively possess an internal structure (topology) readily interpretable in the formalism of graphs, with vertices representing protein sequences, and edges representing pairwise matches. Each edge can be assigned a length that is inversely proportional to the strength of the corresponding pairwise BLAST match. Membership in these protein families can be made more, or less, stringent by adjusting a single threshold that defines when an edge is recognised to be present. This threshold can be based in a straightforward, intuitive way on the BLAST output (e.g. bit score or e-value). As the stringency of this threshold is (for example) increased, a given paralogous family will often be resolved into its constituent orthologous families.
There is, however, one major drawback to this naïve approach. At threshold values that are sometimes more stringent than required to constitute all orthologous (much less all paralogous) protein families, xenologous (evolutionarily unrelated) proteins begin to be drawn into these clusters, undermining their biological interpretability. This occurs for several reasons. At higher stringencies, BLAST may recognise so-called "promiscuous domains" common to otherwise unrelated proteins [13]. At lower stringencies, BLAST may report weakly convergent motifs [14] or chance similarities. These non-specific clusters grow explosively in size with decreasing stringency, thereby preventing the useful extension of standard clustering down to the threshold values required to fully constitute most evolutionarily related protein families.
Alternative approaches are of course available, but are not necessarily appropriate for the problem at hand.
Approaches based on machine learning, e.g. [15,16], identify putative homologs even at low levels of sequence identify, but can be computationally very expensive. Putative remote homologs can likewise be recognised based on similarities in folded structure. However, inclusion of remote homologs is likely to be counterproductive for us, for three reasons: rigorous multiple alignment and inference of phylogenetic trees scale exponentially with number of sequences; alignment of weakly similar sequences is problematic; and weakly supported branches contribute little or no biological information to our analyses. For this and similar applications, it is therefore far better that remote homologs be excluded from the analysis pipeline at the clustering stage, rather than later.
Recently, an alternative approach based on the Markov Cluster algorithm (MCL: [17]) has been introduced to comparative genomics [4]. The MCL algorithm simulates random walks through a graph (e.g. of sequences as vertices, and edges as pairwise matches). By iteratively re-computing random walks and favouring those with higher probability (which tend to be intra-cluster walks) over those with low probability (which tend to be inter-cluster walks), the algorithm partitions the graph into segments that can be interpreted as clusters [4,17]. Computation is rapid and, in application to molecular sequences, the Markov Cluster algorithm produces clusters that resist contamination by promiscuous domains. However, Markov clusters are unstructured lists without internal topology, and as such do not yield information useful to many biologists, e.g. a hierarchical ordering of orthologs. Information about edge lengths (strength of BLAST matches) is lost (transformed into stochastic Markov probabilities), making it very difficult to conceptualise these clusters in terms familiar to biologists. How aggressively the Markov clusters find membership (i.e., the resulting granularity) can be adjusted only via operators that may have limited useful dynamic range, and are not intuitive to most biologists.
Here we present a hybrid approach to recognizing protein families among very large (multi-genome) datasets. Our hybrid approach preserves the advantages of single-linkage (SL) clustering identified above, but captures the power of the Markov Cluster algorithm to avoid indiscriminate cluster membership. As quality control, we apply our approach to a manually curated database, Protein Data Bank [18], and report how the Structural Classification of Proteins (SCOP) database families and domains [19] map onto the Markov clusters. We demonstrate the application of hybrid clustering to a problem that cannot be usefully addressed by either SL or the Markov Cluster algorithm alone, recognition of orthologs and paralogs of rotary motor ATP synthase F1 subunit proteins [20], and to a 114-genome dataset of protein sequences. Finally, we describe how clusters with nonredundant genome coverage ("maximally representative clusters", or MRCs) can be selected automatically from the output of our hybrid method, for subsequent analysis e.g. in a phylogenomic pipeline.

PDB
In order to characterise the behaviour of our hybrid method with a well-understood dataset before application to multi-genome data, we used MCL [17] to cluster PDB at a range of granularities, then mapped SCOP families (fa) and domains (dm) onto the Markov clusters. We assess the resulting mapping from the viewpoints of both PDB (cluster purity) and SCOP (distribution of families or domains over multiple clusters) ( Table 1; see Methods for definitions of purity and distribution). Recall that SCOP domains are more compact than SCOP families; one SCOP family can contain one or more SCOP domains.
The MCL inflation parameter I, which alters the relative probabilities of within-cluster and between-cluster random walks, is the main parameter by which users can adjust cluster granularity [4,17]. At inflation value I = 1.00, 89.5% of all n ≥ 2 clusters are "pure" by SCOP domain, i.e. contain all instances of any SCOP domain represented in that cluster. Markov clusters found at I = 1.10 are slightly more pure by SCOP domain (89.7%) but purity diminishes somewhat thereafter with increasing graininess, to 79.5% at I = 5.00 (Table 1). Looking instead at SCOP families, the Markov clusters are less pure, ranging from 56.1% at I = 1.00 to 38.5% at I = 5.00. These percentages reflect the relative granularity SCOP families and SCOP domains within this subset of PDB: at I = 1.00, for example, the 761 Markov clusters of N ≥ 2 contain 1852 SCOP domains but only 831 SCOP families.
Among the n ≥ 2 Markov clusters that are pure by SCOP family, 88-92% (depending on granularity) contain only a single SCOP family, and >97% contain either 1 or 2 SCOP families. 51-60% of these clusters contain a single SCOP domain, and 77-85% either 1 or 2 SCOP domains. Cluster purity tends to increase slightly with increasing granularity: higher inflation values yield more and cleaner clusters ( Table 1).
As clusters become finer-grained, individual SCOP families tend to become distributed among more clusters: the proportion fully contained within a single cluster drops from 69% (I = 1.00) to 56% (I = 5.00). Nevertheless, most (87-93%) families have all their members in ≤ 3 clusters throughout this range of inflation values. SCOP domains  show a similar but less-pronounced trend, decreasing from 94% (I = 1.00) to 88% (I = 5.00) within a single Markov cluster. Most (97-98%) domains have all their members in ≤ 2 clusters.
We carried out single-linkage clustering (i.e. completed our hybrid method) on the pure Markov clusters that have ≥ 2 SCOP families each. At I = 1.00, for instance, there are 427 -380 = 47 such clusters. By raising the clustering threshold S' norm (see Methods) we cleanly resolve 30 of these into constituent families (each in its own pure cluster); in the other 17 clusters, at least one family fragments (is not resolved into a pure cluster). Similarly, among 323 pure Markov clusters with ≥ 2 SCOP domains each, hybrid clustering resolves 292 cleanly, while 31 exhibit domain fragmentation (results not shown). The numbers of fragmenting families and domains decrease at higher inflation values (results not shown).

Multi-genome data
Single-linkage clustering of the 328577 proteins in these 114 completely sequenced microbial genomes yields, at maximum, 14440 clusters of size n ≥ 4 ( Figure 1a). This maximum is reached at S' norm 0.47, at which point 157540 proteins are included in an n ≥ 4 cluster ( Figure 1b). The number of proteins included in clusters of size n ≥ 4 continues to increase with further decrease in S' norm threshold ( Figure 1b), but the number of clusters decreases precipitously ( Figure 1a) because existing clusters are progressively and quickly swallowed up into a single large nonspecific cluster ("blob") that eventually encompasses 286109 proteins, more than 87% of the total (Figure 1c). We focus on n = 4 as a minimum cluster size because phylogenetic trees become interesting only for n ≥ 4. For n ≥ 2 (all non-singular graphs) at I = 1.10, the maximum number of clusters (53120) was at S' norm 0.67, at which point 183119 proteins are members of a n ≥ 2 cluster (results not shown).
Markov clustering at I = 1.10 yields 4797 clusters (n ≥ 4). Projecting these onto the BLASTp data followed by SL clustering within (but disallowed between) Markov clusters yields, at S' norm 0.47, 14403 clusters, almost (99.74%) as many as found by SL clustering alone ( Figure 2a). As before, as the S' norm threshold is decreased further, the number of proteins in clusters increases ( Figure 2b) and the number of clusters decreases ( Figure 2a); but disallowing edges that link proteins in different Markov clusters prevents the formation of a non-specific "blob" ( Figure  2c). Consolidation within Markov clusters is complete by S' norm 0.02 (Figure 2a), at which point 4802 hybrid protein-family clusters of n ≥ 4 remain. In this way we estimate that the number of phylogenetically interesting (n ≥ 4) protein families in these genomes is between about 4802 (the number at S' norm 0.01, where paralogous fami-lies might be expected to dominate) and about 14403 (the maximum number observed, where orthologous families, some perhaps not fully consolidated, are presumably more numerous). Similar behaviour is observed at other inflation values and with clusters of size n ≥ 2, although of course with different numbers of families and of proteins within these families, and with different inflection points.
The size distribution of all hybrid clusters obtained at Markov inflation values I = 1.10, 1.20, 2.00, 3.00, 4.00 and 5.00 is shown in Figure 3. Small and medium-sized clusters of a given size tend to be less numerous as I decreases across this range.

Characterisation of hybrid clusters from multi-genome protein data
In the context of our research on the evolutionary diversification of microbial genomes [1,2], the purpose of protein clustering is to generate sets of orthologs [21] that can be taken forward into subsequent analysis steps (multiple sequence alignment, phylogenetic inference, and topological comparison of subtrees We define a representative cluster (RC) as one in which each protein represents a different genome, and a maximally representative cluster (MRC) as an RC that cannot grow further (as the S' norm threshold is incremented toward zero) without incurring multiple representation of one of the genomes. The complete distribution of MRCs over these 114 genomes is shown in Figure 4, and the representation of bacterial "phyla" (second-level NCBI categories: [1]) in these MRCs in Figure 5. Not surprisingly, many of the MRCs -particularly the smaller ones -are of relatively limited distribution across bacterial phyla. The range of S' norm values through which an RC is maximal is its range of maximality, and is bounded above by its maximum threshold and below by its minimum threshold. More precisely, the maximum threshold is the lesser of the maximum possible value of S' norm (1.00 in the absence of rounding errors) and the increment just below that at which the cluster ceases to be maximal (because all edges linking one or more of its proteins no longer satisfy the threshold criterion). The minimum threshold is the greater of the minimum possible threshold (here 0.01) Single-linkage clustering of multi-genome data  C and the increment just greater than that at which one or more genomes becomes represented more than once in the cluster. Maximum and minimum thresholds for the MRCs are shown in Figures 6a and 6b respectively, and the distribution of ranges of maximality in Figure 6c. The 13 clusters that are maximally representative over 0.99 threshold units (and represent at least four genomes) include leader and attenuator peptides, a streptolysin Sassociated protein, the Streptococcus lantibiotic precursor protein, and hypothetical proteins in Escherichia coli, Salmonella enterica, Shigella flexneri and Streptococcus pyogenes.
Because our cluster data are stored under a relational data model (in our case, in Oracle), these and even more-complex queries -for example, involving successive relaxation of the admittedly strict criterion used here for recognizing an MRC -can easily be made using very short SQL scripts.  Under our hybrid (Markov Clustering plus SL) approach at I = 1.10, the cluster history is identical to that described immediately above for the SL approach down through S' norm 0.24. The 96 α subunits and (separately) the Methanosarcina acetivorans C2A predicted protein gi|20090748 join them at 0.21, the 38 A subunits at 0.18, and the 80 ρ subunits at 0.13. The Bradyrhizobium japonicum unannotated protein gi|27377965 joins at S' norm 0.06, and the Agrobacterium tumefaciens C58 hypothetical protein gi|17938963 at 0.04. This 433-member paralogous cluster is thus fully constituted at S' norm 0.04 (or at 0.13, if the two outliers are spurious matches), and remains cohesive until at least S' norm 0.01, the lowest threshold we examined. The size of the paralogous cluster depends on granularity (i.e. on parameterisation of I), and the exact threshold value at which a sequence or group of sequences joins a cluster is to some extent data-dependent; but we observed little or no non-specificity within the range of inflation parameter settings we examined for these data.

Conclusions
The results reported above demonstrate that our hybrid (Markov followed by single-linkage) clustering approach efficiently sorts protein sequences into biologically meaningful clusters that are not accessible by SL clustering alone. The hybrid clusters retain intuitively meaningful topological and ordered edge-length information not immediately available using only MCL, as illustrated specifically by our recovery of all six ATP synthase F1 rotary motor subunit paralogous families, and more globally by the extent of SCOP-to-PDB mapping for protein sequences encoded by 114 microbial genomes.
For the well-behaved subset of PDB we examined, MCL preserves most SCOP domains intact, although the (intrinsically larger) SCOP families can be distributed among several Markov clusters within this range of granularities. Most pure clusters (as defined above) contain a single SCOP family. A very substantial majority of domains within multi-domain clusters, and most families within multi-family clusters, are cleanly resolved during the SL stage of our hybrid method. As the SCOP classifica-tion of proteins is based neither on evolutionary principles nor on primary-sequence similarity, it is unrealistic to expect perfect concordance between our hybrid clusters and SCOP families or domains.
Enright et al. [4] used a somewhat different approach to test the effectiveness of MCL in clustering protein sequences. Their analysis of the full PDB yielded 1167 families at I = 1.10 and 1761 families at I = 5.00, i.e. 50.9% increase, whereas for our subset of PDB we found 831 and 812 n ≥ 2 families at these inflation values respectively; this suggests that the extra graininess observed by Enright et al. reflects the formation of many single-member clusters, which although important in some contexts would not be useful for phylogenetic analysis. These authors reported that 79-87% (depending on the MCL inflation value) of proteins were clustered consistently, as assessed against annotation by domain or domain combination. This validation statistic is not easily commensurate with the (in our opinion) more transparent statistics we present herein (above and Table 1), but was interpreted by Enright et al. [4] as indicating that MCL "accurately and consistently assign(s) proteins into families, despite the fact that this classification relies on structural similarities, which are not all readily detectable at the sequence level". We argue similarly for our results.

Bacterial phylum representation in MRCs
For both the 114-genome protein data set in general, and ATP synthase F1 subunits in particular, SL clustering alone fails progressively at thresholds below about S' norm 0.50. This failure is due to non-specific attraction into a large, non-specific cluster. Depending on the data, granularity and probably other factors, more than one such large nonspecific cluster can exist fleetingly, but all are soon attracted into a single large "blob" that grows extremely rapidly and eventually takes in most proteins in these genomes. By not allowing edges to be recognised between proteins in different Markov clusters, we prevent the formation of this blob.
Individual Markov clusters may, of course, contain paralogous or, possibly, even non-related sequences; these are resolved into families during the SL step. With the MCL software, the inclusiveness of Markov clusters is determined by the value of the inflation parameter I. The range of I values accepted by MCL [17] produces only a limited dynamic range of cluster sizes (Figure 3). We hypothesise (based on e.g. the ATP synthase example) that as the mean cluster size increases, at least the larger clusters are more likely to contain paralogous proteins. This will be tested Clustering of ATP synthase F1 paralog sequences Figure 7 Clustering of ATP synthase F1 paralog sequences Membership in the ATP synthase F1 cluster, as a function of S' norm threshold. Single-linkage and hybrid clustering gave identical results at S' norm ≥ 0.22; cluster structure below S' norm 0.22 is for our hybrid method only (see text). NCBI gi numbers are displayed across the top for all F1β subunit sequences, and for three singleton sequences that group with this paralogous family. Large adjacent dots depict clusters at S' norm 1.00, and small adjacent dots show singleton sequences at S' norm 1.00 that are clustered at 0.99.  15828921  15829170  13357610  15828707  16799215  16802140  21673859  20091272  21223733  23464956  28572498  28493391  25027871  19552435  15827576  15840761  15608450  15607015  29346121  13357686  26553514  12045261  13508337  24215476  15791495  15612125  15645746  24323736  17548034  15677764  15793517  15837745  28198344  21230029  21244374  27904521  21672302  15616638  15642757  28899843  27364454  28872698  15600747  26992088  24376219  16272426  15603359  16124229  22128003  16767149  29143949  16762461  15804332  26250476  24115035  15833928  16131600  15644358  15828737  16330679  22298068  17232531  16127677  15893158  15604633  27375551  13473456  23502652  17986535  15966787  17936497  15889877  20807130  21675043  19703700  18311169  15896119  27468618  15927677  15925093  21283756  23100430  16801734  16804567  15616316  16080734  28378938  15673746  15903403  15901355  24379917  22537026  25010936  19745847  15674808  28896267  by iterative clustering, alignment and tree inference when our pipeline is in place. We also intend to examine the extent to which the hierarchical course of cluster formation with decreasing S' norm threshold approximates the inferred phylogeny, and whether MRCs tend to be orthologous sets.
At a more algorithmic level, our results illustrate the complementarity between the Markov Clustering algorithm and hierarchical linkage-based clustering in application to these data. MCL operates not so much on the absolute differences among edge lengths, but rather on the overall density structure of the edge-length data [17]. The result is a partitioning of edge space that avoids the formation of non-specific (hence biologically meaningless) groupings, but does not produce the degree of resolution needed to resolve most orthologous protein families. Single-linkage clustering imposes an absolute view of edge-length differences that works with precision locally, but fails globally. MCL thus provides SL with the local environments to which it is best suited, and restricts it from the global context in which it can fail.
We have demonstrated that our hybrid approach can be implemented efficiently in conjunction with a relational database structure, with results saved automatically and queries conducted using generic SQL commands. The hybrid method is fast, and is appropriate for problems where remote homologs are not needed or wanted. It has already proven valuable as part of an automated inference pipeline for studying patterns of vertical and lateral gene transfer among microbial genomes.

All-vs-all BLASTp
All-versus-all BLASTp [10] used NHGRI blastall with default settings, including low-complexity filtering, except that word size was set to 2. All BLASTp output was saved, but only output (bit scores) corresponding to matches for which expectation e ≤ 10 -3 was used throughout this work (i.e. for analyses of PDB, multi-genome data and individual protein datasets by single-linkage, Markov and hybrid clustering).

Edge lengths
Relatedness between each protein pair (a, b) was calculated as follows: with a as query and b as target, we normalised [22] the BLASTp bit score S' ab by dividing by S' aa ; then for b as query and a target, we normalised S' ba by dividing by S' bb . The relatedness of a and b is defined as max(S' ab /S' aa , S' ba /S' bb ). Thus each edge recognised as present (i.e. having e ≤10 -3 in at least one direction) is assigned a single scalar value (S' norm ) between 0 and 1. Due to rounding, scores can slightly exceed 1. Edge values were also stored in the Oracle database at three decimal places precision.

ATP synthase subset
We analysed the clustering of ATP synthase F1 subunit paralogs as a function of threshold [see Additional file 1] using the 114-genome dataset (Results, and Figure 7). Orthology and paralogy were assessed manually based on annotations available in public genome databases (i.e. as supplied by the various genome projects); a more rigorous approach would use phylogenetic analysis (the goal of our automated pipeline project).
We also used the ATP synthase F1 dataset to establish the default setting of the MCL inflation parameter I. Working with an earlier, 84-genome (235970 protein) version of this dataset, by string search of annotation lines we found the largest SL cluster that contained only proteins annotated as homologs of ATP synthase rotary motor subunits. This contained F1 ATPase α (64 proteins) and β subunits (66); archaeal/vacuolar ATPase A (32) and B (24); bacterial flagellar assembly ATPase subunits (53); and termination factor ρ (56) (total, 295 proteins). We added to this cluster all 11083 proteins that match one or more of these 295 at S' norm 0.20, and carried out MCL at values of I between 1.00 and 5.00. Based on these results, we selected a default value (I = 1.10) small enough to group all the paralogous subunits of ATP synthase F1 within a single cluster, but large enough to avoid the extremely long computation times required for inflation values nearer 1.00.

Single-linkage clustering in a relational environment
Single-linkage clustering was initiated with the most-similar matches (those with S' norm ≥ 1.01, as the maximum S' norm observed was 1.012) and proceeded in 0.01 stepwise intervals to S' norm 0.01. Clusters were recognised, and allowed to grow and/or merge with others, as relaxation of S' norm threshold progressively caused additional proteins (vertices) to join through valid matches (edges). The S' norm threshold values we refer to must be understood in context. Recall that we store normalised edge length data to three decimal places precision, but step through these data in increments of 0.01. If two proteins (or groups of proteins) are in the same cluster at (for example) S' norm 0.465 but in different clusters at S' norm 0.466, we could say that they "join at S' norm 0.46" or that they "split at S' norm 0.47". The sets of dual vertical lines in Figure 7 are intended to convey this nuance.
Our automated phylogeny inference pipeline is being implemented over a relational environment for reasons well beyond the scope of work presented in this paper. We therefore implemented SL clustering over Oracle 9i to facilitate data coordination with the larger project. We processed (at each threshold) the ordered S' norm edge data, writing a series of new cluster-state information (membership) tables. We do not store topology tables per se, but reconstruct topologies from membership plus edge (S' norm ) data. After the list is processed, clusters (graphs) are labelled in ascending order of S' norm . Graphs present at S' norm 0.01 are arbitrarily numbered 1, 2, etc. and as these graphs fragment with stepwise increase in S' norm , the labels are extended. Thus if the graph labelled 1 fragments at a higher value of S' norm , the fragments (having n ≥ 2 members) are arbitrarily labelled 1.

Markov clustering
The Markov Cluster algorithm was implemented using MCL (available from http://micans.org/mcl) and was applied with inflation parameter typically set to I = 1.10, and with other parameters at default values. Cluster membership was stored in ordered Oracle tables as described above. With PDB we clustered all sequences, but subsequently used only the 6435 (or 6430) entries/IDs identified above.

Hybrid clustering
Our hybrid clustering method was carried out in two stages, as follows. First, we processed the entire dataset (valid edges with S' norm ≥ 0.01) with MCL, yielding Markov clusters (ordered tables). We then conducted SL clustering as above, but on only those edges that have both ends (proteins) within the same Markov cluster; edges that span Markov clusters (i.e. link proteins in different Markov clusters) were ignored. We again wrote tables of clustered proteins at each threshold, and backtracked to label graphs.

Cluster purity, family/domain distribution, and family/ domain splitting
We define a Markov cluster to be pure if it contains only SCOP families (or domains) in their entirety. A pure cluster can contain multiple SCOP families (or domains) so long as each family (or domain) is contained in its entirety. If any member of any of the families (or domains) is external to that cluster, the cluster is not pure. SCOP families (or domains) are distributed over multiple clusters if members of that family (or domain) are found in more than one cluster (n ≥ 2). A SCOP family (or domain) is split if one or more, but not all, of its members occurs in a cluster (n ≥ 2) together with at least one member of at least one different family (or domain). The nonincluded members of the split family (or domain) do not need to be included in a cluster.

Authors' contributions
JPG and MAR have longstanding interests in lateral gene transfer. JPG suggested the use of Markov clustering, and provided the initial ATP synthase F1 datasets. TJH conducted all computational analyses, wrote programs and scripts, managed our relational database, and initiated specific questions. MAR provided conceptual design and wrote the manuscript. All authors contributed to data analysis.